[This is a digest culled from my notes, as well as David Oppenheimer's, George Candea's, and Aaron Brown's notes].

Contents
--------
  1. General/Highlights
  2. Talk overview
  3. Talk details
  4. Panel sessions


General/Highlights:
-------------------


Talk highlights
---------------

Three main themes to the dependability-related talks:

Making the system easier to develop/understand/modify: Making it easier to inspect/debug the system: Systems design techniques (including failure isolation): There were also a couple papers on using formal methods to prove correctness (for some value of correctness). Talk Details
------------

[in no particular order]

(1) Brittle systems will break, not bend: can aspect-oriented
programming help (Coady, UBC)

The big idea of this paper is that many system properties are by nature spread out across many modules of a system.  Aspect-Oriented Programming (AOP) helps one collect this code (called an "aspect") into a single file, and apply it to the rest of your system.  The hypothesis is that by collecting all this code in a single file, it becomes much easier to understand how this cross-cutting functionality is implemented, and much easier to change this functionality without introducing bugs.  The potential disadvantage is that when you are looking at the "main functionality" of your system, it can be difficult to tell what aspects are being inserted into a given point your code.  AOP-aware editors are being developed to help visualize how aspects interact with your code.

As a simple example, you can write a single debug statement:

    'println( "entered a function" );'

and automatically insert this debug statement into all functions in your system, or only some functions in your system.  You can even tell AOP to insert this function only if a function X is called with function Y in its call stack.  These aspects can also pull-in variables that exist anywhere in the call stack.

---

(2) Automating data dependability (Keeton and Wilkes, HPL)

The big idea is to modify storage configuration tools to take dependability requirements into account.  In addition to specifying performance and space requirements, the user also makes dependability and performability requirements.  The system has a knowledge of the fault models of various storage systems, and generates a design for storing/protecting data (when/how to make backups, where to store them, etc.)  The hypothesis is that designing these storage systems are very error-prone, and by automating this tasks, there is a higher chance of getting it right.

Main two issues:  1) it can be hard to specify user requirements.  Does the user know what they need?  2) automation irony: if this system is a black-box, how does the user understand the designed system well enough to be helpful when something fails?

---

(3) InfoSpect: using a logic language for system health monitoring in
distributed systems (Roscoe, Intel Research@Berkeley)

The big idea of this paper is that we should not use predefined schemas when we're monitoring a system for problems.  The hypothesis is that an a priori schema assumes (and enforces) a particular consistency on your system's state.  If you apply this schema too early in your monitoring process, you will only be able to view your system through the lens of this consistent schema.  In contrast, if you record everything in the form of schema-less facts (e.g., "X was observed"), you can use a logic language like PROLOG to test for the consistency you want to maintain.  In a sense, you are doing a late-binding of a schema to already existing data, and popping up errors whenever the data doesn't fit into your schema.  Other advantages of using a language like PROLOG is that it shows you what inference rules were used to discover the inconsistency in your system.

The philosophy behind this system is very ROC-like.  It is assumed that the system is going to fail, and moreover, the system is going to fail in completely unexpected ways.  The focus here is on how to detect these unexpected failures.

---

(4) Dependable software needs pervasive debugging (Harris, U. of Cambridge)

Motivation: It's hard to debug heisenbugs.
Hypothesis: We can make debugging easier by putting all "non-deterministic events" in the control of the debugger, allowing replay, etc.  Basically, heisenbugs become deterministic bugs.

Harris has built a debugger that runs programs in a virtual environment, controlling non-deterministic events like thread-switching, etc.  Seems like a good idea.  Drawbacks are that 1) it's not clear how to show that the virtual environment is a good model of the real environment and 2) it's not easy to model network events.  It seems that this debugger can be a useful tool, as long as you don't depend on it to find all your bugs.

---

(5) Rewind, Repair, Replay: 3Rs to Dependability (Brown, Berkeley)

Summary:


[The following is stolen verbatim from Aaron's summary of his talk]
"I gave my talk on the 3R's, which was generally well-received even though delivered through severe jet-lag. I spent a lot of time talking with various people (Hank Levy, Steve Gribble, Marc Shapiro, Brian Noble, Timothy Roscoe, John Wilkes, Matt Welsh, Yvonne Coady, among others) about the 3Rs and undo, and got a lot of very useful advice and feedback that I'm in the process of synthesizing now. Much of the discussion revolved around verbs and whether or not the simple verb model I've proposed is sufficient; notably, whether additional verb properties ("adjectives" or "adverbs") are needed to better capture causality. There was again discussion of whether this could be accomplished by using a fully-transactional system, which showed me that I'm still not explaining clearly enough why the verb approach is different. However, Matt and others had some very enlightening comments on how I could better present the material to make this distinction clearer. People asked about undo history models (what I talked about a few retreats ago concerning branching vs. linear histories), which I didn't have time to cover in the talk. There was also discussion about whether email was an appropriate application, or if it was "too easy". No one was able to propose an alternate application that they liked better (they did not think that auctions or e-commerce were particularly different); again, I think this can be best addressed by better motivation that more clearly defines the problem domain. Steve Gribble had the interesting suggestion of forgetting about the user-driven service model and instead building a tool just to manage state below the application--for example, handling software installation. This is an interesting application of the 3Rs and probably in the end compatible with the user/service model of the 3Rs I've been working on, but it would be a big change of direction at this point. I'm going to spend some more time thinking about it, though, to see if/how it fits in to my mental picture of the 3Rs. Finally, Marc Shapiro pointed me to a system he's developed for optimistic concurrency control that apparently solves a similar consistency-management problem as that addressed by the 3R verbs, so I'm going to take a look at that and see if it will make the 3Rs any easier."

---

(6) Overload Management fundamental as a service design primitive (Welsh, Berkeley)

Important points: ---

(7) Denali: A scalable isolation kernel. (Gribble, UW)

Common idea with Nooks: Build light-weight protection domains to contain unreliable code.

Motivation:

  1. Desire to run internet services inside a third-party/professionally managed hardware infrastructure.
  2. Zipf's law says that most Internet service requests will go to a large number of individually unpopular services.  These unpopular services won't warrant their own dedicated hardware.  So we need to multiplex.
Existing OSes have several weaknesses:  high-level abstractions, very wide APIs, and efficient sharing of data (compromising isolation).

Denali addresses these problems by running each service in it's own VM.  It uses low-level hardware-like abstractions; very narrow APIs; and allows no inter-process communication or sharing except via the network interface.  To achieve its scalability requirements, Denali uses a number of tricks, such as not supporting non-virtualizable or otherwise expensive functions of the HW (e.g., no x86 segmentation, virtual memory, or BIOS).  Introducing an "idle timer" instruction, delivering interrupts in batches to avoid context switches, drastically simplifying I/O interfaces to avoid kernel crossings.

Denali runs on the x86, borrows drivers from Flux, and has a rewritten kernel core.  On 1.5GHz P4 w/1 GB of RAM they get an aggregate throughput of 5,000 req/sec for up to 500 web servers.  With more web servers, they hit a memory bottleneck, and end up swapping processes from disk.  Here, Denali's performance becomes disk-bound---they can swap in/out 7 VMs / sec.

---

(8) Nooks: an architecture for reliable device drivers. (Swift, UW)

Common idea with Denali: Build light-weight protection domains to contain unreliable code.

Motivation: Buggiest problems in OSes today are in device drivers.
Hypothesis: By running device drivers in light-weight protection domains, we can contain most failures without significantly harming performance.
Assumptions: device drivers are not malicious, just buggy.  Don't worry about being perfect.  just make it better.

---

(9) Specifying and Verifying Systems with TLA+

no comments.

---

(10) Rigour is good for you and feasible: reflections of formal treatments of C and UDP (Norris)

The authors wrote a formal behavior specification for the UDP protocol
and the C language.  The point of the paper is not to describe these efforts, but rather to draw conclusions and to prove that a formal approach is feasible and useful in making the contract between system and users concrete.

The authors don't advocate formal specs, but rather point out that they lead to a number of useful principles:
  1. Specify system in small pieces, not all at a time.
  2. Specify subsystems with subtle behaviors, don't waste time with the others.
  3. Use notation with tool support, to enable mechanical analysis.
  4. Begin as early as possible; what the authors did was posthoc and, while valuable, it would be even more so during the design process.
Subsequent discussion result: depending on the app domain, verification may be more appropriate than recovery and vice versa.  A combination of the two is probably most likely.

---

(11) The Case for Transient Authentication

The big idea is that most people won't use security mechanisms (like logging in and out) if it gets in the way of doing work.  The question is how to make a "log-in/out" mechanism that is both secure (the user *must* be present in order to access secure data), and doesn't get in the way of the user.

The approach Noble takes is to user a personal token that communicates via radio with local devices.  The personal token has an authentication code that must be available in order to access confidential data and programs.  The main challenges seem to have been 1) making the system fast enough so that the application/data was available by the time the user sat down at the computer (avoid getting in the user's way), and 2) when the user walks away, make sure all confidential state is locked down and cleaned up.

Drawbacks: 1) there's a potential for piggybacking on a user's presence and using their authentication tokens without their knowledge. 2) also, malicious radio-repeaters can make radio tokens appear far away from the user's actual location.

tangential relation to ROC: The user will not do the "right thing" if it's annoying or hard. You have to make the right systems decisions so that the system makes it easy for the user to do "the right thing."   The question this paper answers is "how do you

----------------------------------

Panel Sessions
--------------

The panel sessions were generally disappointing.  They mostly skirted around the hard questions.  Since neither panel came up with a useful concensus/conclusions, I'm simply highlighting what I thought were the most useful comments.

(1) Highlights: Can we really depend on an OS? (2) Highlights: Is OS research worth doing?