Highlights from OSDI

Highlights from OSDI '04

A lot of papers in dependability! If you count both "recovery" and "bug finding" papers, between 13 and 15 (out of 27) papers are about dependability.

Mike Swift talked about recovering device drivers in Linux building on Nooks. (basically microreboots for device drivers; could this be part of the RADS platform??). In most cases, you can "recover" the device driver and restart it in a way that lets the app continue without noticing anything except latency. Combination of shadowing, virtual memory, and transparent (to the app) syscall retry. Works better with drivers with narrow API's.

Worm detection by subsampling packets goign thru a 1Gb/s router. From Stefan Savage's group at UCSD. Big challenge is to do it at line speeds. Approach: Capture all substrings of length k and compute their Rabin fingerprint (a hash). Random sampling based on last n bits of fingerprint. (Fir 400-byte worm signature, and 40 byte substrings, sampling at a rate of 1/64 (checking last 6 bits of hash) will find a desired sample with prob >99.6%) Since repeated substrings (as in a propagating worm) are uncommon (1% of 40 byte substrings are repeated more than once), use a "high pass filter" (multistage filter) to get 3 orders of mag. improvement over the naive approach of 1 entry per string. Their SW implementation handles 200Mbps, their FPGA_assisted implementation does 2Gbps.

If dispersion of a repeated substring (# of unique sources and destinations corresponding to packets where substring was seen) is high, it may be a worm in progress of being distributed. They use scalable bitmap counters since only approximate count is needed - sample a larger virtual bitmap. See ppr. Note that they don't need to keep track of which sources/dests at this time - just how many. So they use a table indexed by the fingerprint. (OK for fingerprints to have collisions for same reason.)

In 8 months at UCSD, system has detected and generated signatures for every worm outbreak (no false negatives). Most common false positives are "benign" traffic with signatures, like http headers; they whitelist these. Spam and BitTorrent give false positives too.

Note that this relies on worms having deterministic signatures at the substring level. In paper: "self-modifying [polymorphic] worms still have a common 'seed' or kernel of code so this will still work." One person asked if adding random NOPs will break this, and if they've looked at techniques that can extract substring-type invariants that are robust to this.

Another person asked if characterizations of current traffic figure too heavily into the optimizations that make problem tractable (ie, observation that only 1% of 40-byte substrings are repeated).

Understanding and dealing with operator errors, from Rich Martin's group at Rutgers.

Configuration debugging as search, from Steve Gribble's group at UW. Talks abotu "WYNOT errors" - "Worked Yesterday, Not Today" - typically results from misconfiguration or unknown config changes that broke some dependency. Use Denali (paravirtual machine) to boot a VM with copy-on-write time-travel disk. Use internal and external user-supplied "probe" that performs some test on the system. These can run inside or outside the VM, depending what they want to probe. The idea is to do binary search across time-travel change log to find the path that led to the failure. (There used to be a Macintosh tool that worked this way to find INIT conflicts.) This assumes that there exists a single transition that took system from working to nonworking. If multiple transitions exists, Cronos my find the wrong one. Once found, diff the filsystem to find out what the cause was (what got changed). Also assumes change that led to breakage is deterministic and is captured on disk. What happens if the "single transition" was a script that touched many files? Can you distinguish this from a case where only a subset of the file mods contained the badness?

Note, the motivation for this work argues even further for using VM's as the unit of software distribution.

PlanetSeer - anomaly detection for TCP flows and routing loops in Planetlab. This should be of interest to NetRADS. look for anomlay indicators like: TTL change during a flow; routing change during a flow; n consecutive timeouts; idling period of 3-16 secs (most congestion periods are <220ms, so long ones are likely due to routing problems)

Improving the reliability of internet paths with one-hop source routing. Idea - understand how routing differs depending on whether it's client-to-server (eg Web) or client-to-client (eg VOIP). Key results: (a) large fraction of "client-side" failures are due to broadband last-hop at client; (b) large % of "server" failures are really due to failures in the core of the internet; (3) majority of failures are sholrt (99.9% median path availability, 1-2 median duration), but a nontrivial % of paths see long-lived failures at the last hop. Failures happen about once every 1-2 days. Failures near end-nodes limit performance of indirection routing; not a problem for servers, but a problem for broadband failures. Soln: 1-hop source routing - do a 1-hop source route around a failed router/AS. For most failures, more than half of the available intermediate nodes could be used to avoid the failure; ie, a randomly-selected intermediate node has a pretty good chance of being able to help you route around the failure. "Random-4" (pick 4 at random) seems to be a sufficient threshold. Running one web client with this scheme and one without, fetching the same randomly-generated selection of web pages, the base client succeeds 99.8% of time, the enhanced client gets only 20% fewer failures; turns out it's because many of the failures are not Internet failures but server/app failures. Question: most users don't randomly surf, but spend blocks of time at a site - does this make SOSR better or worse? (would be better if failures were dominated by network; worse if dominated by app failures)

Improving DNS with CoDNS. Defining a failure as ">5 sec latency to complete DNS lookup" (which corresponds to retransmission timeout for DNS clients in Unix), DNS servers at various universities seem to "fail" 20-40% of the time. A major source of this is the dropping of UDP packets used for DNS lookup. "Cooperative DNS" lookup (which requires client-sied changes, but no server changes) adds 1 nine to DNS performance, reduces avg DNS lookup time by 27-82%, and outperforms CDN-based DNS (usually negligible net difference; CoDNS resolves faster but doesn't always pick best replica). This may be a "killer app" for Planetlab/P2P. However, it's totally insecure, modulo DNSSec, and there is nothing to address how you can tell if peers were compromised and what you do about it. That seems like a dealbreaker for DNS.

DOA, a middlebox-friendly resolution and routing scheme. Same goals as Cheriton's TRIAD, but requires end-host and middlebox (NAT, firewall, etc) changes rather than router changes as TRIAD does. Makes exactly the same claims and benefits as TRIAD as far as I can tell, but bases the mechanism on delegation.

Moises' talk made a good point on classification vs. regression. In their system, they do not try to predict latency: only whether latency will be above or below a threshold. Classification is a much simpler problem than regression: given a whole bunch of dots in the plane, classification finds a surface that separates two classes of data points, whereas regression would try to find a curve that passes through some subset of the data points. This is a good general lesson to keep in mind: sometimes transforming a regression problem to a classifiaction problem may make it more tractable.

Dawson Engler's group had a nice paper where they used model-checking to find pretty bad bugs in filesystems, mostly in the filesystem recovery code (fsck). part of the motivation: filesystem recovery code is complicated because filesystem has to be crash-safe, yet recovery paths can be convoluted depending on when the crash occurs, etc. One person asked how this might be extended to check concurrency.

YY Zhou's group from Illinois had a method of detecting copy-pasted code and therefore possible bugs related to forgetting to change stuff when you copy and paste. Basically they look for frequent lexical sequences within a program. THey find that 15-20% of code in Linux, Apache, etc. consists of copy-pasted code (scary). Their approach gets a lot of false positives: eg of 421 reported Linux copy-paste bugs, 28 are true bugs, 21 are bad code but not actually buggy, the rest are false alarms (~250)

Failure-oblivious computing. Martin Rinard's talk on this was, I thuoght, the most original work presented at OSDI. We can discuss it at miniretreat, but I think this is a paper worth reading for all. Key observation: request-reply servers tend to have short error-propagation distances; errors occurring in one request tend not to propagate to subsequent requests. So, the effect of obliviously eliminating bounds-check errors is likely to remain localized. For OOB reads, eturn manufactured values in the sequence 0,1,2,0,1,3,0,1,4.... because most programs terminate on 0 or 1, but some require a particular character to terminate a loop (eg newline). For OOB writes, just ignore them. (This requires a special compiler, and adds execution time overhead ranging from 3-7x in some cases, almost nothing in others; more to the point, the added latency almost never crosses the 200ms "instantaneous" perceptibility threshold on any particular request.) There are guidelines about when to use it: sliding-time-window programs, short error propagation distances, multiple largely-independent results, cases where it's unacceptable to stop executing. The observation about short propagation distances is the same reason microreboots work well for request-reply apps. One could use the constraints of J2EE architecture combined with some static analysis to possibly prove which cases would be "safe" for F/O computing. Related work by Rinard: boundless memory block -- store OOB writes in a hashtable, serve corresponding OOB reads from that hashtable. In response to a question about whether you'd want to turn this on because it makes bugs "disappear" or become hard-to-reproduce: from the developer's point of view, no; but from user's point of view, you don't want the program to stop unless it has no other choice.