Armando's HotOS-9 notes

Andrew Hume's talk

tie c/o approach to this talk: c/o is not abotu "giving up" or doing hacks to work around problems of unreliability/bugs; it is about asserting invariants on runtiem behavior of components, and reasoning from them to invariants on the system. in the absence of such invariants, apps are at a loss how to react when "something goes wrong". some examples of possibel invariants and their implications--
- recovery cheap -> ok to make mistake -> ok to use aggressive methods like statistical analysis
- fails in bounded time -> ok to take action that assumes the failure (a la Timed-perfect Failure Detector). eg: client request taking too long -> client disconnects and retries request --> this shouldn't eventually make the system fall off a cliff because of abandoned connections.
- saturation rather than falling off performance cliff -> don't need further admission control -> ok to take part of system out of service even if we don't have excess capacity
would be nice to "prove" someday that if a component is c/o, a system built out of those plus some specific intercomponent comm/coord mechainsms is also c/o. need to define what the c/o invariants are on the components and the coord/comm. an aspect of it is "succeeding or unambiguously failing in bounded time".

Dsitributed upgrading

Mothy's comment can be used to tie this to c/o: if individual node has to be crash-safe anyway, why worry about preserving its state smoothly thru an upgrade.

C/O dryrun

we already know how to do this. -- yes, for specific apps, but we'd like to make it more general thru a combination of (a) innovating at the container level, which today is (userlevel?) middleware but in future could/should be OS, and (b) dealing w/state appropriateness issue. in both cases we'll provide specific tools.
don't transactions do this? -- yes, but not appropriate for all cases.

SSM dryrun/ tie-ins:

SSM "just says no" under load, and does so in bounded time; capacity threshold is auto-discovered. this is a good tie-in to andrew hume's rant.

Magpie talk (Paul Barham)

Interesting aspect is that they treat "paths" (in their terminology, "event strings") as sentences in a stochastic context-free grammar. (Think of a BNF definition of a grammar, turned into an FSM whose arcs are annotated with transition probabilities; so that for any legal string in the grammar, you can determine the probability that that string appears in a uniform collection. You run the system for awhile and it "learns" the CFG in time linear in the number of input strings; converges after ~4k paths. (Algorithm "ALERGIA"; people who do stochastic CFG's came up with it in previous work.) Then they can construct "Bayesian watchdogs" that lets you flag a particular sentence as being highly unlikely, therefore possibly a problem. A nice concise representation of path structure and its possible deviants.
Interesting question from David Wetherall: clearly, not all performance "anomalies" are problems - they may just be bona fide but unusual requests. Isn't it a problem if you start treating those as "failures" and taking corrective action? [My view: first, you don't trigger this on a single request. Second, even when you trigger recovey, you may blow away some number of requests, but the assumption in our world is they're short and will probably get reissued; at some point someone has to decide - or the system has to learn - that this "unusual" behavior is actually OK. This should happen automagically if "unusual" is defined by historical behavior.]
Question from audience: stoachstic CFG's are very powerful, can't you just use a simple MM? Paul didn't really answer, though their current algo seems fine and is cheap enough to use in realtime.