Andrew Hume's talk
- tie c/o approach to this talk: c/o is not abotu "giving up" or
doing hacks to work around problems of unreliability/bugs; it is about
asserting invariants on runtiem behavior of components, and reasoning from
them to invariants on the system. in the absence of such invariants,
apps are at a loss how to react when "something goes wrong".
some examples of possibel invariants and their implications--
- recovery cheap -> ok to make mistake -> ok to use aggressive
methods like statistical analysis
- fails in bounded time -> ok to take action that assumes the
failure (a la Timed-perfect Failure Detector). eg: client request
taking too long -> client disconnects and retries request --> this
shouldn't eventually make the system fall off a cliff because of
abandoned connections.
- saturation rather than falling off performance cliff -> don't need
further admission control -> ok to take part of system out of service
even if we don't have excess capacity
- would be nice to "prove" someday that if a component is c/o, a
system built out of those plus some specific intercomponent comm/coord
mechainsms is also c/o. need to define what the c/o invariants are on
the components and the coord/comm. an aspect of it is "succeeding
or unambiguously failing in bounded time".
Dsitributed upgrading
- Mothy's comment can be used to tie this to c/o: if individual node has to
be crash-safe anyway, why worry about preserving its state smoothly thru an
upgrade.
C/O dryrun
- we already know how to do this. -- yes, for specific apps, but we'd
like to make it more general thru a combination of (a) innovating at the
container level, which today is (userlevel?) middleware but in future
could/should be OS, and (b) dealing w/state appropriateness issue. in
both cases we'll provide specific tools.
- don't transactions do this? -- yes, but not appropriate for all
cases.
SSM dryrun/ tie-ins:
- SSM "just says no" under load, and does so in bounded time;
capacity threshold is auto-discovered. this is a good tie-in to andrew
hume's rant.
Magpie talk (Paul Barham)
- Interesting aspect is that they treat "paths" (in their
terminology, "event strings") as sentences in a stochastic
context-free grammar. (Think of a BNF definition of a grammar,
turned into an FSM whose arcs are annotated with transition probabilities;
so that for any legal string in the grammar, you can determine the
probability that that string appears in a uniform collection. You run
the system for awhile and it "learns" the CFG in time linear in
the number of input strings; converges after ~4k paths. (Algorithm
"ALERGIA"; people who do stochastic CFG's came up with it in
previous work.) Then they can construct "Bayesian watchdogs"
that lets you flag a particular sentence as being highly unlikely, therefore
possibly a problem. A nice concise representation of path structure
and its possible deviants.
- Interesting question from David Wetherall: clearly, not all
performance "anomalies" are problems - they may just be bona fide
but unusual requests. Isn't it a problem if you start treating those
as "failures" and taking corrective action? [My view: first,
you don't trigger this on a single request. Second, even when you
trigger recovey, you may blow away some number of requests, but the
assumption in our world is they're short and will probably get reissued; at
some point someone has to decide - or the system has to learn - that this
"unusual" behavior is actually OK. This should happen
automagically if "unusual" is defined by historical behavior.]
- Question from audience: stoachstic CFG's are very powerful, can't you just
use a simple MM? Paul didn't really answer, though their current algo
seems fine and is cheap enough to use in realtime.