ICAC04 notes

Generally good crowd, ~125 attendees, very interactive, keynote well received and its points were revisited (and cited) in various other talks, too many PC member papers (not sure what to make of that).
- The majority of the pprs seemed clearly focused on enterprise-like apps and distributed sw systems, but the organizers are a little concerned that a handful of the papers on DHT's, sensornets, etc just relabeled themselves as "autonomic" to have one more place to send papers, and they want to fix this for ICAC 05 (which Yi-min Wang is a co-program chair).
- Overall, we were well represented by a diverse group of papers: Mike Chen's paper on eBay diagnosis; Emre's paper (from his work at MS Research) on Strider and debugging the Windows registry; Keith Coleman (MS student of mine) paper on using a free-market model to dynamically allocate resources in a cluster and using VMware as the unit of application distribution; and my keynote that makes the case for the RADS proposal.
- My keynote was very well received. Some people had questoins like: would false positives still be a problem if you had more sophisticated/multivariable analysis rather than looking at individual variables? (They'd be less of a problem, but still nonzero.) Aren't you concerened about stability issues, ie if the system is "always recovering" maybe it is not making forward progress? (Yes, you also need other monitoring to make sure that forward progress is happening.) There was general agreement that treating systems as "collections of black boxes" is the right thig, but at the end of the conference was a panel session where a guy from Microsoft seemed to be arguing against that and in favor of drawing box-and-arrow diagrams of software systems. Unfortunately, they placed the panel at the end of the second day, and a number of people who had to make evening flights (myself included) had to leave 1/4 of the way through.
- Due to one session being parallel-tracked and also having to work on OSDI pprs, I missed one or two talks that I wanted to hear, including the one from Kim K. and her intern on automatically determinng tuning parameters for storage arrays. Did someone else hear this talk?
Steve White, IBM: autonomic architecture -> uniform means to manage and compose components, enforcement of component self-management
- WS-policy, WS-agreement, WS-manageability, negotiation: must advertise itself, offer advertsised service, and adhere to its advertised policies, and not accept an agreement it cannot fulfill. But how?
- How do you write an architectural spec for "self-optimizing"?
- Composition by agreement: DB asks storage sys for 5ms latency; storage sys thus asks DB to limit thruput to 100 req/s. Why not backpressure?
- Great question: the "limited interfaces" thing is motherhood - CORBA, DCE, etc all did that. But the agreement stuff sounds like Internet reservations/QoS and we never got that right because we don't know how to compute the probabilistic bound on whole-system QoS based on local behaviors. [No good answer for this except "get there incrementally"]
Rob Barrett (IBM) talk: was wrong for Autonomic to "exclude" admins, right goal is to bring system functionality up to level where people (operators) can meaningfully use them, and "bad automation is worse than no automation"
Mike Chen's formula for expected benefit from automatic diag: E(manual approach)-E(automated approach) in terms of detection time, recovery time, likelihood of correct diagnosis, and time to verify the fix - see slides. He got a best case savings of 11 mins compared to human operator diagnosis.
- Lessons: data is cheap, so simple algo's will do well; failed components won't record observations correctly; separate analysis from instrumentation/observation arch; use live workloads (since they'll exercise your sys in ways you hadn't thought of).
Mike Mesnier, Greg Ganger, Margo Seltzer: automatically classifying files (zero size/nonzero; 0-16K/larger - use mirroring vs encoding; long/short lived (can store in NVRAM), readonly/rw (replicate vs store in LFS), etc) so you can automatically make policy decisions on how to manage files. Same approach as Mike Chen's Ebay failure diagnosis with decision trees.
- Overtraining/overfitting of model?
- Correctness issues if make wrong decision (eg replication)? Answer: everything they do is an optimization (the underlying FS's do the Right Thing all the time.) [What about locks in NFS files that become local files??]
Some utility-function stuff, pervasive computing + planning stuff, and Grid scheduling, recast as 'autonomic' something or other. Most of it bad.
The 2nd keynote was about an economic scheme for sensornets to determine who collects and forwards data to whom. This isn't any more "autonomic" than traditional dsitributed systems work. Autonomic seems to have an identity crisis - some of us are thinking networked services, but a bunch of people doing other distributed systems (dsitributed hash tables, p2p/Grid stuff, sensornets) have labeled their stuff autonomic. The connection between this talk nd the conference was weak. (The speaker also mentioned multi-agent systems and the semantic Web as autonomic efforts.... sigh)
A theory paper about moving processes around to maximize availability without running out of memory on any one node. It assumes some nice clean metrics of "network availability" and "node reliability". Not exciting.
Repstore - replicated storage with smart bricks (MS Resrch Asia).
- Smart bricks cost about $1/GB on the street; operational cost of Gmail said to be $2/GB
- Goal: thousands or tens-of-thousands of smart bricks as a huge storage array, self-managing and self-tuning; eg as tape-array replacement, or to displace high end large disk arrays
- "Leverage P2P DHT's techniques for self-organization... may be most effective inside a datacenter" [I'm paraphrasing] - based on XRing 1-hop DHT (done by maintaining logN "fingers"); when a brick dies, its logical neighbors become neighbors of each other.
- Dynamically decides whethr given data item should be erasure-coded or replicated.
- Fixed-length objects grouped into sets; upper bits of objectID determines which bricks it will live on, lower bits determine what key on a given brick. Can be erasure coded (striped) across those bricks using a 4-out-of-7 erasure code, or can be replicated when higher write performance is desired (since erasure coding makes writes expensive). Replication cost (factor) is r, erasure cost is n/m (n=7, m=4 in this system); writing replicated data costs r writes, writing erasure-coded needs n-m+1 reads and writes (for the same reason RAID does).
- Invariants: for replicated data, there are r replicas including object's root brick. for erasure-coded, fragments are spread on n bricks including root brick. Goal: preserve invariants under new brick join or brick crash. Insight: due to P2P DHT's self-organizing capability, this requires only local (neighborhood) operations.
- Each brick owns tuning decisions for objects rooted on it. Policy: devote a small %age of its space to replicated data. On a write, erasure set is promoted to replicated set; use LRU-W to demote least-recently-written replicated set when you run out of space for replicated . ie, perfrmance optimized for hot and write-intensive data (but you could choose a different mgmt policy).
- Conditions: working set must be small (else thrashing) and slowly-changing. (else tuning overhead becomes significant part of overall performance). To see if this is the case, they got traces where each day a vector was captured recording #accesses to each object in the trace (so most vector elts are zero, and a few are nonzero). Then you compute the correlation of this set w/itself, which captures how slowly the workingset changes over time. This seems like a neat and important technique for working-set evaluation. They found from a UCB trace and HP trace that the workingset is small enough and stable enough for their scheme to be useful.
- Graphs show very good write performance even with minimal tuning, since policy is optimized for hot writes. But with more tuning, the cost of doing tuning increases (since object promotions/demotions are more frequent?). Sweet spot: 6% of storage for replication gives essentially full performance for writes while saving 40% storage over 3-way replication (given their traces).
- Wasn't 100% clear from the talk how variable-extent objects are handled--I get the sense this is a backing store, like FAB or Dstore. It's definitely not a filesystem. The guy giving the talk wasn't the lead author so wasn't sure what the variance in extents is.
- Not clear how failure during write is handled (do they need 2pc, do they have some rollback/repair mecahnism to avoid leaving the write group in an inconsistent state when failure occurs during write, etc). Will have to read ppr for that.
- Slight conflict w/motivation: if disk space really is free, why not replicate everything? (Answer: with good working sets, the overhead/complexity of tuning is minimal, so it's like getting back 40% of the disk space for free.)
- In all, this was one of the better papers.
Diaconescu & Mos, Dublin City Univ: portable open monitoring infrastructure for j2ee.
- Low overhead: they claimed this but evaluation was pretty weak. I believe it though.
- Collaborative: multple independent probes embedded by modifying containers, injecting proxy layer for each component and generating new EAR file (so not appserver-specific)
- Centralized: probe events come into JMX-based event dispatching layer, which filters them and sends them to event handlers corresponding to invoke, create, delete, ... You can also write your own probes that are server-specific, etc.
- readings are correlated (analyzed together) to determine "true" bottleneck/hotspot EJB's.
- interaction models help: since #paths < #permutations of EJB's, you can instrument only the head of each path. When a probe indicates a bad path, the other probes in that path are activated. (When passive, they observe locally but do not report events to the main event dispatcher.)
- What triggers a probe?
- How to adapt when a probe identifies a bottleneck: not implemented, but "you could" replicate stateless SB's or activate redundant components, then reroute some requests to use the extra instances by modifying deployment descriptors (??). How well did it work? They don't know. It depends heavily on constructing a probe algo that fires quickly and with a low false positive rate. This is very hard to do!
- Started out promising but a bit weak overall. We may want to use their monitoring framework though for J2EE apps, as it is free and extensible.