ICAC04 notes
- Generally good crowd, ~125 attendees, very interactive, keynote well received and its points
were revisited (and cited) in various other talks, too many PC member papers
(not sure what to make of that).
- The majority of the pprs seemed clearly focused on enterprise-like
apps and distributed sw systems, but the organizers are a little
concerned that a handful of the papers on DHT's, sensornets, etc just
relabeled themselves as "autonomic" to have one more place to
send papers, and they want to fix this for ICAC 05 (which Yi-min Wang is
a co-program chair).
- Overall, we were well represented by a diverse group of papers: Mike
Chen's paper on eBay diagnosis; Emre's paper (from his work at MS
Research) on Strider and debugging the Windows registry; Keith Coleman
(MS student of mine) paper on using a free-market model to dynamically
allocate resources in a cluster and using VMware as the unit of
application distribution; and my keynote that makes the case for the
RADS proposal.
- My keynote was very well received. Some people had questoins
like: would false positives still be a problem if you had more
sophisticated/multivariable analysis rather than looking at individual
variables? (They'd be less of a problem, but still nonzero.)
Aren't you concerened about stability issues, ie if the system is
"always recovering" maybe it is not making forward
progress? (Yes, you also need other monitoring to make sure that
forward progress is happening.) There was general agreement that
treating systems as "collections of black boxes" is the right
thig, but at the end of the conference was a panel session where a guy
from Microsoft seemed to be arguing against that and in favor of
drawing box-and-arrow diagrams of software systems. Unfortunately,
they placed the panel at the end of the second day, and a number of
people who had to make evening flights (myself included) had to leave
1/4 of the way through.
- Due to one session being parallel-tracked and also having to work on
OSDI pprs, I missed one or two talks that I wanted to hear, including
the one from Kim K. and her intern on automatically determinng tuning
parameters for storage arrays. Did someone else hear this talk?
- Steve White, IBM: autonomic architecture -> uniform means to manage and compose
components, enforcement of component self-management
- WS-policy, WS-agreement, WS-manageability, negotiation: must advertise
itself, offer advertsised service, and adhere to its advertised policies,
and not accept an agreement it cannot fulfill. But how?
- How do you write an architectural spec for "self-optimizing"?
- Composition by agreement: DB asks storage sys for 5ms latency; storage sys
thus asks DB to limit thruput to 100 req/s. Why not
backpressure?
- Great question: the "limited interfaces" thing is motherhood -
CORBA, DCE, etc all did that. But the agreement stuff sounds like
Internet reservations/QoS and we never got that right because we don't know
how to compute the probabilistic bound on whole-system QoS based on local
behaviors. [No good answer for this except "get there
incrementally"]
- Rob Barrett (IBM) talk: was wrong for Autonomic to "exclude"
admins, right goal is to bring system functionality up to level where people
(operators) can meaningfully use them, and "bad automation is worse
than no automation"
- Mike Chen's formula for expected benefit from automatic diag: E(manual
approach)-E(automated approach) in terms of detection time, recovery time,
likelihood of correct diagnosis, and time to verify the fix - see
slides. He got a best case savings of 11 mins compared to human
operator diagnosis.
- Lessons: data is cheap, so simple algo's will do well; failed
components won't record observations correctly; separate analysis from
instrumentation/observation arch; use live workloads (since they'll
exercise your sys in ways you hadn't thought of).
- Mike Mesnier, Greg Ganger, Margo Seltzer: automatically classifying files
(zero size/nonzero; 0-16K/larger - use mirroring vs encoding; long/short
lived (can store in NVRAM), readonly/rw (replicate vs store in LFS), etc) so
you can automatically make policy decisions on how to manage
files. Same approach as Mike Chen's Ebay failure diagnosis with
decision trees.
- Overtraining/overfitting of model?
- Correctness issues if make wrong decision (eg replication)?
Answer: everything they do is an optimization (the underlying FS's do
the Right Thing all the time.) [What about locks in NFS files that
become local files??]
- Some utility-function stuff, pervasive computing + planning stuff, and
Grid scheduling, recast as 'autonomic' something or other. Most of it
bad.
- The 2nd keynote was about an economic scheme for sensornets to determine
who collects and forwards data to whom. This isn't any more
"autonomic" than traditional dsitributed systems work.
Autonomic seems to have an identity crisis - some of us are thinking
networked services, but a bunch of people doing other distributed systems (dsitributed
hash tables, p2p/Grid stuff, sensornets) have labeled their stuff
autonomic. The connection between this talk nd the conference was weak.
(The speaker also mentioned multi-agent systems and the semantic Web as
autonomic efforts.... sigh)
- A theory paper about moving processes around to maximize availability
without running out of memory on any one node. It assumes some nice
clean metrics of "network availability" and "node
reliability". Not exciting.
- Repstore - replicated storage with smart bricks (MS Resrch Asia).
- Smart bricks cost about $1/GB on the street; operational cost of Gmail
said to be $2/GB
- Goal: thousands or tens-of-thousands of smart bricks as a huge storage
array, self-managing and self-tuning; eg as tape-array replacement, or
to displace high end large disk arrays
- "Leverage P2P DHT's techniques for self-organization... may be
most effective inside a datacenter" [I'm paraphrasing] - based on
XRing 1-hop DHT (done by maintaining logN "fingers"); when a
brick dies, its logical neighbors become neighbors of each other.
- Dynamically decides whethr given data item should be erasure-coded or
replicated.
- Fixed-length objects grouped into sets; upper bits of objectID
determines which bricks it will live on, lower bits determine what key
on a given brick. Can be erasure coded (striped) across those
bricks using a 4-out-of-7 erasure code, or can be replicated when higher
write performance is desired (since erasure coding makes writes
expensive). Replication cost (factor) is r, erasure cost is n/m
(n=7, m=4 in this system); writing replicated data costs r writes,
writing erasure-coded needs n-m+1 reads and writes (for the same reason
RAID does).
- Invariants: for replicated data, there are r replicas including
object's root brick. for erasure-coded, fragments are spread on n
bricks including root brick. Goal: preserve invariants under
new brick join or brick crash. Insight: due to P2P DHT's
self-organizing capability, this requires only local (neighborhood)
operations.
- Each brick owns tuning decisions for objects rooted on it.
Policy: devote a small %age of its space to replicated data. On a
write, erasure set is promoted to replicated set; use LRU-W to demote
least-recently-written replicated set when you run out of space for
replicated . ie, perfrmance optimized for hot and write-intensive
data (but you could choose a different mgmt policy).
- Conditions: working set must be small (else thrashing) and
slowly-changing. (else tuning overhead becomes significant part of
overall performance). To see if this is the case, they got traces
where each day a vector was captured recording #accesses to each object
in the trace (so most vector elts are zero, and a few are
nonzero). Then you compute the correlation of this set w/itself,
which captures how slowly the workingset changes over time. This
seems like a neat and important technique for working-set evaluation.
They found from a UCB trace and HP trace that the workingset is small
enough and stable enough for their scheme to be useful.
- Graphs show very good write performance even with minimal tuning,
since policy is optimized for hot writes. But with more tuning,
the cost of doing tuning increases (since object promotions/demotions
are more frequent?). Sweet spot: 6% of storage for
replication gives essentially full performance for writes while saving
40% storage over 3-way replication (given their traces).
- Wasn't 100% clear from the talk how variable-extent objects are
handled--I get the sense this is a backing store, like FAB or Dstore.
It's definitely not a filesystem. The guy giving the talk wasn't
the lead author so wasn't sure what the variance in extents is.
- Not clear how failure during write is handled (do they need 2pc, do
they have some rollback/repair mecahnism to avoid leaving the write
group in an inconsistent state when failure occurs during write,
etc). Will have to read ppr for that.
- Slight conflict w/motivation: if disk space really is free, why not
replicate everything? (Answer: with good working sets, the
overhead/complexity of tuning is minimal, so it's like getting back 40%
of the disk space for free.)
- In all, this was one of the better papers.
- Diaconescu & Mos, Dublin City Univ: portable open monitoring
infrastructure for j2ee.
- Low overhead: they claimed this but evaluation was pretty
weak. I believe it though.
- Collaborative: multple independent probes embedded by modifying
containers, injecting proxy layer for each component and generating new
EAR file (so not appserver-specific)
- Centralized: probe events come into JMX-based event dispatching layer,
which filters them and sends them to event handlers corresponding to
invoke, create, delete, ... You can also write your own probes
that are server-specific, etc.
- readings are correlated (analyzed together) to determine
"true" bottleneck/hotspot EJB's.
- interaction models help: since #paths < #permutations of EJB's, you
can instrument only the head of each path. When a probe indicates a bad
path, the other probes in that path are activated. (When passive,
they observe locally but do not report events to the main event
dispatcher.)
- What triggers a probe?
- How to adapt when a probe identifies a bottleneck: not implemented,
but "you could" replicate stateless SB's or activate redundant
components, then reroute some requests to use the extra instances by
modifying deployment descriptors (??). How well did it work?
They don't know. It depends heavily on constructing a probe algo
that fires quickly and with a low false positive rate. This is
very hard to do!
- Started out promising but a bit weak overall. We may want to use
their monitoring framework though for J2EE apps, as it is free and
extensible.