K. Whisnant, RK Iyer, R Some, et al,
Proc. DSN 2002 |
Summary by
AF |
One-line summary:
ARMOR's are assigned to "protect" components of a
componentized app, in that they can detect and recover from crashes and
hangs in either themselves or the components they protect, using
standard techniques such as heartbeats and progress counters. The
ARMOR's use microcheckpointing of their own state to speed recovery, but
provide no support for app-level checkpointing or recovery (beyond
restarting failed app components; it is up to the compnent to do partial
checkpoint recovery if it is so designed). App workload is an MPI
scientific app; metric being optimized is total execution completion
time in the face of injected faults (SIGSTOP, SIGINT, single bit flips
in regfile, text segment and heap).
Overview/Main Points
SIFT goals: detect crash and hang failures in itself and in MPI
componentized apps; recover from those kinds of failures, by combination
of restart and recover-from-checkpoint. No
attempt is made to do fault diagnosis (and therefore fault-specific
recovery) - it's all generic.
App target: scientific (parallelized image processing) app running on
a small cluster of 100Mbps-ethernet-connected nodes and using MPI for
communication. Metric being optimized is overall
time-to-completion of app in the presence of injected faults, i.e.
essentially a performability metric.
Assumptions about structure: App components are
"protected" by ARMORs that monitor them for liveness (using
progress counters and heartbeats). Each physical node also has an
ARMOR daemon that "represents" that node to the SIFT
environment: if communication with the ARMOR daemon is lost, it's
assumed that the node failed.
ARMORs detect and recover from crashes and hangs in both themselves
and the app.
- Microcheckpointing: each ARMOR element is a single thread that
receives messages exclusively through its ARMOR event-delivery API,
and whose state is well-encapsulated and not updated by anyone
else. A microcheckpoint consists of periodically bulk-copying
this state to (simulated) NVRAM, so after an ARMOR element failure
it can restart from a checkpoint. What state is captured in
ARMOR elements??
- No particular support for app-level ckpts is provided. The
particular app used (pipelined image-processing) has its own
app-specific "checkpointing" in that it writes
intermediate files after each stage of processing, and upon restart
can check whether it can start from a recent intermediate result.
- Sometimes a failure of an ARMOR can cause a failure of the app: if
ARMOR recovery takes long enough that heartbeats are missed from
another ARMOR (or the app blocks, causing its progress counter to
stop increasing), the rest of the SIFT will see it as a failure and
restart all or part of the app. Two different cases are
described, accounting for 2 and 22 out of 178 runs respectively.
Fault model:
- Crash and hang faults were injected by sending SIGSTOP and/or
SIGINT to ARMORs
- Other faults included single-bit flipping in either the register
file, text segment, or heap of a running app component or ARMOR
- In one case, this resulted in teh ARMOR writing a corrupt
checkpoint and repeatedly crashing on restart from that checkpoint
until the overall app timed out. (Note,
in all cases, there is a separate node on the spacecraft, the SCC,
that can reboot en masse the whole cluster running the
scientific workload if the app as a whole does not complete within
some bounded time; it is assumed that in this case, that node woudl
take over. In this respect ARMOR's are just an optimization,
albeit a big one.)
Relevance
- Performability metric makes sense for this app, what makes sense
for Inet apps? (Geo's/Rich Martin's "integrate under the
curve" metric?)
- Usual complaint about the fault model... therefore we should have
a good story for our fault model
- Not clear what happens in the case of correlated/cascaded
failures, ie the failure of one app component rapidly induces
failures in others. There doesn't appear to eb anything in ARMOR to
explicitly handle this - each failure seems to be treated
separately. For that matter, it's not clear to me what kinds
of real-llife conditions would cause a failure to manifest as a
SIGINT or SIGSTOP. (So we must be able to defend that there are
real-life conditiosn that would cause failures to manifest as
Java exceptions)
- The notion of well-framed state updates (at least for the ARMORs,
which use checkpointing) is assumed. In a longer paper that
describes ARMORs more formally, this is made explicit: "there
are some state variables taht completely characterize the state of a
process, and these variables are modified only through explicit
execution of certain code blocks, which in turn happen as a result
of a particular operation that results in the scheduling of a
thread to enter one of the code blocks" (paraphrased by me,
from their article
in IBM Sys. J.)
- metrics: the style of argument here is: (a) in
steady-state, the overhead added by the FD machinery is minimal; (b)
for most kinds of failures, the total time to completion (metric
they optimize for) is greatly enhanced by doing recovery using
ARMORs rather than restarting the whole app blindly; (c) even if we
have to restart the whole app blindly, that is still correct though
very slow and undesirable.
See also: Performability - an
e-utility imperative summary, and Rich Martin's et al.
Performability for Internet Services/PRESS paper in USITS 03.
We should "formalize" the J2EE "computational
model" - can we use distinct bean types to be able to talk about
when state updates actually occur? - and formalize the metric we plan to
optimize around, then show (a) how much better life is with RR-Jboss
than regular Jboss; (b) how much even better it can be with AFPI.
Seems like we should be able to do a lot more for generic J2EE apps,
because the "middleware" is much more structured compared to
MPI.
Flaws
- What if the kernel has a problem, or a corrupted data
structure? (ARMORs check their own data structures, but the
kernel doesn't) In the spirit of crash-only, kernels should
have self-protecting data structures and if corruption is detected,
the kernel should panic. (Crash-only kernels?)
- This app is built to do app-level checkpoints, but in general, can
most MPI apps actually undergo partial recovery?
- Why didn't they try doing bit flips in the NVRAM? That would
have caused a hang in a much larger number of cases. (Or does
the ARMOR checkpoint account for this by protecting its own
checkpoint? But they had a case where the ARMOR wrote and
restarted from a bad checkpoint!)
|