Experimental Evaluation of the REE SIFT environment for spaceborne applications
K. Whisnant, RK Iyer, R Some, et al,
Proc. DSN 2002 
Summary by
AF

One-line summary:

ARMOR's are assigned to "protect" components of a componentized app, in that they can detect and recover from crashes and hangs in either themselves or the components they protect, using standard techniques such as heartbeats and progress counters.  The ARMOR's use microcheckpointing of their own state to speed recovery, but provide no support for app-level checkpointing or recovery (beyond restarting failed app components; it is up to the compnent to do partial checkpoint recovery if it is so designed).  App workload is an MPI scientific app; metric being optimized is total execution completion time in the face of injected faults (SIGSTOP, SIGINT, single bit flips in regfile, text segment and heap).

Overview/Main Points

SIFT goals: detect crash and hang failures in itself and in MPI componentized apps; recover from those kinds of failures, by combination of restart and recover-from-checkpoint.  No attempt is made to do fault diagnosis (and therefore fault-specific recovery) - it's all generic.

App target: scientific (parallelized image processing) app running on a small cluster of 100Mbps-ethernet-connected nodes and using MPI for communication.  Metric being optimized is overall time-to-completion of app in the presence of injected faults, i.e. essentially a performability metric.

Assumptions about structure:  App components are "protected" by ARMORs that monitor them for liveness (using progress counters and heartbeats).  Each physical node also has an ARMOR daemon that "represents" that node to the SIFT environment: if communication with the ARMOR daemon is lost, it's assumed that the node failed.

ARMORs detect and recover from crashes and hangs in both themselves and the app.  

  • Microcheckpointing: each ARMOR element is a single thread that receives messages exclusively through its ARMOR event-delivery API, and whose state is well-encapsulated and not updated by anyone else.  A microcheckpoint consists of periodically bulk-copying this state to (simulated) NVRAM, so after an ARMOR element failure it can restart from a checkpoint.  What state is captured in ARMOR elements??
  • No particular support for app-level ckpts is provided.  The particular app used (pipelined image-processing) has its own app-specific "checkpointing" in that it writes intermediate files after each stage of processing, and upon restart can check whether it can start from a recent intermediate result.
  • Sometimes a failure of an ARMOR can cause a failure of the app: if ARMOR recovery takes long enough that heartbeats are missed from another ARMOR (or the app blocks, causing its progress counter to stop increasing), the rest of the SIFT will see it as a failure and restart all or part of the app.  Two different cases are described, accounting for 2 and 22 out of 178 runs respectively.

Fault model:

  • Crash and hang faults were injected by sending SIGSTOP and/or SIGINT to ARMORs
  • Other faults included single-bit flipping in either the register file, text segment, or heap of a running app component or ARMOR
  • In one case, this resulted in teh ARMOR writing a corrupt checkpoint and repeatedly crashing on restart from that checkpoint until the overall app timed out.  (Note, in all cases, there is a separate node on the spacecraft, the SCC, that can reboot en masse the whole cluster running the scientific workload if the app as a whole does not complete within some bounded time; it is assumed that in this case, that node woudl take over.  In this respect ARMOR's are just an optimization, albeit a big one.)

Relevance

  • Performability metric makes sense for this app, what makes sense for Inet apps?  (Geo's/Rich Martin's "integrate under the curve" metric?)
  • Usual complaint about the fault model... therefore we should have a good story for our fault model
  • Not clear what happens in the case of correlated/cascaded failures, ie the failure of one app component rapidly induces failures in others. There doesn't appear to eb anything in ARMOR to explicitly handle this - each failure seems to be treated separately.  For that matter, it's not clear to me what kinds of real-llife conditions would cause a failure to manifest as a SIGINT or SIGSTOP.  (So we must be able to defend that there are real-life conditiosn that would cause failures to manifest as Java exceptions)
  • The notion of well-framed state updates (at least for the ARMORs, which use checkpointing) is assumed.  In a longer paper that describes ARMORs more formally, this is made explicit: "there are some state variables taht completely characterize the state of a process, and these variables are modified only through explicit execution of certain code blocks, which in turn happen as a result of a particular operation that results in the scheduling of a thread to enter one of the code blocks" (paraphrased by me, from their article in IBM Sys. J.)
  • metrics: the style of argument here is: (a) in steady-state, the overhead added by the FD machinery is minimal; (b) for most kinds of failures, the total time to completion (metric they optimize for) is greatly enhanced by doing recovery using ARMORs rather than restarting the whole app blindly; (c) even if we have to restart the whole app blindly, that is still correct though very slow and undesirable.

See also: Performability - an e-utility imperative summary, and Rich Martin's et al. Performability for Internet Services/PRESS paper in USITS 03.

We should "formalize" the J2EE "computational model" - can we use distinct bean types to be able to talk about when state updates actually occur? - and formalize the metric we plan to optimize around, then show (a) how much better life is with RR-Jboss than regular Jboss; (b) how much even better it can be with AFPI.  Seems like we should be able to do a lot more for generic J2EE apps, because the "middleware" is much more structured compared to MPI.

Flaws

  • What if the kernel has a problem, or a corrupted data structure?  (ARMORs check their own data structures, but the kernel doesn't)  In the spirit of crash-only, kernels should have self-protecting data structures and if corruption is detected, the kernel should panic.  (Crash-only kernels?)
  • This app is built to do app-level checkpoints, but in general, can most MPI apps actually undergo partial recovery?
  • Why didn't they try doing bit flips in the NVRAM?  That would have caused a hang in a much larger number of cases.  (Or does the ARMOR checkpoint account for this by protecting its own checkpoint?  But they had a case where the ARMOR wrote and restarted from a bad checkpoint!)
Back to index

Summaries may be used for non-commercial purposes only, provided the summary's author and origin are acknowledged. For all other uses, please contact us.