Perfect failure detection in timed asynchronous systems
  • . Christoph Fetzer. IEEE Transactions on Computers, 52(2):99-112, February 2003 [PDF]
Summary by
AF

One-line summary:

Timeout-based crash detection meets leases: if a machine cannot renew a lease from the "observers" responsible for detecting it crashed, then it kills itself; if all observers can agree (distributed snapshot) than no lease renewal was seen, they can be sure the machine is crashed.  Prevents the problem of inconsistent behavior when machines make incorrect assumptions about when a peer crashed.

Overview/Main Points

  • Assumptions: computers can recover from a crash (either on their own or by some external recovery agent).  ie, the computers are to some extent crash-only.
  • Problem being addressed: suppose we use timeout or heartbeat type protocols to detect failure.  In many cases it's possible that for other reasons, we  incorrectly infer a failure when it hasn't happened.  (eg, network partition, remote node is too busy to heartbeat but hasn't actually failed, etc.)  We may then act in a way that assumes that node is dead, but if we were wrong, we might end up doing bad things (eg, writing to shared state in the mistaken belief that the other node won't do so, etc.)
  • Goal here: provide a way to be sure that our assumption about the other guy crashing is actually correct.

Sketch of algo:

  1. A participant c attempts to get a lease from its two neighbors d,e.  (The protocol extends to multiple computers if you group them into threes, where the neighbor cliques are not completely overlapping.)
  2. c sets a local watchdog timer such that if the lease expires without renewal, c will crash itself.
  3. Suppose c crashes or becomes disconnected (or for whatever reason doesn't try to renew its lease).  d will detect this fact.  It can then ask e, using  a distributed snapshot algorithm, whether e believes it has an active lease for c.  If  e says no, then it's safe for d to assume c has crashed itself.  Symmetrically for e. 
  4. Suppose e says it does have an active lease for c.  d must continue to believe c is up, but can check again after e's stated lease expires.  Also, at some point d may  try to renew its own lease from c; if c replies, d can conclude c is up.
  5. Suppose c becomes disconnected.  It will be unable to renew its lease and will have to crash itself.  Again, d or e will detect it.
  6. Suppose d crashes or becomes disconnected.  At some point c will try to renew its lease with d, but it can't.  It will then try to renew with e, and by a symmetric argument, either it will succeed (and thereby learn that d is crashed) or fail (in which case it will crash itself).
  7. Assumption: If a computer self-recovers, it does not begin to do so for D time units.  (ie, recovery does not begin until all pre-crash leases have expired.)
  8. Detail 1: the message-delivery service must be able to adjust for maximum delivery delays and be "fail-aware", i.e. it must know whether it successfully delivered the message.  This is subtly difficult to do in an end-to-end manner.  

This is a theory paper, and it is instructive to see the form in which the arguments about the theoretical correctness of the protocol are made:

  • Chandy and Lamport originally defined a class P of perfect failure detectors with 2 properties: (we assume crashed computers don't recover)  let t be the time on the global clock, let CRASHED be the set of all crashed computers, UP the set of noncrashed ones, P(d,c,t) the opinion of computer d regarding the up-ness of computer c at time t.
    • Strong completeness: eventually, each computer that is crashed is permanently suspected by all noncrashed computers (formally: for all t, and for all c in CRASHED, and for all d in UP, and for all u>=t,  P(d,c,u)=crashed.
    • Strong accuracy:  a noncrashed computer is never suspected of having crashed by anyone.  Formally: for all t,c,d: if c and d are up at t, then P(d,c,t) !=crashed.
    • No recovery: once crashed, a computer stays crashed. Formally: if a computer is crashed at time t, it is crashed for all times u>=t.
  • Since we want to consider recovery, this paper defines a new class of failure detectors called TP (timed-perfect): a failure detector in TP can detect crash failures and recovries within a known maximum time. There is a proof that if all timed-perfect detectors are perfect if the system has no recovery, ie if the No-Recovery property holds, then any detector in TP satisfies the Strong Completeness and Strong Accuracy properties.
  • Let FD be a  failure detector in TP.  Each computer c participating in the protocol has a module that implements FD and maintains an array TP(j), each whose values is always well-defined and is either up, crashed, or recovering:
    • TP(j)=crashed if I suspect computer j to be crashed,
    • TP(j)=up if I do not suspect computer j to be crashed,
    • TP(j)=recovering if I myself am recovering from a crash and therefore my TP() array is not properly initialized yet.
  • The difference between TP and P is that (a) the no-recovery property doesn't hold in TP, and (b) the Strong Completeness property is replaced by two accuracy properties, Up-Accuracy and Recovering-Accuracy.  Formally, for any detector in TP, there exists a finite constant D such that:
    • Crash-accuracy: for any time u such that TP(d,c,u)=crashed, there is some time t, where u-D<=t<=u (ie some time that is at most D in the past) such that c is crashed at time t.
    • Up-Accuracy: for any time u such that TP(d,c,u)=up, there is some time t, at most D in the past, such that c is up at t.
    • Recovering-accuracy: for any time u such that TP(d,c,u)=recovering, there exists a time at most D in the past such that d is crashed at t.  (Or, as a special case, the system just started.)
    • The first two say: Your assessment about the other guy's up-ness is stale by at most D.  The third says: until time D passes, you can't be sure about anything.  The intuition is: as soon as some computer reaches either a crashed or up state, a detector in TP must correctly classify that computer within time D.
    • The theory body of the paper focuses on proofs that the algo satisfies the three properties.
    •  

Relevance

  • Shankar: it's cool that they can guarantee an upper bound on crash detection and recovery detection even if there's no upper bound on message delivery (can reject messages that arrive after a certain delay).
  • Shankar: Piggybacking distributed-snapshot messages on lease messages is important
  • Watchdog timer is used to give a specific guarantee that can't be made in async systems - makes this algorithm more practical.

 

Flaws

Back to index

Summaries may be used for non-commercial purposes only, provided the summary's author and origin are acknowledged. For all other uses, please contact us.