- . Christoph Fetzer. IEEE Transactions on Computers,
52(2):99-112, February 2003 [PDF]
|
Summary by
AF |
One-line summary:
Timeout-based crash detection meets leases: if a machine cannot renew a lease
from the "observers" responsible for detecting it crashed, then it
kills itself; if all observers can agree (distributed snapshot) than no lease
renewal was seen, they can be sure the machine is crashed. Prevents the
problem of inconsistent behavior when machines make incorrect assumptions about
when a peer crashed.
Overview/Main Points
- Assumptions: computers can recover from a crash (either on their
own or by some external recovery agent). ie, the computers are
to some extent crash-only.
- Problem being addressed: suppose we use timeout or heartbeat type
protocols to detect failure. In many cases it's possible that
for other reasons, we incorrectly infer a failure when it
hasn't happened. (eg, network partition, remote node is too
busy to heartbeat but hasn't actually failed, etc.) We may
then act in a way that assumes that node is dead, but if we
were wrong, we might end up doing bad things (eg, writing to shared
state in the mistaken belief that the other node won't do so, etc.)
- Goal here: provide a way to be sure that our assumption
about the other guy crashing is actually correct.
Sketch of algo:
- A participant c attempts to get a lease from its two neighbors d,e.
(The protocol extends to multiple computers if you group them into
threes, where the neighbor cliques are not completely overlapping.)
- c sets a local watchdog timer such that if the lease expires
without renewal, c will crash itself.
- Suppose c crashes or becomes disconnected (or for whatever reason
doesn't try to renew its lease). d will detect this
fact. It can then ask e, using a distributed snapshot
algorithm, whether e believes it has an active lease for c.
If e says no, then it's safe for d to assume c has crashed
itself. Symmetrically for e.
- Suppose e says it does have an active lease for c. d
must continue to believe c is up, but can check again after e's
stated lease expires. Also, at some point d may try
to renew its own lease from c; if c replies, d can conclude c is up.
- Suppose c becomes disconnected. It will be unable to renew
its lease and will have to crash itself. Again, d or e will
detect it.
- Suppose d crashes or becomes disconnected. At some point c
will try to renew its lease with d, but it can't. It will then
try to renew with e, and by a symmetric argument, either it will
succeed (and thereby learn that d is crashed) or fail (in which case
it will crash itself).
- Assumption: If a computer self-recovers, it does not begin to do
so for D time units. (ie, recovery does not begin until all
pre-crash leases have expired.)
- Detail 1: the message-delivery service must be able to adjust for
maximum delivery delays and be "fail-aware", i.e. it must
know whether it successfully delivered the message. This is
subtly difficult to do in an end-to-end manner.
This is a theory paper, and it is instructive to see the form in
which the arguments about the theoretical correctness of the protocol
are made:
- Chandy and Lamport originally defined a class P of perfect failure
detectors with 2 properties: (we assume crashed computers don't
recover) let t be the time on the global clock, let CRASHED be
the set of all crashed computers, UP the set of noncrashed ones,
P(d,c,t) the opinion of computer d regarding the up-ness of computer
c at time t.
- Strong completeness: eventually, each computer that is crashed
is permanently suspected by all noncrashed computers (formally:
for all t, and for all c in CRASHED, and for all d in UP, and
for all u>=t, P(d,c,u)=crashed.
- Strong accuracy: a noncrashed computer is never
suspected of having crashed by anyone. Formally: for all
t,c,d: if c and d are up at t, then P(d,c,t) !=crashed.
- No recovery: once crashed, a computer stays crashed. Formally:
if a computer is crashed at time t, it is crashed for all times
u>=t.
- Since we want to consider recovery, this paper defines a new class
of failure detectors called TP (timed-perfect): a failure detector
in TP can detect crash failures and recovries within a known maximum
time. There is a proof that if all timed-perfect detectors are
perfect if the system has no recovery, ie if the No-Recovery
property holds, then any detector in TP satisfies the Strong
Completeness and Strong Accuracy properties.
- Let FD be a failure detector in TP. Each computer c
participating in the protocol has a module that implements FD
and maintains an array TP(j), each whose values is always
well-defined and is either up, crashed, or recovering:
- TP(j)=crashed if I suspect computer j to be crashed,
- TP(j)=up if I do not suspect computer j to be crashed,
- TP(j)=recovering if I myself am recovering from a crash and
therefore my TP() array is not properly initialized yet.
- The difference between TP and P is that (a) the no-recovery
property doesn't hold in TP, and (b) the Strong Completeness
property is replaced by two accuracy properties, Up-Accuracy and
Recovering-Accuracy. Formally, for any detector in TP, there
exists a finite constant D such that:
- Crash-accuracy: for any time u such that TP(d,c,u)=crashed,
there is some time t, where u-D<=t<=u (ie some time that
is at most D in the past) such that c is crashed at time t.
- Up-Accuracy: for any time u such that TP(d,c,u)=up, there is
some time t, at most D in the past, such that c is up at t.
- Recovering-accuracy: for any time u such that TP(d,c,u)=recovering,
there exists a time at most D in the past such that d
is crashed at t. (Or, as a special case, the system just
started.)
- The first two say: Your assessment about the other guy's
up-ness is stale by at most D. The third says: until time
D passes, you can't be sure about anything. The intuition
is: as soon as some computer reaches either a crashed or up
state, a detector in TP must correctly classify that computer
within time D.
- The theory body of the paper focuses on proofs that the algo
satisfies the three properties.
-
Relevance
- Shankar: it's cool that they can guarantee an upper bound on crash
detection and recovery detection even if there's no upper bound on
message delivery (can reject messages that arrive after a certain
delay).
- Shankar: Piggybacking distributed-snapshot messages on lease
messages is important
- Watchdog timer is used to give a specific guarantee that can't be
made in async systems - makes this algorithm more practical.
Flaws
|