Unit5 Compressed Fault Tolerance - PACE
Unit5 Compressed Fault Tolerance - PACE
Lecture 8
Fault Tolerance
Dealing successfully with partial failure within a Distributed System. ( a review by Gartner 1999)
Basic Concepts
Fault Tolerance is closely related to the notion of “Dependability”
In Distributed Systems, this is characterized under a number of headings:
Maintainability – when a system fails, it can be repaired easily and quickly (and, sometimes, without its users noticing the
failure).
What Is “Failure”?
Distinction between preventing, removing, and forecasting faults (Avizienis et al., 2004).
Fault tolerance - meaning that a system can provide its services even in the presence of faults.
Types of Faults
Permanent Fault – once it occurs, only the replacement/repair of a faulty component will allow the DS to function normally.
Failure Models
Different types of failures. (Cristian 1991) and (Hadzilacos and Toueg 1993).
Timing failure A server's response lies outside the specified time interval
Response failure A server's response is incorrect
State transition failure The server deviates from the correct flow of control
Information Redundancy – add extra bits to allow for error detection/recovery (e.g., Hamming codes and the like).
Time Redundancy – perform operation and, if needs be, perform it again.
Think about how transactions work (BEGIN/END/COMMIT/ABORT).
Physical Redundancy – add extra (duplicate) hardware and/or software to the system.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 1/11
11/11/22, 2:49 PM Chapter 8
1. Process Resilience
2. Reliable Client/Server Communications
3. Reliable Group Communciation
4. Distributed COMMIT
5. Recovery Strategie
Process Resilience
(Guerraoui and Schiper, 1997)
Processes can be made fault tolerant by arranging to have a group of processes, with each member of the group being identical
.
A message sent to the group is delivered to all of the “copies” of the process (the group members), and then only one of them
performs the required service.
If one of the processes fail, it is assumed that one of the others will still be able to function (and service any pending request or
operation
Communication in a flat group – all the processes are equal, decisions are made collectively.
Communication in a simple hierarchical group - one of the processes is elected to be the coordinator, which selects another
process (a worker) to perform the operation.
Note: single point-of failure, however: decisions are easily and quickly made by the coordinator without first having to get
consensus.
By organizing a fault tolerant group of processes , we can protect a single vulnerable process.
A group of processes is organized in a hierarchical fashion in which a primary coordinates all write operations.
When the primary crashes, the backups execute some election algorithm to choose a new primary.
Replicated-Write Protocols
Replicated-write protocols are used in the form of active replication, as well as by means of quorum-based protocols.
Adv. - these groups have no single point of failure, at the cost of distributed coordination.
Goal of distributed agreement algorithms - have all the non-faulty processes reach consensus on some issue, and to establish that
consensus within a finite number of steps.
Complications:
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 2/11
11/11/22, 2:49 PM Chapter 8
Different assumptions about the underlying system require different solutions, assuming solutions even exist.
A system is synchronous if and only if the processes are known to operate in a lock-step mode.
Formally, this means that there should be some constant c >= 1, such that if any processor has taken c + 1 steps, every other
process has taken at least 1 step.
A system that is not synchronous is said to be asynchronous.
Delay is bounded if and only if we know that every message is delivered with a globally and predetermined maximum time.
In other words, we distinguish the situation where messages from the same sender are delivered in the order that they were
sent, from the situation in which we do not have such guarantees.
Note - most distributed systems in practice assume that processes behave asynchronously, message transmission is unicast, and
communication delays are unbounded.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 3/11
11/11/22, 2:49 PM Chapter 8
3. The vectors that each general receives in step 3. It is clear to all that General 3 is the traitor. In each
‘column’, the majority value is assumed to be correct.
Goal of Byzantine agreement is that consensus is reached on the value for the non-faulty processes only.
Assume that processes are synchronous, messages are unicast while preserving ordering, and communication delay is
bounded.
Assume N processes, where each process i will provide a value vi to the others.
Goal - let each process construct a vector V of length N, such that if process i is non-faulty, V [i ] = vi
ELSE V [i ] is undefined.
We assume that there are at most k faulty processes.
1. Every non-faulty process i sends vi to every other process using reliable unicasting.
Faulty processes may send anything and different values to different processes.
Let vi =i. In (Fig.a) t process 1 reports 1, process 2 reports 2, process 3 lies to everyone, giving x, y, and z, respectively, and
process 4 reports a value of 4.
2. The results of the announcements of step 1 are collected together in the form of the vectors (Fig.b).
3. Every process passes its vector from (Fig.b) to every other process.
Every process gets three vectors, one from every other process.
Process 3 lies, inventing 12 new values, a through l.
Results in (Fig.c).
4. Each process examines the ith element of each of the newly received vectors.
If any value has a majority, that value is put into the result vector.
If no value has a majority, the corresponding element of the result vector is marked UNKNOWN.
From (Fig.c) - 1, 2, and 4 all come to agreement on the values for v1, v2, and v4, which is the correct result.
What these processes conclude regarding v 3 cannot be decided, but is also irrelevant.
Example Again:
With 2 loyal generals and 1 traitor.
Note: It is no longer possible to determine the majority value in each column, and the algorithm has failed to produce agreement.
Lamport et al. (1982) proved that in a system with k faulty processes, agreement can be achieved only if 2k + 1 correctly
functioning processes are present, for a total of 3k + 1.
Agreement is possible only if more than two-thirds of the processes are working properly.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 4/11
11/11/22, 2:49 PM Chapter 8
Kinds of Failures:
Processes actively send "are you alive?" messages to each other (for which they obviously expect an answer)
Makes sense only when it can be guaranteed that there is enough communication between processes.
2. The client’s request to the server is lost, so no response is returned by the server to the waiting client.
3. The server crashes after receiving the request, and the service request is left acknowledged, but undone.
4. The server’s reply is lost on its way to the client, the service has completed, but the results never arrive at the client
5. The client crashes after sending its request, and the server sends a reply to a newly-restarted client that may not be expecting
it.
Server crashes are dealt with by implementing one of three possible implementation philosophies:
At least once semantics: a guarantee is given that the RPC occurred at least once, but (also) possibly more that once.
At most once semantics: a guarantee is given that the RPC occurred at most once, but possibly not at all.
No semantics: nothing is guaranteed, and client and servers take their chances!
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 5/11
11/11/22, 2:49 PM Chapter 8
•It has proved difficult to provide exactly once semantics.
A request that can be repeated any number of times without any nasty side-effects is said to be idempotent.
Nonidempotent requests (for example, the electronic transfer of funds) are a little harder to deal with.
Another technique is the inclusion of additional bits in a retransmission to identify it as such to the server.
Client Crashes
When a client crashes, and when an ‘old’ reply arrives, such a reply is known as an orphan.
2. reincarnation (each client session has an epoch associated with it, making orphans easy to spot),
3. gentle reincarnation (when a new epoch is identified, an attempt is made to locate a requests owner, otherwise the orphan is
killed),
4. expiration (if the RPC cannot be completed within a stardard amount of time, it is assumed to have expired).
In practice, however, none of these methods are desirable for dealing with orphans.
Orphan elimination is discussed in more detail by Panzieri and Shrivastava (1988).
Sounds simple, but is surprisingly tricky (as multicasting services tend to be inherently unreliable).
Small group: multiple, reliable point-to-point channels will do the job, however, such a solution scales poorly as the group
membership grows.
Worse: what happens if the sender of the multiple, reliable point-to-point channels crashes half way through sending the
messages?
The sending process assigns a sequence number to outgoing messages (making it easy to spot when a message is missing).
Assume that messages are received in the order they are sent.
Each multicast message is stored locally in a history buffer at the sender.
Assuming the receivers are known to the sender, the sender simply keeps the message in its history buffer until each receiver
has returned an acknowledgment.
If a receiver detects it is missing a message, it may return a negative acknowledgment, requesting the sender for a
retransmission.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 6/11
11/11/22, 2:49 PM Chapter 8
But, how long does the sender keep its history-buffer populated?
Also, such schemes perform poorly as the group grows … there are too many ACKs.
A extensive and detailed survey of total-order broadcasts can be found in Defago et al. (2004).
Negative acknoledgements (NACK) are multicast to all group members. (Don't send any more.)
To avoid “retransmission clashes”, each member is required to wait a random delay prior to NACKing.
See Towsley et al. (1997) for details - but no hard guarantees can be given that feedback implosions will never happen.
A comparison between different scalable reliable multicasting can be found in Levine and Garcia-Luna-Aceves (1998).
Successful delivery is never acknowledged, only missing messages are reported (NACK), which are multicast to all group
members.
If another process is about to NACK, this feedback is suppressed as a result of the first multicast NACK.
Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the
suppression of others.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 7/11
11/11/22, 2:49 PM Chapter 8
Conclusion:
Building reliable multicast schemes that can scale to a large number of receivers spread across a wide-area network, is a
difficult problem.
No single best solution exists, and each solution introduces new problems.
Atomic Multicast
Atomic multicast problem:
A requirement where the system needs to ensure that all processes get the message, or that none of them get it.
An additional requirement is that all messages arrive at all processes in sequential order.
Atomic multicasting ensures that nonfaulty processes maintain a consistent view of the database, and forces reconciliation
when a replica recovers and rejoins the group.
Virtual Synchrony
The concept of virtual synchrony was proposed by Kenneth Birman as the abstraction that group communication protocols should
attempt to build on top of an asynchronous system.
1. All recipients have identical group views when a message is delivered. (The group view of a recipient defines the set of "correct"
processors from the perspective of that recepient.)
2. The destination list of the message consists precisely of the members in that view
3. The message should be delivered either to all members in its destination list or to no one at all. The latter case can occur only if
the sender fails during transmission.
Reliable multicast with the above properties is said to be virtually synchronous (Birman and Joseph, 1987).
Whole idea of atomic multicasting is that a multicast message m is uniquely associated with a list of processes to which it
should be delivered.
Delivery list corresponds to a group view, namely, the view on the set of processes contained in the group, which the sender
had at the time message m was multicast. (Virtual synchrony #2)
Each process on that list has the same view. In other words, they should all agree that m should be delivered to each one of
them and to no other process. (Virtual synchrony #1)
Need to guarantee that m is either delivered to all processes in the list in order or m is not delivered at all. (Virtual synchrony #3
& #4)
Message Ordering
Four different orderings:
1. Unordered multicasts
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 8/11
11/11/22, 2:49 PM Chapter 8
virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are
delivered by different processes
2. FIFO-ordered multicasts
the communication layer is forced to deliver incoming messages from the same process in the same order as they have been
sent
3. Causally-ordered multicasts
4. Totally-ordered multicasts
regardless of whether message delivery is unordered, FIFO ordered, or causally ordered, it is required additionally that when
messages are delivered, they are delivered in the same order to all group members.
Virtually synchronous reliable multicasting offering totally-ordered delivery of messages is called atomic multicasting.
With the three different message ordering constraints discussed above, this leads to six forms of reliable multicasting
(Hadzilacos and Toueg, 1993).
Distributed Commit
Examples of distributed commit, and how it can be solved are discussed in Tanisch (2000).
General Goal: We want an operation to be performed by all group members or none at all.
[In the case of atomic multicasting, the operation is the delivery of the message.]
There are three types of “commit protocol”: single-phase, two-phase and three-phase commit.
An elected co-ordinator tells all the other processes to perform the operation in question.
4. Group members then COMMIT or ABORT based on the last message received from the coordinator.
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 9/11
11/11/22, 2:49 PM Chapter 8
Second phase - decision phase steps 3 and 4.
It can lead to both the coordinator and the group members blocking, which may lead to the dreaded deadlock.
If the coordinator crashes, the group members may not be able to reach a final decision, and they may, therefore, block until the
coordinator recovers …
Essence: the states of the coordinator and each participant satisfy the following two conditions:
1. There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state.
2. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be
made.
Recovery
Once a failure has occurred, it is essential that the process where the failure happened recovers to a correct state.
1. Backward Recovery: return the system to some previous correct state (using checkpoints), then continue executing.
2. Forward Recovery: bring the system into a correct state, from which it can then continue to execute.
It can be integrated into (the middleware layer) of a distributed system as a general-purpose service.
Disadvantages:
Checkpointing (can be very expensive (especially when errors are very rare).
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 10/11
11/11/22, 2:49 PM Chapter 8
[Despite the cost, backward recovery is implemented more often. The “logging” of information can be thought of as a type of
checkpointing.].
Recovery mechanisms are independent of the distributed application for which they are actually used – thus no guarantees can
be given that once recovery has taken place, the same or similar failure will not happen again.
When an error occurs, the recovery mechanism then knows what to do to bring the system forward to a correct state.
Example
Consider as an example: Reliable Communications.
Retransmission of a lost/damaged packet - backward recovery technique.
Erasure Correction - When a lost/damaged packet can be reconstructed as a result of the receipt of other successfully delivered
packets - forward recovery technique. [see Rizzo (1997)]
Elnozahy et al. (2002) and (Elnozahy and Planck, 2004) provide a survey of checkpointing and logging in distributed systems.
Recovery-Oriented Computing
Recovery-oriented computing - Start over again (Candea et al., 2004a).
Underlying principle - it may be much cheaper to optimize for recovery, then it is aiming for systems that are free from failures
for a long time.
Different flavors:
means deleting all instances of the identified components, along with the threads operating on them, and (often) to just restart
the associated requests.
Apply checkpointing and recovery techniques, but to continue execution in a changed environment.
Basic idea - many failures can be simply avoided if programs are given extra buffer space, memory is zeroed before allocated,
changing the ordering of message delivery (as long as this does not affect semantics), and so on (Qin et al., 2005).
csis.pace.edu/~marchese/CS865/Lectures/Chap8/New8/Chapter8.html 11/11