Fault Tolerance
Johannes Sianipar
25 Maret 2021
Fault Tolerance, Fail, Error, Fault
■ Fault Tolerance
□ Whenever a failure occurs, the system should continue to operate in
acceptable way while repairs are being made.
■ A system is said to fail when it cannot meet its promises.
■ An error is a part of a system’s state that may lead to a failure.
□ The cause of an error is called a fault.
□ In other words, the programmer is the fault of the error (programming
bug), in turn leading to a failure (a crashed program).
■ Fault tolerance means that a system can provide its services even in the
presence of faults.
Fault Tolerance
Fault Tolerance Basic Concepts
■ Being fault tolerant is strongly related to what are called dependable
systems
■ Dependability implies the following:
□ Availability
– the system is ready to be used immediately
□ Reliability
– the system can run continuously without failure
□ Safety
– if a system fails, nothing catastrophic will happen
□ Maintainability
– when a system fails, it can be repaired easily and quickly (sometimes,
without its users noticing the failure) Fault Tolerance
Types of Fault
■ There are three main types of ‘fault’:
□ Transient Fault
– appears once, then disappears.
□ Intermittent Fault
– occurs, vanishes, reappears; but: follows no real pattern
(worst kind).
□ Permanent Fault
– once it occurs, only the replacement/repair of a faulty component
will allow the DS to function normally
Fault Tolerance
Failure Models
Fault Tolerance
Halting failures classification
Halting Type Description
Fail-stop Crash failures, but reliably detectable
Fail-noisy Crash failures, eventually reliably
detectable
Fail-silent Crash failures: clients cannot tell
what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they
cannot do any harm)
Fail-arbitrary Arbitrary, with malicious failures
Fault Tolerance
Failure Masking by Redundancy
■ Strategy: hide the occurrence of failure from other processes using
redundancy.
■ Three main types:
□ Information Redundancy
– add extra bits to allow for error detection/recovery (e.g., Hamming
codes and the like).
□ Time Redundancy
– perform operation and, if needs be, perform it again. Think about how
transactions work (BEGIN/END/COMMIT/ABORT).
□ Physical Redundancy
– add extra (duplicate) hardware and/or software to the system.
Fault Tolerance
Failure Masking by Redundancy (Cont.)
Fault Tolerance
Triple modular redundancy
Process Resilience
■ Processes can be made fault tolerant by arranging to have a group of
processes, with each member of the group being identical.
■ A message sent to the group is delivered to all of the “copies” of the
process (the group members), and then only one of them performs the
required service.
■ If one of the processes fail, it is assumed that one of the others will still be
able to function (and service any pending request or operation).
Fault Tolerance
Flat Groups versus Hierarchical Groups
(a) Communication in a flat group.
(b) Communication in a simple hierarchical group.
Fault Tolerance
Failure Masking and Replication
■ By organizing a fault tolerant group of processes, we can protect a single
vulnerable process.
■ There are two approaches to arranging the replication of the group:
□ Primary (backup) Protocols
□ Replicated-Write Protocols
Fault Tolerance
Groups and Failure masking
■ k-fault tolerant group
□ When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
■ How large does a k-fault tolerant group need to be?
□ With halting failures (crash/omission/timing failures): we need a total of k
+1 members as no member will produce an incorrect result, so the result
of one member is good enough.
□ With arbitrary failures: we need 2k +1 members so that the correct result
can be obtained through a majority vote.
■ Important assumptions
□ All members are identical
□ All members process commands in the same order
Fault Tolerance
□ Result: We can now be sure that all processes do exactly the same
Consensus in Faulty Systems with Crash Failures
■ Prerequisite
□ In a fault-tolerant process group, each nonfaulty process executes the
same commands, and in the same order, as every other nonfaulty
process.
■ Reformulation
□ Nonfaulty group members need to reach consensus on which command
to execute next.
Fault Tolerance
Flooding-based consensus
■ System model
□ A process group P = {P1,...,Pn}
□ Fail-stop failure semantics, i.e., with reliable failure detection
□ A client contacts a Pi requesting it to execute a command
□ Every Pi maintains a list of proposed commands
■ Basic algorithm (based on rounds)
□ In round r , Pi multicasts its known set of commands to all others
□ At the end of r , each Pi merges all received commands into a new
□ Next command selected through a globally shared, deterministic
function: select)
Fault Tolerance
Flooding-based consensus: Example
■ P1 Crashed
■ P2 received all proposed commands from all other processes makes
decision.
■ P3 may have detected that P1 crashed, but does not know if P2 received
anything, i.e., P3 cannot know if it has the same information as P2
cannot make decision (same for P4).
Fault Tolerance
Realistic consensus: Paxos
■ Assumptions (rather weak ones, and realistic)
□ A partially synchronous system (in fact, it may even be asynchronous).
□ Communication between processes may be unreliable: messages may be
lost, duplicated, or reordered.
□ Corrupted message can be detected (and thus subsequently ignored).
□ All operations are deterministic: once an execution is started, it is known
exactly what it will do.
□ Processes may exhibit crash failures, but not arbitrary failures.
□ Processes do not collude.
Fault Tolerance
Paxos essentials
■ Starting point
□ We assume a client-server configuration, with initially one primary server.
□ To make the server more robust, we start with adding a backup server.
□ To ensure that all commands are executed in the same order at
both servers, the primary assigns unique sequence numbers to all
commands.
□ In Paxos, the primary is called the leader.
□ Assume that actual commands can always be restored (either from clients
or servers) we consider only control messages.
Fault Tolerance
Some Paxos terminology
■ The leader sends an accept message ACCEPT(o,t) to backups when
assigning a timestamp t to command o.
■ A backup responds by sending a learn message: LEARN(o,t)
■ When the leader notices that operation o has not yet been learned,
it retransmits ACCEPT(o,t) with the original timestamp.
Fault Tolerance
Two servers and one crash: problem
■ Primary crashes after executing an operation, but the backup
never received the accept message.
Fault Tolerance
Two servers and one crash: solution
■ Never execute an operation before it is clear that is has been
learned.
Fault Tolerance
Three servers and two crashes: still a problem?
■ What happens when LEARN( o1) as sent by S2 to S1 is lost?
□ S2 will also have to wait until it knows that S3 has learned o1.
Fault Tolerance
Paxos: Fundamental Rule
■ In Paxos, a server S cannot execute an operation o until it has
received a LEARN(o) from all other nonfaulty servers.
Fault Tolerance
Failure detection
■ Reliable failure detection is practically impossible.
□ A solution is to set timeouts, but take into account that a
detected failure may be false.
□ Each server is required to send a message declaring it is still
alive
Fault Tolerance
Required number of servers
■ Paxos needs at least three servers
■ Adapted fundamental rule
□ In Paxos with three servers, a server S cannot execute an operation o
until it has received at least one (other) LEARN(o) message, so that it
knows that a majority of servers will execute o.
■ Assumptions before taking the next steps
□ Initially, S1 is the leader.
□ A server can reliably detect it has missed a message, and recover from
that miss.
□ When a new leader needs to be elected, the remaining servers follow a
strictly deterministic algorithm, such as S1 → S2 → S3.
□ A client cannot be asked to help the servers to resolve a situation.
Fault Tolerance
■ If either one of the backups (S2 or S3) crashes, Paxos will behave correctly:
operations at nonfaulty servers are executed in the same order.
Paxos (Cont.)
■ Pages 443-449 in Distributed System by Maarten van Steen and Andrew S.
Tanenbaum.
Fault Tolerance
Consensus under arbitrary failure semantics
■ Essence.
□ Consider process groups in which communication between process is
inconsistent: (a) improper forwarding of messages, or (b) telling different
things to different processes.
Fault Tolerance
Consensus under arbitrary failure semantics (Cont.)
■ System model
□ We consider a primary P and n − 1 backups B1 , . . . , Bn−1.
□ A client sends v ∈ {T , F} to P
□ Messages may be lost, but this can be detected.
□ Messages cannot be corrupted beyond detection.
□ A receiver of a message can reliably detect its sender.
■ Byzantine agreement: requirements
□ BA1: Every nonfaulty backup process stores the same value.
□ BA2: If the primary is nonfaulty then every nonfaulty backup process
stores exactly what the primary had sent.
■ Notes
Fault Tolerance
□ Primary faulty ⇒ BA1 says that backups may store the same, but different
(and thus wrong) value than originally sent by the client.
□ Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
Why having 3k processes is not enough
Fault Tolerance
Why having 3k + 1 processes is enough
Fault Tolerance
Distributed consensus: when can it be reached
■ Formal requirements for consensus
□ Processes produce the same output value
□ Every output value must be valid
□ Every process must eventually provide output
Fault Tolerance
Failure detection
■ Issue
□ How can we reliably detect that a process has actually crashed?
■ General model
□ Each process is equipped with a failure detection module
□ A process P probes another process Q for a reaction
□ If Q reacts: Q is considered to be alive (by P)
□ If Q does not react with t time units: Q is suspected to have crashed
■ Observation for a synchronous system
□ a suspected crash ≡ a known crash
Fault Tolerance
Practical failure detection
■ If P did not receive heartbeat from Q within time t: P suspects Q.
■ If Q later sends a message (which is received by P):
□ P stops suspecting Q
□ P increases the timeout value t
■ Note: if Q did crash, P will keep suspecting Q
Fault Tolerance
Reference
■ Distributed Systems Principles and Paradigms by Andrew S. Tanenbaum
and Maarten Van Steen
Fault Tolerance
Insert picture by
clicking the icon
Thank you
for your attention!
Johannes Sianipar