Chapter 8
Chapter 8
Dependability
Basics
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may depend
on some other component.
Specifically
A component C depends on C ∗ if the correctness of C’s behavior depends on
the correctness of C ∗ ’s behavior. (Components are processes or channels.)
Basic concepts
Fault tolerance Introduction to fault tolerance
Dependability
Basics
A component provides services to clients. To provide services, the component
may require the services from other components ⇒ a component may depend
on some other component.
Specifically
A component C depends on C ∗ if the correctness of C’s behavior depends on
the correctness of C ∗ ’s behavior. (Components are processes or channels.)
Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired
Basic concepts
Fault tolerance Introduction to fault tolerance
Traditional metrics
• Mean Time To Failure (MTTF): The average time until a component fails.
• Mean Time To Repair (MTTR): The average time needed to repair a
component.
• Mean Time Between Failures (MTBF): Simply MTTF + MTTR.
Basic concepts
Fault tolerance Introduction to fault tolerance
Observation
Reliability and availability make sense only if we have an accurate notion of
what a failure actually is.
Basic concepts
Fault tolerance Introduction to fault tolerance
Terminology
Failure, error, fault
Basic concepts
Fault tolerance Introduction to fault tolerance
Terminology
Handling faults
Basic concepts
Fault tolerance Introduction to fault tolerance
Failure models
Types of failures
Failure models
Fault tolerance Introduction to fault tolerance
Failure models
Fault tolerance Introduction to fault tolerance
Observation
Note that deliberate failures, be they omission or commission failures, are
typically security problems. Distinguishing between deliberate failures and
unintentional ones is, in general, impossible.
Failure models
Fault tolerance Introduction to fault tolerance
Halting failures
Scenario
C no longer perceives any activity from C ∗ — a halting failure? Distinguishing
between a crash or omission/timing failure may be impossible.
Failure models
Fault tolerance Introduction to fault tolerance
Halting failures
Assumptions we can make
Failure models
Fault tolerance Introduction to fault tolerance
Process resilience
Basic idea
Protect against malfunctioning processes through process replication,
organizing multiple processes into a process group. Distinguish between flat
groups and hierarchical groups.
Important assumptions
• All members are identical
• All members process commands in the same order
Result: We can now be sure that all processes do exactly the same thing.
Consensus
Prerequisite
In a fault-tolerant process group, each nonfaulty process executes the same
commands, and in the same order, as every other nonfaulty process.
Reformulation
Nonfaulty group members need to reach consensus on which command to
execute next.
Flooding-based consensus
System model
• A process group P = {P1 , . . . , Pn }
• Fail-stop failure semantics, i.e., with reliable failure detection
• A client contacts a Pi requesting it to execute a command
• Every Pi maintains a list of proposed commands
Flooding-based consensus
System model
• A process group P = {P1 , . . . , Pn }
• Fail-stop failure semantics, i.e., with reliable failure detection
• A client contacts a Pi requesting it to execute a command
• Every Pi maintains a list of proposed commands
Observations
• P2 received all proposed commands from all other processes ⇒ makes
decision.
• P3 may have detected that P1 crashed, but does not know if P2 received
anything, i.e., P3 cannot know if it has the same information as P2 ⇒
cannot make decision (same for P4 ).
Raft
Developed for understandability
• Uses a fairly straightforward leader-election algorithm (see Chp. 5). The
current leader operates during the current term.
• Every server (typically, five) keeps a log of operations, some of which
have been committed. A backup will not vote for a new leader if its own
log is more up to date.
• All committed operations have the same position in the log of each
respective server.
• The leader decides which pending operation is to be committed next ⇒ a
primary-backup approach.
Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o, t, ⟩ to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.
Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o, t, ⟩ to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.
Note
In practice, only updates are broadcast. At the end, every server has the same
view and knows about the c committed operations. Note that effectively, any
information at the backups is overwritten.
Crucial observations
• The new leader has the most committed operations in its log.
• Any missing commits will eventually be sent to the other backups.
Consensus in faulty systems with crash failures
Fault tolerance Process resilience
Understanding Paxos
We will build up Paxos from scratch to understand where many consensus
algorithms actually come from.
Example: Paxos
Fault tolerance Process resilience
Paxos essentials
Starting point
• We assume a client-server configuration, with initially one primary server.
• To make the server more robust, we start with adding a backup server.
• To ensure that all commands are executed in the same order at both
servers, the primary assigns unique sequence numbers to all commands.
In Paxos, the primary is called the leader.
• Assume that actual commands can always be restored (either from
clients or servers) ⇒ we consider only control messages.
Example: Paxos
Fault tolerance Process resilience
Two-server situation
Example: Paxos
Fault tolerance Process resilience
• When the leader notices that operation o has not yet been learned, it
retransmits ACCEPT(o, t) with the original timestamp.
Example: Paxos
Fault tolerance Process resilience
Problem
Primary crashes after executing an operation, but the backup never received
the accept message.
Example: Paxos
Fault tolerance Process resilience
Solution
Never execute an operation before it is clear that is has been learned.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Scenario
What happens when LEARN (o 1 ) as sent by S2 to S1 is lost?
Example: Paxos
Fault tolerance Process resilience
Scenario
What happens when LEARN (o 1 ) as sent by S2 to S1 is lost?
Solution
S2 will also have to wait until it knows that S3 has learned o1 .
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.
Example: Paxos
Fault tolerance Process resilience
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Observation
If either one of the backups (S2 or S3 ) crashes, Paxos will behave correctly:
operations at nonfaulty servers are executed in the same order.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
S2 missed ACCEPT(o1 , 1)
• S2 did detect crash and became new leader
• If S2 sends ACCEPT(o 1 , 1) ⇒ S3 retransmits LEARN (o 1 ).
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Observation
Paxos (with three servers) behaves correctly when a single server crashes,
regardless when that crash took place.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Essence of solution
When S2 takes over, it needs to make sure that any outstanding operations
initiated by S1 have been properly flushed, i.e., executed by enough servers.
This requires an explicit leadership takeover by which other servers are
informed before sending out new accept messages.
Example: Paxos
Fault tolerance Process resilience
Observation
• Primary faulty ⇒ BA1 says that backups may store the same, but different
(and thus wrong) value than originally sent by the client.
• Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
Assumptions
• A server may exhibit arbitrary failures
• Messages may be lost, delayed, and received out of order
• Messages have an identifiable sender (i.e., they are signed)
• Partially synchronous execution model
Essence
A primary-backup approach with 3k + 1 replica servers.
• C is the client
• P is the primary
• B1 , B2 , B3 are backups
• Assume B2 is faulty
Procedure
• The next primary P ∗ is known deterministically
• A backup server broadcasts VIEW- CHANGE (v + 1, P): P is the set of
prepares it had sent out.
• P ∗ waits for 2k + 1 view-change messages, with X =
S
P containing all
previously sent prepares.
• P ∗ sends out NEW- VIEW (v+1,X,O) with O a new set of pre-prepare
messages.
• Essence: this allows the nonfaulty backups to replay what has gone on in
the previous view, if necessary, and bring o into the new view v + 1.
Question
Are there limitations to what can be readily achieved?
• What is needed to enable reaching consensus?
• What happens when groups are partitioned?
Conclusion
In a network subject to communication failures, it is impossible to realize an
atomic read/write shared memory that guarantees a response to every
request.
Fundamental question
What are the practical ramifications of the CAP theorem?
Failure detection
Issue
How can we reliably detect that a process has actually crashed?
General model
• Each process is equipped with a failure detection module
• A process P probes another process Q for a reaction
• If Q reacts: Q is considered to be alive (by P)
• If Q does not react with t time units: Q is suspected to have crashed
Failure detection
Fault tolerance Process resilience
Failure detection
Fault tolerance Reliable client-server communication
Problem
Where (a) is the normal case, situations (b) and (c) require different solutions.
However, we don’t know what happened. Two approaches:
• At-least-once-semantics: The server guarantees it will carry out an
operation at least once, no matter what.
• At-most-once-semantics: The server guarantees it will carry out an
operation at most once.
Partial solution
Design the server such that its operations are idempotent: repeating the same
operation is the same as carrying it out exactly once:
• pure read operations
• strict overwrite operations
Many operations are inherently nonidempotent, such as many banking
transactions.
Solution
• Orphan is killed (or rolled back) by the client when it recovers
• Client broadcasts new epoch number when recovering ⇒ server kills
client’s orphans
• Require computations to complete in a T time units. Old ones are simply
removed.
Introduction
Fault tolerance Reliable group communication
Tricky part
Agreement is needed on what the group actually looks like before a received
message can be delivered.
Introduction
Fault tolerance Reliable group communication
Introduction
Fault tolerance Distributed commit
• Phase 2b: Each participant waits for GLOBAL - COMMIT or GLOBAL - ABORT
and handles accordingly.
Fault tolerance Distributed commit
Coordinator Participant
Fault tolerance Distributed commit
• ABORT : Merely make entry into abort state idempotent, e.g., removing
the workspace of results
Fault tolerance Distributed commit
• COMMIT : Also make entry into commit state idempotent, e.g., copying
workspace to storage.
Fault tolerance Distributed commit
Observation
When distributed commit is required, having participants use temporary
workspaces to keep their results allows for simple recovery in the presence of
failures.
Fault tolerance Distributed commit
State of Q Action by P
COMMIT Make transition to COMMIT
ABORT Make transition to ABORT
INIT Make transition to ABORT
READY Contact another participant
Result
If all participants are in the READY state, the protocol blocks. Apparently, the
coordinator is failing. Note: The protocol prescribes that we need the decision
from the coordinator.
Fault tolerance Distributed commit
Alternative
Let a participant P in the READY state timeout when it hasn’t received the
coordinator’s decision; P tries to find out what other participants know (as
discussed).
Observation
Essence of the problem is that a recovering participant cannot make a local
decision: it is dependent on other (possibly failed) processes
Fault tolerance Distributed commit
Coordinator in Python
1 class Coordinator:
2 def run(self):
3 yetToReceive = list(self.participants)
4 self.log.info(’WAIT’)
5 self.chan.sendTo(self.participants, VOTE_REQUEST)
6 while len(yetToReceive) > 0:
7 msg = self.chan.recvFrom(self.participants, BLOCK, TIMEOUT)
8 if msg == -1 or (msg[1] == VOTE_ABORT):
9 self.log.info(’ABORT’)
10 self.chan.sendTo(self.participants, GLOBAL_ABORT)
11 return
12 else: # msg[1] == VOTE_COMMIT
13 yetToReceive.remove(msg[0])
14 self.log.info(’COMMIT’)
15 self.chan.sendTo(self.participants, GLOBAL_COMMIT)
Fault tolerance Distributed commit
Participant in Python
1 class Participant:
2 def run(self):
3 self.log.info(’INIT’)
4 msg = self.chan.recvFrom(self.coordinator, BLOCK, TIMEOUT)
5 if msg == -1: # Crashed coordinator - give up entirely
6 decision = LOCAL_ABORT
7 else: # Coordinator will have sent VOTE_REQUEST
8 decision = self.do_work()
9 if decision == LOCAL_ABORT:
10 self.chan.sendTo(self.coordinator, VOTE_ABORT)
11 self.log.info(’LOCAL_ABORT’)
12 else: # Ready to commit, enter READY state
13 self.log.info(’READY’)
14 self.chan.sendTo(self.coordinator, VOTE_COMMIT)
15 msg = self.chan.recvFrom(self.coordinator, BLOCK, TIMEOUT)
16 if msg == -1: # Crashed coordinator - check the others
17 self.log.info(’NEED_DECISION’)
18 self.chan.sendTo(self.participants, NEED_DECISION)
19 while True:
20 msg = self.chan.recvFromAny()
21 if msg[1] in [GLOBAL_COMMIT, GLOBAL_ABORT, LOCAL_ABORT]:
22 decision = msg[1]
23 break
24 else: # Coordinator came to a decision
25 decision = msg[1]
26 if decision == GLOBAL_COMMIT:
27 self.log.info(’COMMIT’)
28 else: # decision in [GLOBAL_ABORT, LOCAL_ABORT]:
29 self.log.info(’ABORT’)
30 while True: # Help any other participant when coordinator crashed
31 msg = self.chan.recvFrom(self.participants)
32 if msg[1] == NEED_DECISION:
33 self.chan.sendTo([msg[0]], decision)
Fault tolerance Recovery
Recovery: Background
Essence
When a failure occurs, we need to bring the system into an error-free state:
• Forward error recovery: Find a new state from which the system can
continue operation
• Backward error recovery: Bring the system back into a previous error-free
state
Practice
Use backward error recovery, requiring that we establish recovery points
Observation
Recovery in distributed systems is complicated by the fact that processes need
to cooperate in identifying a consistent state from where to recover
Introduction
Fault tolerance Recovery
Recovery line
Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue
Observation
It is possible to consider only those processes that depend on the recovery of
the coordinator, and ignore the rest
Checkpointing
Fault tolerance Recovery
Cascaded rollback
Observation
If checkpointing is done at the “wrong” instants, the recovery line may lie at
system startup time. We have a so-called cascaded rollback.
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
• Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval
between CPi (m − 1) and CPi (m).
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
• Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval
between CPi (m − 1) and CPi (m).
• When process Pi sends a message in interval INTi (m), it piggybacks
(i, m)
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
• Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval
between CPi (m − 1) and CPi (m).
• When process Pi sends a message in interval INTi (m), it piggybacks
(i, m)
• When process Pj receives a message in interval INTj (n), it records the
dependency INTi (m) → INTj (n).
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
• Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval
between CPi (m − 1) and CPi (m).
• When process Pi sends a message in interval INTi (m), it piggybacks
(i, m)
• When process Pj receives a message in interval INTj (n), it records the
dependency INTi (m) → INTj (n).
• The dependency INTi (m) → INTj (n) is saved to storage when taking
checkpoint CPj (n).
Checkpointing
Fault tolerance Recovery
Independent checkpointing
Essence
Each process independently takes checkpoints, with the risk of a cascaded
rollback to system startup.
• Let CPi (m) denote mth checkpoint of process Pi and INTi (m) the interval
between CPi (m − 1) and CPi (m).
• When process Pi sends a message in interval INTi (m), it piggybacks
(i, m)
• When process Pj receives a message in interval INTj (n), it records the
dependency INTi (m) → INTj (n).
• The dependency INTi (m) → INTj (n) is saved to storage when taking
checkpoint CPj (n).
Observation
If process Pi rolls back to CPi (m − 1), Pj must roll back to CPj (n − 1).
Checkpointing
Fault tolerance Recovery
Message logging
Alternative
Instead of taking an (expensive) checkpoint, try to replay your (communication)
behavior from the most recent checkpoint ⇒ store messages in a log.
Assumption
We assume a piecewise deterministic execution model:
• The execution of each process can be considered as a sequence of state
intervals
• Each state interval starts with a nondeterministic event (e.g., message
receipt)
• Execution in a state interval is deterministic
Conclusion
If we record nondeterministic events (to replay them later), we obtain a
deterministic execution model that will allow us to do a complete replay.
Message logging
Fault tolerance Recovery
Message logging
Fault tolerance Recovery
Message-logging schemes
Notations
• DEP(m): processes to which m has been delivered. If message m∗ is
causally dependent on the delivery of m, and m∗ has been delivered to Q,
then Q ∈ DEP(m).
• COPY(m): processes that have a copy of m, but have not (yet) reliably
stored it.
• FAIL: the collection of crashed processes.
Characterization
Message logging
Fault tolerance Recovery
Message-logging schemes
Pessimistic protocol
For each nonstable message m, there is at most one process dependent on m,
that is |DEP(m)| ≤ 1.
Consequence
An unstable message in a pessimistic protocol must be made stable before
sending a next message.
Message logging
Fault tolerance Recovery
Message-logging schemes
Optimistic protocol
For each unstable message m, we ensure that if COPY(m) ⊆ FAIL, then
eventually also DEP(m) ⊆ FAIL.
Consequence
To guarantee that DEP(m) ⊆ FAIL, we generally roll back each orphan process
Q until Q ̸∈ DEP(m).
Message logging