Fault
Fault
Fault Tolerance
Part I Introduction Part II Process Resilience Part III Reliable Communication Part IV Distributed Commit Part V Recovery
Most of the lecture notes are based on slides by Prof. Jalal Y. Kawash at Univ. of Calgary and Dr. Daniel M. Zimmerman at
CALTECH
Some of the lecture notes are based on slides by Scott Shenker and Ion Stoica at Univ.of California, Berkeley, Timo Alanko at Univ. of Helsinki, Finland, Hugh C. Lauer at Worcester Polytechnic Institute, Xiuwen Liu at Florida State
University
Chapter 8
Fault Tolerance
Part I Introduction
Fault Tolerance
A DS should be fault-tolerant
Should be able to continue functioning in the presence of faults
Dependability
Dependability Includes Availability Reliability Safety Maintainability
Faults
A system fails when it cannot meet its promises (specifications) An error is part of a system state that may lead to a failure A fault is the cause of the error Fault-Tolerance: the system can provide services even in the presence of faults Faults can be:
Transient (appear once and disappear) Intermittent (appear-disappear-reappear behavior)
A loose contact on a connector intermittent fault
Failure Models
Type of failure Crash failure Omission failure Receive omission Send omission Timing failure Response failure Value failure State transition failure Arbitrary failure (Byzantine failure) Description A server halts, but is working correctly until it halts A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages A server's response lies outside the specified time interval The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control A server may produce arbitrary responses at arbitrary times
Failure Masking
Redundancy is key technique for hiding failures Redundancy types: 1. Information: add extra (control) information
Error-correction codes in messages
Chapter 8
Fault Tolerance
Part II Process Resilience
Process Resilience
Mask process failures by replication
Organize processes into groups, a message sent to a group is delivered to all members
a) b)
Process Replication
Replicate a process and group replicas in one group How many replicas do we create? A system is k fault-tolerant if it can survive and function even if it has k faulty processes
For crash failures (a faulty process halts, but is working correctly until it halts)
k+1 replicas
For Byzantine failures (a faulty process may produce arbitrary responses at arbitrary times)
2k+1 replicas
Agreement
Need agreement in DS:
Leader, commit, synchronize
Distributed Agreement algorithm: all nonfaulty processes achieve consensus in a finite number of steps Perfect processes, faulty channels: two-army Faulty processes, perfect channels: Byzantine generals
Two-Army Problem
The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce the time to launch the attack (by messages marked by their ids). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3, where every general passes his vector from (b) to every other general.
The same as in previous slide, except now with 2 loyal generals and one traitor.
Byzantine Generals
Given three processes, if one fails, consensus is impossible Given N processes, if F processes fail, consensus is impossible if N 3F
OceanStore
Global-Scale Persistent Storage on Untrusted Infrastructure
Update Model
Concurrent updates w/o wide-area locking
Conflict resolution
A master replica?
Updates Serialization
All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol
the result of the updates is multicast down the dissemination tree to all the secondary replicas
Chapter 8
Fault Tolerance
Part III Reliable Communication
A simple solution to reliable multicasting when all receivers are known and assumed not to fail a) Message transmission b) Reporting feedback
Atomic Multicast
All messages are delivered in the same order to all processes
Group view: the view on the set of processes contained in the group Virtual synchronous multicast: a message m multicast to a group view G is delivered to all non-faulty processes in G
The logical organization of a distributed system to distinguish between message receipt and message delivery
Message Delivery
Delivery of messages - new message => HBQ - decision making - delivery order - deliver or not to deliver? - the message is allowed to be delivered: HBQ => DQ - when at the head of DQ: message => application (application: receive )
Application
delivery
hold-back queue
delivery queue
b) Message is delivered
C
Gi = (A, B, C) Gi+1 = (B, C)
b) ???
C
Gi = (A, B, C) Gi+1 = (B, C)
Solution
membership changes synchronize multicasting
during a MC operation no membership changes
Virtual synchrony: all processes see message and membership change in the same order
b) Message is delivered
C
Gi = (A, B, C) Gi+1 = (B, C)
change view
P4
P5
P1 P4
flush message
P5
P2
P3
P5
Announcement
2nd Midterm in the week after Spring Break
March 27, Wednesday
Distributed Commit
Goal: Either all members of a group decide to perform an operation, or none of them perform the operation
Atomic transaction: a transaction that happens completely or not at all
Assumptions
Failures:
Crash failures that can be recovered Communication failures detectable by timeouts
Notes:
Commit requires a set of processes to agree similar to the Byzantine generals problem but the solution much simpler because stronger assumptions
Distributed Transactions
client
server
Database
server
server
Database
client
join
participant
a.withdraw(4);
One-phase Commit
One-phase commit protocol
One site is designated as a coordinator The coordinator tells all the other processes whether or not to locally perform the operation in question This scheme however is not fault tolerant
T_Id
flag: init
P1 27
client .
Open transaction T_write F1,P1 T_write F2,P2 T_write F3,P3 Close transaction . join S2
participant
F2
T_Id
flag: init
P2 27
S3
participant
F3
T_Id
flag: init
P3 2745
coordinator
T_Id
init committed wait done
P1 27
client .
Open transaction T_read F1,P1 T_write F2,P2 T_write F3,P3 Close transaction .
doCommit ! canCommit?
Yes HaveCommitted
Yes HaveCommitted
T_Id committed ready init
P3 2745
a) b)
The finite state machine for the coordinator in 2PC. The finite state machine for a participant.
a) b)
The finite state machine for the coordinator in 2PC. The finite state machine for a participant.
INIT
READY
Actions taken by a participant P when residing in state READY and having contacted another participant Q.
write START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }
Two-Phase Commit(7)
When all participants are in the ready states, no final decision can be reached Two-phase commit is a blocking commit protocol
There is no state from which a transition can be made to either Commit or Abort There is no state where it is not possible to make a final decision and from which transition can be made to Commit non-blocking commit protocol
Chapter 8
Fault Tolerance
Part V Recovery
Recovery
Weve talked a lot about fault tolerance, but not about what happens after a fault has occurred A process that exhibits a failure has to be able to recover to a correct state There are two basic types of recovery:
Backward Recovery
The goal of backward recovery is to bring the system from an erroneous state back to a prior correct state The state of the system must be recorded checkpointed - from time to time, and then restored when things go wrong Examples
Forward Recovery
The goal of forward recovery is to bring a system from an erroneous state to a correct new state (not a previous state) Examples:
Reliable communication via erasure correction, such as an (n, k) block erasure code
Backward recovery is far more widely applied Checkpointing is costly, so its often combined with message logging
Stable Storage
In order to store checkpoints and logs, information needs to be stored safely - not just able to survive crashes, but also able to survive hardware faults RAID is the typical example of stable storage
Checkpointing
Related to checkpointing, let us first discuss the global state and the distributed snapshot algorithm
The Global State of a distributed computation is the set of local states of all individual processes involved in the computation + the states of the communication channels How?
Synchronize clocks of all processes and ask all processes to record their states at known time t Problems?
Global State
We cannot determine the exact global state of the system, but we can record a snapshot of it
Distributed Snapshot: a state the system might have been in [Chandy and Lamport]
+ So simple!! - Correct??
Example
Producer Consumer problem
p q
Example
p m q
Example
p q
The sender has no record of the sending The receiver has the record of the receipt
Whats Wrong?
p m
Result:
Global state has record of the receive event but no send event violating the happens-before concept!!
Cut
Cut
Cuts
a) b)
2. Marker receiving rule for a process Pk, on receipt of a marker over channel C
if Pk has not yet recorded its state - records Pks state - records the state of C as empty - turns on recording of messages over other incoming channels for each outgoing channel C, sends a marker on C - else - records the state of C as all the messages received over C since Pk saved its state
Snapshot Example
P1 P2 P3
e10 e11,2
M a
e13 e14
M M b
e31,2,3
e34
1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3; turns on recording for channels C21 and C31 2- P2 receives Marker over C12, records its state (S2), sets state(C12) = {} sends Marker to P1 & P3; turns on recording for channel C32 3- P1 receives Marker over C21, sets state(C21) = {a} 4- P3 receives Marker over C13, records its state (S3), sets state(C13) = {} sends Marker to P1 & P2; turns on recording for channel C23 5- P2 receives Marker over C32, sets state(C32) = {b} 6- P3 receives Marker over C23, sets state(C23) = {} 7- P1 receives Marker over C31, sets state(C31) = {}
Snapshot Example
P1 P2 P3
e10
a
e13 e24
b
e20 e30
When a process finishes local snapshot, it collects its local state (S and C) and sends it to the initiator of the distributed snapshot The initiator can then analyze the state One algorithm for distributed global snapshots, but its not particularly efficient for large systems
Checkpointing
Weve discussed distributed snapshots The most recent distributed snapshot in a system is also called the recovery line
Independent Checkpointing
It is often difficult to find a recovery line in a system where every process just records its local state every so often - a domino effect or cascading rollback can result:
Coordinated Checkpointing
To solve this problem, systems can implement coordinated checkpointing Weve discussed one algorithm for distributed global snapshots, but its not particularly efficient for large systems Another way to do it is to use a two-phase blocking protocol (with some coordinator) to get every process to checkpoint its local state simultaneously
Coordinated Checkpointing
Make sure that processes are synchronized when doing the checkpoint Two-phase blocking protocol
1. 2.
3.
Coordinator multicasts CHECKPOINT_REQUEST Processes take local checkpoint Delay further sends Acknowledge to coordinator Send state
Coordinator multicasts CHECKPOINT_DONE
Message Logging
Checkpointing is expensive - message logging allows the occurrences between checkpoints to be replayed, so that checkpoints dont need to happen as frequently
Message Logging
We need to choose when to log messages Message-logging schemes can be characterized as pessimistic or optimistic by how they deal with orphan processes
An orphan process is one that survives the crash of another process but has an inconsistent state after the other process recovers
Message Logging
Message Logging
We assume that each message m has a header containing all the information necessary to retransmit m (sender, receiver, sequence no., etc.) A message is called stable if it can no longer be lost - a stable message can be used for recovery by replaying its transmission
Message Logging
Each message m leads to a set of dependent processes DEP(m), to which either m or a message causally dependent on m has been delivered
Message Logging
The set COPY(m) consists of the processes that have a copy of m, but not in their local stable storage any process in COPY(m) could deliver a copy of m on request
Message Logging
Process Q is an orphan process if there is a nonstable message m, such that Q is contained in DEP(m), and every process in COPY(m) has crashed
Message Logging
To avoid orphan processes, we need to ensure that if all processes in COPY(m) crash, no processes remain in DEP(m)
Pessimistic Logging
For each nonstable message m, ensure that at most one process P is dependent on m The worst that can happen is that P crashes without m ever having been logged No other process can have become dependent on m, because m was nonstable, so this leaves no orphans
Optimistic Logging
The work is done after a crash occurs, not before If, for some m, each process in COPY(m) has crashed, then any orphan process in DEP(m) gets rolled back to a state in which it no longer belongs in DEP(m)
Optimistic Logging
The work is done after a crash occurs, not before If, for some m, each process in COPY(m) has crashed, then any orphan process in DEP(m) gets rolled back to a state in which it no longer belongs in DEP(m) Dependencies need to be explicitly tracked, which makes this difficult to implement - as a result, pessimistic approaches are preferred in real-world implementations