Fault Tolerance and Recovery
Fault Tolerance and Recovery
Introduction
Recovery in computer systems refers to restoring a system to its normal
operational state.
Processes are given the system resources, e.g. memory, files, etc. and they
may have locked the shared resouces.
resouces
After the failure of a process, the resources must be reclaimed so that they
can be given to the other processes.
If a failed process has modified the database, then all the modifications made
to the database by the failed process must be undone.
Other side, if a process has executed for some time before failing, it would be
preferable to restart the process from the point of its failure and resume its
execution.
Introduction Cont..
Enhanced performance:
Concurrent execution of many processes that cooperate in performing a
task.
One process failure may cause the entire task to be recovered, then the
effects due to the interactions of the failed processes with the other
processes must be undone.
Increased availibility:
Through replication, e.g., data, processes, and hardware components
If a site fails, copies of data stored at that site may miss updates, thus
becoming inconsistent with the rest of the system when it becomes
operational.
Basic Concepts
System: a set of software and hardware components to provide a service.
An error is a part of a system state which differs from its intended value.
An error is a manifestation of a fault in a system, which could lead to system
failure.
Failure of a system - deviation from its desired behaviour.
Example
if (Balance<1000) //Fault - No ‘=’ operator
{
return false;
}
else Withdraw();
//Error - when an input balance is exactly 1000, system in
incorrect state.
//and then, System/ SW Failure
Classification of Failures
Process Failure:
The computation results in an incorrect outcome, the process causes
the system state to deviate from specifications, the process may fail to
progress
A fail-silent
silent fault is one where the faulty unit stops functioning and produces
no ill output (it produces no output or produces output to indicate failure).
A Byzantine fault is one where the faulty unit continues to run but produces
incorrect results.
Intermittent faults: these are the most annoying of component faults. This
fault is characterized by a fault occurring, then vanishing again, then
occurring.
An example of this kind of fault is a loose connection.
Permanent faults: this fault is persistent: it continues to exist until the faulty
component is repaired or replaced.
Examples of this fault are disk head crashes, software bugs, and burnt-out
burnt
hardware.
Fault Tolerance
Masking:
System
ystem always behaves as per specifications even in
presence of faults
Non-masking:
System
ystem may violate specifications in presence of faults.
Should at least behave in a well-defined
well manner
Example 1. Clocks lose synchronization, but recover soon
thereafter.
Example 2. . A transaction crashes, but eventually recovers
Fault Tolerance
A fault tolerant system should specify
Class of faults tolerated (Fault Model)
Failure Detection
Heartbeat messages with timeouts
Reliable Storage
RAID, Network Storage
Recovery after fault repair
Recovery - Restoring the system to its normal operational state after a fault
If a cooperating process fails, the effects on other processes due to the interactions of the
failed process need to be undone.
In a replicated system, if a node having a replica fails, its data might be stale when it comes
back and that needs to be made consistent.
Error Recovery
Forward-Error Recovery
Continue to go forward in the presence of failures
Use redundancy to mask the effect of errors, e.g. Error correcting codes
Less overhead
Backward-Error Recovery
If it is not possible to remove all errors in the system state, then the system state can be
restored to a previous error-free state
More general
Performance Penalty
No guarantee that the faults may not occur again
Some system components might be unrecoverable
e.g.,Cash dispensed from ATM teller machines.
Backward Error Recovery-Centralized
Recovery System
Recovery Points-System
System state to which the system can be restored.
All the active processes and the modified data need to be restored to a
proper state.
Two approaches
Operation-based approach
State-based approach
Data from main memory is flushed on the secondary storage using a paging
scheme.
Operation-based
based Approach
A transaction based environment is assumed where transactions update a
database.
Entire system state changes are recorded in a log kept in stable storage.
Commit action indicates that the transaction updating an object has been
successfully completed, hence, the changes to the database should be made
permanent.
Every update to an object is recorded in the log such that it can completely
undo and redo the operation
The info recorded includes- object name, old state, and new state
Before committing the updates, redo and undo logs are recorded
Redo operations might be required if the objects updated were in memory and were not
flushed out on the secondary storage.
Writing a log record on every update operation is expensive in terms of storage required
and CPU overhead incurred, specially if failures are rare.
State-based
based Approach
The complete state of a process is saved in a stable storage when a recovery
point is established, called as checkpointing.
Recovering a process involves reinstating its saved state and resuming the
execution of the process from that state, called as process rollback.
Desirable to rollback to the most recent state - many checkpoints are taken
over the execution of a process
If one process fails and resumes execution from a recovery point, the effects
it has caused at other processes after establishing the recovery point have to
be undone
x1
x2 x3
X
y2 m
y1
Failure
z1 z2
Z
Time
Lost Messages
Messages whose send is not undone but receive is undone due to rollback are called lost messages.
This type of messages occurs when the process rolls back to a checkpoint prior to reception of the
message while the sender does not rollback beyond the send operation of the message.
x1
X
y1 Failure
Y
Livelock
Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing
the system from making progress.
x1
X
n1
y1
m1
Y
(a)
x1
X
n2
y1 n1
m2
Y
2nd Rollback (b)
Consistent Set of Checkpoints
Operation-based or State-based
based Recovery techniques are not adequate in
distributed systems
All the local checkpoints (one from each process) forms a global checkpoint
If no information flow takes place between any pair of processes in the set
of local checkpoints, it becomes a strongly consistent checkpoint
Consistent Vs Strongly Consistent Set of Checkpoints
x1 x2
y1 y2 m
Failure
z1 z2
Assumption - Taking a checkpoint and message send and receive are atomic actions
If every process takes a checkpoint after sending every message, the set of most recent
checkpoints is always consistent (it may not be strongly consistent)
Assumptions:
Channels are FIFO.
Communication failures do not partition the network.
A single process invokes the algorithm.
No site fails during the execution of the algorithm.
No computation messages are exchanged during the execution of the algorithm.
Second Phase:
Pi informs all the processes of the decision it reached at the end of the
first phase.
Correctness:
Tentative Ckpt
x1 x2
X
y1 m y2
Take a tentative
z1 ckpt (msg)
z2
First Phase:
An initiating process Pi checks to see if all the processes are
willing to restart from their previous checkpoints
If Pi learns that all the processes are willing to restart from their
previous checkpoints, Pi decides that all the processes should
restart; otherwise Pi decides that all the processes should
continue their normal activities
Second Phase:
Pi propagates its decision to all the processes
On receiving Pi ’s decision, a process will act accordingly
The Rollback Recovery Algorithm Cont..
Correctness
The recovery algorithm requires that every process does not send
computation messages while it is waiting for Pi’s decision
Disadvantages
Additional messages are exchanged by the checkpoint algorithm when it takes
each checkpoint
A recovery algorithm has to search for the most recent consistent set of
checkpoints before it can initiate recovery
The messages that were received after establishing a recovery point can be
processed again in the event of a rollback to the recovery point
Asynchronous Checkpointing and Recovery
Two ways of message logging
Pessimistic
An incoming message is logged before it is processed
Slows down underlying computation even when there are no failures
Optimistic
Processes continue to perform the computation and the messages received
are stored in volatile storage, which are logged at certain intervals.
In case of a system failure, an incoming message may be lost as it may not
have been logged
In the event of a rollback, the amount of computation redone during recovery
is likely to be more
It does not slow down the underlying computation during normal processing
Asynchronous Checkpointing and Recovery
(Juang and Venkatesan)
Assumptions
Reliable and FIFO communication channels
The events at each process are identified by unique monotonically increasing numbers
Two types of logs are assumed - volatile log and stable log. Volatile log contents are
periodically flushed onto the stable log.
Asynchronous Checkpointing
Each process, after an event, records a triplet {s, m, msgs_sent} in volatile
storage, where s is the state of the process before the event, m is the
message, and msgs_sent is the set of messages that were sent by the
process.
A local checkpoint consists of the record of an event and is taken without any
synchronization with other processes..
Notations
RCVD i←j (CKPTi) – the no. of msgs received
eived by Pi from Pj as per the info. stored in the
checkpoint CKPTi
SENT i→j (CKPTi) – the no. of msgs sent by Pi to Pj as per the info. stored in the
checkpoint CKPTi
Asynchronous Checkpointing
Recovery is based on finding a consistent set of checkpoints to which the system can be
restored
Each process keeps track of, S,, the no. of msgs it has sent to other processes and R, the no.
of msgs it has received from other processes
If R > S, then one or more msgs are orphan, then the process has to rollback to a state
where S=R
The algorithm assumes that a process, upon restarting, will broadcast a message that it had
failed. Can be done in O(|E|) messages where |E| is the total number of communication links.
The algorithm at a process is initiated when it restarts after a failure or when it learns about
another process’s failure
Event Driven Computation
In the example, Y
restarts at checkpoint y1
ezo ez1 ez2 and ey2 is the latest
event logged
z1
CKPTx = ex3
CKPTy = ey2
CKPTz = ez2
Asynchronous Checkpointing Example
First Iteration
Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z
X sends ROLLBACK(X,2) to Y and ROLLBACK(X,0) to Z
Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y
Since RCVD X←Y (CKPTX) = 3 > 2, X will set CKPTX to ex2
Since RCVD Z←Y (CKPTZ) = 2 > 1, Z will set CKPTZ to ez1
Y need not rollback any further
Second Iteration
Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y, 1) to Z
Z sends ROLLBACK(Z, 0) to X and ROLLBACK(Z, 1) to Y
X sends ROLLBACK(X, 2) to Y and ROLLBACK(X, 0) to Z