0% found this document useful (0 votes)
53 views50 pages

Fault Tolerance and Recovery

Uploaded by

Nagendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views50 pages

Fault Tolerance and Recovery

Uploaded by

Nagendra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Fault Tolerance and Recovery

Introduction
 Recovery in computer systems refers to restoring a system to its normal
operational state.

 Processes are given the system resources, e.g. memory, files, etc. and they
may have locked the shared resouces.
resouces

 After the failure of a process, the resources must be reclaimed so that they
can be given to the other processes.

 If a failed process has modified the database, then all the modifications made
to the database by the failed process must be undone.

 Other side, if a process has executed for some time before failing, it would be
preferable to restart the process from the point of its failure and resume its
execution.
Introduction Cont..
 Enhanced performance:
 Concurrent execution of many processes that cooperate in performing a
task.

 One process failure may cause the entire task to be recovered, then the
effects due to the interactions of the failed processes with the other
processes must be undone.

 Increased availibility:
 Through replication, e.g., data, processes, and hardware components

 If a site fails, copies of data stored at that site may miss updates, thus
becoming inconsistent with the rest of the system when it becomes
operational.
Basic Concepts
 System: a set of software and hardware components to provide a service.

 A fault is an anomalous physical condition.The causes of a fault include


design errors (system specification or implementation), manufacturing
problems, external disturbances such as harsh environmental conditions, etc.
 Software bug
 Random hardware fault
 Memory bit “stuck”

 An erroneous state - one which could lead to a system failure by a sequene


of valid state transitions.

 An error is a part of a system state which differs from its intended value.
 An error is a manifestation of a fault in a system, which could lead to system
failure.
 Failure of a system - deviation from its desired behaviour.
Example
if (Balance<1000) //Fault - No ‘=’ operator
{
return false;
}
else Withdraw();
//Error - when an input balance is exactly 1000, system in
incorrect state.
//and then, System/ SW Failure
Classification of Failures
 Process Failure:
 The computation results in an incorrect outcome, the process causes
the system state to deviate from specifications, the process may fail to
progress

 Examples of error causing processes to fail


 Deadlocks
 Protection violation
 Wrong input provided by a user

 Failed process may be restarted from a prior state or aborted.


 A deadlocked process can be restarted from a prior state, where it can
acquire the resources again.
 A wrong input in the initial stages may require a process to be aborted.
System Failure
 A system failure occurs when the processor fails to execute.

 Caused by software errors and hardware problems, e.g. ,CPU


failure, main memory failure, bus failure, power failure, etc.

 The system is stopped and restarted in a correct state.

 The correct state may be some predefined state or a prior state


(checkpoint) saved in a stable (non-volatile)
(non storage.
Secondary Storage
 A secondary
condary storage failure is said to have occured when the
stored data cannot be accessed.

 Failure is usually caused by parity error, or dust particles on the


medium.

 Recovery can be based on the archived version or log structured


version

 For tolerating failures, we can have mirrored disk systems


(2 physically independent disks) with independent buses and
controllers.
Communication Medium
 A communication medium failure occurs when a site cannot
communicate with another operational site in the network.

 Failure of switching nodes and/or the links of the communicating


system.

 May not cause a total shut down of communication facilities.

 Message loss, partition of a network where a subset of sites may


be unable to communicate with the sites in another subset.
Failures and Faults
 Software
 physicallevel = program code
 computational level = values of the program state
 system level = software system running the program

 Bug in a program is a fault.


 Possible incorrect values caused by this bug is an error.
 Possible crush of the operating system is a failure.
Types of Faults
 Faults may be either a fail-silent
silent failure (also known as fail-stop)
fail or a
Byzantine failure.

 A fail-silent
silent fault is one where the faulty unit stops functioning and produces
no ill output (it produces no output or produces output to indicate failure).

 Examples: disk head crashes, software bugs, and burnt-out


burnt hardware

 A Byzantine fault is one where the faulty unit continues to run but produces
incorrect results.

 Byzantine faults are more troublesome to deal with.


with
 Processes can crash, messages can be lost, etc. - Can be malicious
(attacks, software bugs, etc.)
Types of Faults
 Faults can be classified into one of three categories:
 Transient faults: these occur once and then disappear.
 For example, a network message transmission times out but works fine when
attempted a second time.

 Intermittent faults: these are the most annoying of component faults. This
fault is characterized by a fault occurring, then vanishing again, then
occurring.
 An example of this kind of fault is a loose connection.

 Permanent faults: this fault is persistent: it continues to exist until the faulty
component is repaired or replaced.
 Examples of this fault are disk head crashes, software bugs, and burnt-out
burnt
hardware.
Fault Tolerance

 The system's ability to deliver desired services in spite of faults in its


components
 Can be full service (specified behaviour in fault-free
fault state)
Ex: A primary-backup
backup server system to tolerate one server failure

 Or a degraded service (deviate from specified behaviour in fault free state,


but in a pre-defined manner)
Ex: A web service with multiple load-balanced
Ex: load servers in the backend
failing to meet its response time guarantees due to one backend server
failure, but still giving service with slower response time.
time
Fault Tolerance
 Classification of Faults
 Crash:
 A server/process halts/crashes
/crashes (hardware failure),
failure) but is working
correctly until it halts (irreversible).
(irreversible)
 In sync. systems, use timeouts.
 In async. systems, difficult to keep a track.
 Omission:
 Receive omission: A server fails to receive incoming messages
 Send omission: A server fails to send messages
 Timing:
 A server's response lies outside the specified time interval
 Arbitrary/Byzantine failure:
 A server may produce arbitrary responses at arbitrary times
Types of Tolerance

 Masking:
 System
ystem always behaves as per specifications even in
presence of faults

 Non-masking:
 System
ystem may violate specifications in presence of faults.
 Should at least behave in a well-defined
well manner
 Example 1. Clocks lose synchronization, but recover soon

thereafter.
 Example 2. . A transaction crashes, but eventually recovers
Fault Tolerance
 A fault tolerant system should specify
 Class of faults tolerated (Fault Model)

 What tolerance is given from each class?


class

 Needs Some Redundancy


 Hardware (Primary and Backup Servers)

 k+1 Redundancy to tolerate k failures


 Software (Tri-modular
modular Redundancy)
 2k+1 Redundancy to tolerate k failures (Can mask an error by executing three times
and taking a majority vote)
 Use voting to elect majority result and isolate at-fault
at module
 Time
 Redo operation- repeating tasks several times.
 Information Redundancy
 Parity bits, error correcting and detecting codes
Some Building Blocks

 Failure Detection
 Heartbeat messages with timeouts

 Reliable Storage
 RAID, Network Storage
Recovery after fault repair

 Recovery - Restoring the system to its normal operational state after a fault

 Undoing certain operations

 If a cooperating process fails, the effects on other processes due to the interactions of the
failed process need to be undone.

 In a replicated system, if a node having a replica fails, its data might be stale when it comes
back and that needs to be made consistent.
Error Recovery

 Forward-Error Recovery
 Continue to go forward in the presence of failures
 Use redundancy to mask the effect of errors, e.g. Error correcting codes
 Less overhead

 Backward-Error Recovery
 If it is not possible to remove all errors in the system state, then the system state can be
restored to a previous error-free state
 More general
 Performance Penalty
 No guarantee that the faults may not occur again
 Some system components might be unrecoverable
 e.g.,Cash dispensed from ATM teller machines.
Backward Error Recovery-Centralized
Recovery System
 Recovery Points-System
System state to which the system can be restored.

 All the active processes and the modified data need to be restored to a
proper state.

 Two approaches
 Operation-based approach
 State-based approach

 Existance of stable storage (storage that survives system crashes) is


assumed in both approaches.

 Secondary storage is assumed to be archived periodically.

 Data from main memory is flushed on the secondary storage using a paging
scheme.
Operation-based
based Approach
 A transaction based environment is assumed where transactions update a
database.

 Entire system state changes are recorded in a log kept in stable storage.

 It is desirable to be able to commit or undo updates on a per-transaction


per
basis.

 Commit action indicates that the transaction updating an object has been
successfully completed, hence, the changes to the database should be made
permanent.

 If a transaction does not commit, its database update should be undone.

 If a part of a database is lost due to storage media error, it should be possible


to reconstruct that part.
Updating-In-Place
Place Scheme

 Every update to an object is recorded in the log such that it can completely
undo and redo the operation

 The info recorded includes- object name, old state, and new state

 A recoverable update is implemented using the following operations


 A do operation – which does the update and writes a log record
 An undo operation – which, given a log record written by a do operation, undoes the
action specified by the do operation
 A redo operation - which, given a log record written by a do operation, redoes the action
specified by the do operation
 An optional display operation - which displays the log record
Updating-In-Place
Place Scheme

 In case of a failure, the changes made by a transaction can be undone by


using the undo operation.

 If a portion of the database is to be reconstructed, it can be done so by


performing the redo operation on a previously archived portion of the
database.

 Problem- A do operation cannot be undone if the system crashes after an


update operation but before the log record is stored.
Write-Ahead--Log Scheme
 A recoverable update is implemented by the following operations

 Update an object only after the undo log is recorded

 Before committing the updates, redo and undo logs are recorded

 On restarting a system after failure:


 Undo operations might be required to undo the changes made by the transactions that
were under progress at the time of the failure.

 Redo operations might be required if the objects updated were in memory and were not
flushed out on the secondary storage.

 Writing a log record on every update operation is expensive in terms of storage required
and CPU overhead incurred, specially if failures are rare.
State-based
based Approach
 The complete state of a process is saved in a stable storage when a recovery
point is established, called as checkpointing.

 Recovering a process involves reinstating its saved state and resuming the
execution of the process from that state, called as process rollback.

 Desirable to rollback to the most recent state - many checkpoints are taken
over the execution of a process

 The previous checkpoints can be discarded when a new checkpoint is taken


Recovery in Distributed Systems

 In a distributed system several processes cooperate by exchanging


messages to accomplish a task

 If one process fails and resumes execution from a recovery point, the effects
it has caused at other processes after establishing the recovery point have to
be undone

 An active process might also be required to rollback to an earlier state

 All cooperating processes need to establish a recovery point


Orphan Messages and Domino Effect
 Messages with receive recorded but a message send not recorded are called orphan messages.
 For example, a rollback might have undone the send of such messages, leaving the receive event
intact at the receiving process.
 Orphan messages do not arise if processes roll back to a consistent global state.

x1
x2 x3
X

y2 m
y1

Failure
z1 z2

Z
Time
Lost Messages
 Messages whose send is not undone but receive is undone due to rollback are called lost messages.

 This type of messages occurs when the process rolls back to a checkpoint prior to reception of the
message while the sender does not rollback beyond the send operation of the message.

x1
X

y1 Failure

Y
Livelock
 Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing
the system from making progress.
x1
X

n1
y1
m1
Y
(a)

x1
X
n2

y1 n1
m2
Y
2nd Rollback (b)
Consistent Set of Checkpoints

 Operation-based or State-based
based Recovery techniques are not adequate in
distributed systems

 Coordination among processes is required

 A process takes local checkpoint

 All the local checkpoints (one from each process) forms a global checkpoint

 For a global checkpoint to be consistent, there should not be any orphan


message.
Strongly Consistent Set of Checkpoints
 There should be no record of a message receive event in a local checkpoint
when the corresponding message send event is not recorded in the local
checkpoint of the sending process

 If there is no lost message in a consistent checkpoint, it is called as a


strongly consistent checkpoint

 If no information flow takes place between any pair of processes in the set
of local checkpoints, it becomes a strongly consistent checkpoint
Consistent Vs Strongly Consistent Set of Checkpoints

x1 x2

y1 y2 m

Failure
z1 z2

Consistent set of checkpoints


Strongly Consistent set of checkpoints
A Simple Method for Taking a Consistent Set of Checkpoints

 Assumption - Taking a checkpoint and message send and receive are atomic actions

 If every process takes a checkpoint after sending every message, the set of most recent
checkpoints is always consistent (it may not be strongly consistent)

 So rollback to the latest checkpoint will not result in orphan messages

 Disadvantage- huge overhead of check pointing


Synchronous Checkpointing and Recovery (Koo and Toueg)
 All processes coordinate their local checkpointing actions such that the set
of all recent checkpoints in the system is guaranteed to be consistent.

 Assumptions:
 Channels are FIFO.
 Communication failures do not partition the network.
 A single process invokes the algorithm.
 No site fails during the execution of the algorithm.
 No computation messages are exchanged during the execution of the algorithm.

 Two kinds of checkpoints-temporary


temporary and permanent.

 Processes roll back only to their permanent checkpoint.


The Checkpoint Algorithm
 First Phase:
 An initiating process Pi takes a tentative checkpoint and requests all

the processes to take tentative checkpoints.

 Each process informs Pi whether it succeeded in taking a tentative


checkpoint.

 If Pi learns that all the processes have successfully taken tentative


checkpoints, it decides that all the tentative checkpoints should be
made permanent; otherwise Pi decides that all the tentative
checkpoints should be discarded.

 Second Phase:
 Pi informs all the processes of the decision it reached at the end of the

first phase.

 A receiving process acts accordingly.


The Checkpoint Algorithm Cont..

 Correctness:

 Either all or none of the processes take a permanent checkpoint

 There would be no orphan messages since no process sends messages


after taking a tentative checkpoint until the receipt of the initiating
process’s decision
Problem With the Algorithm

Tentative Ckpt
x1 x2
X

y1 m y2

Take a tentative
z1 ckpt (msg)

z2

Checkpoints taken unnecessarily


The Rollback Recovery Algorithm
 Assumptions:
 A single process invokes the algorithm

 Check pointing and rollback recovery algorithms are not invoked


concurrently

 First Phase:
 An initiating process Pi checks to see if all the processes are
willing to restart from their previous checkpoints

 If Pi learns that all the processes are willing to restart from their
previous checkpoints, Pi decides that all the processes should
restart; otherwise Pi decides that all the processes should
continue their normal activities

 Second Phase:
 Pi propagates its decision to all the processes
 On receiving Pi ’s decision, a process will act accordingly
The Rollback Recovery Algorithm Cont..

 Correctness

 The recovery algorithm requires that every process does not send
computation messages while it is waiting for Pi’s decision

 All processes either restart from their previous checkpoints or continue


with their normal activities

 If processes decide to restart, then they resume execution in a consistent


state, as the checkpoint algorithm takes a consistent set of checkpoints
Discussion
 Advantages
 Synchronous checkpointing simplifies recovery

 Previous checkpoints can be discarded

 Disadvantages
 Additional messages are exchanged by the checkpoint algorithm when it takes
each checkpoint

 Synchronization delays are introduced during normal operations (No computation


types of messages during the checkpointing process)

 If failures rarely occur between successive checkpoints, then unnecessary burden


is placed on the system in the form of additional messages, delays, and processing
overhead
Asynchronous Checkpointing and Recovery
 Each process takes a checkpoint independently

 There is no guarantee that a set of local checkpoints will be consistent

 A recovery algorithm has to search for the most recent consistent set of
checkpoints before it can initiate recovery

 To minimize the computation undone during a rollback, all incoming messages


are logged

 The messages that were received after establishing a recovery point can be
processed again in the event of a rollback to the recovery point
Asynchronous Checkpointing and Recovery
 Two ways of message logging

 Pessimistic
 An incoming message is logged before it is processed
 Slows down underlying computation even when there are no failures

 Optimistic
 Processes continue to perform the computation and the messages received
are stored in volatile storage, which are logged at certain intervals.
 In case of a system failure, an incoming message may be lost as it may not
have been logged
 In the event of a rollback, the amount of computation redone during recovery
is likely to be more
 It does not slow down the underlying computation during normal processing
Asynchronous Checkpointing and Recovery
(Juang and Venkatesan)
 Assumptions
 Reliable and FIFO communication channels

 Communication channels have infinite buffers

 The underlying computation is assumed to be event-driven,


event where a process P waits until
a message m is received, processes m,, changes its state and sends zero or more
messages to some of its neighbours

 The events at each process are identified by unique monotonically increasing numbers

 Two types of logs are assumed - volatile log and stable log. Volatile log contents are
periodically flushed onto the stable log.
Asynchronous Checkpointing
 Each process, after an event, records a triplet {s, m, msgs_sent} in volatile
storage, where s is the state of the process before the event, m is the
message, and msgs_sent is the set of messages that were sent by the
process.

 A local checkpoint consists of the record of an event and is taken without any
synchronization with other processes..

 Notations
 RCVD i←j (CKPTi) – the no. of msgs received
eived by Pi from Pj as per the info. stored in the
checkpoint CKPTi

 SENT i→j (CKPTi) – the no. of msgs sent by Pi to Pj as per the info. stored in the
checkpoint CKPTi
Asynchronous Checkpointing
 Recovery is based on finding a consistent set of checkpoints to which the system can be
restored

 Each process keeps track of, S,, the no. of msgs it has sent to other processes and R, the no.
of msgs it has received from other processes

 If R > S, then one or more msgs are orphan, then the process has to rollback to a state
where S=R

 The algorithm assumes that a process, upon restarting, will broadcast a message that it had
failed. Can be done in O(|E|) messages where |E| is the total number of communication links.

 The algorithm at a process is initiated when it restarts after a failure or when it learns about
another process’s failure
Event Driven Computation

X exo ex1 ex2

Y eyo ey1 ey2 ey3

Z ezo ez1 ez2 ez3


Asynchronous Checkpointing Algorithm
 At Pi
(a) If i is a processor that is recovering after a failure
then CKPTi = latest event logged in the stable storage
else CKPTi = latest event that took place

(b) for k = 1 to N do /* N no. of processes */


begin
for each neighbouring process j do
send ROLLBACK(i, SENT i→j (CKPTi)) msg
wait for ROLLBACK msgs from every neighbor
for every ROLLBACK(j, c) msg recd from a neighbor j,
i does the following
if RCVD i←j (e) > c then /* implies orphan msgs */
begin
find the latest event e such that RCVD i←j (e) = c
CKPTi = e
end
end (* for k *)
Asynchronous Checkpointing Example

exo  The procedure has |N|


x1 ex1 ex2 ex3 iterations. At the end of
each iteration, atleast
one process will
rollback to its final
y1 recovery point unless
eyo ey1 ey2 ey3 the current recovery
points are consistent

 In the example, Y
restarts at checkpoint y1
ezo ez1 ez2 and ey2 is the latest
event logged
z1
 CKPTx = ex3
 CKPTy = ey2
 CKPTz = ez2
Asynchronous Checkpointing Example
 First Iteration
 Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z
 X sends ROLLBACK(X,2) to Y and ROLLBACK(X,0) to Z
 Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y
 Since RCVD X←Y (CKPTX) = 3 > 2, X will set CKPTX to ex2
 Since RCVD Z←Y (CKPTZ) = 2 > 1, Z will set CKPTZ to ez1
 Y need not rollback any further

 Second Iteration
 Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y, 1) to Z
 Z sends ROLLBACK(Z, 0) to X and ROLLBACK(Z, 1) to Y
 X sends ROLLBACK(X, 2) to Y and ROLLBACK(X, 0) to Z

 The third iteration will also progress in a similar fashion


 The set of recovery points chosen at the end of the first
iteration {ex2, ey2, ez1} is consistent
 No further rollbacks are required
References

 1. T. Juang and S. Venkatesan,, “Crash Recovery with Little Overhead”,


Proceedings of the 11th International Conference on Distributed Computer
Systems, May 1991, pp. 454-461

 2. R. Koo and S. Toueg, “Checkpointing


Checkpointing and Rollback-Recovery for
Distributed Ssytems”, IEEE Transactions on Software Engineering,
Engineering Vol. 14,
No. 6, June 1988, pp. 810-821

 3. M. Singhal and N. Shivaratri,, “Advanced Concepts in Operation Systems”,


Tata McGraw-Hill

You might also like