0% found this document useful (0 votes)
2K views

Failure Recovery in Distributed Systems

This document discusses failure recovery in distributed systems. It classifies failures as process failures, system failures, secondary storage failures, and communication medium failures. It defines recovery as restoring a system to its normal operational state. Two types of error recovery are discussed: forward error recovery, which can remove errors if faults are fully understood, and backward error recovery, which restores a previous error-free state. Backward recovery has overhead while forward recovery requires fault assessment. Consistent checkpoints and the Koo-Toueg algorithm for synchronous checkpointing and recovery are also summarized.

Uploaded by

Sudha Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

Failure Recovery in Distributed Systems

This document discusses failure recovery in distributed systems. It classifies failures as process failures, system failures, secondary storage failures, and communication medium failures. It defines recovery as restoring a system to its normal operational state. Two types of error recovery are discussed: forward error recovery, which can remove errors if faults are fully understood, and backward error recovery, which restores a previous error-free state. Backward recovery has overhead while forward recovery requires fault assessment. Consistent checkpoints and the Koo-Toueg algorithm for synchronous checkpointing and recovery are also summarized.

Uploaded by

Sudha Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 24

Failure Recovery in Distributed

Systems

02/08/22 1
Classification of Failures
 Process Failure
 Symptoms : process fails to progress, computation results in erroneous output,
process leads to incorrect system state
 Causes : deadlocks, consistency violation, wrong input

 System Failure
 Symptoms: processor fails to execute
 Causes : CPU failure, bus failure, power failure, main memory failure

 Secondary Storage Failure


 Symptoms: stored data cannot be accessed
 Causes:head crash, dust particles settled on the medium

 Communication Medium Failure


 Symptoms : One site cannot communicate with another operational site in the
network
 Causes : failure of switching nodes / links

02/08/22 2
What is Recovery ?

“Recovery in computer systems refers to restoring a system to


its normal operational state.”

02/08/22 3
Error Recovery
 Forward Error Recovery
 If the nature of errors and damages caused by faults can be
completely and accurately assessed, then it is possible to remove
those errors in the process’s state and enable the process to
move forward. This technique is known as forward error
recovery.

 Backward Error Recovery


 If it is not possible to foresee the nature of faults and to remove

all the errors in the process’s state, then the process’s state can
be restored to a previous error-free state of the process. This
technique is known as backward-error recovery.

02/08/22 4
Comparison of Backward and
Forward Recovery
Backward Recovery Forward Recovery

 Performance penalty: due to  Less overhead


restoring overhead  Can be used only when
 No guarantee that faults will damages can be correctly
not re-occur assessed.

02/08/22 5
Backward Recovery : Operation
Based Approach
 Major points:
 Every update operation to an object updates the object and results
in a log to be recorded in a stable storage which has enough
information to completely undo and redo the operation.
 The information recorded includes: (1) the name of the object, (2)
the old state of the object (used for undo), (3) the new state of the
object (used for redo)

 Methods
 Updating-in-place protocol
 Write-ahead-log protocol

02/08/22 6
Updating-in-place protocol
 Major operations:
 A do operation, which does the action(update) and writes a log record.
 An undo operation, which, given a log record written by a do
operation, undoes the action performed by the do operatio
 A redo operation, which, given a log record written by a do operation,
redoes the action specified by the do operation.

 Problem with this method


 A do operation cannot be undone if the system crashes after an update
operation but before the log record is updated. This problem is
overcome by the write-ahead protocol.

02/08/22 7
The write-ahead-log protocol
 In the write-ahead-log protocol a recoverable update operation
is implemented by the following operations:

 Update an object only after the undo log is recorded


 Before committing the updates, redo and undo logs are
recorded

02/08/22 8
State-based approach
 In this approach, the complete state of a process is saved when
a recovery point is established.

 The process of saving a state is referred to as checkpointing.

 A process that fails is rolled back to the last checkpoint.

 A simple scheme to implement the state-based approach : The


shadow-copy scheme.

02/08/22 9
Shadow Copy Method
 Under this technique, only a part of the system state is saved to
facilitate recovery.

 Whenever a process wants to modify an object the page


containing that object is duplicated and maintained on stable
storage. This duplicated page is termed shadow copy.

 Modifications are made on the current copy.

 If the process fails the current copy of the object is discarded


and the shadow copy is restored.

02/08/22 10
Problems occurring in concurrent
systems : “Orphan Messages”
 Orphan Messages and the Domino Effect

02/08/22 11
Definitions : Orphan Messages
and Domino Effect
 Orphan Message : A message whose source (parent) cannot
be traced is called an orphan message.

 Domino effect: The effect where the rolling back of one


process causes one or more other processes to roll back is
known as the domino effect.

02/08/22 12
Problems occuring in concurrent
systems : “Lost messages”
 Lost messages

02/08/22 13
Problems occuring in concurrent
systems : “Deadlocks”
 Deadlock : A deadlock occurs when a set of processes in a
system is blocked because each process is waiting for the
release of some resource held by another process.

 Necessary Conditions for deadlock:


1. Exclusive Access
2. Hold and Wait
3. No Pre-emption
4. Circular Wait

02/08/22 14
Problems occuring in concurrent
systems : “Livelocks”
 Livelock : In rollback recovery, livelock is a situation in which a
single failure can cause an infinite number of rollbacks,
preventing the system from making progress.

02/08/22 15
Problems occuring in concurrent
systems… continued
 Problem of livelocks

02/08/22 16
Strongly consistent set of
Checkpoints
 Definition: To overcome the domino effect, a set of local
checkpoints is needed (one for each process in the set) such
that no information flow takes place between any pair of
processes in the set, as well as between any process in the set
and any process outside the set during the interval spanned by
the checkpoints. Such a set of checkpoints is known as recovery
line or a strongly consistent set of checkpoints.

 Limitation : Processes (with a strongly consistent set of


checkpoints) experience delays during the checkpointing
process as processes cannot exchange messages while
checkpointing is in progress.

02/08/22 17
Strongly Consistent set of
checkpoints

(x1,y1,z1) form a strongly consistent set of checkpoints

02/08/22 18
Consistent set of checkpoints
 Definition : A consistent set of checkpoints requires that each
message recorded as received in a checkpoint (state) should
also be recorded as sent in a checkpoint (state).

02/08/22 19
Sychronous checkpointing and
recovery
 Comments on The Checkpoint Algorithm (by Koo and
Toueg):
 Takes a consistent set of checkpoints

 The algorithm assumes that a single process invokes the algorithm,


as opposed to several processes concurrently invoking the
algorithm to take permanent checkpoints

 Algorithm works in two phases

02/08/22 20
The Checkpoint Algorithm (by
Koo and Toueg)
 First Phase:
 An initiating process Pi takes a tentative checkpoint and requests all
the processes to take tentative checkpoints.
 A process says “no” to a request if it fails to take checkpoint due to
any reason.
 When Pi learns that all processes have successfully taken tentative
checkpoints, Pi decides that all tentative checkpoints should be
made permanent; otherwise, Pi decides that all the tentative
checkpoints should be discarded.

 Second Phase:
 Pi informs all processes of the decision it reached at the end of the
first phase, and all processes act accordingly.

02/08/22 21
The Checkpoint Algorithm (by
Koo and Toueg)

02/08/22 22
Correctness of the Checkpoint
Algorithm
 The set of permanent checkpoints taken by this algorithm is
consistent because:

 Either all or none of the processes take permanent checkpoints.

 A set of checkpoints will be inconsistent if there is a record of a


message received but not of the event sending it. This will not
happen as no process sends messages after taking a tentative
checkpoint until the receipt of the initiating process’s decision, by
which time all processes would have taken checkpoints.

02/08/22 23
Conclusion
 Failure Recovery is critical for ensuring the
correctness and global consistency of processes in an
operating system.

02/08/22 24

You might also like