Failure Recovery in Distributed Systems
Failure Recovery in Distributed Systems
Systems
02/08/22 1
Classification of Failures
Process Failure
Symptoms : process fails to progress, computation results in erroneous output,
process leads to incorrect system state
Causes : deadlocks, consistency violation, wrong input
System Failure
Symptoms: processor fails to execute
Causes : CPU failure, bus failure, power failure, main memory failure
02/08/22 2
What is Recovery ?
02/08/22 3
Error Recovery
Forward Error Recovery
If the nature of errors and damages caused by faults can be
completely and accurately assessed, then it is possible to remove
those errors in the process’s state and enable the process to
move forward. This technique is known as forward error
recovery.
all the errors in the process’s state, then the process’s state can
be restored to a previous error-free state of the process. This
technique is known as backward-error recovery.
02/08/22 4
Comparison of Backward and
Forward Recovery
Backward Recovery Forward Recovery
02/08/22 5
Backward Recovery : Operation
Based Approach
Major points:
Every update operation to an object updates the object and results
in a log to be recorded in a stable storage which has enough
information to completely undo and redo the operation.
The information recorded includes: (1) the name of the object, (2)
the old state of the object (used for undo), (3) the new state of the
object (used for redo)
Methods
Updating-in-place protocol
Write-ahead-log protocol
02/08/22 6
Updating-in-place protocol
Major operations:
A do operation, which does the action(update) and writes a log record.
An undo operation, which, given a log record written by a do
operation, undoes the action performed by the do operatio
A redo operation, which, given a log record written by a do operation,
redoes the action specified by the do operation.
02/08/22 7
The write-ahead-log protocol
In the write-ahead-log protocol a recoverable update operation
is implemented by the following operations:
02/08/22 8
State-based approach
In this approach, the complete state of a process is saved when
a recovery point is established.
02/08/22 9
Shadow Copy Method
Under this technique, only a part of the system state is saved to
facilitate recovery.
02/08/22 10
Problems occurring in concurrent
systems : “Orphan Messages”
Orphan Messages and the Domino Effect
02/08/22 11
Definitions : Orphan Messages
and Domino Effect
Orphan Message : A message whose source (parent) cannot
be traced is called an orphan message.
02/08/22 12
Problems occuring in concurrent
systems : “Lost messages”
Lost messages
02/08/22 13
Problems occuring in concurrent
systems : “Deadlocks”
Deadlock : A deadlock occurs when a set of processes in a
system is blocked because each process is waiting for the
release of some resource held by another process.
02/08/22 14
Problems occuring in concurrent
systems : “Livelocks”
Livelock : In rollback recovery, livelock is a situation in which a
single failure can cause an infinite number of rollbacks,
preventing the system from making progress.
02/08/22 15
Problems occuring in concurrent
systems… continued
Problem of livelocks
02/08/22 16
Strongly consistent set of
Checkpoints
Definition: To overcome the domino effect, a set of local
checkpoints is needed (one for each process in the set) such
that no information flow takes place between any pair of
processes in the set, as well as between any process in the set
and any process outside the set during the interval spanned by
the checkpoints. Such a set of checkpoints is known as recovery
line or a strongly consistent set of checkpoints.
02/08/22 17
Strongly Consistent set of
checkpoints
02/08/22 18
Consistent set of checkpoints
Definition : A consistent set of checkpoints requires that each
message recorded as received in a checkpoint (state) should
also be recorded as sent in a checkpoint (state).
02/08/22 19
Sychronous checkpointing and
recovery
Comments on The Checkpoint Algorithm (by Koo and
Toueg):
Takes a consistent set of checkpoints
02/08/22 20
The Checkpoint Algorithm (by
Koo and Toueg)
First Phase:
An initiating process Pi takes a tentative checkpoint and requests all
the processes to take tentative checkpoints.
A process says “no” to a request if it fails to take checkpoint due to
any reason.
When Pi learns that all processes have successfully taken tentative
checkpoints, Pi decides that all tentative checkpoints should be
made permanent; otherwise, Pi decides that all the tentative
checkpoints should be discarded.
Second Phase:
Pi informs all processes of the decision it reached at the end of the
first phase, and all processes act accordingly.
02/08/22 21
The Checkpoint Algorithm (by
Koo and Toueg)
02/08/22 22
Correctness of the Checkpoint
Algorithm
The set of permanent checkpoints taken by this algorithm is
consistent because:
02/08/22 23
Conclusion
Failure Recovery is critical for ensuring the
correctness and global consistency of processes in an
operating system.
02/08/22 24