0% found this document useful (0 votes)

2K views

Failure Recovery in Distributed Systems

This document discusses failure recovery in distributed systems. It classifies failures as process failures, system failures, secondary storage failures, and communication medium failures. It defines recovery as restoring a system to its normal operational state. Two types of error recovery are discussed: forward error recovery, which can remove errors if faults are fully understood, and backward error recovery, which restores a previous error-free state. Backward recovery has overhead while forward recovery requires fault assessment. Consistent checkpoints and the Koo-Toueg algorithm for synchronous checkpointing and recovery are also summarized.

Uploaded by

Sudha Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2K views

Failure Recovery in Distributed Systems

Uploaded by

Sudha Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 24

Failure Recovery in Distributed

Systems

02/08/22 1
Classification of Failures
 Process Failure
 Symptoms : process fails to progress, computation results in erroneous output,
process leads to incorrect system state
 Causes : deadlocks, consistency violation, wrong input

 System Failure
 Symptoms: processor fails to execute
 Causes : CPU failure, bus failure, power failure, main memory failure

 Secondary Storage Failure

 Symptoms: stored data cannot be accessed
 Causes:head crash, dust particles settled on the medium

 Communication Medium Failure

 Symptoms : One site cannot communicate with another operational site in the
network
 Causes : failure of switching nodes / links

02/08/22 2
What is Recovery ?

“Recovery in computer systems refers to restoring a system to

its normal operational state.”

02/08/22 3
Error Recovery
 Forward Error Recovery
 If the nature of errors and damages caused by faults can be
completely and accurately assessed, then it is possible to remove
those errors in the process’s state and enable the process to
move forward. This technique is known as forward error
recovery.

 Backward Error Recovery

 If it is not possible to foresee the nature of faults and to remove

all the errors in the process’s state, then the process’s state can
be restored to a previous error-free state of the process. This
technique is known as backward-error recovery.

02/08/22 4
Comparison of Backward and
Forward Recovery
Backward Recovery Forward Recovery

 Performance penalty: due to  Less overhead

restoring overhead  Can be used only when
 No guarantee that faults will damages can be correctly
not re-occur assessed.

02/08/22 5
Backward Recovery : Operation
Based Approach
 Major points:
 Every update operation to an object updates the object and results
in a log to be recorded in a stable storage which has enough
information to completely undo and redo the operation.
 The information recorded includes: (1) the name of the object, (2)
the old state of the object (used for undo), (3) the new state of the
object (used for redo)

 Methods
 Updating-in-place protocol
 Write-ahead-log protocol

02/08/22 6
Updating-in-place protocol
 Major operations:
 A do operation, which does the action(update) and writes a log record.
 An undo operation, which, given a log record written by a do
operation, undoes the action performed by the do operatio
 A redo operation, which, given a log record written by a do operation,
redoes the action specified by the do operation.

 Problem with this method

 A do operation cannot be undone if the system crashes after an update
operation but before the log record is updated. This problem is
overcome by the write-ahead protocol.

02/08/22 7
The write-ahead-log protocol
 In the write-ahead-log protocol a recoverable update operation
is implemented by the following operations:

 Update an object only after the undo log is recorded

 Before committing the updates, redo and undo logs are
recorded

02/08/22 8
State-based approach
 In this approach, the complete state of a process is saved when
a recovery point is established.

 The process of saving a state is referred to as checkpointing.

 A process that fails is rolled back to the last checkpoint.

 A simple scheme to implement the state-based approach : The

shadow-copy scheme.

02/08/22 9
Shadow Copy Method
 Under this technique, only a part of the system state is saved to
facilitate recovery.

 Whenever a process wants to modify an object the page

containing that object is duplicated and maintained on stable
storage. This duplicated page is termed shadow copy.

 Modifications are made on the current copy.

 If the process fails the current copy of the object is discarded

and the shadow copy is restored.

02/08/22 10
Problems occurring in concurrent
systems : “Orphan Messages”
 Orphan Messages and the Domino Effect

02/08/22 11
Definitions : Orphan Messages
and Domino Effect
 Orphan Message : A message whose source (parent) cannot
be traced is called an orphan message.

 Domino effect: The effect where the rolling back of one

process causes one or more other processes to roll back is
known as the domino effect.

02/08/22 12
Problems occuring in concurrent
systems : “Lost messages”
 Lost messages

02/08/22 13
Problems occuring in concurrent
systems : “Deadlocks”
 Deadlock : A deadlock occurs when a set of processes in a
system is blocked because each process is waiting for the
release of some resource held by another process.

 Necessary Conditions for deadlock:

1. Exclusive Access
2. Hold and Wait
3. No Pre-emption
4. Circular Wait

02/08/22 14
Problems occuring in concurrent
systems : “Livelocks”
 Livelock : In rollback recovery, livelock is a situation in which a
single failure can cause an infinite number of rollbacks,
preventing the system from making progress.

02/08/22 15
Problems occuring in concurrent
systems… continued
 Problem of livelocks

02/08/22 16
Strongly consistent set of
Checkpoints
 Definition: To overcome the domino effect, a set of local
checkpoints is needed (one for each process in the set) such
that no information flow takes place between any pair of
processes in the set, as well as between any process in the set
and any process outside the set during the interval spanned by
the checkpoints. Such a set of checkpoints is known as recovery
line or a strongly consistent set of checkpoints.

 Limitation : Processes (with a strongly consistent set of

checkpoints) experience delays during the checkpointing
process as processes cannot exchange messages while
checkpointing is in progress.

02/08/22 17
Strongly Consistent set of
checkpoints

(x1,y1,z1) form a strongly consistent set of checkpoints

02/08/22 18
Consistent set of checkpoints
 Definition : A consistent set of checkpoints requires that each
message recorded as received in a checkpoint (state) should
also be recorded as sent in a checkpoint (state).

02/08/22 19
Sychronous checkpointing and
recovery
 Comments on The Checkpoint Algorithm (by Koo and
Toueg):
 Takes a consistent set of checkpoints

 The algorithm assumes that a single process invokes the algorithm,

as opposed to several processes concurrently invoking the
algorithm to take permanent checkpoints

 Algorithm works in two phases

02/08/22 20
The Checkpoint Algorithm (by
Koo and Toueg)
 First Phase:
 An initiating process Pi takes a tentative checkpoint and requests all
the processes to take tentative checkpoints.
 A process says “no” to a request if it fails to take checkpoint due to
any reason.
 When Pi learns that all processes have successfully taken tentative
checkpoints, Pi decides that all tentative checkpoints should be
made permanent; otherwise, Pi decides that all the tentative
checkpoints should be discarded.

 Second Phase:
 Pi informs all processes of the decision it reached at the end of the
first phase, and all processes act accordingly.

02/08/22 21
The Checkpoint Algorithm (by
Koo and Toueg)

02/08/22 22
Correctness of the Checkpoint
Algorithm
 The set of permanent checkpoints taken by this algorithm is
consistent because:

 Either all or none of the processes take permanent checkpoints.

 A set of checkpoints will be inconsistent if there is a record of a

message received but not of the event sending it. This will not
happen as no process sends messages after taking a tentative
checkpoint until the receipt of the initiating process’s decision, by
which time all processes would have taken checkpoints.

02/08/22 23
Conclusion
 Failure Recovery is critical for ensuring the
correctness and global consistency of processes in an
operating system.

02/08/22 24

Web Application Security - Unit 1 Notes
No ratings yet
Web Application Security - Unit 1 Notes
37 pages
CS3492 DBMS Notes
No ratings yet
CS3492 DBMS Notes
165 pages
BSC 2nd Sem CPP Notes
100% (1)
BSC 2nd Sem CPP Notes
97 pages
Ai Questions and Answers
No ratings yet
Ai Questions and Answers
6 pages
OS - Operating System Akash
No ratings yet
OS - Operating System Akash
69 pages
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
No ratings yet
CS8711 - Cloud Computing Laboratory Record: Department of Computer Science & Engineering
5 pages
Upgrading Satellite and Capsule From RHEL 7 To RHEL 8 - HackMD
No ratings yet
Upgrading Satellite and Capsule From RHEL 7 To RHEL 8 - HackMD
22 pages
Restaurant Management System Project Report in Advance Java
No ratings yet
Restaurant Management System Project Report in Advance Java
70 pages
CNS Notes
No ratings yet
CNS Notes
244 pages
Data Structures Using C++ Lab - Record - II A - Dec 2020
No ratings yet
Data Structures Using C++ Lab - Record - II A - Dec 2020
68 pages
Sonata Software Sample Aptitude Placement Paper Level1
No ratings yet
Sonata Software Sample Aptitude Placement Paper Level1
7 pages
Unit - Ii: Communication and Invocation
No ratings yet
Unit - Ii: Communication and Invocation
16 pages
Thrashing in OS (Operating System) - What Is Thrash - Javatpoint
No ratings yet
Thrashing in OS (Operating System) - What Is Thrash - Javatpoint
7 pages
IT8074 - Service Oriented Architecture
No ratings yet
IT8074 - Service Oriented Architecture
196 pages
Data Analytics III I
No ratings yet
Data Analytics III I
86 pages
Cyber Security Unit 2 Notes
No ratings yet
Cyber Security Unit 2 Notes
32 pages
Computer Network UNIT 3
No ratings yet
Computer Network UNIT 3
28 pages
Operating System Notes
100% (1)
Operating System Notes
12 pages
Daa Handwritten Notes
No ratings yet
Daa Handwritten Notes
43 pages
Iare DWDM and WT Lab Manual PDF
No ratings yet
Iare DWDM and WT Lab Manual PDF
69 pages
Dual Access Control For Cloud Based Data Storage and Sharing
No ratings yet
Dual Access Control For Cloud Based Data Storage and Sharing
64 pages
Primitives For Distributed Communication
100% (2)
Primitives For Distributed Communication
10 pages
OS 2 Marks
100% (11)
OS 2 Marks
15 pages
RDBMS Unit 5
No ratings yet
RDBMS Unit 5
39 pages
DC Question Bank 5 Units
No ratings yet
DC Question Bank 5 Units
17 pages
Brainbox Infosys Reasoning & Aptitude
0% (2)
Brainbox Infosys Reasoning & Aptitude
71 pages
DAA Notes Jntu
No ratings yet
DAA Notes Jntu
45 pages
University of Mumbai Dec 2018 TCS Paper Solved
No ratings yet
University of Mumbai Dec 2018 TCS Paper Solved
18 pages
Cyber Security - Lecture 11
100% (4)
Cyber Security - Lecture 11
52 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
FAI - Unit-2 - State Space Search & Heuristic Search Techniques
No ratings yet
FAI - Unit-2 - State Space Search & Heuristic Search Techniques
19 pages
R18 CSM 3-2 Devops
No ratings yet
R18 CSM 3-2 Devops
28 pages
Module II
No ratings yet
Module II
22 pages
DC Notes - 2 Marks
No ratings yet
DC Notes - 2 Marks
11 pages
Component Level Design
No ratings yet
Component Level Design
2 pages
Fs Lab Manual
No ratings yet
Fs Lab Manual
57 pages
Problem Statement
No ratings yet
Problem Statement
23 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
Presentation On Industrial Training
No ratings yet
Presentation On Industrial Training
13 pages
Experiment:1.3: Write A Program To Implement Sequential File Allocation Method. Ide Used: - Dev C++
No ratings yet
Experiment:1.3: Write A Program To Implement Sequential File Allocation Method. Ide Used: - Dev C++
4 pages
Daa Important Questions
No ratings yet
Daa Important Questions
2 pages
Opening and Closing A File in C++
No ratings yet
Opening and Closing A File in C++
4 pages
B Tech AIDS
No ratings yet
B Tech AIDS
43 pages
Ex No:1 (I) Program Using TCP Sockets Date and Time Server Date: Aim
No ratings yet
Ex No:1 (I) Program Using TCP Sockets Date and Time Server Date: Aim
58 pages
Part A 1. Determine The GCD (24140,16762) Using Euclid's Algorithm. (A/M-2017)
No ratings yet
Part A 1. Determine The GCD (24140,16762) Using Euclid's Algorithm. (A/M-2017)
36 pages
Cs8602 Unit 4 Access To Nonlocal Data On The Stack
No ratings yet
Cs8602 Unit 4 Access To Nonlocal Data On The Stack
15 pages
DAA Notes
No ratings yet
DAA Notes
126 pages
CS8791-Cloud Computing UNIT 5 Notes
No ratings yet
CS8791-Cloud Computing UNIT 5 Notes
33 pages
Unit 1 Ethical Hacking
No ratings yet
Unit 1 Ethical Hacking
56 pages
Cs3451 Ios Unit 5 Notes
No ratings yet
Cs3451 Ios Unit 5 Notes
21 pages
(Ebook) Essentials of Pattern Recognition: An Accessible Approach by Jianxin Wu ISBN 9781108483469, 1108483461 all chapter instant download
100% (8)
(Ebook) Essentials of Pattern Recognition: An Accessible Approach by Jianxin Wu ISBN 9781108483469, 1108483461 all chapter instant download
81 pages
CCS374 Web Application Security
No ratings yet
CCS374 Web Application Security
18 pages
Banker'S Algorithm
No ratings yet
Banker'S Algorithm
17 pages
Knapp's Classification
No ratings yet
Knapp's Classification
6 pages
Unit 1 Introduction To ML
100% (1)
Unit 1 Introduction To ML
52 pages
Objective
No ratings yet
Objective
39 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Operating System SP Notes
No ratings yet
Operating System SP Notes
52 pages
Parallel Processors and Cluster Systems: Gagan Bansal IME Sahibabad
No ratings yet
Parallel Processors and Cluster Systems: Gagan Bansal IME Sahibabad
15 pages
System Recovery
No ratings yet
System Recovery
38 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
1904050001
No ratings yet
1904050001
119 pages
DS GTU Study Material Presentations Unit-1
No ratings yet
DS GTU Study Material Presentations Unit-1
14 pages
Ch03 Chen
No ratings yet
Ch03 Chen
24 pages
Writing Client/Server Programs in C Using Sockets (A Tutorial) Session 5958 Greg Granger Grgran at Sas SAS/C & C++ Support Institute Cary, NC
No ratings yet
Writing Client/Server Programs in C Using Sockets (A Tutorial) Session 5958 Greg Granger Grgran at Sas SAS/C & C++ Support Institute Cary, NC
31 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Mining Frequent Patterns, Association and Correlations
No ratings yet
Mining Frequent Patterns, Association and Correlations
42 pages
Distributed File Systems
No ratings yet
Distributed File Systems
107 pages
Remote Procedure Call (RPC)
No ratings yet
Remote Procedure Call (RPC)
50 pages
Synchronization Notes
No ratings yet
Synchronization Notes
9 pages
Process Migration: February 8, 2022
No ratings yet
Process Migration: February 8, 2022
41 pages
Synchronization
No ratings yet
Synchronization
114 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
24 pages
Distributes Scheduling
No ratings yet
Distributes Scheduling
36 pages
Remote Method Invocation (RMI)
No ratings yet
Remote Method Invocation (RMI)
20 pages
Distributed Comp (Intro)
No ratings yet
Distributed Comp (Intro)
39 pages
Block Cipher Modes
No ratings yet
Block Cipher Modes
34 pages
Digital Signal Processing PROF. S. C. Dutta Roy Department of Electrical Engineering IIT Delhi Discrete Fourier Transform (D F T Cont.) Lecture-10
No ratings yet
Digital Signal Processing PROF. S. C. Dutta Roy Department of Electrical Engineering IIT Delhi Discrete Fourier Transform (D F T Cont.) Lecture-10
15 pages
Elliptic Curve Cryptography: Presented By: Mrs. S J Patel Department of Computer Engineering, Nit, Surat
No ratings yet
Elliptic Curve Cryptography: Presented By: Mrs. S J Patel Department of Computer Engineering, Nit, Surat
53 pages
Cryptographic Hash Functions - Data Integrity Applications: Dhiren Patel
No ratings yet
Cryptographic Hash Functions - Data Integrity Applications: Dhiren Patel
54 pages
Designing Block Ciphers - DES: Dhiren Patel
No ratings yet
Designing Block Ciphers - DES: Dhiren Patel
37 pages
Coe4Tn3 Image Processing: Image Enhancement in The Spatial Image Enhancement in The Spatial Domain
No ratings yet
Coe4Tn3 Image Processing: Image Enhancement in The Spatial Image Enhancement in The Spatial Domain
12 pages
Image Processing Image Processing: Intensity Transformations Intensity Transformations CH 3 CH 3 Chapter 3 Chapter 3
No ratings yet
Image Processing Image Processing: Intensity Transformations Intensity Transformations CH 3 CH 3 Chapter 3 Chapter 3
66 pages
Lecture7 Segmentation
No ratings yet
Lecture7 Segmentation
12 pages
Installation Instructions For The Standard & Premium Versions of Kwikpop For Multicharts
No ratings yet
Installation Instructions For The Standard & Premium Versions of Kwikpop For Multicharts
10 pages
Essentials of Programming Languages Third Edition Daniel P. Friedman - Explore the complete ebook content with the fastest download
100% (1)
Essentials of Programming Languages Third Edition Daniel P. Friedman - Explore the complete ebook content with the fastest download
60 pages
Com Tia PDF
No ratings yet
Com Tia PDF
34 pages
SMP 7 1 Database Schema
No ratings yet
SMP 7 1 Database Schema
109 pages
Windows 10 CMD Prompt
No ratings yet
Windows 10 CMD Prompt
1 page
MAG Infinite S3 14th
No ratings yet
MAG Infinite S3 14th
1 page
OCAMPO, Denisse Dean E. - GEE12D - Activity 4
No ratings yet
OCAMPO, Denisse Dean E. - GEE12D - Activity 4
3 pages
04 Internet Addressing-I
No ratings yet
04 Internet Addressing-I
23 pages
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
No ratings yet
PPGCC: Non-Volatile Memory: Emerging Technologies and Their Impacts On Memory Systems
44 pages
How To Write An Industry-Standard EEPROM (24C04) Using The MAX2990 I C Interface
No ratings yet
How To Write An Industry-Standard EEPROM (24C04) Using The MAX2990 I C Interface
4 pages
Pham Quang Tung (BKC12475) - Networking - Asm 2 lần 1
No ratings yet
Pham Quang Tung (BKC12475) - Networking - Asm 2 lần 1
21 pages
Microcontrollers - Chapter 07
No ratings yet
Microcontrollers - Chapter 07
29 pages
Server Control
No ratings yet
Server Control
19 pages
Susan: "Design Is My Ego, Color My Id - Technology Is What Takes Them To The Next Level."
No ratings yet
Susan: "Design Is My Ego, Color My Id - Technology Is What Takes Them To The Next Level."
1 page
1204617-1204627
No ratings yet
1204617-1204627
1 page
WDT
No ratings yet
WDT
26 pages
Computer Network Project Showing A Office Network
No ratings yet
Computer Network Project Showing A Office Network
39 pages
Time: 3 Hours Total Marks: 100: Printed Page 1 of 2 Sub Code:KNC302
No ratings yet
Time: 3 Hours Total Marks: 100: Printed Page 1 of 2 Sub Code:KNC302
2 pages
GR-TIEMS Operation Manual (6F2M1082) 0.07
No ratings yet
GR-TIEMS Operation Manual (6F2M1082) 0.07
294 pages
New Feature in 6.5
No ratings yet
New Feature in 6.5
60 pages
Analytics 2023 01 01 020613
No ratings yet
Analytics 2023 01 01 020613
124 pages
Gentoo 2024 06 02
No ratings yet
Gentoo 2024 06 02
2 pages
1 20 22 BDC 13 A 1 1 A Obj: S. N O YE AR MAJ OR S U B UN IT Chap TER SE C. Q.T YPE
No ratings yet
1 20 22 BDC 13 A 1 1 A Obj: S. N O YE AR MAJ OR S U B UN IT Chap TER SE C. Q.T YPE
42 pages
Spesifikasi Perangkat Komputer Karyawan
No ratings yet
Spesifikasi Perangkat Komputer Karyawan
2 pages
How To Analyze The FDR Output in Siebel Versions 7.7.x, 7.8.x and 8. (ID 473939.1)
No ratings yet
How To Analyze The FDR Output in Siebel Versions 7.7.x, 7.8.x and 8. (ID 473939.1)
11 pages
WaveDRum Editor Manual
100% (1)
WaveDRum Editor Manual
8 pages

Failure Recovery in Distributed Systems

Uploaded by

Failure Recovery in Distributed Systems

Uploaded by

Failure Recovery in Distributed

 Secondary Storage Failure

 Communication Medium Failure

“Recovery in computer systems refers to restoring a system to

 Backward Error Recovery

 Performance penalty: due to  Less overhead

 Problem with this method

 Update an object only after the undo log is recorded

 The process of saving a state is referred to as checkpointing.

 A process that fails is rolled back to the last checkpoint.

 A simple scheme to implement the state-based approach : The

 Whenever a process wants to modify an object the page

 Modifications are made on the current copy.

 If the process fails the current copy of the object is discarded

 Domino effect: The effect where the rolling back of one

 Necessary Conditions for deadlock:

 Limitation : Processes (with a strongly consistent set of

(x1,y1,z1) form a strongly consistent set of checkpoints

 The algorithm assumes that a single process invokes the algorithm,

 Algorithm works in two phases

 Either all or none of the processes take permanent checkpoints.

 A set of checkpoints will be inconsistent if there is a record of a

You might also like