0% found this document useful (0 votes)

53 views50 pages

Fault Tolerance and Recovery

Uploaded by

Nagendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views50 pages

Fault Tolerance and Recovery

Uploaded by

Nagendra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Fault Tolerance and Recovery

Introduction
 Recovery in computer systems refers to restoring a system to its normal
operational state.

 Processes are given the system resources, e.g. memory, files, etc. and they
may have locked the shared resouces.
resouces

 After the failure of a process, the resources must be reclaimed so that they
can be given to the other processes.

 If a failed process has modified the database, then all the modifications made
to the database by the failed process must be undone.

 Other side, if a process has executed for some time before failing, it would be
preferable to restart the process from the point of its failure and resume its
execution.
Introduction Cont..
 Enhanced performance:
 Concurrent execution of many processes that cooperate in performing a
task.

 One process failure may cause the entire task to be recovered, then the
effects due to the interactions of the failed processes with the other
processes must be undone.

 Increased availibility:
 Through replication, e.g., data, processes, and hardware components

 If a site fails, copies of data stored at that site may miss updates, thus
becoming inconsistent with the rest of the system when it becomes
operational.
Basic Concepts
 System: a set of software and hardware components to provide a service.

 A fault is an anomalous physical condition.The causes of a fault include

design errors (system specification or implementation), manufacturing
problems, external disturbances such as harsh environmental conditions, etc.
 Software bug
 Random hardware fault
 Memory bit “stuck”

 An erroneous state - one which could lead to a system failure by a sequene

of valid state transitions.

 An error is a part of a system state which differs from its intended value.
 An error is a manifestation of a fault in a system, which could lead to system
failure.
 Failure of a system - deviation from its desired behaviour.
Example
if (Balance<1000) //Fault - No ‘=’ operator
{
return false;
}
else Withdraw();
//Error - when an input balance is exactly 1000, system in
incorrect state.
//and then, System/ SW Failure
Classification of Failures
 Process Failure:
 The computation results in an incorrect outcome, the process causes
the system state to deviate from specifications, the process may fail to
progress

 Examples of error causing processes to fail

 Deadlocks
 Protection violation
 Wrong input provided by a user

 Failed process may be restarted from a prior state or aborted.

 A deadlocked process can be restarted from a prior state, where it can
acquire the resources again.
 A wrong input in the initial stages may require a process to be aborted.
System Failure
 A system failure occurs when the processor fails to execute.

 Caused by software errors and hardware problems, e.g. ,CPU

failure, main memory failure, bus failure, power failure, etc.

 The system is stopped and restarted in a correct state.

 The correct state may be some predefined state or a prior state

(checkpoint) saved in a stable (non-volatile)
(non storage.
Secondary Storage
 A secondary
condary storage failure is said to have occured when the
stored data cannot be accessed.

 Failure is usually caused by parity error, or dust particles on the

medium.

 Recovery can be based on the archived version or log structured

version

 For tolerating failures, we can have mirrored disk systems

(2 physically independent disks) with independent buses and
controllers.
Communication Medium
 A communication medium failure occurs when a site cannot
communicate with another operational site in the network.

 Failure of switching nodes and/or the links of the communicating

system.

 May not cause a total shut down of communication facilities.

 Message loss, partition of a network where a subset of sites may

be unable to communicate with the sites in another subset.
Failures and Faults
 Software
 physicallevel = program code
 computational level = values of the program state
 system level = software system running the program

 Bug in a program is a fault.

 Possible incorrect values caused by this bug is an error.
 Possible crush of the operating system is a failure.
Types of Faults
 Faults may be either a fail-silent
silent failure (also known as fail-stop)
fail or a
Byzantine failure.

 A fail-silent
silent fault is one where the faulty unit stops functioning and produces
no ill output (it produces no output or produces output to indicate failure).

 Examples: disk head crashes, software bugs, and burnt-out

burnt hardware

 A Byzantine fault is one where the faulty unit continues to run but produces
incorrect results.

 Byzantine faults are more troublesome to deal with.

with
 Processes can crash, messages can be lost, etc. - Can be malicious
(attacks, software bugs, etc.)
Types of Faults
 Faults can be classified into one of three categories:
 Transient faults: these occur once and then disappear.
 For example, a network message transmission times out but works fine when
attempted a second time.

 Intermittent faults: these are the most annoying of component faults. This
fault is characterized by a fault occurring, then vanishing again, then
occurring.
 An example of this kind of fault is a loose connection.

 Permanent faults: this fault is persistent: it continues to exist until the faulty
component is repaired or replaced.
 Examples of this fault are disk head crashes, software bugs, and burnt-out
burnt
hardware.
Fault Tolerance

 The system's ability to deliver desired services in spite of faults in its

components
 Can be full service (specified behaviour in fault-free
fault state)
Ex: A primary-backup
backup server system to tolerate one server failure

 Or a degraded service (deviate from specified behaviour in fault free state,

but in a pre-defined manner)
Ex: A web service with multiple load-balanced
Ex: load servers in the backend
failing to meet its response time guarantees due to one backend server
failure, but still giving service with slower response time.
time
Fault Tolerance
 Classification of Faults
 Crash:
 A server/process halts/crashes
/crashes (hardware failure),
failure) but is working
correctly until it halts (irreversible).
(irreversible)
 In sync. systems, use timeouts.
 In async. systems, difficult to keep a track.
 Omission:
 Receive omission: A server fails to receive incoming messages
 Send omission: A server fails to send messages
 Timing:
 A server's response lies outside the specified time interval
 Arbitrary/Byzantine failure:
 A server may produce arbitrary responses at arbitrary times
Types of Tolerance

 Masking:
 System
ystem always behaves as per specifications even in
presence of faults

 Non-masking:
 System
ystem may violate specifications in presence of faults.
 Should at least behave in a well-defined
well manner
 Example 1. Clocks lose synchronization, but recover soon

thereafter.
 Example 2. . A transaction crashes, but eventually recovers
Fault Tolerance
 A fault tolerant system should specify
 Class of faults tolerated (Fault Model)

 What tolerance is given from each class?

class

 Needs Some Redundancy

 Hardware (Primary and Backup Servers)

 k+1 Redundancy to tolerate k failures

 Software (Tri-modular
modular Redundancy)
 2k+1 Redundancy to tolerate k failures (Can mask an error by executing three times
and taking a majority vote)
 Use voting to elect majority result and isolate at-fault
at module
 Time
 Redo operation- repeating tasks several times.
 Information Redundancy
 Parity bits, error correcting and detecting codes
Some Building Blocks

 Failure Detection
 Heartbeat messages with timeouts

 Reliable Storage
 RAID, Network Storage
Recovery after fault repair

 Recovery - Restoring the system to its normal operational state after a fault

 Undoing certain operations

 If a cooperating process fails, the effects on other processes due to the interactions of the
failed process need to be undone.

 In a replicated system, if a node having a replica fails, its data might be stale when it comes
back and that needs to be made consistent.
Error Recovery

 Forward-Error Recovery
 Continue to go forward in the presence of failures
 Use redundancy to mask the effect of errors, e.g. Error correcting codes
 Less overhead

 Backward-Error Recovery
 If it is not possible to remove all errors in the system state, then the system state can be
restored to a previous error-free state
 More general
 Performance Penalty
 No guarantee that the faults may not occur again
 Some system components might be unrecoverable
 e.g.,Cash dispensed from ATM teller machines.
Backward Error Recovery-Centralized
Recovery System
 Recovery Points-System
System state to which the system can be restored.

 All the active processes and the modified data need to be restored to a
proper state.

 Two approaches
 Operation-based approach
 State-based approach

 Existance of stable storage (storage that survives system crashes) is

assumed in both approaches.

 Secondary storage is assumed to be archived periodically.

 Data from main memory is flushed on the secondary storage using a paging
scheme.
Operation-based
based Approach
 A transaction based environment is assumed where transactions update a
database.

 Entire system state changes are recorded in a log kept in stable storage.

 It is desirable to be able to commit or undo updates on a per-transaction

per
basis.

 Commit action indicates that the transaction updating an object has been
successfully completed, hence, the changes to the database should be made
permanent.

 If a transaction does not commit, its database update should be undone.

 If a part of a database is lost due to storage media error, it should be possible

to reconstruct that part.
Updating-In-Place
Place Scheme

 Every update to an object is recorded in the log such that it can completely
undo and redo the operation

 The info recorded includes- object name, old state, and new state

 A recoverable update is implemented using the following operations

 A do operation – which does the update and writes a log record
 An undo operation – which, given a log record written by a do operation, undoes the
action specified by the do operation
 A redo operation - which, given a log record written by a do operation, redoes the action
specified by the do operation
 An optional display operation - which displays the log record
Updating-In-Place
Place Scheme

 In case of a failure, the changes made by a transaction can be undone by

using the undo operation.

 If a portion of the database is to be reconstructed, it can be done so by

performing the redo operation on a previously archived portion of the
database.

 Problem- A do operation cannot be undone if the system crashes after an

update operation but before the log record is stored.
Write-Ahead--Log Scheme
 A recoverable update is implemented by the following operations

 Update an object only after the undo log is recorded

 Before committing the updates, redo and undo logs are recorded

 On restarting a system after failure:

 Undo operations might be required to undo the changes made by the transactions that
were under progress at the time of the failure.

 Redo operations might be required if the objects updated were in memory and were not
flushed out on the secondary storage.

 Writing a log record on every update operation is expensive in terms of storage required
and CPU overhead incurred, specially if failures are rare.
State-based
based Approach
 The complete state of a process is saved in a stable storage when a recovery
point is established, called as checkpointing.

 Recovering a process involves reinstating its saved state and resuming the
execution of the process from that state, called as process rollback.

 Desirable to rollback to the most recent state - many checkpoints are taken
over the execution of a process

 The previous checkpoints can be discarded when a new checkpoint is taken

Recovery in Distributed Systems

 In a distributed system several processes cooperate by exchanging

messages to accomplish a task

 If one process fails and resumes execution from a recovery point, the effects
it has caused at other processes after establishing the recovery point have to
be undone

 An active process might also be required to rollback to an earlier state

 All cooperating processes need to establish a recovery point

Orphan Messages and Domino Effect
 Messages with receive recorded but a message send not recorded are called orphan messages.
 For example, a rollback might have undone the send of such messages, leaving the receive event
intact at the receiving process.
 Orphan messages do not arise if processes roll back to a consistent global state.

x1
x2 x3
X

y2 m
y1

Failure
z1 z2

Z
Time
Lost Messages
 Messages whose send is not undone but receive is undone due to rollback are called lost messages.

 This type of messages occurs when the process rolls back to a checkpoint prior to reception of the
message while the sender does not rollback beyond the send operation of the message.

x1
X

y1 Failure

Y
Livelock
 Livelock is a situation in which a single failure can cause an infinite number of rollbacks, preventing
the system from making progress.
x1
X

n1
y1
m1
Y
(a)

x1
X
n2

y1 n1
m2
Y
2nd Rollback (b)
Consistent Set of Checkpoints

 Operation-based or State-based
based Recovery techniques are not adequate in
distributed systems

 Coordination among processes is required

 A process takes local checkpoint

 All the local checkpoints (one from each process) forms a global checkpoint

 For a global checkpoint to be consistent, there should not be any orphan

message.
Strongly Consistent Set of Checkpoints
 There should be no record of a message receive event in a local checkpoint
when the corresponding message send event is not recorded in the local
checkpoint of the sending process

 If there is no lost message in a consistent checkpoint, it is called as a

strongly consistent checkpoint

 If no information flow takes place between any pair of processes in the set
of local checkpoints, it becomes a strongly consistent checkpoint
Consistent Vs Strongly Consistent Set of Checkpoints

x1 x2

y1 y2 m

Failure
z1 z2

Consistent set of checkpoints

Strongly Consistent set of checkpoints
A Simple Method for Taking a Consistent Set of Checkpoints

 Assumption - Taking a checkpoint and message send and receive are atomic actions

 If every process takes a checkpoint after sending every message, the set of most recent
checkpoints is always consistent (it may not be strongly consistent)

 So rollback to the latest checkpoint will not result in orphan messages

 Disadvantage- huge overhead of check pointing

Synchronous Checkpointing and Recovery (Koo and Toueg)
 All processes coordinate their local checkpointing actions such that the set
of all recent checkpoints in the system is guaranteed to be consistent.

 Assumptions:
 Channels are FIFO.
 Communication failures do not partition the network.
 A single process invokes the algorithm.
 No site fails during the execution of the algorithm.
 No computation messages are exchanged during the execution of the algorithm.

 Two kinds of checkpoints-temporary

temporary and permanent.

 Processes roll back only to their permanent checkpoint.

The Checkpoint Algorithm
 First Phase:
 An initiating process Pi takes a tentative checkpoint and requests all

the processes to take tentative checkpoints.

 Each process informs Pi whether it succeeded in taking a tentative

checkpoint.

 If Pi learns that all the processes have successfully taken tentative

checkpoints, it decides that all the tentative checkpoints should be
made permanent; otherwise Pi decides that all the tentative
checkpoints should be discarded.

 Second Phase:
 Pi informs all the processes of the decision it reached at the end of the

first phase.

 A receiving process acts accordingly.

The Checkpoint Algorithm Cont..

 Correctness:

 Either all or none of the processes take a permanent checkpoint

 There would be no orphan messages since no process sends messages

after taking a tentative checkpoint until the receipt of the initiating
process’s decision
Problem With the Algorithm

Tentative Ckpt
x1 x2
X

y1 m y2

Take a tentative
z1 ckpt (msg)

Checkpoints taken unnecessarily

The Rollback Recovery Algorithm
 Assumptions:
 A single process invokes the algorithm

 Check pointing and rollback recovery algorithms are not invoked

concurrently

 First Phase:
 An initiating process Pi checks to see if all the processes are
willing to restart from their previous checkpoints

 If Pi learns that all the processes are willing to restart from their
previous checkpoints, Pi decides that all the processes should
restart; otherwise Pi decides that all the processes should
continue their normal activities

 Second Phase:
 Pi propagates its decision to all the processes
 On receiving Pi ’s decision, a process will act accordingly
The Rollback Recovery Algorithm Cont..

 Correctness

 The recovery algorithm requires that every process does not send
computation messages while it is waiting for Pi’s decision

 All processes either restart from their previous checkpoints or continue

with their normal activities

 If processes decide to restart, then they resume execution in a consistent

state, as the checkpoint algorithm takes a consistent set of checkpoints
Discussion
 Advantages
 Synchronous checkpointing simplifies recovery

 Previous checkpoints can be discarded

 Disadvantages
 Additional messages are exchanged by the checkpoint algorithm when it takes
each checkpoint

 Synchronization delays are introduced during normal operations (No computation

types of messages during the checkpointing process)

 If failures rarely occur between successive checkpoints, then unnecessary burden

is placed on the system in the form of additional messages, delays, and processing
overhead
Asynchronous Checkpointing and Recovery
 Each process takes a checkpoint independently

 There is no guarantee that a set of local checkpoints will be consistent

 A recovery algorithm has to search for the most recent consistent set of
checkpoints before it can initiate recovery

 To minimize the computation undone during a rollback, all incoming messages

are logged

 The messages that were received after establishing a recovery point can be
processed again in the event of a rollback to the recovery point
Asynchronous Checkpointing and Recovery
 Two ways of message logging

 Pessimistic
 An incoming message is logged before it is processed
 Slows down underlying computation even when there are no failures

 Optimistic
 Processes continue to perform the computation and the messages received
are stored in volatile storage, which are logged at certain intervals.
 In case of a system failure, an incoming message may be lost as it may not
have been logged
 In the event of a rollback, the amount of computation redone during recovery
is likely to be more
 It does not slow down the underlying computation during normal processing
Asynchronous Checkpointing and Recovery
(Juang and Venkatesan)
 Assumptions
 Reliable and FIFO communication channels

 Communication channels have infinite buffers

 The underlying computation is assumed to be event-driven,

event where a process P waits until
a message m is received, processes m,, changes its state and sends zero or more
messages to some of its neighbours

 The events at each process are identified by unique monotonically increasing numbers

 Two types of logs are assumed - volatile log and stable log. Volatile log contents are
periodically flushed onto the stable log.
Asynchronous Checkpointing
 Each process, after an event, records a triplet {s, m, msgs_sent} in volatile
storage, where s is the state of the process before the event, m is the
message, and msgs_sent is the set of messages that were sent by the
process.

 A local checkpoint consists of the record of an event and is taken without any
synchronization with other processes..

 Notations
 RCVD i←j (CKPTi) – the no. of msgs received
eived by Pi from Pj as per the info. stored in the
checkpoint CKPTi

 SENT i→j (CKPTi) – the no. of msgs sent by Pi to Pj as per the info. stored in the
checkpoint CKPTi
Asynchronous Checkpointing
 Recovery is based on finding a consistent set of checkpoints to which the system can be
restored

 Each process keeps track of, S,, the no. of msgs it has sent to other processes and R, the no.
of msgs it has received from other processes

 If R > S, then one or more msgs are orphan, then the process has to rollback to a state
where S=R

 The algorithm assumes that a process, upon restarting, will broadcast a message that it had
failed. Can be done in O(|E|) messages where |E| is the total number of communication links.

 The algorithm at a process is initiated when it restarts after a failure or when it learns about
another process’s failure
Event Driven Computation

X exo ex1 ex2

Y eyo ey1 ey2 ey3

Z ezo ez1 ez2 ez3

Asynchronous Checkpointing Algorithm
 At Pi
(a) If i is a processor that is recovering after a failure
then CKPTi = latest event logged in the stable storage
else CKPTi = latest event that took place

(b) for k = 1 to N do /* N no. of processes */

begin
for each neighbouring process j do
send ROLLBACK(i, SENT i→j (CKPTi)) msg
wait for ROLLBACK msgs from every neighbor
for every ROLLBACK(j, c) msg recd from a neighbor j,
i does the following
if RCVD i←j (e) > c then /* implies orphan msgs */
begin
find the latest event e such that RCVD i←j (e) = c
CKPTi = e
end
end (* for k *)
Asynchronous Checkpointing Example

exo  The procedure has |N|

x1 ex1 ex2 ex3 iterations. At the end of
each iteration, atleast
one process will
rollback to its final
y1 recovery point unless
eyo ey1 ey2 ey3 the current recovery
points are consistent

 In the example, Y
restarts at checkpoint y1
ezo ez1 ez2 and ey2 is the latest
event logged
z1
 CKPTx = ex3
 CKPTy = ey2
 CKPTz = ez2
Asynchronous Checkpointing Example
 First Iteration
 Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z
 X sends ROLLBACK(X,2) to Y and ROLLBACK(X,0) to Z
 Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y
 Since RCVD X←Y (CKPTX) = 3 > 2, X will set CKPTX to ex2
 Since RCVD Z←Y (CKPTZ) = 2 > 1, Z will set CKPTZ to ez1
 Y need not rollback any further

 Second Iteration
 Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y, 1) to Z
 Z sends ROLLBACK(Z, 0) to X and ROLLBACK(Z, 1) to Y
 X sends ROLLBACK(X, 2) to Y and ROLLBACK(X, 0) to Z

 The third iteration will also progress in a similar fashion

 The set of recovery points chosen at the end of the first
iteration {ex2, ey2, ez1} is consistent
 No further rollbacks are required
References

 1. T. Juang and S. Venkatesan,, “Crash Recovery with Little Overhead”,

Proceedings of the 11th International Conference on Distributed Computer
Systems, May 1991, pp. 454-461

 2. R. Koo and S. Toueg, “Checkpointing

Checkpointing and Rollback-Recovery for
Distributed Ssytems”, IEEE Transactions on Software Engineering,
Engineering Vol. 14,
No. 6, June 1988, pp. 810-821

 3. M. Singhal and N. Shivaratri,, “Advanced Concepts in Operation Systems”,

Tata McGraw-Hill

STDcurs1 Merged
No ratings yet
STDcurs1 Merged
139 pages
Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
Distrsyslectureset7 Win20
No ratings yet
Distrsyslectureset7 Win20
114 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
Failures, Errors and Risks in Computer System Presentation (0024)
No ratings yet
Failures, Errors and Risks in Computer System Presentation (0024)
21 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
51 pages
Unit 4 - Deadlock Handling & Recovery Techniques & Failuere Classification
No ratings yet
Unit 4 - Deadlock Handling & Recovery Techniques & Failuere Classification
55 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
Chapter 01
No ratings yet
Chapter 01
34 pages
Ch-4-Fault Tularance - Naming-SM
No ratings yet
Ch-4-Fault Tularance - Naming-SM
42 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerance Computing Lecture Note
No ratings yet
Fault Tolerance Computing Lecture Note
61 pages
Failure Model
No ratings yet
Failure Model
14 pages
Unit10 Fault Tolerance and Security
No ratings yet
Unit10 Fault Tolerance and Security
24 pages
Chapter 5: Availability: © Len Bass, Paul Clements, Rick Kazman, Distributed Under Creative Commons Attribution License
No ratings yet
Chapter 5: Availability: © Len Bass, Paul Clements, Rick Kazman, Distributed Under Creative Commons Attribution License
31 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Unit5 1
No ratings yet
Unit5 1
23 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
OS Presentattion
No ratings yet
OS Presentattion
15 pages
11 Errors
No ratings yet
11 Errors
33 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
21 pages
Failure Model
No ratings yet
Failure Model
14 pages
1-Lecture (2. Intro-Core Challenges) - Slides
No ratings yet
1-Lecture (2. Intro-Core Challenges) - Slides
22 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Fault Tolerance
No ratings yet
Fault Tolerance
10 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Lecture 7 - FAULT-TOLERANT COMPUTING
No ratings yet
Lecture 7 - FAULT-TOLERANT COMPUTING
13 pages
We Need To Talk About IT Architecture
No ratings yet
We Need To Talk About IT Architecture
60 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault Tolerance Techniques
No ratings yet
Fault Tolerance Techniques
4 pages
Reference Book Principles of Distributed Database System Chapters
No ratings yet
Reference Book Principles of Distributed Database System Chapters
25 pages
Errors
No ratings yet
Errors
9 pages
Failure Classification in DBMS
No ratings yet
Failure Classification in DBMS
2 pages
CH 4
No ratings yet
CH 4
25 pages
Lesson 2 - Fault and Error Modelling
No ratings yet
Lesson 2 - Fault and Error Modelling
7 pages
Faulty Computer and Networks-Systems
No ratings yet
Faulty Computer and Networks-Systems
6 pages
Dependable and Secure Computing Concepts
No ratings yet
Dependable and Secure Computing Concepts
14 pages
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
No ratings yet
Software Reliability: CIS 376 Bruce R. Maxim UM-Dearborn
37 pages
System Recovery
No ratings yet
System Recovery
38 pages
Various Failures in Distributed System
No ratings yet
Various Failures in Distributed System
2 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Distributed System - Failures
No ratings yet
Distributed System - Failures
12 pages
A Large-Scale Study of Failures in High-Performance Computing Systems
No ratings yet
A Large-Scale Study of Failures in High-Performance Computing Systems
10 pages
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
No ratings yet
Fundamental Concepts of Dependability: Algirdas Aviz Ienis Jean-Claude Laprie Brian Randell
6 pages
Notes On Fault Tolerance
No ratings yet
Notes On Fault Tolerance
2 pages
Group Activity 2
No ratings yet
Group Activity 2
6 pages
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
No ratings yet
Safety Critical Computer Systems: Failure Independence and Software Diversity Effects On Reliability of Dual Channel Structures
10 pages
REPORT Contour
100% (3)
REPORT Contour
7 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Reliability: APSC 380: I M 1997/98 W S T 2
No ratings yet
Reliability: APSC 380: I M 1997/98 W S T 2
4 pages
20 Critical Systems
No ratings yet
20 Critical Systems
58 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
6 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
AC2 Engineering Utilities 2 Syllabus
No ratings yet
AC2 Engineering Utilities 2 Syllabus
16 pages
Cost & Management Accounting
No ratings yet
Cost & Management Accounting
3 pages
Unit-4 Failure Recovery and Fault Tolerance Basic Concept
No ratings yet
Unit-4 Failure Recovery and Fault Tolerance Basic Concept
4 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Specialized Crime Investigation: With Legal Medicine
100% (1)
Specialized Crime Investigation: With Legal Medicine
4 pages
Communication Aids and Strategies Using Tools of Technology
No ratings yet
Communication Aids and Strategies Using Tools of Technology
32 pages
MIL 11 - 12 Q3 0102 What Is Media and Information Literacy PS
No ratings yet
MIL 11 - 12 Q3 0102 What Is Media and Information Literacy PS
14 pages
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
No ratings yet
CCM 303 Topic 8 PPT Gender and Communication in The Media PDF
23 pages
Unit4 PHP
No ratings yet
Unit4 PHP
51 pages
Computer Arithmetic
No ratings yet
Computer Arithmetic
48 pages
03 Takaful MAYBANK EZYPAY Application Form V1.0 2018
No ratings yet
03 Takaful MAYBANK EZYPAY Application Form V1.0 2018
2 pages
9-Mm Pistol Pmi Training: REF: FM 23 - 35
No ratings yet
9-Mm Pistol Pmi Training: REF: FM 23 - 35
30 pages
ACPH Formula
No ratings yet
ACPH Formula
4 pages
Mitutoyo - Przenośny Twardościomierz Leeb HH-411 - 2006 EN
No ratings yet
Mitutoyo - Przenośny Twardościomierz Leeb HH-411 - 2006 EN
2 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
The Biomechanics of Spinal Manipulation: Walter Herzog, PHD
No ratings yet
The Biomechanics of Spinal Manipulation: Walter Herzog, PHD
7 pages
EL BR 023 CA EN 0120.1 - PVC Duct DB2 ES2 Pipe Fittings
No ratings yet
EL BR 023 CA EN 0120.1 - PVC Duct DB2 ES2 Pipe Fittings
8 pages
News Document
No ratings yet
News Document
1 page
(English-Vietnamese) Bạn có nhiều hơn một cuộc đời - Marc Levy - Have A Sip EP98 (DownSub.com)
No ratings yet
(English-Vietnamese) Bạn có nhiều hơn một cuộc đời - Marc Levy - Have A Sip EP98 (DownSub.com)
46 pages
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
No ratings yet
Ayitenew Determinantsof Internal Audit Effectiveness Evidencefrom Gurage Zone
12 pages
(Buehler & Griffin & Peetz-2012) The Planning Fallacy - Cognitive, Motivational, and Social Origins
No ratings yet
(Buehler & Griffin & Peetz-2012) The Planning Fallacy - Cognitive, Motivational, and Social Origins
62 pages
Yolo
No ratings yet
Yolo
20 pages
MI Lecture - 6
No ratings yet
MI Lecture - 6
23 pages
DxDiag Requisitos
No ratings yet
DxDiag Requisitos
30 pages
History Plan Week 6and 7. Term 1
No ratings yet
History Plan Week 6and 7. Term 1
2 pages
8.design and Analysis of A Conformal MIMO Ingestible Bolus Sensor Antenna For Wireless Capsule Endoscopy For Animal Husbandry
No ratings yet
8.design and Analysis of A Conformal MIMO Ingestible Bolus Sensor Antenna For Wireless Capsule Endoscopy For Animal Husbandry
9 pages
Chemistry Quiz - General
No ratings yet
Chemistry Quiz - General
3 pages
Wsei It Eng 2022
No ratings yet
Wsei It Eng 2022
9 pages
DCP Exam Datesheet
No ratings yet
DCP Exam Datesheet
15 pages
Risk Projection
No ratings yet
Risk Projection
3 pages
CSAT Test 2 Answers Q. Ans. Explanation 1 C
No ratings yet
CSAT Test 2 Answers Q. Ans. Explanation 1 C
5 pages
Q2 Lesson 1 Worksheet
No ratings yet
Q2 Lesson 1 Worksheet
2 pages
Work Measurement Techniques Methods Types
No ratings yet
Work Measurement Techniques Methods Types
5 pages
Stanford GSB Ee Sample Schedule MRR
No ratings yet
Stanford GSB Ee Sample Schedule MRR
1 page
Homework 2 DSP
No ratings yet
Homework 2 DSP
2 pages
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Fault Tolerance and Recovery

Uploaded by

Fault Tolerance and Recovery

Uploaded by

Fault Tolerance and Recovery

 A fault is an anomalous physical condition.The causes of a fault include

 An erroneous state - one which could lead to a system failure by a sequene

 Examples of error causing processes to fail

 Failed process may be restarted from a prior state or aborted.

 Caused by software errors and hardware problems, e.g. ,CPU

 The system is stopped and restarted in a correct state.

 The correct state may be some predefined state or a prior state

 Failure is usually caused by parity error, or dust particles on the

 Recovery can be based on the archived version or log structured

 For tolerating failures, we can have mirrored disk systems

 Failure of switching nodes and/or the links of the communicating

 May not cause a total shut down of communication facilities.

 Message loss, partition of a network where a subset of sites may

 Bug in a program is a fault.

 Examples: disk head crashes, software bugs, and burnt-out

 Byzantine faults are more troublesome to deal with.

 The system's ability to deliver desired services in spite of faults in its

 Or a degraded service (deviate from specified behaviour in fault free state,

 What tolerance is given from each class?

 Needs Some Redundancy

 k+1 Redundancy to tolerate k failures

 Undoing certain operations

 Existance of stable storage (storage that survives system crashes) is

 Secondary storage is assumed to be archived periodically.

 It is desirable to be able to commit or undo updates on a per-transaction

 If a transaction does not commit, its database update should be undone.

 If a part of a database is lost due to storage media error, it should be possible

 A recoverable update is implemented using the following operations

 In case of a failure, the changes made by a transaction can be undone by

 If a portion of the database is to be reconstructed, it can be done so by

 Problem- A do operation cannot be undone if the system crashes after an

 Update an object only after the undo log is recorded

 On restarting a system after failure:

 The previous checkpoints can be discarded when a new checkpoint is taken

 In a distributed system several processes cooperate by exchanging

 An active process might also be required to rollback to an earlier state

 All cooperating processes need to establish a recovery point

 Coordination among processes is required

 A process takes local checkpoint

 For a global checkpoint to be consistent, there should not be any orphan

 If there is no lost message in a consistent checkpoint, it is called as a

Consistent set of checkpoints

 So rollback to the latest checkpoint will not result in orphan messages

 Disadvantage- huge overhead of check pointing

 Two kinds of checkpoints-temporary

 Processes roll back only to their permanent checkpoint.

the processes to take tentative checkpoints.

 Each process informs Pi whether it succeeded in taking a tentative

 If Pi learns that all the processes have successfully taken tentative

 A receiving process acts accordingly.

 Either all or none of the processes take a permanent checkpoint

 There would be no orphan messages since no process sends messages

Checkpoints taken unnecessarily

 Check pointing and rollback recovery algorithms are not invoked

 All processes either restart from their previous checkpoints or continue

 If processes decide to restart, then they resume execution in a consistent

 Previous checkpoints can be discarded

 Synchronization delays are introduced during normal operations (No computation

 If failures rarely occur between successive checkpoints, then unnecessary burden

 There is no guarantee that a set of local checkpoints will be consistent

 To minimize the computation undone during a rollback, all incoming messages

 Communication channels have infinite buffers

 The underlying computation is assumed to be event-driven,

X exo ex1 ex2

Y eyo ey1 ey2 ey3

Z ezo ez1 ez2 ez3

(b) for k = 1 to N do /* N no. of processes */

exo  The procedure has |N|

 The third iteration will also progress in a similar fashion

 1. T. Juang and S. Venkatesan,, “Crash Recovery with Little Overhead”,

 2. R. Koo and S. Toueg, “Checkpointing

 3. M. Singhal and N. Shivaratri,, “Advanced Concepts in Operation Systems”,

You might also like