0% found this document useful (0 votes)

57 views23 pages

DS Chapter V8.0fault Tolerance

This document discusses fault tolerance in distributed systems. It covers: 1) Fault tolerance aims to construct systems that can automatically recover from partial failures when components fail. Dependability includes availability, reliability, safety, and maintainability. 2) Faults are classified as transient, intermittent, or permanent. Failure modes include crash, omission, timing, response, and arbitrary failures. 3) Redundancy is the key technique for fault tolerance. This includes information, time, and physical redundancy. Process groups use physical redundancy through replication to mask faults.

Uploaded by

Gofere Tube

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views23 pages

DS Chapter V8.0fault Tolerance

Uploaded by

Gofere Tube

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Distributed Systems

CHAPTER EIGHT
Fault Tolerance

(CS, CCI, WKU, Ethiopia, 2022)

Habtamu Alemayehu
Lecturer Name: (MSc in CSE)

5/9/2022 1
Introduction

 a major difference between distributed systems and single machine

systems is that with the former, partial failure is possible, i.e., when one
component in a distributed system fails
 such a failure may affect some components while others will continue to
function properly
 an important goal of distributed systems design is to construct a system
that can automatically recover from partial failure
 it should tolerate faults and continue to operate to some extent

5/9/2022 2
Fault Tolerance
Basic Concepts
 fault tolerance is strongly related to dependable systems
 dependability covers the following
 availability
 refers to the probability that the system is operating correctly
at any given time; defined in terms of an instant in time
 reliability
 a property that a system can run continuously without failure;
defined in terms of a time interval
 safety
 refers to the situation that even when a system temporarily
fails to operate correctly, nothing catastrophic happens
 maintainability
 how easily a failed system can be repaired

5/9/2022 3
Cont’d
 dependable systems are also required to provide a high degree of
security
 a system is said to fail when it cannot meet its promises; for instance
failing to provide it users one or more of the services it promises
 an error is a part of a system’s state the may lead to a failure; e.g.,
damaged packets in communication
 the cause of an error is called a fault
 building dependable systems closely relates to controlling faults
 a distinction is made between preventing, removing, and forecasting
faults
 a fault tolerant system is a system that can provide its services even in
the presence of faults

5/9/2022 4
Cont’d
 faults are classified into three
 transient
 occurs once and then disappears; if the operation is repeated, the fault
goes away;
 intermittent
 it occurs, then vanishes on its own accord, then reappears, ...;
 permanent
 one that continues to exist until the faulty component is repaired; e.g,
disk head crash, software bug

5/9/2022 5
Cont’d
 Failure Modes - 5 of them
 Crash failure: a server halts, but was working correctly until it stopped
 Omission failure: a server fails to respond to incoming requests
 Receive omission: a server fails to receive incoming messages; e.g., may
be no thread is listening
 Send omission: a server fails to send messages
 Timing failure: a server's response lies outside the specified time interval;
e.g., may be it is too fast over flooding the receiver or too slow
 Response failure: the server's response is incorrect
 Value failure: the value of the response is wrong; e.g., a search engine
returning wrong Web pages as a result of a search
 State transition failure: the server deviates from the correct flow of
control; e.g., taking default actions when it fails to understand the
request
 Arbitrary failure (or Byzantine failure): a server may produce arbitrary
responses at arbitrary times; most serious

5/9/2022 6
Cont’d
 Failure Masking by Redundancy
 to be fault tolerant, the system tries to hide the occurrence of failures
from other processes - masking
 the key technique for masking faults is redundancy
 three kinds are possible
 information redundancy; add extra bits to allow recovery from garbled
bits (error correction)
 time redundancy: an action is performed more than once if needed;
e.g., redo an aborted transaction; useful for transient and intermittent
faults
 physical redundancy: add (replicate) extra equipment (hardware) or
processes (software)

5/9/2022 7
Process Resilience
 how can fault tolerance be achieved in distributed systems
 one method is protection against process failures by replicating
processes into groups
 we discuss
 what are the general design issues of process groups
 what actually is a fault tolerant group

5/9/2022 8
Cont’d
 Design Issues
 the key approach to tolerating a faulty process is to organize several
identical processes into a group
 all members of a group receive a message hoping that if one process
fails, another one will take over
 process groups may be dynamic
 new groups can be created and old groups can be destroyed
 a process can join or leave a group
 a process can be a member of several groups at the same time
 hence group management and membership mechanisms are required
 groups may be flat (all processes are equal) or hierarchical (a coordinator
and several workers)

5/9/2022 9
Cont’d

(a) communication in a flat group (b) communication in a simple hierarchical group

 the flat group has no single point of failure, but decision making is more
complicated (voting may be required for decision making)
 the hierarchical group has the opposite properties
 group membership may be handled
 through a group server where all requests (joining, leaving, ...) are sent; it
has a single point of failure
 in a distributed way (membership is multicasted)
5/9/2022 10
Cont’d
 Failure Masking and Replication
 how to replicate processes so that they can form groups?; there are
two ways :
 primary-based protocols: for fault tolerance, primary-backup
protocol is used; organize processes hierarchically and let the
primary coordinate all writes; if the primary crashes, the backups
hold an election
 replicated-write protocols: in the form of active replication or by
means of quorum-based protocols; processes are organized as flat
groups

5/9/2022 11
Reliable Group Communication
how to reliably deliver messages to a process group (multicasting)
 Basic Reliable-Multicasting Schemes
 reliable multicasting means a message sent to a process group should
be delivered to each member of that group
 transport protocols do not offer reliable communication to a collection
of processes
 problems:
 what happens if a process joins a group during communication?
 what happens if a (sending) process crashes during communication?
 what if there are faulty processes?
 a weaker solution assuming that all receivers are known and that none
will fail is for the sending process to assign a sequence number to each
message and to buffer all messages so that lost ones can be
retransmitted

5/9/2022 12
Cont’d

a simple solution to reliable multicasting when all receivers are known and are assumed not to fail; (a)
message transmission, (b) reporting feedback
5/9/2022 13
Cont’d

 Atomic Multicast
 how to achieve reliable multicasting in the presence of process
failures
 for example, in a replicated database, how to handle update
operations when a replica crashes during update operations
 the atomic multicast problem: to guarantee that a message is
delivered to either all processes or none at all and that messages are
delivered in the same order to all processes

5/9/2022 14
Distributed Commit
 atomic multicasting is an example of the more generalized problem
known as distributed commit
 in atomic multicasting, the operation is delivery of a message
 but the distributed commit problem involves having an(y) operation
being performed by each member of a process group, or none at all
 there are three protocols: one-phase commit, two-phase commit, and
three-phase commit
 One-Phase Commit Protocol
 a coordinator tells all other processes, called participants, whether or
not to (locally) perform an operation
 drawback: if one of the participants cannot perform the operation,
there is no way to tell the coordinator; for example due to violation of
concurrency control constraints in distributed transactions

5/9/2022 15
Cont’d
 Two-Phase Commit Protocol (2PC)
 it has two phases: voting phase and decision phase, each involving
two steps
 voting phase
 the coordinator sends a VOTE_REQUEST message to all
participants
 each participant then sends a VOTE_COMMIT or VOTE_ABORT
message depending on its local situation
 decision phase
 the coordinator collects all votes; if all vote to commit the
transaction, it sends a GLOBAL_COMMIT message; if at least one
participant sends VOTE_ABORT, it sends a GLOBAL_ABORT
message
 each participant that voted for a commit waits for the final
reaction of the coordinator and commits or aborts

5/9/2022 16
Cont’d

a) the finite state machine for the coordinator in 2PC

b) the finite state machine for a participant

5/9/2022 17
Cont’d
 problems may occur in the event of failures
 the coordinator and participants have states in which they block waiting for
messages: INIT, READY, WAIT
 when a process crashes, other processes may wait indefinitely
 hence, timeout mechanisms are required
 a participant waiting in its INIT state for VOTE_REQUEST from the
coordinator aborts and sends VOTE_ABORT if it does not receive a vote
request after some time
 the coordinator blocking in state WAIT aborts and sends GLOBAL_ABORT if
all votes have not been collected on time
 a participant P waiting in its READY state waiting for the global vote cannot
abort; instead it must find out which message the coordinator actually sent
 by blocking until the coordinator recovers
 or requesting another participant, say Q
 a process (participant or coordinator) can recover from crash if its state has
been saved to persistent storage
5/9/2022 18
Cont’d
 Three-Phase Commit Protocol (3PC)
 the problem with 2PC is that, if the coordinator crashes, participants
will need to block until the coordinator recovers
 3PC avoids blocking processes in the presence of crashes
 the states of the coordinator and each participant satisfy the following
two conditions
 there is no single state from which it is possible to make a transition
directly to either COMMIT or an ABORT state
 there is no state in which it is not possible to make a final decision,
and from which a transition to a COMMIT state can be made

5/9/2022 19
Cont’d

a) finite state machine for the coordinator in 3PC

b) finite state machine for a participant

5/9/2022 20
Recovery

 fundamental to fault tolerance is recovery from an error

 error recovery means to replace an erroneous state with an error-free
state
 two forms of error recovery: backward recovery and forward recovery
 Backward Recovery
 bring the system from its present erroneous state back into a
previously correct state
 for this, the system’s state must be recorded from time to time; each
time a state is recorded, a checkpoint is said to be made
 e.g., retransmitting lost or damaged packets in the implementation of
reliable communication

5/9/2022 21
Cont’d
 disadvantages:
 checkpointing and restoring a process to its previous state are costly and
performance bottlenecks
 no guarantee can be given that the error will not recur, which may take
an application into a loop of recovery
 some actions may be irreversible; e.g., deleting a file, handing over cash
to a customer

 Forward Recovery
 bring the system from its present erroneous state to a correct new state
from which it can continue to execute
 it has to be known in advance which errors may occur so as to correct
those errors
 e.g., erasure correction (or simply error correction) where a lost or
damaged packet is constructed from other successfully delivered packets

5/9/2022 22
Thank You !!!

5/9/2022 23

PRESENTATION ON Annotated Bibliography
40% (5)
PRESENTATION ON Annotated Bibliography
19 pages
GFGDGDG
No ratings yet
GFGDGDG
610 pages
Refactoring For Resilience: Strengthening Systems Under Pressure
From Everand
Refactoring For Resilience: Strengthening Systems Under Pressure
Tochukwu Njoku
No ratings yet
Notes
No ratings yet
Notes
584 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Notes On Theory of Distributed Systems
No ratings yet
Notes On Theory of Distributed Systems
556 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
01 Da24 Introduction
No ratings yet
01 Da24 Introduction
55 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
DC Ese Notes
No ratings yet
DC Ese Notes
47 pages
Fault
No ratings yet
Fault
101 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
3 Synchronization
No ratings yet
3 Synchronization
45 pages
Chap 15
No ratings yet
Chap 15
72 pages
Week 04
No ratings yet
Week 04
49 pages
DC10 M2022 TopologyAbstraction
No ratings yet
DC10 M2022 TopologyAbstraction
33 pages
Sanatana Goswami Hari Bhakti Vilasa Full
100% (1)
Sanatana Goswami Hari Bhakti Vilasa Full
562 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Chen 07
No ratings yet
Chen 07
39 pages
DC - Unit IV
No ratings yet
DC - Unit IV
36 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Distributed Recovery Management: UNIT-4
No ratings yet
Distributed Recovery Management: UNIT-4
31 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Chapter 15
No ratings yet
Chapter 15
29 pages
DC Unit IV
No ratings yet
DC Unit IV
37 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Almario v. Alba, GR No. L-66068
No ratings yet
Almario v. Alba, GR No. L-66068
2 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
1-Lecture (2. Intro-Core Challenges) - Slides
No ratings yet
1-Lecture (2. Intro-Core Challenges) - Slides
22 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
Lec 3
No ratings yet
Lec 3
30 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Module 5
No ratings yet
Module 5
11 pages
08 Falhas
No ratings yet
08 Falhas
41 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Pediatrics
100% (1)
Pediatrics
4 pages
Distributed Systems Ii Fault-Tolerant Broadcast (CNT.) : Prof Philippas Tsigas
No ratings yet
Distributed Systems Ii Fault-Tolerant Broadcast (CNT.) : Prof Philippas Tsigas
65 pages
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
From Everand
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Chapter 3
No ratings yet
Chapter 3
40 pages
Consensus Failure
No ratings yet
Consensus Failure
79 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
25 DistributedCoordination
No ratings yet
25 DistributedCoordination
30 pages
Coordination and Agreement: Check Point Threat Extraction Secured This Document
No ratings yet
Coordination and Agreement: Check Point Threat Extraction Secured This Document
18 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Consensus
No ratings yet
Consensus
77 pages
Distributed Systems: Fault Tolerance: Fall 2013
No ratings yet
Distributed Systems: Fault Tolerance: Fall 2013
42 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Fracture and Failure - Abaqus PDF
No ratings yet
Fracture and Failure - Abaqus PDF
36 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Unit 4
No ratings yet
Unit 4
11 pages
Unit 4 - DSRM
No ratings yet
Unit 4 - DSRM
5 pages
Nursing Intervention For Chest Pain
100% (3)
Nursing Intervention For Chest Pain
2 pages
All India CW Pricelist Wef 01.08.2022
No ratings yet
All India CW Pricelist Wef 01.08.2022
6 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
CS 194: Distributed Systems
No ratings yet
CS 194: Distributed Systems
15 pages
Ddbs Checkpointing ... Ddbs Checkpointing ... : Phase 1 at Css Phase 2 at CC
No ratings yet
Ddbs Checkpointing ... Ddbs Checkpointing ... : Phase 1 at Css Phase 2 at CC
9 pages
Fault System One
No ratings yet
Fault System One
19 pages
Report 7
No ratings yet
Report 7
13 pages
Modules 1-3 Activity 5 Explanations
No ratings yet
Modules 1-3 Activity 5 Explanations
6 pages
Notes
No ratings yet
Notes
11 pages
213 What Is Existential Therapy Cooper 2
No ratings yet
213 What Is Existential Therapy Cooper 2
23 pages
Cooper-Carringtonk Edid6507-Mini Project
No ratings yet
Cooper-Carringtonk Edid6507-Mini Project
32 pages
Surgical Guidelines For Dental Implant Placement: British Dental Journal September 2006
No ratings yet
Surgical Guidelines For Dental Implant Placement: British Dental Journal September 2006
15 pages
Important Concepts For Case Studies: Partial Compliance With AFRF Is Not Allowed E.G. "Financial Statements Are in Substantial Compliance With I FRS"
No ratings yet
Important Concepts For Case Studies: Partial Compliance With AFRF Is Not Allowed E.G. "Financial Statements Are in Substantial Compliance With I FRS"
2 pages
Seminar 6 - Directors Duty of Care
No ratings yet
Seminar 6 - Directors Duty of Care
32 pages
Cor 001 Sas-2
No ratings yet
Cor 001 Sas-2
8 pages
Paraphrasing
No ratings yet
Paraphrasing
20 pages
Taxguru - In-Exemption Under Section 54 54EC Amp 54F - FAQs Amp Case Laws
No ratings yet
Taxguru - In-Exemption Under Section 54 54EC Amp 54F - FAQs Amp Case Laws
9 pages
22 Tolentino V Leviste 2004 - Digest
100% (1)
22 Tolentino V Leviste 2004 - Digest
2 pages
English Lesson Plan Year 1 Cefr 24 Oktober 2019
No ratings yet
English Lesson Plan Year 1 Cefr 24 Oktober 2019
1 page
Weekly Training Plan and Accomplishment Report
No ratings yet
Weekly Training Plan and Accomplishment Report
8 pages
Assignment-3 UNIT-3 Network Layer: Q1.What Is Distance Vector Routing Algorithm? Explain With Example
No ratings yet
Assignment-3 UNIT-3 Network Layer: Q1.What Is Distance Vector Routing Algorithm? Explain With Example
6 pages
The Fundamental Equipment of The Learner Map
No ratings yet
The Fundamental Equipment of The Learner Map
2 pages
Peden Macrobiusmediaevaldream 1985
No ratings yet
Peden Macrobiusmediaevaldream 1985
16 pages
Posthumanism and Deconstructing Arguments Corpora and Digitallydriven Critical Analysis Kieran Ohalloran Instant Download
No ratings yet
Posthumanism and Deconstructing Arguments Corpora and Digitallydriven Critical Analysis Kieran Ohalloran Instant Download
82 pages
04jankiewicz Models07 1
No ratings yet
04jankiewicz Models07 1
20 pages
Philippine History
No ratings yet
Philippine History
24 pages
Slides Module 4 Lesson 2
No ratings yet
Slides Module 4 Lesson 2
34 pages
OPINION STRUCTURE For FAST
No ratings yet
OPINION STRUCTURE For FAST
3 pages

DS Chapter V8.0fault Tolerance

Uploaded by

DS Chapter V8.0fault Tolerance

Uploaded by

Distributed Systems

(CS, CCI, WKU, Ethiopia, 2022)

 a major difference between distributed systems and single machine

(a) communication in a flat group (b) communication in a simple hierarchical group

a) the finite state machine for the coordinator in 2PC

a) finite state machine for the coordinator in 3PC

 fundamental to fault tolerance is recovery from an error

You might also like