0% found this document useful (0 votes)
57 views23 pages

DS Chapter V8.0fault Tolerance

This document discusses fault tolerance in distributed systems. It covers: 1) Fault tolerance aims to construct systems that can automatically recover from partial failures when components fail. Dependability includes availability, reliability, safety, and maintainability. 2) Faults are classified as transient, intermittent, or permanent. Failure modes include crash, omission, timing, response, and arbitrary failures. 3) Redundancy is the key technique for fault tolerance. This includes information, time, and physical redundancy. Process groups use physical redundancy through replication to mask faults.

Uploaded by

Gofere Tube
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views23 pages

DS Chapter V8.0fault Tolerance

This document discusses fault tolerance in distributed systems. It covers: 1) Fault tolerance aims to construct systems that can automatically recover from partial failures when components fail. Dependability includes availability, reliability, safety, and maintainability. 2) Faults are classified as transient, intermittent, or permanent. Failure modes include crash, omission, timing, response, and arbitrary failures. 3) Redundancy is the key technique for fault tolerance. This includes information, time, and physical redundancy. Process groups use physical redundancy through replication to mask faults.

Uploaded by

Gofere Tube
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Distributed Systems

CHAPTER EIGHT
Fault Tolerance

(CS, CCI, WKU, Ethiopia, 2022)

Habtamu Alemayehu
Lecturer Name: (MSc in CSE)

5/9/2022 1
Introduction

 a major difference between distributed systems and single machine


systems is that with the former, partial failure is possible, i.e., when one
component in a distributed system fails
 such a failure may affect some components while others will continue to
function properly
 an important goal of distributed systems design is to construct a system
that can automatically recover from partial failure
 it should tolerate faults and continue to operate to some extent

5/9/2022 2
Fault Tolerance
Basic Concepts
 fault tolerance is strongly related to dependable systems
 dependability covers the following
 availability
 refers to the probability that the system is operating correctly
at any given time; defined in terms of an instant in time
 reliability
 a property that a system can run continuously without failure;
defined in terms of a time interval
 safety
 refers to the situation that even when a system temporarily
fails to operate correctly, nothing catastrophic happens
 maintainability
 how easily a failed system can be repaired

5/9/2022 3
Cont’d
 dependable systems are also required to provide a high degree of
security
 a system is said to fail when it cannot meet its promises; for instance
failing to provide it users one or more of the services it promises
 an error is a part of a system’s state the may lead to a failure; e.g.,
damaged packets in communication
 the cause of an error is called a fault
 building dependable systems closely relates to controlling faults
 a distinction is made between preventing, removing, and forecasting
faults
 a fault tolerant system is a system that can provide its services even in
the presence of faults

5/9/2022 4
Cont’d
 faults are classified into three
 transient
 occurs once and then disappears; if the operation is repeated, the fault
goes away;
 intermittent
 it occurs, then vanishes on its own accord, then reappears, ...;
 permanent
 one that continues to exist until the faulty component is repaired; e.g,
disk head crash, software bug

5/9/2022 5
Cont’d
 Failure Modes - 5 of them
 Crash failure: a server halts, but was working correctly until it stopped
 Omission failure: a server fails to respond to incoming requests
 Receive omission: a server fails to receive incoming messages; e.g., may
be no thread is listening
 Send omission: a server fails to send messages
 Timing failure: a server's response lies outside the specified time interval;
e.g., may be it is too fast over flooding the receiver or too slow
 Response failure: the server's response is incorrect
 Value failure: the value of the response is wrong; e.g., a search engine
returning wrong Web pages as a result of a search
 State transition failure: the server deviates from the correct flow of
control; e.g., taking default actions when it fails to understand the
request
 Arbitrary failure (or Byzantine failure): a server may produce arbitrary
responses at arbitrary times; most serious

5/9/2022 6
Cont’d
 Failure Masking by Redundancy
 to be fault tolerant, the system tries to hide the occurrence of failures
from other processes - masking
 the key technique for masking faults is redundancy
 three kinds are possible
 information redundancy; add extra bits to allow recovery from garbled
bits (error correction)
 time redundancy: an action is performed more than once if needed;
e.g., redo an aborted transaction; useful for transient and intermittent
faults
 physical redundancy: add (replicate) extra equipment (hardware) or
processes (software)

5/9/2022 7
Process Resilience
 how can fault tolerance be achieved in distributed systems
 one method is protection against process failures by replicating
processes into groups
 we discuss
 what are the general design issues of process groups
 what actually is a fault tolerant group

5/9/2022 8
Cont’d
 Design Issues
 the key approach to tolerating a faulty process is to organize several
identical processes into a group
 all members of a group receive a message hoping that if one process
fails, another one will take over
 process groups may be dynamic
 new groups can be created and old groups can be destroyed
 a process can join or leave a group
 a process can be a member of several groups at the same time
 hence group management and membership mechanisms are required
 groups may be flat (all processes are equal) or hierarchical (a coordinator
and several workers)

5/9/2022 9
Cont’d

(a) communication in a flat group (b) communication in a simple hierarchical group


 the flat group has no single point of failure, but decision making is more
complicated (voting may be required for decision making)
 the hierarchical group has the opposite properties
 group membership may be handled
 through a group server where all requests (joining, leaving, ...) are sent; it
has a single point of failure
 in a distributed way (membership is multicasted)
5/9/2022 10
Cont’d
 Failure Masking and Replication
 how to replicate processes so that they can form groups?; there are
two ways :
 primary-based protocols: for fault tolerance, primary-backup
protocol is used; organize processes hierarchically and let the
primary coordinate all writes; if the primary crashes, the backups
hold an election
 replicated-write protocols: in the form of active replication or by
means of quorum-based protocols; processes are organized as flat
groups

5/9/2022 11
Reliable Group Communication
how to reliably deliver messages to a process group (multicasting)
 Basic Reliable-Multicasting Schemes
 reliable multicasting means a message sent to a process group should
be delivered to each member of that group
 transport protocols do not offer reliable communication to a collection
of processes
 problems:
 what happens if a process joins a group during communication?
 what happens if a (sending) process crashes during communication?
 what if there are faulty processes?
 a weaker solution assuming that all receivers are known and that none
will fail is for the sending process to assign a sequence number to each
message and to buffer all messages so that lost ones can be
retransmitted

5/9/2022 12
Cont’d

a simple solution to reliable multicasting when all receivers are known and are assumed not to fail; (a)
message transmission, (b) reporting feedback
5/9/2022 13
Cont’d

 Atomic Multicast
 how to achieve reliable multicasting in the presence of process
failures
 for example, in a replicated database, how to handle update
operations when a replica crashes during update operations
 the atomic multicast problem: to guarantee that a message is
delivered to either all processes or none at all and that messages are
delivered in the same order to all processes

5/9/2022 14
Distributed Commit
 atomic multicasting is an example of the more generalized problem
known as distributed commit
 in atomic multicasting, the operation is delivery of a message
 but the distributed commit problem involves having an(y) operation
being performed by each member of a process group, or none at all
 there are three protocols: one-phase commit, two-phase commit, and
three-phase commit
 One-Phase Commit Protocol
 a coordinator tells all other processes, called participants, whether or
not to (locally) perform an operation
 drawback: if one of the participants cannot perform the operation,
there is no way to tell the coordinator; for example due to violation of
concurrency control constraints in distributed transactions

5/9/2022 15
Cont’d
 Two-Phase Commit Protocol (2PC)
 it has two phases: voting phase and decision phase, each involving
two steps
 voting phase
 the coordinator sends a VOTE_REQUEST message to all
participants
 each participant then sends a VOTE_COMMIT or VOTE_ABORT
message depending on its local situation
 decision phase
 the coordinator collects all votes; if all vote to commit the
transaction, it sends a GLOBAL_COMMIT message; if at least one
participant sends VOTE_ABORT, it sends a GLOBAL_ABORT
message
 each participant that voted for a commit waits for the final
reaction of the coordinator and commits or aborts

5/9/2022 16
Cont’d

a) the finite state machine for the coordinator in 2PC


b) the finite state machine for a participant

5/9/2022 17
Cont’d
 problems may occur in the event of failures
 the coordinator and participants have states in which they block waiting for
messages: INIT, READY, WAIT
 when a process crashes, other processes may wait indefinitely
 hence, timeout mechanisms are required
 a participant waiting in its INIT state for VOTE_REQUEST from the
coordinator aborts and sends VOTE_ABORT if it does not receive a vote
request after some time
 the coordinator blocking in state WAIT aborts and sends GLOBAL_ABORT if
all votes have not been collected on time
 a participant P waiting in its READY state waiting for the global vote cannot
abort; instead it must find out which message the coordinator actually sent
 by blocking until the coordinator recovers
 or requesting another participant, say Q
 a process (participant or coordinator) can recover from crash if its state has
been saved to persistent storage
5/9/2022 18
Cont’d
 Three-Phase Commit Protocol (3PC)
 the problem with 2PC is that, if the coordinator crashes, participants
will need to block until the coordinator recovers
 3PC avoids blocking processes in the presence of crashes
 the states of the coordinator and each participant satisfy the following
two conditions
 there is no single state from which it is possible to make a transition
directly to either COMMIT or an ABORT state
 there is no state in which it is not possible to make a final decision,
and from which a transition to a COMMIT state can be made

5/9/2022 19
Cont’d

a) finite state machine for the coordinator in 3PC


b) finite state machine for a participant

5/9/2022 20
Recovery

 fundamental to fault tolerance is recovery from an error


 error recovery means to replace an erroneous state with an error-free
state
 two forms of error recovery: backward recovery and forward recovery
 Backward Recovery
 bring the system from its present erroneous state back into a
previously correct state
 for this, the system’s state must be recorded from time to time; each
time a state is recorded, a checkpoint is said to be made
 e.g., retransmitting lost or damaged packets in the implementation of
reliable communication

5/9/2022 21
Cont’d
 disadvantages:
 checkpointing and restoring a process to its previous state are costly and
performance bottlenecks
 no guarantee can be given that the error will not recur, which may take
an application into a loop of recovery
 some actions may be irreversible; e.g., deleting a file, handing over cash
to a customer

 Forward Recovery
 bring the system from its present erroneous state to a correct new state
from which it can continue to execute
 it has to be known in advance which errors may occur so as to correct
those errors
 e.g., erasure correction (or simply error correction) where a lost or
damaged packet is constructed from other successfully delivered packets

5/9/2022 22
Thank You !!!

5/9/2022 23

You might also like