0% found this document useful (0 votes)

11 views49 pages

Week 04

Uploaded by

Imaan Mufti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views49 pages

Week 04

Uploaded by

Imaan Mufti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

Parallel and Distributed Computing

Introduction to Fault
Tolerance

Parallel and Distributed Computing

Introduction of Fault Tolerance.

Fault Classification.

Objectives
Failure Classification.

Failure Masking.
Fault Tolerance

“A fault-tolerance system is one that continues to

provide the required functionality in the presence of
fault/failure.”
Fault Tolerance Cont…
A characteristic feature of
distributed systems is the notion of
partial failure:
• A partial failure may happen
when one component in a An important goal in distributed
distributed system fails. systems design is to construct
• This failure may affect the proper the system in such a way that it
operation of other components, can automatically recover from
while at the same time leaving yet partial failures without seriously
other components totally affecting the overall
unaffected. performance.
Fault Classification

• Transient fault: Occurs once and then disappear. If

the operation is repeated, the fault goes away.
• Intermittent fault: Occurs, then vanishes of its own
accord, then reappears, and so on. A loose contact
on a connector will often cause an intermittent fault.
Faults are generally
classified as transient, • Permanent fault: Is one that continues to exist until
intermittent, or the faulty component is replaced. Burnt-out chips,
permanent:
software bugs, and disk head crashes are examples
of permanent faults.
Failure Classification

• Crash failure: A server halts but working correctly until it

halts.
Failures are • Omission failure: A server fails to respond to incoming
generally requests.
classified • Timing failure: A server’s response lies outside the specified
time interval.
into five • Response failure: A server’s response is incorrect.
categories: • Arbitrary failure: A server may produce the arbitrary
responses at arbitrary times.
Failure Masking
Failure Masking Cont…

• Information redundancy: Add extra bits to allow

recovery from garbled bits.

• Time redundancy: Repeat an action if needed.

The most common approach
to failure masking is
redundancy which is
categorized into three types:
• Physical redundancy: Add extra equipment or
processes so that the system can tolerate the loss or
malfunctioning of some components.
Process Resilience

Parallel and Distributed Computing

Introduction of Process Resilience.

Flat Groups versus Hierarchical Groups.

Objectives
Failure Masking and Replications.

Approaches for Replications.

Process Resilience

“Process resilience incorporates techniques by which

one or more processes can fail without seriously
disturbing the rest of the system.”
Process Resilience Cont…

Related to this issue is reliable

multicasting, by which message Protection against
transmission to a collection of process failures can
Groups are
processes is guaranteed to be achieved by
categorized into two
succeed. process replication,
categories: Flat
Reliable multicasting is often organizing several
Group and
necessary to keep processes identical processes
Hierarchy Group.
synchronized. into a group.
Flat Group

• All processes are equal.

• The processes make
decisions collectively.
• No single point of failure, but
decision making is more
complicated as consensus
is required.
Hierarchical Group

• A single coordinator makes all

decisions.
• Single point-of failure, however:
decisions are easily and quickly
made by the coordinator without
first having to get consensus.
• Group is transparent to its users;
the whole group is dealt with as a
single process.
Failure Masking and Replication

Two approaches to arranging the

replication of the group are:
• Primary-base protocols and
By organizing a fault tolerant group of Replicated-write protocols.
processes , we can protect a single
vulnerable process.
Primary-Base Protocols

A group of processes is When the primary

Appears in the form of organized in a crashes, the backups
a primary-backup hierarchical fashion in execute some election
protocol. which a primary algorithm to choose a
coordinates all write new primary.
operations.
Replicated-Write Protocols

Replicated-write
Solutions correspond to These groups have no
protocols are used in the
organizing a collection of single point of failure, at
form of active replication,
identical processes into the cost of distributed
as well as by means of
a flat group. coordination.
quorum-based protocols.
Reliable Client-Server
Communication

Parallel and Distributed Computing

Understanding of Reliable
Client-Server Communication.

Objectives
RPC Semantics in the Presence
of Failures.
Reliable Client-Server Communication

Fault tolerance in distributed systems

concentrates on faulty processes. A communication channel may exhibit
crash, omission, timing, and arbitrary
However, communication failures failures.
should also be considered .
Peer to Peer Communication

Crash failures of
Reliable point-to- connections are not
TCP masks omission
point communication masked. The only
failures, which occur
is established by way to mask such
in the form of lost
making use of a failures is to let the
messages by using
reliable transport distributed system
acknowledgments
protocol, such as attempt to
and retransmissions.
TCP. automatically set up
a new connection.
Remote Procedure Call (RPC)
mechanism works well as long as both
the client and server function perfectly.
RPC
Semantics
in the
Presence Five classes of RPC failure can be
identified:
of Failures
The
The reply
request The server The client
The client message
message crashes crashes
is unable from the
from the after after
to locate server to
client to receiving a sending a
the server. the client
the server request. request.
is lost.
is lost.
Server in Client-Server Communication

The sequence of events at a server is

shown in Fig.
(a) A request arrives, is carried out,
and a reply is sent.
(b) A request arrives and is carried
out, just as before, but the server
crashes before it can send the reply.
(c) Again, a request arrives, but this
time the server crashes before it can
even be carried out and no reply is
sent back.
Server in Client-Server Communication
Cont..

Server crashes
• At least once semantics: A guarantee is
are dealt with by given that the RPC occurred at least once, but
implementing (also) possibly more that once.
• At most once semantics: A guarantee is
one of three given that the RPC occurred at most once, but
possible possibly not at all.
• No semantics: Nothing is guaranteed, and
implementation client and servers take their chances.
philosophies:
Client in Client-Server Communication

When a client sends a

request to a server and • Extermination: The orphan is simply killed-off.
crashes before the server • Reincarnation: Each client session has an epoch
replies. At this point a associated with it, making orphans easy to spot.
computation is active and • Gentle reincarnation: When a new epoch is
no parent is waiting for identified, an attempt is made to locate a requests
the result. Such an owner, otherwise the orphan is killed.
unwanted computation is • Expiration: If the RPC cannot be completed within a
called an orphan. Four standard amount of time, it is assumed to have
orphan solutions have expired.
been proposed:
Reliable Group
Communication

Parallel and Distributed Computing

Understanding of Reliable
Group Communication.

Objectives

Reliable-Multicasting Schemes.
Reliable Group Communication

“Reliable multicast services guarantee that all

messages are delivered to all members of a process
group.”
Basic Reliable-Multicasting Schemes

• The sending process assigns a sequence number to

each message it multicasts.
• Assume that messages are received in the order they
A simple solution to are sent.
reliable multicasting • Each multicast message is stored locally in a history
buffer at the sender.
when all receivers • Assuming the receivers are known to the sender, the
are known and are sender simply keeps the message in its history buffer
assumed not to fail. until each receiver has returned an acknowledgment.
• If a receiver detects it is missing a message, it may
return a negative acknowledgment, requesting the
sender for a retransmission.
Basic Reliable-Multicasting Schemes
Cont..

(a) Message transmission – note that the third receiver is expecting 24.
(b) Reporting feedback – the third receiver informs the sender.
Distributed Commit

Parallel and Distributed Computing

Introduction of Distributed Commit.

Objectives
Distributed Commit Protocol
Phases.
Distributed Commit

“The distributed commit problem involves having an

operation being performed by each member of a
process group, or none at all.”
Distributed Commit Cont…

With distributed
In the case of transactions, the Other examples of
reliable operation may be distributed commit,
multicasting, the the commit of a and how it can be
operation is the transaction at a solved are
delivery of a single site that discussed in
message. takes part in the Tanisch (2000).
transaction.
Distributed Commit Cont …

Commit protocol
is distributed into
three types:

Single-phase Two-phase Three-phase

commit commit commit.
One-Phase Commit Protocol:

Coordinator tells all If one of the It cannot efficiently

other processes that participants cannot handle the failure of
are also involved, perform the operation, the coordinator.
called participants, there is no way to tell
whether to (locally) the coordinator. The solutions:
perform the operation Two-Phase and Three-
in question. Phase Commit
Protocols
Two-Phase Commit Protocol

“Assuming that no failures occur, the protocol consists

of the following two phases, each consisting of two
steps: The first phase is the voting phase, and the
second phase is the decision phase.”
Two-Phase Commit Protocol Cont…

All votes are collected

by the coordinator.
• A Group
GLOBAL_COMMIT members
A group member
is sent if all the then
The coordinator returns
group members COMMIT or
sends a VOTE_COMMIT
voted to commit. ABORT
VOTE_REQUE if it can commit
• If one group based on the
ST message to locally, otherwise
member voted to last message
all participants. VOTE_ABORT
abort, a received
message.
GLOBAL_ABORT from the
is sent. coordinator.
Two-Phase Commit Protocol Cont…

(a) The finite state machine for the coordinator in 2PC.

(b) The finite state machine for a participant.
Drawbacks of Two-Phase Commit Protocol
It can lead to both the coordinator and the participants blocking,
which may lead to the dreaded deadlock.

If the coordinator crashes, the participants may not be able to reach a

final decision, and they may, therefore, block until the coordinator
recovers.

Two-Phase Commit is known as a blocking-commit protocol for this

reason.

The solution: Three-Phase Commit Protocol

Three-Phase Commit Protocol (Pre
Commit)

The states of the coordinator and each

participant satisfy the following two conditions:
• There is no single state from which it is
possible to make a transition directly to
either a COMMIT or an ABORT state.
Skeen (1981) developed a
variant of 2PC, called the three- • There is no state in which it is not possible
phase commit protocol (3PC), to make a final decision, and from which a
that avoids blocking processes in transition to a COMMIT state can be made.
the presence of fail-stop crashes.
Three-Phase Commit Protocol Cont…

(a) The finite state machine for the coordinator in 3PC.

(b) The finite state machine for a participant.
Recovery

Parallel and Distributed Computing

Basic Concept of Recovery.

Objectives

Types of Recovery.
Recovery

“The whole idea of error recovery is to replace an

erroneous state with an error-free state. Once a failure
has occurred, it is essential that the process where the
failure happened recovers to a correct state.”
Recovery Cont…

Recovery from • Backward Recovery: Return the system to

an error is some previous correct state (using
checkpoints), then continue executing.
fundamental to
• Forward Recovery: When the system has
fault tolerance. entered an erroneous state, instead of moving
Two main back to a previous, checkpointed state, an
attempt is made to bring the system in a
forms of correct new state from which it can continue to
execute.
recovery are:
Advantages:

• Generally applicable method independent of

any specific system or process.
• It can be integrated into (the middleware layer)
of a distributed system as a general-purpose
Backward service.

Recovery Disadvantages:

• Restoring a system or process to a previous

state is generally a relatively costly operation in
terms of performance.
• Backward error recovery mechanisms are
independent of the distributed application for
which they are actually used, no guarantees
can be given that once recovery has taken
place, the same or similar failure will not
happen again.
Advantages:
• Generally, have low overhead.
Forward
Recovery Disadvantages:
• It has to be known in advance which errors may
occur. Only in that case is it possible to correct
those errors and move to a new state.
• When an error occurs, the recovery mechanism
then knows what to do to bring the system forward
to a correct state.

InPower Familiarization
100% (1)
InPower Familiarization
99 pages
How To Implement Modbus TCP Protocol Using VBA With Excel - Acc Automation
No ratings yet
How To Implement Modbus TCP Protocol Using VBA With Excel - Acc Automation
18 pages
3HAC065036 OM OmniCore-en
No ratings yet
3HAC065036 OM OmniCore-en
284 pages
Profilers Zh 刻画器指南
No ratings yet
Profilers Zh 刻画器指南
164 pages
Stair Structure Detail
100% (1)
Stair Structure Detail
2 pages
Hand-Over List
No ratings yet
Hand-Over List
88 pages
Single Aisle Technical Training Manual M35 LINE MECHANICS (CFM 56) (LVL 2&3) Information Systems
100% (1)
Single Aisle Technical Training Manual M35 LINE MECHANICS (CFM 56) (LVL 2&3) Information Systems
42 pages
All The Links For IT
No ratings yet
All The Links For IT
133 pages
Technical - Manual - Midea - Aqua Thermal - MC - SUxxx - RN8L - B
No ratings yet
Technical - Manual - Midea - Aqua Thermal - MC - SUxxx - RN8L - B
62 pages
MILLIPEDE Concept
No ratings yet
MILLIPEDE Concept
23 pages
Upendra Internship Final
No ratings yet
Upendra Internship Final
39 pages
Graph: By: Deepak Kumar Singh
No ratings yet
Graph: By: Deepak Kumar Singh
55 pages
Evaluating Limits of Trigonometric and Exponential Functions
No ratings yet
Evaluating Limits of Trigonometric and Exponential Functions
12 pages
Unit 3
No ratings yet
Unit 3
39 pages
B.Tech. CP Structure-18-19
No ratings yet
B.Tech. CP Structure-18-19
5 pages
Unit 4 Physical Pharmaceutics 1
No ratings yet
Unit 4 Physical Pharmaceutics 1
37 pages
Are QSM Manual Rev 08
No ratings yet
Are QSM Manual Rev 08
43 pages
Iterative Design and Prototyping
No ratings yet
Iterative Design and Prototyping
26 pages
Sap GST
No ratings yet
Sap GST
5 pages
Turn Off Unnecessary Windows Services
No ratings yet
Turn Off Unnecessary Windows Services
3 pages
MARK 301 Articles Summary
No ratings yet
MARK 301 Articles Summary
21 pages
COMAH SRAM 2015 - Humans Factor Criteria PDF
No ratings yet
COMAH SRAM 2015 - Humans Factor Criteria PDF
20 pages
Aqua Series
No ratings yet
Aqua Series
2 pages
United States Patent: (10) Patent No.: US 7,702,608 B1
No ratings yet
United States Patent: (10) Patent No.: US 7,702,608 B1
17 pages
ProcessResilience FaultTolerance Recovery
No ratings yet
ProcessResilience FaultTolerance Recovery
21 pages
Compiler Design Qbank 2023
No ratings yet
Compiler Design Qbank 2023
15 pages
Lecture23 FaultTolerance
No ratings yet
Lecture23 FaultTolerance
56 pages
Replication: Distributed Computing
No ratings yet
Replication: Distributed Computing
43 pages
M1 User Level Security User Guide
No ratings yet
M1 User Level Security User Guide
8 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
Tangkapan Layar 2025-04-06 Pada 00.36.36
No ratings yet
Tangkapan Layar 2025-04-06 Pada 00.36.36
6 pages
08 Falhas
No ratings yet
08 Falhas
41 pages
Database Programming With SQL 16-1: Working With Sequences Practice Activities
No ratings yet
Database Programming With SQL 16-1: Working With Sequences Practice Activities
3 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
G7 May Test
No ratings yet
G7 May Test
3 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Green Is Great Part 2
No ratings yet
Green Is Great Part 2
2 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Du3 1
No ratings yet
Du3 1
54 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
Distributed Computing Practice Questions Chapter 4 pt2
No ratings yet
Distributed Computing Practice Questions Chapter 4 pt2
6 pages
Design Issues of DS
No ratings yet
Design Issues of DS
21 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Unit - Iv
No ratings yet
Unit - Iv
19 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Distributed Computing Practice Questions Chapter 8 Pt2
No ratings yet
Distributed Computing Practice Questions Chapter 8 Pt2
3 pages
IDS805 Installer
100% (1)
IDS805 Installer
48 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Group Assignment and Its Presentation
No ratings yet
Group Assignment and Its Presentation
2 pages
Lecture 05
No ratings yet
Lecture 05
29 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Blockchain - Unit1
No ratings yet
Blockchain - Unit1
115 pages
Consensus
No ratings yet
Consensus
10 pages
Chen 07
No ratings yet
Chen 07
39 pages
Lec 3
No ratings yet
Lec 3
30 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Consensus
No ratings yet
Consensus
77 pages
14CS705B-Distributed Systems Scheme
No ratings yet
14CS705B-Distributed Systems Scheme
24 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Fault
No ratings yet
Fault
101 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Aos-Unit 2
No ratings yet
Aos-Unit 2
23 pages
A) What Is RPC? Explain Different Types of RPC?
No ratings yet
A) What Is RPC? Explain Different Types of RPC?
6 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Fault System One
No ratings yet
Fault System One
19 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
AOS PPT Unit 1,2 - 20241112 - 222203 - 0000
No ratings yet
AOS PPT Unit 1,2 - 20241112 - 222203 - 0000
20 pages
Unit 4
No ratings yet
Unit 4
11 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Week 04

Uploaded by

Week 04

Uploaded by

Parallel and Distributed Computing

Parallel and Distributed Computing

“A fault-tolerance system is one that continues to

• Transient fault: Occurs once and then disappear. If

• Crash failure: A server halts but working correctly until it

• Information redundancy: Add extra bits to allow

• Time redundancy: Repeat an action if needed.

Parallel and Distributed Computing

Flat Groups versus Hierarchical Groups.

Approaches for Replications.

“Process resilience incorporates techniques by which

Related to this issue is reliable

• All processes are equal.

• A single coordinator makes all

Two approaches to arranging the

A group of processes is When the primary

Parallel and Distributed Computing

Fault tolerance in distributed systems

The sequence of events at a server is

When a client sends a

Parallel and Distributed Computing

“Reliable multicast services guarantee that all

• The sending process assigns a sequence number to

Parallel and Distributed Computing

“The distributed commit problem involves having an

Single-phase Two-phase Three-phase

Coordinator tells all If one of the It cannot efficiently

“Assuming that no failures occur, the protocol consists

All votes are collected

(a) The finite state machine for the coordinator in 2PC.

If the coordinator crashes, the participants may not be able to reach a

Two-Phase Commit is known as a blocking-commit protocol for this

The solution: Three-Phase Commit Protocol

The states of the coordinator and each

(a) The finite state machine for the coordinator in 3PC.

Parallel and Distributed Computing

“The whole idea of error recovery is to replace an

Recovery from • Backward Recovery: Return the system to

• Generally applicable method independent of

• Restoring a system or process to a previous

You might also like