0% found this document useful (0 votes)
5 views

Week-04

Uploaded by

Imaan Mufti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Week-04

Uploaded by

Imaan Mufti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Parallel and Distributed Computing

Introduction to Fault
Tolerance

Parallel and Distributed Computing


Introduction of Fault Tolerance.

Fault Classification.

Objectives
Failure Classification.

Failure Masking.
Fault Tolerance

“A fault-tolerance system is one that continues to


provide the required functionality in the presence of
fault/failure.”
Fault Tolerance Cont…
A characteristic feature of
distributed systems is the notion of
partial failure:
• A partial failure may happen
when one component in a An important goal in distributed
distributed system fails. systems design is to construct
• This failure may affect the proper the system in such a way that it
operation of other components, can automatically recover from
while at the same time leaving yet partial failures without seriously
other components totally affecting the overall
unaffected. performance.
Fault Classification

• Transient fault: Occurs once and then disappear. If


the operation is repeated, the fault goes away.
• Intermittent fault: Occurs, then vanishes of its own
accord, then reappears, and so on. A loose contact
on a connector will often cause an intermittent fault.
Faults are generally
classified as transient, • Permanent fault: Is one that continues to exist until
intermittent, or the faulty component is replaced. Burnt-out chips,
permanent:
software bugs, and disk head crashes are examples
of permanent faults.
Failure Classification

• Crash failure: A server halts but working correctly until it


halts.
Failures are • Omission failure: A server fails to respond to incoming
generally requests.
classified • Timing failure: A server’s response lies outside the specified
time interval.
into five • Response failure: A server’s response is incorrect.
categories: • Arbitrary failure: A server may produce the arbitrary
responses at arbitrary times.
Failure Masking
Failure Masking Cont…

• Information redundancy: Add extra bits to allow


recovery from garbled bits.

• Time redundancy: Repeat an action if needed.


The most common approach
to failure masking is
redundancy which is
categorized into three types:
• Physical redundancy: Add extra equipment or
processes so that the system can tolerate the loss or
malfunctioning of some components.
Process Resilience

Parallel and Distributed Computing


Introduction of Process Resilience.

Flat Groups versus Hierarchical Groups.

Objectives
Failure Masking and Replications.

Approaches for Replications.


Process Resilience

“Process resilience incorporates techniques by which


one or more processes can fail without seriously
disturbing the rest of the system.”
Process Resilience Cont…

Related to this issue is reliable


multicasting, by which message Protection against
transmission to a collection of process failures can
Groups are
processes is guaranteed to be achieved by
categorized into two
succeed. process replication,
categories: Flat
Reliable multicasting is often organizing several
Group and
necessary to keep processes identical processes
Hierarchy Group.
synchronized. into a group.
Flat Group

• All processes are equal.


• The processes make
decisions collectively.
• No single point of failure, but
decision making is more
complicated as consensus
is required.
Hierarchical Group

• A single coordinator makes all


decisions.
• Single point-of failure, however:
decisions are easily and quickly
made by the coordinator without
first having to get consensus.
• Group is transparent to its users;
the whole group is dealt with as a
single process.
Failure Masking and Replication

Two approaches to arranging the


replication of the group are:
• Primary-base protocols and
By organizing a fault tolerant group of Replicated-write protocols.
processes , we can protect a single
vulnerable process.
Primary-Base Protocols

A group of processes is When the primary


Appears in the form of organized in a crashes, the backups
a primary-backup hierarchical fashion in execute some election
protocol. which a primary algorithm to choose a
coordinates all write new primary.
operations.
Replicated-Write Protocols

Replicated-write
Solutions correspond to These groups have no
protocols are used in the
organizing a collection of single point of failure, at
form of active replication,
identical processes into the cost of distributed
as well as by means of
a flat group. coordination.
quorum-based protocols.
Reliable Client-Server
Communication

Parallel and Distributed Computing


Understanding of Reliable
Client-Server Communication.

Objectives
RPC Semantics in the Presence
of Failures.
Reliable Client-Server Communication

Fault tolerance in distributed systems


concentrates on faulty processes. A communication channel may exhibit
crash, omission, timing, and arbitrary
However, communication failures failures.
should also be considered .
Peer to Peer Communication

Crash failures of
Reliable point-to- connections are not
TCP masks omission
point communication masked. The only
failures, which occur
is established by way to mask such
in the form of lost
making use of a failures is to let the
messages by using
reliable transport distributed system
acknowledgments
protocol, such as attempt to
and retransmissions.
TCP. automatically set up
a new connection.
Remote Procedure Call (RPC)
mechanism works well as long as both
the client and server function perfectly.
RPC
Semantics
in the
Presence Five classes of RPC failure can be
identified:
of Failures
The
The reply
request The server The client
The client message
message crashes crashes
is unable from the
from the after after
to locate server to
client to receiving a sending a
the server. the client
the server request. request.
is lost.
is lost.
Server in Client-Server Communication

The sequence of events at a server is


shown in Fig.
(a) A request arrives, is carried out,
and a reply is sent.
(b) A request arrives and is carried
out, just as before, but the server
crashes before it can send the reply.
(c) Again, a request arrives, but this
time the server crashes before it can
even be carried out and no reply is
sent back.
Server in Client-Server Communication
Cont..

Server crashes
• At least once semantics: A guarantee is
are dealt with by given that the RPC occurred at least once, but
implementing (also) possibly more that once.
• At most once semantics: A guarantee is
one of three given that the RPC occurred at most once, but
possible possibly not at all.
• No semantics: Nothing is guaranteed, and
implementation client and servers take their chances.
philosophies:
Client in Client-Server Communication

When a client sends a


request to a server and • Extermination: The orphan is simply killed-off.
crashes before the server • Reincarnation: Each client session has an epoch
replies. At this point a associated with it, making orphans easy to spot.
computation is active and • Gentle reincarnation: When a new epoch is
no parent is waiting for identified, an attempt is made to locate a requests
the result. Such an owner, otherwise the orphan is killed.
unwanted computation is • Expiration: If the RPC cannot be completed within a
called an orphan. Four standard amount of time, it is assumed to have
orphan solutions have expired.
been proposed:
Reliable Group
Communication

Parallel and Distributed Computing


Understanding of Reliable
Group Communication.

Objectives

Reliable-Multicasting Schemes.
Reliable Group Communication

“Reliable multicast services guarantee that all


messages are delivered to all members of a process
group.”
Basic Reliable-Multicasting Schemes

• The sending process assigns a sequence number to


each message it multicasts.
• Assume that messages are received in the order they
A simple solution to are sent.
reliable multicasting • Each multicast message is stored locally in a history
buffer at the sender.
when all receivers • Assuming the receivers are known to the sender, the
are known and are sender simply keeps the message in its history buffer
assumed not to fail. until each receiver has returned an acknowledgment.
• If a receiver detects it is missing a message, it may
return a negative acknowledgment, requesting the
sender for a retransmission.
Basic Reliable-Multicasting Schemes
Cont..

(a) Message transmission – note that the third receiver is expecting 24.
(b) Reporting feedback – the third receiver informs the sender.
Distributed Commit

Parallel and Distributed Computing


Introduction of Distributed Commit.

Objectives
Distributed Commit Protocol
Phases.
Distributed Commit

“The distributed commit problem involves having an


operation being performed by each member of a
process group, or none at all.”
Distributed Commit Cont…

With distributed
In the case of transactions, the Other examples of
reliable operation may be distributed commit,
multicasting, the the commit of a and how it can be
operation is the transaction at a solved are
delivery of a single site that discussed in
message. takes part in the Tanisch (2000).
transaction.
Distributed Commit Cont …

Commit protocol
is distributed into
three types:

Single-phase Two-phase Three-phase


commit commit commit.
One-Phase Commit Protocol:

Coordinator tells all If one of the It cannot efficiently


other processes that participants cannot handle the failure of
are also involved, perform the operation, the coordinator.
called participants, there is no way to tell
whether to (locally) the coordinator. The solutions:
perform the operation Two-Phase and Three-
in question. Phase Commit
Protocols
Two-Phase Commit Protocol

“Assuming that no failures occur, the protocol consists


of the following two phases, each consisting of two
steps: The first phase is the voting phase, and the
second phase is the decision phase.”
Two-Phase Commit Protocol Cont…

All votes are collected


by the coordinator.
• A Group
GLOBAL_COMMIT members
A group member
is sent if all the then
The coordinator returns
group members COMMIT or
sends a VOTE_COMMIT
voted to commit. ABORT
VOTE_REQUE if it can commit
• If one group based on the
ST message to locally, otherwise
member voted to last message
all participants. VOTE_ABORT
abort, a received
message.
GLOBAL_ABORT from the
is sent. coordinator.
Two-Phase Commit Protocol Cont…

(a) The finite state machine for the coordinator in 2PC.


(b) The finite state machine for a participant.
Drawbacks of Two-Phase Commit Protocol
It can lead to both the coordinator and the participants blocking,
which may lead to the dreaded deadlock.

If the coordinator crashes, the participants may not be able to reach a


final decision, and they may, therefore, block until the coordinator
recovers.

Two-Phase Commit is known as a blocking-commit protocol for this


reason.

The solution: Three-Phase Commit Protocol


Three-Phase Commit Protocol (Pre
Commit)

The states of the coordinator and each


participant satisfy the following two conditions:
• There is no single state from which it is
possible to make a transition directly to
either a COMMIT or an ABORT state.
Skeen (1981) developed a
variant of 2PC, called the three- • There is no state in which it is not possible
phase commit protocol (3PC), to make a final decision, and from which a
that avoids blocking processes in transition to a COMMIT state can be made.
the presence of fail-stop crashes.
Three-Phase Commit Protocol Cont…

(a) The finite state machine for the coordinator in 3PC.


(b) The finite state machine for a participant.
Recovery

Parallel and Distributed Computing


Basic Concept of Recovery.

Objectives

Types of Recovery.
Recovery

“The whole idea of error recovery is to replace an


erroneous state with an error-free state. Once a failure
has occurred, it is essential that the process where the
failure happened recovers to a correct state.”
Recovery Cont…

Recovery from • Backward Recovery: Return the system to


an error is some previous correct state (using
checkpoints), then continue executing.
fundamental to
• Forward Recovery: When the system has
fault tolerance. entered an erroneous state, instead of moving
Two main back to a previous, checkpointed state, an
attempt is made to bring the system in a
forms of correct new state from which it can continue to
execute.
recovery are:
Advantages:

• Generally applicable method independent of


any specific system or process.
• It can be integrated into (the middleware layer)
of a distributed system as a general-purpose
Backward service.

Recovery Disadvantages:

• Restoring a system or process to a previous


state is generally a relatively costly operation in
terms of performance.
• Backward error recovery mechanisms are
independent of the distributed application for
which they are actually used, no guarantees
can be given that once recovery has taken
place, the same or similar failure will not
happen again.
Advantages:
• Generally, have low overhead.
Forward
Recovery Disadvantages:
• It has to be known in advance which errors may
occur. Only in that case is it possible to correct
those errors and move to a new state.
• When an error occurs, the recovery mechanism
then knows what to do to bring the system forward
to a correct state.

You might also like