0% found this document useful (0 votes)

52 views41 pages

08 Falhas

This document discusses fault tolerance in distributed systems. It begins by defining basic concepts like availability, reliability, safety, and maintainability. It then discusses different types of failures like transient, intermittent, and permanent faults. Various failure models are described like crash, omission, timing, and response failures. Redundancy is discussed as a way to mask failures through information, time, and physical redundancy. Process replication is described as a way to achieve fault tolerance. The challenges of agreement in faulty systems and reliable communication are then covered, specifically focusing on point-to-point communication, RPC semantics, and reliable group communication.

Uploaded by

Daniella Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views41 pages

08 Falhas

Uploaded by

Daniella Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Fault Tolerance

Chapter 7
Introduction to fault tolerance
** Fault Tolerance

An goal in distributed systems design is to construct

the system in such a way that it can automatically
recover from partial failures without affecting the
overall performance.

Whenever a failure occurs, the distributed system

should continue to operate in an acceptable way
while repairs are being made.
** Basic Concepts
Fault tolerance is strongly related to dependable systems.
Dependability includes:
Availability: Property that a system is ready
to be used immediately.

Reliability: Property that a system can run

continuously without failure.

Safety: Situation that when a system temporarily fails to

operate correctly, nothing catastrophic happens.

Maintainability: How easy a failed system can be repaired.

** Basic Concepts
Fail >> In Portuguese means Faltar, falhar(verbo)
Fault >> In Portuguese means Falta, defeito
(substantivo)

A system is said to fail when it cannot meet its

promises.
An error is a part of systems state that may conduct
to a failure.
The cause of an error is called a fault.
** Basic Concepts
Fault classification

Transient faults occur once and then disappear.

Intermittent fault occurs, then vanishes of its own

accord, then reappears, and so on.

Permanent fault is one that continues to exist until the

faulty component is repaired.
Failure Models

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure A server fails to respond to incoming requests

Receive omission A server fails to receive incoming messages
Send omission A server fails to send messages
Timing failure A server's response lies outside the specified time interval

Response failure The server's response is incorrect

Value failure The value of the response is wrong
State transition failure The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times

Different types of failures.

** Failure Masking by Redundancy
If a system is to be fault tolerant, the best it can do is to try to
hide the occurrence of failures from other processes.
Redundancy kinds:

Information redundancy, extra bits are added to allow recovery

from garbled bits. Ex. Hamming code.

Time redundancy an action is performed, and then, if need be,

it is performed again.

Physical redundancy, extra equipment or processes are added

to make it possible for the system as a whole to tolerate the
loss or malfunctioning of some component.
Physical Redundancy Example

Triple modular redundancy.

Process resilience

How fault tolerance can actually be

achieved in distributed systems
** Design issues

The key approach to tolerating faulty process is to

organize several identical processes into a group.

The propose of introducing groups is to allow

processes to deal with collections of processes as a
single abstraction.
Design issues

Flat Groups versus Hierarchical Groups

Communication in a flat group.
Communication in a simple hierarchical group
** Design issues
Group membership. When group communication is
present, some method is needed for creating and
deleting groups.

One possibility is a group server to which all these

requests can be sent.

The opposite approach is to manage group

membership in a distributed way.
** Failure Masking and Replication
Process groups are part of the solution for building
fault-tolerant systems.

We can masking the faulty by replicate processes and

organize them into a group to replace a single
process with a group.

All of these solutions consider the replication problem

addressed in previous chapters.
** Agreement in Faulty Systems
Organizing replicated processes into a group helps to
increase fault tolerance. Therefore, the processes do
not team up to produce a wrong result.

So, agreement is need in many cases.

The general goal of distributed agreement algorithms
is to have all the non-faulty processes reach
consensus on some issue, and to establish that
consensus within a finite number of steps.
Reliable client-based
communication
** Point-to-point communication

Reliable point-to-point communication is established

by making use of reliable transport protocol (TCP)

Crash failures of connections are often not masked.

A crash failure may occur when a TCP connection is

abruptly broken so that no more messages can be
transmitted through the channel.
** RPC semantics in the presence of failures
To structure our discussion, let us distinguish between five
different classes of failures in RPC systems.

1. The client is unable to locate the server

2. The request message from the client to server is lost

3. The server crashes after receiving a request

4. The reply message from the server to the client is lost

5. The client crashes after sending a request

** RPC semantics in the presence of failures

The client is unable to locate the server

One possible solution is to have the error raise an exception or

signal handlers.

The drawback is that not every language has exception or

signals support.

Another drawback is that having to write an exception or

signal handler destroys the transparency.
** RPC semantics in the presence of failures

The request message from the client to server is lost

This is the easiest one to deal with: just have the operation
system or client stub start a timer when sending the request.

The time expires, the request is resent.

Many message is lost, so the client conclude that the server is

down.
RPC semantics in the presence of failures
The server crashes after receiving a request

A server in client-server communication: (i) Normal case; (ii)

Crash after execution ; and (iii) Crash before execution.

Option to recovery the crash: (i) To keep trying until a reply

has been received; (ii) To give up immediately and reports
back failure; (iii) To guarantee nothing.
** RPC semantics in the presence of failures

The reply message from the server to the client is lost

The obviously solution is just to rely on a timer again that has

been set by the clients operating system.

If no reply is forthcoming within a reasonable period, just send

the request once more.
** RPC semantics in the presence of failures
The client crashes after sending a request: This causes a unwanted computation,
called orphan. For example, the client reboots and does the RPC again, but the
reply from the orphan comes back immediately. What can be done about
orphans?

(i) Before the client stub sends a RPC message, it makes a log entry telling what it is
about to do.

(ii) When a client reboots, it broadcast a message to all machine declaring the start a
new epoch. So, old computations of that client are killed.

(iii) When an epoch broadcast comes in, each machine checks to see if it has any
remote computations, and if so, tries to locate their owner. Only if the owner
cannot be found is the computation killed.

(iv) The RPC receives a standard amount of time to do the job. When the client
reboot, all orphans are sure to be gone.
Reliable group communication
** Basic Reliable-Multicasting Schemes
Reliable multicasting means that a message that is sent
to a process group should be delivered to each
member of that group.

What is happens if during the communication a

process joins the group?

Should that process also receive the message?

We should also determine what happens if a (sending)

process crashes during the communication?
Basic Reliable-Multicasting Schemes

A simple solution to reliable multicasting when all receivers

are known and are assumed not to fail. a) Message
transmission; b) Reporting feedback
When the sender receives a negative acknowledgement a
new message is sent.
Scalability in Reliable Multicasting
Nonhierarchical Feedback Control
The objective is reduce the number of feedback messages

Several receivers have scheduled a request for retransmission, but the

first retransmission request leads to the suppression of others.
Hierarchical Feedback Control
Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.

a) Each local coordinator forwards the message to its children.
b) A local coordinator handles retransmission requests.
** Atomic multicast
The atomic multicast problem occurs when a message for a
group is lost because a process crash.

Example: Consider a replicated database. Update operations

are always multicast to all replicas. During the update a
replica crashes. That update is lost for that replica but it is
performed at the other replicas.

The atomic multicast is performed only if the group have

agreed that the crashed replica no back to the group. When
the replica recovers, a new contract must be made.
This is performed by virtual synchrony.
** Atomic multicast
Message ordering

Unordered multicast is a virtual synchronous multicast in

which no guarantees are given concerning the order in
which received messages are delivered by different
processes. More simple way.

FIFO-ordered multicast, the communication layer is forced to

deliver incoming messages from the same process in the
same order as they have been sent.

Totally-ordered multicast is required that when messages are

delivered, they are delivered in the same order to all group
members.
Distributed commit
** Distributed commit
The distributed commit problem involves having an
operation being performed by each member of a
process group, or none at all. It is often established
by means of a coordinator.

In multicast the operation is the delivery message.

With distributed transaction, the operation may be the

commit of transaction at a single site that takes part
in the transaction.
** On-phase commit

In on-phase commit the coordinator tells all other

processes that are also involved, if or not to perform
the operation in question.

The drawback is that if one of the participants cannot

actually perform the operation, there is no way to
tell the coordinator.
** Two-phase commit
Two-phase commit consist of voting and decision phases, described in
the following steps:
1. The coordinator sends a VOTE_REQ message to all participants.

2. When a participant receives a VOTE_REQ message, it returns

either a VOTE_COMMIT or a VOTE_ABORT.

3. The coordinator collect all votes from the participants. If all

participants votes to commit the transaction the coordinator sends a
GLOBAL_COMMIT, otherwise it sends a GLOBAL_ABORT
message.

4. Each participant wait for the coordinator decision. If it receives a

GLOBAL_COMMIT message, it commits the transaction.
Otherwise, if it receives a GLOBAL_ABORT message, the
transaction is aborted.
Two-phase commit

The finite state machine for the coordinator.

The finite state machine for a participant.
Three-Phase Commit
A problem with a two-phase commit is that when the coordinator has
crashed. Three-phase commit avoids blocking processes in the
presence of fail-stop crashes.

a) Finite state machine for the coordinator in 3PC

b) Finite state machine for a participant
Recovery
**Introduction
In backward recovery, the main issue is to allow a
system in an erroneous state back into a previously
correct state. To do this some snapshots of the
system must be performed.

In forward recovery, when the system has just entered

an erroneous state, the system passes for a correct
new state from which it can continue to execute.
Recovery Stable Storage

Recovery Stable Storage: a) Stable Storage; b) Crash after

drive 1 is updated; c) Bad spot.
Checkpointing

A recovery line. Completely based in distributed snapshot. The

recovery consider the last snapshot.
Independent Checkpointing

The domino effect.

If these local states jointly do not form a distributed snapshot,
further rolling back is necessary.
The roll back is performed until a consistency state.

Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Sec 2425 L02
No ratings yet
Sec 2425 L02
56 pages
Lecture23 FaultTolerance
No ratings yet
Lecture23 FaultTolerance
56 pages
Unit # IV Replication and Fault Tolerance
No ratings yet
Unit # IV Replication and Fault Tolerance
82 pages
Distributed System Module 1
No ratings yet
Distributed System Module 1
50 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
Lecture 7
No ratings yet
Lecture 7
57 pages
Du3 1
No ratings yet
Du3 1
54 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
ProcessResilience FaultTolerance Recovery
No ratings yet
ProcessResilience FaultTolerance Recovery
21 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
DS UNIT-3 Saqs Laqs (Complete)
No ratings yet
DS UNIT-3 Saqs Laqs (Complete)
16 pages
Week 04
No ratings yet
Week 04
49 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Unit - Iv
No ratings yet
Unit - Iv
19 pages
Unit 1 Part 2
No ratings yet
Unit 1 Part 2
37 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
DS Unit-3 Notes
No ratings yet
DS Unit-3 Notes
35 pages
Chen 07
No ratings yet
Chen 07
39 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Consensus
No ratings yet
Consensus
10 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Module 5 Notes
No ratings yet
Module 5 Notes
10 pages
Distributed Computing Practice Questions Chapter 4 pt2
No ratings yet
Distributed Computing Practice Questions Chapter 4 pt2
6 pages
Lec 3
No ratings yet
Lec 3
30 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Actions and Objects
No ratings yet
Actions and Objects
22 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Aos-Unit 2
No ratings yet
Aos-Unit 2
23 pages
Fault
No ratings yet
Fault
101 pages
Group Assignment and Its Presentation
No ratings yet
Group Assignment and Its Presentation
2 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
w9s1 FaultTolerance1
No ratings yet
w9s1 FaultTolerance1
34 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
CH 4
No ratings yet
CH 4
25 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Lecture 9: February 20: 9.1.1 Server Failure Semantics
No ratings yet
Lecture 9: February 20: 9.1.1 Server Failure Semantics
6 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Unit 4
No ratings yet
Unit 4
11 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
A) What Is RPC? Explain Different Types of RPC?
No ratings yet
A) What Is RPC? Explain Different Types of RPC?
6 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Fault System One
No ratings yet
Fault System One
19 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Distributed File Systems
No ratings yet
Distributed File Systems
19 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet

08 Falhas

Uploaded by

08 Falhas

Uploaded by

Fault Tolerance

An goal in distributed systems design is to construct

Whenever a failure occurs, the distributed system

Reliability: Property that a system can run

Safety: Situation that when a system temporarily fails to

Maintainability: How easy a failed system can be repaired.

A system is said to fail when it cannot meet its

Transient faults occur once and then disappear.

Intermittent fault occurs, then vanishes of its own

Permanent fault is one that continues to exist until the

Type of failure Description

Omission failure A server fails to respond to incoming requests

Response failure The server's response is incorrect

Different types of failures.

Information redundancy, extra bits are added to allow recovery

Time redundancy an action is performed, and then, if need be,

Physical redundancy, extra equipment or processes are added

Triple modular redundancy.

How fault tolerance can actually be

The key approach to tolerating faulty process is to

The propose of introducing groups is to allow

Flat Groups versus Hierarchical Groups

One possibility is a group server to which all these

The opposite approach is to manage group

We can masking the faulty by replicate processes and

All of these solutions consider the replication problem

So, agreement is need in many cases.

Reliable point-to-point communication is established

Crash failures of connections are often not masked.

A crash failure may occur when a TCP connection is

1. The client is unable to locate the server

2. The request message from the client to server is lost

3. The server crashes after receiving a request

4. The reply message from the server to the client is lost

5. The client crashes after sending a request

The client is unable to locate the server

One possible solution is to have the error raise an exception or

The drawback is that not every language has exception or

Another drawback is that having to write an exception or

The request message from the client to server is lost

The time expires, the request is resent.

Many message is lost, so the client conclude that the server is

A server in client-server communication: (i) Normal case; (ii)

Option to recovery the crash: (i) To keep trying until a reply

The reply message from the server to the client is lost

The obviously solution is just to rely on a timer again that has

If no reply is forthcoming within a reasonable period, just send

What is happens if during the communication a

Should that process also receive the message?

We should also determine what happens if a (sending)

A simple solution to reliable multicasting when all receivers

Several receivers have scheduled a request for retransmission, but the

The essence of hierarchical reliable multicasting.

Example: Consider a replicated database. Update operations

The atomic multicast is performed only if the group have

Unordered multicast is a virtual synchronous multicast in

FIFO-ordered multicast, the communication layer is forced to

Totally-ordered multicast is required that when messages are

In multicast the operation is the delivery message.

With distributed transaction, the operation may be the

In on-phase commit the coordinator tells all other

The drawback is that if one of the participants cannot

2. When a participant receives a VOTE_REQ message, it returns

3. The coordinator collect all votes from the participants. If all

4. Each participant wait for the coordinator decision. If it receives a

The finite state machine for the coordinator.

a) Finite state machine for the coordinator in 3PC

In forward recovery, when the system has just entered

Recovery Stable Storage: a) Stable Storage; b) Crash after

A recovery line. Completely based in distributed snapshot. The

The domino effect.

You might also like