0% found this document useful (0 votes)
52 views41 pages

08 Falhas

This document discusses fault tolerance in distributed systems. It begins by defining basic concepts like availability, reliability, safety, and maintainability. It then discusses different types of failures like transient, intermittent, and permanent faults. Various failure models are described like crash, omission, timing, and response failures. Redundancy is discussed as a way to mask failures through information, time, and physical redundancy. Process replication is described as a way to achieve fault tolerance. The challenges of agreement in faulty systems and reliable communication are then covered, specifically focusing on point-to-point communication, RPC semantics, and reliable group communication.

Uploaded by

Daniella Costa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views41 pages

08 Falhas

This document discusses fault tolerance in distributed systems. It begins by defining basic concepts like availability, reliability, safety, and maintainability. It then discusses different types of failures like transient, intermittent, and permanent faults. Various failure models are described like crash, omission, timing, and response failures. Redundancy is discussed as a way to mask failures through information, time, and physical redundancy. Process replication is described as a way to achieve fault tolerance. The challenges of agreement in faulty systems and reliable communication are then covered, specifically focusing on point-to-point communication, RPC semantics, and reliable group communication.

Uploaded by

Daniella Costa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Fault Tolerance

Chapter 7
Introduction to fault tolerance
** Fault Tolerance

An goal in distributed systems design is to construct


the system in such a way that it can automatically
recover from partial failures without affecting the
overall performance.

Whenever a failure occurs, the distributed system


should continue to operate in an acceptable way
while repairs are being made.
** Basic Concepts
Fault tolerance is strongly related to dependable systems.
Dependability includes:
Availability: Property that a system is ready
to be used immediately.

Reliability: Property that a system can run


continuously without failure.

Safety: Situation that when a system temporarily fails to


operate correctly, nothing catastrophic happens.

Maintainability: How easy a failed system can be repaired.


** Basic Concepts
Fail >> In Portuguese means Faltar, falhar(verbo)
Fault >> In Portuguese means Falta, defeito
(substantivo)

A system is said to fail when it cannot meet its


promises.
An error is a part of systems state that may conduct
to a failure.
The cause of an error is called a fault.
** Basic Concepts
Fault classification

Transient faults occur once and then disappear.

Intermittent fault occurs, then vanishes of its own


accord, then reappears, and so on.

Permanent fault is one that continues to exist until the


faulty component is repaired.
Failure Models

Type of failure Description


Crash failure A server halts, but is working correctly until it halts

Omission failure A server fails to respond to incoming requests


Receive omission A server fails to receive incoming messages
Send omission A server fails to send messages
Timing failure A server's response lies outside the specified time interval

Response failure The server's response is incorrect


Value failure The value of the response is wrong
State transition failure The server deviates from the correct flow of control
Arbitrary failure A server may produce arbitrary responses at arbitrary times

Different types of failures.


** Failure Masking by Redundancy
If a system is to be fault tolerant, the best it can do is to try to
hide the occurrence of failures from other processes.
Redundancy kinds:

Information redundancy, extra bits are added to allow recovery


from garbled bits. Ex. Hamming code.

Time redundancy an action is performed, and then, if need be,


it is performed again.

Physical redundancy, extra equipment or processes are added


to make it possible for the system as a whole to tolerate the
loss or malfunctioning of some component.
Physical Redundancy Example

Triple modular redundancy.


Process resilience

How fault tolerance can actually be


achieved in distributed systems
** Design issues

The key approach to tolerating faulty process is to


organize several identical processes into a group.

The propose of introducing groups is to allow


processes to deal with collections of processes as a
single abstraction.
Design issues

Flat Groups versus Hierarchical Groups


Communication in a flat group.
Communication in a simple hierarchical group
** Design issues
Group membership. When group communication is
present, some method is needed for creating and
deleting groups.

One possibility is a group server to which all these


requests can be sent.

The opposite approach is to manage group


membership in a distributed way.
** Failure Masking and Replication
Process groups are part of the solution for building
fault-tolerant systems.

We can masking the faulty by replicate processes and


organize them into a group to replace a single
process with a group.

All of these solutions consider the replication problem


addressed in previous chapters.
** Agreement in Faulty Systems
Organizing replicated processes into a group helps to
increase fault tolerance. Therefore, the processes do
not team up to produce a wrong result.

So, agreement is need in many cases.


The general goal of distributed agreement algorithms
is to have all the non-faulty processes reach
consensus on some issue, and to establish that
consensus within a finite number of steps.
Reliable client-based
communication
** Point-to-point communication

Reliable point-to-point communication is established


by making use of reliable transport protocol (TCP)

Crash failures of connections are often not masked.

A crash failure may occur when a TCP connection is


abruptly broken so that no more messages can be
transmitted through the channel.
** RPC semantics in the presence of failures
To structure our discussion, let us distinguish between five
different classes of failures in RPC systems.

1. The client is unable to locate the server

2. The request message from the client to server is lost

3. The server crashes after receiving a request

4. The reply message from the server to the client is lost

5. The client crashes after sending a request


** RPC semantics in the presence of failures

The client is unable to locate the server

One possible solution is to have the error raise an exception or


signal handlers.

The drawback is that not every language has exception or


signals support.

Another drawback is that having to write an exception or


signal handler destroys the transparency.
** RPC semantics in the presence of failures

The request message from the client to server is lost

This is the easiest one to deal with: just have the operation
system or client stub start a timer when sending the request.

The time expires, the request is resent.

Many message is lost, so the client conclude that the server is


down.
RPC semantics in the presence of failures
The server crashes after receiving a request

A server in client-server communication: (i) Normal case; (ii)


Crash after execution ; and (iii) Crash before execution.

Option to recovery the crash: (i) To keep trying until a reply


has been received; (ii) To give up immediately and reports
back failure; (iii) To guarantee nothing.
** RPC semantics in the presence of failures

The reply message from the server to the client is lost

The obviously solution is just to rely on a timer again that has


been set by the clients operating system.

If no reply is forthcoming within a reasonable period, just send


the request once more.
** RPC semantics in the presence of failures
The client crashes after sending a request: This causes a unwanted computation,
called orphan. For example, the client reboots and does the RPC again, but the
reply from the orphan comes back immediately. What can be done about
orphans?

(i) Before the client stub sends a RPC message, it makes a log entry telling what it is
about to do.

(ii) When a client reboots, it broadcast a message to all machine declaring the start a
new epoch. So, old computations of that client are killed.

(iii) When an epoch broadcast comes in, each machine checks to see if it has any
remote computations, and if so, tries to locate their owner. Only if the owner
cannot be found is the computation killed.

(iv) The RPC receives a standard amount of time to do the job. When the client
reboot, all orphans are sure to be gone.
Reliable group communication
** Basic Reliable-Multicasting Schemes
Reliable multicasting means that a message that is sent
to a process group should be delivered to each
member of that group.

What is happens if during the communication a


process joins the group?

Should that process also receive the message?

We should also determine what happens if a (sending)


process crashes during the communication?
Basic Reliable-Multicasting Schemes

A simple solution to reliable multicasting when all receivers


are known and are assumed not to fail. a) Message
transmission; b) Reporting feedback
When the sender receives a negative acknowledgement a
new message is sent.
Scalability in Reliable Multicasting
Nonhierarchical Feedback Control
The objective is reduce the number of feedback messages

Several receivers have scheduled a request for retransmission, but the


first retransmission request leads to the suppression of others.
Hierarchical Feedback Control
Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.


a) Each local coordinator forwards the message to its children.
b) A local coordinator handles retransmission requests.
** Atomic multicast
The atomic multicast problem occurs when a message for a
group is lost because a process crash.

Example: Consider a replicated database. Update operations


are always multicast to all replicas. During the update a
replica crashes. That update is lost for that replica but it is
performed at the other replicas.

The atomic multicast is performed only if the group have


agreed that the crashed replica no back to the group. When
the replica recovers, a new contract must be made.
This is performed by virtual synchrony.
** Atomic multicast
Message ordering

Unordered multicast is a virtual synchronous multicast in


which no guarantees are given concerning the order in
which received messages are delivered by different
processes. More simple way.

FIFO-ordered multicast, the communication layer is forced to


deliver incoming messages from the same process in the
same order as they have been sent.

Totally-ordered multicast is required that when messages are


delivered, they are delivered in the same order to all group
members.
Distributed commit
** Distributed commit
The distributed commit problem involves having an
operation being performed by each member of a
process group, or none at all. It is often established
by means of a coordinator.

In multicast the operation is the delivery message.

With distributed transaction, the operation may be the


commit of transaction at a single site that takes part
in the transaction.
** On-phase commit

In on-phase commit the coordinator tells all other


processes that are also involved, if or not to perform
the operation in question.

The drawback is that if one of the participants cannot


actually perform the operation, there is no way to
tell the coordinator.
** Two-phase commit
Two-phase commit consist of voting and decision phases, described in
the following steps:
1. The coordinator sends a VOTE_REQ message to all participants.

2. When a participant receives a VOTE_REQ message, it returns


either a VOTE_COMMIT or a VOTE_ABORT.

3. The coordinator collect all votes from the participants. If all


participants votes to commit the transaction the coordinator sends a
GLOBAL_COMMIT, otherwise it sends a GLOBAL_ABORT
message.

4. Each participant wait for the coordinator decision. If it receives a


GLOBAL_COMMIT message, it commits the transaction.
Otherwise, if it receives a GLOBAL_ABORT message, the
transaction is aborted.
Two-phase commit

The finite state machine for the coordinator.


The finite state machine for a participant.
Three-Phase Commit
A problem with a two-phase commit is that when the coordinator has
crashed. Three-phase commit avoids blocking processes in the
presence of fail-stop crashes.

a) Finite state machine for the coordinator in 3PC


b) Finite state machine for a participant
Recovery
**Introduction
In backward recovery, the main issue is to allow a
system in an erroneous state back into a previously
correct state. To do this some snapshots of the
system must be performed.

In forward recovery, when the system has just entered


an erroneous state, the system passes for a correct
new state from which it can continue to execute.
Recovery Stable Storage

Recovery Stable Storage: a) Stable Storage; b) Crash after


drive 1 is updated; c) Bad spot.
Checkpointing

A recovery line. Completely based in distributed snapshot. The


recovery consider the last snapshot.
Independent Checkpointing

The domino effect.


If these local states jointly do not form a distributed snapshot,
further rolling back is necessary.
The roll back is performed until a consistency state.

You might also like