0% found this document useful (0 votes)
19 views56 pages

Lecture23 FaultTolerance

Uploaded by

Arsim Krasniqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views56 pages

Lecture23 FaultTolerance

Uploaded by

Arsim Krasniqi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 56

Distributed Systems

CS 15-440

Fault Tolerance- Part II


Lecture 23, Nov 19, 2014

Mohammad Hammoud

1
Today…
 Last Session:
 Quiz 2

 Today’s Session:
 Fault Tolerance – Part II
 Reliable communication

 Announcements:
 Project 4 is due on Dec 3rd by midnight
 PS5 will be posted by tonight. It is due on Dec 4th by midnight

2
Objectives
Discussion on Fault Tolerance

Recovery from
failures
Atomicity and
distributed
Process commit
resilience, protocols
General failure detection
background on and reliable
fault tolerance communication
Reliable Communication
 Fault tolerance in distributed systems typically
concentrates on faulty processes

P1 P0 
 However, we also need to consider
communication failures

 We will focus on two types of reliable communication:


 Reliable request-reply communication (e.g., RPC)
 Reliable group communication (e.g., multicasting schemes)
4
Reliable Communication

Reliable Communication

Reliable Request-Reply Reliable Group


Communication Communication

5
Request-Reply Communication
 The request-reply (RR) communication is designed to support the
roles and message exchanges in typical client-server interactions

Client Server
Request Message
doOperation getRequest

• select operation
(wait)
• execute operation

(continuation) sendReply
Reply Message

 This sort of communication is mainly based on a trio of


communication primitives, doOperation, getRequest and sendReply

6
Timeout Mechanisms
 Request-reply communication may suffer from crash, omission,
timing, and byzantine failures

 To allow for occasions where a request or a reply message is not


delivered (e.g., lost), doOperation uses a timeout mechanism

 There are various options as to what doOperation can do


after a timeout:

 Return immediately with an indication to the client that the request


has failed
 Send the request message repeatedly until either a reply is received or
the server is assumed to have failed
7
Idempotent Operations
 In cases when the request message is retransmitted, the
server may receive it more than once

 This can cause the server executing an operation more than


once for the same request

 Not every operation can be executed more than once and


obtain the same results each time

 Operations that can be executed repeatedly with the


same effect are called idempotent operations
8
Duplicate Filtering
 To avoid problems with non-idempotent operations, the server
should recognize successive messages from the same client
and filter out duplicates

 If the server has already sent the reply when it receives a


“duplicate” request, it can either:
 Re-execute the operation again to obtain the result (only for
idempotent operations)
 Or do not re-execute the operation if it has chosen to retain the
outcome of the first and only execution

9
Keeping History
 Servers can maintain the execution outcomes of requests in
what is called the history

 More precisely, the term ‘history’ is used to refer to a structure


that contains records of (reply) messages that have been
transmitted

Fields of a history record: Request ID Message Client ID

10
Managing History
 The server can interpret each request from a client as an ACK
of its previous reply

 Thus, the history needs contain ONLY the last reply message
sent to each client

 But, if the number of clients is large, memory cost might


become a problem

 Messages in a history are normally discarded after a limited


period of time

11
In Summary…
 RR protocol can be implemented in different ways to provide
different delivery guarantees. The main choices are:

1. Retry request message (client side): Controls whether to retransmit


the request message until either a reply is received or the server is
assumed to have failed

2. Duplicate filtering (server side): Controls when retransmissions are


used and whether to filter out duplicate requests at the server

3. Retransmission of results (server side): Controls whether to keep a


history of result messages to enable lost results to be retransmitted
without re-executing the operations at the server

12
Request-Reply Call Semantics
 Combinations of request-reply protocols lead to a variety of possible
semantics for the reliability of remote invocations

Fault Tolerance Measure


Call Semantics
Retransmit Duplicate Re-execute (Pertaining
Call to
Semantics
Request Filtering Procedure or Remote
Message Retransmit Reply Procedures)
No N/A N/A Maybe
No
Yes N/A
No N/A
Re-execute Maybe
At-least-once
Yes No Re-execute
Procedure At-least-once
Yes Yes Procedure
Retransmit Reply At-most-once
Yes Yes Retransmit Reply At-most-once

13
Reliable Communication

Reliable Communication

Reliable Request-Reply Reliable Group


Communication Communication

14
Reliable Group Communication
 As we considered reliable request-reply communication,
we need also to consider reliable multicasting services

1 2

7 3

6 4

 E.g., Election algorithms use multicasting schemes

15
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Atomic Multicasting

16
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Atomic Multicasting

17
Reliable Multicasting
 Reliable multicasting indicates that a message that is sent to a
group of processes should be delivered to each member of
that group

 A distinction should be made between:

Reliable communication in the presence of faulty processes


Reliable communication when processes are assumed to
operate correctly

 In the presence of faulty processes, multicasting is considered to


be reliable when it can be guaranteed that all non-faulty group
members receive the message

18
Basic Reliable Multicasting Questions
 What happens if during multicasting a process P joins or
leaves a group?
 Should the sent message be delivered?
 Should P (if joining) also receive the message?

 What happens if the (sending) process crashes during


multicasting?

 What about message ordering?

19
A Simple Case: Reliable Multicasting
with Feedback Messages
 Consider the case when a single sender S wants to
multicast a message to multiple receivers

 An S’s multi-casted message may be lost part way and


delivered to some, but not to all, of the intended receivers

 Assume that messages are received in the same order as


they are sent

20
Reliable Multicasting with Feedback
Messages
Sender Receiver Receiver Receiver Receiver
M25
History
Buffer
Last = 24 Last = 24 Last = 23 Last = 24

Network

Sender Receiver Receiver Receiver Receiver

Last = 24 Last = 24 Last = 23 Last = 24


M25 M25 M25 M25

ACK25 ACK25
Missed 24 ACK25

An extensive and detailed survey of total-order broadcasts can be found


21 in Defago et al. (2004)
Reliable Group Communication
 A Basic Reliable-Multicasting Scheme
 Atomic Multicasting

22
Atomic Multicast
 C1: What is often needed in a distributed system is the guarantee
that a message is delivered to either all processes or none at all

 C2: It is also generally required that all messages are delivered in


the same order to all processes

 Satisfying C1 and C2 results in what we call atomic multicast

 Atomic multicast:

 Ensures that non-faulty processes maintain a consistent view

 Forces reconciliation when a process recovers and rejoins the group


23
Virtual Synchrony
 A multicast message m is uniquely associated with a list of
processes to which it should be delivered

 This delivery list corresponds to a group view (G)

 In principle, the delivery of m is allowed to fail:


 When a group-membership-change is the result of the sender
of m crashing
 Accordingly, m may either be delivered to all remaining processes, or
ignored by each of them
 Or when a group-membership-change is the result of a receiver
of m crashing
 Accordingly, m may be ignored by every other receiver-- which corresponds
to the situation that the sender of m crashed before m was sent

A reliable multicast with this property is said to be “virtually synchronous” 24


The Principle of Virtual Synchrony

Reliable multicast by multiple


point-to-point messages
P3 crashes P3 rejoins

P1

P2

P3

P4
Time
G = {P1, P2, P3, P4} G = {P1, P2, P4} G = {P1, P2, P3, P4}

Partial multicast from P3 is discarded

25
Message Ordering
 Four different virtually synchronous multicast orderings
are distinguished:

1. Unordered multicasts

2. FIFO-ordered multicasts

3. Causally-ordered multicasts

4. Totally-ordered multicasts

26
1. Unordered multicasts
 A reliable, unordered multicast is a virtually synchronous multicast in
which no guarantees are given concerning the order in which
received messages are delivered by different processes

Process P1 Process P2 Process P3


Sends m1 Receives m1 Receives m2
Sends m2 Receives m2 Receives m1

Three communicating processes in the same group

27
2. FIFO-Ordered Multicasts
 With FIFO-Ordered multicasts, the communication layer is forced to
deliver incoming messages from the same process in the same
order as they have been sent

Process P1 Process P2 Process P3 Process P4


Sends m1 Receives m1 Receives m3 Sends m3
Sends m2 Receives m3 Receives m1 Sends m4
Receives m2 Receives m2
Receives m4 Receives m4

Four processes in the same group with two different senders.

28
3-4. Causally-Ordered and
Total-Ordered Multicasts
 Causally-ordered multicasts preserve potential causality
between different messages

 If message m1 causally precedes another message m2,


regardless of whether they were multicast by the same sender
or not, the communication layer at each receiver will always
deliver m1 before m2

 Total-ordered multicasts require that when messages are


delivered, they are delivered in the same order to all group
members (regardless of whether message delivery is
unordered, FIFO-ordered, or causally-ordered)

29
Virtually Synchronous Reliable
Multicasting
 A virtually synchronous reliable multicasting that offers total-ordered
delivery of messages is what we refer to as atomic multicasting

Multicast Basic Message Ordering Total-Ordered Delivery?


Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Six different versions of virtually synchronous reliable multicasting

30
Distributed Commit
 Atomic multicasting problem is an example of a more general
problem, known as distributed commit

 The distributed commit problem involves having an operation being


performed by each member of a process group, or none at all

 With reliable multicasting, the operation is the delivery of a message

 With distributed transactions, the operation may be the commit of a


transaction at a single site that takes part in the transaction

 Distributed commit is often established by means of a


coordinator and participants

31
One-Phase Commit Protocol
 In a simple scheme, a coordinator can tell all participants
whether or not to (locally) perform the operation in question

 This scheme is referred to as a one-phase commit protocol

 The one-phase commit protocol has a main drawback that if


one of the participants cannot actually perform the operation,
there is no way to tell the coordinator

 In practice, more sophisticated schemes are needed


 The most common utilized one is the two-phase commit protocol

32
Two-Phase Commit Protocol
 Assuming that no failures occur, the two-phase commit protocol
(2PC) consists of the following two phases, each consisting of
two steps:

Phase I: Voting Phase


• The coordinator sends a VOTE_REQUEST message to all
Step 1 participants.

• When a participant receives a VOTE_REQUEST message, it


returns either a VOTE_COMMIT message to the coordinator
telling the that
indicating coordinator
it is prepared
that it to
is prepared
locally commit
to locally
its part
commit
of theits
Step 2 part of the transaction,
transaction, or otherwise or aotherwise
VOTE_ABORT a VOTE_ABORT
message.
message

33
Two-Phase Commit Protocol
Phase II: Decision Phase
• The coordinator collects all votes from the participants.

• If all participants have voted to commit the transaction, then so


will the coordinator. In that case, it sends a GLOBAL_COMMIT
Step 1 message to all participants.

• However, if one participant had voted to abort the transaction,


the coordinator will also decide to abort the transaction and
multicasts a GLOBAL_ABORT message.
• Each participant that voted for a commit waits for the final
reaction by the coordinator.

• If a participant receives a GLOBAL_COMMIT message, it


Step 2 locally commits the transaction.

• Otherwise, when receiving a GLOBAL_ABORT message, the


transaction is locally aborted as well.
34
2PC Finite State Machines
Vote-request
Vote-abort
Commit INIT INIT
Vote-request
Vote-request
Vote-commit
WAIT WAIT
Vote-abort Vote-commit Global-abort Global-commit
Global-abort Global-commit ACK ACK

ABORT COMMIT ABORT COMMIT

The finite state machine for the The finite state machine for a
coordinator in 2PC participant in 2PC

35
2PC Algorithm
Actions by coordinator:
write START_2PC to local log;
multicast VOTE_REQUEST to all participants;
while not all votes have been collected{
wait for any incoming vote;
if timeout{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
exit;
}
record vote;
}
If all participants sent VOTE_COMMIT and coordinator votes COMMIT{
write GLOBAL_COMMIT to local log;
multicast GLOBAL_COMMIT to all participants;
}else{
write GLOBAL_ABORT to local log;
multicast GLOBAL_ABORT to all participants;
} 36
Two-Phase Commit Protocol
Actions by participants:
write INIT to local log;
Wait for VOTE_REQUEST from coordinator;
If timeout{
write VOTE_ABORT to local log;
exit;
}
If participant votes COMMIT{
write VOTE_COMMIT to local log;
send VOTE_COMMIT to coordinator;
wait for DECISION from coordinator;
if timeout{
multicast DECISION_RQUEST to other participants;
wait until DECISION is received; /*remain blocked*/
write DECISION to local log;
}
if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;}
else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log};
}else{
write VOTE_ABORT to local log;
send VOTE_ABORT to coordinator;
} 37
Two-Phase Commit Protocol
Actions for handling decision requests:
/*executed by separate thread*/

while true{
wait until any incoming DECISION_REQUEST is received; /*remain blocked*/
read most recently recorded STATE from the local log;
if STATE == GLOBAL_COMMIT
send GLOBAL_COMMIT to requesting participant;
else if STATE == INIT or STATE == GLOBAL_ABORT
send GLOBAL_ABORT to requesting participant;
else
skip; /*participant remains blocked*/
}

38
Objectives
Discussion on Fault Tolerance

Recovery from
failures
Atomicity and
distributed
Process commit
resilience, protocols
General failure detection
background on and reliable
fault tolerance communication
Recovery
 So far, we have mainly concentrated on algorithms that allow us to
tolerate faults

 However, once a failure has occurred, it is essential that the process


where the failure has happened can recover to a correct state

 In what follows we focus on:

 What it actually means to recover to a correct state

 When and how the state of a distributed system can be recorded and
recovered, by means of checkpointing and message logging

40
Recovery
 Error Recovery
 Checkpointing
 Message Logging

41
Recovery
 Error Recovery
 Checkpointing
 Message Logging

42
Error Recovery
 Once a failure has occurred, it is essential that the process where
the failure has happened can recover to a correct state

 Fundamental to fault tolerance is the recovery from an error

 The idea of error recovery is to replace an erroneous state with an


error-free state

 There are essentially two forms of error recovery:


1. Backward recovery
2. Forward recovery

43
Backward Recovery
 In backward recovery, the main issue is to bring the system from its
present erroneous state “back” to a previously correct state

 It is necessary to record the system’s state from time to time onto a


stable storage, and to restore such a recorded state when things
go wrong

 Each time (part of) the system’s present state is recorded, a


checkpoint is said to be made

 Some problems with backward recovery:


 Restoring a system or a process to a previous state is generally expensive
(in terms of performance)
 Some states can never be rolled back (e.g., typing in UNIX rm –fr *)
Forward Recovery
 When the system detects that it has made an error, forward
recovery reverts the system state to error time and corrects it,
to be able to move forward

 Forward recovery is typically faster than backward recovery


but requires that it has to be known in advance which errors
may occur

 Some systems make use of both forward and backward


recovery for different errors or different parts of one error

45
Recovery
 Error Recovery
 Checkpointing
 Message Logging

46
Why Checkpointing?
 In fault-tolerant distributed systems, backward recovery
requires that systems “regularly” save their states onto
stable storages

 This process is referred to as checkpointing

 Checkpointing consists of storing a “distributed


snapshot” of the current application state, and later on,
use it for restarting the execution in case of a failure

47
Recovery Line
 In capturing a distributed snapshot, if a process P has recorded the
receipt of a message, m, then there should be also a process Q that
has recorded the sending of m

We are able to identify both, senders and receivers.

Initial state A snapshot


A recovery line Not a recovery line

m A failure

Message sent from


Q to P They jointly form a distributed
snapshot 48
Checkpointing
 Checkpointing can be of two types:

1. Independent Checkpointing: each process simply records its


local state from time to time in an uncoordinated fashion

2. Coordinated Checkpointing: all processes synchronize to


jointly write their states to local stable storages

 Which algorithm among the ones we’ve studied can be


used to implement coordinated checkpointing?
 A simple solution is to use 2PC

49
Domino Effect
 Independent checkpointing may make it difficult to find a recovery line,
leading potentially to a domino effect resulting from cascaded rollbacks
Rollback
Not a Recovery Line Not a Recovery Line Not a Recovery Line

A failure

 With coordinated checkpointing, the saved state is automatically globally


consistent, hence, domino effect is inherently avoided

50
Recovery
 Error Recovery
 Checkpointing
 Message Logging

51
Why Message Logging?
 Considering that checkpointing is an expensive operation,
techniques have been sought to reduce the number of checkpoints,
but still enable recovery

 An important technique in distributed systems is message logging

 The basic idea is that if transmission of messages can be replayed,


we can still reach a globally consistent state, yet without having to
restore that state from stable storage

 In practice, the combination of having fewer checkpoints and


message logging is more efficient than having to take
many checkpoints

52
Message Logging
 Message logging can be of two types:

1. Sender-based logging: A process can log its messages before


sending them off

2. Receiver-based logging: A receiving process can first log an


incoming message before delivering it to the application

 When a sending or a receiving process crashes, it can restore the


most recently checkpointed state, and from there on “replay” the
logged messages (Is it fine for non-deterministic behaviors?)

53
Replay of Messages and
Orphan Processes
 Caveat: Incorrect replay of messages after recovery can lead to
orphan processes
Q crashes Q recovers M1 is replayed M3 becomes an
orphan
P
M1 M1

Q
M3 M3
M2 M2
R

M2 can never be replayed


Logged Message

Unlogged Message

54
Objectives
Discussion on Fault Tolerance

Recovery from
failures
Atomicity and
distributed
Process commit
resilience, protocols
General failure detection
background on and reliable All Covered!
fault tolerance communication
Next Class

Distributed File Systems-Part I

Thank You!

56

You might also like