0% found this document useful (0 votes)

13 views34 pages

w9s1 FaultTolerance1

Uploaded by

Ella Silaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views34 pages

w9s1 FaultTolerance1

Uploaded by

Ella Silaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

Fault Tolerance

Johannes Sianipar
25 Maret 2021
Fault Tolerance, Fail, Error, Fault
■ Fault Tolerance
□ Whenever a failure occurs, the system should continue to operate in
acceptable way while repairs are being made.
■ A system is said to fail when it cannot meet its promises.
■ An error is a part of a system’s state that may lead to a failure.
□ The cause of an error is called a fault.
□ In other words, the programmer is the fault of the error (programming
bug), in turn leading to a failure (a crashed program).
■ Fault tolerance means that a system can provide its services even in the
presence of faults.

Fault Tolerance
Fault Tolerance Basic Concepts
■ Being fault tolerant is strongly related to what are called dependable
systems
■ Dependability implies the following:
□ Availability
– the system is ready to be used immediately
□ Reliability
– the system can run continuously without failure
□ Safety
– if a system fails, nothing catastrophic will happen
□ Maintainability
– when a system fails, it can be repaired easily and quickly (sometimes,
without its users noticing the failure) Fault Tolerance
Types of Fault
■ There are three main types of ‘fault’:
□ Transient Fault
– appears once, then disappears.
□ Intermittent Fault
– occurs, vanishes, reappears; but: follows no real pattern
(worst kind).
□ Permanent Fault
– once it occurs, only the replacement/repair of a faulty component
will allow the DS to function normally

Fault Tolerance
Failure Models

Fault Tolerance
Halting failures classification

Halting Type Description

Fail-stop Crash failures, but reliably detectable
Fail-noisy Crash failures, eventually reliably
detectable
Fail-silent Crash failures: clients cannot tell
what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they
cannot do any harm)
Fail-arbitrary Arbitrary, with malicious failures
Fault Tolerance
Failure Masking by Redundancy
■ Strategy: hide the occurrence of failure from other processes using
redundancy.
■ Three main types:
□ Information Redundancy
– add extra bits to allow for error detection/recovery (e.g., Hamming
codes and the like).
□ Time Redundancy
– perform operation and, if needs be, perform it again. Think about how
transactions work (BEGIN/END/COMMIT/ABORT).
□ Physical Redundancy
– add extra (duplicate) hardware and/or software to the system.
Fault Tolerance
Failure Masking by Redundancy (Cont.)

Fault Tolerance

Triple modular redundancy

Process Resilience
■ Processes can be made fault tolerant by arranging to have a group of
processes, with each member of the group being identical.
■ A message sent to the group is delivered to all of the “copies” of the
process (the group members), and then only one of them performs the
required service.
■ If one of the processes fail, it is assumed that one of the others will still be
able to function (and service any pending request or operation).

Fault Tolerance
Flat Groups versus Hierarchical Groups

(a) Communication in a flat group.

(b) Communication in a simple hierarchical group.
Fault Tolerance
Failure Masking and Replication
■ By organizing a fault tolerant group of processes, we can protect a single
vulnerable process.
■ There are two approaches to arranging the replication of the group:
□ Primary (backup) Protocols
□ Replicated-Write Protocols

Fault Tolerance
Groups and Failure masking
■ k-fault tolerant group
□ When a group can mask any k concurrent member failures (k is called
degree of fault tolerance).
■ How large does a k-fault tolerant group need to be?
□ With halting failures (crash/omission/timing failures): we need a total of k
+1 members as no member will produce an incorrect result, so the result
of one member is good enough.
□ With arbitrary failures: we need 2k +1 members so that the correct result
can be obtained through a majority vote.
■ Important assumptions
□ All members are identical
□ All members process commands in the same order
Fault Tolerance
□ Result: We can now be sure that all processes do exactly the same
Consensus in Faulty Systems with Crash Failures
■ Prerequisite
□ In a fault-tolerant process group, each nonfaulty process executes the
same commands, and in the same order, as every other nonfaulty
process.
■ Reformulation
□ Nonfaulty group members need to reach consensus on which command
to execute next.

Fault Tolerance
Flooding-based consensus
■ System model
□ A process group P = {P1,...,Pn}
□ Fail-stop failure semantics, i.e., with reliable failure detection
□ A client contacts a Pi requesting it to execute a command
□ Every Pi maintains a list of proposed commands
■ Basic algorithm (based on rounds)
□ In round r , Pi multicasts its known set of commands to all others
□ At the end of r , each Pi merges all received commands into a new
□ Next command selected through a globally shared, deterministic
function:  select)

Fault Tolerance
Flooding-based consensus: Example
■ P1 Crashed
■ P2 received all proposed commands from all other processes  makes
decision.
■ P3 may have detected that P1 crashed, but does not know if P2 received
anything, i.e., P3 cannot know if it has the same information as P2 
cannot make decision (same for P4).

Fault Tolerance
Realistic consensus: Paxos
■ Assumptions (rather weak ones, and realistic)
□ A partially synchronous system (in fact, it may even be asynchronous).
□ Communication between processes may be unreliable: messages may be
lost, duplicated, or reordered.
□ Corrupted message can be detected (and thus subsequently ignored).
□ All operations are deterministic: once an execution is started, it is known
exactly what it will do.
□ Processes may exhibit crash failures, but not arbitrary failures.
□ Processes do not collude.

Fault Tolerance
Paxos essentials
■ Starting point
□ We assume a client-server configuration, with initially one primary server.
□ To make the server more robust, we start with adding a backup server.
□ To ensure that all commands are executed in the same order at
both servers, the primary assigns unique sequence numbers to all
commands.
□ In Paxos, the primary is called the leader.
□ Assume that actual commands can always be restored (either from clients
or servers)  we consider only control messages.

Fault Tolerance
Some Paxos terminology
■ The leader sends an accept message ACCEPT(o,t) to backups when
assigning a timestamp t to command o.
■ A backup responds by sending a learn message: LEARN(o,t)
■ When the leader notices that operation o has not yet been learned,
it retransmits ACCEPT(o,t) with the original timestamp.

Fault Tolerance
Two servers and one crash: problem
■ Primary crashes after executing an operation, but the backup
never received the accept message.

Fault Tolerance
Two servers and one crash: solution
■ Never execute an operation before it is clear that is has been
learned.

Fault Tolerance
Three servers and two crashes: still a problem?
■ What happens when LEARN( o1) as sent by S2 to S1 is lost?
□ S2 will also have to wait until it knows that S3 has learned o1.

Fault Tolerance
Paxos: Fundamental Rule
■ In Paxos, a server S cannot execute an operation o until it has
received a LEARN(o) from all other nonfaulty servers.

Fault Tolerance
Failure detection
■ Reliable failure detection is practically impossible.
□ A solution is to set timeouts, but take into account that a
detected failure may be false.
□ Each server is required to send a message declaring it is still
alive

Fault Tolerance
Required number of servers
■ Paxos needs at least three servers
■ Adapted fundamental rule
□ In Paxos with three servers, a server S cannot execute an operation o
until it has received at least one (other) LEARN(o) message, so that it
knows that a majority of servers will execute o.
■ Assumptions before taking the next steps
□ Initially, S1 is the leader.
□ A server can reliably detect it has missed a message, and recover from
that miss.
□ When a new leader needs to be elected, the remaining servers follow a
strictly deterministic algorithm, such as S1 → S2 → S3.
□ A client cannot be asked to help the servers to resolve a situation.
Fault Tolerance
■ If either one of the backups (S2 or S3) crashes, Paxos will behave correctly:
operations at nonfaulty servers are executed in the same order.
Paxos (Cont.)
■ Pages 443-449 in Distributed System by Maarten van Steen and Andrew S.
Tanenbaum.

Fault Tolerance
Consensus under arbitrary failure semantics
■ Essence.
□ Consider process groups in which communication between process is
inconsistent: (a) improper forwarding of messages, or (b) telling different
things to different processes.

Fault Tolerance
Consensus under arbitrary failure semantics (Cont.)
■ System model
□ We consider a primary P and n − 1 backups B1 , . . . , Bn−1.
□ A client sends v ∈ {T , F} to P
□ Messages may be lost, but this can be detected.
□ Messages cannot be corrupted beyond detection.
□ A receiver of a message can reliably detect its sender.
■ Byzantine agreement: requirements
□ BA1: Every nonfaulty backup process stores the same value.
□ BA2: If the primary is nonfaulty then every nonfaulty backup process
stores exactly what the primary had sent.
■ Notes
Fault Tolerance
□ Primary faulty ⇒ BA1 says that backups may store the same, but different
(and thus wrong) value than originally sent by the client.
□ Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
Why having 3k processes is not enough

Fault Tolerance
Why having 3k + 1 processes is enough

Fault Tolerance
Distributed consensus: when can it be reached
■ Formal requirements for consensus
□ Processes produce the same output value
□ Every output value must be valid
□ Every process must eventually provide output

Fault Tolerance
Failure detection
■ Issue
□ How can we reliably detect that a process has actually crashed?
■ General model
□ Each process is equipped with a failure detection module
□ A process P probes another process Q for a reaction
□ If Q reacts: Q is considered to be alive (by P)
□ If Q does not react with t time units: Q is suspected to have crashed
■ Observation for a synchronous system
□ a suspected crash ≡ a known crash

Fault Tolerance
Practical failure detection
■ If P did not receive heartbeat from Q within time t: P suspects Q.
■ If Q later sends a message (which is received by P):
□ P stops suspecting Q
□ P increases the timeout value t
■ Note: if Q did crash, P will keep suspecting Q

Fault Tolerance
Reference

■ Distributed Systems Principles and Paradigms by Andrew S. Tanenbaum

and Maarten Van Steen

Fault Tolerance
Insert picture by
clicking the icon

Thank you
for your attention!
Johannes Sianipar

Lecture 7
No ratings yet
Lecture 7
57 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Chapter 8
No ratings yet
Chapter 8
107 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Slides 08
No ratings yet
Slides 08
107 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Ch8 Distributed
No ratings yet
Ch8 Distributed
12 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Unit 8
No ratings yet
Unit 8
6 pages
Unit 4
No ratings yet
Unit 4
11 pages
DS Unit - 4
No ratings yet
DS Unit - 4
20 pages
Chen 07
No ratings yet
Chen 07
39 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
CH 4
No ratings yet
CH 4
25 pages
Lec 3
No ratings yet
Lec 3
30 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Chapter 8
No ratings yet
Chapter 8
29 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
ch08 Ts TK Fault Tolerance I
No ratings yet
ch08 Ts TK Fault Tolerance I
29 pages
Paxos Siminar Final
No ratings yet
Paxos Siminar Final
20 pages
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec12 - Distributed and Parallel ComputerSystems
28 pages
Consensus
No ratings yet
Consensus
10 pages
BCS 413 - Lecture7 - Fault Tolerance
No ratings yet
BCS 413 - Lecture7 - Fault Tolerance
47 pages
Week 04
No ratings yet
Week 04
49 pages
Du3 1
No ratings yet
Du3 1
54 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Consensus Failure
No ratings yet
Consensus Failure
79 pages
Fault
No ratings yet
Fault
101 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Fault Tolerance Slides
No ratings yet
Fault Tolerance Slides
18 pages
Nikil DS Report
No ratings yet
Nikil DS Report
4 pages
08 Falhas
No ratings yet
08 Falhas
41 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
ProcessResilience FaultTolerance Recovery
No ratings yet
ProcessResilience FaultTolerance Recovery
21 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Document 32distributed Computing Concept
No ratings yet
Document 32distributed Computing Concept
16 pages
DC (Unit 4)
No ratings yet
DC (Unit 4)
14 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
51 pages
Failure Model
No ratings yet
Failure Model
14 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Ascs 04 0213
No ratings yet
Ascs 04 0213
5 pages
Da10 Byzantine
No ratings yet
Da10 Byzantine
28 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
Cse535 F24 1003 BFT
No ratings yet
Cse535 F24 1003 BFT
47 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
Consensus & Agreement: Arvind Krishnamurthy Fall 2003
No ratings yet
Consensus & Agreement: Arvind Krishnamurthy Fall 2003
41 pages
Systems That Never Stop (And Erlang) : Joe Armstrong
No ratings yet
Systems That Never Stop (And Erlang) : Joe Armstrong
47 pages
T1 BFTSMR
No ratings yet
T1 BFTSMR
68 pages
# Consensus and Agreement Algorithms: Distributed Computing
No ratings yet
# Consensus and Agreement Algorithms: Distributed Computing
9 pages
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
Jquery: The Way To Javascript and Rich Internet Applications
No ratings yet
Jquery: The Way To Javascript and Rich Internet Applications
26 pages
BCA III Year Major-II Python
No ratings yet
BCA III Year Major-II Python
66 pages
3501 FinalTest ThanhThong ThanhNgoc
No ratings yet
3501 FinalTest ThanhThong ThanhNgoc
112 pages
College Event Management System
No ratings yet
College Event Management System
10 pages
Vjoy Feeder SDK: Version 2.0.5 Release - January 2015
No ratings yet
Vjoy Feeder SDK: Version 2.0.5 Release - January 2015
14 pages
Ascading Tyle Heet: 05/28/2024 Cascading Style Sheet
No ratings yet
Ascading Tyle Heet: 05/28/2024 Cascading Style Sheet
54 pages
Tkinter Examples
No ratings yet
Tkinter Examples
13 pages
Tecnms-2401 (2018)
No ratings yet
Tecnms-2401 (2018)
162 pages
Chapter Three Proposal From Melaku Group (10) The
No ratings yet
Chapter Three Proposal From Melaku Group (10) The
30 pages
OASG Password Management - Document-2285834.1
No ratings yet
OASG Password Management - Document-2285834.1
8 pages
Resume Format For BA
No ratings yet
Resume Format For BA
2 pages
Python Web Programming: Mr. Wangjinkai
No ratings yet
Python Web Programming: Mr. Wangjinkai
95 pages
Business Plan For Mac Os X
100% (1)
Business Plan For Mac Os X
5 pages
Laudon - EC16 - TB - Chapter 4
No ratings yet
Laudon - EC16 - TB - Chapter 4
24 pages
A Closer Look at Siebel Incremental Repository Merge
No ratings yet
A Closer Look at Siebel Incremental Repository Merge
11 pages
MCQ CH03 Uml M3
No ratings yet
MCQ CH03 Uml M3
3 pages
17ec442 PDF
No ratings yet
17ec442 PDF
162 pages
PH1050 July2024 1
No ratings yet
PH1050 July2024 1
37 pages
TC-32B G-Code Programming Manual - System Varialbles
No ratings yet
TC-32B G-Code Programming Manual - System Varialbles
6 pages
Imm5496 1
No ratings yet
Imm5496 1
106 pages
SQL Cheat Sheet
100% (2)
SQL Cheat Sheet
3 pages
White Box Tools User Manual
No ratings yet
White Box Tools User Manual
668 pages
100 Days of Data Engineering - Make A Copy and Use As You Need
No ratings yet
100 Days of Data Engineering - Make A Copy and Use As You Need
7 pages
Challenges of Malware Analysis: Obfuscation Techniques
No ratings yet
Challenges of Malware Analysis: Obfuscation Techniques
11 pages
Functional Programming With Scala
No ratings yet
Functional Programming With Scala
23 pages
HTML 5 Crash Course
No ratings yet
HTML 5 Crash Course
7 pages
Python Basics
No ratings yet
Python Basics
15 pages
Unit - 1
No ratings yet
Unit - 1
10 pages
Work Report
No ratings yet
Work Report
39 pages
Paper White: SAP's Approach To SOA-based Process Integration
No ratings yet
Paper White: SAP's Approach To SOA-based Process Integration
18 pages

w9s1 FaultTolerance1

Uploaded by

w9s1 FaultTolerance1

Uploaded by

Fault Tolerance

Halting Type Description

Triple modular redundancy

(a) Communication in a flat group.

■ Distributed Systems Principles and Paradigms by Andrew S. Tanenbaum

You might also like