0% found this document useful (0 votes)

166 views21 pages

Distributed Systems - Fault Tolerance

The document discusses fault tolerance in distributed systems. It defines key terms like faults, failures, and fault tolerance. It describes various types of failures that can occur. It also discusses techniques for achieving fault tolerance like redundancy, replication, and consensus algorithms. Fault detection methods like pinging and gossiping are explained. Recovery mechanisms like checkpointing and backward error recovery are summarized as well.

Uploaded by

asd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

166 views21 pages

Distributed Systems - Fault Tolerance

Uploaded by

asd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Distributed Systems -Fault Tolerance

INTRODUCTION

 Distributed Computing Systems consists of variety of

hardware and software components.
 Failure of any of these components can lead to
unanticipated, potentially disruptive behavior and
service unavailability.
 Fault may occur due to various reasons like
communication failure, resources or hardware failure,
failure due to fault in process, software errors.
 A system’s fault tolerant capability guarantees to
minimize the loss that may be caused due to the
unpredictable system behavior

Slide 2
Fault

 At the lowest level of abstraction fault can be termed as

“defect”.
 It can lead to inaccurate system state. Fault in the system
can be categorized based on time as below:
 Transient: This type of fault occurs once and disappear
 Intermittent: This type of fault occurs many time in an irregular
way
 Permanent: This is the fault that is permanent and brings system
to halt

Slide 3
Failure

 May be defined as faults due to unintentional intrusion.

Different types of failure are as below
 Crash Failure: A server halts, but is working correctly until it
halts
 Omission Failure: A server fails to respond incoming requests
- Receive omission: A server fails to respond incoming message
- Send Omission: A server fails to send message
 Timing Failure: A server’s response lies outside the specified
time interval
 Response Failure: The server’s response in incorrect
- Value failure: The value of the response is wrong
- State transition failure: The server deviates from the correct flow of control
 Arbitrary (Byzanthine) Failure: A server may produce arbitrary
responses at arbitrary times
Slide 4
Need

 Availability: The system must be usable immediately at

any time.
 Reliability: The system must work for a long period of
time without error.
 Safety: There should be no catastrophic consequences of
temporal failure.
 Maintainability: The system must be able to repair and
fix the fault quickly and easily.
 Safety: The system should be able to resist the attacks
against its integrity

Slide 5
Failure masking by redundancy

 Information redundancy: Extra bits are added (e.g. CRC)

 Time redundancy: Action may be redone (e.g. transaction
after abort)
 Physical redundancy: Hardware and software
component may be multiplied (e.g. adding extra disk,
replicating the database), TMR
 Triple modular redundancy (TMR)

Slide 6
Metrics

 Mean Time To Failure (MTTF): The average time until a

component fails.
 Mean Time To Repair (MTTR): The average time needed
to repair a component.
 Mean Time Between Failures (MTBF): Simply MTTF +
MTTR.

Slide 7
Fault tolerance

Slide 8
PROCESS RESILIENCE

 The key property that all group shave is that when a

message
 is sent to the group itself, all members of the group
receive it. In this way, if one process in a group fails,
hopefully some other process can take over for it.

Slide 9
Failure Masking and Replication

 replicate processes and organize them into a group to

replace a single (vulnerable) process with a (fault
tolerant) group.
 by means of primary-based protocols
 a group of processes is organized in a hierarchical fashion in
which a primary coordinates all write operations.
 when the primary crashes, the backups execute some election
algorithm to choose a new primary.
 through replicated-write protocols.

Slide 10
Agreement in Faulty Systems

 Synchronous versus asynchronous systems. A system is

synchronous if and only if the processes are known to
operate in a lock-step mode.
 A system that is not synchronous is said to be
asynchronous.
 Communication delay is bounded or not. Delay is
bounded if and only if we know that every message is
delivered with a globally and predetermined maximum
time.

Slide 11
Agreement in Faulty Systems

 Message delivery is ordered or not. In other words, we

distinguish the situation where messages from the same
sender are delivered in the order that they were sent,
from the situation in which we do not have such
guarantees.
 Message transmission is done through unicasting or
multicasting.

Slide 12
Failure Detection

 Pinging
 gossiping -in which each node regularly announces to its
neighbors that it is still up and running.
 Distinguish network failures from node failures. One way
of dealing with this problem is not to let a single node
decide whether one of its neighbors has crashed.
 Instead, when noticing a timeout on a ping message, a
node requests other neighbors to see whether they can
reach the presumed failing node.

Slide 13
DISTRIBUTED COMMIT

 THE TWO PHASE COMMIT PROTOCOL

 One of the processes is the coordinator and other
processes are cohorts.

Slide 14
Phase 1: Coordinator
 Coordinator sends a Commit_Request message to
every cohort requesting the cohorts to commit.
 The coordinator waits for the replies.
Phase 1: Cohort
 On receipt of Commit request
- If the transaction is successful
• Writes undo and redo log on the stable storage.
• Sends Agreed message to the coordinator.
- Else if transaction is unsuccessful then
• It sends an ABORT message to the coordinator.

Slide 15
 Phase 2 : Coordinator
 If all the cohorts reply agreed and the coordinator also agrees,
then the coordinator writes a commit record in to the LOG.
 Otherwise it sends an ABORT message to all the cohorts.
 The coordinator waits for acks from each cohort.
 If an ack does not arrive from any cohort within time out period,
the coordinator resend the commit/abort message to that
cohort.
 If all the acknowledgements are received , the coordinator writes
a COMPLETE record to the log.

Slide 16
 Phase 2 : Cohorts
 On receiving a COMMIT message, a cohort releases all the
resources and locks held by it for executing the
transaction, and sends an acknowledgement.
 On receiving an ABORT message , undoes the transaction
using the undo log record, releases all the resources and
locks held by it for performing the transaction, and sends
an acknowledgement.

Slide 17
VOTING PROTOCOLS

 With the voting mechanism, each replica is assigned

some number of votes, and a majority of votes must be
collected from a process before it can access a replica.
 The voting mechanism is more fault-tolerant than a
commit protocol in that it allows access to data under
network partitions, site failures and message losses
without compromising the integrity of the data.

Slide 18
Algorithm

 Every replica is assigned a certain number of votes`

 Every site has a lock manager.
 Every file has a version number.
 Every replica is assigned a certain number of votes.
 Read and write permitted only if a certain number of
votes are obtained(read quorum) and (Write quorum) by
the requesting process.

Slide 19
Recovery of a System

 Forward error recovery:

 If the nature of errors and damages caused by faults can be
completely and accurately assessed, then it is possible to remove
those errors in the process’s state and enable the process to
move forward.
 Backward error recovery:
 If it is not possible to foresee the nature of faults and to remove
all the errors in the process’s state, then the process state can be
restored to a previous error free state of the process.
 STATE BASED APPROACH:
 Recovery Point/ Checkpoint: In the state-based approach or
recovery, the complete state of a process is saved when a
recovery point is established, and recovering a process involves
reinstating its saved state and resuming the execution of the
Slide 20
process from that state.
Reference

 https://programmerprodigy.code.blog/2021/07/07/faul
t-tolerance-and-recovery-in-distributed-systems
/
 https://www.scirp.org/html/1-
9702032_61986.htm#txtF6

Slide 21

Software Architecture Course Notes
No ratings yet
Software Architecture Course Notes
90 pages
Erp The Implementation Cycle PDF
No ratings yet
Erp The Implementation Cycle PDF
2 pages
Slides 08 PDF
No ratings yet
Slides 08 PDF
95 pages
Fault Tolerance Notes
No ratings yet
Fault Tolerance Notes
101 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Lect8 FaultTolerance
No ratings yet
Lect8 FaultTolerance
37 pages
Fault Tolerance Techniques: Unit 3
No ratings yet
Fault Tolerance Techniques: Unit 3
40 pages
Fault Tolerance
No ratings yet
Fault Tolerance
33 pages
Fault Avoidance and Tolerance Technique
No ratings yet
Fault Avoidance and Tolerance Technique
15 pages
Fault Tolerance
No ratings yet
Fault Tolerance
13 pages
Fault Tolerance Exam
No ratings yet
Fault Tolerance Exam
14 pages
II - Fault-Tolerant-techniques
No ratings yet
II - Fault-Tolerant-techniques
104 pages
Lecture 3
No ratings yet
Lecture 3
118 pages
1 Chapter 11 Security and Dependability
No ratings yet
1 Chapter 11 Security and Dependability
46 pages
Introduction To Fault Tolerance
No ratings yet
Introduction To Fault Tolerance
20 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
Fault Tolerant System Design
100% (1)
Fault Tolerant System Design
44 pages
Litrecher Rivew
No ratings yet
Litrecher Rivew
18 pages
Reliablity Assignment
100% (1)
Reliablity Assignment
11 pages
Chapter II. Process Management: 2.1 Overview
No ratings yet
Chapter II. Process Management: 2.1 Overview
17 pages
08s Cpe633 Test1 Solution
No ratings yet
08s Cpe633 Test1 Solution
3 pages
Fault Tolerant Systems Syllabus 14-05-2019
No ratings yet
Fault Tolerant Systems Syllabus 14-05-2019
3 pages
Chapter Three Term Weighting and Similarity Measures
No ratings yet
Chapter Three Term Weighting and Similarity Measures
25 pages
II Fault Tolerant Techniques
No ratings yet
II Fault Tolerant Techniques
101 pages
Chapter 4 - Data Communication
No ratings yet
Chapter 4 - Data Communication
67 pages
Reliability and Availablity
No ratings yet
Reliability and Availablity
6 pages
Advanced Computer Networking
No ratings yet
Advanced Computer Networking
1 page
Credit and Saving Management System
No ratings yet
Credit and Saving Management System
96 pages
Unit 11 Dependability-and-Security
No ratings yet
Unit 11 Dependability-and-Security
39 pages
Synchronization
No ratings yet
Synchronization
3 pages
Final Project Proposal 2016 Online Exam System Edited
No ratings yet
Final Project Proposal 2016 Online Exam System Edited
77 pages
B.E NovDec 2010 SQA
No ratings yet
B.E NovDec 2010 SQA
3 pages
Information Technology Infrastructure IT602
No ratings yet
Information Technology Infrastructure IT602
19 pages
Revised OOSAD Module2020
No ratings yet
Revised OOSAD Module2020
72 pages
Chapter 1: Maintenance, Upgrade and Repair
No ratings yet
Chapter 1: Maintenance, Upgrade and Repair
35 pages
Chapter 1-Introduction To Distributed Systems
No ratings yet
Chapter 1-Introduction To Distributed Systems
59 pages
2.1,2.2-Service Models of Cloud Computing
No ratings yet
2.1,2.2-Service Models of Cloud Computing
17 pages
CNS PDF
No ratings yet
CNS PDF
213 pages
Tsion Adisu Final Thesis1
No ratings yet
Tsion Adisu Final Thesis1
88 pages
System Analysis and Design
No ratings yet
System Analysis and Design
24 pages
Finding Minimal Cut Sets in A Fault Tree
No ratings yet
Finding Minimal Cut Sets in A Fault Tree
4 pages
08s Cpe633 Hw1 Solution
No ratings yet
08s Cpe633 Hw1 Solution
3 pages
Fault Tolerant Computer System Design Pradhan PDF
50% (2)
Fault Tolerant Computer System Design Pradhan PDF
2 pages
Chapter Three: Data Encoding, Data Transmission and Multiplexing
No ratings yet
Chapter Three: Data Encoding, Data Transmission and Multiplexing
27 pages
Project Investment Evaluation: Chethan S.Gowda
No ratings yet
Project Investment Evaluation: Chethan S.Gowda
70 pages
PH.D Presentation
No ratings yet
PH.D Presentation
16 pages
Chapter 5 - Recovery Techniques
No ratings yet
Chapter 5 - Recovery Techniques
24 pages
Chapter 5 - Logical and Physical Database Design
No ratings yet
Chapter 5 - Logical and Physical Database Design
19 pages
Chapter 2 Design Principles
100% (1)
Chapter 2 Design Principles
20 pages
Jesse Lesperance Smre Homework 2 Reliability Apportionment Component Importance
No ratings yet
Jesse Lesperance Smre Homework 2 Reliability Apportionment Component Importance
10 pages
Cloud Computing Assignment-1
No ratings yet
Cloud Computing Assignment-1
9 pages
Strength of Material - I Lecture Note by Balemlay A.
No ratings yet
Strength of Material - I Lecture Note by Balemlay A.
61 pages
Abraham Tilahun GSE 6950 15
100% (2)
Abraham Tilahun GSE 6950 15
13 pages
2@software Reliability
No ratings yet
2@software Reliability
126 pages
The 8085 Microprocessor Architecture
0% (1)
The 8085 Microprocessor Architecture
33 pages
Pinnacle - Quantitative Reliability Optimization (QRO) Executive Brief
100% (1)
Pinnacle - Quantitative Reliability Optimization (QRO) Executive Brief
9 pages
Chapter 5
No ratings yet
Chapter 5
9 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Cascade Control SE665
No ratings yet
Cascade Control SE665
20 pages
Direction: Encircle/highlight The Correct Answer.: Chapter 6: Ais Development Strategies
No ratings yet
Direction: Encircle/highlight The Correct Answer.: Chapter 6: Ais Development Strategies
4 pages
Chhayank Tyagi
No ratings yet
Chhayank Tyagi
1 page
Nishanth Internship Report
No ratings yet
Nishanth Internship Report
23 pages
T 53201
No ratings yet
T 53201
217 pages
Sepm
No ratings yet
Sepm
303 pages
Catalogues4s-61sdy162sdy262sddt61sdz263sdz2dt62sdy362sdy3dt61sd June2009
No ratings yet
Catalogues4s-61sdy162sdy262sddt61sdz263sdz2dt62sdy362sdy3dt61sd June2009
188 pages
The 8 Useful Java Testing Tools
No ratings yet
The 8 Useful Java Testing Tools
4 pages
Practical Devops Tools
0% (1)
Practical Devops Tools
442 pages
Java (Mod 1)
No ratings yet
Java (Mod 1)
60 pages
ISE102 - Assessment 1 - 20240603
No ratings yet
ISE102 - Assessment 1 - 20240603
6 pages
ISO Hipaa
No ratings yet
ISO Hipaa
8 pages
Xin Liu 2006
No ratings yet
Xin Liu 2006
72 pages
Building and Using Macro Libraries
No ratings yet
Building and Using Macro Libraries
7 pages
JSA Manual Cleaning 6205-F (Filter Water Tank)
No ratings yet
JSA Manual Cleaning 6205-F (Filter Water Tank)
7 pages
Salon Management System FOR "Salon Nirosha": G.J.V.P.S.O. Jayawardena Registration Number: R141179 Index Number: 1411799
100% (1)
Salon Management System FOR "Salon Nirosha": G.J.V.P.S.O. Jayawardena Registration Number: R141179 Index Number: 1411799
94 pages
Manjunath Devops
No ratings yet
Manjunath Devops
3 pages
C18 Industrial Engine WRH00001-UP (SEBP3816 - 85)
No ratings yet
C18 Industrial Engine WRH00001-UP (SEBP3816 - 85)
7 pages
Python Project Documentation 2024
No ratings yet
Python Project Documentation 2024
13 pages
Detection of Depression in Speech1
No ratings yet
Detection of Depression in Speech1
60 pages
Calibracion
No ratings yet
Calibracion
2 pages
Coding Bootcamp in Full Stack Development - MEAN New
No ratings yet
Coding Bootcamp in Full Stack Development - MEAN New
18 pages
09 404 Case Study Infosys
No ratings yet
09 404 Case Study Infosys
13 pages
3 Conditional Structure - If and Switch
No ratings yet
3 Conditional Structure - If and Switch
26 pages
It211 PR2 GRP.1
No ratings yet
It211 PR2 GRP.1
8 pages
Kubernetes Vs Docker
No ratings yet
Kubernetes Vs Docker
9 pages
UML Use Cases - Class Diagrams
No ratings yet
UML Use Cases - Class Diagrams
17 pages

Distributed Systems - Fault Tolerance

Uploaded by

Distributed Systems - Fault Tolerance

Uploaded by

Distributed Systems -Fault Tolerance

 Distributed Computing Systems consists of variety of

 At the lowest level of abstraction fault can be termed as

 May be defined as faults due to unintentional intrusion.

 Availability: The system must be usable immediately at

 Information redundancy: Extra bits are added (e.g. CRC)

 Mean Time To Failure (MTTF): The average time until a

 The key property that all group shave is that when a

 replicate processes and organize them into a group to

 Synchronous versus asynchronous systems. A system is

 Message delivery is ordered or not. In other words, we

 THE TWO PHASE COMMIT PROTOCOL

 With the voting mechanism, each replica is assigned

 Every replica is assigned a certain number of votes`

 Forward error recovery:

You might also like