Distributed Systems - Fault Tolerance
Distributed Systems - Fault Tolerance
INTRODUCTION
Slide 2
Fault
Slide 3
Failure
Slide 5
Failure masking by redundancy
Slide 6
Metrics
Slide 7
Fault tolerance
Slide 8
PROCESS RESILIENCE
Slide 9
Failure Masking and Replication
Slide 10
Agreement in Faulty Systems
Slide 11
Agreement in Faulty Systems
Slide 12
Failure Detection
Pinging
gossiping -in which each node regularly announces to its
neighbors that it is still up and running.
Distinguish network failures from node failures. One way
of dealing with this problem is not to let a single node
decide whether one of its neighbors has crashed.
Instead, when noticing a timeout on a ping message, a
node requests other neighbors to see whether they can
reach the presumed failing node.
Slide 13
DISTRIBUTED COMMIT
Slide 14
Phase 1: Coordinator
Coordinator sends a Commit_Request message to
every cohort requesting the cohorts to commit.
The coordinator waits for the replies.
Phase 1: Cohort
On receipt of Commit request
- If the transaction is successful
• Writes undo and redo log on the stable storage.
• Sends Agreed message to the coordinator.
- Else if transaction is unsuccessful then
• It sends an ABORT message to the coordinator.
Slide 15
Phase 2 : Coordinator
If all the cohorts reply agreed and the coordinator also agrees,
then the coordinator writes a commit record in to the LOG.
Otherwise it sends an ABORT message to all the cohorts.
The coordinator waits for acks from each cohort.
If an ack does not arrive from any cohort within time out period,
the coordinator resend the commit/abort message to that
cohort.
If all the acknowledgements are received , the coordinator writes
a COMPLETE record to the log.
Slide 16
Phase 2 : Cohorts
On receiving a COMMIT message, a cohort releases all the
resources and locks held by it for executing the
transaction, and sends an acknowledgement.
On receiving an ABORT message , undoes the transaction
using the undo log record, releases all the resources and
locks held by it for performing the transaction, and sends
an acknowledgement.
Slide 17
VOTING PROTOCOLS
Slide 18
Algorithm
Slide 19
Recovery of a System
https://programmerprodigy.code.blog/2021/07/07/faul
t-tolerance-and-recovery-in-distributed-systems
/
https://www.scirp.org/html/1-
9702032_61986.htm#txtF6
Slide 21