0% found this document useful (0 votes)

44 views15 pages

CS 194: Distributed Systems

This document summarizes key concepts related to distributed commit and recovery in distributed systems: 1) Two-phase commit (2PC) is a protocol that allows processes to agree whether to commit or abort a transaction in a distributed system despite failures. It uses a coordinator and participants. 2) Stable storage is used to log actions during 2PC to allow processes to recover after crashes and determine their commit decision. 3) Checkpointing involves periodically saving process state to stable storage to enable backward or forward recovery from failures through rollback or restart from a known good state. Message logging is another recovery technique.

Uploaded by

Karthik Kannan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views15 pages

CS 194: Distributed Systems

Uploaded by

Karthik Kannan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 15

CS 194: Distributed Systems Distributed Commit, Recovery

Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1776

Distributed Commit

Goal: Either all members of a group decide to perform an operation, or none of them perform the operation

Assumptions

Failures:
- Crash failures that can be recovered - Communication failures detectable by timeouts

Notes:
- Commit requires a set of processes to agree - similar to the Byzantine general problem - but the solution much simpler because stronger assumptions

Two Phase Commit (2PC)

Coordinator
send VOTE_REQ to all send vote to coordinator if (vote == no) decide abort halt if (all votes yes) decide commit send COMMIT to all else decide abort send ABORT to all who voted yes halt Participants

if receive ABORT, decide abort else decide commit halt 4

2PC State Machine

a) b)

The finite state machine for the coordinator in 2PC The finite state machine for a participant

2PC: Crash Recovery Protocol

Stable storage is persistent memory that supports writes that are atomic with respect to failures Log actions: c sends VOTE_REQ write start p votes YES write yes p votes NO write abort commit point c decides commit write commit c decides abort write abort p receives decision write decision

2PC: Crash Recovery Protocol

Upon recovery a process r starts reading the values logged to stable storage. If there is a start then r was the coordinator: - If there is a subsequent abort or commit then decision was made; otherwise decide abort. Otherwise, r was a participant: - If there is abort or commit then the decision was made; - If there is no yes then decide abort. - Otherwise (i.e., there is an yes record) run termination protocol. ... when can these records be garbage collected?
7

Recovery Techniques: Checkpoints

Goal: recover a process from error Backward recovery: checkpoint the state of the process periodically
- Go to previous checkpoint, if error - Problem: same failure may repeat

Forward recovery: go to a known good state if error

- Problem: need to know in advance which error may occur

Example: Reliable Communication

Backward recovery: retransmit packet if lost

Forward recovery: use erasure coding

- Instead of sending k packets, send n > k using erasure coding - As long as receiver gets at least k packets out of n, it can reconstruct the original k packets

Recovery Techniques: Message Logging

Sender based: sender logs message before sending it out Receiver based: receiver logs message before delivering it Replay log messages between checkpoints restore state beyond most recent checkpoint

Distributed Checkpointing: Recovery Line

Recovery line: most recent snapshot

- If a process P has recorder the receipt of message m there should be a process Q that recorded sending of message m

How do you find a recover line?

Independent Checkpointing: The Domino Effect

Domino effect: cascaded rollback to find a recovery line Solutions:

- Coordinate checkpointing: use two-phase non-blocking protocol (see the book) - Logging and replaying messages
12

Message Logging and Checkpointing

Incorrect replay of messages after recovery, leading to an orphan process

Stable Storage

Storage designed to survive anything except major calamities

Use two disks to record identical information

1) Write and verify sector on disk 1 2) Write and verify sector on disk 2

Recovery
Verify all sectors If two corresponding sectors differ, copy sector from disk 1 to disk

Stable Storage Recovery

a) b) c)

Stable Storage Crash after drive 1 is updated Bad spot

Failure Recovery in Distributed Systems
No ratings yet
Failure Recovery in Distributed Systems
24 pages
Unit 4
No ratings yet
Unit 4
32 pages
Aks Replication Control
No ratings yet
Aks Replication Control
71 pages
DC Ese Notes
No ratings yet
DC Ese Notes
47 pages
DC Unit4
No ratings yet
DC Unit4
33 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
Maintenance Manual Instructions For Continued Airworthiness: King Airs & Super King Airs
No ratings yet
Maintenance Manual Instructions For Continued Airworthiness: King Airs & Super King Airs
330 pages
Unit 4
No ratings yet
Unit 4
94 pages
Lecture 13
No ratings yet
Lecture 13
37 pages
Unit 4 - Deadlock Handling & Recovery Techniques & Failuere Classification
No ratings yet
Unit 4 - Deadlock Handling & Recovery Techniques & Failuere Classification
55 pages
Presentation On Consistent Checkpoints & Recovery in Distributed System
100% (1)
Presentation On Consistent Checkpoints & Recovery in Distributed System
26 pages
Distributed Recovery Management: UNIT-4
No ratings yet
Distributed Recovery Management: UNIT-4
31 pages
Fault Tolerance
No ratings yet
Fault Tolerance
40 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Distributed Failure Recovery
No ratings yet
Distributed Failure Recovery
30 pages
Concurrent Checkpointing and Recovery in Distributed Systems
No ratings yet
Concurrent Checkpointing and Recovery in Distributed Systems
61 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
DC Unit4
No ratings yet
DC Unit4
32 pages
Rollback Slides
No ratings yet
Rollback Slides
22 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
1904050001
No ratings yet
1904050001
119 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
William F Hosford-Iron and Steel-Cambridge University Press (2012) PDF
50% (2)
William F Hosford-Iron and Steel-Cambridge University Press (2012) PDF
310 pages
Watch and Clock Escapements by Anonymous
No ratings yet
Watch and Clock Escapements by Anonymous
95 pages
CS8603 U.iv
No ratings yet
CS8603 U.iv
33 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
21 pages
Lecture 11A - Replication Control
No ratings yet
Lecture 11A - Replication Control
15 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
Ddbs Checkpointing ... Ddbs Checkpointing ... : Phase 1 at Css Phase 2 at CC
No ratings yet
Ddbs Checkpointing ... Ddbs Checkpointing ... : Phase 1 at Css Phase 2 at CC
9 pages
Midterm Cheatsheet
No ratings yet
Midterm Cheatsheet
2 pages
Lm2-Rollback & Recovery
No ratings yet
Lm2-Rollback & Recovery
34 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
37 pages
Chen 07
No ratings yet
Chen 07
39 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
DS CH7 - Fault Tolerance
No ratings yet
DS CH7 - Fault Tolerance
17 pages
Chapter 06 Fault - Tolerance
No ratings yet
Chapter 06 Fault - Tolerance
30 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
21 pages
Dc-3551 Unit IV Notes
No ratings yet
Dc-3551 Unit IV Notes
32 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Consensus
No ratings yet
Consensus
77 pages
Solutions2CMOS CircuitDesign Allen
No ratings yet
Solutions2CMOS CircuitDesign Allen
509 pages
25 DistributedCoordination
No ratings yet
25 DistributedCoordination
30 pages
Chapter 8-Fault Tolerance
No ratings yet
Chapter 8-Fault Tolerance
30 pages
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
No ratings yet
Fault Tolerance: Click To Add Text Dealing Successfully With Partial System. Key Technique: Redundancy
48 pages
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
No ratings yet
Checkpointing and Rollback Recovery For Distributed Systems 5cvcuy5txm
23 pages
Unit 4 - DSRM
No ratings yet
Unit 4 - DSRM
5 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
System Recovery
No ratings yet
System Recovery
38 pages
Lm3 Checkpointing Algorithm
No ratings yet
Lm3 Checkpointing Algorithm
40 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Possible Types of Failure
No ratings yet
Possible Types of Failure
16 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Pile Cap - Pedestal - Beam Design - Add-On Ammonia Converter
No ratings yet
Pile Cap - Pedestal - Beam Design - Add-On Ammonia Converter
22 pages
Ds Chapter 7
No ratings yet
Ds Chapter 7
21 pages
Cat Ii (Dme)
No ratings yet
Cat Ii (Dme)
10 pages
Unit - Iv
No ratings yet
Unit - Iv
10 pages
Nonblocking Commit Protocols: Dale Skeen
No ratings yet
Nonblocking Commit Protocols: Dale Skeen
42 pages
Distributed Systems - Fault Tolerance
No ratings yet
Distributed Systems - Fault Tolerance
21 pages
Distributed Computing: Farhad Muhammad Riaz
No ratings yet
Distributed Computing: Farhad Muhammad Riaz
18 pages
A Case Study of Jini: Lookup Service. This Turns Out To Be The
0% (1)
A Case Study of Jini: Lookup Service. This Turns Out To Be The
5 pages
Chapter 5: Confidentiality Policies: - Overview - Bell-Lapadula Model
No ratings yet
Chapter 5: Confidentiality Policies: - Overview - Bell-Lapadula Model
31 pages
Garage Locator: Click To Edit Master Subtitle Style
No ratings yet
Garage Locator: Click To Edit Master Subtitle Style
22 pages
Unit IV 2 Marks With Answer
No ratings yet
Unit IV 2 Marks With Answer
2 pages
Reliability and Security in The Distributed Databases
No ratings yet
Reliability and Security in The Distributed Databases
29 pages
BS4551 保水性
No ratings yet
BS4551 保水性
5 pages
Fault System One
No ratings yet
Fault System One
19 pages
Adding Decimals (1 Digit Plus 2 Digits)
No ratings yet
Adding Decimals (1 Digit Plus 2 Digits)
22 pages
Renewsys India Pvt. LTD.: Form Factor - 18 Cells Reference Drawing Numbers: Bom For C06 - 5Wp With Elmex JB Per Module
No ratings yet
Renewsys India Pvt. LTD.: Form Factor - 18 Cells Reference Drawing Numbers: Bom For C06 - 5Wp With Elmex JB Per Module
31 pages
61B0 61V 130 0031R - Y1889474++fa
No ratings yet
61B0 61V 130 0031R - Y1889474++fa
11 pages
Tanuj Chopra RSA-part-I PDF
No ratings yet
Tanuj Chopra RSA-part-I PDF
30 pages
Brochure SMB8
No ratings yet
Brochure SMB8
12 pages
RT 10 IKOI Gold Bar Manufacturing Machines
No ratings yet
RT 10 IKOI Gold Bar Manufacturing Machines
6 pages
Information Security CS 526: Integrity Protection: Biba, Clark-Wilson, and Chinese Wall
No ratings yet
Information Security CS 526: Integrity Protection: Biba, Clark-Wilson, and Chinese Wall
24 pages
Invers or
No ratings yet
Invers or
22 pages
Parametric Geometry For Propulsion-Airframe Integration
No ratings yet
Parametric Geometry For Propulsion-Airframe Integration
31 pages
Practical Aspects of Production Bright Annealing of Stainless Steel
No ratings yet
Practical Aspects of Production Bright Annealing of Stainless Steel
2 pages
Thomas Mitchell, OCIO/OD/NIH/HHS Raymond Dillon, OAMP/OD/NIH/HHS
No ratings yet
Thomas Mitchell, OCIO/OD/NIH/HHS Raymond Dillon, OAMP/OD/NIH/HHS
37 pages
Excavation 1
100% (10)
Excavation 1
18 pages
Tool Engineering (Elective) - Sample-Question-Paper (Msbte-Study-Resources)
No ratings yet
Tool Engineering (Elective) - Sample-Question-Paper (Msbte-Study-Resources)
4 pages
Gs Pro gf561
No ratings yet
Gs Pro gf561
6 pages
2020 09 10 Versa Bond Ppo v2
No ratings yet
2020 09 10 Versa Bond Ppo v2
4 pages
Chapter 7: Hybrid Policies: - Overview - Chinese Wall Model - Clinical Information Systems Security Policy - Orcon - Rbac
No ratings yet
Chapter 7: Hybrid Policies: - Overview - Chinese Wall Model - Clinical Information Systems Security Policy - Orcon - Rbac
50 pages
History of The Pumps
No ratings yet
History of The Pumps
13 pages
750 - 1500 LB Electric Hoist
No ratings yet
750 - 1500 LB Electric Hoist
15 pages
BRP Vehicle Design Rules 3.0
No ratings yet
BRP Vehicle Design Rules 3.0
6 pages
Mechanics of Advanced Materials and Structures
No ratings yet
Mechanics of Advanced Materials and Structures
25 pages
Design of Column & Foundation: Aarvee Associates Gundlakamma Reservoir Project
No ratings yet
Design of Column & Foundation: Aarvee Associates Gundlakamma Reservoir Project
15 pages
Dairy Land OVP
No ratings yet
Dairy Land OVP
2 pages
Mds
No ratings yet
Mds
5 pages
Logiq P5 Datasheet
No ratings yet
Logiq P5 Datasheet
14 pages
(FTPTCPS.C) : Output: Server
No ratings yet
(FTPTCPS.C) : Output: Server
2 pages
DG31PR Product Brief
No ratings yet
DG31PR Product Brief
4 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Kafka Developer Certified: The Essential Guide
From Everand
Kafka Developer Certified: The Essential Guide
SUJAN
No ratings yet

CS 194: Distributed Systems

Uploaded by

CS 194: Distributed Systems

Uploaded by

CS 194: Distributed Systems Distributed Commit, Recovery

Two Phase Commit (2PC)

if receive ABORT, decide abort else decide commit halt 4

2PC State Machine

2PC: Crash Recovery Protocol

2PC: Crash Recovery Protocol

Recovery Techniques: Checkpoints

Forward recovery: go to a known good state if error

Example: Reliable Communication

Backward recovery: retransmit packet if lost

Forward recovery: use erasure coding

Recovery Techniques: Message Logging

Distributed Checkpointing: Recovery Line

Recovery line: most recent snapshot

How do you find a recover line?

Independent Checkpointing: The Domino Effect

Domino effect: cascaded rollback to find a recovery line Solutions:

Message Logging and Checkpointing

Incorrect replay of messages after recovery, leading to an orphan process

Storage designed to survive anything except major calamities

Use two disks to record identical information

Stable Storage Recovery

Stable Storage Crash after drive 1 is updated Bad spot

You might also like