CS 194: Distributed Systems
CS 194: Distributed Systems
Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720-1776
Distributed Commit
Goal: Either all members of a group decide to perform an operation, or none of them perform the operation
Assumptions
Failures:
- Crash failures that can be recovered - Communication failures detectable by timeouts
Notes:
- Commit requires a set of processes to agree - similar to the Byzantine general problem - but the solution much simpler because stronger assumptions
a) b)
The finite state machine for the coordinator in 2PC The finite state machine for a participant
Goal: recover a process from error Backward recovery: checkpoint the state of the process periodically
- Go to previous checkpoint, if error - Problem: same failure may repeat
Sender based: sender logs message before sending it out Receiver based: receiver logs message before delivering it Replay log messages between checkpoints restore state beyond most recent checkpoint
10
Stable Storage
Recovery
Verify all sectors If two corresponding sectors differ, copy sector from disk 1 to disk
14
a) b) c)
15