Mutual exclusion in distributed
systems
• All the solutions to the mutual exclusion problem studied
assume presence of shared memory
– Ex. Semaphores, monitors, etc. all rely on shared variables
• The mutual exclusion problem is complicated in
distributed system by
– lack of shared memory
– lack of a common physical clock
– unpredictable communication delays
• Several algorithms have been proposed to solve this
problem with different performance trade-offs
– Token-based solutions
– Permission-based solutions
A centralized algorithm
• A simple solution to the distributed mutual
exclusion problem:
– a single control site in charge of granting permissions
to access the resource
– require 3 messages
– time to grant a new permission is 2T (T = average
message delay)
• This solution has drawbacks:
– existence of a single point of failure
– control site is a bottleneck
Lamport’s Algorithm
• Assumption: messages delivered in FIFO order
(no loss messages)
• Requesting the CS
– Pi sends message REQUEST(ti, i) to other processes, then
enqueues the request in its own request_queuei
– when Pj receives a request from Pi, it returns a timestamped
REPLY to Pi and places the request in request_queuej
– request_queue is ordered according to (ti, i)
• A process Pi executes the CS only when:
– Pi has received a message with timestamp larger than ti from
all other processes
– its own request in the first of the request_queuei
Lamport’s Algorithm (2)
• Releasing the critical section:
– when done, a process remove its request from the queue and sends a
timestamped RELEASE message to all
– upon receiving a RELEASE message from Pi, a process removes
Pi’s request from the request queue
Lamport’s Algorithm Example
(2,1) P2 enters
P1 CS
(1,2)
(1,2) (2,1)
P2 P1
(1,2) (2,1)
P3 P2
(1,2)
P3
(1,2) (2,1)
P2 leaves CS
(2,1)
(2,1)
P1 enters
(2,1) CS
Lamport’s: proof of correctness
• Proof by contradiction:
– assume Pi and Pj are executing the CS at the same time
– assume request timestamp (ti, i) of Pi is smaller than that of Pj
– this means both Pi and Pj have their request at the top of the queue
– Pj executing => Pj has received a reply from Pi with timestamp> tj
– FIFO channels + second assumption => Pj has received (ti, i)
– contradiction: (ti, i) not at the top of the queue, however we said
timestamp(Pi ) < timestamp(Pj ) …
• Therefore it cannot be that Pi and Pj are executing the CS
at the same time!
What happens if channels are not
FIFO?
Out of order
P1 enters
messages
CS
P2 enters
CS
P1
P2 Violate the definition
of critical section
P3
Ricart-Agrawala Algorithm
• Optimization of Lamport’s algorithm:
Lamport’s Algorithm Ricart-Agarwala Algorithm
Requesting the CS Requesting the CS
- Pi sends message REQUEST(ti, i) + - Pi sends message REQUEST(ti, i)
enqueues the request in request_queuei - when Pi receives a request from Pj, it returns a REPLY to
- when Pi receives a request from Pj, it Pj if it is not requesting or executing the CS, or if it made a
enqueues it and returns a REPLY to Pi request but with a larger timestamp. Otherwise,
the request is deferred.
Pi executes the CS only when: Pi executes the CS only when:
- has received a msg with timestamp > ti from everybody - has received a REPLY from everybody
- its own request is the first in the request_queuei
Releasing the CS: Releasing the CS:
- when done, a process remove its request from the queue + - when done, a process sends a REPLY to all deferred
sends a timestamped RELEASE msg. to everybody else requests
- upon receiving a RELEASE message from Pi, a process
removes Pi’s request from its request queue
Ricart-Agrawala Algorithm Example
P1 P2 enters
CS
P2 leaves CS
P2
(2,1)
P3
(2,1)
P1
P2
P1 enters
P3 CS
Ricart-Agrawala: proof of
correctness
• Assumption: Lamport’s clock is used
• Proof by contradiction:
– assume Pi and Pj are executing the CS at the same time
– assume request timestamp of is Pi smaller than that of Pj
– this means Pi issued its own request first and then received Pj‘s
request, otherwise Pj request timestamp would be smaller
– for Pi and Pj to execute the CS concurrently means Pi sent a
REPLY to Pj before exiting the CS
– Contradiction: a process is not allowed to send a REPLY if the
timestamp of its request is smaller than the incoming one
• Therefore it cannot be that Pi and Pj are executing the
CS at the same time!
Algorithm comparisons
• Ricart-Agrawala’s can be seen as an optimization
of Lamport’s:
– RELEASE messages are merged with REPLYes
• Basic differences:
– Lamport’s idea is to maintain (partially) coherent copies of a
replicated data structure - the request_queue
– Ricart-Agrawala does away with the data structure and just
propagates state changes
– messages needed for CS execution in the two schemes:
• 3(N-1) vs. 2(N-1)
Maekawa’s Algorithm
• Difference with respect to previous algorithms:
– a site does not request permission from every other site but
only from a subset - called request set
• The request sets of any two sites have at least one site
in common:
i j : 1 i,j N :: Ri Rj Ø
• The basic idea is that each pair of sites is going to
have a third site mediating conflicts between the pair
Maekawa’s algorithm steps
• Requesting the CS
– Si sends a message REQUEST(i) to all the sites in Ri
– when Sj receives a request from Si, it returns a REPLY to Si if it has
not sent a REPLY since receiving the latest RELEASE message.
Otherwise the request is enqueued.
• Executing the CS
– A site Si executes the CS only after receiving REPLY messages from
all the sites in Ri
• Releasing the CS
– When done, a site Si sends a RELEASE message to all the sites in Ri
– When a site receives a RELEASE message, it sends a REPLY
message to the next site waiting in the queue and removes it
Construction of the request set
• The request sets are constructed to satisfy the following
conditions:
– i j : 1 i,j N :: Ri Rj Ø
• necessary for correctness
– i : 1 i N :: Si Ri
• necessary for correctness (note: this condition, like the need for FIFO
comm., is really needed only in the extended version of the algorithm)
– i : 1 i N :: | Ri | = K
• all Ri have equal size, so all sites do equal work to access the CS
– Any site S is contained in K of the Ri’s
• the same number of sites are requesting permission from each site (no
bottleneck)
More on the request set
• All the previous conditions are satisfied if N can
be expressed as:
N = K(K - 1) + 1
(examples: N = 3 and K = 2, N = 7 and K = 3, etc.)
– Note that, for large N, K N
• Otherwise one of the last two conditions must be
relaxed
– for example, | Ri | = K no longer true for all i
Notes on Maekawa’s algorithm
• Performance:
– 3N messages are needed for execution of the CS
– synchronization delay is 2T
• Problem: the algorithm is deadlock prone!
– there is a variant of the algorithm that can prevent the
deadlock by using a priority-based preempting scheme
– this variant requires additional messages (up to 5N )
Topics of Distributed System
Section
• Distributed deadlock detection
– Global state and distributed snapshot
– Logical clock and vector clock, …
• Distributed mutual exclusion
• Distributed transaction execution
• Fault tolerance
– Quorum-based approach
– Leader-based approach: Paxos and PBFT
• Consistency
Review of Atomicity
• Atomicity: all or nothing
BEGIN_TRANSACTION
destination_balance += amount;
if(source_balance < amount)
abort;
else
source_balance -= amount;
END_TRANSACTION
Example: transfer some money from source to destination
Atomicity: either both balances are updated or no update
Global atomicity
• A distributed database may partition data
– E.g. source_balance on site1; destination_balance on site2
• As a result, a transaction may touch multiple sites
• Global atomicity: either all sites commit or all abort
BEGIN_TRANSACTION
destination_balance += amount; site2
if(source_balance < amount)
abort;
site1
else
source_balance -= amount;
END_TRANSACTION
Common knowledge
• Global atomicity is related to the concept of
“common knowledge” in distributed systems
• What is “common knowledge”?
– Wiki: knowledge that is known by everyone
– But in a distributed system, this is not enough
– Let’s look at a video first
What is common knowledge?
• Everyone knows. Is this enough?
• Everyone knows everyone knows. Enough?
• Everyone knows everyone knows everyone
knows. Enough?
• Definition: for any k, everyone knows …
everyone knows. “everyone” appears k times.
• Global atomicity would be trivial if we could
design a protocol to achieve common knowledge.
The two-general problem
• Classic problem illustrating the difficulty of achieving
common knowledge:
Once upon a time there were two generals belonging to the
same army. To conquer a certain hill they needed to attack
simultaneously. If only one general attacks, he will be
defeated. They could only communicate through messengers.
Messengers can get lost or get killed by the enemy.
Assume they have perfectly synchronized clocks.
• How can they agree on the time of the attack?
The two-general problem
• Classic problem illustrating the difficulty of achieving
common knowledge:
Once upon a time there were two generals belonging to the
same army. To conquer a certain hill they needed to attack
simultaneously. If only one general attacks, he will be
defeated. They could only communicate through messengers.
Messengers can get lost or get killed by the enemy.
Assume they have perfectly synchronized clocks.
• How can they agree on the time of the attack?
• It is proved to be impossible.
The two-general problem
• It is impossible for multiple processes to agree on doing
something at the same time t, assuming
– Communication channels are unreliable.
– Or channels are reliable but delay is unbounded.
• How do we solve this problem in practice?
– Sometimes we assume our communication channel is reliable
– Probabilistic solution: send many messengers, hoping at least
one can survive
– Remove the requirement “at the same time t”
The two-general problem
• It is impossible for multiple processes to agree on doing
something at the same time t, assuming
– Communication channels are unreliable.
– Or channels are reliable but delay is unbounded.
• If we remove the “at the same time” requirement, there
exists solutions
– Multiple processes can agree that they will do (or not do)
something eventually (“eventual common knowledge”).
– Still useful: e.g. transferring money across accounts
– Solution: two phase commit
Two-phase commit protocol
• Goal: suppose a transaction involves multiple
operations at different sites, we want to ensure that
either they all commit or they all abort (atomicity)
• Additional requirement: if all sites can commit
and there are no site failures or communication
errors, then the transaction should commit.
Two-phase commit protocol
• Assumptions:
– A site may temporarily crash, but no permanent failure
– Persistent and reliable storage at each site
– A site knows whether it can commit or not
• General idea:
– Elect one site as coordinator (others called “cohorts”)
– A site should log important decisions to persistent storage
Two-phase commit protocol: definition
Phase I
At the coordinator At the cohorts
Upon receiving the COMMIT-REQUEST message the
- The coordinator sends a COMMIT-REQUEST cohort takes the following action:
message to every cohort requesting them to commit - if the transaction is successful it writes UNDO and REDO
- The coordinator waits for replies from all cohorts log on stable storage, then sends an AGREED message
- otherwise it sends an ABORT message to the coordinator
Phase II
At the coordinator At the cohorts
- If all cohorts reply AGREED, then the coordinator writes a Upon receiving a COMMIT message a cohort releases
COMMIT record in the log, and sends a COMMIT message to all all the resources held for executing the transaction and
the cohorts. Otherwise, it sends an ABORT msg to all the cohorts
sends an ack
- The coordinator then waits for acknowledgements from all cohorts
- If an ack is not received from a cohort after a timeout period, the Upon receiving an ABORT message a cohort undoes
coordinator resends the commit/abort message to that cohort the transaction using the UNDO log record, releases
- If all acks are received, the coordinator writes a COMPLETE record all the resources held and sends an ack
Two-phase commit protocol: message
exchanges
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Two-phase commit protocol: failures
• So far we assume sites do not fail and there are no
message loss
• How about failures or message loss?
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes before sending anything
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes before sending anything
After it reboots, it can resend or simply decide to abort.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some COMMIT-REQUEST are lost
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some COMMIT-REQUEST are lost
Same: resend or simply abort
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some cohorts crash
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some cohorts crash
The coordinator can abort.
The crashed cohort can learn the decision later.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some AGREED/ABORT messages are lost
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some AGREED/ABORT messages are lost
The coordinator can abort
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
The coordinator crashes before writing COMMIT
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
The coordinator crashes before writing COMMIT
The coordinator can abort after rebooting
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
The coordinator crashes after writing COMMIT
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
The coordinator crashes after writing COMMIT
The coordinator finds COMMIT in the log and resends it
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
The coordinator crashes after writing COMMIT
Can the coordinator abort in this case?
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Can the coordinator abort in this case? No, because coordinator
does not know whether cohorts have received COMMIT.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
If any failure before coordinator writes COMMIT in log,
coordinator can abort, but after that point, it cannot.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
What happens if coordinator decides to abort but that is lost?
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
What happens if coordinator decides to abort but that is lost?
Abort is not logged, so the coordinator aborts after rebooting.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Cohort crashes in phase II.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Cohort crashes in phase II.
Cohort can ask the coordinator after rebooting.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
What to do if the coordinator also crashes? Of course we can
wait for the coordinator to come back. But can we do better?
Two-phase commit protocol: failures
• Things that can go wrong – more complicated:
– a cohort crashes in Phase II (i.e. after writing UNDO and REDO recs)
and the coordinator also crashes. How to proceed?
Two-phase commit protocol: failures
• Things that can go wrong – more complicated:
– a cohort crashes in Phase II (i.e. after writing UNDO and REDO recs)
and the coordinator also crashes.
– If the cohort voted “abort”, then it can safely aborts. Otherwise, it asks
other cohorts.
Two-phase commit protocol: failures
• Things that can go wrong – more complicated:
– a cohort crashes in Phase II (i.e. after writing UNDO and REDO recs)
and the coordinator also crashes.
– If the cohort voted “abort”, then it can safely aborts. Otherwise, it asks
other cohorts.
– If another cohort has received “COMMIT” or “ABORT” from the
coordinator, then this cohort can follow.
– If another cohort has voted “abort”, then this cohort can safely abort.
Two-phase commit protocol: failures
• Things that can go wrong – more complicated:
– a cohort crashes in Phase II (i.e. after writing UNDO and REDO recs)
and the coordinator also crashes.
– If the cohort voted “abort”, then it can safely aborts. Otherwise, it asks
other cohorts.
– If another cohort has received “COMMIT” or “ABORT” from the
coordinator, then this cohort can follow.
– If another cohort has voted “abort”, then this cohort can safely abort.
– Otherwise, if every remaining cohorts voted “agreed”, but no one
received the decision from the coordinator, then they have to wait for
the coordinator to reboot.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some ACKs are lost
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Some ACKs are lost
The coordinator can resend COMMIT/ABORT. If a cohort
has executed the message, it will resend ACK.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes before writing COMPLETE.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes before writing COMPLETE.
Same: resend COMMIT/ABORT and wait for ACK
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes after writing COMPLETE.
Two-phase commit protocol: failures
Coordinator Cohorts
Transaction successful
Send ABORT
write UNDO, REDO on log
Write COMMIT in log
Release resources Undo transaction
and locks using UNDO rec
Write COMPLETE in log
Coordinator crashes after writing COMPLETE.
Nothing to do after reboot.
Summary of two phase commit (2PC)
• Designed to achieve global atomicity
• Two round trips
• Widely used in practice
– Many research works try to reduce its overhead.
Topics of Distributed System
Section
• Distributed deadlock detection
– Global state and distributed snapshot
– Logical clock and vector clock, …
• Distributed mutual exclusion
• Distributed transaction execution
• Fault tolerance
– Quorum-based approach
– Leader-based approach: Paxos and PBFT
• Consistency
Failures in practice: reasons
• Hardware faults:
– Manufacturing problems, design errors, environmental
issue, fatigue/deterioration, …
• Software faults:
– Bugs, …
• Human error:
– Delete a file by mistake, …
Failures in practice: types
• Omission failure: a process does not respond
– Crash/transient failure: power loss, segment fault, …
– Permanent failure: hardware damaged, …
• Commission failure: a process gives wrong
response
– Data corruption: bit flip, software bug, …
– Malicious: machine got hacked, …
Failures are common
• Mean time to failure (MTTF)
– Disk: 10-50 years
– Node: 4.3 months
– Rack: 10.2 years
– Data from “Availability in Globally Distributed Storage
Systems”, by Google (OSDI 2010)
• Assuming Google has 1 million machines
– Every day there are about 8K node failures
How to handle failures?
• Write better code:
– Find and fix bugs
– Check possible errors
• Checkpoint and log replay
• Data redundancy
– Coding
– Replication
Write better code
• Find and fix bugs: this is a big topic
– Use proper synchronization for multi-threaded apps
– Avoid deadlocks
– …….
• Related techniques
– Debugging: tools can help you find certain bugs
– Model checking: automatically drive your app to reach
different states
– Formal verification: prove that your code has no bug
Write better code
• It is a good habit to check all possible errors
– In C/C++, check the return value
– In Java, check all exceptions
• You may handle some of them (forward recovery)
– Not enough memory => release some objects
– Disk error => retry
– Network failure => retry
Write better code
• What failures can it handle?
• What failures does it not handle?
Write better code
• What failures can it handle?
– Software faults; a subset of hardware faults
• What failures does it not handle?
– Permanent hardware faults; human error
How to handle failures?
• Write better code:
– Find and fix bugs
– Check possible errors
• Checkpoint and log replay
• Data redundancy
– Coding
– Replication
Checkpoint and log replay
• Checkpoint: persistent snapshot of system state
– E.g. Windows restore point
– A user can “rollback” to a previous checkpoint
• Log: persistent record of individual operation
– E.g. redo/undo log in database system
– A user can replay logs to redo/undo operations
Checkpoint and log replay
• Checkpoint: heavy
– Often use copy-on-write (COW) to reduce its overhead
• Log: light but log file can grow long
• In practice, they are often used in combination
– Keep logging
– Take a checkpoint when the system is healthy
– Then discard all logs before the checkpoint
Checkpoint and log replay
• When a failure happens:
– Rollback to a previous checkpoint
– Play logs after the checkpoint
– If an operation causes problems, discard it and all
following operations
Checkpoint and log replay
• What failures can it handle?
• What failures does it not handle?
Checkpoint and log replay
• What failures can it handle?
– Transient software and hardware faults; a subset of
human errors (e.g. delete a file by mistake)
• What failures does it not handle?
– Permanent faults; some human errors (e.g. delete the
checkpoint by mistake)
Checkpoint in distributed systems
• Recovery is harder in distributed systems due to
the presence of messages in transit
• Bad things can happen if checkpointing does not
take into account messages:
– orphan messages and domino effect
– lost messages
– potential for livelocks
• Remedies:
– (strongly) consistent set of checkpoints
Orphan messages
• Orphan message: suppose Y fails after sending m and
rolls back to y2
– receiving of m is recorded (in x3) but the sending is not,
violating happened-before relationship
– X must be rolled back to undo the effects of the X-Y interaction
• Domino effect: suppose Z rolls back ...
x1 x2 x3
X [ [ [
y1 m
y2
Y [ [
z1 z2
Z [ [
Lost messages
• Lost message: suppose Y rolls back after receiving m
– the sending of m is recorded (in x1) but not its reception
x1
X [
m
y1
Y [ X
failure
How to handle lost messages?
• The sender should buffer messages till it receives
an acknowledgement from the receiver
– After recovery, resend unacknowledged messages
• The receiver should be able to detect duplicate or
old messages
– Assign a unique ID to each message
– TCP already has such support
• If properly handled, lost messages should not
cause problems to recovery
Global checkpoint
• A global checkpoint is a collection of local
checkpoints
• As shown in the previous slides, taking local
checkpoints at random can lead to problems
• Criteria to take safe global checkpoints
– (strongly) consistent set of checkpoints
Strongly/simply consistent set
• Strongly consistent set:
– no message in transit during interval spanned by
checkpoints
• Consistent set (we have learned this)
– each message recorded as received is also recorded as sent
– i.e. lost messages are acceptable, orphans are not
– {x3, y3, z3} is a consistent set
• A consistent set avoids the occurrence of domino
effect because it does not allow orphans
Consistent set of checkpoints
• {x2, y2, z2} is a strongly consistent set of checkpoints
x1 x2 x3
X [ [ [
m
y1 y2 y3
Y [ [ [
z1 z2 z3
Z [ [ [
Consistent set of checkpoint:
algorithms
• We have learned the Chandy Lamport protocol. There exists others.
• Simple checkpointing algorithm :
– if a process takes a checkpoint after sending every message, the set of most
recent checkpoints is consistent
– (some assumptions required, like atomicity of sending/receiving and
checkpointing)
– expensive
• Synchronous checkpointing algorithm:
– Basic idea: all processes coordinate their actions in taking local checkpoints
– Assumptions: FIFO channels, end-to-end protocol to deal with loss of
messages (example: sliding window protocol)
– Similar algorithm exists for synchronous rollback that avoids livelock
• Asynchronous algorithms:
– Basic idea: no coordination at all, at rollback we’ll figure it out …
Synchronous checkpointing algorithm
• Assumptions: FIFO, reliable communication
• Algorithm works in two phases:
– Phase I: process Pi takes a tentative checkpoint and asks all
others to do the same. If everybody else is successful, then Pi
decides all tentative checkpoints should be made permanent
– Phase II: Pi informs others of its decision (discard or commit)
• The set of checkpoints is consistent because:
– either none or all processes take part
– for the set to be inconsistent if a message is recorded received
but not sent. But no process will send messages after being
asked to take a tentative checkpoint until notification from Pi
Synchronous recovery
• Analogous to synchronous checkpointing:
– Phase I: process Pi starts a vote for everybody to start a
rollback. A process may vote “no” if already busy with a
checkpoint or another rollback. If vote is a unanimous “yes”,
then Pi decides all processes should rollback
– Phase II: Pi informs others of its decision. Upon receiving its
notification, processes start rollback
• Correctness:
– all processes take same action - either rollback or continue
– if all restart, then they resume execution in a consistent state
(thanks to the synchronous checkpointing algorithm)
Asynchronous algorithms
• Asynchronous approach:
– don’t synchronize actions - local checkpoints taken independently
by each site
– in the event of a rollback, search most recent consistent set
between available checkpoints
• Different trade offs:
– Synchronous checkpointing
• simplifies recovery because checkpoints are consistent
• trade off: increased burden on regular operations
– Asynchronous checkpointing:
• simplifies checkpointing because processes work independently
• trade off: recovery becomes more complicated
Asynchronous algorithm example
x1 x2 x3
X [ [ [
y1 y2 y3
Y [ [ [
z1 z2
Z [ [
• The latest set of checkpoints {x3, y3, z2} is not consistent.
– The most recent set of consistent checkpoints (MRSCP) is {x2, y2, z2}
• One way to find the MRSCP is the following:
– Each checkpoint includes the tally of how many messages the processor has sent and
received so far to/from everybody else;
– Upon roll back, orphan messages can be detected by comparing the number of
messages sent and received;
• example: if Y tries to roll back to y3 it will have one message sent to X on record – but X has
two messages from Y on its record.
– Keep trying new sets of checkpoints until one is found which is orphan-free.
How to handle failures?
• Write better code:
– Find and fix bugs
– Check possible errors
• Checkpoint and log replay
• Data redundancy
– Coding
– Replication
Data redundancy
• Put data at multiple locations
• Coding: XOR, CRC, Reed-Solomon coding, …
– E.g. checksum, RAID 5/6, …
• Replication: simply keep multiple copies of data
– E.g. Google File System, …
Data redundancy
• What failures can it handle?
• What failures does it not handle?
Data redundancy
• What failures can it handle?
– Uncorrelated failures (either software or hardware or
human, either transient or permanent)
• What failures does it not handle?
– Correlated failures (failures that manifest in all copies
of data)
Topics of Distributed System Section
• Distributed deadlock detection
• Global state and distributed snapshot
• Logical clock and vector clock, …
• Distributed mutual exclusion
• Distributed transaction execution
• Fault tolerance
• Quorum-based approach
• Leader-based approach: Paxos and PBFT
• Consistency
Key idea of replication
• Put data on multiple physical machines or even in different
datacenters
• If some of these machines/data centers fail, we can still retrieve data
from other replicas.
Models
• How can a replica fail?
• Omission (Fail-stop) model: if a replica fails, it becomes unresponsive
• Commission (Byzantine): a failed replica can do anything
• How is the communication channel?
• Synchronous: delay is bounded and clock drift is bounded
• Asynchronous: delay is unbounded and clock drift is unbounded
Key challenge
• Need to ensure all replicas are identical
Client 0 a=3 a=0 Replica 0
Client 1 a*=2
a=0 Replica 1
Client 2 a++
Client 3 a=5 a=0 Replica 2
However, requests may arrive at different replicas in different orders
Overview of solution
Run an agreement/consensus protocol to agree on the order of input requests
Each replica executes its input requests deterministically
Client 0 a=3 a=0 Replica 0
Client 1 a*=2
Agreement a=0 Replica 1
Client 2 a++
Client 3 a=5 a=0 Replica 2
Goals
• Correctness: correct replicas receive the same sequence of requests
• Liveness: a request is eventually processed
Solutions
• Quorum-based approaches: Gifford’s algorithm, …
• Leader-based approaches: Primary backup, Paxos, PBFT, …
• They are not completely orthogonal: leader-based approaches use
quorum ideas internally
Gifford’s voting algorithm
• Model: omission failures and synchronous network
• Scenario: solve readers/writers problem in an unreliable distributed system
• data replicated at several sites to provide fault tolerance
• client can contact any site to read/write data
• each site is assigned one or more votes
• each replica maintains a version number (logic clock of its data)
• local Lock Managers implement multiple readers/single writer policy
• Reading (writing) permitted only if a minimum number of votes are collected
• requesting process verifies reaching of read/write quorum
• Timeouts employed in vote collection to prevent undefined waits in case of site failure
The algorithm
• Site i gets lock from local Lock Manager then sends Vote_Request
to all other sites
• If another site already holds the lock, site i waits till that site releases the lock.
• The lock manager follows a read/write lock rule: multiple readers can get the
lock simultaneously; if there is a writer, other readers or writers have to wait.
The algorithm
• Site i gets lock from local Lock Manager then sends Vote_Request
to all other sites
• site j checks with local Lock Manager, then replies by sending
version number VNj and number of votes Vj to site i
• If site j’s lock is held by another site, site j waits till that site releases the lock
• For simplicity, you can assume Vj= 1 now
The algorithm
• Site i gets lock from local Lock Manager then sends Vote_Request
to all other sites
• site j checks with local Lock Manager, then replies by sending
version number VNj and number of votes Vj to site i
• site i adds votes and verifies that total is higher than read quorum
size r/write quorum size w if site i wants to do read/write
• We will discuss how to choose read quorum/write quorum later
The algorithm
• Site i gets lock from local Lock Manager then sends Vote_Request
to all other sites
• site j checks with local Lock Manager, then replies by sending
version number VNj and number of votes Vj to site i
• site i adds votes and verifies that total is higher than read quorum
size r/write quorum size w if site i wants to do read/write
• if quorum reached, site i updates local copy of data if not current
then proceeds to read/write it
• Site i’s data is not current if site i’s version number < max (versions in replies),
which means some other sites have newer data.
• Request aborted / locks released if quorum not reached after
timeout
The algorithm
• Site i gets lock from local Lock Manager then sends Vote_Request
to all other sites
• site j checks with local Lock Manager, then replies by sending
version number VNj and number of votes Vj to site i
• site i adds votes and verifies that total is higher than read quorum
size r/write quorum size w if site i wants to do read/write
• if quorum reached, site i updates local copy of data if not current
then proceeds to read/write it
• if site i performs a write, site i updates VNi ++ and then sends the
updated data and VNi to all sites that replied
• Those sites update data and release locks
Quorum and vote assignment
• If v is the total number of votes:
r+w>v , w > v/2
• This choice of values has the following properties:
• r + w > v => 1) read and write will not execute in parallel; 2) a
read will always access at least one current copy
• w > v/2 => no simultaneous writes on two distinct subset of
replicas
• Fault tolerance can be increased by assigning a higher
number of votes to reliable sites
Examples
© Prentice Hall
© Prentice Hall
One of these two
A common scheme, referred to as
configurations Read-One, Write-All (ROWA)
has a problem …
Which one?
Examples
1 Quorum # of sites failures that
Votes = 1 will disable the system
r=1 w=5 ???
2 3
Votes = 1 Votes = 2 r=3 w=3 ???
4
Votes = 1
• Both vote assignment and quorum values have implications
on the system’s ability to tolerate faults
Examples
1 Quorum # of sites failures that
Votes = 1 will disable the system
r=1 w=5 1
2 3
Votes = 1 Votes = 2 r=3 w=3 2 or 3
4
Votes = 1
• Both vote assignment and quorum values have implications
on the system’s ability to tolerate faults
Other voting protocols
• Gifford’s algorithm is static
• Both number of votes and each quorum remain unchanged
• Dynamic algorithms can adapt to system state by changing either:
• Set of sites that can form a majority
• Vote assignment
• There also exist variations for other models (commission failures,
asynchronous network)
Problem of quorum-based approaches
• Deadlock: in Gifford’s algorithm, if multiple writers try to grab the
write lock simultaneously, it is possible each gets the lock from less
than w replicas
• To resolve it, we need to cancel some writers, which is slow
• This problem motivates the leader-based solutions
• We should try to elect a leader
• The leader can order all requests
• Each replica follows the order proposed by the leader
Questions of leader-based approaches
• How to elect a leader?
• What to do when the leader is alive?
• This is usually the simple case.
• How to detect failures?
• What to do if a replica, in particular the leader fails?
Overview of leader-based solutions
Synchronous Asynchronous
Omission (Fail-stop) Primary backup Paxos
Commission (Byzantine) Synchronous BFT PBFT
Let’s start from the simplest case: omission failure and synchronous network
Leader election
• Requirements:
• Correctness: there should be no more than one leader at any time.
• Liveness: we should eventually elect a leader. (It is OK to have no leader for some time)
• We don’t care which process becomes the leader.
• Assumptions:
• Assume every process has a well-known unique ID
• Assume no knowledge about nodes status (up or down)
• Assume network latency is bounded
22
The Bully Algorithm (Garcia-Molina)
• P sends an ELECTION message to all processes with higher numbers
• If no one responds in given time, P wins the election and becomes a leader
• If one of the higher-ups answers, it takes over. P’s job is done.
• Repeat the above steps until one wins
• The new leader sends Coordinator message to all other processes
23
An Example
1 1 1
2 5 2 5 2 5
Election OK Election
Election 6 4 OK 6 Election 6
4 4
Election
Election
0 3 0 3 0 3
7 7 7
(a) 1 (b) 1 (c)
2 5 2 5
OK
4 6 4 Coordinator 6
0 3 0 3
7 7
(d) (e) 24
A Ring Algorithm
• Each process knows who its successor is.
• Any process that notices the leader is not functioning builds an ELECTION
message containing its own process ID and sends the message to its successor
• At each step along the way, the sender adds its own process ID to the list
• The process with the highest ID wins the election
• When the message gets back, the first process changes the message type to
COORDINATOR and circulates the message once again.
25
An Example
1 Election Message
[5, 6, 0]
0 2 [2]
Previous coordinator 3
has crashed 7 [5, 6]
[2, 3]
No response 4
6
[5] 5
Election Message
26
Overview of leader-based solutions
Synchronous Asynchronous
Omission (Fail-stop) Primary backup Paxos
Commission (Byzantine) Synchronous BFT PBFT
Primary backup
• Key idea:
• Elect one replica as primary
• The primary orders clients’ requests and broadcasts them to backups
• If primary fails, elect a new primary
• How to detect failures:
• In synchronous environment, timeout is a classic solution for failure detection: a
node can periodically send a heartbeat to others; if one does not receive heartbeat
in time, it will consider the sender as failed.
• What is “in time”? Count max network latency and max clock drift
• Though its key idea is simple, its details are not.
• Read “Hypervisor-based Fault tolerance” in ACM Trans. Comput. Syst., 14(1), 1996.
Unbounded network latency?
• So far we assume network latency is bounded.
• Can you design an election algorithm assuming network latency is
unbounded?
• It is proved to be impossible.
• Say the current leader is not responsive. Shall we elect a new one?
• The current leader may be still alive but is experiencing long latency.
• The current leader may be actually dead
• YES will violate correctness in the first case; NO will violate liveness in the
second case.
29
Can Primary Backup work in asynchronous
environment?
Can Primary Backup work in asynchronous
environment?
• We have learned no leader election protocol can ensure there is
exactly one leader in asynchronous environment
• Failure detection also becomes unreliable because we don’t know the
max network latency
• It is possible that a backup is elected as a new primary while the old
primary is still alive. Why is this bad?
Can Primary Backup work in asynchronous
environment?
• We have learned no leader election protocol can ensure there is
exactly one leader in asynchronous environment
• Failure detection also becomes unreliable because we don’t know the
max network latency
• It is possible that a backup is elected as a new primary while the old
primary is still alive. Why is this bad?
• Two primaries can order clients’ requests in different orders.
Unbounded network latency?
• So far we assume network latency is bounded.
• Can you design an election algorithm assuming network latency is
unbounded?
• It is proved to be impossible.
• We have to make sure our algorithm can work correctly with more than one
leaders in this case.
• Solution: Paxos
33
Paxos
Yang Wang
Model
Synchronous Asynchronous
Crash Only Primary backup Paxos
Byzantine Synchronous BFT PBFT
Goals and assumptions
• Goals:
– Correctness: all correct replicas execute the
incoming requests in the same order
– Liveness: the system eventually makes progress
• Assumptions
– Machines can crash
– Network can become asynchronous: there is no
upper bound on message delay
Authors
• Brian M. Oki and Barbara Liskov, Viewstamped
Replication: A New Primary Copy Method to
Support Highly-Available Distributed Systems.
In PODC 88.
• Leslie Lamport, The Part Time Parliament, in
TOCS 98.
• These two papers share the 2012 SIGOPS Hall
of Fame Award.
Paxos is quite hard to explain
• The part-time parliament, L. Lamport, 1998
• Paxos Made Simple, L. Lamport, 2001
– “At the PODC 2001 conference, I got tired of
everyone saying how difficult it was to understand
the Paxos algorithm… When I got home, I wrote
down the explanation as a short note …
– The “short note” becomes a 11-page paper.
– Still not so simple ……
Paxos is quite hard to explain
• The ABCDs of Paxos, B. Lampson, 2001
• …….
• In Search for an Understandable Consensus
Algorithm, D. Ongaro and J. Ousterhout, Best
paper award at USENIX ATC 2014 (known as
Raft)
• Paxos Made Moderately Complex, Robbert
Van Renesse and Denis Altinbuken, 2015
• Maybe it will continue ……
Why is it hard to explain?
• There are multiple versions
– I will focus on multi-Paxos, the most popular one
• All steps are correlated: it is hard to explain
one step without touching other steps
– My strategy: I will first present all steps. You
memorize them. Then I explain why.
Terminologies
• Slot (n): Paxos orders requests by assigning a
slot number to each request: ReqA is in slot 0;
ReqC is in slot 1; ReqB is in slot 2; ……
• View (v): Paxos tries to elect a leader. v is the
unique identifier of a leader. We say the
system is in view v if it is led by leader v.
– A replica may be elected as leader more than once,
but it will have different view numbers for each
instance.
Leader Election
• In asynchronous environment, it is impossible to
design an election protocol to ensure there is at most
one leader at a time.
• Paxos ensures correctness even when there exist
more than one leaders.
• Leader election is a separate topic. We assume
– It can elect replicas as leaders and can assign unique view
numbers to them.
– Later leader will have a higher view number.
– It may elect multiple leaders at the same time.
Leader Election
• Simplest solution: round robin order
– Each replica maintains its own copy of v.
– During initialization, v is 0 and replica 0 is the leader.
– At any time, a replica sets leader to be replica v % N.
– If a replica finds its current leader is unresponsive, it
increments v by 1 and elects a new leader
– Note different replicas may have different values for v, so
they may elect different leaders.
• There exist other solutions.
Multi Paxos
Election
Client
Leader
Server
Server
Leader Election
• The new leader sends a message to all replicas:
“Replica i with view number v wants to
become leader. Do you agree?”
Leader Election
• The new leader sends a message to all replicas:
“Replica i with view number v wants to
become leader. Do you agree?”
• Other replicas:
– If it has not agreed on a leader with a higher v, it
will reply “Yes. And I promise I will not accept any
proposals from leaders with a lower v.”
– Otherwise, it will reply “Sorry. I made a promise to
a later leader.”
Leader Election
• If the new leader receives “YES” from a
majority of replicas, its election succeeds.
• Question here: for this idea to work, how
many replicas do we need if we assume f of
them can fail?
Leader Election
• If the new leader receives “YES” from a
majority of replicas, its election succeeds.
• Question here: for this idea to work, how
many replicas do we need if we assume f of
them can fail?
– N-f > N/2. So N>=2f+1
– Paxos needs 2f+1 replicas to tolerate f failures.
Learn the past
• The new leader must respect agreed decisions
in the past
• We will discuss this later. You can assume
– The new leader can learn what requests were
agreed for what slots in the past
– The new leader can make sure all correct replicas
are on the same page
– The new leader will start from the next unoccupied
slot.
Accept phase
Election Accept
Client
Leader
Server
Server
Accept phase
• The leader will receive clients’ requests, order
them, and pick up the next one r.
– Each request r has a unique identifier. The simplest
one to create such an identifier is to use (client ID,
number of requests from that client)
• It proposes “leader v proposes request r to be
in slot n” to all replicas
– It then increments n by 1.
Learn phase
Election Accept Learn
Client
Leader
Server
Server
Learn phase
• Another replica R:
– If v is the latest leader R has agreed on, then R will
broadcast “confirmed” to all.
• If R has agreed on a proposal with smaller v, this new
proposal will overwrite the old one.
– If R has agreed on a later leader, R will reply “No. I
have promised not to accept proposals from you.”
– If R is still working with an older leader, R should
catch up with the new leader.
Learn phase
• Any replica:
– If it receives f+1 “Confirmed” for the same request,
it will execute the request.
– It will then send the reply to the client.
– A client will consider the request as complete after
it receives one reply.
The leader can then propose next
Election Accept Learn Accept Learn
Client
Leader
Server
Server
Learn the past in Leader election
• A new leader must learn what has been agreed on.
– During leader election, each non-leader replica will also
reply with its history of requests agreed in the past.
– If the new leader finds one slot n is already occupied by
some requests, it will choose the request with the highest v
for slot n. We will see why.
– The new leader will notify all replicas of its decision by
following the same accept-learn protocol.
– The new leader will continue after the max occupied slot
(i.e. set r = max + 1).
Example
• Suppose N=5, f=2.
• Suppose a proposal has <v, n, req>
• In leader election, three replicas replied
– Leader itself: <0, 0, a> <1, 1, b>
– Replica 1: <0, 0, a> <2, 1, c>
– Replica 2: <0, 0, a> <2, 1, c> <2, 2, d>
• What is the leader’s action?
Example
• Suppose N=5, f=2.
• Suppose a proposal has <v, n, req>
• In leader election (v=3), three replicas replied
– Leader itself: <0, 0, a> <1, 1, b>
– Replica 1: <0, 0, a> <2, 1, c>
– Replica 2: <0, 0, a> <2, 1, c> <2, 2, d>
• Leader will (1) choose <0, 0, a> <2, 1, c> <2, 2, d>,
(2) change them to <3, 0, a> <3, 1, c> <3, 2, d>, (3)
start accept phase for all of them, and (4) continue
with n=3.
Example
• Suppose N=5, f=2.
• Suppose a proposal has <v, n, req>
• In leader election, three replicas replied
– Leader itself: <0, 0, a> <1, 1, b>
– Replica 1: <0, 0, a> <2, 1, c> How did this happen?
– Replica 2: <0, 0, a> <2, 1, c> <2, 2, d>
Example
• Replica 4: <1, 1, b>
– Replica 1: <2, 1, c> How did this happen?
– Replica 2: <2, 1, c>
– Leader (v=1) proposes <1, 1, b>. It was agreed
by replica 1 and 4. Then replica 1 dies.
– Leader (v=2) does not know <1, 1, b> after it
was elected. So it proposes <2, 1, c>.
– During the procedure, Replica 1 comes back.
Summary so far
• A new leader needs to get a promise from f+1
replicas
• If a slot is already occupied, the new leader will
choose the one with the highest v.
• A proposal should be agreed by f+1 replicas
Key idea of Paxos
• A replica will only execute a request if it is
agreed by f+1 replicas as the next one.
• Why?
– Assuming f failures, we need at least f+1 replicas.
– Paxos’ leader election guarantees that if a request
is agreed by f+1 replicas, then later leaders will not
change this decision
Key idea
• Paxos’ leader election guarantees that if a
request is agreed by f+1 replicas, then later
leaders will not change this decision
– We call this request becomes “stable”
– Then it is safe to execute the request.
– Can you prove this property?
Proof of correctness
• Correctness: for the same slot n, different
replicas should execute the same request r.
• Proof by contradiction: suppose one replica
executes <v, n, r> and another executes <v’, n,
r’> and suppose there are no other different
request between v and v’ for slot n.
– v < v’; r!=r’
Proof of correctness
• Proof by contradiction: <v, n, r> and <v’, n, r’>
• <v, n, r> must be agreed by f+1 replicas. We
call them Group(v).
• If a replica in Group(v) agrees v’ to be the new
leader, it must do so after it agrees on <v,n,r>
– Because v’ will ask every one not to accept old
proposals.
Proof of correctness
• During the leader election of v’, v’ will ask all replicas
and wait for f+1 replies. We call this group Group(v’)
• Group(v) and Group(v’) both have at least f+1
members, so there must be at least one replica
existing in both groups (Quorum idea).
• This replica will tell v’ that it has already agreed on <v,
n, r>
• v’ should choose the request with the highest view,
which should be <v, n, r>, and change it to <v’, n, r>.
This contradicts with <v’, n, r’>
Liveness?
• What can prevent the system from making
progress?
Liveness?
• What can prevent the system from making
progress?
– A leader is elected.
– Before it was able to successfully propose any
request, a new round of leader election happens.
– And this happens continuously.
– Note a new round of election will happen if some
replica believes the current leader is unresponsive
(maybe through a timeout)
Liveness?
• Problem: there is no upper bound on network
latency, so latency may be higher than timeout
interval. In this case, leader election will
happen continuously.
• Solution: let timeout interval grow
exponentially with v
Liveness?
• Paxos can achieve liveness when
– Max network latency < timeout interval
– Clocks of different replicas are reasonably
synchronized
– The above two conditions hold for sufficiently long
so that a new leader can successfully propose at
least one request.
• Paxos achieves liveness with a condition
– But it seems this is the best we can get
Summary so far
• Paxos is always correct.
• Paxos is live when the environment is
synchronous for sufficiently long.
• Two round trips in the common case
• One more round trip for leader election
Optimizing Paxos
• The protocol we discussed so far is expensive
• It can be optimized in many ways
– Some are simple
– Some could take a whole lecture to explain, so I
will only briefly go through them
Problem 1: log may grow long
• Each replica needs to record what it agreed on
– A new leader may ask for such info
• Consequences:
– The log grows longer
– Leader election becomes more expensive
• Solution: checkpoint and discard old logs
Problem 1: log may grow long
• Checkpoint: a replica can write all its state to a
file periodically or under certain conditions
• Discard old log: if the replica has executed
request n (slot) before the checkpoint, it can
discard all logs before n
Problem 1: log may grow long
• How about leader election?
– A replica taking checkpoint at request n means all
requests <= n must have already been stable
– So a new leader does not need to choose requests
for slot <= n
• What to do if some replica misses requests
before n?
– Just fetch the checkpoint from another replica
Problem 2: too many messages
Election Accept Learn Accept Learn
Client
Leader
Server
Server
With 3 replicas, processing a single request needs 12 messages
Problem 2: too many messages
Election Accept Learn
Client
Leader
Server
Server
Solution 1: the leader delegates the broadcast (10 messages)
Trade-off: fewer messages at the cost of longer latency
Problem 2: too many messages
• Solution 2: batching
– We can put many requests in a slot
• Leader: I propose requests A, B, C, D, and E to
be in slot 99
– Instead of making a proposal for each request, the
leader can accumulate a few requests and make a
proposal for all of them together
Problem 2: too many messages
• Solution 3: avoid transferring requests content
in the learning phase
• The accept phase already sends full requests to
all replicas
• The learning phase just needs the identifier of
a request
Problem 3: long latency
Election Accept Learn
Client
Leader
Server
Server
This protocol needs 2.5 round trips
The original one needs 2 round trips.
Problem 3: long latency
• Solution 1: pipelining
– Overlapping accept and learning phases
• The leader can start the accept phase for the
next slot before the learning phase of the
current slot completes
• Problem: an unlucky leader failure may cause
slot n+1 to be stable but not slot n
– How should the new leader handle such gap?
Problem 3: long latency
• Solution 1: pipelining
– Overlapping accept and learning phases
• The leader can start the accept phase for the
next slot before the learning phase of the
current slot completes
• Problem: an unlucky leader failure may cause
slot n+1 to be stable but not slot n
– The new leader can propose a special “no-op” for
slot n
Problem 3: long latency
• Solution 2: Fast Paxos (by Lamport)
– Idea inherits in following works like
SpeculativePaxos and Tapir
• Basic idea:
– Fast path: one round trip under certain conditions
– Slow path: resolve the problem if the condition is
not met
Problem 3: long latency
Client
Server
Server
Server
Fast path: client sends a request to all replicas directly
Each replica executes it and replies to the client
Succeed if the client gets enough matching replies.
Problem 3: long latency
Client
Server
Server
Server
Problem 3: long latency
Client
Leader
Server
Server
Slow path: cannot get matching replies from a super quorum.
The client resolves back to the leader
I will skip the details since it is quite complicated
Problem 4: too many replicas
• Primary backup needs f+1 replicas to tolerate f
failures
• Paxos needs 2f+1 replicas to tolerate f failures
Problem 4: too many replicas
• Solution 1: read-only optimization
– Read-only requests should be executed on only
one replica
• Can the client send a read directly to a replica?
– No. It may read old data if the replica misses some
requests.
Problem 4: too many replicas
• Solution 1: read-only optimization
– Read-only requests should be executed on only
one replica
• From “Paxos made moderately complex”
– First run a light-weight agreement to determine
what replica has latest data
– Then send read to that replica
– If that replica fails, just retry
Problem 4: too many replicas
• Solution 2: on-demand instantiation
– Paxos only needs f+1 replies in all steps
– We can activate f+1 replicas first
– If some replica fails, activate remaining ones
• Problem: during activation, not enough
replicas, and activation can take long
– ThriftyPaxos from our group (USENIX ATC 16) has a
solution
Problem 5: load balancing
• The leader is doing more work than others
– It is likely to become a bottleneck
• Solution: EPaxos
– All replicas are leaders.
– Replica i can only propose for slot (i+kN) (N is the
number of replicas)
– Problem: what to do if some replica fails?
– Refer to the paper since it is complicated.
Summary
• With all optimizations, the performance of
Paxos is not too bad
– Many of the optimizations are used in practice to
some extent.
• Paxos is widely used in practice
– Google Spanner, Microsoft Azure Storage, Alibaba
X-DB, Tencent, …
– They replicate data across datacenters, and
internet is not as reliable as local networks
Byzantine Fault Tolerance
Yang Wang
Byzantine Generals Problem
• Several divisions of the Byzantine army
surround an enemy city
– Each division is lead by a general.
– One general is the commander; others are
lieutenants.
– Generals communicate through messengers.
– They need to reach a plan.
Byzantine Generals Problem
• Problem: some generals are traitors.
– Even the commander may be a traitor.
– Loyal generals don’t know who are traitors, but
traitors may know each other.
• Goals:
– Loyal generals should reach the same plan (either
attack or retreat)
– If the commander is loyal, all loyal generals should
follow his plan.
Byzantine Generals Problem
• Byzantine Generals Problem is an abstraction
of a replicated system with commission
failures/malicious replicas.
• Correctness: correct replicas should agree on
an order of incoming requests
– How to achieve this if some replicas, in particular
the leader, are malicious?
Byzantine Generals Problem
• Similarly, Byzantine Fault Tolerance (BFT)
protocols have synchronous and asynchronous
versions. Let’s start from the simpler one.
• Synchronous model:
– Each pair of generals has a separate channel.
– Traitors cannot intrude other channels.
– There is an upper bound on message delay.
Byzantine Generals Problem
• Synchronous model:
– Each pair of generals has a separate channel.
– Traitors cannot intrude other channels.
– There is an upper bound on message delay.
• Can you design a protocol for three
generals/replicas (one is a traitor)?
Impossibility results
commander
“attack” “attack”
lieutenant 1 lieutenant 2 TRAITOR
“he said ‘retreat’”
commander TRAITOR
“attack” “retreat”
lieutenant 1 lieutenant 2
“he said ‘retreat’”
Impossible because one cannot differentiate two cases.
Impossibility results
• General case: no solution with fewer than 3f
generals can cope with f traitors.
• General solution: L. Lamport, R. Shostak, M.
Pease, “The Byzantine Generals Problem”,
ACM Transactions on Programming Languages
and Systems (TOPLAS), 1982
Is the assumption realistic?
• Assumptions:
– Secret channel between each pair of nodes
– Upper bound of message delay
• However, today’s systems share the network
– A malicious node may forge identity
– A malicious node may slow down the network
How to ensure identity?
• Question: on the network, if someone tells you
“I am A”, how to know he/she is actually A?
• Problem: IPs can be forged.
– The sender IP is filled by the sender.
• Solution:?
How to ensure identity?
• Question: on the network, if someone tells you
“I am A”, how to know he/she is actually A?
• Problem: IPs can be forged.
– The sender IP is filled by the sender.
• Solution: signature, the magic of cryptography
Magic of signature
• One generates a pair of public key/private key
– Public key is known to every one.
• The sender signs its message with the private
key (some cryptography computation).
• The sender sends the message together with
the signature to another one.
• The receiver can verify that with the sender’s
public key (some cryptography computation).
Magic of signature
• Property: computationally hard to find the
private key corresponding to a public key
• Signature allows a receiver to verify that a
message is actually sent by the sender.
• More importantly, it allows one to prove that
another one has actually sent a message
– A can tell B that “I received m from C. Here is C’s
signature”. C cannot deny that.
Magic of signature
• If we assume message delay has an upper
bound, with the help of signature, we can cope
with any number of traitors.
Not a problem any more
commander
“attack” “attack”
lieutenant 1 lieutenant 2 TRAITOR
“he said ‘retreat’”
commander TRAITOR
“attack” “retreat”
lieutenant 1 lieutenant 2
“he said ‘retreat’”
Commander needs to sign his message
Slow network
• An attacker slowing down the network is much
harder to defend.
• Classic attack: Distributed Denial of Service
(DDoS)
– Use a lot of machines to flood the network.
• There is no perfect defense yet.
• That’s why I will focus on asynchronous BFT.
Model
Synchronous Asynchronous
Crash Only Primary backup Paxos
Byzantine Synchronous BFT PBFT
Authors
• Miguel Castro, Barbara Liskov, Practical
Byzantine Fault Tolerance, OSDI 99
Review of Paxos
• Challenges: cannot guarantee there is a single
leader; no way to reliably detect failures
• Key idea/lessons:
– For a system composed of N replicas, we can only
expect N‐f of them to reply.
– If a decision is agreed by f+1 replicas, it becomes
stable, which means a new leader will not change
the decision.
Review of Paxos
• PBFT inherits these ideas.
• Question: does Paxos + signature solve the
problem?
Review of Paxos
• PBFT inherits these ideas.
• Question: does Paxos + signature solve the
problem?
– Problem 1: a malicious leader (traitor) could
propose different requests to different replicas.
– Example: leader could send a to replica 1 and b to
replica 2. They may execute different requests.
Review of Paxos
• PBFT inherits these ideas.
• Question: does Paxos + signature solve the
problem?
– Problem 2: even if a request is agreed by f+1
replicas, a new leader may not know it.
– The new leader gets f+1 responses for leader
election: it is possible f of them are from slow but
correct replicas, and 1 of them is from a traitor.
PBFT
• Goals:
– Correctness: all correct replicas execute the same
sequence of requests
– Validity: each request is from a client.
– Liveness: the system can eventually make progress
• Assumptions:
– Signatures; no upper bound on message delay
– 3f+1 replicas (f includes both byzantine and
omission failures)
Overview
• Look similar as Paxos but have one more phase
Normal case operation (no leader election)
Request
A client signs its request so that a faulty leader cannot forge one.
Pre‐prepare
The leader proposes the next request <PRE-PREPARE, v, n, r>
v: view number (leader ID)
n: slot number
r: next request
The leader signs this proposal.
Prepare
After a replica receives a PRE-PREPARE, it first makes a series of checks:
Signature matches.
View number matches
No PRE-PREPARE for the same n.
If something is wrong, the replica can ask for leader election.
Prepare
If replica i passes all checks, it will send <PREPARE, v, n, r, i> to
other replicas together with its signature
Prepare
We say (v, n, r) is prepared at replica i if replica i has received the
pre-prepare and 2f prepares for (v, n, r)
Property of “Prepared”
• Property: if (v, n, r) is prepared at any replica,
then it is impossible for a (v, n, r’) (r’!=r) to be
prepared at any correct replica.
• Proof: if (v, n, r) is prepared, then 2f+1 replicas
have sent either pre‐prepare or prepare for it.
– Among them f+1 must be correct. So they will send
the same thing to other replicas.
– So it is impossible for another correct replica to
receive 2f+1 different pre‐prepare or prepare.
Property of “Prepared”
• Property: if (v, n, r) is prepared at any replica,
then it is impossible for a (v, n, r’) (r’!=r) to be
prepared at any correct replica.
• If (v, n, r) is prepared at a replica i, can replica i
execute (v, n, r)?
– This is what we did in Paxos. If a request is agreed
by f+1 replicas, then a replica can execute it.
Property of “Prepared”
• Property: if (v, n, r) is prepared at any replica,
then it is impossible for a (v, n, r’) (r’!=r) to be
prepared at any correct replica.
• If (v, n, r) is prepared at a replica i, can replica i
execute (v, n, r)?
– No. If replica i dies in the future, a new leader,
which is malicious, can propose a different r’ for n.
– We must have some mechanism to prevent a new
leader from proposing arbitrary value.
Property of “Prepared”
• Property: if (v, n, r) is prepared at any replica,
then it is impossible for a (v, n, r’) (r’!=r) to be
prepared at any correct replica.
• If (v, n, r) is prepared at a replica i, can replica i
execute (v, n, r)?
– No. If replica i dies in the future, a new leader,
which is malicious, can propose a different r’ for n.
• This is why we need the third phase.
Commit
If (v, n, r) is prepared at replica i, replica i sends <COMMIT, v, n, r, i>
to all replicas, together with its signature
Commit
We say (v, n, r) is committed at replica i, if replica i has received 2f+1
matching commits.
Property of “Committed”
• Property: if (v, n, r) is committed at any replica,
it must be prepared at at least f+1 correct
replicas.
• Proof: the replica must have received 2f+1
commits.
– f+1 of them must be from correct replicas
– (v, n, r) must be prepared at those f+1 replicas.
• This property ensures that a ”committed” (v, n,
r) can survive a leader election.
Leader election (View Change)
• Similar to that in Paxos
• If something is wrong, a replica will increment
its v and start a leader election
– It may not receive request for a long time: maybe
the leader is dead
– It may receive conflicting prepares or commits.
–…
Leader election (View Change)
• The new leader must get 2f acknowledges.
• Similar as in Paxos, each replica also reports its
“prepared” requests.
– When reporting, the replica attaches the
corresponding pre‐prepare and 2f prepare
messages so that the new leader can verify them
Leader election (View Change)
• The new leader must get 2f acknowledges.
• Similar as in Paxos, each replica also reports its
“prepared” requests.
• Similar as in Paxos, if a slot is already occupied,
the new leader chooses the one with highest v
and sends pre‐prepare for the request.
Leader election (View Change)
• The new leader must get 2f acknowledges.
• Similar as in Paxos, each replica also reports its
“prepared” requests.
• Similar as in Paxos, if a slot is already occupied,
the new leader chooses the one with highest v
and sends pre‐prepare for the request.
– But the new leader can be malicious?
Leader election (View Change)
• Unlike Paxos, a new leader needs to present all
messages (acknowledgements and reported
“prepared” requests) it received to other
replicas
– Other replicas can check that the new leader
actually did the right thing.
– Messages are signed so the new leader cannot
forge them.
Revise commit phase again
• Property: if (v, n, r) is committed at any replica,
it must be prepared at at least f+1 correct
replicas.
• This means a new leader will get response
from at least 1 of the correct replicas.
• So the new leader will know (v, n, r) is
prepared before.
• The new leader cannot pretend that it does
not know (v, n, r) is prepared before.
Correctness
• Within one leader (or one view), we know if a
request is prepared at some replica, no correct
replica should get a different request
prepared. So it is impossible for correct
replicas to agree on different requests.
• During a leader election, we know if a request
is committed, the new leader will know it and
re‐propose it.
Liveness
• A number of optimizations:
• E.g. leader election starts if f+1 replicas want
to elect a new leader
– A single faulty replica cannot force elections
continuously
• ……
• Similar as Paxos, PBFT is live if the
environment is synchronous for sufficiently
long.
Topics of Distributed System
Section
• Distributed deadlock detection
– Global state and distributed snapshot
– Logical clock and vector clock, …
• Distributed mutual exclusion
• Distributed transaction execution
• Fault tolerance
– Quorum-based approach
– Leader-based approach: Paxos and PBFT
• Consistency
Consistency
• Consistency is a contract between a service
and a user: it is a specification of the
correctness guarantees of a service.
• Confusing terminology:
– In distributed systems, it is called “consistency”
– In database systems, it is called “isolation”
• Consistency in database systems has other meanings.
– In cache systems, it is called “coherence”
Outline
• Common consistency models in distributed
systems
• Common isolation levels in database systems
– We have learned them
• Common consistency models for
memory/cache systems
Consistency
• Linearizable Strong
• Sequential
• Causal
• Eventual Weak
Sequential consistency
• Definition: The result of any execution is the
same as if the (read and write) operations by
all processes on the data store were executed
in some sequential order and the operations of
each individual process appear in this
sequence in the order specified by its program.
• Similar to serializability in database isolations.
– In DB, a transaction is a unit; in this section, a
read/write operation is a unit
Sequential consistency - Examples
a=0, b=0 at the beginning
W a=1
A
W b=1
B
R a=1 R b=1
C
Equal to
W a=1 W b=1 R a=1 R b=1
Sequential consistency - Examples
W a=1
A
W b=1
B
R a=0 R b=0
C
Equal to
R a=0 R b=0 W a=1 W b=1
Sequential consistency - Examples
W a=1
A
W b=1
B
R a=0 R b=1
C
Equal to
R a=0 W a=1 W b=1 R b=1
Sequential consistency - Examples
W a=1
A
W b=1
B
R a=1 R b=0
C
Equal to
W a=1 R a=1 R b=0 W b=1
Sequential consistency - Examples
W a=1
A
W b=1
B
R a=1 R a=0
C
Sequential consistency - Examples
W a=1
A
W b=1
B
R a=1 R b=0
C
R b=1 R a=0
D
Linearizable
• Definition: sequential consistency + real time
constraint
• If a request R1 completes before R2 starts,
then R1 must be ordered before R2.
• Concurrent requests can be ordered in either
way.
Linearizable - Examples
W a=1
A
W b=1
B
R a=1 R b=1
C
Equal to
W a=1 W b=1 R a=1 R b=1
Linearizable - Examples
W a=1
A
W b=1
B
R a=0 R b=0
C
Linearizable - Examples
W a=1
A
W a=2
B
R a=1
C
Linearizable - Examples
W a=1
A
W a=2
B
R a=2
C
Linearizable - Examples
W a=1
A
W a=2
B
R a=1
C
R a=2
D
• Linearizability is easy to use
• Protocols we learned so far (Gifford, Primary
Backup, Paxos, PBFT) all try to achieve
linearizability.
Eventual Consistency
• Definition: If no new updates are made to a
given data item, eventually all accesses to that
item will return the last updated value
• No global ordering requirement
• Require conflict resolution for concurrent
updates
Eventual - Examples
• Version control (e.g. svn, git):
– Each user updates locally
– A user commits code when ready
– A user may need to resolve conflicts when it
updates the repo
– Different users may see updates in different
order
Causal Consistency
• Definition: Respect partial order defined by a
user
– If a user sees a request R, then it must also
see all requests R depends on.
• No global ordering requirement
• Require conflict resolution for concurrent
updates
Causal - Examples
• Access control in social network
– Remove your parents from access list
– Post a photo that you hope your parents
never see
– This is problematic in an eventual consistent
system, because your parent may get the
photo before “update access list”
Implementing consistency
• There are different ways
– Quorum based: Gifford’s algorithm, …
– Agreement based: Paxos, PBFT, …
–…
• To give you an overview, let’s use quorum-
based algorithm as an example
Implementing consistency
Linearizability
w>N/2: no concurrent write; writes are totally ordered
r+w>N: a read always gets latest data
U1 U2
Write Read
R1 R2 R3
Implementing consistency
Sequential
w>N/2: no concurrent write; writes are totally ordered
r+w>N: a read always get latest data
A read could get stale data
U1 U2
Write Read
R1 R2 R3
Implementing consistency
Eventual/Causal
w>N/2: no concurrent write; writes are totally ordered
Different replicas may see writes in different order
U1 U2
Write a=1 Write a=2
R1 R2 R3
Implementing consistency
Eventual/Causal
w>N/2: no concurrent write; writes are totally ordered
Different replicas may see writes in different order
U1 U2
Write a=1 Write a=2
R1 R2 R3
Replicas must exchange data later and resolve conflict
Implementing Causal Consistency
• One common solution: eventual consistency +
vector clock
• Vector clock has scalability issues in a large
cluster
– Many research works about how to address
scalability issue
Why weaker consistency?
Easier to use
• Linearizable
• Sequential
There is a gap here.
• Causal
• Eventual Lower latency
Why weaker consistency?
• Weaker consistency is mainly motivated by the
problem of replicating data on unreliable or
long-latency networks
• Examples:
– You have desktop, laptop, smart phones, etc, and
you want to replicate data among them
– Big companies want to replicate data across
different datacenters (geo-replication)
Why weaker consistency?
Modify a file
Long-latency links
Some devices may be
shut down for a long time
Why weaker consistency?
• Strong consistency (linearizability and
sequential)
– An operation needs to reach a quorum of replicas
– This will incur a long latency
• Weak consistency (causal and eventual)
– An operation only needs to reach one (local) replica
– Low latency
– Replicas can exchange data later in the background
• Question: can we achieve both strong
consistency and low latency?
– Low latency means an operation can be completed
at the local replica
CAP theorem
• Impossible to achieve all the following:
– Strong Consistency: sequential or higher
– Availability
– Partition tolerance
• Low latency => operation can be completed at the
local replica => system is available even if network is
partitioned
• CAP theorem says such a system cannot achieve strong
consistency.
• Question: can we achieve both strong
consistency and low latency?
– No. If we want low latency, we have to give up
strong consistency.
• To use weakly consistent system, we need to
address additional challenges
Challenge: Resolve conflicts
• For causal/eventual consistency, we need to resolve conflicts if
replicas see writes in different order.
• It is application specific.
U1 U2
58 58
Example: Number of remaining tickets
Challenge: Resolve conflicts
• It is application specific
• Example: Number of remaining tickets
U1 U2
Buy 1 Buy 1
57 57
Challenge: Resolve conflicts
• It is application specific
• Example: Number of remaining tickets
U1 U2
Buy 1 Buy 1
Merge 57 57?
57 57
Challenge: Resolve conflicts
• It is application specific
• Example: Number of remaining tickets
U1 U2
Buy 1 Buy 1
Merge 57 57?
57 57
Merge -1
Challenge: Resolve conflicts
• It is application specific
• Example: Number of remaining tickets
U1 U2
Buy 1 Buy 1
Merge 57 57?
56 56
Merge -1
Challenge: Resolve conflicts
• It does not solve all problems
• If the original ticket number is 1, it is possible
that both users get the ticket
• Resolving the conflict will set number to -1,
which is invalid.
Consistency in reality
• Who are using strong consistency?
– Online gaming
– Google
• Who are using weak consistency?
– Banks (ATMs)
– Version control
– Amazon; Facebook
Outline
• Common consistency models in distributed
systems
• Common isolation levels in database systems
– We have learned them
• Common consistency models in
memory/cache systems
The issue of Data Consistency
• Assume writes to an object become visible to all in the same order
• But when does a write become visible, exactly?
– How to establish orders between a write and a read by different procs?
– Think of event synchronization by using more than one data object
P P
1 2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;
• If our memory is linearizable, then ”print A” should always output 1.
• Unfortunately, it is not. That is why you should never write code like above.
Weak Consistency
• Another approach to data consistency is to use
synchronization primitives
– think of these primitives as barriers, added by the programmer when
enforcing of consistency is desired
• Simplest model: weak consistency
– Single synchronization primitive with following properties:
• Accesses to synchronization variables associated with a data store are
sequentially consistent => all processes see operations on same order
• No operation on a synchronization variable is allowed to be performed
until all previous writes have been completed everywhere => synch to
flush the pipeline
• No read or write operation on data items are allowed to be performed
until all previous operations to synchronization variables have been
performed => synch to get the latest value
Weak Consistency (3)
a) A valid sequence of events for weak consistency.
b) An invalid sequence for weak consistency.
Release Consistency (1)
• In this model the programmer has to use two separate
synchronization operations
– Acquire operation is used to enter a critical region
– Release is used to leave the region
• This is a weaker model because the responsibility of
maintaining consistency is delegated to the user
– Acquire and release do not have to apply to all shared data in a store
– if all data is protected, the result of the execution will be no different
than with sequential consistency model
• Difference with respect to weak consistency: two separate
synch operations for entering or leaving the critical section
allow for potentially more efficient implementation
Release Consistency (2)
• Semantic of acquire
– All local copies of the protected data are brought up to date with remote
ones
• Semantic of release
– All changes to local copy of protected data are propagated to remote
copies
• Possible variants
– Lazy release consistency: at time of release changes are not propagated.
Instead, during an acquire the most recent values of data are retrieved
from all possible copies holding them
Release Consistency (3)
• A valid event sequence for release consistency.
Cloud and Virtual Machine
Motivation and history
• Suppose you want to build a startup
At first you put servers in your office
Motivation and history
• Later your startup becomes successful
……
Too many servers to fit in your office
Motivation and history
• Put servers into a separate room
……
You need someone to manage them
Motivation and history
• Managing a machine room is not easy
– Cooling
– Power supply
– Reboot crashed machines
–…
• The IT guy may need to do other staffs
– Installing and updating software
– Backup data
–…
Motivation and history
• A new business appears
Build a room to manage servers for others
Motivation and history
• The room owner handles the following
– Cooling
– Power supply
– Reboot crashed machines
–…
• The server owner handles the following
– Installing and updating software
– Backup data
–…
Can we do better?
• Motivation 1:
– A company has to buy servers to handle peak load
– Peak load is usually much higher than average load
– This leads to waste of resource
• Motivation 2:
– Load can change in unpredictable ways
– Buying new machines is too slow
Can we do better?
• It would be better if users can share machines
• Room owner can buy servers and users can
rent them on demand
Challenges
• User may require different hardware
– A may only need a server to handle light traffic
– B may need to run heavy-computation task
– What servers should the room owner buy?
• User may require different software
– A may need Linux and B may need Windows
– How to share the server between them?
Early ideas
• Because of these challenges, the idea of
“sharing a cluster” is only successful in limited
areas in early days
• Grid (1990s-2000s): share HPC cluster
– A user submits MPI jobs to the cluster
– Schedule the job when resource is available
– Charge the user based on job length * #CPUs
– Example: OSC, TACC, …
Much harder in general
• Advantages of HPC jobs
– They follow uniform patterns: written in MPI, use
similar libraries, store data in a shared file system, …
• Many applications do not follow such patterns
– They may require specific operating systems
– They may require installation of specific software
– They may require root privilege
– …
Made possible by virtual machine
• Virtual machine:
– Can install and run multiple instances of different
OSes on the same machine
– Each user can have root privilege to the OS
instance he/she owns
– Different OS instances are isolated
Challenges and solutions
• Users may require different hardware
– Cluster owner can buy powerful machines
– If some users only need crappy ones, cluster
owner can run multiple virtual machines on the
same physical machine
• Users may require different software
– They can install their own OS and software
This leads to the success of Cloud
(late 2000s till now)
More than virtual machine
• Virtual machine is the key technique that
enables cloud
• But Cloud provides more than that:
– Reliable and rich storage systems (e.g. Amazon
EBS, DynamoDB, …)
– Computation platform (e.g. TensorFlow)
–…
• By using cloud, a company outsources cluster
management to the cloud provider
Pros and Cons of Cloud
• Pros
– Elastic resource: reserve machines on demand
– Professional IT management
– Have someone to blame when service is down
• Cons
– Privacy is still a concern
– For very large companies, managing their own
clusters is cheaper
Topics
• Virtual machine internals
• Other cloud services
References
• Edouard Bugnion, Scott Devine, Mendel
Rosenblum, “Disco: running commodity
operating systems on scalable
multiprocessors”, SOSP 97
– Mendel Rosenblum later co-founded VMWare
– SIGOPS Hall of Fame award 2008
• Paul Barham et al. “Xen and the art of
virtualization”, SOSP 03
– SIGOPS Hall of Fame Award 2023
References
• They are not the earliest works in VM
– Early works date back to 1970s in IBM
– But they are the popular ones used today
• They mainly focus on “replication” and
“optimization”, but not “emulation”
– Replication: run multiple OSes
– Optimization: improve performance
– Emulation: convert an ISA to another
• E.g. Android emulator on desktop
Key challenges of a VMM
• To understand key challenges, let’s first review
how an OS works
Key challenges of a VMM
• In most of the time, application code runs on
hardware directly (native execution)
• Occasionally, control is given to OS
– When application makes a system call
– When interrupt occurs
– When there is a page fault
– When application has an error
– ……
Key challenges of a VMM
• To implement this mechanism, CPU provides
several features
– Privileged and unprivileged instructions: privileged
instructions can only be executed by OS
– Trap mechanism: OS registers handlers to all kinds
of events (e.g. system call, interrupt, page fault,
etc); if such an event happens, CPU automatically
jumps (traps) to the predefined function
Key challenges of a VMM
• Traditionally, OS and CPU are co-designed
under the assumptions that
– There is only one OS running
– OS has exclusive control of all hardware
– OS has the highest privilege
Key challenges of a VMM
• With the new architecture, all these
assumptions are broken
Key challenges of a VMM
• Examples of problems:
– OS needs to register interrupt handles. Now we
have more than one OSes registering the same
interrupt. How?
– OS virtual memory mechanism assumes OS has
full control of all memory, which is not true any
more.
– I/O devices are shared by multiple OSes. How?
– ……
Short answer
• VMM should intercept key instructions and
emulate them.
– E.g. when an OS registers an interrupt handler,
VMM should store it somewhere. VMM should
register its own interrupt handler, which
dispatches the interrupt to a stored handler.
Outline
• How to intercept instructions
• Virtual Memory
• I/O devices
Intercept instructions
• DISCO: trap-and-emulate (full virtualization)
– Though for hardware reasons, DISCO does not
fully achieve full virtualization.
• Xen: change OS code (para-virtualization)
DISCO: trap-and-emulate
• Traditionally, application runs in low-privilege
mode and OS runs in high-privilege mode
• DISCO is designed for MIPS, which has three
modes
• DISCO moves OS to the middle mode and
executes VMM in high-privilege mode
– When OS executes privileged instructions, they
will trap into VMM
DISCO: trap-and-emulate
• Problem: MIPS does not trap all privileged
instructions. It is not designed for VMM in the
first place.
– DISCO’s solution: it re-writes OS code at binary
level to replace these unsupported privileged
instructions
Xen: para-virtualization
• Xen is designed for x86, which provides 4 modes.
• Xen also runs OS with lower privilege
• Xen provides hypercalls to OS
– We need to manually replace privileged instructions
with hypercalls in an OS
– Hypercalls are similar to system calls, except that they
are between VMM and OS
• Xen replaces interrupts with events
Intercept instructions - Summary
• To support VMM, instructions should be
categorized into three types
– Those that can be executed by applications
– Those that can be executed by OSes
– Those that can only be executed by VMM
• If an OS executes a third type, it should be
properly trapped into VMM.
• Early CPUs only support two types and create a
lot of problems for VMM design.
• Later CPUs provide good support.
Outline
• How to intercept instructions
• Virtual Memory
• I/O devices
Review of virtual memory
• Process sees virtual addresses
• OS maps a virtual address into a physical address
– Mechanisms: paging, segment, …
• CPU provides support to accelerate address
mapping
– Hardware accelerated page table
– TLB (cache for page table)
Challenges for VMM
• Multiplexing is not an issue: to support multi-
processing, CPU already supports multiple
page tables and can switch between them
• Problem: different OSes cannot access the
same physical page.
– OS is designed under the assumption that it can
access all pages. But with VMM, it cannot.
Solutions
• DISCO: trap page-table related instructions
and emulate them
• Xen: the OS should make a hypercall when
operating page tables
DISCO memory management
• Three levels
DISCO memory management
• MIPS: software managed TLB
– When CPU finds a virtual page number is not in
TLB, it traps to OS
– OS uses a special instruction to load the correct
entry into TLB and retry
• DISCO: intercept this special instruction and
converts a physical address to a machine
address
DISCO memory management
• MIPS allows an OS to access physical memory
directly and bypass TLB
– DISCO disallows this mode: it needs to modify OS
DISCO memory management
• Context switch: page 0 of process 1 is different
from page 0 of process 2
– How to handle TLB entries after a context switch?
• Traditionally, two solutions:
– Flush the whole TLB: bad for performance
– Add an ASID (similar to process ID) to each entry
DISCO memory management
• Context switch: page 0 of process 1 is different
from page 0 of process 2
– How to handle TLB entries after a context switch?
• Traditionally, two solutions:
– Flush the whole TLB: bad for performance
– Add an ASID (similar to process ID) to each entry
• VMM is complicating things because different
OSes may assign same ASID
– DISCO has to do a full flush
Challenges of x86
• X86 does not have a software managed TLB
– When CPU finds a virtual page number is not in
TLB, it will search the page table in memory
– OS must organize the page table in a format
specified by the CPU and register it to CPU
• DISCO’s solution does not work for x86 CPUs
Xen memory management
• No three-layer mapping: just two layers
• An OS reserves some memory pages from
VMM (with hypercalls)
• When an OS updates its own page table, it
also makes a hypercall: Xen checks that the OS
only maps a page it owns
Xen memory management
• Xen copies itself to each address space
– This can avoid TLB flushes when OS switches to
Xen and when Xen switches to an OS
– TLB flush is still necessary when switching
between different OSes
– This optimization is a trade-off between memory
consumption and performance
VMWare memory management on x86
• Following DISCO, VMWare tries to achieve full
virtualization, which should not require
modification to OS code.
• Review of challenges from x86:
– No software managed TLB
– CPU assumes page table must be organized in a
specific format
VMWare memory management on x86
• Solution: shadow page table
• VMM lets each OS organize its own page tables,
but does not register them to CPU
• VMM maintains a copy for each page table
(shadow) and registers them to CPU
• VMM marks OS’ page table as read-only
– When the OS updates its page table, traps to VMM
– VMM updates both its shadow page table and the OS’
page table: for shadow table, VMM puts machine
address; for original table, VMM puts physical address
Outline
• How to intercept instructions
• Virtual Memory
• I/O devices
I/O devices
• Some devices already support multi-plexing (e.g.
disk, network). They are not hard to handle:
– Address mapping for PIO and DMA
– Interrupt handling
• Some devices do not support multi-plexing well
(e.g. GPU, Infiniband).
– How to virtualize them are still open problems.
– Earlier papers don’t discuss these.
Implementation for I/O devices
• Trap-and-emulate is still a solution
• Another solution is to design specific device
drivers for guest OS
– These drivers can make hypercalls
– They do not require modification to OS
– DISCO uses this method
More about Virtual Machine
Review
• So far we have learned the ideas of early work
– Disco and Xen
– Intercept and emulate
• Later works continue to improve
Outline
• Hardware support
• Performance optimization
• Isolation (resource and security)
• Light-weight VM
• Privacy
Hardware assisted virtualization
• As one can see, many problems are created by
the fact that CPUs were designed without
considering VMM in the first place.
– E.g. only two privilege levels, some instructions
not properly trapped, TLB flush, ...
– Software emulation can incur a high overhead
Hardware assisted virtualization
• To solve these problems, CPU manufacturers
like Intel and AMD add virtualization support
in their CPUs
– They have simplified the design and improved the
performance of VMM
• We show Intel’s as an example
Hardware assisted virtualization
• DISCO published in 1997. Xen in 2003.
• Intel vt-x (2005): three privilege levels
– Add extended page table (2008)
• Intel vt-d (2012?): virtualization for directed
I/O
• Intel vt-c (2014?): virtualization for
connectivity
• Intel GVT: virtualization for GPU
Intel vt-x
• Three privilege levels: root (for VMM), non-
root (for OS), and user
– Instructions are properly trapped and trapping can
be configured
• Special instructions VMENTRY and VMEXIT to
quickly swap registers and address space
• VMCS (VM Control Structure): data structure
to manage state
Intel vt-x
• Add a virtual-processor-ID to each TLB entry
– Similar as ASID, but used to identify virtual processor
– No need to flush whole TLB during a switch
• Extended page table
– OS still maintains its own page tables to convert a
virtual address to a physical address
– VMM maintains a page table to convert a physical
address to a machine address
– CPU automatically searches both tables
Intel vt-d and vt-c
• Hardware DMA remapping
• Hardware interrupt remapping
• Single Root I/O Virtualization (SR-IOV):
virtualize PCIe devices (GPU and Infiniband
use PCIe)
Outline
• Hardware support
• Performance optimization
• Isolation (resource and security)
• Light-weight VM
• Privacy
Optimization opportunities
• Running multiple OSes on the same machine
creates many optimization opportunities
• There are many papers about them. I can only
cover a few classic ones.
Sharing memory pages
• It is not uncommon that users run multiple
instances of the same OS on the same
machine. Their code are the same.
• Instead of loading a copy for each OS instance,
we can load one copy for all and share these
memory pages with all instances
Flexible Memory Management
• When we create a VM, we usually specify how
much memory it needs, but usually it will not
use all such memory.
• We can allow multiple OS instances to
exchange free memory space
– Solution: balloon driver. It can ask for pages from
the OS and actually give them to VMM.
Efficient I/O
• Different OS instances communicate through
the network interface
• But if they are on the same VMM, VMM can
use memory copy instead
– Can be further optimized to use page sharing to
reduce memory copy
Outline
• Hardware support
• Performance optimization
• Isolation (resource and security)
• Light-weight VM
• Privacy
Resource Isolation
• Cloud providers may create multiple VMs on a
physical machine
• Users get VMs instead of physical machine
• So they may run VMs on the same machine
• Their resource should be isolated
Resource Isolation
• Some resource are easy to isolate
– Number of CPU cores
– Memory and disk space
– Network bandwidth
• Some resource are not easy to isolate
– Disk I/Os
– Memory bandwidth
Disk
• Problem: big gap between sequential and
random I/O
– Sequential I/O can be 100x faster than random
• So the VMM cannot simply limit each VM’s
bandwidth or IOPS
– If the limit is for sequential (random) I/Os, then it
does not work well for random (sequential) I/Os
• Less of a problem for SSD and NVRAM
– Because their gap is smaller
Memory bandwidth
• Problem: there is no good place to enforce the
limit
• For disks, the VMM can intercept each I/O
• But intercepting each memory access is too
expensive
Security Isolation
• Goal: if one VM is hacked, other VMs on the
same physical machine are still secure
• Why is it better than traditional OS?
– If a (non-root) user is hacked, other users on the
same OS are still secure
Security Isolation
• Problem: to achieve our goal, we need to
ensure the underlying software is secure
– VMM in the VM case
– OS in the traditional case
• People believe it is easier to secure VMM than
securing OS
– Because the interface of VMM is simpler than OS
Simpler interface
• The interface of OS: system calls
– There are more than 400 system calls now
– If any of them has a vulnerability, a hacked user
can utilize it to control the whole system
• The interface of VMM: hypercalls and
instruction emulation; management
– Simpler than that of OS
– Less likely to have vulnerability
Counter argument
• VMMs are getting more complex
– Because people need more features
• Maybe in the future, a VMM will be as
complex as an OS?
Outline
• Hardware support
• Performance optimization
• Isolation (resource and security)
• Light-weight VM
• Privacy
Cost of an VM
• We need an OS image for each VM
– Usually hundreds of MBs to a few GBs
– Loading the image takes seconds
• An OS has a minimal requirement for memory
• As a result
– Starting a VM is slow
– Can only run limited number of VMs on a machine
Solutions
• Container: Docker, etc
• Unikernel: ClickOS, etc
• But they all have trade-offs
Container
• Modern OS actually already provides a
number of virtualization functions
– Process to virtualize CPUs; virtual memory; disk
partitions; network can be shared
– It even provides resource isolation features
• Why do we need VM?
– One reason is to run different Oses
– But if we don’t need that, is OS virtualization
enough?
Container
• One problem of OS virtualization: software
dependency
– A may need Linux version X1 and java version Y1
– B may need Linux version X2 and java version Y2
• Why can’t we always use the newest version?
Container
• Backward compatibility: the new version of a
software should always support code/data
generated by the old version
– E.g. Word 2013 should be able to open a Word 97
document
• Unfortunately, many softwares do not follow
this principle
– More severe in the open-source community
– Microsoft is pretty good at backward compatibility
Container
• Without backward compatibility, if you update
software to newest version, your old
program/data may have problems
• Idea of container: build an application and its
dependencies into an image
Container vs VM
• Container: all users share the same OS
– Light weight: no OS image; all containers running
in one OS
– Of course cannot run different Oses on one
machine
• VM: each user has his/her own OS
– More expensive
– Better security isolation
Unikernel
• A general-purpose OS is big (both image size
and memory consumption) because it needs
to support various applications
– Remember that is the major reason VM is
expensive
• Basic idea of Unikernel: create customized OS
for a specific application
Unikernel
• Customized OS
– No separation of user and kernel space: OS and
application run in the same space
– Remove all code not used by the application
• Can reduce OS size to a few MBs
• Of course it cannot run other applications
Outline
• Hardware support
• Performance optimization
• Isolation (resource and security)
• Light-weight VM
• Privacy
Concern about privacy
• Privacy is one major bummer for many
organizations to move their data to cloud
• Risk model: they assume cloud providers can
be malicious or at least “curious”
– Curious means cloud providers may not hack your
system, but may try to read your data
Protecting privacy
• Traditional solution: encryption
– User encrypts data before sending it to the server
• This works for remote storage system, but not
for cloud
– Because a user needs to run computation
– Data must be decrypted at some point on cloud
– VMM can access all data
Protecting privacy
• Which component has even higher privilege
than VMM?
Protecting privacy
• Which component has even higher privilege
than VMM?
– Hardware (CPU)
Protecting privacy
• Which component has even higher privilege
than VMM?
– Hardware (CPU)
• To ensure privacy despite a malicious VMM,
CPU manufacturers provide new features
– Intel SGX, AMD SEV, ARM Trustzone, …
Intel SGX
• A user program can create a special memory
region called “enclave”
– CPU guarantees that no other component (VMM,
OS, process, etc) can access this memory region
• When loading a program, CPU computes a
hash for the program and signs the hash
– User can verify that it is the correct program
New trust model
• With such techniques, a user does not need to
trust cloud providers
• They have to trust CPU manufacturers
– It seems that we cannot avoid this.
Limitation
• These techniques have many limitations at
this moment:
– Limited trusted memory
– Need to rewrite code
– May be affected by hardware vulnerabilities: e.g.
Meltdown and Spectre attacks
• CPU manufacturers are still improving them