Replication Consistency
Replication Consistency
ACK: Slides use some material from Scott Shenker (UC Berkeley) and Jim Kurose (UMass)
Agenda
⚫ Replication (of?)
⚫ Primary-backup Protocols
⚫ Asynchronous protocols
⚫ Synchronous protocols
3
Replication
⚫ When we replicate an object, we create copies of the object and store
them on different servers
4
Why Replication?
⚫ Fault tolerance
⚫ If one replica crashes, simply switch to another replica
⚫ With k replicas of each object, it can tolerate the failure of any (k-1) servers in the
system
⚫ Performance
⚫ Helps when scaling for size or geographical area
⚫ Load balancing: divide the workload among multiple servers
⚫ Placing the copy of the object in proximity of the client, e.g., CDNs
5
Nature of Replicated Data
⚫ Read-only data
⚫ Easy to replicate; we just make multiple copies
⚫ Read-write data
⚫ Writes result in different replicas. Any challenge?
⚫ Replicas must be kept consistent; modifications need to be propagated to all
other copies!
⚫ What do applications need: Read-only or Read-write?
⚫ Want the distributed system with multiple replicas to appear as if there was one
copy on a single machine. Want Read-write data replication!
⚫ Challenge:
⚫ When and how to propagate write updates?
6
Depends on Application Requirements
⚫ What do applications require?
⚫ From replicated/distributed systems
⚫ Availability
⚫ The application is operational and instantly processes requests
⚫ Some server failures do not prevent surviving servers from continuing to operate
⚫ Partition Tolerance:
⚫ The application continues to operate despite message loss due to network partition
7
Network Partitions Divide Systems
8
Network Partitions Divide Systems
9
Fundamental Tradeoff
⚫ Replicas appear to be a single [consistent] machine but lose availability
during a network partition
⚫ OR
10
CAP Theorem Preview
⚫ You cannot achieve all three of:
1. Consistency Consistency Availability
2. Availability
3. Partition-Tolerance
Partition
tolerance
12
CAP Theorem [Gilbert Lynch 02]
Assume that an algorithm provides all of CAP (to contradict!)
Let us start with a variable x=0 consistently stored at A and B
Client 1 Client 1
A B
13
CAP Theorem [Gilbert Lynch 02]
Assume that an algorithm provides all of CAP (to contradict!)
w(x=1)
Client 1 Client 1
ok
A B
Partition Possible
14
CAP Theorem [Gilbert Lynch 02]
Assume to contradict that an algorithm provides all of CAP
w(x=1) r(x)
Client 1 Client 1
ok x=0
A B
Partition Possible
15
CAP Theorem [Gilbert Lynch 02]
Assume to contradict that an algorithm provides all of CAP
Not consistent (C) => contradiction!
w(x=1) r(x)
Client 1 Client 1
ok x=0
A B
Partition Possible
16
CAP Interpretation Part 1
⚫ Cannot “choose” no partitions
⚫ 2-out-of-3 interpretation doesn’t make sense
⚫ Instead, availability OR consistency?
17
CAP Interpretation Part 2
⚫ It is a theorem, with proof, that you understand!
⚫ Can engineer systems to make partitions extremely rare, and then just
take the rare hit to availability (or consistency)
18
Some Real Distributed Systems Relax
Consistency Constraints …
Memcache at
Facebook
19
Questions?
20
Node Failures and Correctness
21
Failure Model: Fail-Stop
Node Fails!
22
Failure Model: Byzantine Failures
Node Fails
Can be caused by
• Malicious attacks
• Software errors
23
Failures in Distributed Systems (DS)
⚫ What distinguishes DS from single-machine systems:
⚫ Some nodes might still be working correctly while others are experiencing failures
⚫ DS may continue to operate even if part of it is failing
24
Correctness for Strong Consistency
⚫ Replicas act like a single machine
⚫ Specifically
⚫ If one node commits an update, no other replica rejects it
⚫ If one replica rejects it, no one commits the update
25
We will assume Fail-Stop model
⚫ Many consistency protocols assume fail-stop model
⚫ Reason: Byzantine failures are very hard to deal with and usually add
substantial performance overheads
26
Primary-backup protocols
27
Primary-backup protocols
⚫ One special node (primary) orders requests
28
Two Types of Primary-backup Schemes
⚫ Asynchronous primary-backup protocol
29
Primary-backup protocol
Client
30
Primary-backup protocol
Client
Write-request
31
Primary-backup protocol
• The primary sends an ACK to
the client once it has performed
the write locally
Client OR
• It waits for all replicas first to
perform the update and then
sends the ACK
Write-request
32
Primary-backup protocol[asynchronous version]
Client
Step 1
Write-request
33
Primary-backup protocol[asynchronous version]
Client
34
Primary-backup protocol[asynchronous version]
Client
Client
37
What could go wrong if there are failures?
⚫ If the primary fails before the updates are sent to the backups,
⚫ Then updates may be lost
38
Primary-backup protocol[synchronous version]
Client
Step 1
Write-request
39
Primary-backup protocol[synchronous version]
Client
Step 1
Write-request
Client
Step 1
Write-request
Client
43
What could go wrong if there are failures?
44
Failure Scenario
Client
Write-request
45
Failure Scenario
Client
Write-request
Client
Write-request
This replica
crashes
Update
ACK
Write-request
This replica
crashes
Update
ACK
Write-request
This replica
crashes
Update
ACK
50
Two Phase Commit (2-PC)
51
Key idea
⚫ Allow the system to roll back updates on failures
⚫ By using two phases
52
Terminology
⚫ Primary Node
⚫ Coordinator
⚫ Replicas
⚫ Cohort, worker, participant
54
2-PC
Respond with
yes or No
55
2-PC
Save update to
Prepare disk
Respond with
Yes from all yes or No
replicas within
the timeout
56
2-PC
Save update to
Prepare disk
Respond with
Yes from all yes or No
replicas within
the timeout
Commit
Commit
updates from
disk to store
ACK
57
2-PC
Save update to
Prepare disk
Respond with
If any “No” or yes or No
timeout before
all votes
Abort
ACK
58
Failures in 2-PC: Food for Thought
⚫ If a server votes yes, can it commit unilaterally before receiving commit
message?
⚫ If a server voted No, can it abort right away without waiting for an abort
message?
59
Failures in 2-PC
⚫ To deal with replica crashes
⚫ Each replica saves tentative updates into permanent storage, right before replying
Yes/No in the first phase
⚫ Retrievable after crash recovery
60
Correctness and Performance of 2-PC
⚫ Correctness: All hosts that decide reach the same decision
⚫ No commit unless everyone says “yes”
61
2-PC Summary
⚫ Primary-backup schemes (1 Phase protocols)
⚫ Safety may be violated ➙ Can’t roll back updates
⚫ Two-Phase Commit
⚫ Allow for rolling back updates
⚫ Sensitive to coordinator failure ➙ blocking
62
Summing it up …
⚫ Replication is needed for fault tolerance and performance
63
Advanced Topics on Consistency
⚫ Paxos
⚫ Raft
⚫ PBFT
⚫ Blockchain
64
Questions?
66