DisSys Lec7
DisSys Lec7
Replication
• Replication is a technique used in distributed systems to improve the availability,
reliability, and performance of the system.
• Each copy of the data is referred to as a replica, and the process of creating and
maintaining replicas is known as replication.
Replication objectives
• Availability: By replicating data across multiple nodes, the system can continue to
function even if some nodes fail. If one replica becomes unavailable, the system
can switch to another replica, ensuring that the data remains available.
• Reliability: Replication can improve the reliability of the system by reducing the
likelihood of data loss due to node failures or other issues. If one replica becomes
corrupted or lost, the system can use another replica to restore the data.
Deduplicating requests requires that the database tracks which requests it has already seen (in stable storage)
Retry behavior
• Choice of retry behavior:
• At-most-once semantics: send request, don't retry, update may not happen
• At-least-once semantics: retry request until acknowledged, may repeat
update
• Exactly-once semantics: retry + idempotence or deduplication
• A replica may be unavailable due to network partition or node fault (e.g. crash,
hardware problem).
• The details of how exactly the replication is performed have a big impact on the
reliability of the system.
• Without fault tolerance, having multiple replicas would make reliability worse
Read-after-write consistency
Read-after-write consistency
Read-after-write consistency
• Writing to one replica, reading from another: client does not read back the value it has written
• Require writing to/reading from both replicas ⟹ cannot write/read if one replica is unavailable
Read-after-write consistency
• Many systems require read-after- write consistency, in which we ensure that after a
client writes a value, the same client will be able to read back the value it has just
written.
• With read-after-write consistency, after writing a client may not read the value it wrote
because concurrently another client may have overwritten the value.
• Therefore, we say that read-after-write consistency requires reading either the last value
written, or a later value.
• This would mean that reads and/or writes are no longer fault-tolerant: if one replica is
unavailable, a write or read that requires responses from both replicas would not be able
to complete.
Quorum (2 out of 3)
Quorum (2 out of 3)
Quorum (2 out of 3)
Quorum (2 out of 3)
Quorum (2 out of 3)
Write succeeds on B and C; read succeeds on A and B Choose between (t0, v0) and (t1, v1) based on timestamp
Quorum
• We can solve this problem by using three replicas, We send every read and write request
to all three replicas, but we consider the request successful if we receive 2 responses.
• In the example, the write succeeds on replicas B and C, while the read succeeds on
replicas A and B. With a (2 out of 3) policy for both reads and writes, it is guaranteed that
at least one of the responses to a read is from a replica that saw the most recent write
(in the example, this is replica B).
• In this example, the set of replicas {B, C} that responded to the write request is a write
quorum, and the set {A, B} that responded to the read is a read quorum.
• A quorum is a minimum set of nodes that must respond to some request for it to be
successful. (The term comes from politics, where a quorum refers to the minimum
number of votes required to make a valid decision,
Read and write quorums
• In a system with 𝑛 replicas:
• If a write is acknowledged by w replicas (write quorum),
• and we subsequently read from r replicas (read quorum),
• and 𝑟 + 𝑤 > 𝑛,
• . . . then the read will see the previously written value or a value that subsequently
overwrote it)
• Read quorum and write quorum share ≥ 1 replica
𝑛+1
• Typical: 𝑟 = 𝑤 = , for n = 3, 5, 7, . . . (majority)
2
• Reads can tolerate 𝑛 − 𝑟 unavailable replicas, writes 𝑛 − 𝑤
Read repair
Read repair
Read repair
Replication using broadcast
• So far we have used best-effort broadcast for replication. What about
stronger broadcast models?
• Some distributed database perform replication in this way, with each replica independently
executing the same deterministic transaction code (this is known as active replication).
• This principle also underpins blockchains, cryptocurrencies, and distributed ledgers: the chain of
blocks" in a blockchain is nothing other than the sequence of messages delivered by a total order
broadcast protocol.
Total order broadcast algorithms
• Single leader approach:
• One node is designated as leader (sequencer)
• To broadcast message, send it to the leader; leader broadcasts it via FIFO broadcast.
• Problem: leader crashes ⇒ no more messages delivered
• If replica state updates are commutative, replicas can process updates in different
orders and still end up in the same state.
• Consistency: Every read from the system returns the most recent write or an
error.
• Availability: Every request receives a response, without guarantee that it
contains the most recent write.
• Partition tolerance: The system continues to function even when network
partitions occur.
Partition tolerance vs. consistency and
availability
• Partition tolerance is a fundamental requirement of distributed systems, and
cannot be sacrificed.
• However, the choice between consistency and availability depends on how the
system handles network partitions.
• For example, a system that sacrifices consistency during a network partition may
be able to provide high availability, while a system that sacrifices availability
during a network partition may be able to provide strong consistency.
Latency and performance
• Achieving high consistency or availability may come at the cost of increased
latency or reduced performance.