SDLE
SDLE
o Server returns data to the stub, which sends a response back to the client
stub.
o The client stub unmarshals the response and gives it back to the client
application.
Limitations: RPC is typically well-suited for request-reply patterns, but it enforces
synchronous communication, which can lead to blocking and tight coupling
between client and server. Failures also complicate RPC semantics (e.g., lost
requests or duplicates).
o Microservices architectures.
o Workflow systems.
o Email, SMS, instant messaging (in a broader sense).
1. Data Replication
1.1 Motivation
Data replication means maintaining multiple copies of the same data on different
nodes in a distributed system.
Benefits:
1. Performance: Faster, local reads since each replica can serve read
requests.
2. Reliability: A single replica failure does not necessarily lead to data loss—
data is lost only if all replicas lose the data.
3. Availability: Data remains accessible as long as at least one replica is up
and reachable.
2. Conflicts in Updates
If multiple replicas are updated in different orders or with different writes, the
replicated data can become inconsistent.
For example, if each replica starts at value 5 and:
1. Replica 1 applies an update “+3”
2. Replica 2 applies an update “+7”
Depending on the order or synchronization strategy, replicas may end up with different final
values, e.g., one might compute 5 + 3 = 8, then 8 + 7 = 15, whereas another might compute
5 + 7 = 12, then 12 + 3 = 15 or never apply one of the updates at all. Ensuring they end up in
a consistent state is part of the replication protocol challenge.
3. Strong Consistency
A simple strategy for strong consistency is for all replicas to execute all updates in
the same global order, and each update must be deterministic to yield the same
final state everywhere.
Strong consistency aims to provide an abstraction as if there is a single copy of the
data. Updates happen atomically and in some consistent global sequence.
3.1 Consistency Models
Three notable strong consistency models are:
1. Sequential Consistency
2. Serializability
3. Linearizability
They each impose increasingly strict requirements on how operations appear to execute in
the system.
4. Sequential Consistency
Definition (Lamport 1979):
An execution is sequentially consistent if it can be seen as a legal sequential (one-after-
another) ordering of all operations such that:
Each process’s operations appear in that order in the global sequence (i.e., per-
process order is preserved).
This matches the intuition of how operations would interleave in a single-core
(uniprocessor) system with multiple threads.
4.1 Example of a Violation
Suppose there are two variables, x and y, each with initial values (x=2, y=3), and two
replicas performing:
o Replica 1: x = y + 2
o Replica 2: y = x + 3
If these happen concurrently, one final outcome might be (x=5, y=5). However, that
final state can contradict any single sequential ordering where x and y both start as
(2,3). A purely sequential update would not yield (5,5) from that initial state if the
operations strictly happen one after the other.
4.2 Non-Composability
Even if each individual data structure (e.g., each array or each replicated object) is
sequentially consistent, combining them does not guarantee the entire system is
sequentially consistent.
Example: Two separate arrays (u and v), each replicated under a sequentially
consistent protocol, can still produce a global interleaving that is not sequentially
consistent when operations on u and v are merged.
5. Linearizability
Definition (Herlihy and Wing 1990):
An execution is linearizable if:
1. It is sequentially consistent.
2. If operation A completes before operation B begins (according to a global observer’s
real-time clock), then A precedes B in the global sequential order.
5.1 Observations
5.2 Composability
Linearizability is composable: if two data structures are each linearizable, then
the combination of both maintains linearizability for the entire system.
This is in contrast to the non-composability of sequential consistency.
7. Weak Consistency
Strong consistency (e.g., linearizability, serializability) simplifies application logic
because it mimics a single-copy or single-processor illusion.
However, strong guarantees often come with significant synchronization costs,
impacting:
o Scalability (performance under high load or many replicas).
o Availability (ability to withstand network partitions or replica failures
without halting writes/reads).
Weak consistency models aim to:
1. Improve scalability and availability.
2. Provide some useful guarantees for applications that can tolerate more relaxed
consistency requirements (e.g., eventual consistency, causal consistency, etc.).
Typically, these weaker models depend heavily on the application domain. Some
applications do not need strict global ordering of updates if certain invariants are
preserved.
1. Introduction
1.1 Quorums and Quorum Consensus Replication
Read:
1. The client polls Nr replicas to obtain their version numbers (and possibly the
data).
2. The client uses the highest version or determines the up-to-date replica
from which it reads the value.
Write:
2. The client writes the new value (with an incremented version number) to Nw
replicas.
o If not enough replicas (i.e., < Nw) can acknowledge the write, the
transaction aborts.
o Upon commit, all replicas in the write quorum finalize the new version.
Example:
7. Dynamo Quorums
7.1 Dynamo Overview
Dynamo is Amazon’s key-value store designed for high availability.
It uses a put(key, value) / get(key) API (an “associative memory” instead of a single
read/write memory cell).
Replicas maintain version vectors instead of simple version numbers.
Multi-versioning (storing conflicting updates in parallel) is used to avoid blocking
under network partitions. Clients eventually reconcile conflicting versions.
A quorum for an operation is any set of replicas that covers both the operation’s
required initial and final quorums.
To ensure consistency, certain intersection constraints must hold between the
final quorum of one operation and the initial quorum of another (e.g., a write’s final
quorum must intersect a subsequent read’s initial quorum).
(write) → (read)
where an edge indicates that the final quorum of the first must intersect the initial quorum
of the second.
Example (5 replicas):
o Possible minimal quorums for read might be (1,0), (2,0), (3,0), meaning:
read from 1 (or 2, or 3) replicas initially, and no final writes required.
o Possible minimal quorums for write might be (1,5), (2,4), (3,3), etc.
3. Replicated Queue as an Example of a Replicated ADT
3.1 Building a Queue via Read/Write Quorums
A queue supports:
2. Deq (normal):
Read from an initial quorum to obtain the current state/version.
Remove the least recent item if any.
Write updated state (or version) to a final quorum.
3. Deq (abnormal, empty):
Read to discover the queue is empty; no final write needed.
o No need for a separate “initial round” to get a new version number, because
each client can generate a globally unique, totally ordered timestamp (e.g.,
by using a transaction manager or a hierarchical timestamp generator).
o The client only needs to write the new state/timestamp to a final quorum.
Quorum Intersection Graph for 5 replicas can become simpler:
o Reads might only require an initial quorum, and writes only a final quorum,
with the edge constraint ensuring overlap when needed.
o Discard all enqueued items (and corresponding log entries) older than that
horizon.
Client Caching: Clients can cache logs to reduce the overhead of reading entire
logs from the replicas each time.
6.3 Intersection Constraints
For Enq, the initial quorum can be empty, because the operation does not depend
on prior results to return a value to the caller.
For Deq, both the initial quorum (to see up-to-date logs/items) and the final quorum
(to append a Deq event) are needed.
Different operations thus have different quorum intersection rules to guarantee
linearizability.
7. Critical Evaluation
7.1 Timestamps and Linearizability
Herlihy’s approach requires globally ordered timestamps consistent with real-
time (linearizability).
In practice, this often means integrating with a transaction manager or a protocol
for distributing timestamps (e.g., hierarchical timestamps or 2PC-based commit to
decide final ordering).
Without transactions, assigning these timestamps correctly (especially if the initial
quorum is empty) is tricky.
7.2 Log Garbage Collection
Event-based replication can lead to unbounded growth if logs are never pruned.
Herlihy shows how certain ADTs (like a queue) allow safe removal of stale entries
(e.g., after items are dequeued).
Other ADTs may have more complex garbage-collection logic or find it less
effective.
7.3 Relationship to CRDTs (CvRDTs)
Like Conflict-free Replicated Data Types (CRDTs), Herlihy’s replicated ADTs rely
on merges of concurrent updates to keep replicas consistent.
However, CRDTs typically provide eventual consistency rather than strong
(linearizable) consistency, and do not require a quorum for each operation.
Herlihy’s approach aims for strong linearizability but imposes more
synchronization (quorum intersection).
o Each replica holds a local state and merges states from other replicas.
o The state is organized as a join-semilattice, where the merge operation is
the least upper bound (lub or “join”) of two partial states.
o Updates are monotonic: applying an update never discards information, so
replicas can safely converge via merges.
Both approaches can simulate one another, but state-based CRDTs are often more
common in practice for large or unreliable networks. They are more tolerant of message
losses, reordering, and duplicates.
3. State-Based CRDTs
3.1 Join-Semilattice Basics
A join-semilattice is defined as:
A join operator ⊔ (“t” in the text) that computes the least upper bound of any two
elements. It satisfies:
1. Idempotence: a ⊔ a = a.
2. Commutativity: a ⊔ b = b ⊔ a.
3. Associativity: (a ⊔ b) ⊔ c =a ⊔ (b ⊔ c).
Typically, there is a bottom (⊥) which is the initial state. Each update moves the CRDT’s
state monotonically “upwards” in the partial order (x ≤ m(x)).
3.2 Interpreting State as a Log
Conceptually, one can think of a distributed partial-order log (polog) of all
operations. Each replica maintains a growing local subset of the log.
State = (S, T): a set of “added” elements S, and a set of “removed” elements T.
m(X) = X ⊔ mδ(X)
This reduces bandwidth usage and speeds up convergence when states get large.
6. Pros and Cons of State-Based CRDTs
6.1 Advantages
1. Permissive Communication: They tolerate dropped, reordered, or duplicated
messages.
2. No Rigid Reliance on Causality Messages: Local state merges handle out-of-
order arrivals naturally.
3. Operation Rate Decoupled from Transmission: A replica can perform many local
updates and merge them later.
4. Built-In Convergence: Idempotent, commutative merges ensure eventual
consistency.
6.2 Disadvantages
1. Growing Metadata: CRDT states can accumulate “tombstones” or historical tags
(e.g., version vectors or removed-element tags).
2. Full-State Transmission: A naive approach may send the entire state each time
replicas synchronize, which can be large. (Delta-based CRDTs mitigate this.)
3. Application-Specific Garbage Collection: Some CRDTs (like sets) allow indefinite
growth, so removing obsolete metadata can be non-trivial.
1. Graph Fundamentals
A graph G(V,E) is defined by:
A set of vertices V.
Weighted Graph: Each edge has an associated weight (cost, distance, etc.).
Path: A sequence of vertices such that consecutive vertices are connected by
edges.
1.2 Walk, Trail, Path
Walk: A sequence of vertices where edges may repeat, and vertices may repeat.
Trail: A walk where edges are not repeated (but vertices may still repeat).
Path: A walk with no repeated vertices or edges.
Diameter D = max{ecc(𝑣 }.
1. Initialization:
o Time: At most diam rounds (where diam is the diameter from the root).
Applications:
Aggregation (computing global sums, maxima, etc. in a convergecast).
Leader Election (e.g., gather all UIDs, pick the max).
Broadcast (once a BFS tree is built, use it to disseminate messages).
Computing Diameter (build BFS from every node, combine results, though it can be
expensive in large graphs).
2.2 Asynchronous Spanning Tree (AsynchSpanningTree)
Works with asynchronous message passing (no global rounds).
Each node:
o Messages: O(∣E|).
o Time: O (diam(l + d)) where l is max local processing time and dd is max
channel delay.
Child pointers & Broadcast:
If each node reports back whether it accepted the sender as a parent (or not), a tree
can be augmented with child pointers.
Complexity can be higher if we wait for acknowledgments, potentially up to
O(n(l+d)) if a chain forms through many nodes.
Leader Election:
Combine asynchronous spanning tree with UID comparisons to pick the node with
the largest (or smallest) ID.
2. On receiving a new message for the first time, the node forwards the payload to its
eager-push neighbours.
3. On receiving a duplicate message, the node moves that neighbour to lazy-push
mode (just exchanging message IDs/metadata).
4. If the tree breaks (a node times out waiting for a payload that never arrives), it
upgrades the lazy link to an eager link to repair connectivity.
Formal random graphs like Erdős–Rényi achieve short diameters (log n) but lack
high clustering.
Watts–Strogatz keeps a lattice/ring (high clustering) but rewires a fraction of edges
randomly to produce shortcuts, reducing average path length.
4.3 Routing
Simple flooding in a small-world network finds short routes, but local, greedy
routing is trickier:
o With arbitrary random shortcuts, greedy local steps can lead to 𝑂(√𝑛)
average paths.
Kleinberg: Showed a probability distribution for long-range links that preserves
distance at multiple scales, yielding 𝑂(log 𝑛) local greedy routes.
DHTs (Distributed Hash Tables) like Chord are engineered “small world” overlays
with carefully chosen links, giving logarithmic routing.
5. Concluding Remarks
1. Spanning Trees in synchronous or asynchronous networks allow foundational
tasks: broadcast, gather, leader election, global computations, etc.
2. Asynchronous algorithms do not guarantee BFS ordering but still produce a valid
spanning tree.
3. Epidemic (Gossip) Protocols and Tree-based methods can be combined to
balance high scalability/resilience with controlled overhead (e.g., Plumtree).
4. Small-World Networks (Watts–Strogatz, Kleinberg) demonstrate how adding
random or structured long-distance edges can drastically reduce average path
length while retaining clustering. They inspire scalable peer-to-peer overlays and
DHTs with logarithmic routing.
Overall, distributed systems exploit these scalable topologies and graph structures to
achieve efficient, resilient communication despite large scale and unreliable networks.
CHAP 7 - FAULT TOLERANCE – CONSENSUS
Given n processes, each starts with an input value in some set V. Each process must
produce a decision value in V. The following conditions must hold:
1. Agreement: All processes that decide must decide on the same value (only one
value).
2. Validity: If all processes start with the same input value v, then the only possible
decided value is vv. (We cannot invent a value that was never proposed.)
o Each acceptor that receives ACCEPT(n, v) will accept it unless it has already
promised not to accept proposals < a number greater than n.
Key Property:
Once a value is chosen (accepted by a majority under a certain proposal number),
all higher-numbered proposals that succeed (i.e., get majority acceptance) must
carry the same value. This ensures agreement.
2.3 Synod Execution Examples
1. Simple Case: One proposer P1 does PREPARE(n) and gets “promises” from a
majority of acceptors (A1, A2, A3). It then sends ACCEPT(n, v), and the acceptors
accept.
3. Any approach must eventually gather enough acceptor votes (from a majority) to
confirm that a proposal is chosen.
Progress / Liveness:
o Multiple proposers can lead to livelock if they continuously issue conflicting
proposals.
o Leader election (choosing one distinguished proposer) is used to avoid
indefinite conflicts. Only that leader tries to propose new values,
guaranteeing eventual success.
o Under synchronous-like conditions with stable leader, the protocol
terminates.
If a proposer does not get “promise” from a majority in Phase 1, it can retry with a
higher proposal number.
4. Paxos as an Implementation
Lamport’s Paxos is the practical embedding of the Synod Algorithm:
1. Leader: Each node can be a proposer, acceptor, and learner, but typically we run a
leader election. The leader acts as the distinguished proposer/learner.
2. Unique Proposal Numbers: Typically (counter, leader_id) ensures uniqueness.
3. Stable Storage: Acceptors record:
o The highest numbered PREPARE they’ve promised.
Liveness if eventually messages arrive, and a stable leader remains in place (no
perpetual new leadership attempts).
If the leader fails, a new leader runs Phase 1 for all slots that remain uncertain,
collects accepted proposals, and issues new proposals accordingly.
Once a request is learned chosen for slot ii, each replica executes it in state
machine order.
CHAP 8 - PRACTICAL BYZANTINE FAULT-TOLERANCE
1. Byzantine Failures
A Byzantine process can deviate arbitrarily from its specification. It may send
conflicting messages to different recipients.
Other processes do not initially know which ones are Byzantine; simply ignoring
some subset’s messages is not feasible if we don’t know who is faulty.
However, correct protocols must still address replay attacks and ensure signatures
aren’t reused incorrectly.
o Send replies to the client after committing and executing the operation.
In more detail, PBFT’s atomic broadcast ensures that:
Within a single view, no two requests obtain the same sequence number with
different operation digests.
Across view changes, a request partially ordered in a previous view is not lost or
re-ordered incorrectly.
3.3 Client Behavior
2. Wait: The client collects f+1 valid REPLY messages with matching results from
different replicas.
3. Timeouts: If replies do not arrive or are inconsistent, the client broadcasts the
request to all replicas, causing them either to resend the already committed reply
or to forward the request to the leader. If the leader fails, a view change will
eventually occur.
4. Atomic Broadcast Protocol (Three-Phase Commit in PBFT)
4.1 Quorums and Certificates
o Any two quorums intersect in at least f+1 replicas, ensuring overlap that
contains at least one correct (non-Byzantine) node.
Certificates:
o The replica hasn’t accepted another pre-prepare for view v with the same
nn but different digest.
4.3 Prepare Phase
o Plus, 2f PREPARE messages from distinct replicas (total 2f+1 including the
leader).
Outcome: If a request is prepared, no conflicting request with the same (v,n) can also be
prepared, achieving total order within the same view.
Invariant: If a replica commits a request, at least f+1f+1 correct replicas also prepared it,
so knowledge of that request cannot be lost in a view change.
4.5 Execution and Reply
A replica executes the operation after committing it in sequence number order (no
gaps).
Then it sends a signed reply to the client.
The client waits for f+1 identical replies to confirm the result.
The new leader waits until it collects 2f+1 valid VIEW-CHANGE messages for v+1.
This set forms the new-view certificate.
The leader combines the information (prepared requests, stable checkpoints) into
a ⟨NEW − VIEW, v+1, V,O,N⟩σℓ message, where:
The replica then enters view v+1 and re-issues PREPARE messages for the included
requests to finalize them in the new view.
6. Correctness Arguments
6.1 Safety
1. No conflicting commits: Within a view, the prepare phase ensures no two different
requests get the same (v,n).
o During the view change to v′>v, at least one of those f+1 replicas will
include that prepared info in its VIEW-CHANGE message, thereby
propagating the request.
At most f consecutive leaders can fail before a correct leader emerges (since each
view increments the leader ID by 1 modulo N).
7. Final Remarks
7.1 Additional Protocol Aspects
1. Bitcoin
1.1 Motivation
Goal: Make direct online payments without a trusted third party (like PayPal, Visa,
or a bank).
Account-based model: Bitcoin tracks a set of accounts (public-key addresses)
rather than physical “digital coins.”
A blockchain maintains a public record of all transactions ever performed.
1.2 Assumptions
1. Peer-to-peer (P2P) network:
o Nodes can join/leave freely.
o The network is large; most nodes are expected to remain online most of the
time.
o Uses broadcast over an unstructured overlay and anti-entropy to spread
information.
2. Account/Keys:
o Each user controls one or more keypairs (private/public).
o The account identifier is a hash of the public key.
o Transactions (payments) move “BTC balance” from one account (address)
to others.
2. Bitcoin Blockchain
2.1 Basic Structure
A blockchain is an ordered sequence of blocks:
o Each block contains a set of transactions plus a header with metadata (e.g.,
pointer/hash of the previous block).
4. Bitcoin Forks
4.1 Fork Condition
A fork occurs if two different miners each produce a new block referencing the
same previous block around the same time.
The network may temporarily see two competing chain tips.
4.2 Resolution
Nodes choose the chain with the most accumulated work (or simply the longer
chain) as the valid one.
Because new blocks are constantly built on top of one chain tip, typically one
branch becomes longer faster, and the other is abandoned.
No Finality in the sense of classical consensus—blocks can be “rolled back” if a
competing fork eventually overtakes them.
Conventionally, transactions are considered safe after ~6 confirmations (blocks
following them).
4.3 Causes of Forks
Simply increasing block size or reducing interval faces propagation delays, more
frequent forks, storage bloat, etc.
5.2 Energy Use
Bitcoin’s PoW consumes massive energy to maintain security:
o The hash power must be large to deter majority attacks.
6. Proof-of-Stake (PoS)
6.1 Concept
Alternative to PoW to save energy.
No more brute-force hashing; instead, “stakeholders” who hold coins (or have coin
“age”) are randomly selected to propose new blocks.
o The chance of being selected scales with how many coins you stake and
how long you’ve held them.
Lottery Mechanism:
o Each block references a timestamp used to check a hash < target * stake
factor.
o If you hold more stake, or have accumulated more coin-age, you have a
higher chance to produce a block.
Advantages:
o Much lower energy consumption.
o Faster block times are feasible.
Disadvantages:
o More complex security analysis.
7. Permissioned Blockchains
7.1 Motivation
Blockchain can be a useful structure for storing tamper-evident logs, implementing
“smart contracts,” etc.
Not all use cases require open, permissionless participation like Bitcoin’s:
o Some businesses need private or consortium blockchains with controlled
membership and data confidentiality.
o “Hyperledger Fabric,” “Corda,” and similar platforms are examples.
7.2 Taxonomy
Public (Permissionless): Anyone can join, propose blocks, and read data.
Example: Bitcoin, Ethereum.
Consortium: Only known organization members can run block-producing nodes.
Reading might be open or restricted.
Private/Permissioned: A single organization or trusted set controls who can run
nodes and see data.
7.3 PBFT vs PoW
Practical Byzantine Fault Tolerance (PBFT) can maintain a replicated log (a
blockchain) among a small group (like 4–7 nodes).
PBFT has O(n2) message complexity, but low latency, high throughput, and finality.
Bitcoin’s PoW scales to thousands of nodes but has probabilistic finality, low
throughput (~7 tps), and very high energy cost.
For many enterprise or consortium scenarios, PBFT-like protocols or other
Byzantine consensus variants are more suitable than PoW.
Comparison:
Feature PoW (e.g. Bitcoin) PBFT
1. Motivation
The Internet has millions of connected users, and newly launched services may
experience sudden bursts of popularity that overload centralized resources (the
“SlashDot effect”).
Many users have always-on broadband, so it’s tempting to offload some service
responsibilities to client nodes at the network edge.
o For example, Blizzard (World of Warcraft) uses peer-to-peer distribution for
patches/demos to reduce bandwidth load on central servers.
In theory, P2P can scale as more users adopt a service: more resources
(bandwidth, storage) become available.
o However, diminishing returns can set in: each additional node might add
complexity, overhead, or maintenance cost, limiting the net gain.
Goal: Analyze radio signals from the Arecibo telescope to detect possible
extraterrestrial transmissions.
Central server splits raw data into “work units” (“buckets”) by time/frequency and
distributes them to volunteer machines worldwide.
Volunteers run the analysis locally, then upload results back to a central server.
A central index server stored metadata (which user had which songs). Actual file
transfers occurred directly peer-to-peer.
Relied on a single, centralized directory for searching.
Weakness: The central server became a legal attack point for the music industry.
Many users behind firewalls: they can connect outbound to the index server, but not
necessarily accept inbound connections from other peers (the double firewall
problem).
5. Chord
5.1 Overview
Each node and key are assigned a unique ID in a circular ID space from 0 to 2^m−1
using a hash function (e.g., SHA-1).
Ring structure: Keys “belong” to the node with ID >= key ID that is the successor in
the ring.
A node does not store the entire ring membership; it only stores O (log n) pointers:
o The successor pointer plus a “finger table” of log n entries, where entry ii is
the node that succeeds by at least 2^i−1 in the ID space.
Lookup: Forward queries via finger table, halving the distance in ID space each
time. Route time is O (log n).
5.2 Joining/Leaving
Each node updates its successor/finger references after a join/leave to preserve
ring structure.
Protocol ensures queries keep working despite churn, though some stale
references can temporarily increase routing hops.
6. Kademlia
6.1 XOR Metric
IDs are 160-bit values for nodes and keys (e.g., SHA-1).
Distance: d (a, b)= a ⊕ b (bitwise XOR). This is a proper metric (symmetric, obeys
triangle inequality).
A node is responsible for keys “close” to its own ID under XOR.
6.2 Routing Tables
Each node organizes its routing table into buckets based on the shared prefix length
with its own ID.
o E.g., bucket i contains nodes whose IDs share the first i−1 bits with the local
node, differ at bit i, and can vary in the remaining bits.