0% found this document useful (0 votes)
14 views41 pages

SDLE

The document provides an overview of Message-Oriented Middleware (MOM) and its role in distributed systems, emphasizing asynchronous communication and the use of message brokers for reliable message delivery. It also discusses Java Message Service (JMS) as an API for messaging in Java applications, detailing its architecture, message types, and acknowledgment modes. Additionally, it covers data replication, consistency models, and quorum consensus protocols for fault tolerance in distributed systems.

Uploaded by

tascegoohfilho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

SDLE

The document provides an overview of Message-Oriented Middleware (MOM) and its role in distributed systems, emphasizing asynchronous communication and the use of message brokers for reliable message delivery. It also discusses Java Message Service (JMS) as an API for messaging in Java applications, detailing its architecture, message types, and acknowledgment modes. Additionally, it covers data replication, consistency models, and quorum consensus protocols for fault tolerance in distributed systems.

Uploaded by

tascegoohfilho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

SDLE

CHAP 1 - MESSAGE ORIENTED MIDLEWARE (MOM)

1. Distributed Systems and Message-Based Communication


1.1 Definition of a Distributed System
A distributed system is a collection of processes located on different machines that
communicate by exchanging messages. According to Leslie Lamport, such a system is
“distributed if the message transmission delay is not negligible compared to the time
between events in a single process.”
1.2 What is a Message?
 A message is an atomic bit string whose format and meaning are defined by a
communication protocol.
 Communication between distributed processes is typically realized via a computer
network. Messages flow through network hosts, where each host runs applications
that can send and receive messages.
1.3 Internet Protocols and Transport Properties
 The Internet communication stack offers:
o Transport layer (TCP/UDP) for end-to-end communication.

o Network layer (IP) for node-to-node communication across networks.


o Link layer for direct communication between devices on the same network
segment.
 UDP vs. TCP key differences:
o UDP is message-based, not connection-oriented, can lose or duplicate
packets, does not guarantee order or flow control, and can support one-to-
many delivery.

o TCP is stream-based, connection-oriented, ensures reliable in-order


delivery, performs flow control, and typically supports one-to-one
connections.
o TCP “reliability” means that if the connection breaks, the sender/receiver is
notified (e.g., via an error code). However, it does not guarantee zero data
loss in all possible failures (like unplugged cables or network partitions).
TCP only guarantees it will try to retransmit within an open connection until
it concludes the peer is unreachable. Beyond that, the application must
handle further recovery or reconnections.
1.4 Remote Procedure Call (RPC)
 Idea: Make distributed communication resemble local procedure calls. Instead of
passing messages manually, the client calls a “remote” function, and the RPC
machinery handles marshalling/unmarshalling parameters, sending messages,
and waiting for responses.
 Typical RPC Architecture:
o Client application calls a client stub, which marshals request parameters
into a message.
o The message is transported to the server stub, which unmarshals the data
and calls the actual server function.

o Server returns data to the stub, which sends a response back to the client
stub.
o The client stub unmarshals the response and gives it back to the client
application.
 Limitations: RPC is typically well-suited for request-reply patterns, but it enforces
synchronous communication, which can lead to blocking and tight coupling
between client and server. Failures also complicate RPC semantics (e.g., lost
requests or duplicates).

2. Asynchronous Communication (Message-Oriented Middleware,


MOM)
2.1 Motivation and Concept
 In asynchronous communication, the sender and receiver do not need to be active
or synchronized at the same time. Messages are stored and forwarded by an
intermediary service (the “middleware”) until the receiver is ready.
 Analogy to the postal system (snail mail):
o A message is placed into a “mailbox” or queue.
o The system retains it until the recipient is ready to pick it up.
o Sender and receiver remain loosely coupled.

2.2 MOM Basic Patterns


1. Point-to-point (Queues)
o One or more producers put messages into a queue.
o One or more consumers retrieve messages from that queue.
o Each individual message is delivered to at most one consumer.
2. Publish-subscribe (Topics)

o One or more publishers send messages to a topic.


o One or more subscribers receive messages from that topic.
o Each message can be delivered to multiple subscribers.

2.3 Architecture of MOM


 Often involves communication servers (message brokers) that store messages if
the receiver is not available.
 Sender places the message in the middleware. The middleware ensures eventual
delivery, subject to its reliability guarantees.
2.4 Applications
 Ideal for loosely coupled, event-driven, or asynchronous workflows:
o Enterprise Application Integration (connecting different enterprise apps).

o Microservices architectures.
o Workflow systems.
o Email, SMS, instant messaging (in a broader sense).

3. Java Message Service (JMS)


3.1 Overview and Architecture
 JMS is an API that defines how Java applications can send/receive messages from
a MOM (messaging provider).

 It is part of the Java EE (now Jakarta EE) specification, though it is commonly


referred to by its original name (JMS).
 JMS defines:
o Connection: a link to the JMS provider.
o Session: a context for producing and consuming messages within a
connection.
o Destination: a queue or a topic.
o MessageProducer: used by a client to send messages.
o MessageConsumer: used by a client to receive messages.

3.2 JMS Messages


Each JMS message has three parts:
1. Header: standard fields for identification and routing (e.g., JMSMessageID,
JMSDeliveryMode, JMSExpiration, JMSRedelivered, etc.).

2. Properties: extension mechanism (key-value pairs) for metadata defined by the


application (e.g., filtering or custom routing).
3. Body: the actual payload. JMS supports typed message bodies (e.g., TextMessage,
ObjectMessage, BytesMessage, etc.).
3.3 JMS Queues
 Point-to-point model:

o Multiple producers can send messages to the same queue.


o Multiple consumers can retrieve messages, but each message goes to only
one consumer.
 Persistence and Reliability:

o PERSISTENT mode: the message is stored by the JMS provider on non-


volatile storage; aims for once-and-only-once semantics (though
duplicates can still appear under certain failure conditions).
o NON_PERSISTENT mode: the message does not survive provider crashes;
leads to at-most-once delivery semantics but can be faster.
3.3.1 Acknowledgment Modes

 AUTO_ACKNOWLEDGE: automatic acknowledgment after successful receive () or


message callback.
 DUPS_OK_ACKNOWLEDGE: allows lazy acknowledgment; duplicates are
possible.
 CLIENT_ACKNOWLEDGE: the application explicitly calls Message.acknowledge();
acknowledges all messages in that session up to that point.
 Transacted Sessions (SESSION_TRANSACTED): all sends/receives in a session
are part of a transaction that is committed or rolled back as a whole.
3.3.2 Delivery Guarantees
 JMS uses acknowledgments and persistent storage to manage message reliability.
 On crash recovery, a message could be redelivered if the provider is uncertain
whether it was acknowledged. Therefore, the consumer may see JMSRedelivered
set to true.

3.4 JMS Topics


 Publish-subscribe model:
o Messages published by producers are delivered to multiple subscribers.
o Topics are long-lived, configured by an administrator.
3.4.1 Subscriptions
 Durable subscription: persists even if the subscriber is offline. Messages are
retained until they are consumed (once the subscriber reconnects).
 Non-durable subscription: only active while the consumer is connected; any
messages sent while offline are lost.
 Subscriptions can be shared (multiple consumers share one subscription, each
message delivered to only one of them) or unshared (only one consumer at a time).
3.4.2 Reliability with Topics
 Similar to queues: PERSISTENT vs. NON_PERSISTENT.

 Durable subscriptions combined with PERSISTENT delivery can achieve once-and-


only-once semantics (with the same caveat about possible redelivery on failure).
3.5 Ordering Guarantees
 Within a single session, messages sent are delivered in the order they were
produced—provided they have the same delivery mode (priority and other factors
can affect actual delivery order).

 Across different sessions or when multiple consumers share destinations, order is


not globally guaranteed.
3.6 JMS Limitations
 JMS is only an API. It does not specify an on-the-wire protocol, nor does it directly
address:
o Interoperability between different JMS providers (they might speak different
internal protocols).

o Fault tolerance or load balancing across providers.


o Security configurations or administrative features.

4. Message Queuing Protocols


Several open or proprietary protocols exist for message-oriented systems:
 AMQP (Advanced Message Queuing Protocol): an open-standard protocol
approved by OASIS and later ISO/IEC.
 MQTT (once called “Message Queuing Telemetry Transport”): OASIS standard,
designed for lightweight, low-bandwidth scenarios (commonly used in IoT).
 OpenWire: used by Apache ActiveMQ but also supports AMQP, MQTT, and others.
Interoperability is much easier when communicating parties agree on a common message
protocol like AMQP or MQTT, rather than relying on JMS alone (since JMS only describes the
Java API, not the actual wire format).

5. Architecture and Message Brokers


5.1 Multi-hop Routing
 Larger deployments often use multiple message relays (routers or brokers).
Messages might pass through several servers before reaching their destination.
This helps in:
o Scaling out across multiple data centers.

o Applying routing logic or transformations.


5.2 Message Brokers
 Message brokers can do more than store and forward:

o They can transform or enrich messages, reconciling different formats when


integrating heterogeneous systems (Enterprise Application Integration).
o Brokers consult a rules repository or use code to convert message formats
between applications.

CHAP 2 - REPLICATION AND CONSISTENCY MODELS

1. Data Replication
1.1 Motivation
 Data replication means maintaining multiple copies of the same data on different
nodes in a distributed system.
 Benefits:

1. Performance: Faster, local reads since each replica can serve read
requests.
2. Reliability: A single replica failure does not necessarily lead to data loss—
data is lost only if all replicas lose the data.
3. Availability: Data remains accessible as long as at least one replica is up
and reachable.

4. Scalability: Read load can be distributed across different replicas.


1.2 The Challenge of Updates
 When data is updated, replicas can temporarily diverge (hold different values).
 To maintain consistency, replicas need to run some replication protocol that
disseminates updates to all replicas.
 This becomes more complex if there are concurrent updates or partial failures.

2. Conflicts in Updates
 If multiple replicas are updated in different orders or with different writes, the
replicated data can become inconsistent.
 For example, if each replica starts at value 5 and:
1. Replica 1 applies an update “+3”
2. Replica 2 applies an update “+7”

Depending on the order or synchronization strategy, replicas may end up with different final
values, e.g., one might compute 5 + 3 = 8, then 8 + 7 = 15, whereas another might compute
5 + 7 = 12, then 12 + 3 = 15 or never apply one of the updates at all. Ensuring they end up in
a consistent state is part of the replication protocol challenge.
3. Strong Consistency
 A simple strategy for strong consistency is for all replicas to execute all updates in
the same global order, and each update must be deterministic to yield the same
final state everywhere.
 Strong consistency aims to provide an abstraction as if there is a single copy of the
data. Updates happen atomically and in some consistent global sequence.
3.1 Consistency Models
Three notable strong consistency models are:
1. Sequential Consistency
2. Serializability
3. Linearizability

They each impose increasingly strict requirements on how operations appear to execute in
the system.

4. Sequential Consistency
Definition (Lamport 1979):
An execution is sequentially consistent if it can be seen as a legal sequential (one-after-
another) ordering of all operations such that:
 Each process’s operations appear in that order in the global sequence (i.e., per-
process order is preserved).
This matches the intuition of how operations would interleave in a single-core
(uniprocessor) system with multiple threads.
4.1 Example of a Violation

 Suppose there are two variables, x and y, each with initial values (x=2, y=3), and two
replicas performing:
o Replica 1: x = y + 2
o Replica 2: y = x + 3
 If these happen concurrently, one final outcome might be (x=5, y=5). However, that
final state can contradict any single sequential ordering where x and y both start as
(2,3). A purely sequential update would not yield (5,5) from that initial state if the
operations strictly happen one after the other.
4.2 Non-Composability
 Even if each individual data structure (e.g., each array or each replicated object) is
sequentially consistent, combining them does not guarantee the entire system is
sequentially consistent.
 Example: Two separate arrays (u and v), each replicated under a sequentially
consistent protocol, can still produce a global interleaving that is not sequentially
consistent when operations on u and v are merged.
5. Linearizability
Definition (Herlihy and Wing 1990):
An execution is linearizable if:

1. It is sequentially consistent.
2. If operation A completes before operation B begins (according to a global observer’s
real-time clock), then A precedes B in the global sequential order.
5.1 Observations

 Linearizability introduces a “real-time” constraint: if one operation finishes before


another starts, they must appear in that order.
 Overlapping operations (where one’s finish time is after the other’s start time) can
appear in either order.
 Synchronization among replicas is typically required to implement linearizability,
often incurring additional communication and performance overhead.

5.2 Composability
 Linearizability is composable: if two data structures are each linearizable, then
the combination of both maintains linearizability for the entire system.
 This is in contrast to the non-composability of sequential consistency.

6. One-Copy Serializability (Transactions)


Definition: An execution of a set of transactions is one-copy serializable if its outcome is
the same as if all those transactions were executed serially on a single copy of the data.
6.1 Relationship to Strong Consistency Models
 Serializability is a well-known model in database transactions to preserve complex
invariants (ACID properties).
 If each operation is within a transaction, then serializability effectively behaves
similarly to sequential consistency, but for transactions rather than low-level
reads/writes.
 Modern databases often relax serializability for performance reasons, offering
weaker isolation levels.

7. Weak Consistency
 Strong consistency (e.g., linearizability, serializability) simplifies application logic
because it mimics a single-copy or single-processor illusion.
 However, strong guarantees often come with significant synchronization costs,
impacting:
o Scalability (performance under high load or many replicas).
o Availability (ability to withstand network partitions or replica failures
without halting writes/reads).
 Weak consistency models aim to:
1. Improve scalability and availability.
2. Provide some useful guarantees for applications that can tolerate more relaxed
consistency requirements (e.g., eventual consistency, causal consistency, etc.).
 Typically, these weaker models depend heavily on the application domain. Some
applications do not need strict global ordering of updates if certain invariants are
preserved.

CHAP 3 - REPLICATION FOR FAULT TOLERANCE - QUORUM


CONSENSUS

1. Introduction
1.1 Quorums and Quorum Consensus Replication

 Quorum-based replication requires that each operation on a replicated object


(such as a read or write) collects acknowledgments (or votes) from a quorum of
replicas.
 Fundamental Property: If the output of operation B depends on the result of
operation A, then the quorums chosen for A and B must intersect (i.e., share at least
one common replica).
o In practice, we enforce conditions on quorum sizes to guarantee
consistency.

2. Quorum Consensus Protocols


2.1 Peer Replication and Simple Quorums
 A simple configuration treats all replicas as peers. Each replica holds exactly one
“vote.”
 Define two parameters:

o N: total number of replicas,

o Nr: size of the quorum for read operations,

o Nw: size of the quorum for write operations.

 Key Overlap Condition: Nr + Nw > N.


o Ensures that every read quorum intersects with every write quorum so that
a read sees the most recent committed write.

2.2 Version Numbers


 Each replica of an object maintains a version number.

 Read:

1. The client polls Nr replicas to obtain their version numbers (and possibly the
data).
2. The client uses the highest version or determines the up-to-date replica
from which it reads the value.
 Write:

1. The client polls Nw replicas to find the highest version number.

2. The client writes the new value (with an incremented version number) to Nw
replicas.

3. Faults and Naïve Implementations


3.1 Network Partitions (Example with N=3, Nr=2, Nw=2)
 If a partition prevents a write quorum from forming properly, different clients may
each successfully update some but not all replicas, resulting in:

o Divergent values for the same version number.


o Clients subsequently reading can get inconsistent data (i.e., each sees a
different final value).
 This naïve approach does not preserve “read-your-writes” or converge properly
under partitions.

3.2 Concurrent Writes (Also N=3, NR=2, Nw=2)


 Two concurrent clients might each gather different write quorums with partial
overlap, updating different subsets of replicas.
 Final result: some replicas have one value, others have a different value—again a
loss of consistency.

4. Ensuring Consistency with Transactions


4.1 Transactions and Two-Phase Commit
 Gifford’s original approach assumes each read or write quorum operation is part
of a distributed transaction.
o The transaction uses two-phase commit (2PC) or a similar protocol.

o If not enough replicas (i.e., < Nw) can acknowledge the write, the
transaction aborts.
o Upon commit, all replicas in the write quorum finalize the new version.
Example:

 With N=3, Nw=2:


o If a network partition prevents a client from writing to at least two replicas,
the transaction aborts, so the system avoids partial updates.
4.2 Concurrency Control
 Isolation is enforced via concurrency-control (often lock-based or version-based):
o Prevents two concurrent write transactions from committing incompatible
updates.
 Deadlock can arise if two transactions try to lock the same replicas in conflicting
orders. One transaction is ultimately aborted.
4.3 Benefits and Drawbacks
 Transactions address both fault tolerance (e.g., partial failures) and concurrency.
 They can also handle multi-object or multi-operation scenarios.
 However:
o 2PC can block if the coordinator fails at a critical moment (classic blocking
problem).
o Deadlocks or aborts may occur.
o Solutions often involve more sophisticated concurrency and replication
techniques.

5. Playing with Quorums


5.1 Trade-Offs

By choosing different values of Nr and Nw, one can prioritize:


 Write performance vs. Read performance (e.g., “read-one/write-all” or “read-
all/write-one”).
 Availability: Higher quorum sizes can reduce the chance of split-brain conflicts but
make it harder to achieve a quorum if some replicas fail.
5.2 Weighted Voting
 Instead of each replica having one vote, weighted voting assigns different vote
counts to replicas.
 This can reflect reliability differences, geographical location, or performance
constraints (e.g., a powerful replica might get more votes).

 Example: A replica might have 0 votes to store a copy without participating in


quorum counts. This could improve overall performance while still providing a local
backup.
6. Fault Tolerance Analysis
 Quorum consensus is robust to a certain number of crash failures or network
partitions.

 Standard question: For a maximum of f simultaneous crash failures among


replicas, how large must N be, and how should Nr and Nw be chosen?

o Typically, we see constraints like Nw + Nr > N and Nw > f.

o The exact formula depends on whether one wants to tolerate Byzantine


failures vs. crash-only failures.

7. Dynamo Quorums
7.1 Dynamo Overview
 Dynamo is Amazon’s key-value store designed for high availability.
 It uses a put(key, value) / get(key) API (an “associative memory” instead of a single
read/write memory cell).
 Replicas maintain version vectors instead of simple version numbers.
 Multi-versioning (storing conflicting updates in parallel) is used to avoid blocking
under network partitions. Clients eventually reconcile conflicting versions.

7.2 Quorum Definitions in Dynamo


 Each key has a preference list of primary replicas (the first ca nodes). If some of
these nodes are down or unreachable, Dynamo can expand to backup replicas.
 Parameters:

o N: replication factor for each key.

o W: minimum number of replicas that must acknowledge a put().

o R: minimum number of replicas from which a get() must collect responses.

o Overlap condition: R + W > N for strong consistency when no failures


occur.
7.3 Sloppy Quorums and Hinted Handoff
 When a node in the preference list is unavailable, Dynamo uses sloppy quorums:
it writes to other “backup” replicas to still gather W acknowledgments.
 The backup node stores a hint (the identity of the intended node) and periodically
attempts to hand the data off once the primary replica recovers.
 Trade-Off: Sloppy quorums maximize availability but can violate the strict overlap
property (since read and write quorums might not actually intersect in the presence
of failures).
 Dynamo then relies on anti-entropy mechanisms for eventual synchronization
across replicas.
CHAP 4 - REPLICATION FOR FAULT TOLERANCE QUORUMS-
CONSENSUS REPLICATED ADT

1. Quorum-Consensus and Replicated Abstract Data Types


1.1 Herlihy’s Extension to Quorum Consensus
 Maurice Herlihy proposed a generalization of quorum consensus for replicated
abstract data types (ADTs) such as queues, dictionaries, etc.
 Key Concept: Each operation requires:
1. An initial quorum of replicas to gather information needed to execute the
operation (e.g., the latest state or relevant entries).
2. A final quorum of replicas to propagate the new state or log entries after
performing the operation.
1.2 Initial and Final Quorums
 In read/write systems, “initial quorum” might be used to read the current version
(or log) of the data, while “final quorum” is where the new version (or new log
entries) is written.
 For read-only operations, the final quorum can be empty—no new data is stored.

 A quorum for an operation is any set of replicas that covers both the operation’s
required initial and final quorums.
 To ensure consistency, certain intersection constraints must hold between the
final quorum of one operation and the initial quorum of another (e.g., a write’s final
quorum must intersect a subsequent read’s initial quorum).

2. Example: Initial/Final Quorums for Read/Write Operations


 When an object supports only read and write, we have:
1. Every write final quorum must intersect every read initial quorum.
2. Every write final quorum must also intersect every write initial quorum (so
that version numbers or timestamps update consistently).
 These constraints can be visualized with a quorum intersection graph:

(write) → (read)
where an edge indicates that the final quorum of the first must intersect the initial quorum
of the second.
 Example (5 replicas):

o Possible minimal quorums for read might be (1,0), (2,0), (3,0), meaning:
read from 1 (or 2, or 3) replicas initially, and no final writes required.

o Possible minimal quorums for write might be (1,5), (2,4), (3,3), etc.
3. Replicated Queue as an Example of a Replicated ADT
3.1 Building a Queue via Read/Write Quorums
 A queue supports:

o Enq(x): enqueue item x.

o Deq(): dequeue the least recently enqueued item; raises an exception if


empty.
 Implementation sketch using read/write quorums:
1. Enq:

 Read from an initial quorum (to find the version or timestamp if


necessary).

 Modify the queue (enqueue x).


 Write the updated state to a final quorum.

2. Deq (normal):
 Read from an initial quorum to obtain the current state/version.
 Remove the least recent item if any.
 Write updated state (or version) to a final quorum.
3. Deq (abnormal, empty):
 Read to discover the queue is empty; no final write needed.

 By mapping these operations onto read/write quorum constraints, one obtains


minimal or valid quorum sizes for each operation. Sometimes Enq has different
quorum sizes than Deq.
3.2 Herlihy’s ADT Approach vs. Simple Read/Write
 Herlihy introduced:
1. Timestamps (instead of version numbers) that are totally ordered across
clients.
2. Logs (rather than replicated state) to record the history of operations,
merging these logs to recover the ADT’s state on demand.
 This approach can reduce the intersection constraints or the number of messages.

4. Replicated Read/Write Objects with Timestamps


 Read:
o A client queries an initial quorum to find the highest timestamp (or the best
local copy).
 Write:

o No need for a separate “initial round” to get a new version number, because
each client can generate a globally unique, totally ordered timestamp (e.g.,
by using a transaction manager or a hierarchical timestamp generator).
o The client only needs to write the new state/timestamp to a final quorum.
 Quorum Intersection Graph for 5 replicas can become simpler:
o Reads might only require an initial quorum, and writes only a final quorum,
with the edge constraint ensuring overlap when needed.

5. Replicated Event Logs vs. Replicated State


 Key Idea: Instead of storing the entire state, replicate an event log of timestamped
operations (including their results).

o An event can be something like [Enq(x), Ok()] or [Deq(), Ok(x)].


 Merging logs across replicas effectively reconstructs the queue or ADT’s current
state.
 This approach is particularly flexible but can lead to growing logs that require
garbage collection.

6. Herlihy’s Replicated Queue Implementation


6.1 Dequeue (Deq) Operation
1. Read logs from an initial Deq quorum; merge them (by timestamp) to produce a
view of all known events.
2. Reconstruct the queue’s state by replaying these events in timestamp order.
3. If the queue is non-empty, remove the oldest item and append a new [Deq(),
Ok(item)] event to the view.
4. Write the updated view (or the new event plus possibly updated metadata) to a final
Deq quorum.

5. Return the dequeued item (or raise an exception if empty).


6.2 Optimizations
 Garbage Collection of logs:
o Keep track of a horizon timestamp—the timestamp of the most recently
dequeued item.

o Discard all enqueued items (and corresponding log entries) older than that
horizon.
 Client Caching: Clients can cache logs to reduce the overhead of reading entire
logs from the replicas each time.
6.3 Intersection Constraints

 For Enq, the initial quorum can be empty, because the operation does not depend
on prior results to return a value to the caller.
 For Deq, both the initial quorum (to see up-to-date logs/items) and the final quorum
(to append a Deq event) are needed.
 Different operations thus have different quorum intersection rules to guarantee
linearizability.

7. Critical Evaluation
7.1 Timestamps and Linearizability
 Herlihy’s approach requires globally ordered timestamps consistent with real-
time (linearizability).
 In practice, this often means integrating with a transaction manager or a protocol
for distributing timestamps (e.g., hierarchical timestamps or 2PC-based commit to
decide final ordering).
 Without transactions, assigning these timestamps correctly (especially if the initial
quorum is empty) is tricky.
7.2 Log Garbage Collection
 Event-based replication can lead to unbounded growth if logs are never pruned.
 Herlihy shows how certain ADTs (like a queue) allow safe removal of stale entries
(e.g., after items are dequeued).
 Other ADTs may have more complex garbage-collection logic or find it less
effective.
7.3 Relationship to CRDTs (CvRDTs)

 Like Conflict-free Replicated Data Types (CRDTs), Herlihy’s replicated ADTs rely
on merges of concurrent updates to keep replicas consistent.
 However, CRDTs typically provide eventual consistency rather than strong
(linearizable) consistency, and do not require a quorum for each operation.
 Herlihy’s approach aims for strong linearizability but imposes more
synchronization (quorum intersection).

7.4 Other Observations


 Quorum-based approaches often appear in simple data stores, but Herlihy’s
method generalizes it for more complex operations and data types.
 Paxos-based state machine replication can be viewed as a majority-quorum
approach, illustrating how quorums are central to many replication strategies.
 Quorums need not be uniform or static: there can be square grids or dynamic
membership, though changing quorum configurations carefully without breaking
consistency is non-trivial.
CHAP 5 - CRDTS: STATE-BASED APPROACHES TO HIGH
AVAILABILITY

1. CAP Trade-Off and Motivation for CRDTs


 CAP Theorem: In a distributed system, you cannot simultaneously guarantee
(C)onsistency, (A)vailability, and tolerance to (P)artitions. At most two of these can
be fully satisfied.
 High Availability + Partition Tolerance under CAP typically leads to weaker forms
of consistency such as Eventual Consistency (EC):

o After an update, if no new updates occur, eventually all replicas converge


to the same final state.
o Example: DNS, where changes propagate but might be stale for a while.
 Geo-Replication: Systems spanning data centers around the globe incur
significant latencies (Λ in the order of 100–300 ms inter-continent). Achieving strong
consistency (like Paxos) across distant replicas can add latency for every client
operation. Alternatively, multi-master replication can keep local write latency low
(λ), but allow temporary divergence.

 CRDTs (Conflict-Free Replicated Data Types):


o Designed to ensure replicas converge automatically after concurrent
updates, favoring an AP (Availability + Partition tolerance) approach under
CAP.
o Provide eventual consistency with well-defined merging rules to resolve
concurrent updates deterministically.

2. Operations vs. State in CRDTs


Two major CRDT models:
1. Operation-based CRDTs:

o Each update operation is broadcast to other replicas.


o The effect of operations must commute, ensuring no conflicts if messages
arrive out of order or concurrently.
2. State-based CRDTs:

o Each replica holds a local state and merges states from other replicas.
o The state is organized as a join-semilattice, where the merge operation is
the least upper bound (lub or “join”) of two partial states.
o Updates are monotonic: applying an update never discards information, so
replicas can safely converge via merges.
Both approaches can simulate one another, but state-based CRDTs are often more
common in practice for large or unreliable networks. They are more tolerant of message
losses, reordering, and duplicates.

3. State-Based CRDTs
3.1 Join-Semilattice Basics
A join-semilattice is defined as:

 A set S with a partial order ≤.

 A join operator ⊔ (“t” in the text) that computes the least upper bound of any two
elements. It satisfies:

1. Idempotence: a ⊔ a = a.

2. Commutativity: a ⊔ b = b ⊔ a.

3. Associativity: (a ⊔ b) ⊔ c =a ⊔ (b ⊔ c).

Typically, there is a bottom (⊥) which is the initial state. Each update moves the CRDT’s
state monotonically “upwards” in the partial order (x ≤ m(x)).
3.2 Interpreting State as a Log
 Conceptually, one can think of a distributed partial-order log (polog) of all
operations. Each replica maintains a growing local subset of the log.

 State-based CRDT implementations store a compact form of these logs (e.g.,


version vectors, sets, counters) rather than the entire operation list.

4. Examples of State-Based CRDTs


4.1 Counters
1. G-Counter (Grow-Only Counter)

o State is a map from replica-IDs to integer counts, e.g. m[i] tracks


increments at replica i.

o ⊥ is an empty map (zero for all IDs).

o 𝐢𝐧𝐜𝒊𝒊 increments m[i] by 1.

o Merge ⊔ is pointwise max over all IDs.


o The value is the sum of all mapped integers.

2. PN-Counter (Positive-Negative Counter)


o Tracks increments and decrements separately, each as a G-Counter.
o The final count is ∑ incs − ∑ decs.
o Merge is done by merging both G-Counter components pointwise.
4.2 Grow-Only Set (G-Set)

 State is a set of elements; ⊥={}.


 add(e) inserts ee.
 Merge is set union.
 No removal is possible; once added, elements remain indefinitely.
4.3 Two-Phase Set (2P-Set)

 State = (S, T): a set of “added” elements S, and a set of “removed” elements T.

 Merge is pairwise union: (S ∪ S′, T ∪ T′).

 An element is in the “visible set” if it is in S but not in T.


 Limitation: after an element is removed, it cannot be re-added.
4.4 Add-Wins Sets
 Allow re-adding elements after removal by attaching unique tags to each addition.
 The “remove” operation removes all tags associated with an element, but a later
“add” with a newer tag supersedes older removes.
 Merge operation carefully resolves tags so that concurrent “add” vs. “remove”
conflicts yield an add-wins or remove-wins outcome, depending on design.
4.5 Using Causal Histories (Dots)

 Unique tags (dots) can be structured as (replicaID, counter).


 Merging sets or counters involves taking a pointwise max or union of these dot sets.

5. Delta State-Based CRDTs


 While traditional state-based CRDTs send the full local state on each
synchronization, delta-based CRDTs optimize by only shipping the delta (the small
incremental state change).
 A delta-mutator returns a delta-state that is then merged locally and can be sent
to other replicas.
o Formally, for each update mm, there is a corresponding δ\delta-function
mδm^\delta such that:

m(X) = X ⊔ mδ(X)

 This reduces bandwidth usage and speeds up convergence when states get large.
6. Pros and Cons of State-Based CRDTs
6.1 Advantages
1. Permissive Communication: They tolerate dropped, reordered, or duplicated
messages.
2. No Rigid Reliance on Causality Messages: Local state merges handle out-of-
order arrivals naturally.
3. Operation Rate Decoupled from Transmission: A replica can perform many local
updates and merge them later.
4. Built-In Convergence: Idempotent, commutative merges ensure eventual
consistency.
6.2 Disadvantages
1. Growing Metadata: CRDT states can accumulate “tombstones” or historical tags
(e.g., version vectors or removed-element tags).

2. Full-State Transmission: A naive approach may send the entire state each time
replicas synchronize, which can be large. (Delta-based CRDTs mitigate this.)
3. Application-Specific Garbage Collection: Some CRDTs (like sets) allow indefinite
growth, so removing obsolete metadata can be non-trivial.

CHAP 6 - SCALABLE DISTRIBUTED TOPOLOGIES

1. Graph Fundamentals
A graph G(V,E) is defined by:

 A set of vertices V.

 A set of edges E connecting pairs of vertices.


1.1 Types of Graphs
 Directed/Undirected: In an undirected graph, edges connect vertices in both
directions.
 Simple Graph: Undirected, no loops, and at most one edge between any pair of
vertices.

 Weighted Graph: Each edge has an associated weight (cost, distance, etc.).
 Path: A sequence of vertices such that consecutive vertices are connected by
edges.
1.2 Walk, Trail, Path

 Walk: A sequence of vertices where edges may repeat, and vertices may repeat.
 Trail: A walk where edges are not repeated (but vertices may still repeat).
 Path: A walk with no repeated vertices or edges.

1.3 More Topologies


 Complete Graph: Every pair of distinct vertices is connected by an edge. A “clique”
is a sub-graph with that property.
 Connected Graph: There is a path between any two vertices.

 Star: One central vertex connected to all other “leaf” vertices.


 Tree: A connected graph with no cycles.
 Planar Graph: Can be drawn in a plane with no edges crossing (e.g., a ring or a tree).
Having cycles in a network context can improve robustness (redundant paths) but
sometimes complicates data handling. Many distributed algorithms use tree structures to
avoid the complexity of cycles, though others are tolerant of multi-path networks.
1.4 Additional Definitions
 Connected Component: A maximal connected subgraph.

 Degree of a vertex 𝑣 : The number of adjacent neighbours. In a directed graph, we


distinguish in-degree and out-degree.

 Distance d (𝑣 , 𝑣 ): The length of the shortest path between 𝑣 and 𝑣 .

 Eccentricity of viv_i, ecc(vi) = max {d (𝑣 , 𝑣 ) ∣ 𝑣 ∈ V}

 Diameter D = max{ecc(𝑣 }.

 Radius R = min{ecc(𝑣 )}.

 Center: The set of vertices with eccentricity R.

 Periphery: The set of vertices with eccentricity D.


1.5 More Complex Randomized Topologies
 Random Geometric Graph: Vertices placed randomly (uniformly) in a unit square;
edges connect vertices within a specified Euclidean distance.
 Erdős–Rényi (Random) Graph G(n,p): A graph of nn vertices where each possible
edge is included with probability pp, independently. Often yields small diameters
(O(logn)) but lacks clustering.
 Watts–Strogatz: A “small-world” model that starts with a ring/lattice and then
randomly rewires some edges, achieving both high clustering and short average
path lengths.
 Barabási–Albert: Uses preferential attachment so that highly connected nodes are
more likely to attract new edges. Produces a power-law degree distribution.
2. Spanning Trees
A spanning tree in a connected graph is a subgraph that includes all vertices and is itself a
tree (no cycles).

2.1 Synchronous BFS (SyncBFS)


 Assumptions:
o We have a strongly connected directed graph.
o Each vertex (process) has a unique ID.
o Processes communicate in synchronous rounds over directed edges.
SyncBFS algorithm for building a Breadth-First Spanning Tree:

1. Initialization:

o A designated root 𝑖 sets marked = True, others marked = False.


o parent = nil everywhere except the root.
2. Rounds:
o The root sends a “search” message in round 1.
o An unmarked node receiving “search” from xx marks itself, sets parent = x,
and in the next round sends “search” messages to its own out-neighbors.
3. Complexities:

o Time: At most diam rounds (where diam is the diameter from the root).

o Messages: One message per edge ⟹ O(∣E∣).

Applications:
 Aggregation (computing global sums, maxima, etc. in a convergecast).
 Leader Election (e.g., gather all UIDs, pick the max).
 Broadcast (once a BFS tree is built, use it to disseminate messages).
 Computing Diameter (build BFS from every node, combine results, though it can be
expensive in large graphs).
2.2 Asynchronous Spanning Tree (AsynchSpanningTree)
 Works with asynchronous message passing (no global rounds).
 Each node:

o Initially parent = null except a root 𝑖 (which might be a known leader or a


chosen node).
o When a node first receives a “search” message and has no parent, it sets
parent = sender and forwards “search” to all other neighbors.
o This forms a spanning tree over time, but not necessarily BFS (since
slower/longer paths might “win” if they arrive earlier).
 Correctness: Eventually, all nodes become part of a single spanning tree rooted at
𝑖 .
 Complexities:

o Messages: O(∣E|).

o Time: O (diam(l + d)) where l is max local processing time and dd is max
channel delay.
Child pointers & Broadcast:
 If each node reports back whether it accepted the sender as a parent (or not), a tree
can be augmented with child pointers.
 Complexity can be higher if we wait for acknowledgments, potentially up to
O(n(l+d)) if a chain forms through many nodes.

Leader Election:
 Combine asynchronous spanning tree with UID comparisons to pick the node with
the largest (or smallest) ID.

3. Epidemic Broadcast Trees


Plumtree Protocol:
 Aims to combine:
o Gossip broadcast: high scalability, resilience, but can produce excessive
message overhead (due to flooding or multi-path).
o Tree-based broadcast: minimal redundancy but fragile if links fail or nodes
crash.
Approach:
1. Each node keeps a set of peers in an eager-push set (the main tree edges) and a
lazy-push set (metadata-only links).

2. On receiving a new message for the first time, the node forwards the payload to its
eager-push neighbours.
3. On receiving a duplicate message, the node moves that neighbour to lazy-push
mode (just exchanging message IDs/metadata).
4. If the tree breaks (a node times out waiting for a payload that never arrives), it
upgrades the lazy link to an eager link to repair connectivity.

5. Eventually, excessive cycles (redundant edges) are pruned.


Result: A gossip-based approach that converges to an implicit spanning tree used for
broadcast, leveraging local repairs when failures are detected.
4. Small-World Networks
4.1 Milgram’s “Six Degrees of Separation”
 Experiments suggested short paths of acquaintances connect arbitrary people in
~6 steps.
4.2 Watts–Strogatz Model
 High clustering + short path lengths via small number of random long-range edges.

 Formal random graphs like Erdős–Rényi achieve short diameters (log n) but lack
high clustering.
 Watts–Strogatz keeps a lattice/ring (high clustering) but rewires a fraction of edges
randomly to produce shortcuts, reducing average path length.
4.3 Routing

 Simple flooding in a small-world network finds short routes, but local, greedy
routing is trickier:

o With arbitrary random shortcuts, greedy local steps can lead to 𝑂(√𝑛)
average paths.
 Kleinberg: Showed a probability distribution for long-range links that preserves
distance at multiple scales, yielding 𝑂(log 𝑛) local greedy routes.

 DHTs (Distributed Hash Tables) like Chord are engineered “small world” overlays
with carefully chosen links, giving logarithmic routing.

5. Concluding Remarks
1. Spanning Trees in synchronous or asynchronous networks allow foundational
tasks: broadcast, gather, leader election, global computations, etc.

2. Asynchronous algorithms do not guarantee BFS ordering but still produce a valid
spanning tree.
3. Epidemic (Gossip) Protocols and Tree-based methods can be combined to
balance high scalability/resilience with controlled overhead (e.g., Plumtree).
4. Small-World Networks (Watts–Strogatz, Kleinberg) demonstrate how adding
random or structured long-distance edges can drastically reduce average path
length while retaining clustering. They inspire scalable peer-to-peer overlays and
DHTs with logarithmic routing.

Overall, distributed systems exploit these scalable topologies and graph structures to
achieve efficient, resilient communication despite large scale and unreliable networks.
CHAP 7 - FAULT TOLERANCE – CONSENSUS

1. Distributed Agreement: Motivation and Consensus Definition


1.1 Distributed Agreement (Informal)
 We want processes in a group to agree on actions or decisions.

 This general notion of “agreement” appears in multiple problems (atomic


commitment, group membership, leader election, etc.).
 Consensus is a fundamental formulation of the agreement problem.
1.2 Consensus (Formal Definition)

Given n processes, each starts with an input value in some set V. Each process must
produce a decision value in V. The following conditions must hold:
1. Agreement: All processes that decide must decide on the same value (only one
value).

2. Validity: If all processes start with the same input value v, then the only possible
decided value is vv. (We cannot invent a value that was never proposed.)

3. Termination: In a failure-free execution, eventually all processes make a decision


(i.e., the algorithm terminates with a value).

2. The Synod Algorithm (Core of Paxos)


2.1 System Model & Assumptions
 Processes:
o Operate asynchronously (no global time bounds).
o May fail and recover (crash-restart).

o Have access to stable storage (recovery can read prior state).


 Messages:
o Delays are unbounded.
o Messages can be lost or duplicated but not corrupted.
 Remark: By the famous FLP (Fischer, Lynch, Paterson) impossibility result, there is
no purely deterministic protocol that guarantees progress in an asynchronous
system with potential crash failures. However, Paxos/Synod ensures safety always
and liveness under additional conditions (e.g., leader election with timeouts).
2.2 Roles
 Proposers: Propose (number, value) pairs to be chosen.
 Acceptors: Vote to accept or reject proposals. A value becomes chosen if a
majority of acceptors accept a particular proposal number and value.
 Learners: Learn which value was chosen by the acceptors.

Structure: Execution proceeds in rounds, each with up to two phases:


1. Phase 1 (“Prepare”):

o A proposer picks a unique proposal number n and sends PREPARE(n) to a


majority of acceptors.

o Each acceptor, upon receiving PREPARE(n), checks if n is higher than any


proposal number it has already promised not to override:

 If n is indeed higher, the acceptor promises not to accept proposals


< n in future and replies with the highest-numbered proposal it has
accepted so far (if any).

 If n is not higher, the acceptor may choose not to respond or


respond negatively.
2. Phase 2 (“Accept”):
o If the proposer gets successful “promise” responses from a majority, it
picks the value v of the highest-numbered accepted proposal (among
replies) or, if no acceptor reported any prior accepted proposal, it picks its
own input value.

o The proposer sends ACCEPT(n, v) to those same acceptors.

o Each acceptor that receives ACCEPT(n, v) will accept it unless it has already
promised not to accept proposals < a number greater than n.
Key Property:
 Once a value is chosen (accepted by a majority under a certain proposal number),
all higher-numbered proposals that succeed (i.e., get majority acceptance) must
carry the same value. This ensures agreement.
2.3 Synod Execution Examples

1. Simple Case: One proposer P1 does PREPARE(n) and gets “promises” from a
majority of acceptors (A1, A2, A3). It then sends ACCEPT(n, v), and the acceptors
accept.

2. Concurrent Proposals: Two proposers with different numbers might overlap in


time. One crucial rule is that if an acceptor has accepted a proposal (n2, v2) with n2
> n1, it will reflect this in the PROMISE reply to any new PREPARE(n) with n > n2.
Hence, the new proposer must adopt v2 if v2 is the highest-numbered prior value.
2.4 Correctness Arguments
 Safety:

o If a proposal (n,v) is chosen, no other proposal with a different value can


also be chosen.
o The mechanism of requiring intersection of majorities and carrying forward
the highest-numbered accepted value ensures that once a value is chosen,
higher-numbered proposals will not choose a conflicting value.

 Learning (how learners discover the chosen value):


1. Each acceptor can notify all learners whenever it accepts a proposal.
2. Alternatively, acceptors can notify a distinguished learner or the distinguished
proposer.

3. Any approach must eventually gather enough acceptor votes (from a majority) to
confirm that a proposal is chosen.
 Progress / Liveness:
o Multiple proposers can lead to livelock if they continuously issue conflicting
proposals.
o Leader election (choosing one distinguished proposer) is used to avoid
indefinite conflicts. Only that leader tries to propose new values,
guaranteeing eventual success.
o Under synchronous-like conditions with stable leader, the protocol
terminates.

3. Handling Lost Messages & Failure Recovery


3.1 Lost Messages

 If a proposer does not get “promise” from a majority in Phase 1, it can retry with a
higher proposal number.

 Likewise, if ACCEPT(n, v) messages do not reach enough acceptors, the proposer


can reinitiate a new round.
3.2 Ensuring Progress
 Livelock can occur if multiple proposers keep interrupting each other with higher
numbered PREPAREs. In practice, use:
o A single leader (distinguished proposer) to propose.
o If the leader fails or times out, a new leader is elected with a higher range of
proposal numbers, ensuring it can eventually achieve acceptance.

4. Paxos as an Implementation
Lamport’s Paxos is the practical embedding of the Synod Algorithm:
1. Leader: Each node can be a proposer, acceptor, and learner, but typically we run a
leader election. The leader acts as the distinguished proposer/learner.
2. Unique Proposal Numbers: Typically (counter, leader_id) ensures uniqueness.
3. Stable Storage: Acceptors record:
o The highest numbered PREPARE they’ve promised.

o The highest numbered proposal they’ve accepted.


The protocol ensures:
 Safety under asynchrony and crash failures, if at most a minority of acceptors are
permanently lost.

 Liveness if eventually messages arrive, and a stable leader remains in place (no
perpetual new leadership attempts).

5. State Machine Replication (SMR) with Paxos


5.1 Concept
 State Machine: A deterministic replica that transitions state by applying a
sequence of operations (requests).
 Goal: Have each replica process the same sequence of client requests in the same
order, thus producing identical outputs.
 Challenge: In a fault-prone environment, clients might issue requests
concurrently, and replicas can fail.
5.2 Using Paxos (or Atomic Broadcast)
 Label each request with an increasing slot/index ii.
 For slot ii, run Paxos as an instance of consensus to decide which client operation
belongs there.
 Replicas apply the chosen operation in slot ii only after all slots up to i−1i-1 are
applied.
5.3 Implementation Details
 Typically, a single leader is elected.
 The leader proposes new client requests in new slot numbers via Phase 2 (possibly
skipping Phase 1 if it did a global Phase 1 once at election).

 If the leader fails, a new leader runs Phase 1 for all slots that remain uncertain,
collects accepted proposals, and issues new proposals accordingly.
 Once a request is learned chosen for slot ii, each replica executes it in state
machine order.
CHAP 8 - PRACTICAL BYZANTINE FAULT-TOLERANCE

1. Byzantine Failures
 A Byzantine process can deviate arbitrarily from its specification. It may send
conflicting messages to different recipients.
 Other processes do not initially know which ones are Byzantine; simply ignoring
some subset’s messages is not feasible if we don’t know who is faulty.

 Byzantine Generals Problem (BGP): Formulates a reliable broadcast scenario in


which a broadcaster might be faulty, and we need agreement and validity among
correct receivers.
o Agreement: All non-faulty processes deliver the same message.
o Validity: If the broadcaster is non-faulty, then all non-faulty processes
deliver the original broadcast message.

 Impossibility result: In a purely synchronous system with no message signatures,


at most f<n/3 can be faulty for a correct solution to exist. Hence, you need n≥3f+1.
1.1 Cryptographic Signatures
 If messages are signed (e.g., via digital signatures), a Byzantine node cannot
tamper with a message’s signature once issued. This prevents certain forms of
tampering or replay.

 However, correct protocols must still address replay attacks and ensure signatures
aren’t reused incorrectly.

2. System Model and Problem Definition


 The system is asynchronous:
o Message delays can be arbitrary.
o Messages can be lost, duplicated, or delivered out-of-order, but not
corrupted (if using digital signatures/hash checks).
 Each replica (process) can fail arbitrarily (Byzantine), independently of each other.
 Service Properties:
1. State Machine Replication (SMR): The service is modeled as a deterministic state
machine with a replicated state and atomic operations.
2. Safety (Linearizability): The replicated service behaves like a non-replicated
service with sequential, atomic execution of requests.
3. Liveness: Typically, cannot be guaranteed in a purely asynchronous model, but
PBFT aims for practical progress when network conditions are “stable enough.”

4. Resiliency: Tolerates up to f faulty (Byzantine) replicas out of N = 3f + 1 total.


3. Protocol Overview
3.1 Views and Leader
 The protocol proceeds in a series of views identified by v=0,1,2,….

 The leader (primary) of view v is replica ℓ=v mod N.


 If the leader is suspected to be faulty or slow, the system transitions to a higher view
v + 1 with a new leader.
3.2 Algorithm Outline
1. Clients:
o Sign and send requests to the leader.

o Collect at least f + 1 matching replies from different replicas before


accepting a result.
2. Replicas:
o Perform a three-phase atomic broadcast of the client request to ensure
total ordering:

1. Pre-prepare (leader proposes a sequence number n for the


request).
2. Prepare (replicas confirm they saw the same request and sequence
number).
3. Commit (replicas finalize the order and allow local execution).

o Send replies to the client after committing and executing the operation.
In more detail, PBFT’s atomic broadcast ensures that:
 Within a single view, no two requests obtain the same sequence number with
different operation digests.

 Across view changes, a request partially ordered in a previous view is not lost or
re-ordered incorrectly.
3.3 Client Behavior

1. Send: A client sends a signed REQUEST message ⟨REQUEST,op,t,c⟩ to the current


leader (t is a unique timestamp to avoid duplicates).

2. Wait: The client collects f+1 valid REPLY messages with matching results from
different replicas.
3. Timeouts: If replies do not arrive or are inconsistent, the client broadcasts the
request to all replicas, causing them either to resend the already committed reply
or to forward the request to the leader. If the leader fails, a view change will
eventually occur.
4. Atomic Broadcast Protocol (Three-Phase Commit in PBFT)
4.1 Quorums and Certificates

 PBFT uses quorums of size 2f + 1 among 3f+1 replicas.

o Any two quorums intersect in at least f+1 replicas, ensuring overlap that
contains at least one correct (non-Byzantine) node.
 Certificates:

o A quorum certificate is a set of 2f+1 matching messages each signed by


different replicas.

o A weak certificate requires f+1 signatures (used for client replies).


4.2 Pre-Prepare Phase

1. Leader assigns the next sequence number n to the request m.

2. Leader multicasts ⟨PRE − PREPARE,v,n,d⟩σℓ, m to all replicas, where dd is a digest


(hash) of the request mm.
3. Replica accepts this pre-prepare if:

o It is in the correct view v.


o The signature is valid, and the request digest matches mm.

o The sequence number n is within watermarks [h,H] (to prevent unbounded


sequence jumps).

o The replica hasn’t accepted another pre-prepare for view v with the same
nn but different digest.
4.3 Prepare Phase

 After accepting a pre-prepare message, a replica broadcasts ⟨PREPARE,v,n,d,i⟩σi


to all.
 A replica accepts each prepare if valid.

 A prepared (P-)certificate for (v,n,d)(v,n,d) requires:


o The pre-prepare message from the leader,

o Plus, 2f PREPARE messages from distinct replicas (total 2f+1 including the
leader).

Outcome: If a request is prepared, no conflicting request with the same (v,n) can also be
prepared, achieving total order within the same view.

4.4 Commit Phase

 Once a replica has a P-certificate (it is “prepared”) for (v,n,m), it multicasts


⟨COMMIT,v,n,d,i⟩σi to all.
 A request is considered committed at replica ii when ii has both:

1. A P-certificate (from prepare),


2. A “commit certificate” of 2f+1 matching COMMIT messages for (v,n,d).

Invariant: If a replica commits a request, at least f+1f+1 correct replicas also prepared it,
so knowledge of that request cannot be lost in a view change.
4.5 Execution and Reply
 A replica executes the operation after committing it in sequence number order (no
gaps).
 Then it sends a signed reply to the client.

 The client waits for f+1 identical replies to confirm the result.

5. View Change Protocol


If the leader is faulty or too slow, replicas suspect it and move to view v+1. The new leader
becomes ℓ′= (v + 1 ) mod N.
5.1 Trigger
 A replica times out waiting for a request to be committed or sees clear Byzantine
misbehaviour from the leader.

 It sends a VIEW-CHANGE message for v+1, carrying:

o The latest stable checkpoint (with sequence n),

o The checkpoint’s certificate (f+1 matching checkpoint messages),

o A set of prepared certificates for all sequence numbers above n.


5.2 New-View Certificate

 The new leader waits until it collects 2f+1 valid VIEW-CHANGE messages for v+1.
 This set forms the new-view certificate.
 The leader combines the information (prepared requests, stable checkpoints) into
a ⟨NEW − VIEW, v+1, V,O,N⟩σℓ message, where:

o V is the set of VIEW-CHANGE messages,

o O, N are new pre-prepare messages carrying the assignments of sequence


numbers from previous views (or “null” if needed).

 It multicasts NEW-VIEW to all replicas.

5.3 Accepting NEW-VIEW


 A replica accepts NEW-VIEW if:
o It is properly signed, references a valid new-view certificate,

o The included pre-prepare messages (O ∪ N) are consistent with the


recorded prepared certificates.

The replica then enters view v+1 and re-issues PREPARE messages for the included
requests to finalize them in the new view.
6. Correctness Arguments
6.1 Safety
1. No conflicting commits: Within a view, the prepare phase ensures no two different
requests get the same (v,n).

2. View change safety:

o If a request commits in view v at sequence n, at least f+1 correct replicas


also prepared it.

o During the view change to v′>v, at least one of those f+1 replicas will
include that prepared info in its VIEW-CHANGE message, thereby
propagating the request.

o Hence in any future view, request (v,n) is not replaced by a different


request.
6.2 Liveness
 If there is a long-enough “stable period” (bounded message delays, correct leader,
etc.), the protocol will complete requests.
 If the leader is faulty or slow, eventually enough replicas will suspect it, trigger a
view change, and elect a new leader.
 PBFT uses exponential backoff of timeouts to reduce the frequency of premature
view changes.

 At most f consecutive leaders can fail before a correct leader emerges (since each
view increments the leader ID by 1 modulo N).

7. Final Remarks
7.1 Additional Protocol Aspects

 Checkpoints: Used to truncate logs. Every K request, a replica takes a checkpoint


of state, broadcasts a CHECKPOINT message, and gathers f+1 matching signatures
to form a stable checkpoint. Anything older can be garbage-collected.
 Client fairness: The protocol ensures no correct client is starved; if the leader tries
to ignore some clients, those clients eventually trigger a view change.
 Performance Optimizations:
o Replacing expensive digital signatures with MACs (Message Authentication
Codes) can reduce overhead.

o Using digests D(m) to avoid sending full requests in pre-prepare messages.


o Clients can multicast requests to all replicas, cutting one communication
hop.
CHAP 9 – BLOCKCHAIN

1. Bitcoin
1.1 Motivation
 Goal: Make direct online payments without a trusted third party (like PayPal, Visa,
or a bank).
 Account-based model: Bitcoin tracks a set of accounts (public-key addresses)
rather than physical “digital coins.”
 A blockchain maintains a public record of all transactions ever performed.
1.2 Assumptions
1. Peer-to-peer (P2P) network:
o Nodes can join/leave freely.
o The network is large; most nodes are expected to remain online most of the
time.
o Uses broadcast over an unstructured overlay and anti-entropy to spread
information.
2. Account/Keys:
o Each user controls one or more keypairs (private/public).
o The account identifier is a hash of the public key.
o Transactions (payments) move “BTC balance” from one account (address)
to others.

2. Bitcoin Blockchain
2.1 Basic Structure
 A blockchain is an ordered sequence of blocks:
o Each block contains a set of transactions plus a header with metadata (e.g.,
pointer/hash of the previous block).

o The genesis block is the first block.


o New blocks are appended to the chain’s head (the latest block).
o Block size in Bitcoin is limited to 1 MB.
2.2 Network and Consistency
 The Bitcoin P2P network:
o Each node typically connects to ~8 neighbours, though more connections
are possible.
o Each node aims to store the entire blockchain, which (as of November
2024) is around 616 GB total.
 Key Problem: Achieve consensus on the current chain head among thousands of
nodes that may be compromised or behave maliciously (Sybil attacks).
 Solution: Use Proof-of-Work to determine who appends the next block.

3. Bitcoin Proof-of-Work (PoW)


3.1 Mechanism
 Idea: To propose a new block, a node (miner) must solve a cryptographic puzzle:
o Find a nonce in the block header so that the block header’s SHA-256 hash
is below a certain target.

o This is a brute force search, as SHA-256 is non-invertible.


 Difficulty Adjustment:
o Bitcoin is designed to produce a block every ~10 minutes on average.
o Every 2016 blocks (~14 days at 10 min/block), the protocol adjusts the
target so that block production rate remains ~10 minutes/block.
o As hash power grows, the target is lowered; if hash power decreases, the
target is raised.
3.2 Miners
 Miners invest computational power (and energy) to do PoW.

 The block’s header includes:


1. Hash of the previous block (chain link).
2. Merkle root or hash of the block’s transaction set.
3. A nonce and other field (timestamp, target).
 Once a miner finds a valid PoW, it broadcasts the new block to the network.

4. Bitcoin Forks
4.1 Fork Condition
 A fork occurs if two different miners each produce a new block referencing the
same previous block around the same time.
 The network may temporarily see two competing chain tips.
4.2 Resolution
 Nodes choose the chain with the most accumulated work (or simply the longer
chain) as the valid one.
 Because new blocks are constantly built on top of one chain tip, typically one
branch becomes longer faster, and the other is abandoned.
 No Finality in the sense of classical consensus—blocks can be “rolled back” if a
competing fork eventually overtakes them.
 Conventionally, transactions are considered safe after ~6 confirmations (blocks
following them).
4.3 Causes of Forks

 Accidental: Differences in block propagation times, random chance.


 Selfish mining or network partitions: Attackers might withhold blocks to rewrite
history, especially if they control significant hash power (though not necessarily
50%).

5. Bitcoin Scalability and Energy Consumption


5.1 Transaction Throughput
 Block size = 1 MB, block interval = 10 min → Theoretical max ~7–8 transactions/sec
(empirical data ~7.7 tps).
 Contrast with Visa averaging 1700 tps, with peak capacity >10k tps.

 Simply increasing block size or reducing interval faces propagation delays, more
frequent forks, storage bloat, etc.
5.2 Energy Use
 Bitcoin’s PoW consumes massive energy to maintain security:
o The hash power must be large to deter majority attacks.

o Global Bitcoin mining electricity usage is estimated at 10–40 GW—


translating to many TWh/year.
o The exact carbon footprint depends on the energy mix (coal vs. renewable),
but it is significant.

6. Proof-of-Stake (PoS)
6.1 Concept
 Alternative to PoW to save energy.
 No more brute-force hashing; instead, “stakeholders” who hold coins (or have coin
“age”) are randomly selected to propose new blocks.

o The chance of being selected scales with how many coins you stake and
how long you’ve held them.
 Lottery Mechanism:
o Each block references a timestamp used to check a hash < target * stake
factor.
o If you hold more stake, or have accumulated more coin-age, you have a
higher chance to produce a block.
 Advantages:
o Much lower energy consumption.
o Faster block times are feasible.
 Disadvantages:
o More complex security analysis.

o Implementation challenges (e.g., Ethereum’s transition was postponed


multiple times).

7. Permissioned Blockchains
7.1 Motivation
 Blockchain can be a useful structure for storing tamper-evident logs, implementing
“smart contracts,” etc.
 Not all use cases require open, permissionless participation like Bitcoin’s:
o Some businesses need private or consortium blockchains with controlled
membership and data confidentiality.
o “Hyperledger Fabric,” “Corda,” and similar platforms are examples.

7.2 Taxonomy
 Public (Permissionless): Anyone can join, propose blocks, and read data.
Example: Bitcoin, Ethereum.
 Consortium: Only known organization members can run block-producing nodes.
Reading might be open or restricted.
 Private/Permissioned: A single organization or trusted set controls who can run
nodes and see data.
7.3 PBFT vs PoW
 Practical Byzantine Fault Tolerance (PBFT) can maintain a replicated log (a
blockchain) among a small group (like 4–7 nodes).

 PBFT has O(n2) message complexity, but low latency, high throughput, and finality.
 Bitcoin’s PoW scales to thousands of nodes but has probabilistic finality, low
throughput (~7 tps), and very high energy cost.
 For many enterprise or consortium scenarios, PBFT-like protocols or other
Byzantine consensus variants are more suitable than PoW.
Comparison:
Feature PoW (e.g. Bitcoin) PBFT

Node IDs Open Known a priori

Finality Probabilistic Immediate (once decided)

Node Scalability ~Thousands Unknown (practical < 100s)

Throughput (tx/s) ~10 ~Thousands

Latency Needs multiple Network time + 2–3 rounds


confirmations

Energy Consumption Very high Low

Adversary Threshold <50% of hash power <1/3 of replicas faulty

Synchrony Needed for block validation For liveness


Assumption

Proofs Less formal Well-established


correctness

CHAP 10 - SYSTEM DESIGN FOR LARGE SCALE

1. Motivation
 The Internet has millions of connected users, and newly launched services may
experience sudden bursts of popularity that overload centralized resources (the
“SlashDot effect”).
 Many users have always-on broadband, so it’s tempting to offload some service
responsibilities to client nodes at the network edge.
o For example, Blizzard (World of Warcraft) uses peer-to-peer distribution for
patches/demos to reduce bandwidth load on central servers.
 In theory, P2P can scale as more users adopt a service: more resources
(bandwidth, storage) become available.
o However, diminishing returns can set in: each additional node might add
complexity, overhead, or maintenance cost, limiting the net gain.

2. Early History of P2P


2.1 SETI@Home

 Goal: Analyze radio signals from the Arecibo telescope to detect possible
extraterrestrial transmissions.
 Central server splits raw data into “work units” (“buckets”) by time/frequency and
distributes them to volunteer machines worldwide.
 Volunteers run the analysis locally, then upload results back to a central server.

 Architecture: This is data-parallel computing with no direct peer-to-peer


communication—all brokerage handled by a central server.
2.2 Napster
 Napster pioneered MP3 file-sharing.

 A central index server stored metadata (which user had which songs). Actual file
transfers occurred directly peer-to-peer.
 Relied on a single, centralized directory for searching.
 Weakness: The central server became a legal attack point for the music industry.

 Many users behind firewalls: they can connect outbound to the index server, but not
necessarily accept inbound connections from other peers (the double firewall
problem).

3. Gnutella (Early Design)


3.1 Fully Distributed P2P

 No central server for indexing or searching.

 Peers form a partially randomized overlay. Each node i connects to ki other


nodes, with ki varying.
 Bootstrapping: Some known “host caches” (HTTP or previously known addresses)
are used to discover initial peers.
 Routing:
o Flooding used for queries.
o Reverse path routing: responses travel backward along the same path the
query took.
 Protocol:
1. PING/PONG for node discovery: Flood PING, respond with PONG along reverse
path.
2. QUERY/QUERY RESPONSE for searching metadata (e.g., file names). Queries are
flooded; responses go back via reverse path.
3. GET/PUSH for initiating the actual file transfer directly between peers. PUSH is a
workaround for single-direction connectivity (firewalls).
 Scaling Issues: Early Gnutella was dominated by overhead from PING/PONG
messages in a large network.
3.2 Later Gnutella Improvements (Super-Peers)
 Some nodes (with higher capacity and longer uptime) become super-peers:
o They handle search indexing for a cluster of regular peers.
o Two-tier architecture reduces flooding overhead.

 Peers send content digests (e.g., Bloom filters) to super-peers.


 Super-peers then forward queries to only likely hosting peers.
 This scaled better and Gnutella eventually captured ~40% of P2P file-sharing by
around 2005.

4. Distributed Hash Tables (DHTs)


 Flood-based search (like early Gnutella) can still be expensive at large scale.
 DHTs: Provide a consistent way to map keys to network nodes with bounded
lookup complexity:
o On a DHT, each key is deterministically assigned to a node that is
“responsible” for that key.
o A node routes a query to the key’s responsible node in O (log n) or similar
hops, rather than flooding.
 Challenge: DHTs require structured overlays that must handle node churn
(joins/leaves/crashes) carefully to maintain the routing invariants.

5. Chord
5.1 Overview
 Each node and key are assigned a unique ID in a circular ID space from 0 to 2^m−1
using a hash function (e.g., SHA-1).

 Ring structure: Keys “belong” to the node with ID >= key ID that is the successor in
the ring.

 A node does not store the entire ring membership; it only stores O (log n) pointers:

o The successor pointer plus a “finger table” of log n entries, where entry ii is
the node that succeeds by at least 2^i−1 in the ID space.
 Lookup: Forward queries via finger table, halving the distance in ID space each
time. Route time is O (log n).
5.2 Joining/Leaving
 Each node updates its successor/finger references after a join/leave to preserve
ring structure.

 Protocol ensures queries keep working despite churn, though some stale
references can temporarily increase routing hops.
6. Kademlia
6.1 XOR Metric
 IDs are 160-bit values for nodes and keys (e.g., SHA-1).

 Distance: d (a, b)= a ⊕ b (bitwise XOR). This is a proper metric (symmetric, obeys
triangle inequality).
 A node is responsible for keys “close” to its own ID under XOR.
6.2 Routing Tables
 Each node organizes its routing table into buckets based on the shared prefix length
with its own ID.

o E.g., bucket i contains nodes whose IDs share the first i−1 bits with the local
node, differ at bit i, and can vary in the remaining bits.

 Each bucket can store up to kk nodes (commonly k≈20).


 Lookups proceed by iteratively querying nodes that are closer in XOR space,
converging to the target key or node ID in about O (log n) steps.
 Symmetry: Kademlia routes the same way in both directions. This enables parallel
searches and better fault tolerance.

You might also like