DistSys Script
DistSys Script
Roger Wattenhofer
[email protected]
Autumn 2022
ii
Chapter 1
Introduction to Distributed
Systems
1
2 CHAPTER 1. INTRODUCTION TO DISTRIBUTED SYSTEMS
reasons:
• Geography: Large organizations and companies are inherently geograph-
ically distributed, and a computer system needs to deal with this issue
anyway.
• Parallelism: To speed up computation, we employ multicore processors or
computing clusters.
• Reliability: Data is replicated on different machines to prevent data loss.
• Availability: Data is replicated on different machines to allow for access
at any time, without bottlenecks, minimizing latency.
Even though distributed systems have many benefits, such as increased stor-
age or computational power, they also introduce challenging coordination prob-
lems. Some say that going from one computer to two is a bit like having a
second child. When you have one child and all cookies are gone from the cookie
jar, you know who did it!
Coordination problems are so prevalent, they come with various flavors and
names. Probably there is a term for every letter of the alphabet: agreement,
blockchain, consensus, consistency, distributed ledger, event sourcing, fault-
tolerance, etc.
Coordination problems will happen quite often in a distributed system. Even
though every single node (node is a general term for anything that computes,
e.g. a computer, a multiprocessor core, a network switch, etc.) of a distributed
system will only fail once every few years, with millions of nodes, you can expect
a failure every minute. On the bright side, one may hope that a distributed
system may have enough redundancy to tolerate node failures and continue to
work correctly.
Chapter Notes
Many good textbooks have been written on the subject, e.g. [AW04, CGR11,
CDKB11, Lyn96, Mul93, Ray13, TS01]. James Aspnes has written an excellent
BIBLIOGRAPHY 3
Bibliography
[Asp14] James Aspnes. Notes on Theory of Distributed Systems, 2014.
[AW04] Hagit Attiya and Jennifer Welch. Distributed Computing: Funda-
mentals, Simulations and Advanced Topics (2nd edition). John Wi-
ley Interscience, March 2004.
[CDKB11] George Coulouris, Jean Dollimore, Tim Kindberg, and Gordon Blair.
Distributed Systems: Concepts and Design. Addison-Wesley Pub-
lishing Company, USA, 5th edition, 2011.
15.1 Client/Server
Definition 15.1 (node). We call a single actor in the system node. In a
computer network the computers are the nodes, in the classical client-server
model both the server and the client are nodes, and so on. If not stated otherwise,
the total number of nodes in the system is n.
Model 15.2 (message passing). In the message passing model we study
distributed systems that consist of a set of nodes. Each node can perform local
computations, and can send messages to every other node.
Remarks:
• We start with two nodes, the smallest number of nodes in a distributed
system. We have a client node that wants to “manipulate” data (e.g.,
store, update, . . . ) on a remote server node.
Model 15.4 (message loss). In the message passing model with message loss,
for any specific message, it is not guaranteed that it will arrive safely at the
receiver.
Remarks:
• A related problem is message corruption, i.e., a message is received
but the content of the message is corrupted. In practice, in contrast
to message loss, message corruption can be handled quite well, e.g. by
including additional information in the message, such as a checksum.
4
15.1. CLIENT/SERVER 5
Remarks:
• Sending commands “one at a time” means that when the client sent
command c, the client does not send any new command c0 until it
received an acknowledgment for c.
• Since not only messages sent by the client can be lost, but also ac-
knowledgments, the client might resend a message that was already
received and executed on the server. To prevent multiple executions of
the same command, one can add a sequence number to each message,
allowing the receiver to identify duplicates.
• This simple algorithm is the basis of many reliable protocols, e.g.
TCP.
• The algorithm can easily be extended to work with multiple servers:
The client sends each command to every server, and once the client
received an acknowledgment from each server, the command is con-
sidered to be executed successfully.
• What about multiple clients?
Model 15.6 (variable message delay). In practice, messages might experience
different transmission times, even if they are being sent between the same two
nodes.
Remarks:
• Throughout this chapter, we assume the variable message delay model.
Theorem 15.7. If Algorithm 15.5 is used with multiple clients and multiple
servers, the servers might see the commands in different order, leading to an
inconsistent state.
Proof. Assume we have two clients u1 and u2 , and two servers s1 and s2 . Both
clients issue a command to update a variable x on the servers, initially x = 0.
Client u1 sends command x = x + 1 and client u2 sends x = 2 · x.
Let both clients send their message at the same time. With variable message
delay, it can happen that s1 receives the message from u1 first, and s2 receives
the message from u2 first.1 Hence, s1 computes x = (0 + 1) · 2 = 2 and s2
computes x = (0 · 2) + 1 = 1.
1 For example, u and s are (geographically) located close to each other, and so are u
1 1 2
and s2 .
6 CHAPTER 15. FAULT-TOLERANCE & PAXOS
Remarks:
• This idea is sometimes also referred to as leader/follower (or par-
ent/child) replication.
• What about node failures? Our serializer is a single point of failure!
• Can we have a more distributed approach of solving state replication?
Instead of directly establishing a consistent order of commands, we
can use a different approach: We make sure that there is always at
most one client sending a command; i.e., we use mutual exclusion,
respectively locking.
Remarks:
• This idea appears in many contexts and with different names, usually
with slight variations, e.g. two-phase locking (2PL).
• It is often claimed that 2PL and 2PC provide better consistency guar-
antees than a simple serializer if nodes can recover after crashing. In
particular, alive nodes might be kept consistent with crashed nodes,
for transactions that started while the crashed node was still running.
This benefit was even improved in a protocol that uses an additional
phase (3PC).
• The problem with 2PC or 3PC is that they are not well-defined if
exceptions happen.
• Does Algorithm 15.10 really handle node crashes well? No! In fact,
it is even worse than the simple serializer approach (Algorithm 15.9):
Instead of needing one available node, Algorithm 15.10 requires all
servers to be responsive!
• Does Algorithm 15.10 also work if we only get the lock from a subset
of servers? Is a majority of servers enough?
15.2 Paxos
Definition 15.11 (ticket). A ticket is a weaker form of a lock, with the fol-
lowing properties:
Remarks:
Phase 3
8: if client hears a positive answer from a majority of the servers then
9: Client tells servers to execute the stored command
10: else
11: Client waits, and then starts with Phase 1 again
12: end if
Remarks:
• There are problems with this algorithm: Let u1 be the first client
that successfully stores its command c1 on a majority of the servers.
Assume that u1 becomes very slow just before it can notify the servers
(Line 9), and a client u2 updates the stored command in some servers
to c2 . Afterwards, u1 tells the servers to execute the command. Now
some servers will execute c1 and others c2 !
• How can this problem be fixed? We know that every client u2 that
updates the stored command after u1 must have used a newer ticket
than u1 . As u1 ’s ticket was accepted in Phase 2, it follows that u2
must have acquired its ticket after u1 already stored its value in the
respective server.
15.2. PAXOS 9
Phase 2
7: if a majority answers ok then
8: Pick (Tstore , C) with largest Tstore
Remarks:
• Unlike previously mentioned algorithms, there is no step where a client
explicitly decides to start a new attempt and jumps back to Phase 1.
Note that this is not necessary, as a client can decide to abort the
current attempt and start a new one at any point in the algorithm.
This has the advantage that we do not need to be careful about se-
lecting “good” values for timeouts, as correctness is independent of
the decisions when to start new attempts.
• The performance can be improved by letting the servers send negative
15.2. PAXOS 11
Proof. Observe that there can be at most one proposal for every ticket number
τ since clients only send a proposal if they received a majority of the tickets for
τ (Line 7). Hence, every proposal is uniquely identified by its ticket number τ .
Assume that there is at least one propose(t0 ,c0 ) with t0 > t and c0 6= c; of
such proposals, consider the proposal with the smallest ticket number t0 . Since
both this proposal and also the propose(t,c) have been sent to a majority of the
servers, we can denote by S the non-empty intersection of servers that have been
involved in both proposals. Since propose(t,c) has been chosen, this means that
at least one server s ∈ S must have stored command c; thus, when the command
was stored, the ticket number t was still valid. Hence, s must have received the
request for ticket t0 after it already stored propose(t,c), as the request for ticket
t0 invalidates ticket t.
Therefore, the client that sent propose(t0 ,c0 ) must have learned from s that
a client already stored propose(t,c). Since a client adapts its proposal to the
command that is stored with the highest ticket number so far (Line 8), the client
must have proposed c as well. There is only one possibility that would lead to
the client not adapting c: If the client received the information from a server
that some client stored propose(t∗ ,c∗ ), with c∗ 6= c and t∗ > t. In this case, a
client must have sent propose(t∗ ,c∗ ) with t < t∗ < t0 , but this contradicts the
assumption that t0 is the smallest ticket number of a proposal issued after t.
Proof. From Lemma 15.14 we know that once a proposal for c is chosen, every
subsequent proposal is for c. As there is exactly one first propose(t,c) that is
chosen, it follows that all successful proposals will be for the command c. Thus,
only proposals for a single command c can be chosen, and since clients only
tell servers to execute a command, when it is chosen (Line 20), each client will
eventually tell every server to execute c.
Remarks:
• If the client with the first successful proposal does not crash, it will
directly tell every server to execute c.
• However, if the client crashes before notifying any of the servers, the
servers will execute the command only once the next client is success-
ful. Once a server received a request to execute c, it can inform every
client that arrives later that there is already a chosen command, so
that the client does not waste time with the proposal process.
12 CHAPTER 15. FAULT-TOLERANCE & PAXOS
• Note that Paxos cannot make progress if half (or more) of the servers
crash, as clients cannot achieve a majority anymore.
• So far, we only discussed how a set of nodes can reach decision for a
single command with the help of Paxos. We call such a single decision
an instance of Paxos.
Chapter Notes
Two-phase protocols have been around for a long time, and it is unclear if there
is a single source of this idea. One of the earlier descriptions of this concept can
found in the book of Gray [Gra78].
Leslie Lamport introduced Paxos in 1989. But why is it called Paxos? Lam-
port described the algorithm as the solution to a problem of the parliament
of a fictitious Greek society on the island Paxos. He even liked this idea so
much, that he gave some lectures in the persona of an Indiana-Jones-style ar-
chaeologist! When the paper was submitted, many readers were so distracted by
the descriptions of the activities of the legislators, they did not understand the
meaning and purpose of the algorithm. The paper was rejected. But Lamport
refused to rewrite the paper, and he later wrote that he “was quite annoyed at
how humorless everyone working in the field seemed to be”. A few years later,
when the need for a protocol like Paxos arose again, Lamport simply took the
paper out of the drawer and gave it to his colleagues. They liked it. So Lamport
decided to submit the paper (in basically unaltered form!) again, 8 years after
he wrote it – and it got accepted! But as this paper [Lam98] is admittedly hard
to read, he had mercy, and later wrote a simpler description of Paxos [Lam01].
Leslie Lamport is an eminent scholar when it comes to understanding dis-
tributed systems, and we will learn some of his contributions in almost every
chapter. Not surprisingly, Lamport has won the 2013 Turing Award for his
BIBLIOGRAPHY 13
Bibliography
[Gra78] James N Gray. Notes on data base operating systems. Springer, 1978.
[Lam98] Leslie Lamport. The part-time parliament. ACM Transactions on
Computer Systems (TOCS), 16(2):133–169, 1998.
[Lam01] Leslie Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25,
2001.
[Mal13] Dahlia Malkhi. Leslie Lamport. ACM webpage, 2013.
Chapter 16
Consensus
16.2 Consensus
In Chapter 15 we studied a problem that we vaguely called agreement. We will
now introduce a formally specified variant of this problem, called consensus.
Definition 16.1 (consensus). There are n nodes, of which at most f might
crash, i.e., at least n − f nodes are correct. Node i starts with an input value
vi . The nodes must decide for one of those values, satisfying the following
properties:
14
16.3. IMPOSSIBILITY OF CONSENSUS 15
Remarks:
• We assume that every node can send messages to every other node,
and that we have reliable links, i.e., a message that is sent will be
received.
• There is no broadcast medium. If a node wants to send a message
to multiple nodes, it needs to send multiple individual messages. If a
node crashes while broadcasting, not all nodes may receive the broad-
casted message. Later we will call this best-effort broadcast.
• Does Paxos satisfy all three criteria? If you study Paxos carefully, you
will notice that Paxos does not guarantee termination. For example,
the system can be stuck forever if two clients continuously request
tickets, and neither of them ever manages to acquire a majority.
• One may hope to fix Paxos somehow, to guarantee termination. How-
ever, this is impossible. In fact, the consensus problem of Definition
16.1 cannot be solved by any algorithm.
Remarks:
• The asynchronous time model is a widely used formalization of the
variable message delay model (Model 15.6).
Definition 16.3 (asynchronous runtime). For algorithms in the asynchronous
model, the runtime is the number of time units from the start of the execution
to its completion in the worst case (every legal input, every execution scenario),
assuming that each message has a delay of at most one time unit.
Remarks:
• The maximum delay cannot be used in the algorithm design, i.e., the
algorithm must work independent of the actual delay.
• Asynchronous algorithms can be thought of as systems, where local
computation is significantly faster than message delays, and thus can
be done in no time. Nodes are only active once an event occurs (a
message arrives), and then they perform their actions “immediately”.
16 CHAPTER 16. CONSENSUS
• We will show now that crash failures in the asynchronous model can
be quite harsh. In particular there is no deterministic fault-tolerant
consensus algorithm in the asynchronous model, not even for binary
input.
Definition 16.4 (configuration). We say that a system is fully defined (at any
point during the execution) by its configuration C. The configuration includes
the state of every node, and all messages that are in transit (sent but not yet
received).
Remarks:
Remarks:
• The decision value depends on the order in which messages are re-
ceived or on crash events. I.e., the decision is not yet made.
Lemma 16.7. There is at least one selection of input values V such that the
according initial configuration C0 is bivalent, if f ≥ 1.
Proof. As explained in the previous remark, C0 only depends on the input values
of the nodes. Let V = [v0 , v1 , . . . , vn−1 ] denote the array of input values, where
vi is the input value of node i.
We construct n + 1 arrays V0 , V1 , . . . , Vn , where the index i in Vi denotes the
position in the array up to which all input values are 1. So, V0 = [0, 0, 0, . . . , 0],
V1 = [1, 0, 0, . . . , 0], and so on, up to Vn = [1, 1, 1, . . . , 1].
Note that the configuration corresponding to V0 must be 0-valent so that the
validity requirement is satisfied. Analogously, the configuration corresponding
to Vn must be 1-valent. Assume that all initial configurations with starting
values Vi are univalent. Therefore, there must be at least one index b, such
16.3. IMPOSSIBILITY OF CONSENSUS 17
Remarks:
Remarks:
• For any algorithm, there is exactly one configuration tree for every
selection of input values.
• Every path from the root to a leaf is one possible asynchronous exe-
cution of the algorithm.
Remarks:
Proof. Recall that there is at least one bivalent initial configuration (Lemma
16.7). Assuming that this configuration is not critical, there must be at least
one bivalent following configuration; hence, the system may enter this configura-
tion. But if this configuration is not critical as well, the system may afterwards
progress into another bivalent configuration. As long as there is no critical con-
figuration, an unfortunate scheduling (selection of transitions) can always lead
the system into another bivalent configuration. The only way how an algo-
rithm can enforce to arrive in a univalent configuration is by reaching a critical
configuration.
Therefore we can conclude that a system which does not reach a critical
configuration has at least one possible execution where it will terminate in a
bivalent configuration (hence it terminates without agreement), or it will not
terminate at all.
Therefore we can pick one particular node u for which there is a transition
τ = (u, m) ∈ T which leads to a 0-valent configuration. As shown before, all
transitions in T which lead to a 1-valent configuration must also take place on
u. Since C is critical, there must be at least one such transition. Applying the
same argument again, it follows that all transitions in T that lead to a 0-valent
configuration must take place on u as well, and since C is critical, there is no
transition in T that leads to a bivalent configuration. Therefore all transitions
applicable to C take place on the same node u!
If this node u crashes while the system is in C, all transitions are removed,
and therefore the system is stuck in C, i.e., it terminates in C. But as C is
critical, and therefore bivalent, the algorithm fails to reach an agreement.
Proof. We assume that the input values are binary, as this is the easiest non-
trivial possibility. From Lemma 16.7 we know that there must be at least one
bivalent initial configuration C. Using Lemma 16.12 we know that if an algo-
rithm solves consensus, all executions starting from the bivalent configuration
C must reach a critical configuration. But if the algorithm reaches a critical
configuration, a single crash can prevent agreement (Lemma 16.13).
Remarks:
• If f = 0, then each node can simply send its value to all others, wait
for all values, and choose the minimum.
• But if a single node may crash, there is no deterministic solution to
consensus in the asynchronous model.
• How can the situation be improved? For example by giving each node
access to randomness, i.e., we allow each node to toss a coin.
20 CHAPTER 16. CONSENSUS
Remarks:
• The idea of Algorithm 16.15 is very simple: Either all nodes start
with the same input bit, which makes consensus easy. Otherwise,
nodes toss a coin until a large number of nodes get – by chance – the
same outcome.
Proof. The only two steps in the algorithm when a node waits are in Lines 5
and 11. Since a node only waits for a majority of the nodes to send a message,
and since f < n/2, the node will always receive enough messages to continue,
as long as no correct node terminates.
Proof. Observe that proposals for both 0 and 1 cannot occur in the same round,
as nodes only send a proposal for v, if they hear a majority for v in Line 5.
Let u be the first node that decides for a value v in round r. Hence, it received
a majority of proposals for v in r (Line 7). Note that once a node receives a
majority of proposals for a value, it will adapt this value and terminate in the
same round. Since there cannot be a proposal for any other value in r, it follows
that no node decides for a different value in r.
In Lemma 16.16 we only showed that nodes do not get stuck as long as no
node decides, thus we need to be careful that no node gets stuck if u terminates.
Any node u0 6= u can experience one of two scenarios: Either it also receives
a majority for v in round r and terminates, or it does not receive a majority.
In the first case, the agreement requirement is directly satisfied, and also the
node cannot get stuck. Let us study the latter case. Since u heard a majority
of proposals for v, it follows that every node hears at least one proposal for v.
Hence, all nodes set their value vi to v in round r. The nodes that terminate
in round r also send one additional myValue and one propose message (Lines
13, 14). Therefore, all nodes will broadcast v at the beginning of round r + 1,
all nodes will propose v in the same round and, finally, all nodes will decide for
the same value v.
Lemma 16.19. Algorithm 16.15 satisfies the termination requirement, i.e., all
nodes terminate in expected time O(2n ).
Proof. We know from the proof of Lemma 16.18 that once a node hears a ma-
jority of proposals for a value, all nodes will terminate at most one round later.
Hence, we only need to show that a node receives a majority of proposals for
the same value within expected time O(2n ).
Assume that no node receives a majority of proposals for the same value.
In such a round, some nodes may update their value to v based on a proposal
(Line 17). As shown before, all nodes that update the value based on a proposal,
adapt the same value v. The rest of the nodes chooses 0 or 1 randomly. The
probability that all nodes choose the same value v in one round is hence at
least 1/2n . Therefore, the expected number of rounds is bounded by O(2n ). As
every round consists of two message exchanges, the asymptotic runtime of the
algorithm is equal to the number of rounds.
Theorem 16.20. Algorithm 16.15 achieves binary consensus with expected run-
time O(2n ) if up to f < n/2 nodes crash.
22 CHAPTER 16. CONSENSUS
Remarks:
Proof. Assume that there is an algorithm that can handle f = n/2 many fail-
ures. We partition the set of all nodes into two sets N, N 0 both containing n/2
many nodes. Let us look at three different selection of input values: In V0 all
nodes start with 0. In V1 all nodes start with 1. In Vhalf all nodes in N start
with 0, and all nodes in N 0 start with 1.
Assume that nodes start with Vhalf . Since the algorithm must solve consensus
independent of the scheduling of the messages, we study the scenario where
all messages sent from nodes in N to nodes in N 0 (or vice versa) are heavily
delayed. Note that the nodes in N cannot determine if they started with V0 or
Vhalf . Analogously, the nodes in N 0 cannot determine if they started in V1 or
Vhalf . Hence, if the algorithm terminates before any message from the other set
is received, N must decide for 0 and N 0 must decide for 1 (to satisfy the validity
requirement, as they could have started with V0 respectively V1 ). Therefore,
the algorithm would fail to reach agreement.
The only possibility to overcome this problem is to wait for at least one
message sent from a node of the other set. However, as f = n/2 many nodes
can crash, the entire other set could have crashed before they sent any message.
In that case, the algorithm would wait forever and therefore not satisfy the
termination requirement.
Remarks:
3: Wait for n − f coins and store them in the local coin set Cu
4: Broadcast mySet(Cu )
5: Wait for n − f coin sets
6: if at least one coin is 0 among all coins in the coin sets then
7: return 0
8: else
9: return 1
10: end if
Remarks:
• Since at most f nodes crash, all nodes will always receive n − f coins
respectively coin sets in Lines 3 and 5. Therefore, all nodes make
progress and termination is guaranteed.
Lemma 16.23. Let u be a node, and let W be the set of coins that u received
in at least f + 1 different coin sets. It holds that |W | ≥ f + 1.
|C| ≤ f · (n − f ) + (n − f ) · f = 2f (n − f ).
Our assumption was that n > 3f , i.e., n − f > 2f . Therefore |C| ≤ 2f (n − f ) <
(n − f )2 = |C|, which is a contradiction.
Theorem 16.25. If f < n/3 nodes crash, Algorithm 16.22 implements a shared
coin.
Proof. Let us first bound the probability that the algorithm returns 1 for all
nodes. With probability (1 − 1/n)n ≈ 1/e ≈ 0.37 all nodes chose their local
24 CHAPTER 16. CONSENSUS
coin equal to 1 (Line 1), and in that case 1 will be decided. This is only a lower
bound on the probability that all nodes return 1, as there are also other scenarios
based on message scheduling and crashes which lead to a global decision for 1.
But a probability of 0.37 is good enough, so we do not need to consider these
scenarios.
With probability 1 − (1 − 1/n)|W | there is at least one 0 in W . Using
Lemma 16.23 we know that |W | ≥ f + 1 ≈ n/3, hence the probability is about
1 − (1 − 1/n)n/3 ≈ 1 − (1/e)1/3 ≈ 0.28. We know that this 0 is seen by all
nodes (Lemma 16.24), and hence everybody will decide 0. Thus Algorithm
16.22 implements a shared coin.
Remarks:
Theorem 16.26. Plugging Algorithm 16.22 into Algorithm 16.15 we get a ran-
domized consensus algorithm which terminates in a constant expected number
of rounds tolerating up to f < n/3 crash failures.
Chapter Notes
The problem of two friends arranging a meeting was presented and studied under
many different names; nowadays, it is usually referred to as the Two Generals
Problem. The impossibility proof was established in 1975 by Akkoyunlu et
al. [AEH75].
The proof that there is no deterministic algorithm that always solves con-
sensus is based on the proof of Fischer, Lynch and Paterson [FLP85], known as
FLP, which they established in 1985. This result was awarded the 2001 PODC
Influential Paper Award (now called Dijkstra Prize). The idea for the ran-
domized consensus algorithm was originally presented by Ben-Or [Ben83]. The
concept of a shared coin was introduced by Bracha [Bra87]. The shared coin
algorithm in this chapter was proposed by [AW04]and it assumes randomized
scheduling. A shared coin that can withstand worst-case scheduling has been
developed by Alistarh et al. [AAKS14]; this shared coin was inspired by earlier
shared coin solutions in the shared memory model [Cha96].
Apart from randomization, there are other techniques to still get consensus.
One possibility is to drop asynchrony and rely on time more, e.g. by assuming
BIBLIOGRAPHY 25
Bibliography
[AAKS14] Dan Alistarh, James Aspnes, Valerie King, and Jared Saia.
Communication-efficient randomized consensus. In 28th Interna-
tional Symposium of Distributed Computing (DISC), Austin, TX,
USA, October 12-15, 2014, pages 61–75, 2014.
[AEH75] EA Akkoyunlu, K Ekanadham, and RV Huber. Some constraints
and tradeoffs in the design of network communications. In ACM
SIGOPS Operating Systems Review, volume 9, pages 67–74. ACM,
1975.
[AW04] Hagit Attiya and Jennifer Welch. Distributed Computing: Funda-
mentals, Simulations and Advanced Topics (2nd edition). John Wi-
ley Interscience, March 2004.
[Ben83] Michael Ben-Or. Another advantage of free choice (extended ab-
stract): Completely asynchronous agreement protocols. In Proceed-
ings of the second annual ACM symposium on Principles of distrib-
uted computing, pages 27–30. ACM, 1983.
[Bra87] Gabriel Bracha. Asynchronous byzantine agreement protocols. In-
formation and Computation, 75(2):130–143, 1987.
[CF98] Flaviu Cristian and Christof Fetzer. The timed asynchronous dis-
tributed system model. In Digest of Papers: FTCS-28, The Twenty-
Eigth Annual International Symposium on Fault-Tolerant Comput-
ing, Munich, Germany, June 23-25, 1998, pages 140–149, 1998.
[Cha96] Tushar Deepak Chandra. Polylog randomized wait-free consensus. In
Proceedings of the Fifteenth Annual ACM Symposium on Principles
of Distributed Computing, Philadelphia, Pennsylvania, USA, pages
166–175, 1996.
[CT96] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors
for reliable distributed systems. J. ACM, 43(2):225–267, 1996.
Byzantine Agreement
Remarks:
Remarks:
• As for consensus (Definition 16.1) we also need agreement, termination
and validity. Agreement and termination are straight-forward, but
what about validity?
26
17.1. VALIDITY 27
17.1 Validity
Definition 17.3 (Any-Input Validity). The decision value must be the input
value of any node.
Remarks:
• This is the validity definition we used for consensus, in Definition 16.1.
• Does this definition still make sense in the presence of byzantine
nodes? What if byzantine nodes lie about their inputs?
• We would wish for a validity definition that differentiates between
byzantine and correct inputs.
Definition 17.4 (Correct-Input Validity). The decision value must be the input
value of a correct node.
Remarks:
• Unfortunately, implementing correct-input validity does not seem to
be easy, as a byzantine node following the protocol but lying about
its input value is indistinguishable from a correct node. Here is an
alternative.
Definition 17.5 (All-Same Validity). If all correct nodes start with the same
input v, the decision value must be v.
Remarks:
• If the decision values are binary, then correct-input validity is induced
by all-same validity.
• If the input values are not binary, but for example from sensors that
deliver values in R, all-same validity is in most scenarios not really
useful.
Definition 17.6 (Median Validity). If the input values are orderable, e.g.
v ∈ R, byzantine outliers can be prevented by agreeing on a value close to the
median of the correct input values – how close depends on the number of byzan-
tine nodes f .
Remarks:
• Is byzantine agreement possible? If yes, with what validity condition?
• Let us try to find an algorithm which tolerates 1 single byzantine node,
first restricting to the so-called synchronous model.
Model 17.7 (synchronous). In the synchronous model, nodes operate in
synchronous rounds. In each round, each node may send a message to the
other nodes, receive the messages sent by the other nodes, and do some local
computation.
Definition 17.8 (synchronous runtime). For algorithms in the synchronous
model, the runtime is simply the number of rounds from the start of the ex-
ecution to its completion in the worst case (every legal input, every execution
scenario).
28 CHAPTER 17. BYZANTINE AGREEMENT
Round 1
2: Send tuple(u, x) to all other nodes
3: Receive tuple(v, y) from all other nodes v
4: Store all received tuple(v, y) in a set Su
Round 2
5: Send set Su to all other nodes
6: Receive sets Sv from all nodes v
7: T = set of tuple(v, y) seen in at least two sets Sv , including own Su
8: Let tuple(v, y) ∈ T be the tuple with the smallest value y
9: Decide on value y
Remarks:
• Byzantine nodes may not follow the protocol and send syntactically
incorrect messages. Such messages can easily be detected and dis-
carded. It is worse if byzantine nodes send syntactically correct mes-
sages, but with bogus content, e.g., they send different messages to
different nodes.
• Some of these mistakes cannot easily be detected: For example, if a
byzantine node sends different values to different nodes in the first
round; such values will be put into Su . However, some mistakes can
and must be detected: Observe that all nodes only relay information
in Round 2, and do not repeat their own value. So, if a byzantine
node sends a set Sv which contains a tuple(v, y), this tuple must be
removed by u from Sv upon receiving it (Line 6).
• Recall that we assumed that nodes cannot forge their source address;
thus, if a node receives tuple(v, y) in Round 1, it is guaranteed that
this message was sent by v.
Lemma 17.10. If n ≥ 4, all correct nodes have the same set T .
Proof. With f = 1 and n ≥ 4 we have at least 3 correct nodes. A correct node
will see every correct value at least twice, once directly from another correct
node, and once through the third correct node. So all correct values are in T .
If the byzantine node sends the same value to at least 2 other (correct) nodes,
all correct nodes will see the value twice, so all add it to set T . If the byzantine
node sends all different values to the correct nodes, none of these values will
end up in any set T .
Theorem 17.11. Algorithm 17.9 reaches byzantine agreement if n ≥ 4.
Proof. We need to show agreement, any-input validity and termination. With
Lemma 17.10 we know that all correct nodes have the same set T , and therefore
17.2. HOW MANY BYZANTINE NODES? 29
agree on the same minimum value. The nodes agree on a value proposed by any
node, so any-input validity holds. Moreover, the algorithm terminates after two
rounds.
Remarks:
• If n > 4 the byzantine node can put multiple values into T .
• Algorithm 17.9 only provides any-input agreement, which is question-
able in the byzantine context: Assume a byzantine node sends different
values to different nodes, what is its input value in that case?
• Algorithm 17.9 can be slightly modified to achieve all-same validity
by choosing the smallest value that occurs at least twice.
• The idea of this algorithm can be generalized for any f and n >
3f . In the generalization, every node sends in every of f + 1 rounds
all information it learned so far to all other nodes. In other words,
message size increases exponentially with f .
• Does Algorithm 17.9 also work with n = 3?
Theorem 17.12. Three nodes cannot reach byzantine agreement with all-same
validity if one node among them is byzantine.
Proof. We will assume that the three nodes satisfy all-same validity and show
that they will violate the agreement condition under this assumption.
In order to achieve all-same validity, nodes have to deterministically decide
for a value x if it is the input value of every correct node. Recall that a Byzantine
node which follows the protocol is indistinguishable from a correct node. Assume
a correct node sees that n−f nodes including itself have an input value x. Then,
by all-same validity, this correct node must deterministically decide for x.
In the case of three nodes (n − f = 2), a node has to decide on its own
input value if another node has the same input value. Let us call the three
nodes u, v and w. If correct node u has input 0 and correct node v has input
1, the byzantine node w can fool them by telling u that its value is 0 and
simultaneously telling v that its value is 1. By all-same validity, this leads to u
and v deciding on two different values, which violates the agreement condition.
Even if u talks to v, and they figure out that they have different assumptions
about w’s value, u cannot distinguish whether w or v is byzantine.
Theorem 17.13. A network with n nodes cannot reach byzantine agreement
with f ≥ n/3 byzantine nodes.
Proof. Assume (for the sake of contradiction) that there exists an algorithm
A that reaches byzantine agreement for n nodes with f ≥ dn/3e byzantine
nodes. We will show that A cannot satisfy all-same validity and agreement
simultaneously.
Let us divide the n nodes into three groups of size n/3 (either bn/3c or
dn/3e, if n is not divisible by 3). Assume that one group of size dn/3e ≥ n/3
contains only Byzantine and the other two groups only correct nodes. Let
one group of correct nodes start with input value 0 and the other with input
value 1. As in Lemma 17.12, the group of Byzantine nodes supports the input
30 CHAPTER 17. BYZANTINE AGREEMENT
value of each node, so each correct node observes at least n − f nodes who
support its own input value. Because of all-same validity, every correct node
has to deterministically decide on its own input value. Since the two groups
of correct nodes had different input values, the nodes will decide on different
values respectively, thus violating the agreement property.
Vote
3: Broadcast value(x)
Propose
4: if some value(y) received at least n − f times then
5: Broadcast propose(y)
6: end if
7: if some propose(z) received more than f times then
8: x=z
9: end if
King
10: Let node vi be the predefined king of this phase i
11: The king vi broadcasts its current value w
12: if received strictly less than n − f propose(y) then
13: x=w
14: end if
15: end for
Proof. If all correct nodes start with the same value, all correct nodes propose it
in Line 5. All correct nodes will receive at least n − f proposals, i.e., all correct
nodes will stick with this value, and never change it to the king’s value. This
holds for all phases.
Proof. Assume (for the sake of contradiction) that a correct node proposes value
x and another correct node proposes value y. Since a good node only proposes
a value if it heard at least n − f value messages, we know that both nodes must
have received their value from at least n − 2f distinct correct nodes (as at most
f nodes can behave byzantine and send x to one node and y to the other one).
Hence, there must be a total of at least 2(n − 2f ) + f = 2n − 3f nodes in the
system. Using 3f < n, we have 2n − 3f > n nodes, a contradiction.
17.4. LOWER BOUND ON NUMBER OF ROUNDS 31
Remarks:
• Algorithm 17.14 requires f + 1 predefined kings. We assume that the
kings (and their order) are given. Finding the kings indeed would be
a byzantine agreement task by itself, so this must be done before the
execution of the King algorithm.
• Do algorithms exist which do not need predefined kings? Yes, see
Section 17.5.
• Can we solve byzantine agreement (or at least consensus) in less than
f + 1 rounds?
Remarks:
• A general proof without the restriction to decide for the minimum
value exists as well.
• Since byzantine nodes can also just crash, this lower bound also holds
for byzantine agreement, so Algorithm 17.14 has an asymptotically
optimal runtime.
• So far all our byzantine agreement algorithms assume the synchronous
model. Can byzantine agreement be solved in the asynchronous model?
Lemma 17.22. Let a correct node choose value x in Line 10, then no other
correct node chooses value y 6= x in Line 10.
Proof. For the sake of contradiction, assume that both 0 and 1 are chosen
in Line 10. This means that both 0 and 1 had been proposed by at least
n/2 + 1 out of n − f correct nodes. In other words, we have a total of at least
2 · n/2 + 2 = n + 2 > n − f correct nodes. Contradiction!
Theorem 17.23. Algorithm 17.21 solves binary byzantine agreement as in Def-
inition 17.2 for up to f < n/10 byzantine nodes.
Proof. First note that it is not a problem to wait for n − f propose messages in
Line 5, since at most f nodes are byzantine. If all correct nodes have the same
input value x, then all (except the f byzantine nodes) will propose the same
value x. Thus, every node receives at least n − 2f propose messages containing
x. Observe that for f < n/10, we get n − 2f > n/2 + 3f and the nodes will
decide on x in the first round already. We have established all-same validity!
If the correct nodes have different (binary) input values, the validity condition
becomes trivial as any result is fine.
17.6. RANDOM ORACLE AND BITSTRING 33
What about agreement? Let u be the first node to decide on value x (in
Line 8). Due to asynchrony, another node v received messages from a different
subset of the nodes, however, at most f senders may be different. Taking
into account that byzantine nodes may lie (send different propose messages to
different nodes), f additional propose messages received by v may differ from
those received by u. Since node u had at least n/2 + 3f + 1 propose messages
with value x, node v has at least n/2 + f + 1 propose messages with value x.
Hence every correct node will propose x in the next round and then decide on
x.
So we only need to worry about termination: We have already seen that
as soon as one correct node terminates (Line 8) everybody terminates in the
next round. So what are the chances that some node u terminates in Line 8?
Well, we can hope that all correct nodes randomly propose the same value (in
Line 12). Maybe there are some nodes not choosing randomly (entering Line 10
instead of 12), but according to Lemma 17.22 they will all propose the same.
Thus, at worst all n − f correct nodes need to randomly choose the same bit,
which happens with probability 2−(n−f )+1 . If so, all correct nodes will send the
same propose message, and the algorithm terminates. So the expected running
time is exponential in the number of nodes n in the worst case.
Remarks:
• Local coinflips are responsible for the slow runtime of Algorithm 17.21
and 16.15. Is there a simple way to replace the local coinflips by
randomness that does not cause exponential runtime?
Remarks:
• Algorithm 17.25, as well as the upcoming Algorithm 17.28 will be
called in Line 12 of Algorithm 17.21. So instead of every node throwing
a local coin (and hoping that they all show the same), the nodes will
base their random decision on the proposed algorithm.
Theorem 17.26. Algorithm 17.25 plugged into Algorithm 17.21 solves asyn-
chronous byzantine agreement in expected constant number of rounds.
Proof. If there is a large majority for one of the input values in the system, all
nodes will decide within two rounds since Algorithm 17.21 satisfies all-same-
validity; the coin is not even used.
If there is no significant majority for any of the input values at the beginning
of algorithm 17.21, all correct nodes will run Algorithm 17.25. Therefore, they
will set their new value to the bit given by the random oracle and terminate in
the following round.
If neither of the above cases holds, some of the nodes see an n/2 + f + 1
majority for one of the input values, while other nodes rely on the oracle. With
probability 1/2, the value of the oracle will coincide with the deterministic ma-
jority value of the other nodes. Therefore, with probability 1/2, the nodes will
terminate in the following round. The expected number of rounds for termina-
tion in this case is 3.
Remarks:
• Unfortunately, random oracles are a bit like pink fluffy unicorns: they
do not really exist in the real world. Can we fix that?
Definition 17.27 (Random Bitstring). A random bitstring is a string of
random binary values, known to all participating nodes when starting a protocol.
Remarks:
• But is such a precomputed bitstring really random enough? We should
be worried because of Theorem 16.14.
Theorem 17.29. If the scheduling is worst-case, Algorithm 17.28 plugged into
Algorithm 17.21 does not terminate.
Proof. We start Algorithm 17.28 with the following input: n/2 + f + 1 nodes
have input value 1, and n/2 − f − 1 nodes have input value 0. Assume w.l.o.g.
that the first bit of the random bitstring is 0.
If the second random bit in the bitstring is also 0, then a worst-case scheduler
will let n/2 + f + 1 nodes see all n/2 + f + 1 values 1, these will therefore
deterministically choose the value 1 as their new value. Because of scheduling
(or byzantine nodes), the remaining n/2 − f − 1 nodes receive strictly less than
17.6. RANDOM ORACLE AND BITSTRING 35
n/2 + f + 1 values 1 and therefore have to rely on the value of the shared coin,
which is 0. The nodes will not come to a decision in this round. Moreover, we
have created the very same distribution of values for the next round (which has
also random bit 0).
If the second random bit in the bitstring is 1, then a worst-case scheduler can
let n/2 − f − 1 nodes see all n/2 + f + 1 values 1, and therefore deterministically
choose the value 1 as their new value. Because of scheduling (or byzantine
nodes), the remaining n/2 + f + 1 nodes receive strictly less than n/2 + f + 1
values 1 and therefore have to rely on the value of the shared coin, which is 0.
The nodes will not decide in this round. And we have created the symmetric
situation for input value 1 that is coming in the next round.
So if the current and the next random bit are known, worst-case scheduling
will keep the system in one of two symmetric states that never decide.
Remarks:
• Note that in the proof of Theorem 17.29 we did not even use any
byzantine nodes. Just bad scheduling was enough to prevent termi-
nation.
Chapter Notes
The project which started the study of byzantine failures was called SIFT and
was founded by NASA [WLG+ 78], and the research regarding byzantine agree-
ment started to get significant attention with the results by Pease, Shostak, and
Lamport [PSL80, LSP82]. In [PSL80] they presented the generalized version
of Algorithm 17.9 and also showed that byzantine agreement is unsolvable for
n ≤ 3f . The algorithm presented in that paper is nowadays called Exponential
Information Gathering (EIG), due to the exponential size of the messages.
There are many algorithms for the byzantine agreement problem. For ex-
ample the Queen Algorithm [BG89] which has a better runtime than the King
algorithm [BGP89], but tolerates less failures. That byzantine agreement re-
quires at least f + 1 many rounds was shown by Dolev and Strong [DS83],
based on a more complicated proof from Fischer and Lynch [FL82].
While many algorithms for the synchronous model have been around for a
long time, the asynchronous model is a lot harder. The only results were by
Ben-Or and Bracha. Ben-Or [Ben83] was able to tolerate f < n/5. Bracha
[BT85] improved this tolerance to f < n/3.
Nearly all developed algorithms only satisfy all-same validity. There are a
few exceptions, e.g., correct-input validity [FG03], available if the initial values
are from a finite domain, median validity [SW15, MW18, DGM+ 11] if the input
values are orderable, or values inside the convex hull of all correct input values
[VG13, MH13, MHVG15] if the input is multidimensional.
Before the term byzantine was coined, the terms Albanian Generals or Chi-
nese Generals were used in order to describe malicious behavior. When the
36 CHAPTER 17. BYZANTINE AGREEMENT
involved researchers met people from these countries they moved – for obvious
reasons – to the historic term byzantine [LSP82].
Hat tip to Peter Robinson for noting how to improve Algorithm 17.9 to all-
same validity. This chapter was written in collaboration with Barbara Keller.
Bibliography
[Ben83] Michael Ben-Or. Another advantage of free choice (extended ab-
stract): Completely asynchronous agreement protocols. In Proceed-
ings of the second annual ACM symposium on Principles of distrib-
uted computing, pages 27–30. ACM, 1983.
[DGM+ 11] Benjamin Doerr, Leslie Ann Goldberg, Lorenz Minder, Thomas
Sauerwald, and Christian Scheideler. Stabilizing Consensus with the
Power of Two Choices. In Proceedings of the Twenty-third Annual
ACM Symposium on Parallelism in Algorithms and Architectures,
SPAA, June 2011.
[FL82] Michael J. Fischer and Nancy A. Lynch. A lower bound for the time
to assure interactive consistency. 14(4):183–186, June 1982.
Remarks:
Lemma 18.2. Algorithm 16.22 has exponential expected running time under
worst-case scheduling.
38
18.1. SHARED COIN ON A BLACKBOARD 39
Remarks:
• We assume that the nodes cannot reconstruct the order in which the
messages are written to the blackboard since the system is asynchro-
nous.
Remarks:
• The sign function is used for the decision values. The sign function
returns +1 if the sum of all coinflips in C is positive, and −1 if it is
negative.
• If a node does not need to wait for other nodes, we call the algorithm
wait-free.
where Φ(z) is the cumulative distribution function of the standard normal dis-
tribution evaluated at z.
40 CHAPTER 18. BROADCAST & SHARED COINS
Proof. Each node in the algorithm terminates once at least n2 coinflips are
written to the blackboard. Before terminating, nodes may write one additional
coinflip. Therefore, every node decides after reading at least n2 and at most
n2 + n − 1 coinflips. The power of the adversary lies in the fact that it can
prevent n − 1 nodes from writing their coinflips to the blackboard by delaying
their writes. Here, we will consider an even stronger adversary that can hide up
to n coinflips which were written on the blackboard.
We need to show that both outcomes for the shared coin (+1 or −1 in Line
6) will occur with constant probability, as in Definition 18.1. Let X be the sum
of all coinflips that are visible to every node. Since some of the nodes might read
n more values from the blackboard than others, the nodes cannot be prevented
from deciding if |X| > n. By applying Theorem 18.5 with N = n2 and z = 1,
we get:
P r(X ≤ −n) = P r(X ≥ n) = 1 − Φ(1) > 0.15.
Lemma 18.7. Algorithm 18.4 uses n2 coinflips, which is optimal in this model.
Proof. The proof for showing quadratic lower bound makes use of configurations
that are indistinguishable to all nodes, similar to Theorem 16.14. It requires
involved stochastic methods and we therefore will only sketch the idea of where
the n2 comes from.
The basic idea follows from Theorem 18.5. The standard deviation of the
sum of n2 coinflips is n. The central limit theorem tells us that with constant
probability the sum of the coinflips will be only a constant factor away from
the standard deviation. As we showed in Theorem 18.6, this is large enough
to disarm a worst-case scheduler. However, with much less than n2 coinflips, a
worst-case scheduler is still too powerful. If it sees a positive sum forming on
the blackboard, it delays messages trying to write +1 in order to turn the sum
temporarily negative, so the nodes finishing first see a negative sum, and the
delayed nodes see a positive sum.
Remarks:
Remarks:
• Note that best-effort broadcast is equivalent to the simple broadcast
primitive that we have used so far.
• Reliable broadcast is a stronger paradigm which implies that byzantine
nodes cannot send different values to different nodes. Such behavior
will be detected.
Definition 18.10 (Reliable Broadcast). Reliable broadcast ensures that the
nodes eventually agree on all accepted messages. That is, if a correct node v
considers message m as accepted, then every other node will eventually consider
message m as accepted.
to forge an incorrect sender address, see Definition 17.1. Instead, they can echo
messages from correct nodes with a wrong input value. If all byzantine nodes
echo a message that has not been broadcast by a correct node, each correct
node will receive at most f < n − 2f echo messages and thus no correct node
will accept such a message.
For the third property, assume that some message originated from a byzan-
tine node b, or a node b that has crashed in the process of sending its message.
If a correct node accepted message msg(b), this node must have received at least
n − f echoes for this message in Line 5. If at most f nodes are faulty, at least
n−2f correct nodes must have broadcast an echo message for msg(b). Therefore,
every correct node will receive these n − 2f echoes eventually and will broadcast
an echo. Finally, all n − f correct nodes will have broadcast an echo for msg(b)
and every correct node will accept msg(b).
Remarks:
• Algorithm 18.11 does not solve consensus according to Definition 16.1.
It only makes sure that all messages of correct nodes will be accepted
eventually. For correct nodes, this corresponds to sending and receiv-
ing messages in the asynchronous model (Model 16.2).
• The algorithm has a linear message overhead since every node again
broadcasts every message.
• Note that byzantine nodes can issue arbitrarily many messages. This
may be a problem for protocols where each node is only allowed to
send one message (per round). Can we fix this, for instance with
sequence numbers?
Definition 18.13 (FIFO Reliable Broadcast). The FIFO (reliable) broad-
cast defines an order in which the messages are accepted in the system. If a
node u broadcasts message m1 before m2 , then any node v will accept message
m1 before m2 .
Proof. Just as reliable broadcast, Algorithm 18.14 satisfies the three properties
of Theorem 18.12 by simply following the flow of messages of a correct node.
It remains to show that at most one message will be accepted from some node
v in round r. In the crash failure case, this property holds because all nodes
follow the algorithm and therefore send at most one message in a round. For
the byzantine case, assume some correct node u has accepted msg(v, r) in Line
7. This node must have received n − f echo messages for this message, n − 2f
of which were sent from the correct nodes. At least n − 2f − f = n − 3f of
those messages are sent for the first time by correct nodes. Now, assume for
contradiction that another correct node accepts msg’(v, r). Similarly, n − 3f
of those messages are sent for the first time by correct nodes. So, we have
n − 3f + n − 3f > n − f (for f < n/5) correct nodes sent echo for the first time.
A contradiction.
Remarks:
• Definition 18.16 is equivalent to Definition 15.8, i.e., atomic broadcast
= state replication.
• Now we have all the tools to finally solve asynchronous consensus.
Remarks:
Definition 18.20 (Signature). Every node can sign its messages in a way
that no other node can forge, thus nodes can reliably determine which node a
signed message originated from. We denote a message x signed by node u with
msg(x)u .
18.4. USING CRYPTOGRAPHY 45
Remarks:
• Note that the communication between the dealer and the nodes must
be private, i.e., a byzantine party cannot see the shares sent to the
correct nodes.
Algorithm 18.22 Preprocessing Step for Algorithm 18.23 (code for dealer d)
1: According to Algorithm 18.21, choose polynomial p of degree f
2: for i = 1, . . . , n do
3: Choose coinflip ci , where ci = 0 with probability 1/2, else ci = 1
4: Using Algorithm 18.21, generate n shares (xi1 , p(xi1 )), . . . , (xin , p(xin )) for
ci
5: end for
6: Send shares msg(x1u , p(x1u ))d , . . . , msg(xn n
u , p(xu ))d to node u
Theorem 18.24. Algorithm 17.21 together with Algorithm 18.22 and Algo-
rithm 18.23 solves asynchronous byzantine agreement for f < n/10 in expected
3 number of rounds.
Proof. In Line 2 of Algorithm 18.23, the nodes collect shares from f + 1 nodes.
Since a byzantine node cannot forge the signature of the dealer, it is restricted
46 CHAPTER 18. BROADCAST & SHARED COINS
to either send its own share or decide to not send it at all. Therefore, each
correct node will eventually be able to reconstruct secret ci of round i correctly
in Line 3 of the algorithm. The running time analysis follows then from the
analysis of Theorem 17.26.
Remarks:
Remarks:
Remarks:
• In Algorithm 18.26, Line 3 each node can verify the correctness of the
signed message using the public key.
Theorem 18.27. Algorithm 18.26 plugged into Algorithm 17.21 solves syn-
chronous byzantine agreement in expected 3 rounds (roughly) for up to f < n/10
byzantine failures.
Proof. With probability 1/10 the minimum hash value is generated by a byzan-
tine node. In such a case, we can assume that not all correct nodes will receive
the byzantine value and thus, different nodes might compute different values for
the shared coin.
With probability 9/10, the shared coin will be from a correct node, and
with probability 1/2 the value of the shared coin will correspond to the value
which was deterministically chosen by some of the correct nodes. Therefore,
with probability 9/20 the nodes will reach consensus in the next iteration of
Algorithm 17.21. Thus, the expected number of rounds is around 3 (expected
number of rounds to be lucky in a round is 20/9 plus one more iteration to
terminate).
Chapter Notes
Asynchronous byzantine agreement is usually considered in one out of two com-
munication models – shared memory or message passing. The first polynomial
algorithm for the shared memory model that uses a shared coin was proposed by
Aspnes and Herlihy [AH90] and required exchanging O(n4 ) messages in total.
Algorithm 18.4 is also an implementation of the shared coin in the shared mem-
ory model and it requires exchanging O(n3 ) messages. This variant is due to
Saks, Shavit and Woll [SSW91]. Bracha and Rachman [BR92] later reduced the
number of messages exchanged to O(n2 log n). The tight lower bound of Ω(n2 )
on the number of coinflips was proposed by Attiya and Censor [AC08] and
improved the first non-trivial lower bound of Ω(n2 / log2 n) by Aspnes [Asp98].
In the message passing model, the shared coin is usually implemented using
reliable broadcast. Reliable broadcast was first proposed by Srikanth and Toueg
[ST87] as a method to simulate authenticated broadcast. There is also another
implementation which was proposed by Bracha [Bra87]. Today, a lot of variants
of reliable broadcast exist, including FIFO broadcast [AAD05], which was con-
sidered in this chapter. A good overview over the broadcast routines is given
by Cachin et al. [CGR14]. A possible way to reduce message complexity is
by simulating the read and write commands [ABND95] as in Algorithm 18.17.
The message complexity of this method is O(n3 ). Alistarh et al. [AAKS14]
improved the number of exchanged messages to O(n2 log2 n) using a binary tree
that restricts the number of communicating nodes according to the depth of the
tree.
It remains an open question whether asynchronous byzantine agreement can
be solved in the message passing model without cryptographic assumptions.
If cryptographic assumptions are however used, byzantine agreement can be
solved in expected constant number of rounds. Algorithm 18.22 presents the
first implementation due to Rabin [Rab83] using threshold secret sharing. This
algorithm relies on the fact that the dealer provides the random bitstring. Chor
et al. [CGMA85] proposed the first algorithm where the nodes use verifiable
48 CHAPTER 18. BROADCAST & SHARED COINS
secret sharing in order to generate random bits. Later work focuses on improving
resilience [CR93] and practicability [CKS00]. Algorithm 18.26 by Micali [Mic18]
shows that cryptographic assumptions can also help to improve the running time
in the synchronous model.
This chapter was written in collaboration with Darya Melnyk.
Bibliography
[AAD05] Ittai Abraham, Yonatan Amit, and Danny Dolev. Optimal re-
silience asynchronous approximate agreement. In Proceedings of the
8th International Conference on Principles of Distributed Systems,
OPODIS’04, pages 229–239, Berlin, Heidelberg, 2005. Springer-
Verlag.
[AAKS14] Dan Alistarh, James Aspnes, Valerie King, and Jared Saia.
Communication-efficient randomized consensus. In Fabian Kuhn,
editor, Distributed Computing, pages 61–75, Berlin, Heidelberg,
2014. Springer Berlin Heidelberg.
[ABND95] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing mem-
ory robustly in message-passing systems. J. ACM, 42(1):124–142,
January 1995.
[AC08] Hagit Attiya and Keren Censor. Tight bounds for asynchronous
randomized consensus. J. ACM, 55(5):20:1–20:26, November 2008.
[Asp98] James Aspnes. Lower bounds for distributed coin-flipping and ran-
domized consensus. J. ACM, 45(3):415–450, May 1998.
[CKS00] Christian Cachin, Klaus Kursawe, and Victor Shoup. Random ora-
cles in constantinople: Practical asynchronous byzantine agreement
using cryptography. Journal of Cryptology, 18:219–246, 2000.
BIBLIOGRAPHY 49
[CR93] Ran Canetti and Tal Rabin. Fast asynchronous byzantine agreement
with optimal resilience. In Proceedings of the Twenty-fifth Annual
ACM Symposium on Theory of Computing, STOC ’93, pages 42–51,
New York, NY, USA, 1993. ACM.
[Mic18] Silvio Micali. Byzantine agreement , made trivial. 2018.
[Rab83] M. O. Rabin. Randomized byzantine generals. In 24th Annual
Symposium on Foundations of Computer Science (sfcs 1983), pages
403–409, Nov 1983.
[SSW91] Michael Saks, Nir Shavit, and Heather Woll. Optimal time ran-
domized consensus – making resilient algorithms fast in practice.
In Proceedings of the Second Annual ACM-SIAM Symposium on
Discrete Algorithms, SODA ’91, pages 351–362, Philadelphia, PA,
USA, 1991. Society for Industrial and Applied Mathematics.
You submit a comment on your favorite social media platform using your phone.
The comment is immediately visible on the phone, but not on your laptop. Is
this level of consistency acceptable?
Remarks:
• Object is a general term for any entity that can be modified, like a
queue, stack, memory slot, file system, etc.
Remarks:
50
19.1. CONSISTENCY MODELS 51
Remarks:
Remarks:
Remarks:
• In the introductory social media example, a linearizable implementa-
tion would have to make sure that the comment is immediately visible
on any device, as the read operation starts after the write operation
finishes. If the system is only sequentially consistent, the comment
does not need to be immediately visible on every device.
Definition 19.15 (restricted execution). Let E be an execution involving oper-
ations on multiple objects. For some object o we let the restricted execution
E|o be the execution E filtered to only contain operations involving object o.
Definition 19.16. A consistency model is called composable if the following
holds: If for every object o the restricted execution E|o is consistent, then also
E is consistent.
19.2. LOGICAL CLOCKS 53
Remarks:
• Composability enables to implement, verify and execute multiple con-
current objects independently.
Lemma 19.17. Sequential consistency is not composable.
Proof. We consider an execution E with two nodes u and v, which operate on
two objects x and y initially set to 0. The operations are as follows: u1 reads
x = 1, u2 writes y := 1, v1 reads y = 1, v2 writes x := 1 with u1 < u2 on node
u and v1 < v2 on node v. It is clear that E|x as well as E|y are sequentially
consistent as the write operations may be before the respective read operations.
In contrast, execution E is not sequentially consistent: Neither u1 nor v1 can
possibly be the initial operation in any correct semantically equivalent sequential
execution S, as that would imply reading 1 when the variable is still 0.
Theorem 19.18. Linearizability is composable.
Proof. Let E be an execution composed of multiple restricted executions E|x.
For any object x there is a sequential execution S|x that is semantically con-
sistent to E|x and in which the operations are ordered according to wall-clock-
linearization points. Let S be the sequential execution ordered according to all
linearization points of all executions E|x. S is semantically equivalent to E as
S|x is semantically equivalent to E|x for all objects x and two object-disjoint
executions cannot interfere. Furthermore, if f† < g∗ in E, then also f• < g• in
E and therefore also f < g in S.
Remarks:
• If for two distinct operations f, g neither f → g nor g → f , then
we also say f and g are independent and write f ∼ g. Sequential
computations are characterized by → being a total order, whereas the
computation is entirely concurrent if no operations f, g with f → g
exist.
Definition 19.20 (Happened-before consistency). An execution E is called
happened-before consistent, if there is a sequence of operations S such that:
54 CHAPTER 19. CONSISTENCY & LOGICAL TIME
Remarks:
• In algorithms we write cu for the current logical time of node u.
• The simplest logical clock is the Lamport clock, given in Algorithm 19.24.
Every message includes a timestamp, such that the receiving node may
update its current logical time.
Remarks:
• Lamport logical clocks are not strong logical clocks, which means we
cannot completely reconstruct → from the family of clocks cu .
Theorem 19.27. Define cu < cv if and only if cu [w] ≤ cv [w] for all entries
w, and cu [x] < cv [x] for at least one entry x. Then the vector clocks are strong
logical clocks.
Proof. We are given two operations f, g, with operation f on node u, and op-
eration g on node v, possibly v = u.
If we have f → g, then there must be a happened-before-path of operations
and messages from f to g. According to Algorithm 19.26, cv (g) must include
at least the values of the vector cu (f ), and the value cv (g)[v] > cu (f )[v].
If we do not have f → g, then cv (g)[u] cannot know about cu (f )[u], and
hence cv (g)[u] < cu (f )[u], since cu (f )[u] was incremented when executing f on
node u.
Remarks:
Remarks:
• In a consistent snapshot it is forbidden to see an effect without its
cause.
• Imagine a bank having lots of accounts with transactions all over the
world. The bank wants to make sure that at no point in time money
gets created or destroyed. This is where consistent snapshots come in:
They are supposed to capture the state of the system. Theoretically,
we have already used snapshots when we discussed configurations in
Definition 16.4:
Definition 19.30 (configuration). We say that a system is fully defined (at any
point during the execution) by its configuration. The configuration includes
the state of every node, and all messages that are in transit (sent but not yet
received).
Remarks:
• While a configuration describes the intractable state of a system at one
point in time, a snapshot extracts all relevant tractable information
of the systems state.
• One application of consistent snapshots is to check if certain invariants
hold in a distributed setting. Other applications include distributed
debugging or determining global states of a distributed system.
• In Algorithm 19.31 we assume that a node can record only its internal
state and the messages it sends and receives. There is no common
clock so it is not possible to just let each node record all information
at precisely the same time.
Remarks:
• It may of course happen that a node u sends a message m before
receiving the first snap message at time tu (hence not containing the
snap tag), and this message m is only received by node v after tv .
Such a message m will be reported by v, and is as such included in
the consistent snapshot (as a message that was in transit during the
snapshot).
• The number of possible consistent snapshots gives also information
about the degree of concurrency of the system.
• One extreme is a sequential computation, where stopping one node
halts the whole system. Let qu be the number of operations on node
u ∈ {1, . . . , n}. Then the number of consistent snapshots (including
the empty cut) in the sequential case is µs := 1 + q1 + q2 + · · · + qn .
• On the other hand, in an entirely concurrent computation the nodes
are not dependent on one another and therefore stopping one node
does not impact others. The number of consistent snapshots in this
case is µc := (1 + q1 ) · (1 + q2 ) · · · (1 + qn ).
Definition 19.33 (measure of concurrency). The concurrency measure of an
execution E = (S1 , . . . , Sn ) is defined as the ratio
µ − µs
m(E) := ,
µc − µs
where µ denotes the number of consistent snapshot of E.
Remarks:
• This measure of concurrency is normalized to [0, 1].
• In order to evaluate the extent to which a computation is concurrent,
we need to compute the number of consistent snapshots µ. This can
be done via vector clocks.
58 CHAPTER 19. CONSISTENCY & LOGICAL TIME
Remarks:
Remarks:
Remarks:
• The advantage of using an open source definition like opentracing
is that it is easy to replace a specific tracing by another one. This
mitigates the lock-in effect that is often experienced when using some
specific technology.
• Algorithm 19.38 shows what is needed if you want to trace requests
to your system.
Remarks:
• All tracing information is collected and has to be sent to some tracing
backend which stores the traces and usually provides a frontend to
understand what is going on.
• Opentracing implementations are available for the most commonly
used programming frameworks and can therefore be used for hetero-
geneous collections of microservices.
Chapter Notes
In his seminal work, Leslie Lamport came up with the happened-before relation
and gave the first logical clock algorithm [Lam78]. This paper also laid the
foundation for the theory of logical clocks. Fidge came some time later up with
vector clocks [JF88]. An obvious drawback of vector clocks is the overhead
caused by including the whole vector. Can we do better? In general, we cannot
if we need strong logical clocks [CB91].
Lamport also introduced the algorithm for distributed snapshots, together
with Chandy [CL85]. Besides this very basic algorithm, there exist several other
algorithms, e.g., [LY87], [SK86].
Throughout the literature the definitions for, e.g., consistency or atomicity
slightly differ. These concepts are studied in different communities, e.g., lin-
earizability hails from the distributed systems community whereas the notion
of serializability was first treated by the database community. As the two areas
converged, the terminology got overloaded.
Our definitions for distributed tracing follow the OpenTracing API 1 . The
opentracing API only gives high-level definitions of how a tracing system is sup-
posed to work. Only the implementation specifies how it works internally.There
are several systems that implement these generic definitions, like Uber’s open
source tracer called Jaeger, or Zipkin, which was first developed by Twitter.
This technology is relevant for the growing number of companies that embrace
1 http://opentracing.io/documentation/
60 CHAPTER 19. CONSISTENCY & LOGICAL TIME
Bibliography
[CB91] Bernadette Charron-Bost. Concerning the size of logical clocks in dis-
tributed systems. Inf. Process. Lett., 39(1):11–16, July 1991.
[Lam78] Leslie Lamport. Time, clocks, and the ordering of events in a distrib-
uted system. Commun. ACM, 21(7):558–565, jul 1978.
[LY87] Ten H. Lai and Tao H. Yang. On distributed snapshots. Information
Processing Letters, 25(3):153 – 158, 1987.
[SK86] Madalene Spezialetti and Phil Kearns. Efficient distributed snapshots.
In ICDCS, pages 382–388. IEEE Computer Society, 1986.
Chapter 20
Remarks:
Definition 20.2 (Wall-Clock Time). The wall-clock time t∗ is the true time
(a perfectly accurate clock would show).
Definition 20.3 (Clock). A clock is a device which tracks and indicates time.
Remarks:
Definition 20.4 (Clock Error). The clock error or clock skew is the difference
between two clocks, e.g., t−t∗ or t−t0 . In practice the clock error is often modeled
as t = (1 + δ)t∗ + ξ(t∗ ).
61
62 CHAPTER 20. TIME, CLOCKS & GPS
Figure 20.8: Drift (left) and Jitter (right). On top is a square wave, the wall-
clock time t∗ .
Remarks:
Remarks:
• Drift is relatively constant over time, but may change with supply
voltage, temperature and age of an oscillator.
• Stable clock sources, which offer a low drift, are generally preferred,
but also more expensive, larger and more power hungry, which is why
many consumer products feature inaccurate clocks.
Definition 20.6 (Parts Per Million). Clock drift is indicated in parts per mil-
lion (ppm). One ppm corresponds to a time error growth of one microsecond
per second.
Remarks:
Remarks:
• Jitter captures all the errors that are not explained by drift. Fig-
ure 20.8 visualizes the concepts.
20.2. CLOCK SYNCHRONIZATION 63
Remarks:
• A trade-off exists between synchronization accuracy, convergence time,
and cost.
• Different clock synchronization variants may tolerate crashing, erro-
neous or byzantine nodes.
2: while true do
3: Node u sends request to v at time tu
4: Node v receives request at time tv
5: Node v processes the request and replies at time t0v
6: Node u receives the response at time t0u
(t0u −tu )−(t0v −tv )
7: Propagation delay δ = 2 (assumption: symmetric)
(tv −(tu +δ))−(t0u −(t0v +δ)) (tv −tu )+(t0v −t0u )
8: Clock skew θ = 2 = 2
9: Node u adjusts clock by +θ
10: Sleep before next synchronization
11: end while
Remarks:
• Many NTP servers are public, answering to UDP packets.
• The most accurate NTP servers derive their time from atomic clocks,
synchronized to UTC. To reduce those server’s load, a hierarchy of
NTP servers is available in a forest (multiple trees) structure.
• The regular synchronization of NTP limits the maximum error despite
unpredictable clock errors. Synchronizing clocks just once is only suf-
ficient for a short time period.
Definition 20.11 (PTP). The Precision Time Protocol (PTP) is a clock
synchronization protocol similar to NTP, but which uses medium access con-
trol (MAC) layer timestamps.
Remarks:
• MAC layer timestamping removes the unknown time delay incurred
through messages passing through the software stack.
• PTP can achieve sub-microsecond accuracy in local networks.
Definition 20.12 (Global Synchronization). Global synchronization estab-
lishes a common time between any two nodes in the system.
64 CHAPTER 20. TIME, CLOCKS & GPS
Remarks:
Remarks:
5: ∆u = tu − (ts + du )
6: ∆v = tv − (ts + dv )
Remarks:
Remarks:
• Time standards use leap seconds to compensate for the slowing of the
Earth’s rotation. In theory, also negative leap seconds can be used to
make some minutes only 59 seconds long. But so far, this was never
necessary.
• For easy implementation, not all time standards use leap seconds, for
instance TAI and GPS time do not.
Remarks:
• The global time standard Greenwich Mean Time (GMT) was already
established in 1884. With the invention of caesium atomic clocks and
the subsequent redefinition of the SI second, UTC replaced GMT in
1967.
• Before time standards existed, each city set their own time according
to the local mean solar time, which is difficult to measure exactly.
This was changed by the upcoming rail and communication networks.
• Different notations for time and date are in use. A standardized format
for timestamps, mostly used for processing by computers, is the ISO
8601 standard. According to this standard, a UTC timestamp looks
like this: 1712-02-30T07:39:52Z. T separates the date and time parts
while Z indicates the time zone with zero offset from UTC.
• Why UTC and not “CUT”? Because France insisted. Same for other
abbreviations in this domain, e.g. TAI.
Definition 20.18 (Time Zone). A time zone is a geographical region in which
the same time offset from UTC is officially used.
Remarks:
• Time zones serve to roughly synchronize noon with the sun reaching
the day’s highest apparent elevation angle.
• Some time zones’ offset is not a whole number of hours, e.g. India.
Remarks:
• Atomic clocks are the most accurate clocks known. They can have a
drift of only about one second in 150 million years, about 2e-10 ppm!
• Many atomic clocks are based on caesium atoms, which led to the
current definition of a second. Others use hydrogen-1 or rubidium-87.
• In the future, atoms with higher frequency oscillations could yield
even more accurate clocks.
• Atomic clocks are getting smaller and more energy efficient. Chip-
scale atomic clocks (CSAC) are currently being produced for space
applications and may eventually find their way into consumer elec-
tronics.
Definition 20.20 (System Clock). The system clock in a computer is an
oscillator used to synchronize all components on the motherboard.
20.4. CLOCK SOURCES 67
Remarks:
Remarks:
• This keeps the computer’s time close to UTC even when the time
cannot be synchronized over a network.
• In many cases, the RTC frequency is 32.768 kHz, which allows for
simple timekeeping based on binary counter circuits because the fre-
quency is exactly 215 Hz.
Definition 20.22 (Radio Time Signal). A Radio Time Signal is a time code
transmitted via radio waves by a time signal station, referring to a time in a
given standard such as UTC.
Remarks:
• Time signal stations use atomic clocks to send as accurate time codes
as possible.
• Radio time signals can be received much farther than the horizon of
the transmitter due to signal reflections at the ionosphere. DCF77 for
instance has an official range of 2,000 km.
Definition 20.23 (Power Line Clock). A power line clock measures the os-
cillations from electric AC power lines, e.g. 50 Hz.
Remarks:
• The magnetic field radiating from power lines is strong enough that
power line clocks can work wirelessly.
Remarks:
• Due to low data rates from length of day measurements, sunlight time
synchronization is well-suited for long-time measurements with data
storage and post-processing, requiring no communication at the time
of measurement.
20.5 GPS
Definition 20.25 (Global Positioning System). The Global Positioning Sys-
tem (GPS) is a Global Navigation Satellite System (GNSS), consisting
of at least 24 satellites orbiting around the Earth, each continuously transmitting
its position and time code.
20.5. GPS 69
Remarks:
• Positioning is done in space and time!
• GPS provides position and time information to receivers anywhere on
Earth where at least four satellite signals can be received.
• Line of sight (LOS) between satellite and receiver is advantageous.
GPS works poorly indoors, or with reflections.
• Besides the US GPS, three other GNSS exist: the European Galileo,
the Russian GLONASS and the Chinese BeiDou.
• GPS satellites orbit around Earth approximately 20,000 km above the
surface, circling Earth twice a day. The signals take between 64 and
89 ms to reach Earth.
• The orbits are precisely determined by ground control stations, op-
timized for a high number of satellites being concurrently above the
horizon at any place on Earth.
3: while true do
4: for all bits Di ∈ D do
5: for j = 0 . . . 19 do
6: for k = 0 . . . 1022 do {this loop takes exactly 1 ms}
7: Send bit P RNk · Di
8: end for
9: end for
10: end for
11: end while
Remarks:
• The GPS PRN sequences are so-called Gold codes, which have low
cross-correlation with each other.
• To simplify our math (abstract from modulation), each PRN bit is
either 1 or −1.
Definition 20.28 (Navigation Data). Navigation Data is the data transmit-
ted from satellites, which includes orbit parameters to determine satellite po-
sitions, timestamps of signal transmission, atmospheric delay estimations and
status information of the satellites and GPS as a whole, such as the accuracy
and validity of the data.
70 CHAPTER 20. TIME, CLOCKS & GPS
Remarks:
• As seen in Algorithm 20.26 each bit is repeated 20 times for better
robustness. Thus, the navigation data rate is only 50 bit/s.
• Due to this limited data rate, timestamps are sent every 6 seconds,
satellite orbit parameters (function of the satellite position over time)
only every 30 seconds. As a result, the latency of a first position
estimate after turning on a receiver, which is called time-to-first-fix
(TTFF), can be high.
Definition 20.29 (Circular Cross-Correlation). The circular cross-correlation
is a similarity measure between two vectors of length N , circularly shifted by
a given displacement d:
N
X −1
cxcorr(a, b, d) = ai · bi+d mod N
i=0
Remarks:
• The two vectors are most similar at the displacement d where the sum
(cross-correlation value) is maximum.
• The vector of cross-correlation values with all N displacements can ef-
ficiently be computed using a fast Fourier transform (FFT) in O(N log N )
instead of O(N 2 ) time.
Remarks:
• Multiple milliseconds of acquisition can be summed up to average out
noise and therefore improve the arrival time detection probability.
Definition 20.31 (Acquisition). Acquisition is the process in a GPS receiver
that finds the visible satellite signals and detects the delays of the PRN sequences
and the Doppler shifts of the signals.
20.5. GPS 71
Remarks:
Remarks:
• GPS satellites carry precise atomic clocks, but the receiver is not syn-
chronized with the satellites. The arrival times of the signals at the
receiver are determined in the receiver’s local time. Therefore, even
though the satellite signals include transmit timestamps, the exact
distance between satellites and receiver is unknown.
• More received signals help reducing the measurement noise and thus
improving the accuracy.
• Since the positioning solution, which is also called position fix, in-
cludes the handset’s time offset ∆, this establishes a global time for
all handsets. Thus, GPS is useful for global time synchronization.
72 CHAPTER 20. TIME, CLOCKS & GPS
Remarks:
• A-GPS reduces the data transmission time, and thus the TTFF, from
a maximum of 30 seconds per satellite to a maximum of 6 seconds.
Remarks:
• Snapshot receivers aim at the remaining latency that results from the
transmission of timestamps from the satellites every six seconds.
Remarks:
Remarks:
• CD can tolerate a few low quality satellite signals and is thus more
robust than CTN.
• In essence, CD tests how well position hypotheses match the received
signal. For large position and time uncertainties, the high number of
hypotheses require a lot of computation power.
• CD can be sped up by a branch and bound approach, which reduces
the computation per position fix to the order of one second even for
uncertainties of 100 km and a minute.
between [0, 1]. In other words, we have a bounded but variable drift on the
hardware clocks and an arbitrary jitter in the delivery times. The goal is to
design a message-passing algorithm that ensures that the logical clock skew of
adjacent nodes is as small as possible at all times.
Definition 20.38 (Local and Global Clock Skew). In a network of nodes, the
local clock skew is the skew between neighboring nodes, while the global clock
skew is the maximum skew between any two nodes.
Remarks:
• Of interest is also the average global clock skew, that is the average
skew between any pair of nodes.
Theorem 20.39. The global clock skew (Definition 20.12) is Ω(D), where D
is the diameter of the network graph.
Proof. For a node u, let tu be the logical time of u and let (u → v) denote a
message sent from u to a node v. Let t(m) be the time delay of a message m
and let u and v be neighboring nodes. First consider a case where the message
delays between u and v are 1/2. Then, all the messages sent by u and v at time
t according to the clock of the sender arrive at time t + 1/2 according to the
clock of the receiver.
Then consider the following cases
where the message delivery time is always fast for one node and slow for the
other and the logical clocks are off by 1/2. In both scenarios, the messages sent
at time i according to the clock of the sender arrive at time i + 1/2 according
to the logical clock of the receiver. Therefore, for nodes u and v, both cases
with clock drift seem the same as the case with perfectly synchronized clocks.
Furthermore, in a linked list of D nodes, the left- and rightmost nodes l, r cannot
distinguish tl = tr + D/2 from tl = tr − D/2.
Remarks:
• As both message jitter and hardware clock drift are bounded by con-
stants, it feels like we should be able to get a constant drift at least
between neighboring nodes.
Proof. Let the graph be a linked list of D nodes. We denote the nodes by
v1 , v2 , . . . , vD from left to right and the logical clock of node vi by ti . Apart
from the left-most node v1 all hardware clocks run with speed 1 (real time).
Node v1 runs at maximum speed, i.e. the time between two pulses is not 1 but
1 − . Assume that initially all message delays are 1. After some time, node v1
will start to speed up v2 , and after some more time v2 will speed up v3 , and
so on. At some point of time, we will have a clock skew of 1 between any two
neighbors. In particular t1 = tD + D − 1.
Now we start playing around with the message delays. Let t1 = T . First we
set the delay between the v1 and v2 to 0. Now node v2 immediately adjusts its
logical clock to T . After this event (which is instantaneous in our model) we set
the delay between v2 and v3 to 0, which results in v3 setting its logical clock to T
as well. We perform this successively to all pairs of nodes until vD−2 and vD−1 .
Now node vD−1 sets its logical clock to T , which indicates that the difference
between the logical clocks of vD−1 and vD is T − (T − (D − 1)) = D − 1.
Remarks:
• The introduced examples may seem cooked-up, but examples like this
exist in all networks, and for all algorithms. Indeed, it was shown
that any natural clock synchronization algorithm must have a bad
local skew. In particular, a protocol that averages between all neigh-
bors (like Algorithm 20.13) is even worse than Algorithm 20.40. An
averaging algorithm has a clock skew of Ω(D2 ) in the linked list, at
all times.
• It was shown that the local clock skew is Θ(log D), i.e., there is a pro-
tocol that achieves this bound, and there is a proof that no algorithm
can be better than this bound!
• Note that these are worst-case bounds. In practice, clock drift and
message delays may not be the worst possible, typically the speed of
hardware clocks changes at a comparatively slow pace and the mes-
sage transmission times follow a benign probability distribution. If we
assume this, better protocols do exist, in theory as well as in practice.
Chapter Notes
Atomic clocks can be used as a GPS fallback for data center synchroniza-
tion [CDE+ 13].
76 CHAPTER 20. TIME, CLOCKS & GPS
GPS has been such a technological breakthrough that even though it dates
back to the 1970s, the new GNSS still use essentially the same techniques. Sev-
eral people worked on snapshot GPS receivers, but the technique has not pene-
trated into commercial receivers yet. Liu et al. [LPH+ 12] presented a practical
CTN receiver and reduced the solution space by eliminating solutions not lying
on the ground. CD receivers are studied since at least 2011 [ABD+ 11] and have
recently been made practically feasible through branch and bound [BEW17]
It has been known for a long time that the global clock skew is Θ(D) [LL84,
ST87]. The problem of synchronizing the clocks of nearby nodes was intro-
duced by Fan and Lynch in [LF04]; they proved a surprising lower bound of
Ω(log D/ log log D) for √ the local skew. The first algorithm providing a non-
trivial local skew of O( D) was given in [LW06]. Later, matching upper and
lower bounds of Θ(log D) were given in [LLW10]. The problem has also been
studied in a dynamic setting [KLO09, KLLO10] or when a fraction of nodes ex-
perience byzantine faults and the other nodes have to recover from faulty initial
state (i.e., self-stabilizing) [DD06, DW04]. The self-stabilizing byzantine case
has been solved with asymptotically optimal skew [KL18].
Clock synchronization is a well-studied problem in practice, for instance
regarding the global clock skew in sensor networks, e.g. [EGE02, GKS03,
MKSL04, PSJ04]. One more recent line of work is focussing on the problem
of minimizing the local clock skew [BvRW07, SW09, LSW09, FW10, FZTS11].
This chapter was written in collaboration with Manuel Eichelberger.
Bibliography
[ABD+ 11] Penina Axelrad, Ben K Bradley, James Donna, Megan Mitchell, and
Shan Mohiuddin. Collective Detection and Direct Positioning Using
Multiple GNSS Satellites. Navigation, 58(4):305–321, 2011.
[CDE+ 13] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes,
Christopher Frost, Jeffrey John Furman, Sanjay Ghemawat, An-
drey Gubarev, Christopher Heiser, Peter Hochschild, et al. Span-
ner: Google’s globally distributed database. ACM Transactions on
Computer Systems (TOCS), 31(3):8, 2013.
[DD06] Ariel Daliot and Danny Dolev. Self-Stabilizing Byzantine Pulse Syn-
chronization. Computing Research Repository, 2006.
[FZTS11] Federico Ferrari, Marco Zimmerling, Lothar Thiele, and Olga Saukh.
Efficient Network Flooding and Time Synchronization with Glossy.
In Proceedings of the 10th International Conference on Information
Processing in Sensor Networks (IPSN), pages 73–84, 2011.
[KLLO10] Fabian Kuhn, Christoph Lenzen, Thomas Locher, and Rotem Osh-
man. Optimal Gradient Clock Synchronization in Dynamic Net-
works. In 29th Symposium on Principles of Distributed Computing
(PODC), Zurich, Switzerland, July 2010.
[KLO09] Fabian Kuhn, Thomas Locher, and Rotem Oshman. Gradient Clock
Synchronization in Dynamic Networks. In 21st ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), Calgary,
Canada, August 2009.
[LL84] Jennifer Lundelius and Nancy Lynch. An Upper and Lower Bound
for Clock Synchronization. Information and Control, 62:190–204,
1984.
[LPH+ 12] Jie Liu, Bodhi Priyantha, Ted Hart, Heitor Ramos, Antonio A.F.
Loureiro, and Qiang Wang. Energy Efficient GPS Sensing with
Cloud Offloading. In 10th ACM Conference on Embedded Networked
Sensor Systems (SenSys 2012). ACM, November 2012.
[MKSL04] Miklós Maróti, Branislav Kusy, Gyula Simon, and Ákos Lédeczi. The
Flooding Time Synchronization Protocol. In Proceedings of the 2nd
international Conference on Embedded Networked Sensor Systems,
SenSys ’04, 2004.
Quorum Systems
What happens if a single server is no longer powerful enough to service all your
customers? The obvious choice is to add more servers and to use the majority
approach (e.g. Paxos, Chapter 15) to guarantee consistency. However, even
if you buy one million servers, a client still has to access more than half of
them per request! While you gain fault-tolerance, your efficiency can at most
be doubled. Do we have to give up on consistency?
Let us take a step back: We used majorities because majority sets always
overlap. But are majority sets the only sets that guarantee overlap? In this
chapter we study the theory behind overlapping sets, known as quorum systems.
Remarks:
• When a quorum system is being used, a client selects a quorum, ac-
quires a lock (or ticket) on all nodes of the quorum, and when done
releases all locks again. The idea is that no matter which quorum is
chosen, its nodes will intersect with the nodes of every other quorum.
• What can happen if two quorums try to lock their nodes at the same
time?
• A quorum system S is called minimal if ∀Q1 , Q2 ∈ S : Q1 * Q2 .
79
80 CHAPTER 21. QUORUM SYSTEMS
Remarks:
• Note that you cannot choose different access strategies Z for work and
load, you have to pick a single Z for both.
√
Theorem 21.6. Let S be a quorum system. Then L(S) ≥ 1/ n holds.
21.2. GRID QUORUM SYSTEMS 81
Remarks:
• Can we achieve this load?
Remarks:
• Consider the right picture in Figure 21.8: The two quorums intersect
in two nodes. If both quorums were to be accessed at the same time,
it is not guaranteed that at least one quorum will lock all of its nodes,
as they could enter a deadlock!
• In the case of just two quorums, one could solve this by letting the
quorums just intersect in one node, see Figure 21.9. However, already
with three quorums the same situation could occur again, progress is
not guaranteed!
• However, by deviating from the “access all at once” strategy, we can
guarantee progress if the nodes are totally ordered!
82 CHAPTER 21. QUORUM SYSTEMS
Figure 21.9: There are other ways to choose quorums in the grid s.t. pairwise
different
√ quorums
√ only intersect in one node.
√ The size of each quorum is between
n and 2 n − 1, i.e., the work is in Θ( √n). When the access strategy Z is
uniform, the load of every node is in Θ(1/ n).
Remarks:
Theorem 21.13. If the nodes and quorums use Algorithm 21.12, at least one
quorum will obtain a lock for all of its nodes.
21.3. FAULT TOLERANCE 83
Proof. The proof is analogous to the proof of Theorem 21.11: Assume for con-
tradiction that no quorum can make progress. However, at least the quorum
with the highest vQ can always make progress – a contradiction! As the set of
nodes is finite, at least one quorum will eventually be able to acquire a lock on
all of its nodes.
Remarks:
• What if a quorum locks all of its nodes and then crashes? Is the
quorum system dead now? This issue can be prevented by, e.g., using
leases instead of locks: leases have a timeout, i.e., a lock is released
eventually. But what happens if a quorum is slow and its acquired
leases expire before it can acquire all leases?
Theorem 21.15. Let S be a Grid quorum system where each √ of the n quorums
consists of a full row and a full column. S has a resilience of n − 1.
√
Proof. If all n nodes on the diagonal of the grid√ fail, then every quorum will
have at least one failed node. Should less than n nodes fail, then there is a
row and a column without failed nodes.
Remarks:
• The Grid quorum system in Theorem 21.15 is different from the Basic
Grid quorum system described in Definition 21.7. In each quorum in
the Basic Grid quorum system the row and column index are identical,
while in the Grid quorum system of Theorem 21.15 this is not the case.
Definition 21.16 (failure probability). Assume that every node works with a
fixed probability p (in the following we assume concrete values, e.g. p > 1/2
84 CHAPTER 21. QUORUM SYSTEMS
Remarks:
• The asymptotic failure probability is Fp (S) for n → ∞.
Facts 21.17. A version of a Chernoff bound states the following:
Let x1 , . . . , xn be independent Bernoulli-distributed random variables
Pn with
P r[xi = 1] =Ppi and P r[xi = 0] = 1 − pi = qi , then for X := i=1 xi and
n
µ := E[X] = i=1 pi the following holds:
2
for all 0 < δ < 1: P r[X ≤ (1 − δ)µ] ≤ e−µδ /2
.
Theorem 21.18. The asymptotic failure probability of the Majority quorum
system is 0, for p > 1/2.
Proof. In a Majority quorum system each quorum contains exactly b n2 c + 1
nodes and each subset of nodes with cardinality b n2 c + 1 forms a quorum. If
only b n2 c nodes work, then the Majority quorum system fails. Otherwise there
is at least one quorum available. In order to calculate the failure probability we
define the ( following random variables:
1, if node i works, happens with probability p
xi =
0, if node i fails, happens with probability q = 1 − p
Pn
and X := i=1 xi , with µ = np, whereas X corresponds to the number of
working nodes. To estimate the probability that the number of working nodes
is less than b n2 c + 1 we will make use of the Chernoff inequality from above. By
1
setting δ = 1 − 2p we obtain FP (S) = P r[X ≤ b n2 c] ≤ P r[X ≤ n2 ] = P r[X ≤
(1 − δ)µ].
1
With δ = 1 − 2p we have 0 < δ ≤ 1/2 due to 1/2 < p ≤ 1. Thus, we can use
2
the Chernoff bound and get FP (S) ≤ e−µδ /2
∈ e−Ω(n) .
Theorem 21.19. The asymptotic failure probability of the Grid quorum system
is 1 for p > 0.
Proof. Consider the n = d · d nodes to be arranged in a d × d grid. A quorum
always contains one full row. In this estimation we will make use of the Bernoulli
inequality which states that for all n ∈ N, x ≥ −1 : (1 + x)n ≥ 1 + nx.
The system fails, if in each row at least one node fails (which happens with
probability 1 − pd for a particular row, as all nodes work with probability pd ).
Therefore we can bound the failure probability from below with:
Fp (S) ≥ P r[at least one failure per row] = (1 − pd )d ≥ 1 − dpd −→ 1.
n→∞
Remarks:
• Now we have a quorum system with optimal load (the Grid) and one
with fault-tolerance (Majority), but what if we want both?
Definition 21.20 (B-Grid quorum system). Consider n = dhr nodes, arranged
in a rectangular grid with h · r rows and d columns. Each group of r rows is a
band, and r elements in a column restricted to a band are called a mini-column.
A quorum consists of one mini-column in every band and one element from
each mini-column of one band; thus every quorum has d + hr − 1 elements. The
B-Grid quorum system consists of all such quorums.
21.4. BYZANTINE QUORUM SYSTEMS 85
Theorem 21.22. The asymptotic failure probability of the B-Grid quorum sys-
tem is 0, for p ≥ 23 .
Proof. Suppose n = dhr and the elements are arranged in a grid with d columns
and h · r rows. The B-Grid quorum system does fail if in each band a complete
mini-column fails, because then it is not possible to choose a band where in each
mini-column an element is still working. It also fails if in a band an element
in each mini-column fails. If none of those cases holds, then the B-Grid system
does not fail. Those events may not be independent of each other, but with the
help of the union bound, we can upper bound the failure probability with the
following equation:
Fp (S) ≤ P r[in every band a complete mini-column fails]
+ P r[in a band at least one element of every m.-col. fails]
≤ (d(1 − p)r )h + h(1 − pr )d
√
We use d = n, r = ln d, and 0 ≤ 1 − p ≤ 1/3. Using nln x = xln n , we have
d(1 − p)r ≤ d · dln 1/3 ≈ d−0.1 , and hence for large enough d the whole first term
is bounded from above by d−0.1h 1/d2 = 1/n.
Regarding the second term, we have p ≥ 2/3, and h = d/ ln d < d. Hence
we can bound the term from above by d(1 − dln 2/3 )d ≈ d(1 − d−0.4 )d . Using
(1 + t/n)n ≤ et , we get (again, for large enough d) an upper bound of d(1 −
0.6 0.6
d−0.4 )d = d(1 − d0.6 /d)d ≤ d · e−d = d(−d / ln d)+1 d−2 = 1/n. In total, we
have Fp (S) ∈ O(1/n).
Remarks:
• Thanks to (2), even with f byzantine nodes, the byzantine nodes
cannot stop all quorums by just pretending to have crashed. At least
one quorum will survive. We will also keep this assumption for the
upcoming more advanced byzantine quorum systems.
• Byzantine nodes can also do something worse than crashing - they
could falsify data! Nonetheless, due to (1), there is at least one
non-byzantine node in every quorum intersection. If the data is self-
verifying by, e.g., authentication, then this one node is enough.
• If the data is not self-verifying, then we need another mechanism.
Definition 21.25 (f -masking). A quorum system S is f -masking if (1) the
intersection of two different quorums always contains 2f + 1 nodes, and (2) for
any set of f byzantine nodes, there is at least one quorum without byzantine
nodes.
Remarks:
• Note that except for the second condition, an f -masking quorum sys-
tem is the same as a 2f -disseminating system. The idea is that the
non-byzantine nodes (at least f + 1) can outvote the byzantine ones
(at most f ), but only if all non-byzantine nodes are up-to-date!
• This raises an issue not covered yet in this chapter. If we access some
quorum and update its values, this change still has to be disseminated
to the other nodes in the byzantine quorum system. Opaque quorum
systems deal with this issue, which are discussed at the end of this
section.
21.4. BYZANTINE QUORUM SYSTEMS 87
• One can show that f -disseminating quorum systems need more than
3f nodes and f -masking quorum systems need more than 4f nodes.
In other words, f < n/3, or f < n/4. Essentially, the quorums may
not contain too many nodes, and the different intersection properties
lead to the different bounds.
p
Theorem 21.27. Let S be an f -masking quorum system. Then L(S) ≥ (2f + 1)/n
holds.
Proofs of Theorems 21.26 and 21.27. The proofs follow the proof of Theorem
21.6, by observing that now not just one element is accessed from a minimal
quorum, but f + 1 or 2f + 1, respectively.
Figure 21.29: An example how to choose a quorum √ in the f -masking Grid with
f = 2, i.e., 2 + 1 = 3 rows. The load is in Θ(f / n) when the access strategy is
chosen to be uniform. Two quorums overlap by their columns intersecting each
other’s rows, i.e., they overlap in at least 2f + 2 nodes.
88 CHAPTER 21. QUORUM SYSTEMS
Remarks:
• The f -masking Grid nearly hits the lower bound for the load of f -
masking quorum systems, but not quite. A small change and we will
be optimal asymptotically.
Corollary 21.32. The f -masking Grid quorum system and the M -Grid quorum
system are f -masking quorum systems.
Remarks:
• This property will be handled in the last part of this chapter by opaque
quorum systems. It will ensure that the number of correct up-to-date
nodes accessed will be larger than the number of out-of-date nodes
combined with the byzantine nodes in the quorum (cf. (21.33.1)).
Remarks:
• For any f -opaque quorum system, inequality (21.33.1) also holds for
|F | < f . In particular, substituting F = ∅ in (21.33.1) gives |Q1 ∩
Q2 | > |Q2 \ Q1 |; similarly, one can also deduce that |Q1 ∩ Q2 | >
|Q1 \ Q2 |. Therefore, |Q1 | = |Q1 \ Q2 | + |Q1 ∩ Q2 | < 2|Q1 ∩ Q2 |, so
|Q1 ∩ Q2 | > |Q21 | .
Proof. Due to (21.33.2), there exists a quorum Q1 with size at most n−f . With
(21.33.1), |Q1 | > f holds. Let F1 be a set of f (byzantine) nodes F1 ⊂ Q1 , and
with (21.33.2), there exists a Q2 ⊆ V \ F1 . Thus, |Q1 ∩ Q2 | ≤ n − 2f . With
(21.33.1), |Q1 ∩ Q2 | > f holds. Thus, one could choose f (byzantine) nodes
F2 with F2 ⊂ Q1 ∩ Q2 . Using (21.33.1) one can bound n − 3f from below:
n − 3f ≥ |Q2 ∩ Q1 | − |F2 | = |(Q2 ∩ Q1 ) \ F2 | > |(Q1 ∩ F2 ) ∪ (Q1 \ Q2 )| =
|F2 ∪ (Q1 \ Q2 )| = |F2 | + |Q1 \ Q2 | ≥ |F2 | + |F1 | = 2f .
Remarks:
Theorem 21.36. Let S be an f -opaque quorum system. Then L(S) > 1/2
holds.
Using the pigeonhole principle, there must be at least one node in Q1 with load
greater than 1/2.
Chapter Notes
Historically, a quorum is the minimum number of members of a deliberative
body necessary to conduct the business of that group. Their use has inspired the
introduction of quorum systems in computer science since the late 1970s/early
1980s. Early work focused on Majority quorum systems [Lam78, Gif79, Tho79],
with the notion of minimality introduced shortly after [GB85]. The Grid quo-
rum system was first considered in [Mae85], with the B-Grid being introduced
in [NW94]. The latter article and [PW95] also initiated the study of load and
resilience.
The f -masking Grid quorum system and opaque quorum systems are from
[MR98], and the M -Grid quorum system was introduced in [MRW97]. Both
papers also mark the start of the formal study of Byzantine quorum systems.
The f -masking and the M -Grid have asymptotic failure probabilities of 1, more
complex systems with better values can be found in these papers as well.
Quorum systems have also been extended to cope with nodes dynamically
leaving and joining, see, e.g., the dynamic paths quorum system in [NW05].
For a further overview on quorum systems, we refer to the book by Vukolić
[Vuk12] and the article by Merideth and Reiter [MR10].
This chapter was written in collaboration with Klaus-Tycho Förster.
Bibliography
[GB85] Hector Garcia-Molina and Daniel Barbará. How to assign votes in a
distributed system. J. ACM, 32(4):841–860, 1985.
[MR10] Michael G. Merideth and Michael K. Reiter. Selected results from the
latest decade of quorum systems research. In Bernadette Charron-
Bost, Fernando Pedone, and André Schiper, editors, Replication:
Theory and Practice, volume 5959 of Lecture Notes in Computer Sci-
ence, pages 185–206. Springer, 2010.
[MRW97] Dahlia Malkhi, Michael K. Reiter, and Avishai Wool. The load and
availability of byzantine quorum systems. In James E. Burns and
Hagit Attiya, editors, Proceedings of the Sixteenth Annual ACM Sym-
posium on Principles of Distributed Computing, Santa Barbara, Cal-
ifornia, USA, August 21-24, 1997, pages 249–257. ACM, 1997.
[NW94] Moni Naor and Avishai Wool. The load, capacity and availability
of quorum systems. In 35th Annual Symposium on Foundations of
Computer Science, Santa Fe, New Mexico, USA, 20-22 November
1994, pages 214–225. IEEE Computer Society, 1994.
[NW05] Moni Naor and Udi Wieder. Scalable and dynamic quorum systems.
Distributed Computing, 17(4):311–322, 2005.
[PW95] David Peleg and Avishai Wool. The availability of quorum systems.
Inf. Comput., 123(2):210–223, 1995.
Distributed Storage
How do you store 1M movies, each with a size of about 1GB, on 1M nodes, each
equipped with a 1TB disk? Simply store the movies on the nodes, arbitrarily,
and memorize (with a global index) which movie is stored on which node. What
if the set of movies or nodes changes over time, and you do not want to change
your global index too often?
Proof. For a specific movie (out of m) and a specific hash function (out of k),
all n nodes have the same probability 1/n to hash closest to the movie hash.
By linearity of expectation, each node stores km/n movies in expectation if we
also count duplicates of movies on a node.
92
22.1. CONSISTENT HASHING 93
Remarks:
• Using the Chernoff bound below with µ = km/n = 1K, the probability
that a node uses 10% more memory than expected is less than 1%.
Remarks:
• For better load balancing, we might also hash nodes multiple times.
• Instead, each node will just know about a small subset of 100 or less
other nodes (“neighbors”). This way, nodes can withstand high churn
situations.
• On the downside, nodes will not directly know which node is responsi-
ble for what movie. Instead, a node searching for a movie might have
to ask a neighbor node, which in turn will recursively ask another
neighbor node, until the correct node storing the movie (or a forward
pointer to the movie) is found. The nodes of our distributed storage
system form a virtual network, also called an overlay network.
94 CHAPTER 22. DISTRIBUTED STORAGE
Remarks:
• Some basic network topologies used in practice are trees, rings, grids
or tori. Many other suggested networks are simply combinations or
derivatives of these.
• The advantage of trees is that the routing is very easy: for every
source-destination pair there is only one path. However, since the
root of a tree is a bottleneck, trees are not homogeneous. Instead,
so-called fat trees should be used. Fat trees have the property that
every edge connecting a node v to its parent u has a capacity that is
proportional to the number of leaves of the subtree rooted at v. See
Figure 22.5 for a picture.
22.2. HYPERCUBIC NETWORKS 95
where [m] means the set {0, . . . , m − 1}. The (m, d)-torus T (m, d) is a graph
that consists of an (m, d)-mesh and additionally wrap-around edges from nodes
(a1 , . . . , ai−1 , m − 1, ai+1 , . . . , ad ) to nodes (a1 , . . . , ai−1 , 0, ai+1 , . . . , ad ) for all
i ∈ {1, . . . , d} and all aj ∈ [m] with j 6= i. In other words, we take the expression
ai − bi in the sum modulo m prior to computing the absolute value. M (m, 1) is
also called a path, T (m, 1) a cycle, and M (2, d) = T (2, d) a d-dimensional
hypercube. Figure 22.7 presents a linear array, a torus, and a hypercube.
00 10 20 30 110 111
01 11 21 31 100 101
02 12 22 32 010 011
0 1 2 m−1
03 13 23 33 000 001
Figure 22.7: The structure of M (m, 1), T (4, 2), and M (2, 3).
Remarks:
• Routing on a mesh, torus, or hypercube is trivial. On a d-dimensional
hypercube, to get from a source bitstring s to a target bitstring t one
only needs to fix each “wrong” bit, one at a time; in other words, if
the source and the target differ by k bits, there are k! routes with k
hops.
• As required by Definition 22.4, the d-bit IDs of the nodes need to be
mapped to the universe [0, 1). One way to do this is by turning each
ID into a fractional binary representation. For example, the ID 101
is mapped to 0.1012 which has a decimal value of 0 · 20 + 1 · 2−1 + 0 ·
2−2 + 1 · 2−3 = 58 .
• The Chord architecture is a close relative of the hypercube, basically
a less rigid hypercube. The hypercube connects every node with an
ID in [0, 1) with other nodes at distance exactly 2−i , i = 1, 2, . . . , d
96 CHAPTER 22. DISTRIBUTED STORAGE
and
A node set {(i, α) | α ∈ [2]d } is said to form level i of the butterfly. The d-
dimensional wrap-around butterfly W-BF(d) is defined by taking the BF (d)
and having (d, α) = (0, α) for all α ∈ [2]d .
Remarks:
• Figure 22.9 shows the 3-dimensional butterfly BF (3). The BF (d) has
(d + 1)2d nodes, 2d · 2d edges and maximum degree 4. It is not difficult
to check that if for each α ∈ [2]d we combine the nodes {(i, α) | i ∈
[d + 1]} into a single node then we get back the hypercube.
• Butterflies have the advantage of a constant node degree over hyper-
cubes, whereas hypercubes feature more fault-tolerant routing.
• You may have seen butterfly-like structures before, e.g. sorting net-
works, communication switches, data center networks, fast fourier
transform (FFT). The Beneš network (telecommunication) is noth-
ing but two back-to-back butterflies. The Clos network (data centers)
is a close relative to Butterflies too. Actually, merging the 2i nodes on
level i that share the first d − i bits into a single node, the Butterfly
becomes a fat tree.
Every year there are new applications for which hypercubic networks
are the perfect solution!
• Next we define the cube-connected-cycles network. It only has a de-
gree of 3 and it results from the hypercube by replacing the corners
by cycles.
3
Figure 22.9: The structure of BF(3).
(110,0) (111,0)
(110,1) (111,1)
(101,1)
(100,1) (110,2) (111,2)
(100,0) 2
(101,0)
(010,2) (011,2)
1
(101,2)
(100,2)
(011,0)
(010,0) 0
000 001 010 011 100 101 110 111
(010,1)
(000,2) (001,2) (011,1)
(000,1)
(001,1)
(000,0) (001,0)
Remarks:
• Two possible representations of a CCC can be found in Figure 22.11.
• The shuffle-exchange is yet another way of transforming the hypercu-
bic interconnection structure into a constant degree network.
Definition 22.12 (Shuffle-Exchange). Let d ∈ N. The d-dimensional
shuffle-exchange SE(d) is defined as an undirected graph with node set
V = [2]d and an edge set E = E1 ∪ E2 with
E1 = {{(a1 , . . . , ad ), (a1 , . . . , ād )} | (a1 , . . . , ad ) ∈ [2]d , ād = 1 − ad }
and
E2 = {{(a1 , . . . , ad ), (ad , a1 , . . . , ad−1 )} | (a1 , . . . , ad ) ∈ [2]d } .
Figure 22.13 shows the 3- and 4-dimensional shuffle-exchange graph.
SE(3) SE(4)
100 101 1000 1001 1100 1101
000 001 110 111 0000 0001 0100 0101 1010 1011 1110 1111
E1
E2
01 001 011
010 101
00 11 000 111
10 100 110
Remarks:
• Two examples of a DeBruijn graph can be found in Figure 22.15.
• There are some data structures which also qualify as hypercubic net-
works. An example of a hypercubic network is the skip list, the bal-
anced binary search tree for the lazy programmer:
Definition 22.16 (Skip List). The skip list is an ordinary ordered linked list
of objects, augmented with additional forward links. The ordinary linked list is
the level 0 of the skip list. In addition, every object is promoted to level 1 with
probability 1/2. As for level 0, all level 1 objects are connected by a linked list.
In general, every object on level i is promoted to the next level with probability
1/2. A special start-object points to the smallest/first object on each level.
Remarks:
• Search, insert, and delete can be implemented in O(log n) expected
time in a skip list, simply by jumping from higher levels to lower ones
when overshooting the searched position. Also, the amortized memory
cost of each object is constant, as on average an object only has two
forward links.
• The randomization can easily be discarded, by deterministically pro-
moting a constant fraction of objects of level i to level i+1, for all i. In
particular, when inserting or deleting, object o simply checks whether
its left and right level i neighbors are being promoted to level i + 1.
If none of them is, promote object o itself. Essentially we establish
a maximal independent set (MIS) on each level, hence at least every
third and at most every second object is promoted.
22.2. HYPERCUBIC NETWORKS 99
• There are obvious variants of the skip list, e.g., the skip graph. Instead
of promoting only half of the nodes to the next level, we always pro-
mote all the nodes, similarly to a balanced binary tree: All nodes are
part of the root level of the binary tree. Half the nodes are promoted
left, and half the nodes are promoted right, on each level. Hence on
level i we have have 2i lists (or, if we connect the last element again
with the first: rings) of about n/2i objects. The skip graph features
all the properties of Definition 22.4.
• More generally, how are degree and diameter of Definition 22.4 re-
lated? The following theorem gives a general lower bound.
Theorem 22.17. Every graph of maximum degree d > 2 and size n must have
a diameter of at least d(log n)/(log(d − 1))e − 2.
Proof. Suppose we have a graph G = (V, E) of maximum degree d and size
n. Start from any node v ∈ V . In a first step at most d other nodes can be
reached. In two steps at most d · (d − 1) additional nodes can be reached. Thus,
in general, in at most r steps at most
r−1
X (d − 1)r − 1 d · (d − 1)r
1+ d · (d − 1)i = 1 + d · ≤
i=0
(d − 1) − 1 d−2
Remarks:
• In other words, constant-degree hypercubic networks feature an
asymptotically optimal diameter D.
• Other hypercubic graphs manage to have a different tradeoff between
node degree d and diameter D. The pancake graph, for instance, min-
imizes the maximum of these with max(d, D) = Θ(log n/ log log n).
The ID of a node u in the pancake graph of dimension d is an ar-
bitrary permutation of the numbers 1, 2, . . . , d. Two nodes u, v are
connected by an edge if one can get the ID of node v by taking the
ID of node u, and reversing (flipping) the first k (for k = 1, . . . , d)
numbers of u’s ID. For example, in dimension d = 4, nodes u = 2314
and v = 1324 are neighbors.
• There are a few other interesting graph classes which are not hyper-
cubic networks, but nevertheless seem to relate to the properties of
Definition 22.4. Small-world graphs (a popular representations for
social networks) also have small diameter, however, in contrast to hy-
percubic networks, they are not homogeneous and feature nodes with
large degrees.
100 CHAPTER 22. DISTRIBUTED STORAGE
Remarks:
• A DHT has many applications beyond storing movies, e.g., the Inter-
net domain name system (DNS) is essentially a DHT.
• Other hypercubic networks, e.g. the pancake graph, might need a bit
of twisting to find appropriate IDs.
• Second, the adversary does not have to wait until the system is recov-
ered before it crashes the next batch of nodes. Instead, the adversary
can constantly crash nodes, while the system is trying to stay alive.
Indeed, the system is never fully repaired but always fully functional.
In particular, the system is resilient against an adversary that contin-
uously attacks the “weakest part” of the system. The adversary could
for example insert a crawler into the DHT, learn the topology of the
system, and then repeatedly crash selected nodes, in an attempt to
partition the DHT. The system counters such an adversary by con-
tinuously moving the remaining or newly joining nodes towards the
areas under attack.
Remarks:
along the edges of the graph such that all hypernodes end up with the
same or almost the same number of tokens. While tokens are moved
around, an adversary constantly inserts and deletes tokens. See also
Figure 22.20.
Theorem 22.21 (DHT with Churn). We have a fully scalable, efficient distrib-
uted storage system which tolerates O(log n) worst-case joins and/or crashes per
constant time interval. As in other storage systems, nodes have O(log n) overlay
neighbors, and the usual operations (e.g., search, insert) take time O(log n).
Remarks:
Chapter Notes
The ideas behind distributed storage were laid during the peer-to-peer (P2P)
file sharing hype around the year 2000, so a lot of the seminal research
in this area is labeled P2P. The paper of Plaxton, Rajaraman, and Richa
BIBLIOGRAPHY 103
[PRR97] laid out a blueprint for many so-called structured P2P architec-
ture proposals, such as Chord [SMK+ 01], CAN [RFH+ 01], Pastry [RD01],
Viceroy [MNR02], Kademlia [MM02], Koorde [KK03], SkipGraph [AS03], Skip-
Net [HJS+ 03], or Tapestry [ZHS+ 04]. Also the paper of Plaxton et. al. was
standing on the shoulders of giants. Some of its eminent precursors are: lin-
ear and consistent hashing [KLL+ 97], locating shared objects [AP90, AP91],
compact routing [SK85, PU88], and even earlier: hypercubic networks, e.g.
[AJ75, Wit81, GS81, BA84].
Furthermore, the techniques in use for prefix-based overlay structures are
related to a proposal called LAND, a locality-aware distributed hash table pro-
posed by Abraham et al. [AMD04].
More recently, a lot of P2P research focussed on security aspects, describing
for instance attacks [LMSW06, SENB07, Lar07], and provable countermeasures
[KSW05, AS09, BSS09]. Another topic currently garnering interest is using
P2P to help distribute live streams of video content on a large scale [LMSW07].
There are several recommendable introductory books on P2P computing, e.g.
[SW05, SG05, MS07, KW08, BYL08].
Some of the figures in this chapter have been provided by Christian Schei-
deler.
Bibliography
[AJ75] George A. Anderson and E. Douglas Jensen. Computer Interconnec-
tion Structures: Taxonomy, Characteristics, and Examples. ACM
Comput. Surv., 7(4):197–213, December 1975.
[AMD04] Ittai Abraham, Dahlia Malkhi, and Oren Dobzinski. LAND: stretch
(1 + epsilon) locality-aware networks for DHTs. In Proceedings of
the fifteenth annual ACM-SIAM symposium on Discrete algorithms,
SODA ’04, pages 550–559, Philadelphia, PA, USA, 2004. Society for
Industrial and Applied Mathematics.
[AP90] Baruch Awerbuch and David Peleg. Efficient Distributed Construc-
tion of Sparse Covers. Technical report, The Weizmann Institute of
Science, 1990.
[AP91] Baruch Awerbuch and David Peleg. Concurrent Online Tracking of
Mobile Users. In SIGCOMM, pages 221–233, 1991.
[AS03] James Aspnes and Gauri Shah. Skip Graphs. In SODA, pages 384–
393. ACM/SIAM, 2003.
[AS09] Baruch Awerbuch and Christian Scheideler. Towards a Scalable and
Robust DHT. Theory Comput. Syst., 45(2):234–260, 2009.
[BA84] L. N. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hy-
perbus Structures for a Computer Network. IEEE Trans. Comput.,
33(4):323–333, April 1984.
[BSS09] Matthias Baumgart, Christian Scheideler, and Stefan Schmid. A
DoS-resilient information system for dynamic data management. In
Proceedings of the twenty-first annual symposium on Parallelism in
104 CHAPTER 22. DISTRIBUTED STORAGE
[BYL08] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking
and Applications. Morgan Kaufmann Publishers Inc., San Francisco,
CA, USA, 2008.
[KLL+ 97] David R. Karger, Eric Lehman, Frank Thomson Leighton, Rina
Panigrahy, Matthew S. Levine, and Daniel Lewin. Consistent Hash-
ing and Random Trees: Distributed Caching Protocols for Relieving
Hot Spots on the World Wide Web. In Frank Thomson Leighton
and Peter W. Shor, editors, STOC, pages 654–663. ACM, 1997.
[LMSW06] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Watten-
hofer. Free Riding in BitTorrent is Cheap. In 5th Workshop on Hot
Topics in Networks (HotNets), Irvine, California, USA, November
2006.
[LMSW07] Thomas Locher, Remo Meier, Stefan Schmid, and Roger Watten-
hofer. Push-to-Pull Peer-to-Peer Live Streaming. In 21st Inter-
national Symposium on Distributed Computing (DISC), Lemesos,
Cyprus, September 2007.
BIBLIOGRAPHY 105
[MNR02] Dahlia Malkhi, Moni Naor, and David Ratajczak. Viceroy: a scal-
able and dynamic emulation of the butterfly. In Proceedings of the
twenty-first annual symposium on Principles of distributed comput-
ing, PODC ’02, pages 183–192, New York, NY, USA, 2002. ACM.
[PU88] David Peleg and Eli Upfal. A tradeoff between space and efficiency
for routing tables. In Proceedings of the twentieth annual ACM
symposium on Theory of computing, STOC ’88, pages 43–52, New
York, NY, USA, 1988. ACM.
[RFH+ 01] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and
Scott Shenker. A scalable content-addressable network. SIGCOMM
Comput. Commun. Rev., 31(4):161–172, August 2001.
[SK85] Nicola Santoro and Ramez Khatib. Labelling and Implicit Routing
in Networks. Comput. J., 28(1):5–8, 1985.
[SMK+ 01] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and
Hari Balakrishnan. Chord: A scalable peer-to-peer lookup ser-
vice for internet applications. SIGCOMM Comput. Commun. Rev.,
31(4):149–160, August 2001.
[SW05] Ralf Steinmetz and Klaus Wehrle, editors. Peer-to-Peer Systems and
Applications, volume 3485 of Lecture Notes in Computer Science.
Springer, 2005.
[ZHS+ 04] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, An-
thony D. Joseph, and John Kubiatowicz. Tapestry: a resilient
global-scale overlay for service deployment. IEEE Journal on Se-
lected Areas in Communications, 22(1):41–53, 2004.
Chapter 23
How would you implement an ATM? Does the following implementation work
satisfactorily?
Remarks:
• A connection problem between the bank and the ATM may block
Algorithm 23.1 in Line 2.
• There are numerous causes for partitions to occur, e.g., physical dis-
connections, software errors, or incompatible protocol versions. From
the point of view of a node in the system, a partition is similar to a
period of sustained message loss.
107
108 CHAPTER 23. EVENTUAL CONSISTENCY & BITCOIN
Remarks:
• Algorithm 23.6 is partition tolerant and available since it continues to
process requests even when the bank is not reachable.
• The ATM’s local view of the balances may diverge from the balances
as seen by the bank, therefore consistency is no longer guaranteed.
• The algorithm will synchronize any changes it made to the local bal-
ances back to the bank once connectivity is re-established. This is
known as eventual consistency.
Definition 23.7 (Eventual Consistency). If no new updates to the shared state
are issued, then eventually the system is in a quiescent state, i.e., no more
messages need to be exchanged between nodes, and the shared state is consistent.
Remarks:
• Eventual consistency is a form of weak consistency.
• Eventual consistency guarantees that the state is eventually agreed
upon, but the nodes may disagree temporarily.
• During a partition, different updates may semantically conflict with
each other. A conflict resolution mechanism is required to resolve the
conflicts and allow the nodes to eventually agree on a common state.
23.2. BITCOIN 109
23.2 Bitcoin
Definition 23.8 (Bitcoin Network). The Bitcoin network is a randomly con-
nected overlay network of a few tens of thousands of individually controlled
nodes.
Remarks:
• Old nodes re-entering the system try to connect to peers that they were
earlier connected to. If those peers are not available, they default to
the new node behavior.
• New nodes entering the system face the bootstrap problem, and can
find active peers any which way they want. If they cannot find an
active peer, their node will look for active peers from a set of au-
thoritative sources. These authoritative sources are hard-coded in the
Bitcoin source code.
Remarks:
• Bitcoin supports the ECDSA and the Schnorr digital signature algo-
rithms to verify ownership of bitcoins.
• It is hard to link public keys to the user that controls them, hence
Bitcoin is often referred to as being pseudonymous.
Remarks:
• Inputs reference the output that is being spent by a (h, i)-tuple, where
h is the hash of the transaction that created the output, and i specifies
the index of the output in that transaction.
• Transactions can be gossiped by any node in the network and are pro-
cessed by every node that receives them through the gossip protocol.
Remarks:
• The outputs of a transaction may assign less than the sum of inputs, in
which case the difference is called the transaction fee. The fee is used
to incentivize other participants in the system (see Definition 23.18)
Remarks:
2. For fixed parameters d and c, finding x such that Fd (c, x) = true is com-
putationally difficult but feasible. The difficulty d is used to adjust the time
to find such an x.
Definition 23.15 (Bitcoin PoW function). The Bitcoin PoW function is given
by
2224
Fd (c, x) → SHA256(SHA256(c|x)) < .
d
Remarks:
Remarks:
• With their reference to a previous block, the blocks build a tree, rooted
in the so called genesis block. The genesis block’s hash is hard-coded
in the Bitcoin source code.
• The primary goal for using the PoW mechanism is to adjust the rate
at which blocks are found in the network, giving the network time
to synchronize on the latest block. Bitcoin sets the difficulty so that
globally a block is created about every 10 minutes in expectation.
• Finding a block allows the finder to impose the transactions in its local
memory pool to all other nodes. Upon receiving a block, all nodes roll
back any local changes since the previous block and apply the new
block’s transactions.
Remarks:
• A coinbase transaction is the sole exception to the rule that the sum
of inputs must be at least the sum of outputs. New bitcoins enter the
system through coinbase transactions.
Definition 23.19 (Blockchain). The longest path from the genesis block (root
of the tree) to a (deepest) leaf is called the blockchain. The blockchain acts as a
consistent transaction history on which all nodes eventually agree.
Remarks:
• The path length from the genesis block to block b is the height hb .
• Only the longest path from the genesis block to a leaf is a valid trans-
action history, since branches may contradict each other because of
doublespends.
114 CHAPTER 23. EVENTUAL CONSISTENCY & BITCOIN
• Since only transactions in the longest path are agreed upon, miners
have an incentive to append their blocks to the longest chain, thus
agreeing on the current state.
• The mining incentives quickly increased the difficulty of the PoW
mechanism: initially miners used CPUs to mine blocks, but CPUs
were quickly replaced by GPUs, FPGAs and even application specific
integrated circuits (ASICs) as bitcoins appreciated. This results in
an equilibrium today in which only the most cost efficient miners, in
terms of hardware supply and electricity, make a profit in expectation.
• If multiple blocks are mined more or less concurrently, the system is
said to have forked. Forks happen naturally because mining is a dis-
tributed random process and two new blocks may be found at roughly
the same time.
Remarks:
• Algorithm 23.20 describes how a node updates its local state upon
receiving a block. Like Algorithm 23.12, this describes the local policy
and may also result in node states diverging, i.e., by accepting different
blocks at the same height as current head.
• Unlike extending the current path, switching paths may result in con-
firmed transactions no longer being confirmed, because the blocks in
the new path do not include them. Switching paths is referred to as
a reorg.
Theorem 23.21. Forks are eventually resolved and all nodes eventually agree
on which is the longest blockchain. The system therefore guarantees eventual
consistency.
Proof. In order for the fork to continue to exist, pairs of blocks need to be
found in close succession, extending distinct branches, otherwise the nodes on
the shorter branch would switch to the longer one. The probability of branches
being extended almost simultaneously decreases exponentially with the length
of the fork, hence there will eventually be a time when only one branch is being
extended, becoming the longest branch.
23.3. LAYER 2 115
Remarks:
Remarks:
• As all nodes cannot upgrade at the same time, miners can create
blocks that have more restrictive is valid rules and older nodes will
still accept them as they accept broader rules. This way, rules can
still be changed without having to upgrade all nodes at the same
time. Miners, on the other hand, have to upgrade almost at the same
time.
23.3 Layer 2
Definition 23.24 (Smart Contract). A smart contract is an agreement between
two or more parties, encoded in such a way that the correct execution is guar-
anteed by the blockchain.
Remarks:
Remarks:
• Transactions with a timelock are not released into the network until
the timelock expires. It is the responsibility of the node receiving
the transaction to store it locally until the timelock expires and then
release it into the network.
• Transactions (and blocks) with future timelocks are invalid. Upon re-
ceiving invalid transactions or blocks, nodes discard them immediately
and do not forward them to their neighbors.
Remarks:
Remarks:
• ts is called a setup transaction and is used to lock in funds into a shared
account. If ts is signed and broadcast immediately, one of the parties
could not collaborate to spend the multisig output, and the funds
become unspendable. To avoid a situation where the funds cannot
be spent, the protocol also creates a timelocked refund transaction
tr which guarantees that, should the funds not be spent before the
timelock expires, the funds are returned to the respective party. At no
point in time one of the parties holds a fully signed setup transaction
without the other party holding a fully signed refund transaction,
guaranteeing that funds are eventually returned.
• Both transactions require the signature of both parties. The setup
transaction has two inputs from A and B respectively which require
individual signatures. The refund transaction requires both signatures
because of the a 2-of-2 multisig input.
Remarks:
• Algorithm 23.28 implements a Simple Micropayment Channel, a smart
contract that is used for rapidly adjusting micropayments from a
spender to a recipient. Only two transactions are ever broadcast and
inserted into the blockchain: the setup transaction ts and the last set-
tlement transaction tf . There may have been any number of updates
to the settlement transaction, transferring ever more of the shared
output to the recipient.
• The number of bitcoins c used to fund the channel is also the maximum
total that may be transferred over the simple micropayment channel.
• At any time the recipient R is guaranteed to eventually receive the
bitcoins, since she holds a fully signed settlement transaction, while
the spender only has partially signed ones.
• The simple micropayment channel is intrinsically unidirectional. Since
the recipient may choose any of the settlement transactions in the
protocol, she will use the one with maximum payout for her. If we
118 CHAPTER 23. EVENTUAL CONSISTENCY & BITCOIN
Remarks:
• Users are annoyed if they receive a notification about a comment on
an online social network, but are unable to reply because the web
interface does not show the same notification yet. In this case the
notification acts as the first read operation, while looking up the com-
ment on the web interface is the second read operation.
Definition 23.30 (Monotonic Write Consistency). A write operation by a node
on a data item is completed before any successive write operation by the same
node (i.e., system guarantees to serialize writes by the same node).
Remarks:
• The ATM must replay all operations in order, otherwise it might hap-
pen that an earlier operation overwrites the result of a later operation,
resulting in an inconsistent final state.
Definition 23.31 (Read-Your-Write Consistency). After a node u has updated
a data item, any later reads from node u will never see an older value.
Definition 23.32 (Causal Relation). The following pairs of operations are said
to be causally related:
• Two writes by the same node to different variables.
• A read followed by a write of the same node.
• A read that returns the value of a write from any node.
• Two operations that are transitively related according to the above condi-
tions.
Remarks:
• The first rule ensures that writes by a single node are seen in the same
order. For example if a node writes a value in one variable and then
signals that it has written the value by writing in another variable.
Another node could then read the signalling variable but still read the
old value from the first variable, if the two writes were not causally
related.
23.4. WEAK CONSISTENCY 119
Chapter Notes
The CAP theorem was first introduced by Fox and Brewer [FB99], although it
is commonly attributed to a talk by Eric Brewer [Bre00]. It was later proven
by Gilbert and Lynch [GL02] for the asynchronous model. Gilbert and Lynch
also showed how to relax the consistency requirement in a partially synchronous
system to achieve availability and partition tolerance.
Bitcoin was introduced in 2008 by Satoshi Nakamoto [Nak08]. Nakamoto is
thought to be a pseudonym used by either a single person or a group of people;
it is still unknown who invented Bitcoin, giving rise to speculation and con-
spiracy theories. Among the plausible theories are noted cryptographers Nick
Szabo [Big13] and Hal Finney [Gre14]. The first Bitcoin client was published
shortly after the paper and the first block was mined on January 3, 2009. The
genesis block contained the headline of the release date’s The Times issue “The
Times 03/Jan/2009 Chancellor on brink of second bailout for banks”, which
serves as proof that the genesis block has been indeed mined on that date, and
that no one had mined before that date. The quote in the genesis block is also
thought to be an ideological hint: Bitcoin was created in a climate of finan-
cial crisis, induced by rampant manipulation by the banking sector, and Bitcoin
quickly grew in popularity in anarchic and libertarian circles. The original client
is nowadays maintained by a group of independent core developers and remains
the most used client in the Bitcoin network.
Central to Bitcoin is the resolution of conflicts due to doublespends, which
is solved by waiting for transactions to be included in the blockchain. This
however introduces large delays for the confirmation of payments which are
undesirable in some scenarios in which an immediate confirmation is required.
Karame et al. [KAC12] show that accepting unconfirmed transactions leads to
a non-negligible probability of being defrauded as a result of a doublespending
attack. This is facilitated by information eclipsing [DW13], i.e., that nodes
do not forward conflicting transactions, hence the victim does not see both
transactions of the doublespend. Bamert et al. [BDE+ 13] showed that the odds
of detecting a doublespending attack in real-time can be improved by connecting
to a large sample of nodes and tracing the propagation of transactions in the
network.
Bitcoin does not scale very well due to its reliance on confirmations in the
blockchain. A copy of the entire transaction history is stored on every node
in order to bootstrap joining nodes, which have to reconstruct the transaction
history from the genesis block. Simple micropayment channels were introduced
by Hearn and Spilman [HS12] and may be used to bundle multiple transfers
between two parties but they are limited to transferring the funds locked into
the channel once. Duplex Micropayment Channels [DW15] and the Lightning
Network [PD15] were the first suggestions for bidirectional micropayment chan-
nels in which the funds can be transferred back and forth an arbitrary number
of times, greatly increasing the flexibility of Bitcoin transfers and enabling a
120 CHAPTER 23. EVENTUAL CONSISTENCY & BITCOIN
Bibliography
[BDE+ 13] Tobias Bamert, Christian Decker, Lennart Elsen, Samuel Welten,
and Roger Wattenhofer. Have a snack, pay with bitcoin. In IEEE
Internation Conference on Peer-to-Peer Computing (P2P), Trento,
Italy, 2013.
[Big13] John Biggs. Who is the real satoshi nakamoto? one researcher may
have found the answer. http://on.tcrn.ch/l/R0vA, 2013.
[Bre00] Eric A. Brewer. Towards robust distributed systems. In Symposium
on Principles of Distributed Computing (PODC). ACM, 2000.
[DW13] Christian Decker and Roger Wattenhofer. Information propagation
in the bitcoin network. In IEEE International Conference on Peer-
to-Peer Computing (P2P), Trento, Italy, September 2013.
[DW15] Christian Decker and Roger Wattenhofer. A Fast and Scalable Pay-
ment Network with Bitcoin Duplex Micropayment Channels. In Sym-
posium on Stabilization, Safety, and Security of Distributed Systems
(SSS), 2015.
[FB99] Armando Fox and Eric Brewer. Harvest, yield, and scalable tolerant
systems. In Hot Topics in Operating Systems. IEEE, 1999.
[GL02] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibil-
ity of consistent, available, partition-tolerant web services. SIGACT
News, 2002.
[Gre14] Andy Greenberg. Nakamoto’s neighbor: My hunt for bitcoin’s cre-
ator led to a paralyzed crypto genius. http://onforb.es/1rvyecq,
2014.
[HS12] Mike Hearn and Jeremy Spilman. Contract: Rapidly adjusting
micro-payments. https://en.bitcoin.it/wiki/Contract, 2012. Last ac-
cessed on November 11, 2015.
Advanced Blockchain
In this chapter we study various advanced blockchain concepts, which are pop-
ular in research.
121
122 CHAPTER 24. ADVANCED BLOCKCHAIN
α α α α
β 0 1 2 3 ...
β
β β
β
Figure 24.4: Each state of the Markov chain represents how many blocks the
selfish miner is ahead, i.e., ds − dp . In each state, the selfish miner finds a
block with probability α, and the honest miners find a block with probability
β = 1 − α. The interesting cases are the “irregular” β arrow from state 2 to
state 0, and the β arrow from state 1 to state 0 as it will include three subcases.
Proof. We model the current state of the system with a Markov chain, see
Figure 24.4.
We can solve the following Markov chain equations to figure out the proba-
bility of each state in the stationary distribution:
p1 = αp0
βpi+1 = αpi , for all i > 1
X
and 1 = pi .
i
p1 X p1 p1 2α2 − α
1= + p1 ρi = + , hence p1 = 2 .
α α 1−ρ α +α−1
i≥0
Each state has an outgoing arrow with probability β. If this arrow is taken,
one or two blocks (depending on the state) are attached that will eventually
end up in the main chain of the blockchain. In state 0 (if arrow β is taken),
the honest miners attach a block. In all states i with i > 2, the selfish miner
eventually attaches a block. In state 2, the selfish miner directly attaches 2
blocks because of Line 11 in Algorithm 24.2.
State 1 in Line 8 is interesting. The selfish miner secretly was 1 block ahead,
but now (after taking the β arrow) the honest miners are attaching a competing
block. We have a race who attaches the next block, and where. There are three
possibilities:
• Either the selfish miner manages to attach another block to its own block,
giving 2 blocks to the selfish miner. This happens with probability α.
• Or the honest miners attach a block (with probability β) to their previous
honest block (with probability 1 − γ). This gives 2 blocks to the honest
miners, with total probability β(1 − γ).
• Or the honest miners attach a block to the selfish block, giving 1 block to
each side, with probability βγ.
24.2. ETHEREUM 123
The blockchain process is just a biased random walk through these states.
Since blocks are attached whenever we have an outgoing β arrow, the total
number of blocks being attached per state is simply 1 + p1 + p2 (all states attach
a single block, except states 1 and 2 which attach 2 blocks each).
As argued above, of these blocks, 1 − p0 + p2 + αp1 − β(1 − γ)p1 are blocks
by the selfish miner, i.e., the ratio of selfish blocks in the blockchain is
Remarks:
• If γ = 1/2 (the selfish miner learns about honest blocks very quickly
and manages to convince half of the honest miners to mine on the
selfish block instead of the slightly earlier published honest block),
already α = 1/4 is enough to have a higher share in expectation.
• And if γ = 1 (the selfish miner controls the network, and can hide any
honest block until the selfish block is published) any α > 0 justifies
selfish mining.
24.2 Ethereum
Definition 24.5 (Ethereum). Ethereum is a distributed state machine. Unlike
Bitcoin, Ethereum promises to run arbitrary computer programs in a blockchain.
Remarks:
• Like the Bitcoin network, Ethereum consists of nodes that are con-
nected by a random virtual network. These nodes can join or leave
the network arbitrarily. There is no central coordinator.
Remarks:
• Smart Contracts are written in higher level programming languages
like Solidity, Vyper, etc. and are compiled down to EVM (Ethereum
Virtual Machine) bytecode, which is a Turing complete low level pro-
gramming language.
• Smart contracts cannot be changed after deployment. But most smart
contracts contain mutable storage, and this storage can be used to
adapt the behavior of the smart contract. With this, many smart
contracts can update to a new version.
Definition 24.7 (Account). Ethereum knows two kinds of accounts. Exter-
nally Owned Accounts (EOAs) are controlled by individuals, with a secret key.
Contract Accounts (CAs) are for smart contracts. CAs are not controlled by a
user.
Definition 24.8 (Ethereum Transaction). An Ethereum transaction is sent by
a user who controls an EOA to the Ethereum network. A transaction contains:
• Nonce: This “number only used once” is simply a counter that counts how
many transactions the account of the sender of the transaction has already
sent.
• 160-bit address of the recipient.
• The transaction is signed by the user controlling the EOA.
• Value: The amount of Wei (the native currency of Ethereum) to transfer
from the sender to the recipient.
• Data: Optional data field, which can be accessed by smart contracts.
• StartGas: A value representing the maximum amount of computation this
transaction is allowed to use.
• GasPrice: How many Wei per unit of Gas the sender is paying. Miners
will probably select transactions with a higher GasPrice, so a high GasPrice
will make sure that the transaction is executed more quickly.
Remarks:
• There are three types of transactions.
Definition 24.9 (Simple Transaction). A simple transaction in Ethereum
transfers some of the native currency, called Wei, from one EOA to another.
Higher units of currency are called Szabo, Finney, and Ether, with 1018 Wei =
106 Szabo = 103 Finney = 1 Ether. The data field in a simple transaction is
empty.
Definition 24.10 (Smart Contract Creation Transaction). A transaction whose
recipient address field is set to 0 and whose data field is set to compiled EVM
code is used to deploy that code as a smart contract on the Ethereum blockchain.
The contract is considered deployed after it has been mined in a block and is
included in the blockchain at a sufficient depth.
24.2. ETHEREUM 125
Remarks:
Definition 24.12 (Gas). Gas is the unit of an atomic computation, like swap-
ping two variables. Complex operations use more than 1 Gas, e.g., ADDing two
numbers costs 3 Gas.
Remarks:
Chapter Notes
Selfish mining has already been discussed shortly after the introduction of Bit-
coin [RHo10]. A few years later, Eyal and Sirer formally analyzed selfish mining
[ES14]. If the selfish miner is two or more blocks ahead, this original research
suggested to always answer a newly published block by releasing the oldest un-
published block, so have two blocks at the same level. The idea was that honest
miners will then split their mining power between these two blocks. However,
what matters is how long it takes the honest miners to find the next block to
extend the public blockchain. This time does not change whether the honest
miners split their efforts or not. Hence the case dp < ds − 1 is not needed in
Algorithm 24.2.
Similarly, Courtois and Bahack [CB14] study subversive mining strategies.
Nayak et al. [NKMS15] combine selfish mining and eclipse attacks. Algorithm
24.2 is not optimal for all parameters, e.g., sometimes it may be beneficial to
risk even a two-block advantage. Sapirshtein et al. [SSZ15] describe and analyze
the optimal algorithm.
Vitalik Buterin introduced Ethereum in the 2013 whitepaper [But13]. In
2014, Ethereum Foundation was founded to create Ethereum’s first implementa-
tion. An online crowd-sale was conducted to raise around 31,000 BTC (around
USD 18 million at the time) for this. In this sense, Ethereum was the first
ICO (Initial Coin Offering). Ethereum has also attempted to write a formal
specification of its protocol in their yellow paper [Gav18]. This is in contrast
to Bitcoin, which doesn’t have a formal specification.
Bitcoin’s blockchain forms as a chain, i.e., each block (except the genesis
block) has a parent block. The longest chain with the highest difficulty is
considered the main chain. GHOST [SZ15] is an alternative to the longest chain
rule for establishing consensus in PoW based blockchains and aims to alleviate
adverse impacts of stale blocks. Ethereum’s blockchain structure is a variant
of GHOST. Other systems based on DAGs have been proposed in [SLZ16],
[SZ18], [LLX+ 18], and [LSZ15].
Bibliography
[But13] Vitalik Buterin. A Next-Generation Smart Contract and Decentral-
ized Application Platform, 2013. Available from: https://github.
com/ethereum/wiki/wiki/White-Paper.
[ES14] Ittay Eyal and Emin Gün Sirer. Majority is not enough: Bitcoin
mining is vulnerable. In Financial Cryptography and Data Security,
pages 436–454. Springer, 2014.
[LLX+ 18] Chenxing Li, Peilun Li, Wei Xu, Fan Long, and Andrew Chi-Chih
Yao. Scaling nakamoto consensus to thousands of transactions per
second. CoRR, abs/1805.03870, 2018.
Game Theory
“Game theory is a sort of umbrella or ‘unified field’ theory for the rational side
of social science, where ‘social’ is interpreted broadly, to include human as well
as non-human players (computers, animals, plants).”
– Robert Aumann, 1987
25.1 Introduction
In this chapter we look at a distributed system from a different perspective.
Nodes no longer have a common goal, but are selfish. The nodes are not byzan-
tine (actively malicious), instead they try to benefit from a distributed system
– possibly without contributing.
Game theory attempts to mathematically capture behavior in strategic sit-
uations, in which an individual’s success depends on the choices of others.
Remarks:
• We start with one of the most famous games to introduce some defi-
nitions and concepts of game theory.
128
25.2. PRISONER’S DILEMMA 129
u Player u
v Cooperate Defect
1 0
Cooperate
Player v 1 3
3 2
Defect
0 2
• If both of them stay silent (cooperate), both will be sentenced to one year
of prison on a lesser charge.
• If both of them testify against their fellow prisoner (defect), the police has
a stronger case and they will be sentenced to two years each.
• If player u defects and the player v cooperates, then player u will go free
(snitching pays off) and player v will have to go to jail for three years; and
vice versa.
• This two player game can be represented as a matrix, see Table 25.1.
Definition 25.2 (game). A game requires at least two rational players, and
each player can choose from at least two options (strategies). In every possible
outcome (strategy profile) each player gets a certain payoff (or cost). The
payoff of a player depends on the strategies of the other players.
Remarks:
• The social optimum for the prisoner’s dilemma is when both players
cooperate – the corresponding cost sum is 2.
Remarks:
Remarks:
• The best response is the best strategy given a belief about the strategy
of the other players. In this game the best response to both strategies
of the other player is to defect. If one strategy is the best response to
any strategy of the other players, it is a dominant strategy.
• Game theorists were invited to come up with a strategy for 200 iter-
ations of the prisoner’s dilemma to compete in a tournament. Each
strategy had to play against every other strategy and accumulated
points throughout the tournament. The simple Tit4Tat strategy (co-
operate in the first game, then copy whatever the other player did in
the previous game) won. One year later, after analyzing each strat-
egy, another tournament (with new strategies) was held. Tit4Tat won
again.
• We will sometimes depict this game as a graph. The cost cv←u for
node v to access the file from node u is equivalent to the length of the
shortest path times the demand dv .
25.3. SELFISH CACHING 131
• Note that in undirected graphs cu←v > cv←u if and only if du > dv .
We assume that the graphs are undirected for the rest of the chapter.
Proof. Let u be a node that is not caching the file. Then there exists a node v
for which cu←v ≤ 1. Hence, node u has no incentive to cache.
Let u be a node that is caching the file. We now consider any other node v
that is also caching the file. First, we consider the case where v cached the file
before u did. Then it holds that cu←v > 1 by construction.
It could also be that v started caching the file after u did. Then it holds
that du ≥ dv and therefore cu←v ≥ cv←u . Furthermore, we have cv←u > 1 by
construction. Combining these implies that cu←v ≥ cv←u > 1.
In either case, node u has no incentive to stop caching.
cost(N E + )
OP oA = .
cost(SO)
132 CHAPTER 25. GAME THEORY
0 0
0
0
0 0
0
0
Remarks:
• The Price of Anarchy measures how much a distributed system de-
grades because of selfish nodes.
• We have P oA ≥ OP oA ≥ 1.
Theorem 25.11. The (Optimistic) Price of Anarchy of Selfish Caching can be
Θ(n).
Proof. Consider a network as depicted in Figure 25.12. Every node v has de-
mand dv = 1. Note that if any node caches the file, no other node has an
incentive to cache the file as well since the cost to access the file is at most 1 − ε.
Without loss of generality, let us assume that a node v on the left caches the
file, then it is cheaper for every node on the right to access the file remotely.
Hence, the total cost of this solution is 1 + n2 · (1 − ε). In the social optimum
one node from the left and one node from the right cache the file. This reduces
1+ n
2 ·(1−ε)
the cost to 2. Hence, the Price of Anarchy is 2 = 12 + n4 = Θ(n).
ε→0
(a) The road network without the shortcut (b) The road network with the shortcut
Figure 25.13: Braess’ Paradox, where d denotes the number of drivers using an
edge.
Remarks:
• We will now look at another famous game that will allow us to deepen
our understanding of game theory.
25.5 Rock-Paper-Scissors
There are two players, u and v. Each player simultaneously chooses one of three
options: rock, paper, or scissors. The rules are simple: paper beats rock, rock
beats scissors, and scissors beat paper. A matrix representation of this game is
in Table 25.15.
u Player u
v Rock Paper Scissors
0 1 -1
Rock
0 -1 1
-1 0 1
Player v Paper
1 0 -1
1 -1 0
Scissors
-1 1 0
Remarks:
• None of the three strategies is a Nash Equilibrium. Whatever player
u chooses, player v can always switch her strategy such that she wins.
• This is highlighted in the best response concept. The best response
to e.g. scissors is to play rock. The other player switches to paper.
And so on.
• Is this a game without a Nash Equilibrium? John Nash answered this
question in 1950. By choosing each strategy with a certain probability,
we can obtain a so called Mixed Nash Equilibrium.
Definition 25.16 (Mixed Nash Equilibrium). A Mixed Nash Equilibrium
(MNE) is a strategy profile in which at least one player is playing a random-
ized strategy (choose strategy profiles according to probabilities), and no player
can improve their expected payoff by unilaterally changing their (randomized)
strategy.
Theorem 25.17. Every game has a mixed Nash Equilibrium.
Remarks:
• The Nash Equilibrium of this game is if both players choose each
strategy with probability 1/3. The expected payoff is 0.
• Any strategy (or mix of them) is a best response to a player choosing
each strategy with probability 1/3.
• In a pure Nash Equilibrium, the strategies are chosen deterministi-
cally. Rock-Paper-Scissors does not have a pure Nash Equilibrium.
• Even though every game has a mixed Nash Equilibrium. Sometimes
such an equilibrium is computationally difficult to compute. One
should be cautious about economic assumptions such as “the mar-
ket will always find the equilibrium”.
• Unfortunately, game theory does not always model problems accu-
rately. Many real world problems are too complex to be captured by
a game. And as you may know, humans (not only politicians) are
often not rational.
• In distributed systems, players can be servers, routers, etc. Game
theory can tell us whether systems and protocols are prone to selfish
behavior.
Remarks:
• For simplicity, we assume that no two bids are the same, and that
b1 > b2 > b3 > . . .
• If zi < bmax < bi , then overbidding wins the auction, but the payoff
(zi − bmax ) is negative. Truthful bidding loses and yields a payoff of 0.
Likewise underbidding, i.e. bi < zi :
• If bmax < bi < zi , then both strategies win and yield the same payoff
(zi − bmax ).
• If bi < zi < bmax , then both strategies lose and yield a payoff of 0.
• If bi < bmax < zi , then truthful bidding wins and yields a positive payoff
(zi − bmax ). Underbidding loses and yields a payoff of 0.
Remarks:
• Let us use this for Selfish Caching. We need to choose a node that is
the first to cache the file. But how? By holding an auction. Every
node says for which price it is willing to cache the file. We pay the
node with the lowest offer and pay it the second lowest offer to ensure
truthful offers.
Proof. If the mechanism designer wants the nodes from the caching set S of the
Nash Equilibrium to cache, then she can offer the following deal to every node
not in S: “If any node from set S does not cache the file, then I will ensure
a positive payoff for you.” Thus, all nodes not in S prefer not to cache since
this is a dominant strategy for them. Consider now a node v ∈ S. Since S is a
Nash Equilibrium, node v incurs cost of at least 1 if it does not cache the file.
For nodes that incur cost of exactly 1, the mechanism designer can even issue a
penalty if the node does not cache the file. Thus, every node v ∈ S caches the
file.
Remarks:
• Mechanism design assumes that the players act rationally and want to
maximize their payoff. In real-world distributed systems some players
may be not selfish, but actively malicious (byzantine).
• Many techniques have been proposed to limit such free riding behavior,
e.g., tit-for-tat trading: I will only share something with you if you
share something with me. To solve the bootstrap problem (“I don’t
have anything yet”), nodes receive files or pieces of files whose hash
match their own hash for free. One can also imagine indirect trading.
Peer u uploads to peer v, who uploads to peer w, who uploads to peer
u. Finally, one could imagine using virtual currencies or a reputation
system (a history of who uploaded what). Reputation systems suffer
from collusion and Sybil attacks. If one node pretends to be many
nodes who rate each other well, it will have a good reputation.
Chapter Notes
Game theory was started by a proof for mixed-strategy equilibria in two-person
zero-sum games by John von Neumann [Neu28]. Later, von Neumann and Mor-
genstern introduced game theory to a wider audience [NM44]. In 1950 John
Nash proved that every game has a mixed Nash Equilibrium [Nas50]. The Pris-
oner’s Dilemma was first formalized by Flood and Dresher [Flo52]. The iterated
prisoner’s dilemma tournament was organized by Robert Axelrod [AH81]. The
Price of Anarchy definition is from Koutsoupias and Papadimitriou [KP99].
This allowed the creation of the Selfish Caching Game [CCW+ 04], which we
used as a running example in this chapter. Braess’ paradox was discovered by
Dietrich Braess in 1968 [Bra68]. A generalized version of the second-price auc-
tion is the VCG auction, named after three successive papers from first Vickrey,
138 CHAPTER 25. GAME THEORY
then Clarke, and finally Groves [Vic61, Cla71, Gro73]. One popular exam-
ple of selfishness in practice is BitThief – a BitTorrent client that successfully
downloads without uploading [LMSW06]. Using game theory economists try to
understand markets and predict crashes. Apart from John Nash, the Sveriges
Riksbank Prize (Nobel Prize) in Economics has been awarded many times to
game theorists. For example in 2007 Hurwicz, Maskin, and Myerson received the
prize for “for having laid the foundations of mechanism design theory”. There
is a considerable amount of work on mixed adversarial models with byzantine,
altruistic, and rational (“BAR”) players, e.g., [AAC+ 05, ADGH06, MSW06].
Daskalakis et al. [DGP09] showed that computing a Nash Equilibrium may not
be trivial.
This chapter was written in collaboration with Philipp Brandes.
Bibliography
[AAC+ 05] Amitanand S. Aiyer, Lorenzo Alvisi, Allen Clement, Michael Dahlin,
Jean-Philippe Martin, and Carl Porth. BAR fault tolerance for
cooperative services. In Proceedings of the 20th ACM Symposium
on Operating Systems Principles 2005, SOSP 2005, Brighton, UK,
October 23-26, 2005, pages 45–58, 2005.
[ADGH06] Ittai Abraham, Danny Dolev, Rica Gonen, and Joseph Y. Halpern.
Distributed computing meets game theory: robust mechanisms for
rational secret sharing and multiparty computation. In Proceedings
of the Twenty-Fifth Annual ACM Symposium on Principles of Dis-
tributed Computing, PODC 2006, Denver, CO, USA, July 23-26,
2006, pages 53–62, 2006.
[Bra68] Dietrich Braess. Über ein paradoxon aus der verkehrsplanung. Un-
ternehmensforschung, 12(1):258–268, 1968.
[CCW+ 04] Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Bar-
reno, Christos H Papadimitriou, and John Kubiatowicz. Selfish
caching in distributed systems: a game-theoretic analysis. In Pro-
ceedings of the twenty-third annual ACM symposium on Principles
of distributed computing, pages 21–30. ACM, 2004.
[LMSW06] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Watten-
hofer. Free Riding in BitTorrent is Cheap. In 5th Workshop on Hot
Topics in Networks (HotNets), Irvine, California, USA, November
2006.
[MSW06] Thomas Moscibroda, Stefan Schmid, and Roger Wattenhofer. When
selfish meets evil: byzantine players in a virus inoculation game. In
Proceedings of the Twenty-Fifth Annual ACM Symposium on Prin-
ciples of Distributed Computing, PODC 2006, Denver, CO, USA,
July 23-26, 2006, pages 35–44, 2006.
[Nas50] John F. Nash. Equilibrium points in n-person games. Proc. Nat.
Acad. Sci. USA, 36(1):48–49, 1950.
[Neu28] John von Neumann. Zur Theorie der Gesellschaftsspiele. Mathema-
tische Annalen, 100(1):295–320, 1928.
[NM44] John von Neumann and Oskar Morgenstern. Theory of games and
economic behavior. Princeton university press, 1944.
Authenticated Agreement
In Section 18.4 we have already had a glimpse into the power of cryptography.
In this Chapter we want to build a practical byzantine fault-tolerant system
using cryptography. With cryptography, byzantine lies may be detected easily.
140
26.1. AGREEMENT WITH AUTHENTICATION 141
Remarks:
• Algorithm 26.2 solves byzantine agreement on binary inputs relying
on signatures. We assume there is a designated “primary” node p that
all other nodes know. The goal is to decide on p’s value.
Theorem 26.3. Algorithm 26.2 can tolerate f < n byzantine failures while
terminating in f + 1 rounds.
Proof. Assuming that the primary p is not byzantine and its input is 1, then p
broadcasts value(1)p in the first round, which will trigger all correct nodes to
decide on 1. If p’s input is 0, there is no signed message value(1)p , and no node
can decide on 1.
If primary p is byzantine, we need all correct nodes to decide on the same
value for the algorithm to be correct.
Assume i < f + 1 is the minimal round in which any correct node u decides
on 1. In this case, u has a set S of at least i messages from other nodes for
value 1 in round i, including one of p. Therefore, in round i + 1 ≤ f + 1, all
other correct nodes will receive S and u’s message for value 1 and thus decide
on 1 too.
Now assume that i = f + 1 is the minimal round in which a correct node
u decides for 1. Thus u must have received f + 1 messages for value 1, one of
which must be from a correct node since there are only f byzantine nodes. In
this case some other correct node v must have decided on 1 in some round j < i,
which contradicts i’s minimality; hence this case cannot happen.
Finally, if no correct node decides on 1 by the end of round f + 1, then all
correct nodes will decide on 0.
Remarks:
• If the primary is a correct node, Algorithm 26.2 only needs two rounds!
Otherwise, the algorithm terminates in at most f + 1 rounds, which
is optimal as described in Theorem 17.20.
• By using signatures, Algorithm 26.2 manages to solve consensus for
any number of failures! Does this contradict Theorem 17.12? Recall
that in the proof of Theorem 17.12 we assumed that a byzantine node
can distribute contradictory information about its own input. If mes-
sages are signed, correct nodes can detect such behavior. Specifically,
if a node u signs two contradicting messages, then observing these two
messages proves to all nodes that node u is byzantine.
• Does Algorithm 26.2 satisfy any of the validity conditions introduced
in Section 17.1? No! A byzantine primary can dictate the decision
value.
• Can we modify the algorithm such that the correct-input validity con-
dition is satisfied? Yes! We can run the algorithm in parallel for 2f +1
primary nodes. Either 0 or 1 will occur at least f + 1 times, which
means that one correct process had to have this value in the first place.
In this case, we can only handle f < n2 byzantine nodes.
• Can we make it work with arbitrary inputs?
142 CHAPTER 26. AUTHENTICATED AGREEMENT
Remarks:
• At any given time, every node will consider one designated node to be
the primary and the other nodes to be backups.
• The timespan for which a node p is seen as the primary from the
perspective of another node is called a view.
Definition 26.5 (View). A view v is a non-negative integer representing the
node’s local perception of the system. We say that node u is in view v as long
as node u considers node p = v mod n to be the primary.
Remarks:
• All nodes start out in view 0. Nodes can potentially be in different
views (i.e. have different local values for v) at any given time.
• If backups detect faulty behavior in the primary, they switch to the
next primary with a so-called view change (see Section 26.4).
• In the asynchronous model, requests can arrive at the nodes in dif-
ferent orders. While a primary remains in charge (sufficiently many
nodes share the view v), it thus adopts the function of a serializer (cf.
Algorithm 15.9).
Definition 26.6 (Sequence Number). During a view, a node relies on the pri-
mary to assign consecutive sequence numbers (integers) that function as in-
dices in the global order (cf. Definition 15.8) for the requests that clients send.
Remarks:
• During a view change, we ensure that no two correct nodes execute
requests in different orders. On the one hand, we need to exchange
information on the current state to guarantee that a correct new pri-
mary knows the latest sequence number that has been accepted by
sufficiently many backups. On the other hand, exchanging informa-
tion will enable backups to determine if the new primary acts in a
byzantine fashion, e.g. reassigning the latest sequence number to a
different request.
26.3. PBFT: AGREEMENT PROTOCOL 143
Remarks:
• The protocol will guarantee that once a correct node has executed a
request r with sequence number s, then no correct node will execute
any request r0 6= r with sequence number s, not unlike Lemma 15.14.
• Correct primaries choose sequence numbers in order, without gap, i.e.
if a correct primary proposed s as the sequence number for the last
request, then it will use s + 1 for the next request that it proposes.
• Before a node can safely execute a request r with a sequence number
s, it will wait until it knows that the decision to execute r with s has
been reached and is widely known.
• Informally, nodes will collect confirmation messages by sets of at least
2f + 1 nodes to guarantee that that information is sufficiently widely
distributed.
Lemma 26.8 (2f + 1 Quorum Intersection). Let S1 with |S1 | ≥ 2f + 1 and S2
with |S2 | ≥ 2f + 1 each be sets of nodes. Then there exists a correct node in
S1 ∩ S2 .
Proof. Let S1 , S2 each be sets of at least 2f + 1 nodes. There are 3f + 1 nodes
in total, thus due to the pigeonhole principle the intersection S1 ∩ S2 contains
at least f + 1 nodes. Since there are at most f faulty nodes, S1 ∩ S2 contains
at least 1 correct node.
client c
primary p
backup n1
backup n2
backup n3
Figure 26.9: The agreement protocol used in PBFT for processing a client
request issued by client c, exemplified for a system with n = 4 nodes. The
primary in view v is p = n0 = v mod n.
Remarks:
• Definitions 26.10, 26.12, 26.14, and 26.16 specify the agreement pro-
tocol formally. Backups run the pre-prepare and the prepare phase
concurrently.
Remarks:
Remarks:
• Note that the agreement protocol can run for multiple requests in
parallel. Since we are in the variable delay model and messages can
arrive out of order, we thus have to wait in Algorithm 26.17 Line 3
for all requests with lower sequence numbers to be executed.
• The client only considers the request to have been processed once it
received f + 1 reply-messages sent by the nodes in Algorithm 26.17
Line 5. Since a correct node only sends a reply-message once it
executed the request, with f + 1 reply-messages the client can be
certain that the request was executed by a correct node.
146 CHAPTER 26. AUTHENTICATED AGREEMENT
• We will see in Section 26.4 that PBFT guarantees that once a single
correct node executed the request, then all correct nodes will never
execute a different request with the same sequence number. Thus,
knowing that a single correct node executed a request is enough for
the client.
Remarks:
Remarks:
• If the faulty-timer expires, the backup considers the primary faulty
and triggers a view change. When triggering a view change, a correct
node will no longer participate in the protocol for the current view.
• We leave out the details regarding for what timespan to set the faulty-
timer. This is a patience trade-off (more patience: slower turnover if
the primary is byzantine; less patience: risk of prematurely firing view
changes).
• During a view change, the protocol has to guarantee that requests
that have already been executed by some correct nodes will not be
executed with the different sequence numbers by other correct nodes.
• How can we guarantee that this happens?
Definition 26.20 (PBFT: View Change Protocol). In the view change proto-
col, a node whose faulty-timer has expired enters the view change phase by
running Algorithm 26.22. During the new view phase (which all nodes con-
tinuously listen for), the primary of the next view runs Algorithm 26.24 while
all other nodes run Algorithm 26.25.
view-change new-view
(v + 1, Pi , ni )ni (v + 1, V, O, n1 )n1
node n2
node n3
Figure 26.21: The view change protocol used in PBFT. Node n0 is the pri-
mary of current view v, node n1 the primary of view v + 1. Once back-
ups consider n0 to be faulty, they start the view change protocol (cf. Algo-
rithms 26.22, 26.24, 26.25). The X signifies that n0 is faulty.
Remarks:
• The idea behind the view change protocol is as follows: during the view
change protocol, the new primary collects prepared-certificates from
2f + 1 nodes, so for every request that some correct node executed,
the new primary will have at least one prepared-certificate.
• After gathering that information, the primary distributes it and tells
all backups which requests need to be to executed with which sequence
numbers.
• Backups can check whether the new primary makes the decisions
required by the protocol, and if it does not, then the new primary
must be byzantine and the backups can directly move to the next
view change.
148 CHAPTER 26. AUTHENTICATED AGREEMENT
Algorithm 26.24 PBFT View Change Protocol: New View Phase - Primary
Code for new primary p of view v + 1:
1: accept 2f + 1 view-change-messages (including possibly p’s own) in a set
V (this is the new-view-certificate)
2: let O be a set of pre-prepare(v + 1, s, r, p)p for all pairs (s, r) where at
least one prepared-certificate for (s, r) exists in V
3: let sVmax be the highest sequence number for which O contains a
pre-prepare-message
4: add to O a message pre-prepare(v + 1, s0 , null, p)p for every sequence
number s0 < sV max for which O does not contain a pre-prepare-message
5: send new-view(v + 1, V, O, p)p to all nodes
6: start processing requests for view v+1 according to Algorithm 26.11 starting
from sequence number sV max + 1
Remarks:
• It is possible that V contains a prepared-certificate for a sequence
number s while it does not contain one for some sequence number s0 <
s. For each such sequence number s0 , we fill up O in Algorithm 26.24
Line 4 with null-requests, i.e. requests that backups understand to
mean “do not do anything here”.
Algorithm 26.25 PBFT View Change Protocol: New View Phase - Backup
Code for backup b of view v + 1 if b’s local view is v 0 < v + 1:
1: accept new-view(v + 1, V, O, p)p
2: stop accepting pre-prepare-/prepare-/commit-messages for v
3: set local view to v + 1
4: if p is primary of v + 1 then
5: if O was correctly constructed from V according to Algorithm 26.24
Lines 2 and 4 then
6: respond to all pre-prepare-messages in O as in the agreement protocol,
starting from Algorithm 26.13
7: start accepting messages for view v + 1
8: else
9: trigger view change to v + 2 using Algorithm 26.22
10: end if
11: end if
Correct backups will enter view v 0 only if the new-view-message for v 0 con-
tains a valid new-view-certificate V and if O was constructed correctly from
V, see Algorithm 26.25 Line 5. They will then respond to the messages in O
before they start accepting other pre-prepare-messages for v 0 due to the order
of Algorithm 26.25 Lines 6 and 7. Therefore, for the sequence numbers that ap-
pear in O, correct backups will only send prepare-messages responding to the
pre-prepare-messages found in O due to Algorithm 26.13 Lines 2 and 3. This
guarantees that in v 0 , for every sequence number s that appears in O, backups
can only collect prepared-certificates for the triple (v 0 , s, r) that appears in O.
Together with the above, this proves that if some correct node executed
request r with sequence number s in v, then no node will be able to collect a
prepared-certificate for some r0 6= r with sequence number s in any view v 0 ≥ v,
and thus no correct node will execute r0 with sequence number s.
150 CHAPTER 26. AUTHENTICATED AGREEMENT
Remarks:
• Since message delays are unknown, timers are doubling with every
view. Eventually, the timeout is larger than the maximum message
delay, and all correct messages are received before any timer expires.
Chapter Notes
PBFT is perhaps the central protocol for asynchronous byzantine state replica-
tion. The seminal first publication about it, of which we presented a simplified
version, can be found in [CL+ 99]. The canonical work about most versions of
PBFT is Miguel Castro’s PhD dissertation [Cas01].
Notice that the sets Pb in Algorithm 26.22 grow with each view change
as the system keeps running since they contain all prepared-certificates that
nodes have collected so far. All variants of the protocol found in the literature
introduce regular checkpoints where nodes agree that enough nodes executed
all requests up to a certain sequence number so they can continuously garbage-
collect prepared-certificates. We left this out for conciseness.
Remember that all messages are signed. Generating signatures is some-
what pricy, and variants of PBFT exist that use the cheaper, but less powerful
Message Authentication Codes (MACs). These variants are more complicated
because MACs only provide authentication between the two endpoints of a mes-
sage and cannot prove to a third party who created a message. An extensive
treatment of a variant that uses MACs can be found in [CL02].
Before PBFT, byzantine fault-tolerance was considered impractical, just
something academics would be interested in. PBFT changed that as it
BIBLIOGRAPHY 151
Bibliography
[AEMGG+ 05] Michael Abd-El-Malek, Gregory R Ganger, Garth R Goodson,
Michael K Reiter, and Jay J Wylie. Fault-scalable byzantine
fault-tolerant services. In ACM SIGOPS Operating Systems Re-
view, volume 39, pages 59–74. ACM, 2005.
[CL+ 99] Miguel Castro, Barbara Liskov, et al. Practical byzantine fault
tolerance. In OSDI, volume 99, pages 173–186, 1999.
[CL02] Miguel Castro and Barbara Liskov. Practical byzantine fault tol-
erance and proactive recovery. ACM Transactions on Computer
Systems (TOCS), 20(4):398–461, 2002.
[CML+ 06] James Cowling, Daniel Myers, Barbara Liskov, Rodrigo Ro-
drigues, and Liuba Shrira. Hq replication: A hybrid quorum
protocol for byzantine fault tolerance. In Proceedings of the
7th symposium on Operating systems design and implementa-
tion, pages 177–190. USENIX Association, 2006.
152
INDEX 153
GNSS, 68 quorum, 79
GPS, 68 quorum system, 79
GPS receiver, 71
grid quorum system, 81 random bitstring, 34
randomized consensus algorithm, 20
happened-before relation, 53 read-your-write consistency, 118
homogeneous system, 94 real time, 61
hypercube topology, 95 refund transaction, 117
hypercubic network, 94 resilience of a quorum system, 83
runtime, 15
king algorithm, 30
selfish mining, 121
Lamport clock, 54 selfish mining algorithm, 121
linearizability, 51 semantic equivalence, 51
linearization point, 51 sequential consistency, 51
load of a quorum system, 80 sequential execution, 50
lock-free, 39 sequential locking strategy, 82
logical clock, 54 serializer, 6
setup transaction, 117
M-grid quorum system, 88
SHA256, 112
majority quorum system, 79
shared coin (crash-resilient), 43
median validity, 27
shared coin (sync, byz), 46
mesh topology, 95
shared coin algorithm, 23
message loss model, 4
shared coin using secret sharing, 45
message passing model, 4
shared coin with magic random oracle,
micropayment channel, 117
33
microservice architecture, 58
shuffle-exchange network, 97
mining algorithm, 112
signature, 44, 140
monotonic read consistency, 118
simple ethereum transaction, 124
monotonic write consistency, 118
singlesig output, 116
multisig output, 116
singleton quorum system, 79
naive shared coin with a random bit- skip list topology, 98
string, 34 smart contract, 115, 123
network time protocol, 63 smart contract creation ethereum
node, 4 transaction, 124
non-blocking, 39 smart contract execution ethereum
transaction, 125
object, 50 span, 58
operation, 50 start time, 50
starvation-free, 39
partition tolerance, 108 state replication, 6
parts per million, 62 state replication with serializer, 6
paxos algorithm, 10 strong logical clock, 54
paxos proposal, 11 synchronization, 63
physical time, 61 synchronous distributed system, 27
precision time protocol, 63 synchronous runtime, 27
proof of work, 111
pseudonymous, 109 termination, 15
threshold secret sharing, 45
quiescent consistency, 52 ticket, 7
154 INDEX
time standards, 65
timelock, 116
torus topology, 95
trace, 58
tracing, 59
transaction, 109
transaction algorithm, 110
transaction fee, 111
two-phase commit, 7
two-phase locking, 7
two-phase protocol, 6
univalent configuration, 16
validity, 15
variable message delay model, 5
vector clocks, 55
wait-free, 39
wall-clock time, 61
weak consistency, 118
work of a quorum system, 80