Distributed Systems Practitioners Dimos Raptis Raspoznan
Distributed Systems Practitioners Dimos Raptis Raspoznan
SYSTEMS FOR
PRACTITIONERS
DIM O S RAPTIS
Contents
Preface i
Acknowledgements iii
I Fundamental Concepts 1
1 Introduction 2
What is a distributed system and why we need it . . . . . . . . . . 2
The fallacies of distributed computing . . . . . . . . . . . . . . . . 5
Why distributed systems are hard . . . . . . . . . . . . . . . . . . 7
Correctness in distributed systems . . . . . . . . . . . . . . . . . . 8
System models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
The tale of exactly-once semantics . . . . . . . . . . . . . . . . . . 10
Failure in the world of distributed systems . . . . . . . . . . . . . . 13
Stateful and Stateless systems . . . . . . . . . . . . . . . . . . . . . 15
1
CONTENTS 2
3 Distributed Transactions 54
What is a distributed transaction . . . . . . . . . . . . . . . . . . . 54
Achieving Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2-phase locking (2PL) . . . . . . . . . . . . . . . . . . . . . . . . . 58
Snapshot Isolation via MVCC . . . . . . . . . . . . . . . . . . . . . 59
Achieving atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2-phase commit (2PC) . . . . . . . . . . . . . . . . . . . . . . . . . 63
3-phase commit (3PC) . . . . . . . . . . . . . . . . . . . . . . . . . 67
A quorum-based commit protocol . . . . . . . . . . . . . . . . . . . 69
How it all fits together . . . . . . . . . . . . . . . . . . . . . . . . . 72
Long-lived transactions & Sagas . . . . . . . . . . . . . . . . . . . 73
4 Consensus 77
Defining consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Some use-cases of consensus . . . . . . . . . . . . . . . . . . . . . . 78
FLP impossibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
The Paxos algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Intricacies of Paxos . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Paxos in real-life . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Replicated state machine via consensus . . . . . . . . . . . . . . . 89
Distributed transactions via consensus . . . . . . . . . . . . . . . . 91
Raft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Standing on the shoulders of giants . . . . . . . . . . . . . . . . . . 98
5 Time 102
What is different in a distributed system . . . . . . . . . . . . . . . 102
A practical perspective . . . . . . . . . . . . . . . . . . . . . . . . . 103
A theoretical perspective . . . . . . . . . . . . . . . . . . . . . . . . 104
Logical clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6 Order 108
Total and partial ordering . . . . . . . . . . . . . . . . . . . . . . . 108
CONTENTS 3
References 243
Preface
Distributed systems are becoming ubiquitous in our life nowadays: from how
we communicate with our friends to how we make online shopping and many
more things. It might be transparent to us sometimes, but many companies
are making use of extremely complicated software systems under the hood
to satisfy our needs. By using these kind of systems, companies are capable
of significant achievements, such as sending our message to a friend who is
thousand miles away in a matter of milliseconds, delivering our orders despite
outages of whole datacenters or searching the whole Internet by processing
more than a million terabytes of data in less than a second. Putting all
of this into perspective, it’s easy to understand the value that distributed
systems bring in the current world and why it’s useful for software engineers
to be able to understand and make use of distributed systems.
The ultimate goal of this book is to help these people get started with
i
PREFACE ii
As any other book, this book might have been written by a single person, but
that would not have been possible with the contribution of many others. As a
result, credits should be given to all my previous employers and colleagues that
have given me the opportunity to work with large-scale, distributed systems
and appreciate both their capabilities and complexities and the distributed
systems community that was always open to answer any questions. I would
also like to thank Richard Gendal Brown for reviewing the case study on
Corda and giving feedback that was very useful in helping me to add clarity
and remove ambiguity. Of course, this book would not have been possible
without the understanding and support of my partner in life, Maria.
iii
Part I
Fundamental Concepts
1
Chapter 1
Introduction
First of all, we need to define what a distributed system is. Multiple, different
definitions can be found, but we will use the following:
"A distributed system is a system whose components are lo-
cated on different networked computers, which communi-
cate and coordinate their actions by passing messages to one
another."[1]
As shown in Figure 1.1, this network can either consist of direct connections
between the components of the distributed system or there could be more
components that form the backbone of the network (if communication is
done through the Internet for example). These components can take many
forms; they could be servers, routers, web browsers or even mobile devices.
In an effort to keep an abstract and generic view, in the context of this book
we’ll refer to them as nodes, being agnostic to their real form. In some cases,
such as when providing a concrete example, it might be useful to escape this
generic view and see how things work in real-life. In these cases, we might
explain in detail the role of each node in the system.
As we will see later, the 2 parts that were highlighted in the definition above
are central to how distributed systems function:
• the various parts that compose a distributed system are located re-
motely, separated by a network.
2
CHAPTER 1. INTRODUCTION 3
Now that we have defined what a distributed system is, let’s explore its
value.
Why do we really need distributed systems ?
Looking at all the complexity that distributed systems introduce, as we will
see during this book, that’s a valid question. The main benefits of distributed
systems come mostly in the following 3 areas:
• performance
• scalability
• availability
Let’s explain each one separately. The performance of a single computer
has certain limits imposed by physical constraints on the hardware. Not
only that, but after a point, improving the hardware of a single computer
in order to achieve better performance becomes extremely expensive. As
CHAPTER 1. INTRODUCTION 4
a result, one can achieve the same performance with 2 or more low-spec
computers as with a single, high-end computer. So, distributed systems
allow us to achieve better performance at a lower cost. Note that
better performance can translate to different things depending on the context,
such as lower latency per request, higher throughput etc.
"Scalability is the capability of a system, network, or process to
handle a growing amount of work, or its potential to be enlarged
to accommodate that growth." [2]
Most of the value derived from software systems in the real world comes from
storing and processing data. As the customer base of a system grows, the
system needs to handle larger amounts of traffic and store larger amounts of
data. However, a system composed of a single computer can only scale up to
a certain point, as explained previously. Building a distributed system
allows us to split and store the data in multiple computers, while
also distributing the processing work amongst them1 . As a result of
this, we are capable of scaling our systems to sizes that would not even be
imaginable with a single-computer system.
In the context of software systems, availability is the probability that a
system will work as required when required during the period of a mission.
Note that nowadays most of the online services are required to operate all
the time (known also as 24/7 service), which makes this a huge challenge.
So, when a service states that it has 5 nines of availability, this means that
it operates normally for 99.999% of the time. This implies that it’s allowed
to be down for up to 5 minutes a year, to satisfy this guarantee. Thinking
about how unreliable hardware can be, one can easily understand how big
an undertaking this is. Of course, using a single computer, it would be
infeasible to provide this kind of guarantees. One of the mechanisms
that are widely used to achieve higher availability is redundancy,
which means storing data into multiple, redundant computers. So,
when one of them fails, we can easily and quickly switch to another one,
preventing our customers from experiencing this failure. Given that data are
stored now in multiple computers, we end up with a distributed system!
Leveraging a distributed system we can get all of the above benefits. However,
as we will see later on, there is a tension between them and several other
1
The approach of scaling a system by adding resources (memory, CPU, disk) to a single
node is also referred to as vertical scaling, while the approach of scaling by adding more
nodes to the system is referred to as horizontal scaling.
CHAPTER 1. INTRODUCTION 5
We will focus on those that are mostly relevant to this book here: 1, 2 and 3.
The first fallacy is sometimes enforced by abstractions provided to developers
from various technologies and protocols. Even though protocols, like TCP,
can make us believe that network is reliable and never fails, this is just
an illusion. We should understand that network connections are also built
on top of hardware that will also fail at some point and we should design
our systems accordingly. The second assumption is also enforced nowadays
by libraries, which attempt to model remote procedure calls as local calls,
2
See: https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing
CHAPTER 1. INTRODUCTION 6
This assumption can be quite deceiving, since it’s somewhat intuitive and
holds true when working in systems that are not distributed. For instance,
an application that runs in a single computer can use the computer’s local
clock in order to decide when events happen and what’s the order between
them. Nonetheless, that’s not true in a distributed system, where every node
in the system has its own local clock, which runs at a different rate from the
other ones. There are ways to try and keep the clocks in sync, but some of
them are very expensive and do not eliminate these differences completely.
This limitation is again bound by physical laws5 . An example of such an
approach is the TrueTime API that was built by Google [5], which exposes
explicitly the clock uncertainty as a first-class citizen. However, as we will
see in the next chapters of the book, when one is mainly interested in cause
and effects, there are other ways to reason about time using logical clocks
3
See: https://grpc.io/
4
See: https://thrift.apache.org/
5
See: https://en.wikipedia.org/wiki/Time_dilation
CHAPTER 1. INTRODUCTION 7
instead.
In general, distributed systems are hard to design, build and reason about,
thus increasing the risk of error. This will become more evident later in the
book while exploring some algorithms that solve fundamental problems that
emerge in distributed systems. It’s worth questioning: why are distributed
systems so hard? The answer to this question can help us understand what
are the main properties that make distributed systems challenging, thus
eliminating our blind spots and providing some guidance on what are some
of the aspects we should be paying attention to.
The main properties of distributed systems that make them challenging to
reason about are the following:
• network asynchrony
• partial failures
• concurrency
Network asynchrony is a property of communication networks that cannot
provide strong guarantees around delivery of events, e.g. a maximum amount
of time required for a message to be delivered. This can create a lot of
counter-intuitive behaviours that would not be present in non-distributed
systems. For instance, this is in contrast to memory operations that can
provide much stricter guarantees6 . For instance, in a distributed system
messages might take extremely long to be delivered or they might be delivered
out of order.
Partial failures are cases where only some components of a distributed
system fail. This behaviour can come in contrast to certain kind of applica-
tions deployed in a single server that work under the assumption that either
the whole server has crashed or everything is working fine. It introduces
significant complexity when there is a requirement for atomicity across com-
ponents in a distributed system, i.e. we need to ensure that an operation is
either applied to all the nodes of a system or to none of them. The chapter
about distributed transactions analyses this problem.
Concurrency is execution of multiple computations happening at the same
6
See: https://en.wikipedia.org/wiki/CAS_latency
CHAPTER 1. INTRODUCTION 8
time and potentially on the same piece of data interleaved with each other.
This introduces additional complexity, since these different computations can
interfere with each other and create unexpected behaviours. This is again in
contrast to simplistic applications with no concurrency, where the program
is expected to run in the order defined by the sequence of commands in the
source code. The various types of problematic behaviours that can arise from
concurrency are explained in the chapter that talks about isolation later in
the book.
As explained, these 3 properties are the major contributors of complexity in
the field of distributed systems. As a result, it will be useful to keep them in
mind during the rest of the book and when building distributed systems in
real life so that you can anticipate edge cases and handle them appropriately.
System models
• Crash: A node halts and remains halted, but it halts in a silent way.
So, other nodes may not be able to detect this state (i.e. they can only
assume it has failed on the basis of not being able to communicate
with it).
• Omission: A node fails to respond to incoming requests.
• Byzantine: A node exhibits arbitrary behavior: it may transmit
arbitrary messages at arbitrary times, it may stop or take an incorrect
step.
Byzantine failures can be exhibited, when a node does not behave according to
the specified protocol/algorithm, i.e. because the node has been compromised
by a malicious actor or because of a software bug. Coping with these failures
introduces significant complexity to the resulting solutions. At the same
time, most distributed systems in companies are deployed in environments
that are assumed to be private and secure. Fail-stop failures are the simplest
and the most convenient ones from the perspective of someone that builds
distributed systems. However, they are also not very realistic, since there
are cases in real-life systems where it’s not easy to identify whether another
node has crashed or not. As a result, most of the algorithms analysed in this
book work under the assumption of crash failures.
are referring to, when you are talking about exactly-once semantics.
Also, as a last note, it’s easy to see that at-most-once delivery semantics and
at-least-once delivery semantics can be trivially implemented. The former can
be achieved by sending every message only one time no matter what happens,
while the latter one can be achieved by sending a message continuously, until
we get an acknowledgement from the recipient.
We could say that a system can belong in one of the 2 following categories:
• stateless systems
• stateful systems
A stateless system is one that maintains no state of what has happened in
the past and is capable of performing its capabilities, purely based on the
inputs provided to it. For instance, a contrived stateless system is one that
receives a set of numbers as input, calculates the maximum of them and
returns it as the result. Note that these inputs can be direct or indirect.
Direct inputs are those included in the request, while indirect inputs are
those potentially received from other systems to fullfil the request. For
instance, imagine a service that calculates the price for a specific product by
retrieving the initial price for it and any currently available discounts from
some other services and then performing the necessary calculations with
this data. This service would still be stateless. On the other hand, stateful
systems are responsible for maintaining and mutating some state and their
results depend on this state. As an example, imagine a system that stores
the age of all the employees of a company and can be asked for the employee
with the maximum age. This system is stateful, since the result depends on
the employees we’ve registered so far in the system.
There are some interesting observations to be made about these 2 types of
systems:
• Stateful systems can be really useful in real-life, since computers are
much more capable in storing and processing data than humans.
CHAPTER 1. INTRODUCTION 16
Partitioning
17
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 18
node stores a specific range. In this way, when the system is receiving a
request for a specific value (or a range of values), it consults this mapping
to identify to which node (or nodes, respectively) the request should be
redirected.
The advantages of this technique are:
• its simplicity and ease of implementation.
• the ability to perform range queries, using the value that is used as the
partitioning key.
• a good performance for range queries using the partitioning key, when
the queried range is small and resides in a single node.
• easy and efficient way to adjust the ranges (re-partition), since one
range can be increased or decreased, exchanging data only between 2
nodes.
Some of its disadvantages are:
• the inability to perform range queries, using other keys than the
partitioning key.
• a bad performance for range queries using the partitioning key, when
the queried range is big and resides in multiple nodes.
• an uneven distribution of the traffic or the data, causing some nodes to
be overloaded. For example, some letters are more frequent as initial
letters in surnames,3 which means that some nodes might have to store
more data and process more requests.
Some systems that leverage a range partitioning technique are Google’s
BigTable [7] and Apache HBase.4
Replication
converge again if the system does not receive any updates (also known
as quiesced) for a period of time.
Replication is a very active field in research, so there are many different
algorithms. As an introduction, we will now discuss the 2 main techniques:
single-master replication and multi-master replication.
Single-master replication
Multi-master replication
The main pattern we’ve seen so far is writes being performed to all the
replica nodes, while reads are performed to one of them. Ensuring writes
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 29
ACID transactions
The CAP Theorem [12] is one of the most fundamental theorems in the
field of distributed systems, outlining an inherent trade-off in the design of
distributed systems. It states that it’s impossible for a distributed data store
to simultaneously provide more than 2 of the following properties:
• Consistency10 : this means that every successful read request will
9
See: https://en.wikipedia.org/wiki/Foreign_key
10
As implied earlier, the concept of consistency in the CAP theorem is completely
different from the concept of consistency in ACID transactions. The notion of consistency
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 33
page11 .
Consistency models
There are many different consistency models in the literature. In the context
of this book, we will focus on the most fundamental ones, which are the
following:
• Linearizability
• Sequential Consistency
11
See: https://en.wikipedia.org/wiki/PACELC_theorem
12
A history is a collection of operations, including their concurrent structure (i.e. the
order they are interleaved during execution).
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 37
• Causal Consistency
• Eventual Consistency
A system that supports the consistency model of linearizability[14] is
one, where operations appear to be instantaneous to the external client.
This means that they happen at a specific point from the point the client
invokes the operation to the point the client receives the acknowledgement
by the system the operation has been completed. Furthermore, once an
operation is complete and the acknowledgement has been delivered to the
client, it is visible to all other clients. This implies that if a client C2
invokes a read operation after a client C1 has received the completion of
its write operation, then C2 should see the result of this (or a subsequent)
write operation. This property of operations being "instantaneous" and
"visible" after they are completed seems obvious, right ? However, as we have
discussed previously, there is no such thing as instantaneity in a distributed
system. Figure 2.8 might help you understand why. When thinking about a
distributed system as a single node, it seems obvious that every operation
happens at a specific instant of time and it’s immediately visible to everyone.
However, when thinking about the distributed system as a set of cooperating
nodes, then it becomes clear that this should not be taken for granted. For
instance, the system in the bottom diagram is not linearizable, since T4
> T3 , but still the second client won’t observe the read, because it hasn’t
propagated to the node that processes the read operation yet. To relate
this to some of the techniques and principles we’ve discussed previously, the
non-linearizability comes from the use of asynchronous replication. By using
a synchronous replication technique, we could make the system linearizable.
However, that would mean that the first write operation would have to
take longer, until the new value has propagated to the rest of the nodes
(remember the latency-consistency trade-off from the PACELC theorem!).
As a result, one can realise that linearizability is a very powerful consistency
model, which can help us treat complex distributed systems as much simpler,
single-node datastores and reason about our applications more efficiently.
Moreover, leveraging atomic instructions provided by hardware (such as CAS
operations13 ), one can build more sophisticated logic on top of distributed
systems, such as mutexes, semaphores, counters etc., which would not be
possible under weaker consistency models.
Sequential Consistency is a weaker consistency model, where operations
are allowed to take effect before their invocation or after their completion.
13
See: https://en.wikipedia.org/wiki/Compare-and-swap
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 38
Isolation levels
The origin of the isolation levels above and the associated anomalies was
essentially the ANSI SQL-92 standard[16]. However, the definitions in this
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 42
can tolerate some inaccuracies on the numbers of the report. It can also be
useful when troubleshooting an issue and one wants to inspect the state of
the database in the middle of an ongoing transaction.
A fuzzy or non-repeatable read occurs when a value is retrieved twice
during a transaction (without it being updated in the same transaction) and
the value is different. This can lead to problematic situations similar to the
example presented above for dirty reads. Other cases where this can lead to
problems is if the first read of the value is used for some conditional logic
and the second is used in order to update data. In this case, the transaction
might be acting on stale data.
A phantom read occurs when a transaction does a predicate-based read
and another transaction writes or removes a data item matched by that
predicate while the first transaction is still in flight. If that happens, then
the first transaction might be acting again on stale data or inconsistent
data. For example, let’s say transaction A is running 2 queries to calculate
the maximum and the average age of a specific set of employees. However,
between the 2 queries transaction B is interleaved and inserts a lot of old
employees in this set, thus making transaction A return an average that
is larger than the maximum! Allowing phantom reads can be safe for an
application that is not making use of predicate-based reads, i.e. performing
only reads that select records using a primary key.
A lost update occurs when two transactions read the same value and then
try to update it to two different values. The end result is that one of the two
updates survives, but the process executing the other update is not informed
that its update did not take effect, thus called lost update. For instance,
imagine a warehouse with various controllers that are used to update the
database when new items arrive. The transactions are rather simple, reading
the number of items currently in the warehouse, adding the number of new
items to this number and then storing the result back to the database. This
anomaly could lead to the following problem: transactions A and B read
simultaneously the current inventory size (say 100 items), add the number
of new items to this (say 5 and 10 respectively) and then store this back to
the database. Let’s assume that transaction B was the last one to write, this
means that the final inventory is 110, instead of 115. Thus, 5 new items were
not recorded! See Figure 2.9 for a visualisation of this example. Depending
on the application, it might be safe to allow lost updates in some cases.
For example, consider an application that allows multiple administrators to
update specific parts of an internal website used by employees of a company.
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 44
In this case, lost updates might not be that catastrophic, since employees
can detect any inaccuracies and inform the administrators to correct them
without any serious consequences.
A read skew occurs when there are integrity constraints between two data
items that seem to be violated because a transaction can only see partial
results of another transaction. For example, let’s imagine an application
that contains a table of persons, where each record represents a person and
contains a list of all the friends of this person. The main integrity constraint
is that friendships are mutual, so if person B is included in person A’s list of
friends, then A must also be included in B’s list. Everytime someone (say
P1) wants to unfriend someone else (say P2), a transaction is executed that
removes P2 from P1’s list and also removes P1 from P2’s list at a single
go. Now, let’s also assume that some other part of the application allows
people to view friends of multiple people at the same time. This is done
by a transaction that reads the friends list of these people. If the second
transaction reads the friends list of P1 before the first transaction has started,
but it reads the friends list of P2 after the second transaction has committed,
then it will notice an integrity violation. P2 will be in P1’s list of friends, but
P1 will not be in P2’s list of friends. Note that this case is not a dirty read,
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 45
since any values written by the first transaction are read only after it has
been committed. See Figure 2.10 for a visualisation of this example. A strict
requirement to prevent read skew is quite rare, as you might have guessed
already. For example, a common application of this type might allow a user
to view the profile of only one person at a time along with his or her friends,
thus not having a requirement for the integrity constraint described above.
A write skew occurs when two transactions read the same data, but then
modify disjoint sets of data. For example, imagine of an application that
maintains the on-call rota of doctors in a hospital. A table contains one
record for every doctor with a field indicating whether they are oncall. The
application allows a doctor to remove himself/herself from the on-call rota if
another doctor is also registered. This is done via a transaction that reads
the number of doctors that are on-call from this table and if the number is
greater than one, then it updates the record corresponding to this doctor
to not be on-call. Now, let’s look at the problems that can arise from write
skew phenomena. Let’s say two doctors, Alice and Bob, are on-call currently
and they both decide to see if they can remove themselves. Two transactions
running concurrently might read the state of the database, seeing there are
two doctors and removing the associated doctor from being on-call. In the
end, the system ends with no doctors being on-call! See Figure 2.11 for a
visualisation of this example.
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 46
the early relational database systems that were not distributed, but they are
applicable in distributed datastores too, as shown later in the book.
It is interesting to observe that isolation levels are not that different from
consistency models. Both of them are essentially constructs that allow us to
express what executions are possible or not possible. In both cases, some
of the models are stricter allowing less executions, thus providing increased
safety at the cost of reduced performance and availability. For instance,
linearizability allows a subset of the executions causal consistency allows,
while serializability also allows a subset of the executions snapshot isolation
does. This strictness relationship can be expressed in a different way, saying
that one model implies another model. For example, the fact that a system
provides linearizability automatically implies that the same system also
provides causal consistency. Note that there are some models that are not
directly comparable, which means neither of them is stricter than the other.
At the same time, consistency models and isolation levels have some dif-
ferences with regards to the characteristics of the allowed and disallowed
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 48
The previous chapters spent significant amount of time going through many
different formal models. But why do we need all these complicated, formal,
academic constructs?
As explained before, these constructs help us define different types of proper-
ties in a more precise way. As a result, when designing a system it is easier
to reason about what kind of properties the system needs to satisfy and
which of these models are sufficient to provide the required guarantees. In
many cases, applications are built on top of pre-existing datastores and they
derive most of their properties from these datastores, since most of the data
management is delegated to them. As a consequence, necessary research
needs to be done to identify datastores that can provide the guarantees the
application needs.
Unfortunately, the terminology presented here and the associated models
are not used consistently across the industry making decision making and
comparison of systems a lot harder. For example, there are datastores that
do not state precisely what kind of consistency guarantees their system can
provide or at least these statements are well hidden, while they should be
highlighted as one of the most important things in their documentation. In
some other cases, this kind of documentation exists, but the various levels
presented before are misused leading to a lot of confusion. As mentioned
CHAPTER 2. BASIC CONCEPTS AND THEOREMS 51
before, one source of this confusion was the initial ANSI-SQL standard.
For example, the SERIALIZABLE level provided by Oracle 11g, mySQL 5.7
and postgreSQL 9.0 was not truly serializable and was susceptible to some
anomalies.
Understanding the models presented here is a good first step in thinking
more carefully when designing systems to reduce risk for errors. You should
be willing to search the documentation of systems you consider using to
understand what kind of guarantees they provide. Ideally, you should also
be able to read between the lines and identify mistakes or incorrect usages
of terms. This will help you make more informed decisions. Hopefully, it
will also help raise awareness across the industry and encourage vendors of
distributed systems to specify the guarantees their system can provide.
Part II
53
Chapter 3
Distributed Transactions
One of the most common problems faced when moving from a centralised to
a distributed system is performing some operations across multiple nodes
in an atomic way, what is also known as a distributed transaction. In
this chapter, we will explore all the complexities involved in performing a
distributed transaction, examining several available solutions and the pitfalls
of each one.
Before diving on the available solutions, let’s first make a tour of transactions,
their properties and what distinguishes a distributed transaction.
A transaction is a unit of work performed in a database system, representing a
change, which can be potentially composed of multiple operations. Database
transactions are an abstraction that has been invented in order to simplify
engineers’ work and relieve them from dealing with all the possible failures,
introduced by the inherent unreliability of hardware.
As described previously, the major guarantees provided by database transac-
tions are usually summed up in the acronym ACID, which stands for:
• Atomicity
• Consistency
• Isolation
• Durability
54
CHAPTER 3. DISTRIBUTED TRANSACTIONS 55
storage when they commit. In distributed systems, this might be a bit more
nuanced, since the system should ensure that results of a transaction are
stored in more than one node, so that the system can keep functioning if a
single node fails. In fact, this would be reasonable, since availability is one of
the main benefits of a distributed system, as we described in the beginning
of this book. This is achieved via replication as described previously.
As we just explained, a database transaction is a quite powerful abstraction,
which can simplify significantly how applications are built. Given the inherent
complexity in distributed systems, one can easily deduce that transactional
semantics can be even more useful in distributed systems. In this case, we
are talking about a distributed transaction, which is a transaction that
takes place in 2 or more different nodes. We could say that there are 2
slightly different variants of distributed transactions. The first variant is one,
where the same piece of data needs to be updated in multiple replicas. This
is the case where the whole database is essentially duplicated in multiple
nodes and a transaction needs to update all of them in an atomic way. The
second variant is one, where different pieces of data residing in different
nodes need to be updated atomically. For instance, a financial application
might be using a partitioned database for the accounts of customers, where
the balance of user A resides in node n1, while the balance of user B resides
in node n2 and we want to transfer some money from user A to user B. This
needs to be done in an atomic way, so that data are not lost (i.e. removed
from user A, but not added in user B, because the transaction failed midway).
The second variant is the most common use of distributed transactions, since
the first variant is mostly tackled via single-master synchronous replication.
The aspects of atomicity and isolation are significantly more complex and
require more things to be taken into consideration in the context of distributed
transactions. For instance, partial failures make it much harder to guarantee
atomicity, while the concurrency and the network asynchrony present in
distributed systems make it quite challenging to preserve isolation between
transactions running in different nodes. For this reason, this book contains
separate sections for these two aspects, analysing their characteristics and
analysing some solutions.
CHAPTER 3. DISTRIBUTED TRANSACTIONS 57
Achieving Isolation
version of the data that was applicable when the transaction started. As
explained before, Snapshot Isolation (SI) is an isolation level that essentially
guarantees that all reads made in a transaction will see a consistent snapshot
of the database from the point it started and the transaction will commit
successfully if no other transaction has updated the same data since that
snapshot. As a result, it is practically easier to achieve snapshot isolation
using an MVCC technique.
This would work in the following way:
• The database keeps track of two timestamps for each transaction: Tstart
that denotes the time the transaction started and Tend that denotes
the time the transaction completed.
• The database maintains one record for each version of an item that
also contains the time the version was committed or equivalently the
transaction that committed this version.
• Every time a transaction performs a read of an item, it receives the
version that was last committed before the transaction started. There
is an exception to this rule: if the transaction has already updated this
item, then this value is returned instead.
• When a transaction attempts to write or update an item, a check is
performed in order to ensure no other transaction updated the same
item in the meanwhile, which could result in a lost update anomaly.
This is done by retrieving the latest version of the item - excluding
versions written by the same transaction - and checking if its timestamp
is later than the transaction’s start timestamp Tstart . If that is true, it
means another transaction has updated the item after the transaction
started, so the transaction is aborted and retried later on.
As explained already, this prevents a lot of the anomalies, but it is still not
serializable and some anomalies would still be possible. Research on this
field has resulted in an improved algorithm, called Serializable Snapshot
Isolation (SSI), which can provide full serializability [21] and has been
integrated in commercial, widely used databases [22]. This algorithm is still
optimistic and just adds some extensions on top of what has been described
above.
The mechanics of the solution are based on a key principle of previous
research that showed that all the non-serializable executions under snapshot
isolation share a common characteristic. This states that in the multi-version
CHAPTER 3. DISTRIBUTED TRANSACTIONS 61
This approach detects these cases and breaks the cycle when they are about
to happen and it prevents them from being formed, by aborting one of the
involved transactions. In order to do so, it keeps track of the incoming and
outgoing rw-dependency edges of each transaction. If there is a transaction
that has both incoming and outgoing edges, the algorithm aborts one of the
transactions and retries it later.3 So, it is sufficient to maintain two boolean
flags per transaction T.inConflict and T.outConflict denoting whether
there is an incoming and outgoing rw-dependency edge. These flags can be
maintained in the following way:
• When a transaction T is performing a read, it is able to detect whether
there is a version of the same item that was written after the trans-
action started, e.g. by another transaction U. This would imply a
rw-dependency edge, so the algorithm can update T.outConflict and
U.inConflict to true.
2
A serialization graph is a graph that shows the data dependencies between transactions.
See: https://en.wikipedia.org/wiki/Precedence_graph
3
Note this can lead to aborts that are false positives, since the algorithm does not
check whether there is a cycle. This is done intentionally to avoid the computational costs
associated with tracking cycles.
CHAPTER 3. DISTRIBUTED TRANSACTIONS 62
• However, this will not detect cases where the write happens after the
read. The algorithm uses a different mechanism to detect these cases
too. Every transaction creates a read lock, called SIREAD lock, when
performing a read. Transactions also create some form of write locks, as
described before to prevent lost update phenomena. As a result, when a
transaction performs a write it can read the existing SIREAD locks and
detect concurrent transactions that have previously read the same item,
thus updating accordingly the same boolean flags. Note that these are
a softer form of locks, since they do not block other transactions from
operating, but they exist mainly to signal data dependencies between
them. This means the algorithm preserves its optimistic nature.
Achieving atomicity
part and committing them. This approach is used extensively in file systems
and databases.
The issue of atomicity in a distributed system becomes even more complicated,
because components (nodes in this context) are separated by the network
that is slow and unreliable, as explained already. Furthermore, we do not
only need to make sure that an operation is performed atomically in a node,
but in most cases we need to ensure that an operation is performed atomically
across multiple nodes. This means that the operation needs to take effect
either at all the nodes or at none of them. This problem is also known as
atomic commit4 .
In the next sections, we will be looking at how atomicity can be achieved in
distributed settings. Algorithms are discussed in chronological order, so that
the reader can understand what are the pitfalls of each algorithm and how
they were addressed by subsequent ones.
this node. If this node recovers later on, it will identify that pending
transaction and will communicate with the coordinator to find out
what was the result and conclude it in the same way. So, if the result of
the transaction was successful, any crashed participant will eventually
find out upon recovery and commit it, the protocol does not allow
aborting it unilaterally.6 Thus, atomicity is maintained.
• Network failures in the same steps of the protocol have similar results to
the ones described previously, since timeouts will make them equivalent
to node failures.
Even though 2-phase commit can handle gracefully all the aforementioned
failures, there’s a single point of failure, the coordinator. Because of the
blocking nature of the protocol, failures of the coordinator at specific stages
of the protocol can bring the whole system to a halt. More specifically, if a
coordinator fails after sending a prepare message to the participants, then
the participants will block waiting for the coordinator to recover in order
to find out what was the outcome of the transaction, so that they commit
or abort it respectively. This means that failures of the coordinator can
decrease availability of the system significantly. Moreover, if the data from
the coordinator’s disk cannot be recovered (e.g. due to disk corruption),
then the result of pending transactions cannot be discovered and manual
intervention might be needed to unblock the protocol.
Despite this, the 2-phase commit has been widely used and a specification
for it has also been released, called the eXtended Architecture (XA)7 . In this
specification, each of the participant nodes are referred to as resources and
they must implement the interface of a resource manager. The specification
also defines the concept of a transaction manager, which plays the role of
the coordinator starting, coordinating and ending transactions.
To conclude, the 2PC protocol satisfies the safety property that all partic-
ipants will always arrive at the same decision (atomicity), but it does not
satisfy the liveness property that it will always make progress.
6
An astute reader will observe that there is a chance that the participants might fail
at the point they try to commit the transaction essentially breaking their promise, e.g.
because they are out of disk space. Indeed, this is true, so participants have to make
the minimum work possible as part of the commit phase to avoid this. For example, the
participants can write all the necessary data on disk during the first phase, so that they
can signal a transaction is committed by doing minimal work during the second phase (e.g.
flipping a bit).
7
See: https://en.wikipedia.org/wiki/X/Open_XA
CHAPTER 3. DISTRIBUTED TRANSACTIONS 67
As we observed in the previous section, the main issue with the 3PC protocol
occurs in the end of the second phase, where a potential network partition can
bring the system to an inconsistent state. This can happen due to the fact
that participants attempt to unblock the protocol, by taking the lead without
having a picture of the overall system, resulting in a split-brain situation9 .
Ideally, we would like to be able to cope with this network partition, but
without compromising on the safety of the protocol. This can be done using
a concept we have already introduced in the book, a quorum.
This approach is followed by the quorum-based commit protocol [28].
This protocol is significantly more complex, when compared to the other two
protocols we described previously, so you should study the original paper
carefully, if you want to examine all the possible edge cases. However, we
will attempt to give a high-level overview of the protocol in this section.
As we mentioned before, this protocol leverages the concept of a quorum to
ensure that different sides of a partition do not arrive to conflicting results.
The protocol establishes the concept of a commit quorum (VC ) and an abort
quorum (VA ). A node can proceed with committing only if a commit quorum
has been formed, while a node can proceed with aborting only if an abort
quorum has been formed. The values of the abort and commit quorums have
to be selected so that the property VA + VC > V holds, where V is the total
number of participants of the transaction. Based on the fact that a node can
be in only one of the two quorums, it’s impossible for both quorums to be
formed in two different sides of the partition, leading in conflicting results.
The protocol is composed of 3 different sub-protocols, used in different cases:
• the commit protocol, which is used when a new transaction starts
• the termination protocol, which is used when there is a network partition
• the merge protocol, which is used when the system recovers from a
network partition
The commit protocol is very similar to the 3PC protocol. The only difference
is that the coordinator is waiting for VC number of acknowledgements in
the end of the third phase to proceed with committing the transaction.
If there is a network partition at this stage, then the coordinator can be
rendered unable to complete the transaction. In this case, the participants
9
See: https://en.wikipedia.org/wiki/Split-brain_(computing)
CHAPTER 3. DISTRIBUTED TRANSACTIONS 70
on each side of a partition will investigate whether they are able to complete
the transaction, using the termination protocol. Initially, a (surrogate)
coordinator will be selected amongst them via leader election. Note that
which leader election algorithm is used is irrelevant and even if multiple
leaders are elected, this does not violate the correctness of the protocol.
The elected coordinator queries the nodes of the partition for their status.
If there is at least one participant that has committed (or aborted), the
coordinator commits (or aborts) the transaction, maintaining the atomicity
property. If there is at least one participant at the prepare-to-commit state
and at least VC participants waiting for the votes result, the coordinator
sends prepare-to-commit to the participants and continues to the next step.
Alternatively, if there’s no participant at the prepare-to-commit state and
at least VA participants waiting for the results vote, the coordinator sends
a prepare-to-abort message. Note that this message does not exist in the
commit protocol, but only in the termination one. The last phase of the
termination protocol waits for acknowledgements and attempts to complete
the transaction in a similar fashion to the commit protocol. The merge
protocol is simple, including a leader election amongst the leaders of the 2
partitions that are merged and then execution of the termination protocol
we described.
Let’s examine what would happen in the network partition example from
the previous section (Figure 3.4). In this case, we had 3 participants (V =
3) and we will assume that the protocol would use quorums of size VA =
2 and VC = 2. As a result, during the network partition, the participant
on the left side of the partition would be unable to form a commit quorum.
On the other hand, the participants on the right side of the partition would
be able to form an abort quorum and they would proceed with aborting
the transaction, assuming no more partitions happen. Later on, when the
network partition recovers, the merge protocol would execute, ensuring that
the participant from the left side of the partition would also abort the
transaction, since the new coordinator would identify at least one node
that has aborted the transaction. Figure 3.5 contains a visualisation of this
execution. An interesting property of the protocol is that one can tune
the values of the quorums VA , VC , thus effectively adjusting the protocol’s
tendency to complete a transaction via commit or abort in the presence of a
partition.
To conclude, the quorum-based commit protocol satisfies the safety property
that all participants will always arrive at the same decision (atomicity). It
does not satisfy the liveness property that it will always make progress, since
CHAPTER 3. DISTRIBUTED TRANSACTIONS 71
there are always degenerate, extreme failure cases (e.g. multiple, continuous
and small partitions). However, it’s much more resilient, when compared to
2PC and other protocols and can make progress in the most common types
of failures.
B fails at the first step and it’s rejected because of zero inventory. Later
on, order A also fails at the second step because the customer’s card does
not have enough money and the associated compensating transaction is run
returning the reserved item to the warehouse. This would mean that an order
would have been rejected while it could have been processed normally. Of
course, this violation of isolation does not have severe consequences, but in
some cases the consequences might be more serious, e.g. leading to customers
being charged without receiving a product.
In order to prevent these scenarios, some form of isolation can be introduced
at the application layer. This topic has been studied by previous research
that proposed some concrete techniques [31], referred to as countermeasures
to isolation anomalies. Some of these techniques are:
• the use of a semantic lock, which essentially signals that some data
items are currently in process and should be treated differently or not
accessed at all. The final transaction of a saga takes care of releasing
this lock and resetting the data to their normal state.
• the use of commutative updates that have the same effect regardless of
their order of execution. This can help mitigate cases that are otherwise
susceptible to lost update phenomena.
• re-ordering the structure of the saga, so that a transaction called
as a pivot transaction delineates a boundary between transactions
that can fail and those that can’t. In this way, transactions that
can’t fail - but could lead to serious problems if being rolled-back
due to failures of other transactions - can be moved after the pivot
transaction. An example of this is a transaction that increases the
balance of an account. This transaction could have serious consequences
if another concurrent saga reads this increase in the balance, but then
the previous transaction is rolled back. Moving this transaction after
the pivot transaction means that it will never be rolled back, since all
the transactions after the pivot transaction can only succeeed.
These techniques can be applied selectively in cases where they are needed.
However, they introduce significant complexity and move some of the burden
back to the application developer that has to think again about all the
possible failures and design accordingly. These trade-offs need to be taken into
consideration when choosing between using saga transactions or leveraging
transaction capabilities of the underlying datastore.
Chapter 4
Consensus
Amongst all the problems encountered so far in the book, there is a common
trait that characterizes most (if not all) of them. It’s the fact that the various
nodes of a distributed systems try to reach an agreement on a specific thing.
In the case of a distributed transaction, it’s whether a transaction has been
committed or not. In case of a message delivery, it’s whether a message has
been delivered or not. In fact, this underlying property is common in many
more problems in the distributed systems space. As a result, researchers
have formally defined this problem and researched possible solutions, since
these can then be used as a building block for more complicated problems.
This is known as the consensus problem and this chapter of the book is
devoted to it.
Defining consensus
77
CHAPTER 4. CONSENSUS 78
• Validity: The value that is agreed must have been proposed by one
of the nodes.
FLP impossibility
Researchers have found a lot of different solutions to this problem, but they
have also found important constraints that impose some limitations on the
possible solutions. We should note that it’s extremely useful to know the
limits of the available solutions to a problem and the research community
has benefited massively from this. As a result, this chapter will unfold in a
counter-intuitive way, first explaining these limitations and then discussing
one of the solutions to the problem. In this way, we hope the reader will
be able to gain a better understanding of the problem and will be better
equipped to reason about the solution presented later on.
As explained previously in the book, there are several different system models
with the asynchronous being the one that is really close to real-life distributed
systems. So, it’s been proved that in asynchronous systems, where there can
be at least one faulty node, any possible consensus algorithm will be unable to
terminate, under some scenarios. In other words, there can be no consensus
algorithm that will always satisfy all the aforementioned properties. This
is referred to as the FLP impossibility after the last initials of the authors
of the associated paper[33]. The proof in the paper is quite complicated,
but it’s essentially based on the following 2 parts. First, the fact that it’s
always possible that the initial state of the system is one, where nodes can
reach different decisions depending on the ordering of messages (the so-called
bivalent configuration), as long as there can be at least one faulty node.
Second, from such a state it’s always possible to end up in another bivalent
state, just by introducing delays in some messages.
As a result, it’s impossible to develop a consensus algorithm that will
always be able to terminate successfully in asynchronous systems,
where at least one failure is possible. What we can do instead is develop
algorithms that minimize this possibility of arriving at ambivalent situations.
coordinator) could bring the whole system to a halt. The obvious next step is
to allow multiple nodes to inherit the role of the coordinator in these failure
cases. This would then mean that there might be multiple masters that might
produce conflicting results. We have already demonstrated this phenomenon
in the chapter about multi-master replication and when explaining the
3-phase commit.
One of the first algorithms that could solve the consensus problem safely
under these failures is called the Paxos algorithm. More specifically, this
algorithm guarantees that the system will come to an agreement on a single
value, tolerating the failure of any number of nodes (potentially all of them),
as long as more than half the nodes are working properly at any time, which
is a significant improvement. Funnily enough, this algorithm was invented by
Leslie Lamport during his attempt to prove this is actually impossible! He
decided to explain the algorithm in terms of a parliamentary procedure used
in an ancient, fictional Greek island, called Paxos. Despite being elegant
and highly entertaining, this first paper[34] was not well received by the
academic community, who found it extremely complicated and could not
discern its applicability in the field of distributed systems. A few years later
and after several successful attempts to use the algorithm in real-life systems,
Leslie decided to publish a second paper[35], explaining the algorithm in
simpler terms and demonstrating how it can be used to build an actual,
highly-available distributed system. A historical residue of all this is the fact
that the Paxos algorithm is regarded as a rather complicated algorithm until
today. Hopefully, this section will help dispel this myth.
The Paxos algorithm defines 3 different roles: the proposers, the acceptors
and the learners. Every node in the system can potentially play multiple
roles. A proposer is responsible for proposing values (potentially received
from clients’ requests) to the acceptors and trying to persuade them to
accept their value in order to arrive at a common decision. An acceptor is
responsible for receiving these proposals and replying with their decision on
whether this value can be chosen or not. Last but not least, the learners
are responsible for learning the outcome of the consensus, storing it (in a
replicated way) and potentially acting on it, by either notifying clients about
the result or performing actions. Figure 4.1 contains a visual overview of
these roles and how they interact with the clients.
The algorithm is split into 2 phases, each of which contains two parts:
• Phase 1 (a): A proposer selects a number n and sends a prepare
request with this number (prepare(n)) to at least a majority of the
CHAPTER 4. CONSENSUS 81
acceptors.
• Phase 1 (b): When receiving a prepare request, an acceptor has the
following options:
– If it has not already responded to another prepare request of a
higher number n, it responds to the request with a promise not
to accept any more proposals that are numbered less than n. It
also returns the highest-numbered proposal it has accepted, if any
(note: the definition of a proposal follows).
– Otherwise, if it has already accepted a prepare request with a
higher number, it rejects this prepare request, ideally giving a
hint to the proposer about the number of that other prepare
request it has already responded to.
• Phase 2 (a): If the proposer receives a response to its prepare(n)
requests from a majority of acceptors, then it sends an accept(n, v)
request to these acceptors for a proposal numbered n with a value v.
The value is selected according to the following logic:
– If any of the acceptors had already accepted another proposal
and included that in its response, then the proposer uses the
value of the highest-numbered proposal among these responses.
Essentially, this means that the proposer attempts to bring the
latest proposal to conclusion.
– Otherwise, if none of the acceptors had accepted any other pro-
posal, then the proposer is free to select any desired value. This
value is usually selected based on the clients’ requests.
• Phase 2 (b): If the acceptor receives an accept(n, v) request for
a proposal numbered n, it accepts the proposal, unless it has already
responded to a prepare(k) request of a higher number (k > n).
Furthermore, as the acceptors accept proposals, they also announce their
acceptance to the learners. When a learner receives an acceptance from a
majority of acceptors, it knows that a value has been chosen. This is the
most basic version of the Paxos protocol. As we mentioned previously, nodes
can play multiple roles for practical reasons and this is usually the case in
real-life systems. As an example, one can observe that the proposers can play
the role of learners as well, since they will be receiving some of these accept
responses anyway, thus minimising traffic and improving the performance of
the system.
During Phase 1 (a) of the protocol, the proposers have to select a proposal
number n. These numbers must be unique in order for the protocol to
maintain its correctness properties. This is so that acceptors are always
CHAPTER 4. CONSENSUS 83
Intricacies of Paxos
The beginning of this chapter outlined how Paxos can be used to solve the
leader election problem. Nonetheless, Paxos itself needs to elect a leader in
order to reach consensus, which seems like a catch-221 . The Paxos protocol
resolves this paradox, by allowing multiple leaders to be elected, thus not
needing to reach consensus for the leader itself. It still has to guarantee
that there will be a single decision, even though multiple nodes might be
proposing different values. Let’s examine how Paxos achieves that and what
are some of the consequences. When a proposer receives a response to a
prepare message from a majority of nodes, it considers itself the (temporary)
leader and proceeds with a proposal. If no other proposer has attempted to
become the leader in the meanwhile, its proposal will be accepted. However,
if another proposer managed to become a leader, the accept requests of the
initial node will be rejected. This prevents multiple values to be chosen by
the proposals of both nodes. This can result in a situation, where proposers
are continuously duelling each other, thus not making any progress, as you
can see in Figure 4.2. There are many ways to avoid getting into this infinite
loop. The most basic one is forcing the proposers to use random delays or
exponential back-off every time they get their accept messages rejected and
have to send a new prepare request. In this way, they give more time to
the node that is currently leading to complete the protocol, by making a
successful proposal, instead of competing.
at every round. For the first 3 rounds, none of the nodes in the majority
quorum have accepted any value, so proposers are free to propose their
own value. In rounds 4 and 5, proposers have to propose the value of the
highest-numbered proposal that has been accepted by the acceptors included
in Phase’s 1 majority quorum. This is A for round 4 and B for round 5. As
it’s demonstrated for round 6, at this point the behaviour depends partially
on the quorum that will be used. For example, if the next proposer selects
the yellow quorum, value C is going to be proposed, while value B will
be proposed if the green quorum is used instead. However, there is one
important thing to note: as soon as the system recovers from failures and a
proposer manages to get a proposal accepted by a majority quorum, then this
value is chosen and it cannot be changed. The reason is that any subsequent
proposer will need to get a majority quorum for Phase 1 of the protocol.
This majority will have to contain at least 1 node from the majority that has
accepted the aforementioned proposal, which will thus transfer the accepted
proposal to the prospective leader. Furthermore, it’s guaranteed this will be
the highest-numbered proposal, which means any subsequent proposer can
only propagate the chosen value to the acceptors that might not have it yet.
Paxos in real-life
Of course, the clients of the system learn the chosen values, so they could keep
track of the state on their side. But, there will always be cases, where some
clients need to retrieve some of the values chosen in the past, i.e. because
they are clients that have just been brought into operation. So, Paxos
should also support read operations that return the decisions of previously
completed instances alongside write operations that start new instances of
the protocol. These read operations have to be routed to the current leader
of the system, which is essentially the node that completed successfully the
last proposal. It’s important to note that this node cannot reply to the client
using its local copy. The reason for this is that another node might have
done a proposal in the meanwhile (becoming the new leader), thus meaning
that the reply will not reflect the latest state of the system.2 As a result,
that node will have to perform a read from a majority of nodes, essentially
seeing any potential new proposal from another node. You should be able to
understand how a majority quorum can guarantee that by now. If not, it
would probably be a good idea to revisit the section about quorums and their
intersection properties. This means that reads can become quite slow, since
they will have to execute in 2 phases. An alternative option that works as
an optimisation is to make use of the so-called master leases [37]. Using this
approach, a node can take a lease, by running a Paxos instance, establishing
a point in time,3 until which it’s guaranteed to be considered the leader and
no other node can challenge him. This means that this node can then serve
read operations locally. However, one has to take clock skew into account in
the implementation of this approach and keep in mind it will be safe only if
there’s an upper bound in the clock skew.
By the same logic, one could argue that electing a leader in every instance of
the Paxos protocol is not as efficient as possible and degrades performance
significantly under normal conditions without many failures. Indeed, that is
true and there is a slightly adjusted implementation of Paxos, called Multi-
Paxos that mitigates this issue [38]. In this approach, the node that has
performed the last successful proposal is considered the current distinguished
proposer. This means that a node can perform a full instance of Paxos
and then it can proceed straight to the second phase for the subsequent
instances, using the same proposal number that has been accepted previously.
2
This would mean that the read/write consensus operations would not be linearizable.
Note that in the context of consensus, operations such as proposals are considered single-
object operations. As a result, there is no need for isolation guarantees.
3
This point in time is essentially the time of the proposal (a timestamp that can be part
of the proposal’s value) plus a pre-defined time period, which corresponds to the duration
of the lease.
CHAPTER 4. CONSENSUS 89
The rest of the nodes know which node is currently the leader based on
which node made the last successful proposal. They can perform periodic
health checks and if they believe this node has crashed, they can initiate a
prepare request in order to perform a successful proposal and become the
distinguished proposer. Essentially, this means that the protocol is much
more efficient under stable conditions, since it has only one phase. When
failures occur, the protocol just falls back to plain Paxos.
Another common need is a way to dynamically update the nodes that are
members of the system. The answer to this requirement might sound familiar
thanks to the elegance of the protocol; membership information can just
be propagated as a new Paxos proposal! The nodes that are member of
the system can have their own way of identifying failures of other nodes
(i.e. periodic health checks) and the corresponding policies on when a node
should be considered dead. When a node is considered dead, one of the
nodes that has identified it can trigger a new Paxos instance, proposing a
new membership list, which is the previous one minus the dead node. As
soon as this proposal completes, all the subsequent instances of Paxos should
make use of the updated membership list.
such a system. The top layer is the one receiving requests from the clients
and creating proposals for the consensus layer, which conducts the necessary
coordination between the other nodes of the system and propagates the
chosen values to the lower layer, which just receives these values as inputs
and executes the necessary state transitions.
Let’s elaborate a bit more on what that would entail, assuming Paxos is
used as the consensus layer of the system. Essentially, the clients would
be sending regular requests to the system, depending on the domain of the
system. These requests could be either commands to the system or requests
to inspect its internal system. These requests would be dispatched to the
CHAPTER 4. CONSENSUS 91
The introduction of this chapter mentioned that the consensus problem is very
similar to the problem of distributed transactions. However, after studying
the Paxos algorithm, one might think there seems to be a fundamental conflict
between distributed transactions and the way Paxos solves the consensus
problem. The core characteristic of distributed transactions is atomicity, the
fact that either the relevant update has to be performed in all the nodes or
it should not be performed in any of them. However, the Paxos algorithm
relies on just a majority quorum to decide on a value. Indeed, the problem
of distributed transactions, known as atomic commit, and the consensus
problem might be closely related, but they are not equivalent[39]. First of
all, the consensus problem mandates that every non-faulty node must reach
the same decision, while the atomic commit problem requires that all the
CHAPTER 4. CONSENSUS 92
nodes (faulty or not) must reach the same decision. Furthermore, the atomic
commit problem imposes stricter relationships between votes or proposals
and the final decision than the consensus problem. In consensus, the only
requirement is that the value that is agreed must have been proposed by at
least one of the nodes. In atomic commit, a decision can be positive, only if
all the votes were positive. The decision is also required to be positive, if all
votes are positive and there are no failures.
As a result of this difference, one might think that the Paxos algorithm does
not have anything to offer in the problem space of distributed transactions.
This is not true and this section will try to illustrate what Paxos (and
any other consensus algorithm) has to offer. The biggest contribution of
a consensus algorithm would not be in the communication of the resource
managers’ results back to the transaction manager, which requires successful
communication for all of them and not just a majority. Its value would lie
in storing and transmitting the transaction’s result back to the resource
managers in a fault-tolerant way, so that the failure of a single node (the
transaction manager) cannot block the system.
Indeed, there is a very simple way to achieve that in the existing 2-phase
commit (2PC) protocol leveraging a consensus algorithm. Assuming we make
use of Paxos as a consensus algorithm, we could just have the transaction
manager start a new Paxos instance, proposing a value for the result of the
transaction, instead of just storing the result locally before sending it back to
the resource managers. The proposal value would be either commit or abort,
depending on the previous results of each one of the resource managers. This
adjustment on its own would make the 2-phase commit protocol resilient
against failures of the transaction manager, since another node could take
the role of the transaction manager and complete the protocol. That node
would have to read the result of the transaction from any existing Paxos
instance. If there’s no decision, that node would be free to make an abort
proposal.
This is simple and elegant, but it would require adding one more messaging
round to the 2-phase commit protocol. It’s actually possible to remove this
additional round, trading off some simplicity for increased performance. This
could be done by essentially "weaving" several instances of Paxos in the plain
2-phase commit protocol, practically obviating the need for a transaction
manager completely. More specifically, the resource managers would have to
send their response to the first phase to a set of acceptors, instead of sending
it to the transaction manager, thus creating a separate Paxos instance for
CHAPTER 4. CONSENSUS 93
Raft
Paxos has been the canonical solution to the consensus problem. However,
the initial specification of the algorithm did not cover some aspects that
were crucial in implementing the algorithm in practice. As explained previ-
ously, some of these aspects were covered in subsequent papers. The Paxos
algorithm is also known to be hard to understand.
As a response to these issues, researchers decided to create a new algorithm
with the goals of improved understandability and ease of implementation.
This algorithm is called Raft [41]. We will briefly examine this algorithm
in this section, since it has provided a good foundation for many practical
implementations and it nicely demonstrates how the various aspects described
before can be consolidated in a single protocol.
Raft establishes the concept of a replicated state machine and the associated
replicated log of commands as first class citizens and supports by default
multiple consecutive rounds of consensus. It requires a set of nodes that form
the consensus group, which is referred to as the Raft cluster. Each of these
nodes can be in one of 3 states: a leader, a follower or a candidate. One of
the nodes is elected the leader and is responsible for receiving log entries
from clients (proposals) and replicate them to the other follower nodes in
order to reach consensus. The leader is responsible for sending heartbeats to
the other nodes in order to maintain its leadership. Any node that hasn’t
heard from the leader for a while will assume the leader has crashed, it will
enter the candidate state and attempt to become leader by triggering a new
election. On the other hand, if a previous leader identifies another node has
CHAPTER 4. CONSENSUS 94
gained leadership, it falls back to a follower state. Figure 4.5 illustrates the
behaviour of the nodes depending on their state.
In order to prevent two leaders from operating concurrently, Raft has the
temporal concept of terms. Time is divided into terms, which are numbered
with consecutive integers and each term begins with an election where one or
more candidates attempt to become leaders. In order to become a leader, a
candidate needs to receive votes from a majority of nodes. Each node votes
for at most one node per term on a first-come-first-served basis. Consequently,
at most one node can win the election for a specific term, since 2 different
majorities would conflict in at least one node. If a candidate wins the election,
it serves as the leader for the rest of the term. Any leader from previous
terms will not be able to replicate any new log entries across the group,
since the voters of the new leader will be rejecting its requests and it will
eventually discover it has been deposed. If none of the candidates manages
to get a majority of votes in a term, then this term ends with no leader and
a new term (with a new election) begins straight after.
Nodes communicate via remote procedure calls (RPCs) and Raft has 2 basic
RPC types:
• RequestVote: sent by candidates during an election.
• AppendEntries: sent by leaders to replicate log entries and also to
provide a form of heartbeat.
The commands are stored in a log replicated to all the nodes of the cluster.
The entries of the log are numbered sequentially and they contain the term in
which they were created and the associated command for the state machine,
as shown in Figure 4.6. An entry is considered committed if it can be applied
CHAPTER 4. CONSENSUS 95
to the state machine of the nodes. Raft guarantees that committed entries
are durable and will be eventually be executed by all of the available state
machines, while also guaranteeing that no other entry will be committed
for the same index. It also guarantees that all the preceding entries of a
committed entry are also committed. This status essentially signals that
consensus has been reached on this entry.
For example, a follower might have crashed and thus missed missed some
(committed) entries (a, b). It might have received some more (non committed)
entries (c,d). Or both things might have happened (e,f). Specifically scenario
(f) could happen if a node was elected leader in both terms 2 and 3 and
replicated some entries, but it crashed before any of these entries were
committed.
in that majority, then it’s guaranteed to hold all the committed entries.
• When sending an AppendEntries RPC, a leader includes the index
and term of the entry that immediately precedes the new entries in its
log. The followers check against their own logs and reject the request
if their log differs. If that happens, the leader discovers the first index
where their logs disagree and starts sending all the entries after that
point from its log. The follower discards its own entries and adds the
leader’s entries to its log. As a result, their logs eventually converge
again.
We mentioned previously that a leader knows that an entry from its term
can be considered committed when it has been successfully replicated to a
majority of nodes and it can then be safely applied to the state machine.
But, what happens when a leader crashes before committing an entry?
If subsequent leaders have received this entry, they will attempt to finish
replicating the entry. However, a subsequent leader cannot safely conclude
that an entry from a previous term is commmitted once it is stored on a
majority of nodes. The reason is there is an edge case where future leaders
can still replace this entry even if it’s stored on a majority of nodes. Feel free
to refer to the paper for a full description of how this can happen. As a result,
leaders can safely conclude an entry from a previous term is committed by
replicating it and then replicating a new entry from its term on top of it. If
the new entry from its own term is replicated to a majority, the leader can
safely consider it as committed and thus it can also consider all the previous
entries as committed at this point. So, a leader is guaranteed to have all the
committed entries at the start of its term, but it doesn’t know which those
are. To find out, it needs to commit an entry from its own term. To expedite
this in periods of idleness, the leader can just commit a no-op command in
the beginning of its term.
What has been described so far consists the main specification of the Raft
protocol. The paper contains more information on some other implementation
details that will be covered briefly here. Cluster membership changes can
be performed using the same mechanisms by storing the members of the
cluster in the same way regular data is stored. An important note is that
transition from an old configuration Cold to a new configuration Cnew must
be done via a transition to an intermediate configuration Cjoint that contains
both the old and the new configuration. This is to prevent two different
leaders from being elected for the same term. Figure 4.8 illustrates how that
could happen if the cluster transitioned from Cold directly to Cnew . During
this intermediate transition, log entries are replicated to the servers of both
CHAPTER 4. CONSENSUS 98
configurations, any node from both configurations can serve as a leader and
consensus requires majority from both the old and the new configuration.
After the Cjoint configuration has been committed, the cluster then switches
to the new configuration Cnew . Since the log can grow infinitely, there also
needs to be a mechanism to avoid running out of storage. Nodes can perform
log compaction by writing a snapshot of the current state of the system on
stable storage and removing old entries. When handling read requests from
clients, a leader needs to first send heartbeats to ensure it’s still the current
leader. That guarantees the linearizability of reads. Alternatively, leaders
can rely on the heartbeat mechanism to provide some form of lease, but
this would assume bounded clock skew in order to be safe. A leader might
also fail after applying a committed entry to its state machine, but before
replying to the client. In these cases, clients are supposed to retry the request
to the new leader. If these requests are tagged with unique serial numbers,
the Raft nodes can identify commands that have already been executed and
reply to the clients without re-executing the same request twice.
At this point, we have spent enough time examining all the small details
of the various consensus algorithms. This can prove to be very useful,
when one thinks about how to design a system, what kinds of guarantees
it would require or even when troubleshooting edge cases. Hopefully, you
CHAPTER 4. CONSENSUS 99
have realised by now that these problems are very complicated. As a result,
creating an algorithm that solves these problems or even translating an
existing algorithm to a concrete implementation is a really big undertaking.
If there is an existing solution out there, you should first consider re-using
this before rolling out your own, since it’s highly likely that the existing
solution would be much more mature and battle-tested. This is true not
only for consensus but other problems inherent to distributed systems as
well. A later chapter contains some case studies of basic categories of such
distributed systems that you can leverage to solve some common problems.
Part III
100
101
Time and order are some of the most challenging aspects of distributed
systems. As you might have realised already by the examples presented, they
are also very intertwined. For instance, time can be defined by the order
of some events, such as the ticks of a clock. At the same time, the order
of some events can also be defined based on the time each of these events
happened. This relationship can feel natural when dealing with a system that
is composed of a single node, but it gets a lot more complicated when dealing
with distributed systems that consist of many nodes. As a result, people
that have been building single-node, centralised applications sometimes get
accustomed to operating under principles and assumptions that do not hold
when working in a distributed setting. This part will study this relationship
between time and order, the intricacies related to distributed systems and
some of the techniques that can be used to tackle some of the problems
inherent in distributed systems.
Chapter 5
Time
102
CHAPTER 5. TIME 103
A practical perspective
The clocks used in real systems are what we usually call physical clocks. A
physical clock is a physical process coupled with a method of measuring that
process to record the passage of time [42]. Most physical clocks are based on
cyclic processes. Below are some examples of such devices:
• Some of the most basic ones and easy to understand are devices like
a sundial or an hourglass. The former tells the time of the day using
a gnomon and tracking the shadow created by the sun. The latter
measures time by the regulated flow of of sand through a bulb.
• Another common clock device is a pendulum clock, which uses an
oscillating weight as its timekeeping element.
• An electronical version of the last type is used in software systems,
called a quartz clock. This device makes use of a crystal, called quartz
crystal, which vibrates or ticks with a specific frequency, when electricity
is applied to it.
• One of the most accurate timekeeping devices are atomic clocks, which
use the frequency of eletronic transitions in certain atoms to measure
time.
As explained initially, all these devices rely on physical processes to measure
time. Of course, there can be errors residing both in the measurement tools
being used and the actual physical processes themselves. As a result, no
matter how often we synchronize these clocks with each other or with other
clocks that have more accurate measurement methods, there will always be
a skew between the various clocks involved in a distributed system. When
building a distributed system, this difference between clocks must be taken
into account and the overall system should not operate under the assumption
that all these clocks are the same and can act as a single, global clock.
Figure 5.1 contains an example of what could happen otherwise. Let’s assume
we have a distributed system composed of 3 different nodes A, B and C.
Every time an event happens at a node, the node assigns a timestamp to
the event, using its own clock, and then propagates this event to the other
nodes. As the nodes receive events from the other nodes, they compare the
timestamps associated with these events to determine the order in which the
events happened. If all the clocks were completely accurate and reflecting
exactly the same time, then that scheme would theoretically be capable of
identifying the order. However, if there is a skew between the clocks of the
various nodes, the correctness of the system is violated. More specifically, in
CHAPTER 5. TIME 104
our example, we assume that the clock of node A is running ahead of the
clock of node B. In the same way, the clock of node C is running behind the
clock of node B. As a result, even if the event in node A happened before
the event in node C, node B will compare the associated timestamps and
will believe the event from node C happened first.
So, from a practical point of view, the best we could do is accept there will
always be a difference between the clocks of different nodes in the system
and expose this uncertainty in some way, so that the various nodes in the
system can handle it appropriately. Spanner [5] is a system that follows this
approach, using the TrueTime API that directly exposes this uncertainty by
using time intervals (embedding an error) instead of plain timestamps.
A theoretical perspective
Logical clocks
The focus of this section has been on physical clocks so far, explaining their
main limitations when used in distributed systems. However, there is an
alternative category of clocks, which is not subject to the same constraints,
logical clocks. These are clocks that do not rely on physical processes to
keep track of time. Instead, they make use of messages exchanged between
the nodes of the system, which is the main mechanism information flows in
a distributed system, as described previously.
We can imagine a trivial form of such a clock in a system consisting of only
a single node. Instead of using a physical clock, this node could instead
make use of a logical clock, which would consist of a single method, say
getTime(). When invoked, this method would return a counter, which would
subsequently be incremented. For example, if the system started at 9:00 and
events A, B and C happened at 9:01, 9:05 and 9:55 respectively, then they
could be assigned the timestamps 1, 2 and 3. As a result, the system would
still be able to order the events, but it would not be able to determine the
temporal distance between any two events. Some more elaborate types of
logical clocks are described in the next section.
Chapter 6
Order
108
CHAPTER 6. ORDER 109
reason is that there is a single actor, where all the events happen, so this
actor can impose a total order on these events as they occur. Total orderings
also make it much simpler to build protocols and algorithms. However, in
a distributed system it’s not that straightforward to impose a total order
on events, since there are multiple nodes in the system and events might be
happening concurrently on different nodes. As a result, a distributed system
can make use of any valid partial ordering of the events occuring, if there is
no strict need for a total ordering.
Figure 6.1 contains a diagram that illustrates why total ordering is much
harder to determine in a distributed system. As displayed in the diagram, in
a system composed of a single node that can only execute events serially, it’s
easy to define a total order on all the events happening, since between two
events (e1 , e2 ) one of them will have started after the other one finished. On
the other hand, in a distributed system composed of multiple nodes where
events are happening concurrently, it’s much harder to determine a total
order, since there might be pairs of events that cannot be ordered.
For instance, look at some of the social media platforms people use nowadays,
where they can create posts and add comments to the posts of other people.
Do you really care about the order in which two unrelated posts are shown
to you? Probably not. As a result, the system could potentially leverage a
partial ordering, where posts that can’t really be ordered are displayed in
an arbitrarily chosen order. However, there is still a need for preserving the
order of some events that are tightly linked. For example, if a comment CB
is a reply to a comment CA , then you would most probably like to see CB
after CA . Otherwise, a conversation could end up being confusing and hard
to follow.
What we just described is the notion of causality, where one event con-
tributes to the production of another event. Looking back at one of the
introductory sections, you can find the description of a consistency model,
called the causal consistency model, which ensures that events that are
causally related are observed by the various nodes in a single order, where
causes precede the effects. Violating causality can lead to behaviours that
are really hard to understand by the users of a system. Fortunately, as we
will explore in the next sections of this chapter, it’s possible to track causality
without the need of physical time.
The notion of causality is also present in real life. We subconsciously make
use of causality when planning or determining the feasibility of a plan or the
innocence of an accused.1 Causality is determined based on a set of loosely
synchronized clocks (i.e. wrist watches, wall clocks etc.) under the illusion
of a global clock. This appears to work in most cases, because the time
duration of events is much more coarse-grained in real life and information
"flows" much more slowly than in software systems. For instance, compare
the time a human needs to go from London to Manchester and the time
needed for 10 kilobytes to travel the same distance via the Internet. As a
result, small differences between clocks do not create significant problems in
most real life scenarios. However, in distributed computing systems, events
happen at a much higher rate, higher speed and their duration is several
orders of magnitude smaller. As a consequence, if the physical clocks of the
various nodes in the system are not precisely synchronised, the causality
relation between events may not be accurately captured.
To sum up, causality can be leveraged in the design of distributed systems
with 2 main benefits: increasing concurrency and replacing real time with the
notion of logical time, which can be tracked with less infrastructure and costs.
1
See: https://en.wikipedia.org/wiki/Alibi
CHAPTER 6. ORDER 112
Lamport clocks
One of the first and simplest types of logical clocks were invented by Leslie
Lamport and is called Lamport clock [44]. In this type of logical clock,
every node in the system maintains a logical clock in the form of a numeric
counter that can start from zero when a node starts operating. The rules of
the protocol are the following:
• (R1) Before executing an event (send, receive or local), a node incre-
ments the counter of its logical clock by one: Ci = Ci + 1.
• (R2) Every sent message piggybacks the clock value of its sender at
sending time. When a node ni receives a message with timestamp Cmsg ,
CHAPTER 6. ORDER 115
will reflect this relationship in the clock’s value. For instance, A1 causally
precedes B1 and we can see that C(A1 ) = 1 < 2 = C(B1 ) (clock consistency
condition). We can also see that the strong consistency condition does not
hold. For instance, C(C2 ) < C(B2 ), but these 2 events are not causally
dependent. Event B2 could have happened either before or after C2 with the
same clock value.
Lamport clocks can be used to create a total ordering of events in a distributed
system by using some arbitrary mechanism to break ties, in case clocks of
different nodes have the same value (e.g. the ID of the node). The caveat
is this total ordering is somewhat arbitrary and cannot be used to infer
causal relationship, which limits the number of practical applications they
can have. The paper demonstrates how they could potentially be used to
solve synchronisation problems, such as mutual exclusion[44].
Vector clocks
update rules are very similar to those used by vector clocks. However, they
are used for slightly different purposes. As explained previously, vector clocks
are used to maintain a logical form of time, which can then be used to identify
when events are happening, especially in comparison to other events. On
the other hand, version vectors are better suited for applications that store
data, where every data item is tagged with a version vector. In this way,
data can potentially be updated in multiple parts of the system concurrently
(e.g. when there is a network partition), so that the version vectors from the
resulting data items can help us identify those items that can be reconciled
automatically and those that require conflict resolution[49].
Version vectors maintain state identical to that in a vector clock, containing
one integer entry per node in the system. The update rules are slightly
different: nodes can experience both local updates (e.g. a write applied at a
server) or can synchronize with another node (e.g. when recovering from a
network partition).
• Initially, all vectors have all their elements set to zero.
• Each time a node experiences a local update event, it increments its
own counter in the vector by one.
• Each time two nodes a and b synchronize, they both set the elements in
their vector to the maximum of the elements across both vectors Va [x]
= Vb [x] = max(Va [x], Vb [x]). After synchronisation, both nodes will
have the same vectors. Furthermore, depending on whether the initial
vectors were causally related or not, one of the associated items will
supercede the other or some conflict resolution logic will be executed
to maintain one entry associated with the new vector.
Version vectors are mostly beneficial in systems that act as datastores, so the
nodes in the system will belong to two basic categories: the server or replica
nodes that store the data and receive read/write operations and the client
nodes that read data from the replica nodes and send update instructions. In
many cases, clients might not even be part of our systems, such as scenarios
where our system receives operations directly from web browsers of customers.
As a result, it would be better to avoid significant amount of logic and storage
overheads in the client nodes.
The version vector mechanism allows this in the following way: one entry is
maintained for every node (both replica/server and client nodes). However,
the client nodes can be stateless, which means they do not store the version
vectors. Instead, they receive a version vector as part of every read operation
and they provide this version vector when executing an update operation
CHAPTER 6. ORDER 120
the pair (4,7) represents the sequence [1,2,3,4,7]. Note that the second
number is optional and some entries can still be single number. This can
be leveraged to keep track of concurrency between multiple versions. The
order between 2 versions is now defined in terms of the contains relationship
on the corresponding sequences. So, for vectors v1 , v2 the relationship v1
≤ v2 holds if the sequence represented by v1 is a subset of the sequence
represented by v2 . More specifically:
• (m) ≤ (m0 ) if m ≤ m0
• (m) ≤ (m0 , n0 ) if m ≤ m0 ∨ m = m0 + 1 = n0
• (m, n) ≤ (m0 ) if n ≤ m0
• (m, n) ≤ (m0 , n0 ) if n ≤ m0 ∨ (m ≤ m0 ∧ n = n0 )
The update rule executed by each replica node when receiving a write
operation is also slightly different. For all the indexes except the one belonging
to the replica node, the node uses the value (m) where m is the maximum
number amongst those available in the provided version vectors in the
context. For the index corresponding to the replica node, the node uses the
pair (m, n+1), where m is the maximum number amongst those available in
the provided version vectors in the context and n is the maximum number
amongst the version vectors present in the replica node (essentially the value
of its logical clock). For a more elaborate analysis of the rules and a formal
proof that this technique is safe, refer to the original paper[50]. Figure 6.8
CHAPTER 6. ORDER 123
illustrates what a solution with dotted version vectors would look like in our
previous examples. As you can see, the write operations by client nodes C
and D end up receiving version vectors {(B,0,1)} and {(B,0,2)} and they
are successfully identified as concurrent, since {(B,0,1)} {(B,0,2)} and
{(B,0,2)} {(B,0,1)}.
Distributed snapshots
and any messages that are in transit between the nodes. The main challenge
in recording this state is that the nodes that are part of the system do not
have a common clock, so they cannot record their local states at precisely
the same instant. As a result, the nodes have to coordinate with each other
by exchanging messages, so that each node records its state and the state
of associated communication channels. Thus, the collective set of all node
and channel states will form a global state. Furthermore, any communi-
cation required by the snapshot protocol should not alter the underlying
computation.
The paper presents a very interesting and illuminating analogy for this
problem. Imagine a group of photographers observing a panoramic, dynamic
scene such as a sky filled with migrating birds. This scene is so big that it
cannot be captured by a single photograph. As a result, the photographers
must take several snapshots and piece them together to form a picture of
the overall scene. The snapshots cannot be taken at the same time and the
photographers should not disturb the process that is being photographed, i.e.
they cannot get all the birds to remain motionless while the photographs are
taken. However, the composite picture should be meaningful.
This need for a meaningful snapshot still exists when talking about distributed
systems. For example, there’s no point recovering from a snapshot, if that
snapshot can lead to the system to an erroneous or corrupted state. A
meaningful snapshot is termed as a consistent snapshot in the paper, which
presents a formal definition of what this is6 . This definition will be presented
here in a more simplified way for ease of understanding. Let’s assume a
distributed system can be modelled as a directed graph, where vertices
represent nodes of the system and edges represent communication channels.
An event e in a node p is an atomic action that may change the state of p
itself and the state of at most one channel c incident on p: the state of c
may be changed by the sending of a message M along c (if c is an outbound
edge from p) or the receipt of a message M along c (if c is an inbound edge
to p). So, an event e could be represented by the tuple <p, s, s', M, c>,
where s and s' are the previous and new state of the node. An event ei
moves the global state of the system from Si to Si+1 . A snapshot Ssnapshot is
thus consistent if:
6
An alternative definition is that of a consistent cut [47], which partitions the space-time
diagram along the time axis in a way that respects causality, e.g. for each pair of events e
and f, if f is in the cut and e -> f then e is also in the cut. Note that the Chandy-Lamport
algorithm produces snapshots that are also consistent cuts.
CHAPTER 6. ORDER 125
• Ssnapshot is reachable from the state Sstart in which the algorithm was
initiated.
• the state Send in which the algorithm terminates is also reachable from
Ssnapshot .
Let’s look at an example to get some intuition about this. Let’s assume we
have a very simple distributed system consisting of 2 nodes p, q and two
channels c, c', as shown in Figure 6.9. The system contains one token that
is passed between the nodes. Each node has two possible states s0 and s1 ,
where s0 is the state in which the node does not possess the token and s1 is
the state in which it does. Figure 6.10 contains the possible global states of
the systems and the associated transitions. As a result, a snapshot where
the state is s0 for both of the nodes and the state of both channels is empty
would not be consistent, since the token is lost! A snapshot where the states
are s1 and s0 and channel c contains the token is also not consistent, since
there are now two tokens in the system.
The algorithm is based on the following main idea: a marker message is sent
between nodes using the available communication channels that represents
an instruction to a node to record a snapshot of the current state. The
algorithm works as follows:
• The node that initiates the protocol records its state and then sends a
marker message to all the outbound channels. Importantly, the marker
is sent after the node records its state and before any further messages
are sent to the channels.
• When a node receives a marker message, the behaviour depends on
whether the node has already recorded its state (while emitting the
CHAPTER 6. ORDER 126
Figure 6.10: The possible global states and the corresponding transitions of
the token system
to state s1 and buffered the token in the sequence of messages received while
the snapshot protocol was executing. The node p then receives the marker
and records the state of the channel c' as the sequence [token]. At this
point, the protocol concludes, since the state of all nodes and channels has
been recorded and the global snapshot state is the following:
snapshot(p): s0
snapshot(q): s0
snapshot(c): []
snapshot(c'): [token]
Hopefully, the two previous chapters helped you understand the difference
between the concepts of physical and logical time. At the same time, some
parts went into detail to explain the inner workings of some techniques and
their benefits and pitfalls. This might have left you with more questions,
so this section will contain an overview of what we have seen and some
additional observations. The goal is to help you finish this chapter with a
CHAPTER 6. ORDER 128
130
Chapter 7
Case studies
System Version
HDFS 3.1.2
Zookeeper 3.5.5
Hbase 2.0
Cassandra 3.11.4
FaunaDB 2.7
Kafka 2.3.1
Kubernetes 1.13.12
Corda 4.1
Spark 2.4.4
Flink 1.8
131
CHAPTER 7. CASE STUDIES 132
may be less than the aggregate bandwidth of all the machines within the
rack and a failure of a single shared resource in a rack (a network switch
or power circuit) can essentially bring all the machines of the rack down.
When creating a new chunk and placing its initially empty replicas, a master
tries to use chunkservers with below-average disk space utilisation. It also
tries to use chunkservers that have a low number of recent creations, since
that can reliably predict imminent heavy write traffic. In this way, the
master attempts to balance disk and network bandwidth utilisation across
the cluster. When deciding where to place the replicas, the master also
follows a chunk replica placement policy that is configurable. By default,
it will attempt to store two replicas at two different nodes that reside in
the same rack, while storing the third replica at a node that resides in a
separate rack. This is a trade-off between high network bandwidth and data
reliability.
The clients can create, read, write and delete files from the distributed
file system by using a GFS client library linked in to the application that
abstracts some implementation details. For example, the applications can
operate based on byte offsets of files and the client library can translate these
byte offsets to the associated chunk index, communicate with the master to
CHAPTER 7. CASE STUDIES 134
retrieve the chunk handle for the provided chunk index and the location of
the associated chunkservers and finally contact the appropriate chunkserver
(most likely the closest one) to retrieve the data. Figure 7.1 displays this
workflow for a read operation. Clients cache the metadata for chunk locations
locally, so they only have to contact master for new chunks or when the cache
has expired. During migration of chunks due to failures, clients organically
request fresh data from the master, when they realise the old chunkservers
cannot serve the data for the specified chunk anymore. On the other hand,
clients do not cache the actual chunk data, since they are expected to stream
through huge files and have working sets that are too large to benefit from
caching.
The master stores the file and chunk namespaces, the mapping from files
to chunks and the chunk locations. All metadata is stored in the master’s
memory. The namespaces and the mappings are also kept persistent by
logging mutating operations (e.g. file creation, renaming etc.) to an operation
log that is stored on the master’s local disk and replicated on remote machines.
The master node also checkpoints its memory state to disk when the log
grows significantly. As a result, in case of the master’s failure the image of
the filesystem can be reconstructed by loading the last checkpoint in memory
and replaying the operation log from this point forward. File namespace
mutations are atomic and linearizable. This is achieved by executing this
operation in a single node, the master node. The operation log defines a
global total order for these operations and the master node also makes use
of read-write locks on the associated namespace nodes to perform proper
serialization on any concurrent writes.
GFS supports multiple concurrent writers for a single file. Figure 7.2 illus-
trates how this works. The client first communicates with the master node to
identify the chunkservers that contain the relevant chunks. Afterwards, the
clients starts pushing the data to all the replicas using some form of chain
replication. The chunkservers are put in a chain depending on the network
topology and data is pushed linearly along the chain. For instance, the
client pushes the data to the first chunkserver in the chain, which pushes the
data to the second chunkserver etc. This helps fully utilize each machine’s
network bandwidth avoiding bottlenecks in a single node. The master grants
a lease for each chunk to one of the chunkservers, which is nominated as
the primary replica, which is responsible for serializing all the mutations
on this chunk. After all the data is pushed to the chunkservers, the client
sends a write request to the primary replica, which identifies the data pushed
earlier. The primary assigns consecutive serial numbers to all the mutations,
CHAPTER 7. CASE STUDIES 135
applies them locally and then forwards the write request to all secondary
replicas, which apply the mutations in the same serial number imposed by
the primary. After the secondary replicas have acknowledged the write to
the primary replica, then the primary replica can acknowledge the write to
the client.
Of course, this flow is vulnerable to partial failures. For example, think about
the scenario, where the primary replica crashes in the middle of performing a
write. After the lease expires, a secondary replica can request the lease and
start imposing a new serial number that might disagree with the writes of
other replicas in the past. As a result, a write might be persisted only in some
replicas or it might be persisted in different orders in different replicas. GFS
provides a custom consistency model for write operations. The state of
CHAPTER 7. CASE STUDIES 136
extra information like checksums so that its validity can be verified. A reader
can then identify and discard extra padding and record fragments using
these checksums. If occasional duplicates are not acceptable, e.g. if they
could trigger non-idempotent operations, the reader can filter them out using
unique record identifiers that are selected and persisted by the writer.
HDFS has taken a slightly different path to simplify the semantics of mutating
operations. Specifically, HDFS supports only a single writer at a time. It
provides support only for append (and not overwrite) operations. It also does
not provide a record append operation, since there are no concurrent writes
and it handles partial failures in the replication pipeline a bit differently,
removing failed nodes from the replica set completely in order to ensure file
content is the same in all replicas.
Both GFS and HDFS provide applications with the information where a
region of a file is stored. This enables the applications to schedule processing
jobs to be run in nodes that store the associated data, minimizing network
congestion and improving the overall throughput of the system. This principle
is also known as moving computation to the data.
replicate the write operations to the followers4 . Each of those nodes has a
copy of the Zookeper state in memory. Any changes are also recorded in a
durable, write-ahead log which can be used for recovery. All the nodes can
serve read requests using their local database. Followers have to forward any
write requests to the leader node, wait until the request has been successfully
replicated/broadcasted and then respond to the client. Reads can be served
locally without any communication between nodes, so they are extremely
fast. However, a follower node might be lagging behind the leader node,
so client reads might not necessarily reflect the latest write that has been
performed. For this reason, Zookeeper provides an additional operation
called sync. Clients can initiate a sync before performing a read. In this
way, the read will reflect any write operations that had happened before
the sync was issued. The sync operation does not need to go through the
broadcast protocol, it it just placed at the end of the leader’s queue and
forwarded only to the associated follower5 .
As a result, Zookeeper provides the following 2 safety guarantees:
• Linearizable writes: all requests that update the state of Zookeeper
are serializable and respect precedence. As mentioned before, writes
4
Chubby uses Paxos for this purpose, while etcd makes use of Raft.
5
In contrast, in Chubby both read and write requests are directed to the master. This
has the benefit of increased consistency, but the downside of decreased throughput. To
mitigate this, Chubby clients cache extensively and the master is responsible for invalidating
the caches before completing writes, thus making the system a bit more sensitive to client
failures.
CHAPTER 7. CASE STUDIES 140
The Zookeeper API can be used to build more powerful primitives. Some
examples are the following:
• Configuration management: This can be achieved simply by having
the node that needs to publish some configuration information create a
znode zc and write the configuration as the znode’s data. The znode’s
path is provided to the other nodes of the system, which obtain the
CHAPTER 7. CASE STUDIES 142
Distributed datastores
BigTable/HBase
the user can specify tuning configurations for each column family, such as
compression type or in-memory caching. Column families need to be declared
upfront during schema definition, but columns can be created dynamically.
Furthermore, the system supports a small number of column families, but
an unlimited number of columns. The keys are also uninterpreted bytes and
rows of the table are physically stored in lexicographical order of the keys.
Each table is partitioned horizontally using range partitioning based
on the row key into segments, called regions. The main goal of this data
model and the architecture described later is to allow the user to control the
physical layout of data, so that related data are stored near each other.
Figure 7.7 shows the high-level architecture of HBase, which is also based on
a master-slave architecture. The master is called HMaster and the slaves
are called region servers. The HMaster is responsible for assigning regions
to region servers, detecting the addition and expiration of region servers,
balancing region server load and handling schema changes. Each region server
manages a set of regions, handling read and write requests to the regions
it has loaded and splitting regions that have grown too large. Similar to
other single-master distributed systems, clients do not communicate with the
master for data flow operations, but only for control flow operations in order
to prevent it becoming the performance bottleneck of the system. Hbase
uses Zookeeper to perform leader election of the master node, maintain
group membership of region servers, store the bootstrap location of HBase
data and also store schema information and access control lists. Each region
server stores the data for the associated regions in HDFS, which provides
the necessary redundancy. A region server can be collocated at the same
machine of an HDFS datanode to enable data locality and minimize network
traffic.
There is a special HBase table, called the META table, which contains the
mapping between regions and region servers in the cluster. The location of
this table is stored in Zookeeper. As a result, the first time a client needs to
read/write to HBase, it first communicates with Zookeeper to retrieve the
region server that hosts the META table, then contacts this region server to
find the region server that contains the desired table and finally sends the
read/write operation to that server. The client caches locally the location
of the META table and the data already read from this table for future use.
HMasters initiallly compete to create an ephemeral node in Zookeeper. The
first one to do so becomes the active master, while the second one listens for
notifications from Zookeeper of the active master failure. Similarly, region
servers create ephemeral nodes in Zookeeper at a directory monitored by the
CHAPTER 7. CASE STUDIES 145
HMaster. In this way, the HMaster is aware of region servers that join/leave
the cluster, so that it can manage assignment of regions accordingly.
Appends are more efficient than random writes, especially in a filesystem
like HDFS. Region servers try to take advantage of this fact by employing
the following components for storage and data retrieval:
• MemStore: this is used as a write cache. Writes are initially written
in this data structure, which is stored in-memory and can be sorted
efficiently before being written to disk. Writes are buffered in this data
structure and periodically written to HDFS after being sorted.
• HFile: this is the file in HDFS which stores sorted key-value entries
on disk.
• Write ahead log (WAL): this stores operations that have not been
persisted to permanent storage and are only stored in the MemStore.
This is also stored in HDFS and is used for recovery in the case of a
region server failure.
• BlockCache: this is the read cache. It stores frequently read data in
memory and least recently used data is evicted when the cache is full.
As a result, write operations go through WAL and MemStore first and
CHAPTER 7. CASE STUDIES 146
HBase Bigtable
region tablet
region server tablet server
Zookeeper Chubby
HDFS GFS
HFile SSTable
MemStore Memtable
Cassandra
and is optional. If both of these components are present, then the primary
key is called a compound primary key. Furthermore, if the partition key is
composed of multiple columns, it’s called a composite partition key. Figure
7.9 contains an example of two tables, one having a simple primary key and
one having a compound primary key.
The primary key of a table is one of the most important parts of the schema,
because it determines how data is distributed across the system and also
how it is stored in every node. The first component of the primary key, the
partition key determines the distribution of data. The rows of a table are
conceptually split into different partitions, where each partition contains
only rows with the same value for the defined partition key. All the rows
corresponding to a single partition are guaranteed to be stored collocated
in the same nodes, while rows belonging to different partitions can be
distributed across different nodes. The second component of the primary
key, the clustering columns, determine how rows of the same partition will
be stored on disk. Specifically, rows of the same partition will be stored in
ascending order of the clustering columns defined, unless specified otherwise.
Figure 7.10 elaborates on the previous example, showing how data from the
two tables would be split into partitions and stored in practice.
Cassandra distributes the partitions of a table across the available nodes
using consistent hashing, while also making use of virtual nodes to provide
balanced, fine-grained partitioning. As a result, all the virtual nodes of a
Cassandra cluster form a ring. Each virtual node corresponds to a specific
value in the ring, called the token, which determines which partitions will
belong to this virtual node. Specifically, each virtual node contains all the
partitions whose partition key (when hashed) falls in the range between
CHAPTER 7. CASE STUDIES 150
its token and the token of the previous virtual node in the ring13 . Every
Cassandra node can be assigned multiple virtual nodes. Each partition is
also replicated across N nodes, where N is a number that is configurable
per keyspace and it’s called the replication factor. There are multiple,
available replication strategies that determine how the additional N-1 nodes
are selected. The simplest strategy just selects the next nodes clockwise in
the ring. More complicated strategies also take into account the network
topology of the nodes for the selection. The storage engine for each node
is inspired by Bigtable and is based on a commit log containing all the
mutations and a memtable that is periodically flushed to SSTables, which
are also periodically merged via compactions.
The nodes of the cluster communicate with each other periodically via a
gossip protocol, exchanging state and topology information about themselves
and other nodes they know about. New information is gradually spread
throughout the cluster via this process. In this way, nodes are able to keep
track of which nodes are responsible for which token ranges, so that they
can route requests accordingly. They can also determine which nodes are
healthy and which are not, so that they can omit sending requests to nodes
that are unreachable. Administrator tools are available that can be used
by an operator to instruct a node of the cluster to remove another node
that has crashed permanently from the ring. Any partitions belonging to
13
Cassandra also supports some form of range partitioning, via the
ByteOrderedPartitioner. However, this is available mostly for backwards com-
patibility reasons and it’s not recommended, since it can cause issues with hot spots and
imbalanced data distribution.
CHAPTER 7. CASE STUDIES 151
that node will be replicated to a different node from the remaining replicas.
There is need for a bootstrap process that will allow the first nodes to join
the cluster. For this reason, a set of nodes are designated as seed nodes
and they can be specified to all the nodes of the cluster via a configuration
file or a third-party system during startup.
Cassandra has no notion of a leader or primary node. All replica nodes are
considered equivalent. Every incoming request can be routed to any node in
the cluster. This node is called the coordinator node and is responsible
for managing the execution of the request on behalf of the client. This node
identifies the nodes that contain the data for the requested partition and
dispatches the requests. After successfully collecting the responses, it replies
to the client. Given there is no leader and all replica nodes are equivalent,
they can be handling writes concurrently. As a result, there is a need for
a conflict resolution scheme and Cassandra makes use of a last-write-wins
(LWW) scheme. Every row that is written comes with a timestamp. When a
read is performed, the coordinator collects all the responses from the replica
nodes and returns the one with the latest timestamp.
The client can also specify policies that define how this coordinator node is
CHAPTER 7. CASE STUDIES 152
of Paxos and satisfy the same needs: the first phase is called prepare and
corresponds to the nodes trying to gather votes before proposing a value
which is done in the third phase, called propose. When run under SERIAL
level, the write operations are conditional using an IF clause, also known
as compare-and-set (CAS). The second phase of the protocol is called read
and is used to retrieve the data in order to check whether the condition
is satisfied before proceeding with the proposal. The last phase is called
commit and it’s used to move the accepted value into Cassandra storage
and allow a new consensus round, thus unblocking concurrent LWTs again.
Read and write operations executed under SERIAL are guaranteed to be
linearizable. Read operations will commit any accepted proposal that has
not been committed yet as part of the read operation. Write operations
under SERIAL are required to contain a conditional part.
In Cassandra, performing a query that does not make use of the primary key
is guaranteed to be inefficient, because it will need to perform a full table
scan querying all the nodes of the cluster. There are two alternatives to this:
secondary indexes and materialized views. A secondary index can be defined
on some columns of a table. This means each node will index locally this table
using the specified columns. A query based on these columns will still need
CHAPTER 7. CASE STUDIES 156
to ask all the nodes of the system, but at least each node will have a more
efficient way to retrieve the necessary data without scanning all the data. A
materialised view can be defined as a query on an existing table with a newly
defined partition key. This materialised view is maintained as a separate
table and any changes on the original table are eventually propagated to it.
As a result, these two approaches are subject to the following trade-off:
• Secondary indexes are more suitable for high cardinality columns,
while materialized views are suitable for low cardinality columns as
they are stored as regular tables.
• Materialised views are expected to be more efficient during read
operations when compared to secondary indexes, since only the nodes
that contain the corresponding partition are queried.
• Secondary indexes are guaranteed to be strongly consistent, while
materialised views are eventually consistent.
Cassandra does not provide join operations, since they would be inefficient due
to the distribution of data. As a result, users are encouraged to denormalise
the data by potentially including the same data in multiple tables, so that
they can be queried efficiently reading only from a minimum number of
nodes. This means that any update operations on this data will need to
update multiple tables, but this is expected to be quite efficient. Cassandra
provides 2 flavours of batch operations that can update multiple partitions
and tables: logged and unlogged batches. Logged batches provide the
additional guarantee of atomicity, which means either all of the statements
of the batch operation will take effect or none of them. This can help ensure
that all the tables that share this denormalised data will be consistent with
each other. However, this is achieved by first logging the batch as a unit in a
system table which is replicated and then performing the operations, which
makes them less efficient than unlogged batches. Both logged and unlogged
batches do not provide any isolation, so concurrent requests might observe
the effects of some of the operations only temporarily.
Spanner
The data model of Spanner is very close to the data model of classical
relational databases. A database in Spanner can contain one or more tables,
which can contain multiple rows. Each row contains a value for each column
that is defined and one or more columns are defined as the primary key of
the table, which must be unique for each row. Each table contains a schema
that defines the data types of each column.
Spanner partitions the data of a table using horizontal range partitioning.
The rows of a table are partitioned in multiple segments, called splits. A split
is a range of contiguous rows, where the rows are ordered by the corresponding
primary key. Spanner can perform dynamic load-based splitting, so any split
that receives an extreme amount of traffic can be partitioned further and
stored in servers that have less traffic. The user can also define parent-
child relationships between tables, so that related rows from the tables are
collocated making join operations much more efficient. A table C can be
declared as a child table of A, using the INTERLEAVE keyword and ensuring
the primary key of the parent table is a prefix of the primary key of the child
table. An example is shown in Figure 7.15, where a parent table Singers is
interleaved with a child table, called Albums . Spanner guarantees that the
row of a parent table and the associated rows of the child table will never be
assigned to a different split.
and store data. The per-zone location proxies are used by clients to locate
the spanservers that serve a specific portion of data. The universe master
displays status information about all the zones for troubleshooting and the
placement driver handles automated movement of data across zones, e.g. for
load balancing reasons.
Each spanserver can manage multiple splits and each split is replicated across
multiple zones for availability, durability and performance15 . All the replicas
of a split form a Paxos group. One of these replicas is voted as the leader
and is responsible for receiving incoming write requests and replicating them
to the replicas of the group via a Paxos round. The rest of the replicas are
followers and can serve some kinds of read requests. Spanner makes use of
long-lived leaders with time-based leader leases, which are renewed by default
every 10 seconds. Spanner makes use of pessimistic concurrency control to
ensure proper isolation between concurrent transactions, specifically two-
phase locking. The leader of each replica group maintains a lock table
that maps ranges of keys to lock states for this purpose16 . Spanner also
provides support for distributed transactions that involve multiple splits
that potentially belong to different replica groups. This is achieved via
two-phase commit across the involved replica groups. As a result, the
leader of each group also implements a transaction manager to take part in
the two-phase commit. The leaders of each group that take part are referred
to as participant leaders and the follower replicas of each one of those groups
are referred to as participant slaves. More specifically, one of these groups is
chosen as the coordinator for the two-phase commit protocol and the replicas
of this group are referred to as coordinator leader and slaves respectively.
Spanner makes use of a novel API to record time, called TrueTime [60],
which was the key enabler for most of the consistency guarantees provided
by Spanner. This API directly exposes clock uncertainty and nodes can wait
out that uncertainty when comparing timestamps retrieved from different
clocks. If the uncertainty gets large because of some failure, this will manifest
as increased latency due to nodes having to wait longer periods. TrueTime
represents time as a TTInterval, which is an interval [earliest, latest] with
bounded time uncertainty. The API provides a method TT.now() that re-
15
In fact, each split is stored in a distributed filesystem, called Colossus that is the
successor of GFS, which already provides byte-level replication. However, Spanner adds
another level of replication to provide the additional benefits of data availability and
geographic locality.
16
In practice, these locks are also replicated in the replicas of the group to cover against
failures of the leader.
CHAPTER 7. CASE STUDIES 160
tsafe TM = + ∞.
Read-only transactions allow a client to perform multiple reads at the same
timestamp and these operations are also guaranteed to be strictly serializ-
able. An interesting property of read-only transactions is they do not need
to hold any locks and they don’t block other transactions. The reason for
this is that these transactions perform reads at a specific timestamp, which
is selected in such a way as to guarantee that any concurrent/future write
operations will update data at a later timestamp. The timestamp is selected
at the beginning of the transaction as TT.now().latest and it’s used for all
the read operations that are executed as part of this transaction. In general,
the read operations at timestamp tread can be served by any replica g that is
up to date, which means tread ≤ tsafe,g . More specifically:
• In some cases, a replica can be certain via its internal state and
TrueTime that it is up to date enough to serve the read and does so.
• In some other cases, a replica might not be sure if it has seen the latest
data. It can then ask the leader of its group for the timestamp of the
last transaction it needs to apply in order to serve the read.
• In the case the replica is the leader itself, it can proceed directly since
it is always up to date.
Spanner also supports standalone reads outside the context of transactions.
These do not differ a lot from the read operations performed as part of
read-only transactions. For instance, their execution follows the same logic
using a specific timestamp. These reads can be strong or stale. A strong
read is a read at a current timestamp and is guaranteed to see all the data
that has been committed up until the start of the read. A stale read is a
read at a timestamp in the past, which can be provided by the application
or calculated by Spanner based on a specified upper bound on staleness. A
stale read is expected to have lower latency at the cost of stale data, since
it’s less likely the replica will need to wait before serving the request.
There is also another type of operations, called partitioned DML. This allows
a client to specify an update/delete operation in a declarative form, which
is then executed in parallel at each replica group. This parallelism and the
associated data locality makes these operations very efficient. However, this
comes with some tradeoffs. These operations need to be fully partitionable,
which means they must be expressible as the union of a set of statements,
where each statement accesses a single row of the table and each statement
accesses no other tables. This ensures each replica group will be able to
execute the operation locally without any coordination with other replica
CHAPTER 7. CASE STUDIES 165
FaunaDB
passively replicate the leader. If the leader fails, one of the followers will
automatically detect that and become the new leader.
Kafka makes use of Zookeper for various functions, such as leader election
between the replica brokers and group membership of brokers and consumers.
Interestingly, log replication is separated from the key elements of the con-
sensus protocol, such as leader election and membership changes. The latter
are implemented via Zookeper, while the former is using a single-master
replication approach, where the leader waits for followers to persist each
message before acknowledging it to the client. For this purpose, Kafka has
the concept of in-sync replicas (ISR), which are replicas that have replicated
committed records and are thus considered to be in-sync with the leader. In
case of a leader failure, only a replica that is in the ISR set is allowed to be
elected as a leader. This guarantees zero data loss, since any replica in the
ISR set is guaranteed to have stored locally all the records acknowledged by
the previous leader. If a follower in the ISR set is very slow and lags behind,
the leader can evict that replica from the ISR set in order to make progress.
In this case, it’s important to note that the ISR update is completed before
CHAPTER 7. CASE STUDIES 172
proceeding, e.g. acknowledging records that have been persisted by the new,
smaller ISR set. Otherwise, there would be a risk of data loss, if the leader
failed after acknowledging these records but before updating the ISR set,
so that the slow follower could be elected as the new leader even though
it would be missing some acknowledged records. The leader maintains 2
offsets, the log end offset (LEO) and the high watermark (HW). The former
indicates the last record stored locally, but not replicated or acknowledged
yet. The latter indicates the last record that has been successfully replicated
and can be acknowledged back to the client.
Kafka provides a lot of levers to adjust the way it operates depending on
the application’s needs. These levers should be tuned carefully depending
on requirements around availability, durability and performance. For
example, the user can control the replication factor of a topic, the minimum
size of the ISR set (min.insync.replicas) and the number of replicas from
the ISR set that need to acknowledge a record before it’s committed (acks).
Let’s see some of the trade-offs one can make using these values:
• Setting min.insync.replicas to a majority quorum (e.g.
(replication factor / 2) + 1) and acks to all would allow
one to enforce stricter durability guarantees, while also achieving good
availability. Let’s assume replication factor = 5, so there are 5
replicas per partition and min.insync.replicas = 3. This would
mean up to 2 node failures can be tolerated with zero data loss and
the cluster still being available for writes and reads.
• Setting min.insync.replicas equal to replication factor and
acks to all would provide even stronger durability guarantees at the
expense of lower availability. In our previous example of replication
factor = 5, this would mean that up to 4 node failures can now be
tolerated with zero data loss. However, a single node failure makes the
cluster unavailable for writes.
• Setting acks to 1 can provide better performance at the expense of
weaker durability and consistency guarantees. For example, records
will be considered committed and acknowledged as soon as the leader
has stored them locally without having to wait for any of the followers
to catch up. However, in case of a leader failure and election of a new
leader, records that had been acknowledged by the previous leader but
had not made it to the new leader yet will be lost.
Kafka can provide at-least-once, at-most-once and exactly-once messaging
guarantees through various different configurations. Let’s see each one of
CHAPTER 7. CASE STUDIES 173
them separately:
• at-most-once semantics: this can be achieved on the producer
side by disabling any retries. If the write fails (e.g. due to a
TimeoutException), the producer will not retry the request, so the
message might or might not be delivered depending on whether it had
reached the broker. However, this guarantees that the message cannot
be delivered more than once. In a similar vein, consumers commit
message offsets before they process them. In that case, each message
is processed once in the happy path. However, if the consumer fails
after committing the offset but before processing the message, then
the message will never be processed.
• at-least-once semantics: this can be achieved by enabling retries for
producers. Since failed requests will now be retried, a message might
be delivered more than once to the broker leading to duplicates, but
it’s guaranteed it will be delivered at least once27 . The consumer can
process the message first and then commit the offset. This would mean
that the message could be processed multiple times, if the consumer
fails after processing it but before committing the offset.
• exactly-once semantics: this can be achieved using the idempotent
producer provided by Kafka. This producer is assigned a unique
identifier (PID) and tags every message with a sequence number. In
this way, the broker can keep track of the largest number per PID and
reject duplicates. The consumers can store the committed offsets in
Kafka or in an external datastore. If the offsets are stored in the same
datastore where the side-effects of the message processing are stored,
then the offsets can be committed atomically with the side-effects, thus
providing exactly-once guarantees.
Kafka also provides a transactional client that allows producers to produce
messages to multiple partitions of a topic atomically. It also makes it
possible to commit consumer offsets from a source topic in Kafka and
produce messages to a destination topic in Kafka atomically. This makes it
possible to provide exactly-once guarantees for an end-to-end pipeline. This
is achieved through the use of a two-phase commit protocol, where the
brokers of the cluster play the role of the transaction coordinator in a highly
available manner using the same underlying mechanisms for partitioning,
leader election and fault-tolerant replication. The coordinator stores the
27
Note that this is assuming infinite retries. In practice, a maximum threshold of retries is
usually performed, in which case a message might not be delivered if this limit is exhausted.
CHAPTER 7. CASE STUDIES 174
The physical storage layout of Kafka is pretty simple: every log partition
is implemented as a set of segment files of approximately the same size
(e.g. 1 GB). Every time a producer publishes a message to a partition, the
broker simply appends the message to the last segment file. For better
performance, segment files are flushed to disk only after a configurable
number of messages have been published or a configurable amount of time
has elapsed28 . Each broker keeps in memory a sorted list of offsets, including
the offset of the first message in every segment file. Kafka employs some more
performance optimisations, such as using the sendfile API29 for sending
28
This behaviour is configurable through the values log.flush.interval.messages and
log.flush.interval.ms. It is important to note that this behaviour has implications in
the aforementioned durability guarantees, since some of the acknowledged records might
be temporarily stored only in the memory of all in-sync replicas for some time until they
are flushed to disk.
29
See https://developer.ibm.com/articles/j-zerocopy
CHAPTER 7. CASE STUDIES 175
potential failures and converging back to the desired state. Since there are
multiple components reading and updating the current state of the cluster,
there is a need for some concurrency control to prevent anomalies arising
from reduced isolation. Kubernetes achieves this with the use of conditional
updates. Every resource object has a resourceVersion field representing
the version of the resource as stored in etcd. This version can be used to
perform a compare-and-swap (CAS) operation, so that anomalies like lost
updates are prevented.
Corda is a platform that allows multiple parties that do not fully trust each
other to maintain a distributed ledger with shared facts amongst each other.
By its nature, this means it is a distributed system similar to the systems
analysed previously. However, a distinctive characteristic of this system is
this lack of trust between the nodes that are part of the system, which also
gives it a decentralisation aspect. This distrust is managed through various
cryptographic primitives35 , as explained later. This section will give a rather
brief overview of Corda’s architecture, but you can refer to the available
whitepapers for a more detailed analysis[68][69].
Each node in Corda is a JVM-runtime environment with a unique identity on
the network. A Corda network is made up of many such nodes that want to
transact with each other in order to maintain and evolve a set of shared facts.
This network is permissioned, which means nodes need to acquire an X.509
certificate from the network operator in order to be part of the network. The
component that issues these certificates is referred to as the doorman. In this
context, the doorman operates as a certificate authority for the nodes that
are part of the network. Each node maintains a public and a private key36 ,
where the private key is used to attest to facts by signing the associated data
and the public key is used by other nodes to verify these signatures. This
X.509 certificate creates an association between the public key of the node
and a human-readable X.500 name (e.g. O=MegaCorp,L=London,C=GB). The
35
This is a book about distributed systems, so this section will focus mostly on the
distribution aspect of Corda. For the sake of completeness, the analysis might also mention
how some cryptographic techniques are used, but this will be done under the assumption
that the reader is familiar with basic concepts and can study them further outside the
scope of this book.
36
See: https://en.wikipedia.org/wiki/Public-key_cryptography
CHAPTER 7. CASE STUDIES 180
network also contains a network map service, which provides some form of
service discovery to the nodes that are part of the network. The nodes
can query this service to discover other nodes that are part of the network
in order to transact with them. Interestingly, the nodes do not fully trust
the network operator for the distribution of this information, so each entry
of this map that contains the identifying data of a node (i.e. IP address,
port, X.500 name, public key, X.509 certificate etc.) is also signed by the
corresponding node. In order to avoid censorship by the network operator,
the nodes can even exchange the files that contain this information with each
other out-of-band and install them locally.
Let’s have a look at the data model of Corda now. The shared facts between
Corda nodes are represented by states, which are immutable objects which
can contain arbitrary data depending on the use case. Since states are
immutable, they cannot be modified directly to reflect a change in the state
of the world. Instead, the current state is marked as historic and is replaced
by a new state, which creates a chain of states that gives us a full view
of the evolution of a shared fact over time. This evolution is done by a
Corda transaction, which specifies the states that are marked as historic
(also known as the input states of the transaction) and the new states that
supersede them (also known as the output states of the transaction). Of
course, there are specific rules that specify what kind of states each state
can be replaced by. These rules are specified by smart contracts and each
state also contains a reference to the contract that governs its evolution. The
smart contract is essentially a pure function that takes a transaction as an
input and determines whether this transaction is considered valid based on
the contract’s rules. Transactions can also contain commands, which indicate
the transaction’s intent in terms of how the data of the states are used. Each
command is also associated with a list of public keys that need to sign the
transaction in order to be valid.
Figure 7.24 contains a very simple example of this data model for electronic
money. In this case, each state represents an amount of money issued by
a specific bank and owned by a specific party at some point in time. We
can see that Alice combines two cash states in order to perform a payment
and transfer 10 GBP to Bob. After that, Bob decides to redeem this money
in order to get some cash from the bank. As shown in the diagram, there
are two different commands for each case. We can also guess some of the
rules of the associated contract for this cash state. For a Spend command,
the contract will verify that the sum of all input states equals the sum of
all output states, so that no money is lost or created out of thin air. Most
CHAPTER 7. CASE STUDIES 181
likely, it will also check that the Spend command contains all the owners of
the input states as signers, which need to attest to this transfer of money.
The astute reader will notice that in this case nothing would prevent someone
from spending a specific cash state to two different parties who would not
be able to detect that. This is known as double spend and it’s prevented
in Corda via the concept of notaries. A notary is a Corda service that
is responsible for attesting that a specific state has not been spent more
than once. In practice, every state is associated with a specific notary and
every transaction that wants to spend this state needs to acquire a signature
from this notary that proves that the state has not been spent already by
another transaction. This process is known as transaction finalisation in
Corda. The notarisation services are not necessarily provided by a single
node, it can also be a notary cluster of multiple nodes in order to provide
better fault tolerance and availability. In that case, these nodes will form
a consensus group. Corda allows the consensus algorithm used by the
notary service to be pluggable depending on the requirements in terms of
privacy, scalability, performance etc. For instance, a notary cluster might
choose to use a crash fault tolerant (CFT) consensus algorithm (e.g. Raft)
that provides high performance but also requires high trust between the
nodes of the cluster. Alternatively, it might choose to use a byzantine fault
tolerant (BFT) algorithm that provides lower performance but also requires
less trust between the nodes of the cluster.
CHAPTER 7. CASE STUDIES 182
At this point, it’s important to note that permissioning has different impli-
cations on regular Corda nodes and notaries. In the first case, it forms the
foundation for authentication of communication between nodes, while in the
second case it makes it easier to detect when a notary service deviates from a
protocol (e.g. violating finality), identify the associated real-world entity and
take the necessary actions. This means that finalised transactions are not
reversible in Corda unless someone violates the protocol37 . As mentioned
previously, in some cases even some limited amount of protocol violation can
be tolerated, i.e. when using a byzantine consensus protocol.
The size of the ledger of all Corda applications deployed in a single network
can become pretty large. The various nodes of the network communicate
on a peer-to-peer fashion only with the nodes they need to transact, but
the notary service seems to be something that needs to be used by all the
nodes and could potentially end up being a scalability and performance
bottleneck. For this purpose, Corda supports both vertical and horizontal
partitioning. Each network can contain multiple notary clusters, so that
different applications can make use of different clusters (vertical partitioning).
Even the same application can choose to distribute its states between multiple
notary clusters for better performance and scalability (vertical partitioning).
The only requirement is for all input states of a transaction to belong to the
same notary. This is so that the operation of checking whether a state is
spent and marking it as spent can be done atomically in a simple and efficient
way without the use of distributed transaction protocols. Corda provides
a special transaction type, called notary-change transaction, which allows one
to change the notary associated with a state by essentially spending the state
and creating a new one associated with the new notary. However, in some use
cases datasets can be partitioned in a way that requires a minimal number
of such transactions, because the majority of transactions are expected to
access states from the same partition. An example of this is partitioning of
states according to geographic regions if we know in advance that most of the
transactions will be accessing data from the same region. This architecture
also makes it possible to use states from different applications in a very easy
way without the use of distributed transaction protocols38 .
Corda applications are called CorDapps and contain several components of
which the most important ones are the states, their contracts and the flows.
37
This is in contrast to some other distributed ledger systems where nodes are anonymous
and can thus collude in order to revert historic transactions, such as Bitcoin[70].
38
This is known as atomic swap and a real use case in the financial world is known as
delivery-versus-payment (DvP).
CHAPTER 7. CASE STUDIES 183
The flows define the workflows between nodes used to perform an update to
the ledger or simply exchange some messages. Corda provides a framework
that allows the application to define the interaction between nodes as a
set of blocking calls that send and receive messages and the framework is
responsible for transforming this to an asynchronous, event-driven execution.
Corda also provides a custom serialization framework that determines how
application messages are serialised when sent across the wire and how they
are deserialised when received. Messaging between nodes is performed with
the use of message queues, using the Apache ActiveMQ Artemis message
broker. Specifically, each node maintains an inbound queue for messages
received by other nodes and outbound queues for messages sent to other
nodes along with a bridge process responsible for forwarding messages from
the node’s outbound queues to the corresponding inbound queues of the
other nodes. Even though all of these moving parts can crash and restart
in the middle of some operation, the platform provides the guarantee that
every node will process each message exactly-once. This is achieved by
resending messages until they are acknowledged and having nodes keeping
track of messages processed already and discarding duplicates. Nodes also
need to acknowledge a message, store its identifier and perform any related
side-effects in an atomic way, which is achieved by doing all of this in a single
database transaction. All the states from the ledger that are relevant to a
node are stored in its database, this part of the database is called the vault.
A node provides some more APIs that can be used for various purposes,
such as starting flows or querying the node’s vault. These can be accessed
remotely via a client, which provides a remote procedure call (RPC)
interface that’s implemented on top of the existing messaging infrastructure
and using the serialization protocol described before. Figure 7.25 contains a
high-level overview of the architecture described so far.
Corda is a very interesting case study from the perspective of backwards
compatibility. In a distributed system, the various nodes of the system
might be running different versions of the software, since in many cases
software has to be deployed incrementally to them and not in a single step.
In a decentralised system, there is an additional challenge, because the
various nodes of the systems are now controlled by different organisations,
so these discrepancies might last longer. Corda provides a lot of different
mechanisms to preserve backwards compability in different areas, so let’s
explore some of them. First of all, Corda provides API & ABI39 backwards
compatibility for all the public APIs available to CorDapps. This means that
39
See: https://en.wikipedia.org/wiki/Application_binary_interface
CHAPTER 7. CASE STUDIES 184
any CorDapp should be able to run in future versions of the platform without
any change or re-compilation. Similar to other applications, CorDapps are
expected to evolve, which might involve changing the structure of data
exchanged between nodes and the structure of data stored in the ledger (e.g.
states). The serialization framework provides some support for evolution
for the first case. For instance, nullable properties can be added to a class
and the framework will take care of the associated conversions. A node
running an older version of the CorDapp will just ignore this new property
if data is sent from a node running a newer version of the CorDapp. A node
running a newer version of the CorDapp will populate the property with null
when receiving data from a node running the older version of the CorDapp.
Removing nullable properties and adding a non-nullable property is also
possible by providing a default value. However, the serialization framework
does not allow this form of data loss to happen for data that are persisted
in the ledger, such as states and commands. Since states can evolve and
the ledger might contain states from many earlier versions of a CorDapp,
newer versions of a contract need to contain appropriate logic that is able
to process states from earlier versions of the CorDapp. The contract logic
for handling states from version vi of the CorDapp can be removed by a
subsequent release of the CorDapp only after all unspent states in the ledger
CHAPTER 7. CASE STUDIES 185
implications on the node running the CorDapp, instead of the whole network.
Another example is the fact that flows in a CorDapp can be versioned in
order to evolve while maintaining backwards compatibility. In this way, a
flow can behave differently depending on the version of the CorDapp that
is deployed on the counterparty node. This makes it possible to upgrade a
CorDapp incrementally across various nodes, instead of all of them having
to do it in lockstep.
This section will examine distributed systems that are used to process large
amounts of data that would be impossible or very inefficient to process using
only a single machine. They can be classified in two main categories:
• batch processing systems: these systems group individual data
items into groups, called batches, which are processed one at a time. In
many cases, these groups can be quite large (e.g. all items for a day),
so the main goal for these system is usually to provide high throughput
sometimes at the cost of increased latency.
• stream processing systems: these systems receive and process data
continuously as a stream of data items. As a result, the main goal for
these systems is providing very low latency sometimes at the cost of
decreased throughput.
MapReduce
}
emit(key, result);
}
In this case, the map function would emit a single record for each word with
the value 1, while the reduce function would just count all these entries and
return the final sum for each word.
This framework is also based on a master-worker architecture, where the
master node is responsible for scheduling tasks at worker nodes and managing
their execution, as shown in Figure 7.27. Apart from the definition of the
map/reduce functions, the user can also specify the number M of map tasks,
the number R of reduce tasks, the input/output files and a partitioning
function that defines how key/value pairs from the map tasks are partitioned
before being processed by the reduce tasks. By default, a hash partitioner
is used that selects a reduce task using the formula hash(key) mod R. The
execution of a MapReduce proceeds in the following way:
• The framework divides the input files into M pieces, called input splits,
which are typically between 16 and 64 MB per split.
• It then starts an instance of a master node and multiple instances of
worker nodes on an existing cluster of machines.
• The master selects idle worker nodes and assigns map tasks to them.
• A worker node that is assigned a map task reads the contents of the
associated input split, it parses key/value pairs and passes them to the
user-defined map function. The entries emitted by the map function
are buffered in memory and periodically written to the local disk,
partitioned into R regions using the partitioning function. When a
worker node completes a map task, it sends the location of the local
file to the master node.
• The master node assigns reduce tasks to worker nodes providing the
location to the associated files. These worker nodes then perform
remote procedure calls (RPCs) to read the data from the local disks
of the map workers. The data is first sorted40 , so that all occurrences
of the same key are grouped together and then passed into the reduce
function.
• When all map and reduce tasks are completed, the master node returns
the control to the user program. After successful completion, the
output of the mapreduce job is available in the R output files that can
either be merged or passed as input to a separate MapReduce job.
40
If the size is prohibitively large to fit in memory, external sorting is used.
CHAPTER 7. CASE STUDIES 189
The master node communicates with every worker periodically in the back-
ground as a form of heartbeat. If no response is received for a specific
amount of time, the master node considers the worker node as failed and
re-schedules all its tasks for re-execution. More specifically, reduce tasks
that had been completed do not need to be rerun, since the output files are
stored in an external file system. However, map tasks are rerun regardless of
whether they had completed, since their output is stored on the local disk
and is therefore inaccessible to the reduce tasks that need it.
This means that network partitions between the master node and worker
nodes might lead to multiple executions of a single map or reduce task.
Duplicate executions of map tasks are deduplicated at the master node,
which ignores completion messages for already completed map tasks. Reduce
tasks write their output to a temporary file, which is atomically renamed
when the reduce task completes. This atomic rename operation provided by
CHAPTER 7. CASE STUDIES 190
the underlying file system guarantees that the output files will contain just
the data produced by a single execution of each reduce task. However, if the
map/reduce functions defined by the application code have additional side-
effects (e.g. writing to external datastores) the framework does not provide
any guarantees and the application writer needs to make sure these side-
effects are atomic and idempotent, since the framework might trigger
them more than once as part of a task re-execution.
Input and output files are usually stored in a distributed filesystem, such as
HDFS or GFS. MapReduce can take advantage of this to perform several
optimisations, such as scheduling map tasks on worker nodes that contain a
replica of the corresponding input to minimize network traffic or aligning
the size of input splits to the block size of the file system.
The framework provides the guarantee that within a given partition, the
intermediate key/value pairs are processed in increasing key order. This
ordering guarantees makes it easy to produce a sorted output file per partition,
which is useful for use cases that need to support efficient random access
lookups by key or need sorted data in general. Furthermore, some use-cases
would benefit from some form of pre-aggregation at the map level to reduce
the amount of data transferred between map and reduce tasks. This was
evident in the example presented above, where a single map would emit
multiple entries for each occurence of a word, instead of a single entry
with the number of occurrences. For this reason, the framework allows the
application code to also provide a combine function. This method has the
same type as the reduce function and is run as part of the map task in order
to pre-aggregate the data locally.
Apache Spark
Apache Spark [72][73] is a data processing system that was initially developed
at the University of California and then donated to the Apache Software
Foundation. It was developed in response to some of the limitations of
MapReduce. Specifically, the MapReduce model allowed developing and
running embarrassingly parallel computations on a big cluster of machines,
but every job had to read the input from disk and write the output to disk.
As a result, there was a lower bound in the latency of a job execution, which
was determined by disk speeds. This means MapReduce was not a good fit
for iterative computations, where a single job was executed multiple times
or data were passed through multiple jobs, and for interactive data analysis,
CHAPTER 7. CASE STUDIES 191
where a user wants to run multiple ad-hoc queries on the same dataset. Spark
tried to address these two use-cases.
Spark is based on the concept of Resilient Distributed Datasets (RDD), which
is a distributed memory abstraction used to perform in-memory computations
on large clusters of machines in a fault-tolerant way. More concretely, an
RDD is a read-only, partitioned collection of records. RDDs can be
created through operations on data in stable storage or other RDDS. The
operations performed on an RDD can be one of the following two types:
• transformations, which are lazy operations that define a new RDD.
Some examples of transformations are map, filter, join and union.
• actions, which trigger a computation to return a value to the program
or write data to external storage. Some examples of actions are count,
collect, reduce and save.
A typical Spark application will create an RDD by reading some data from
a distributed filesystem, it will then process the data by calculating new
RDDs through transformations and will finally store the results in an output
file. For example, an application used to read some log files from HDFS and
count the number of lines that contain the word "sale completed" would look
like the following:
lines = spark.textFile("hdfs://...")
completed_sales = lines.filter(_.contains("sale completed"))
number_of_sales = completed_sales.count()
This program can either be submitted to be run as an invividual application
in the background or each one of the commands can be executed interactively
in the Spark interpreter. A Spark program is executed from a coordinator
process, called the driver. The Spark cluster contains a cluster manager node
and a set of worker nodes. The responsibilities between these components
are split in the following way:
• The cluster manager is responsible for managing the resources of the
cluster (i.e. the worker nodes) and allocating resources to clients that
need to run applications.
• The worker nodes are the nodes of the cluster waiting to receive
applications/jobs to execute.
• Spark also contains a master process that requests resources in the
cluster and makes them available to the driver41 .
41
Note that Spark supports both a standalone clustering mode and some clustering modes
CHAPTER 7. CASE STUDIES 192
• The driver is responsible for requesting the required resources from the
master and starting a Spark agent process on each node that runs for
the entire lifecycle of the application, called executor. The driver then
analyses the user’s application code into a directed acyclic graph (DAG)
of stages, partitions the associated RDDs and assigns the corresponding
tasks to the executors available to compute them. The driver is also
responsible for managing the overall execution of the application, e.g.
receiving heartbeats from executors and restarting failed tasks.
Notably, in the previous example the second line is executed without any
data being read or processed yet, since filter is a transformation. The
data is being read from HDFS, filtered and then counted, when the third
line is processed, which contains the count operation which is an action.
To achieve that, the driver maintains the relationship between the various
RDDs through a lineage graph, triggering calculation of an RDD and all its
ancestors only when an action is performed. RDDs provide the following
basic operations42 :
• partitions(), which returns a list of partition objects. For example,
using third-party cluster management systems, such as YARN, Mesos and Kubernetes.
In the standalone mode, the master process also performs the functions of the cluster
manager. In some of the other clustering modes, such as Mesos and YARN, they are
separate processes.
42
Note that these operations are mainly used by the framework to orchestrate the
execution of Spark applications. The applications are not supposed to make use of these
operations, they should be using the transformations and actions that were presented
previously.
CHAPTER 7. CASE STUDIES 193
Figure 7.29: Examples of narrow and wide dependencies. Each big rectangle
is an RDD with the smaller grey rectangles representing partitions of the
RDD
CHAPTER 7. CASE STUDIES 195
locations (e.g. an HDFS file), it’s submitted to these nodes. For wide
dependencies that require data shuffling, nodes holding parent partitions
materialize intermediate records locally that are later pulled by nodes from
the next stage, similar to MapReduce.
This graph is the basic building block for efficient fault-tolerance. When
an executor fails for some reason, any tasks running on it are re-scheduled on
another executor. Along with this, tasks are scheduled for any parent RDDs
required to calculate the RDD of this task. As a consequence of this, wide
dependencies can be much more inefficient than narrow dependencies when
recovering from failures, as shown in Figure 7.31. Long lineage graphs can
also make recovery very slow, since many RDDs will need to be recomputed
in a potential failure near the end of the graph. For this reason, Spark
provides a checkpointing capability, which can be used to store RDDs from
specified tasks to stable storage (e.g. a distributed filesystem). In this way,
RDDs that have been checkpointed can be read from stable storage during
CHAPTER 7. CASE STUDIES 196
recovery, thus only having to rerun smaller portions of the graph. Users can
call a persist() method to indicate which RDDs need to be stored on disk.
the second one will send one record for each word containing the number of
occurrences.
// word count without pre-aggregation
sparkContext.textFile("hdfs://...")
.flatMap(line => line.split(" "))
.map(word => (word,1))
.groupByKey()
.map((x,y) => (x,sum(y)))
matter which execution completed successfully, the output data will be the
same as long as the transformations that were used were deterministic
and idempotent. This is true for any transformations that act solely on
the data provided by the previous RDDs using deterministic operations.
However, this is not the case if these transformations make use of data
from other systems that might change between executions or if they use
non-deterministic actions (e.g. mathematical calculations based on random
number generation). Furthermore, if transformations perform side-effects on
external systems that are not idempotent, no guarantee is provided since
Spark might execute each side-effect more than once.
Apache Flink
execute user-specified operators that can produce other streams and report
their status to the Job Manager along with heartbeats that are used for
failure detection. When processing unbounded streams, these tasks are
supposed to be long-lived. If they fail, the Job Manager is responsible for
restarting them. To avoid making the Job Manager a single point of failure,
multiple instances can be running in parallel. One of them will be elected
leader via Zookeeper and will be responsible for coordinating the execution
of applications, while the rest will be waiting to take over in case of a leader
failure. As a result, the leader Job Manager stores some critical metadata
about every application in Zookeeper, so that it’s accessible to newly elected
leaders.
event occured on its producing device. Each one of them can be used with
some trade-offs:
• When a streaming program runs on processing time, all time-based
operations (e.g. time windows) will use the system clock of the machines
that run the respective operation. This is the simplest notion of time
and requires no coordination between the nodes of the system. It also
provides good performance and reliably low latency on the produced
results. However, all this comes at the cost of consistency and non-
determinism. The system clocks of different machines will differ and
the various nodes of the system will process data at different rates. As
a consequence, different nodes might assign the same event to different
windows depending on timing.
• When a streaming program runs on event time, all time-based opera-
tions will use the event time embedded within the stream records to
track progress of time, instead of system clocks. This brings consistency
and determinism to the execution of the program, since nodes will
now have a common mechanism to track progress of time and assign
events to windows. However, it requires some coordination between the
various nodes, as we will see below. It also introduces some additional
latency, since nodes might have to wait for out-of-order or late events.
The main mechanism to track progress in event time in Flink is watermarks.
Watermarks are control records that flow as part of a data stream and carry a
timestamp t. A Watermark(t) record indicates that event time has reached
time t in that stream, which means there should be no more elements with
a timestamp t' ≤ t. Once a watermark reaches an operator, the operator
can advance its internal event time clock to the value of the watermark.
Watermarks can be generated either directly in the data stream source or by
a watermark generator in the beginning of a Flink pipeline. The operators
later in the pipeline are supposed to use the watermarks for their processing
(e.g. to trigger calculation of windows) and then emit them downstream
to the next operators. There are many different strategies for generating
watermarks. An example is the BoundedOutOfOrdernessGenerator, which
assumes that the latest elements for a certain timestamp t will arrive at
most n milliseconds after the earliest elements for timestamp t. Of course,
there could be elements that do not satisfy this condition and arrive after
the associated watermark has been emitted and the corresponding windows
have been calculated. These are called late elements and Flink provides
different ways to deal with them, such as discarding them or re-triggering
the calculation of the associated window. Figure 7.34 contains an illustration
CHAPTER 7. CASE STUDIES 202
of how watermarks flow in a Flink pipeline and how event time progresses.
to the next checkpoint. During recovery, updates that had been performed
after the last checkpoint are automatically ignored, since reads will return
the version corresponding to the last checkpoint. If the datastore does not
support MVCC, all the state changes can be maintained temporarily in local
storage in the form of a write-ahead-log (WAL), which will be committed to
the datastore during the second phase of the checkpoint protocol. Flink can
also integrate with various other systems that can be used to retrieve input
data from (sources) or send output data to (sinks), such as Kafka, RabbitMQ
etc. Each one of them provides different capabilities. For example, Kafka
provides an offset-based interface, which makes it very easy to replay data
records in case of recovery from a failure. A sink task just has to keep track
of the offset of each checkpoint and start reading from that offset during
recovery. However, message queues do not provide this interface, so Flink
has to use alternative methods to provide the same guarantees. For instance,
in the case of RabbitMQ, messages are acknowledged and removed from the
queue only after the associated checkpoint is complete, again during the
second phase of the protocol. Similarly, a sink needs to coordinate with the
checkpoint protocol in order to be able to provide exactly-once guarantees.
Kafka is a system that can support this through the use of its transactional
client. When a checkpoint is created by the sink, the flush() operation is
called as part of the checkpoint. After the notification that the checkpoint
has been completed in all operators is received from the Job Manager, the
sink calls Kafka’s commitTransaction method. Flink provides an abstract
class called TwoPhaseCommitSinkFunction that provides the basic methods
that need to be implemented by a sink that wants to provide these guarantees
(i.e. beginTransaction, preCommit, commit, abort).
To sum up, some of the guarantees provided by Flink are the following:
• Depending on the types of sources and sinks used, Flink provides
exactly-once processing semantics even across failures.
• The user can also optionally downgrade to at-least-once processing
semantics, which can provide increased performance.
• It’s important to note that the exactly-once guarantees apply to stream
records and local state produced using the Flink APIs. If an operator
performs additional side-effects on systems external to Flink, then no
guarantees are provided for them.
• Flink does not provide ordering guarantees after any form of repartition-
ing or broadcasting and the responsibility of dealing with out-of-order
records is left to the operator implementation.
Chapter 8
This chapter covers common practices and patterns used when building
and operating distributed systems. These are not supposed to be exact
prescriptions, but they can help you identify some of the basic approaches
available to you and the associated trade-offs. It goes without saying that
there are so many practices and patterns that we would never be able to
cover them all. As a result, the goal is to cover some of the most fundamental
and valuable ones.
Communication patterns
205
CHAPTER 8. PRACTICES & PATTERNS 206
send to each other. There are various options available for this purpose, so
let’s have a look at some of them:
• Some languages provide native support for serialisation, such as Java
and Python via its pickle module. The main benefit of this option is
convenience, since there is very little extra code needed to serialise and
deserialise an object. However, this comes at the cost of maintainability,
security and interoperability. Given the transparent nature of how
these serialisation methods work, it becomes hard to keep the format
stable, since it can be affected even by small changes to an object
that do not affect the data contained in it (e.g. implementing a new
interface). Furthermore, some of these mechanisms are not very secure,
since they indirectly allow a remote sender of data to initialise any
objects they want, thus introducing the risk of remote code execution.
Last but not least, most of these methods are available only in specific
languages, which means systems developed in different programming
languages will not be able to communicate. Note that there are some
third-party libraries that operate in a similar way using reflection or
bytecode manipulation, such as Kryo2 . These libraries tend to be
subject to the same trade-offs.
• Another option is a set of libraries that serialise an in-memory object
2
See: https://github.com/EsotericSoftware/kryo
CHAPTER 8. PRACTICES & PATTERNS 207
head. Messages are typically associated with an index, which can be used by
consumers to declare where they want to consume from. Another difference
with message queues is that the log is typically immutable, so messages are
not removed after they are processed. Instead, a garbage collection is run
periodically that removes old messages from the head of the log. This means
that consumers are responsible for keeping track of an offset indicating the
part of the log they have already consumed in order to avoid processing the
same message twice, thus achieving exactly-once processing semantics.
This is done in a similar way as described previously with this offset playing
the role of the unique identifier for each message. Some examples of event
logs are Apache Kafka13 , Amazon Kinesis14 and Azure Event Hubs15 .
Message queues and event logs can enable two slightly different forms of
communication, known as point-to-point and publish-subscribe. The
point-to-point model is used when we need to connect only 2 applications16 .
The publish-subscribe model is used when more than one applications might
be interested in the messages sent by a single application. For example,
the fact that a customer made an order might be needed by an application
sending recommendations for similar products, an application that sends the
associated invoices and an application that calculates loyalty points for the
customer. Using a publish-subscribe model, the application handling the
order is capable of sending a message about this order to all the applications
that need to know about it, sometimes without even knowing which these
applications are.
These two models of communication are implemented slightly different de-
pending on whether an event log or a message queue is used. This difference
is mostly due to the fact that consumption of messages behaves differently
in each system:
• Point-to-point communication is pretty straightforward when using
a message queue. Both applications are connected to a single queue,
where messages are produced and consumed. In the publish-subscribe
model, one queue can be created and managed by the producer appli-
cation and every consumer application can create its own queue. There
also needs to be a background process that receives messages from the
13
See: https://kafka.apache.org
14
See: https://aws.amazon.com/kinesis
15
See: https://azure.microsoft.com/services/event-hubs
16
Note that point-to-point refers to the number of applications, not the actual servers.
Every application on each side might consist of multiple servers that produce and consume
messages concurrently.
CHAPTER 8. PRACTICES & PATTERNS 212
Coordination patterns
Data synchronisation
There are some cases, where the same data needs to be stored in multiple
places and in potentially different forms18 . Below are some typical examples:
• Data that reside in a persistent datastore also need to be cached in a
separate in-memory store, so that read operations can be processed
18
These are also referred to as materialized views. See: https://en.wikipedia.org/wiki/
Materialized_view
CHAPTER 8. PRACTICES & PATTERNS 214
from the cache with lower latency. Writes operations need to update
both the persistent datastore and the in-memory store.
• Data that are stored in a distributed key-value store need to also be
stored in a separate datastore that provides efficient full-text search,
such as ElasticSearch19 or Solr20 . Depending on the form of read oper-
ations, the appropriate datastore can be used for optimal performance.
• Data that are stored in a relational database need to also be stored in
a graph database, so that graph queries can be performed in a more
efficient way.
Given data reside in multiple places, there needs to be a mechanism that
keeps them in sync. This section will examine some of the approaches
available for this purpose and the associated trade-offs.
One approach is to perform writes to all the associated datastores from
a single application that receives update operations. This approach is
sometimes referred to as dual writes. Typically, writes to the associated
data stores are performed synchronously, so that data have been updated in
all the locations before responding to the client with a confirmation that the
update operation was successful. One drawback of this approach is the way
partial failures are handled and their impact on atomicity. If the application
manages to update the first datastore successfully, but the request to update
the second datastore fails, then atomicity is violated and the overall system
is left in an inconsistent state. It’s also unclear what the response to the
client should be in this case, since data has been updated, but only in some
places. However, even if we assume that there are not partial failures, there is
another pitfall that has to do with how race conditions are handled between
concurrent writers and their impact on isolation. Let’s assume two writers
submit an update operation for the same item. The application receives them
and attempts to update both datastores, but the associated requests are re-
ordered, as shown in Figure 8.7. This means that the first datastore contains
data from the first request, while the second datastore contains data from the
second request. This also leaves the overall system at an inconsistent state.
An obvious solution to mitigate these issues is to introduce a distributed
transaction protocol that provides the necessary atomicity & isolation, such
as a combination of two-phase commit and two-phase locking. In order to
be able to do this, the underlying datastores need to provide support for
this. Even in that case though, this protocol will have some performance
19
See: https://github.com/elastic/elasticsearch
20
See: https://github.com/apache/lucene-solr
CHAPTER 8. PRACTICES & PATTERNS 216
Shared-nothing architectures
At this point, it must have become evident that sharing leads to coordination,
which is one of the main factors that inhibit high availability, performance
and scalability. For example, we have already explained how distributed
databases can scale to larger datasets and in a more cost-efficient manner than
centralised, single-node databases. At the same time, some form of sharing
is sometimes necessary and even beneficial for the same characteristics. For
instance, a system can increase its overall availability by reducing sharing
21
This is also known as a compare-and-set (CAS) operation. See: https://en.wikipedia.
org/wiki/Compare-and-swap
22
An example of a tool that does this is Debezium. See: https://debezium.io
CHAPTER 8. PRACTICES & PATTERNS 218
through partitioning, since the various partitions can have independent failure
modes. However, when looking at a single data item, availability can be
increased by increasing sharing via replication.
A key takeaway from all this is that reducing sharing can be very beneficial,
when applied properly. There are some system architectures that follow this
principle to the extreme in order to reduce coordination and contention, so
that every request can be processed independently by a single node or a single
group of nodes in the system. These are usually called shared-nothing
architectures. This section will explain how this principle can be used in
practice to build such architectures and what are some of the trade-offs.
A basic technique to reduce sharing used widely is decomposing stateful
and stateless parts of a system. The main benefit from this is that stateless
parts of a system tend to be fully symmetric and homogeneous, which means
that every instance of a stateless application is indistinguishable from the
rest. Separating them makes scaling a lot easier. Since all the instances of
an application are identical, one can balance the load across all of them in
an easy way, since all of them should be capable of processing any incoming
request23 . The system can be scaled out in order to handle more load
by simply introducing more instances of the applications behind the load
balancer. The instances could also send heartbeats to the load balancer, so
that the load balancer is capable of identifying the ones that have potentially
failed and stop sending requests to them. The same could also be achieved
by the instances exposing an API, where requests can be sent periodically by
the load balancer to identify healthy instances. Of course, in order to achieve
high availability and be able to scale incrementally, the load balancer also
needs to be composed of multiple, redundant nodes. There are different ways
to achieve this in practice, but a typical implementation uses a single domain
name (DNS) for the application that resolves to multiple IPs belonging to
the various servers of the load balancer. The clients, such as web browsers
or other applications, can rotate between these IPs. The DNS entry needs
to specify a relatively small time-to-live (TTL), so that clients can identify
new servers in the load balancer fleet relatively quickly.
As seen already throughout the book, stateful systems are a bit harder
to manage, since the various nodes of the system are not identical. Each
node contains different pieces of data, so the appropriate routing must be
23
Note that in practice the load balancer might need to have some state that reflects the
load of the various instances. But, this detail is omitted here on purpose for the sake of
simplicity.
CHAPTER 8. PRACTICES & PATTERNS 220
might require to access all the nodes of the system. This reduced flexibility
might also manifest in lack of strong transactional semantics. Applications
that need to perform reads of multiple items with strong isolation or write
multiple items in a single atomic transaction might not be able to do this
under this form of architecture or it might only be possible at the cost of
excessive additional overhead.
Distributed locking
by the system that is responsible for managing the locks. By using leases
instead of locks, the system can automatically recover from failures of nodes
that have acquired locks by releasing these locks and giving the opportunity
to other nodes to acquire them in order to make progress. However, this
introduces new safety risks. There are now two different nodes in the system
that can have different views about the state of the system, specifically
which nodes holds a lock. This is not only due to the fact that these nodes
have different clocks so the time of expiry can differ between them, but also
because a failure detector cannot be perfect, as explained earlier in the book.
The fact that part of the system considers a node to be failed does not mean
this node is necessarily failed. It could be a network partition that prevents
some messages from being exchanged between some nodes or that node might
just be busy with processing something unrelated. As a result, that node
might think it still holds the lock even though the lock has expired and it
has been automatically released by the system.
Figure 8.10 shows an example of this problem, where nodes A and B are
trying to acquire a lease in order to perform some operations in a separate
system. Node A manages to successfully acquire a lease first. However,
there is a significant delay between acquiring the lease and performing the
associated operation. This could be due to various reasons, such as a long
garbage collection pause, scheduling delays or simply network delays. In the
meanwhile, the lease has expired, it has been released by the system and
acquired by node B, which has also managed to perform the operation that’s
protected by the lock. After a while, the operation from node A also reaches
the system and it’s executed even though the lease is not held anymore by
that node violating the basic invariant that was supposed to be protected
by the lease mechanism. Note that simply performing another check the
lease is still held before initiating the operation in node A would not solve
the problem, since the same delays can occur between this check and the
initiation of the operation or even the delivery of the operation to the system.
There is one simple technique that solves this problem and it’s called fencing.
The main idea behind fencing is that the system can block some nodes from
performing some operations they are trying to perform, when these nodes
are malfunctioning. In our case, nodes are malfunctioning in the sense that
they think they hold a lease, while they don’t. The locking subsystem can
associate every lease with a monotonically increasing number. This number
can then be used by all the other systems in order to keep track of the node
that has performed an operation with the most recent lease. If a node with
an older lease attempts to perform an operation, the system is able to detect
CHAPTER 8. PRACTICES & PATTERNS 224
that and reject the operation, while also notifying the node that it’s not the
leader anymore. Figure 8.11 shows how that would work in practice. This
essentially means that in a distributed system, lock management cannot be
performed by a single part of the system, but it has to be done collectively
by all the parts of the system that are protected by this lock. For this to
be possible, the various components of the system need to provide some
basic capabilities. The locking subsystem needs to provide a monotonically
increasing identifier for every lock acquisition. Some examples of systems
that provide this is Zookeeper via the zxid or the znode version number
and Hazelcast as part of the fenced token provided via the FencedLock API.
Any external systems protected by the locks needs to provide conditional
updates with linearizability guarantees.
Compatibility patterns
allow the various nodes of such a system to operate independently for various
reasons. A very typical requirement for some applications in real life is to be
able to deploy new versions of the software with zero downtime. The simplest
way to achieve that is to perform rolling deployments, instead of deploying
in lockstep the software to all the servers at the same time. In some cases,
this is not just a nice-to-have, but an inherent characteristic of the system.
An example of this is mobile applications (e.g. Android applications), where
user consent is required to perform an upgrade, which implies that users are
deploying the new version of the software at their own pace. As a result, the
various nodes of a distributed systems can be running different versions of
the software at any time. This section will examine the implications of this
and some techniques that can help manage this complication.
The phenomenon described previously manifests in many different ways.
One of the most common ones is when two different applications need to
communicate with each other, while each one of them evolves independently
by deploying new versions of its software. For example, one of the applications
might want to expose more data at some point. If this is not done in a
careful way, the other application might not be able to understand the new
data making the whole interaction between the applications fail. Two very
useful properties related to this are backward compatibility and forward
compatibility:
• Backward compatibility is a property of a system that provides inter-
operability with an earlier version of itself or other systems.
• Forward compatibility is a property of system that provides interoper-
ability with a later version of itself or other systems.
These two properties essentially reflect a single characteristic viewed from
different perspectives, those of the sender and the recipient of data. Let’s
consider for a moment a contrived example of two systems S and R, where the
former sends some data to the latter. We can say that a change on system S
is backward compatible, if older versions of R will be able to communicate
successfully with a new version of S. We can also say that the system R is
designed in a forward compatible way, so that it will be able to understand
new versions of S. Let’s look at some examples. Let’s assume system S needs
to change the data type of a specific attribute. Changing the data type of
an existing attribute is not a backward compatible change in general, since
system R would be expecting a different type for this attribute and it would
fail to understand the data. However, this change could be decomposed
in the following sub-changes that preserve compatibility between the two
CHAPTER 8. PRACTICES & PATTERNS 227
systems. The system R can deploy a new version of the software that is is
capable of reading that data either from the new attribute with the new
data type or the old attribute with the old data type. The system S can
then deploy a new version of the software that stops populating the old
attribute and starts populating the new attribute24 . The previous example
probably demonstrated that seemingly trivial changes to software are a lot
more complicated when they need to be performed in a distributed system in
a safe way. As a consequence, maintaining backward compatibility imposes
a tradeoff between agility and safety.
It’s usually beneficial to version the API of an application, since that makes
it easier to compare versions of different nodes and applications of the
system and determine which versions are compatible with each other or
not. Semantic versioning is a very useful convention, where each version
number consists of 3 digits x.y.z. The last one (z) is called the patch version
and it’s incremented when a backward compatible bug fix is made to the
software. The second one (y) is called the minor version and it’s incremented
when new functionality is added in a backward compatible way. The first
one (x) is called the major version and it’s incremented when a backward
incompatible change is made. As a result, the clients of the software can easily
quickly understand the compatibility characteristics of new software and
the associated implications. When providing software as a binary artifact,
the version is usually embedded in the artifact. The consumers of the
artifact then need to take the necessary actions, if it includes a backward
incompatible change, e.g. adjusting their application’s code. However, when
applied to live applications, semantic versioning needs to be implemented
slightly differently. The major version needs to be embedded in the address
of the application’s endpoint, while the major and patch versions can be
included in the application’s responses25 . This is needed so that clients can
be automatically upgraded to newer versions of the software if desired, but
only if these are backward compatible.
Another technique for maintaining backward compatibility through the use
of explicitly versioned software is protocol negotiation. Let’s assume a
scenario as mentioned previously, where the client of an application is a
mobile application. Every version of the application needs to be backward
24
Note that whether a change is backward compatible or not can differ depend-
ing on the serialization protocol that is used. The following article explains this
nicely: https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-
buffers-thrift.html
25
For an example of this, see: https://developers.facebook.com/docs/apps/versions
CHAPTER 8. PRACTICES & PATTERNS 228
compatible with all the versions of the client application running on user
phones currently. This means that the staged approach described previously
cannot be used when making backward incompatible changes, since end users
cannot be forced to upgrade to a newer version. Instead, the application can
identify the version of the client and adjust its behaviour accordingly. For
example, consider the case of a feature introduced on version 4.1.2 that is
backward incompatible with versions < 4.x.x. If the application receives a
request from a 3.0.1 client, it can disable that feature in order to maintain
compatibility. If it receives a request from a 4.0.3 client, it can enable the
feature.
In some cases, an application might not be aware of the applications that will
be consuming its data. An example of this is the publish-subscribe model,
where the publisher does not necessarily need to know all the subscribers.
It’s still very important to ensure consumers will be able to deserialise and
process any produced data successfully as its format changes. One pattern
used here is defining a schema for the data, which is used by both the
producers and consumers. This schema can be embedded in the message
itself. Otherwise, to avoid duplication of the schema data, a reference to the
schema can be put inside the message and the schema can be stored in a
separate store. For example, this is a pattern commonly used in Kafka via the
Schema Registry26 . However, it’s important to remember that even in this
case, producers and consumers are evolving independently, so consumers are
not necessarily using the latest version of the schema used by the producer.
So, producers need to preserve compatibility either by ensuring consumers
can read data of the new schema using an older schema or by ensuring all
consumers have started using the new schema before starting to produce
messages with it. Note that similar considerations need to be made for the
compatibility of the new schema with old data. For example, if consumers are
not able to read old data with the new schema, the producers might have to
make sure all the messages with the previous schema have been consumed by
everyone first. Interestingly, the Schema Registry defines different categories
of compatibility along these dimensions, which determine what changes are
allowed in each category and what is the upgrade process, e.g. if producers
or consumers need to upgrade first. It can also check two different versions of
a schema and confirm that they are compatible under one of these categories
to prevent errors later on27 .
26
See: https://docs.confluent.io/current/schema-registry
27
See: https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution-
and-compatibility
CHAPTER 8. PRACTICES & PATTERNS 229
Note that it’s not only changes in data that can break backward compatibility.
Slight changes in behaviour or semantics of an API can also have serious
consequences in a distributed system. For example, let’s consider a failure
detector that makes use of heartbeats to identify failed nodes. Every node
sends a heartbeat every 1 second and the failure detector considers a node
failed if it hasn’t received a single heartbeat in the last 3 seconds. This causes
a lot of network traffic that affects the performance of the application, so we
decide to increase the interval of a heartbeat from 1 to 5 seconds and the
threshold of the failure detector from 3 to 15 seconds. Note that if we start
performing a rolling deployment of this change, all the servers with the old
version of the software will start thinking all the servers with the new version
have failed. This due to the fact that their failure detectors will still have the
old deadline of 3 seconds, while the new servers will send a heartbeat every
5 seconds. One way to make this change backward compatible would be to
perform an initial change that increases the failure detector threshold from 3
to 15 seconds and the follow this with a subsequent change that increases the
heartbeat interval to 5 seconds, only after the first change has been deployed
to all the nodes. This technique of splitting a change in two parts to make it
backward compatible is commonly used and it’s also known as two-phase
deployment.
the write operation or even a long time after that was completed. Below are
some techniques that are commonly used to handle these kind of failures:
• One way to detect these failures when sending a message to another
node is to introduce some redundancy in the message using a check-
sum derived from the actual payload. If the message is corrupted,
the checksum will not be valid. As a result, the recipient can ask the
sender to send the message again.
• When writing data to disk, this technique might not be useful, since
the corruption will be detected a long time after a write operation has
been performed by the client, which means it might not be feasible
to rewrite the data. Instead, the application can make sure that data
is written to multiple disks, so that corrupted data can be discarded
later on and the right data can be read from another disk with a valid
checksum.
• Another technique used in cases where retransmitting or storing the
data multiple times is impossible or costly is error correcting codes
(ECC). These are similar to checksums and are stored alongside the
actual payload, but they have the additional property that they can
also be used to correct corruption errors calculating the original payload
again. The downside is they are larger than checksums, thus having
a higher overhead in terms of data stored or transmitted across the
network.
A distributed system consists of many different parts and these kind of
failures can happen on any of them. This raises the question of where and
how to apply these techniques. There is a design principle, known as the
end-to-end argument, which suggests that some functions such as the fault
tolerance techniques described above can be implemented completely and
correctly only wih the knowledge and help of the application standing at the
end points of the communication system. A canonical example to illustrate
this point is the "careful file transfer" application, where a file needs to be
moved from computer A’s storage to computer B’s storage without damage.
As shown in Figure 8.12, hardware failures can happen in many places during
this process, such as the disks of computers, the software of the file system,
the hardware processors, their local memory or the communication system.
Even if the various subsystems embed error recovery functionality, this can
only cover lower levels of the system and it cannot protect from errors
happening at a higher level of the system. For example, error detection and
recovery implemented at the disk level or in the operating system won’t help,
if the application has a defect that leads to writing the wrong data in the
CHAPTER 8. PRACTICES & PATTERNS 231
first place. This implies that complete correctness can only be achieved by
implementing this function at the application level. Note that this function
can be implemented redundantly at lower levels too, but this is done mostly as
a performance optimisation. It’s also important to note that this redundant
implementation at lower levels is not always beneficial, but it depends on the
use case. There is existing literature that covers extensively this trade-off, so
we’ll refer the reader to it instead of repeating the same analysis here[81][82].
this case, the application will think the packet has been successfully
processed, while it wasn’t.
• The TCP layer on the side of the application B might receive a packet
and deliver it successfully to the application, which processes it success-
fully. However, a failure happens at this point and the applications on
both sides are forced to establish a new TCP connection. Application
had not received an application acknowledgement for the last message,
so it attempts to resend it on the new connection. TCP provides
reliable transfer only in the scope of a single connection, so it will not
be able to detect this packet has been received and processed in a
previous connection. As a result, a packet will be processed by the
application more than once.
The main takeaway is that any functionality needed for exactly-once semantics
(e.g. retries, acknowledgements and deduplication) needs to be implemented
at the application level in order to be correct and safe against all kind of
failures28 . Another problem where the end-to-end principle manifests in a
slightly different shade is the problem of mutual exclusion in a distributed
system. The fencing technique presented previously essentially extends the
function of mutual exclusion to all the involved ends of the application.
The goal of this section is not to go through all the problems, where the
end-to-end argument is applicable. Instead, the goal is to raise awareness
and make the reader appreciate its value on system design, so that it’s taken
into account if and when need be.
The main technique to recover from failures is using retries. In the case
of a stateless system, the application of retries is pretty simple, since all
the nodes of the application are identical from the perspective of the client
so it could retry a request on any node. In some cases, that can be done
in a fully transparent way to the client. For example, the application can
be fronted by a load balancer that receives all the requests under a single
domain and it’s responsible for forwarding the requests to the various nodes
of the application. In this way, the client would only have to retry the request
to the same endpoint and the load balancer would take care of balancing
the requests across all the available nodes. In the case of stateful systems,
this gets slightly more complicated, since nodes are not identical and retries
need to be directed to the right one. For example, when using a system with
28
It’s also worth reminding here that the side-effects from processing a request and
storing the associated deduplication ID need to be done in an atomic way to avoid partial
failures violating the exactly-once guarantees.
CHAPTER 8. PRACTICES & PATTERNS 233
Let’s look first at how applications can exert backpressure. It is useful for
a system to know its limits and exert backpressure when they are reached,
instead of relying on implicit backpressure. Otherwise, there can be many
failure modes that are unexpected and harder to deal with when they
happen. The main technique to exert backpressure is load shedding, where
an application is aware of the maximum load it can handle and rejects any
requests that cross this threshold in order to keep operating at the desired
levels. A more special form of load shedding is selective client throttling,
where an application assigns different quotas to each of its clients. This
technique can also be used to prioritise traffic from some clients that are more
important. Let’s consider a service that is responsible for serving the prices
of products, which is used both by systems that are responsible to display
product pages and by systems that are responsible to receive purchases and
charge the customer. In case of a failure, that service could throttle the
former type of systems more than the latter, since purchases are considered
to be more important for a business and they also tend to constitute a smaller
percentage of the overall traffic. In the case of asynchronous systems that
make use of message queues, load shedding can be performed by imposing
an upper bound on the size of the queue. There is a common misconception
that message queues can help absorb any form of backpressure without any
consequences. However, this comes at the cost of an increased backlog of
messages, which can lead to increased processing latency or even failure of
the messaging system in extreme cases.
Distributed tracing
30
See: https://opentracing.io
31
See: https://zipkin.io
Chapter 9
Closing thoughts
Hopefully, this book has helped you understand how distributed systems
can be useful, what are some of the main challenges one might face when
building distributed systems and how to overcome them. Ideally, it has also
made you realise that building or using a distributed system is a serious
undertaking that should be done only when necessary. If your requirements
around performance, scalability or availability can be met without the use of
a distributed system, then not using a distributed system might actually be
a wise decision.
This might feel like the end of a journey, but for some of you it might just
be the beginning of it. For this reason, I think it would be useful to recap
some key learnings from this book, while also highlighting topics that were
left uncovered. In this way, those of you that are willing to dive deeper on
some areas will have some starting points to do so.
First of all, we introduced some of the basic areas where distributed systems
can help: performance, scalability and availability. Throughout the book, we
analysed basic mechanisms that can help in these areas, such as partitioning
and replication. It also became evident that these mechanisms introduce
some tension between the aforementioned characteristics and other properties,
such as consistency. This tension is formalised by basic theorems, such as the
CAP theorem and the FLP result. This tension manifests in various ways, as
shown in the book. For example, the decision on whether replication operates
synchronously or asynchronously can be a trade-off between performance
and availability or durability. Early on, we explained the difference between
liveness and safety properties and tried to provide an overview of the basic
238
CHAPTER 9. CLOSING THOUGHTS 239
one used in the Bitcoin protocol and known as the Nakamoto consensus[70].
In the fifth and sixth chapter, we introduced the notions of time and or-
der and their relationship. We explained the difference between total and
partial order, which is quite important in the field of distributed systems.
While consensus can be considered as the problem of establishing total order
amongst all events of a system, there are also systems that do not have a
need for such strict requirements and can also operate successfully under
a partial order. Vector clocks is one mechanism outlined in the book that
allows a system to keep track of such a partial order that preserves causality
relationships between events. However, there are more techniques that were
not presented in the book. An example is conflict-free replicated data types
(CRDTs)[98][99], which are data structures that can be replicated across
multiple nodes, where the replicas can be updated independently and con-
currently without coordination between the replicas and it’s always possible
to resolve inconsistencies that might result from this. Some examples of such
data structures are a grow-only counter, a grow-only set or a linear sequence
CRDT. The lack of need for coordination makes these data structures more
efficient due to reduced contention and more tolerant to partitions, since the
various replicas can keep operating independently. However, they require the
underlying operations to have some specific characteristics (e.g. commutativ-
ity, associativity, idempotency etc.), which can limit their expressibility and
practical application.
We believe it is a lot easier for someone to understand theory, when it is
put into context by demonstrating how it is used in practical systems. This
is the reason we included a chapter dedicated to case studies about real
systems and how they use algorithms and techniques presented in the book.
We tried to cover systems from as many different categories as possible, but
we have to admit there are many more systems that are very interesting
and we would like to include in this chapter, but we couldn’t due to time
constraints. The last chapter on practices and patterns was written under the
same spirit and subject to similar time constraints. So, we call the reader to
study more resources than those that were available in the book for a deeper
understanding of how theory can be put in practice[100][101][102][103][104].
At the risk of being unfair to other systems and material out there, we would
like to mention CockroachDB3 as one system that has a lot of public material
probability of this value being considered not agreed (what is also referred to as reversed)
in the future.
3
See: https://github.com/cockroachdb/cockroach
CHAPTER 9. CLOSING THOUGHTS 241
might need to be taken in order to make sure the system operates in a secure
way. The data transmitted in this network might need to be protected so that
only authorised nodes can read it. The nodes might need to authenticate each
other, so that they can be sure they are communicating with the right node
and not some impersonator. There are a lot of cryptographic techniques that
can help with these aspects, such as encryption, authentication and digital
signatures. However, the interested reader will have to study them separately,
since that’s a separate field of study. We also did not examine how networks
can be designed, so that distributed systems can run on top of them at scale,
which is another broad and challenging topic[109]. Another important topic
that we did not cover is formal verification of systems. There are many
formal verification techniques and tools that can be used to prove safety
and liveness properties of systems with TLA+[110] being one of the most
commonly used across the software industry[111]. It is important to note
that users of these formal verification methods have acknowledged publicly
that they have not only helped them discover bugs in their designs, but
they have also helped them significantly reason about the behaviour of their
systems in a better way.
References
243
CHAPTER 9. CLOSING THOUGHTS 244