Understanding Distributed Systems
Understanding Distributed Systems
UNDERSTANDING
DISTRIBUTED
SYSTEMS
Roberto Vitillo
February 2021
Contents
Copyright 6
Acknowledgements 8
Preface 9
0.1 Who should read this book . . . . . . . . . . . . . . 10
1 Introduction 11
1.1 Communication . . . . . . . . . . . . . . . . . . . . 12
1.2 Coordination . . . . . . . . . . . . . . . . . . . . . 13
1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Resiliency . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Operations . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Anatomy of a distributed system . . . . . . . . . . 17
I Communication 20
2 Reliable links 23
2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Connection lifecycle . . . . . . . . . . . . . . . . . 24
2.3 Flow control . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Congestion control . . . . . . . . . . . . . . . . . . 27
2.5 Custom protocols . . . . . . . . . . . . . . . . . . . 28
3 Secure links 30
3.1 Encryption . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Authentication . . . . . . . . . . . . . . . . . . . . 31
3.3 Integrity . . . . . . . . . . . . . . . . . . . . . . . . 33
CONTENTS 2
3.4 Handshake . . . . . . . . . . . . . . . . . . . . . . 34
4 Discovery 35
5 APIs 39
5.1 HTTP . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Resources . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Request methods . . . . . . . . . . . . . . . . . . . 45
5.4 Response status codes . . . . . . . . . . . . . . . . 46
5.5 OpenAPI . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Evolution . . . . . . . . . . . . . . . . . . . . . . . 49
II Coordination 51
6 System models 54
7 Failure detection 57
8 Time 59
8.1 Physical clocks . . . . . . . . . . . . . . . . . . . . 60
8.2 Logical clocks . . . . . . . . . . . . . . . . . . . . . 61
8.3 Vector clocks . . . . . . . . . . . . . . . . . . . . . 63
9 Leader election 65
9.1 Raft leader election . . . . . . . . . . . . . . . . . . 65
9.2 Practical considerations . . . . . . . . . . . . . . . . 67
10 Replication 71
10.1 State machine replication . . . . . . . . . . . . . . . 72
10.2 Consensus . . . . . . . . . . . . . . . . . . . . . . . 75
10.3 Consistency models . . . . . . . . . . . . . . . . . . 75
10.3.1 Strong consistency . . . . . . . . . . . . . . 77
10.3.2 Sequential consistency . . . . . . . . . . . . 78
10.3.3 Eventual consistency . . . . . . . . . . . . . 80
10.3.4 CAP theorem . . . . . . . . . . . . . . . . . 80
10.4 Practical considerations . . . . . . . . . . . . . . . . 81
11 Transactions 83
11.1 ACID . . . . . . . . . . . . . . . . . . . . . . . . . 83
11.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . 84
11.2.1 Concurrency control . . . . . . . . . . . . . 87
11.3 Atomicity . . . . . . . . . . . . . . . . . . . . . . . 88
11.3.1 Two-phase commit . . . . . . . . . . . . . . 89
CONTENTS 3
III Scalability 99
12 Functional decomposition 103
12.1 Microservices . . . . . . . . . . . . . . . . . . . . . 103
12.1.1 Benefits . . . . . . . . . . . . . . . . . . . . 105
12.1.2 Costs . . . . . . . . . . . . . . . . . . . . . . 106
12.1.3 Practical considerations . . . . . . . . . . . 108
12.2 API gateway . . . . . . . . . . . . . . . . . . . . . . 110
12.2.1 Routing . . . . . . . . . . . . . . . . . . . . 110
12.2.2 Composition . . . . . . . . . . . . . . . . . 111
12.2.3 Translation . . . . . . . . . . . . . . . . . . 112
12.2.4 Cross-cutting concerns . . . . . . . . . . . . 112
12.2.5 Caveats . . . . . . . . . . . . . . . . . . . . 115
12.3 CQRS . . . . . . . . . . . . . . . . . . . . . . . . . 117
12.4 Messaging . . . . . . . . . . . . . . . . . . . . . . . 118
12.4.1 Guarantees . . . . . . . . . . . . . . . . . . 121
12.4.2 Exactly-once processing . . . . . . . . . . . 123
12.4.3 Failures . . . . . . . . . . . . . . . . . . . . 124
12.4.4 Backlogs . . . . . . . . . . . . . . . . . . . . 124
12.4.5 Fault isolation . . . . . . . . . . . . . . . . . 125
12.4.6 Reference plus blob . . . . . . . . . . . . . . 126
13 Partitioning 128
13.1 Sharding strategies . . . . . . . . . . . . . . . . . . 128
13.1.1 Range partitioning . . . . . . . . . . . . . . 129
13.1.2 Hash partitioning . . . . . . . . . . . . . . . 129
13.2 Rebalancing . . . . . . . . . . . . . . . . . . . . . . 133
13.2.1 Static partitioning . . . . . . . . . . . . . . . 133
13.2.2 Dynamic partitioning . . . . . . . . . . . . . 133
13.2.3 Practical considerations . . . . . . . . . . . 134
14 Duplication 135
14.1 Network load balancing . . . . . . . . . . . . . . . 135
14.1.1 DNS load balancing . . . . . . . . . . . . . . 138
14.1.2 Transport layer load balancing . . . . . . . . 138
14.1.3 Application layer load balancing . . . . . . 141
14.1.4 Geo load balancing . . . . . . . . . . . . . . 143
CONTENTS 4
IV Resiliency 157
15 Common failure causes 160
15.1 Single point of failure . . . . . . . . . . . . . . . . . 160
15.2 Unreliable network . . . . . . . . . . . . . . . . . . 161
15.3 Slow processes . . . . . . . . . . . . . . . . . . . . 162
15.4 Unexpected load . . . . . . . . . . . . . . . . . . . 163
15.5 Cascading failures . . . . . . . . . . . . . . . . . . 164
15.6 Risk management . . . . . . . . . . . . . . . . . . . 165
20 Monitoring 207
20.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 208
20.2 Service-level indicators . . . . . . . . . . . . . . . . 211
20.3 Service-level objectives . . . . . . . . . . . . . . . . 214
20.4 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . 217
20.5 Dashboards . . . . . . . . . . . . . . . . . . . . . . 219
20.5.1 Best practices . . . . . . . . . . . . . . . . . 221
20.6 On-call . . . . . . . . . . . . . . . . . . . . . . . . . 222
21 Observability 225
21.1 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . 226
21.2 Traces . . . . . . . . . . . . . . . . . . . . . . . . . 229
21.3 Putting it all together . . . . . . . . . . . . . . . . . 231
2
https://understandingdistributed.systems/
3
roberto@understandingdistributed.systems
Chapter 1
Introduction
Some applications need to tackle workloads that are just too big to
fit on a single node, no matter how powerful. For example, Google
receives hundreds of thousands of search requests per second from
all over the globe. There is no way a single node could handle that.
And finally, some applications have performance requirements
that would be physically impossible to achieve with a single
node. Netflix can seamlessly stream movies to your TV with high
resolutions because it has a datacenter close to you.
This book will guide you through the fundamental challenges that
need to be solved to design, build and operate distributed sys-
tems: communication, coordination, scalability, resiliency, and op-
erations.
1.1 Communication
The first challenge comes from the fact that nodes need to commu-
nicate over the network with each other. For example, when your
browser wants to load a website, it resolves the server’s address
from the URL and sends an HTTP request to it. In turn, the server
returns a response with the content of the page to the client.
How are request and response messages represented on the wire?
What happens when there is a temporary network outage, or some
faulty network switch flips a few bits in the messages? How can
you guarantee that no intermediary can snoop into the communi-
cation?
Although it would be convenient to assume that some networking
library is going to abstract all communication concerns away, in
practice it’s not that simple because abstractions leak1 , and you
need to understand how the stack works when that happens.
1
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractio
ns/
CHAPTER 1. INTRODUCTION 13
1.2 Coordination
Another hard challenge of building distributed systems is coordi-
nating nodes into a single coherent whole in the presence of fail-
ures. A fault is a component that stopped working, and a system is
fault-tolerant when it can continue to operate despite one or more
faults. The “two generals” problem is a famous thought experi-
ment that showcases why this is a challenging problem.
Suppose there are two generals (nodes), each commanding its own
army, that need to agree on a time to jointly attack a city. There is
some distance between the armies, and the only way to communi-
cate is by sending a messenger (messages). Unfortunately, these
messengers can be captured by the enemy (network failure).
Is there a way for the generals to agree on a time? Well, general 1
could send a message with a proposed time to general 2 and wait
for a response. What if no response arrives, though? Was one
of the messengers captured? Perhaps a messenger was injured,
and it’s taking longer than expected to arrive at the destination?
Should the general send another messenger?
You can see that this problem is much harder than it originally ap-
peared. As it turns out, no matter how many messengers are dis-
patched, neither general can be completely certain that the other
army will attack the city at the same time. Although sending more
messengers increases the general’s confidence, it never reaches ab-
solute certainty.
Because coordination is such a key topic, the second part of this
book is dedicated to distributed algorithms used to implement co-
ordination.
1.3 Scalability
The performance of a distributed system represents how efficiently
it handles load, and it’s generally measured with throughput and
response time. Throughput is the number of operations processed
per second, and response time is the total time between a client
CHAPTER 1. INTRODUCTION 14
as scaling up. But that will hit a brick wall sooner or later. When
that option is no longer available, the alternative is scaling out by
adding more machines to the system.
In the book’s third part, we will explore the main architectural pat-
terns that you can leverage to scale out applications: functional
decomposition, duplication, and partitioning.
1.4 Resiliency
A distributed system is resilient when it can continue to do its job
even when failures happen. And at scale, any failure that can hap-
pen will eventually occur. Every component of a system has a
probability of failing — nodes can crash, network links can be sev-
ered, etc. No matter how small that probability is, the more com-
ponents there are, and the more operations the system performs,
the higher the absolute number of failures becomes. And it gets
worse, since failures typically are not independent, the failure of a
component can increase the probability that another one will fail.
Failures that are left unchecked can impact the system’s availability,
which is defined as the amount of time the application can serve
requests divided by the duration of the period measured. In other
words, it’s the percentage of time the system is capable of servicing
requests and doing useful work.
Availability is often described with nines, a shorthand way of ex-
pressing percentages of availability. Three nines are typically con-
sidered acceptable, and anything above four is considered to be
highly available.
1.5 Operations
Distributed systems need to be tested, deployed, and maintained.
It used to be that one team developed an application, and another
was responsible for operating it. The rise of microservices and De-
vOps has changed that. The same team that designs a system is
also responsible for its live-site operation. That’s a good thing as
there is no better way to find out where a system falls short than
experiencing it by being on-call for it.
New deployments need to be rolled out continuously in a safe
manner without affecting the system’s availability. The system
needs to be observable so that it’s easy to understand what’s hap-
pening at any time. Alerts need to fire when its service level objec-
tives are at risk of being breached, and a human needs to be looped
in. The book’s final part explores best practices to test and operate
distributed systems.
CHAPTER 1. INTRODUCTION 17
Figure 1.2: The business logic uses the messaging interface imple-
mented by the Kafka producer to send messages and the reposi-
tory interface to access the SQL store. In contrast, the HTTP con-
troller handles incoming requests using the service interface.
Communication
Introduction
to another across the network. The Internet Protocol (IP) is the core
protocol of this layer, which delivers packets on a best-effort basis.
Routers operate at this layer and forward IP packets based on their
destination IP address.
The transport layer transmits data between two processes using
port numbers to address the processes on either end. The most
important protocol in this layer is the Transmission Control
Protocol (TCP).
The application layer defines high-level communication protocols,
like HTTP or DNS. Typically your code will target this level of ab-
straction.
Even though each protocol builds up on top of the other, some-
times the abstractions leak. If you don’t know how the bottom lay-
ers work, you will have a hard time troubleshooting networking
issues that will inevitably arise.
Chapter 2 describes how to build a reliable communication chan-
nel (TCP) on top of an unreliable one (IP), which can drop, dupli-
cate and deliver data out of order. Building reliable abstractions on
top of unreliable ones is a common pattern that we will encounter
many times as we explore further how distributed systems work.
Chapter 3 describes how to build a secure channel (TLS) on top of
a reliable one (TCP), which provides encryption, authentication,
and integrity.
Chapter 4 dives into how the phone book of the Internet (DNS)
works, which allows nodes to discover others using names. At its
heart, DNS is a distributed, hierarchical, and eventually consistent
key-value store. By studying it, we will get a first taste of eventu-
ally consistency.
Chapter 5 concludes this part by discussing how services can ex-
pose APIs that other nodes can use to send commands or notifi-
cations to. Specifically, we will dive into the implementation of a
RESTful HTTP API.
Chapter 2
Reliable links
2.1 Reliability
To create the illusion of a reliable channel, TCP partitions a byte
stream into discrete packets called segments. The segments are
sequentially numbered, which allows the receiver to detect holes
and duplicates. Every segment sent needs to be acknowledged
by the receiver. When that doesn’t happen, a timer fires on the
sending side, and the segment is retransmitted. To ensure that the
data hasn’t been corrupted in transit, the receiver uses a checksum
to verify the integrity of a delivered segment.
1
https://tools.ietf.org/html/rfc793
CHAPTER 2. RELIABLE LINKS 24
Figure 2.2: The receive buffer stores data that hasn’t been pro-
cessed yet by the application.
Figure 2.4: The lower the RTT is, the quicker the sender can start
utilizing the underlying network’s bandwidth.
WinSize
Bandwidth =
RTT
nisms that TCP provides, what you get is a simple protocol named
User Datagram Protocol6 (UDP) — a connectionless transport layer
protocol that can be used as an alternative to TCP.
Unlike TCP, UDP does not expose the abstraction of a byte
stream to its clients. Clients can only send discrete packets,
called datagrams, with a limited size. UDP doesn’t offer any
reliability as datagrams don’t have sequence numbers and are
not acknowledged. UDP doesn’t implement flow and congestion
control either. Overall, UDP is a lean and barebone protocol. It’s
used to bootstrap custom protocols, which provide some, but not
all, of the stability and reliability guarantees that TCP does7 .
For example, in modern multi-player games, clients sample
gamepad, mouse and keyboard events several times per second
and send them to a server that keeps track of the global game state.
Similarly, the server samples the game state several times per
second and sends these snapshots back to the clients. If a snapshot
is lost in transmission, there is no value in retransmitting it as the
game evolves in real-time; by the time the retransmitted snapshot
would get to the destination, it would be obsolete. This is a use
case where UDP shines, as TCP would attempt to redeliver the
missing data and consequently slow down the client’s experience.
6
https://en.wikipedia.org/wiki/User_Datagram_Protocol
7
As we will later see, HTTP 3 is based on UDP to avoid some of TCP’s short-
comings.
Chapter 3
Secure links
We now know how to reliably send bytes from one process to an-
other over the network. The problem is these bytes are sent in the
clear, and any middle-man can intercept our communication. To
protect against that, we can use the Transport Layer Security1 (TLS)
protocol. TLS runs on top of TCP and encrypts the communica-
tion channel so that application layer protocols, like HTTP, can
leverage it to communicate securely. In a nutshell, TLS provides
encryption, authentication, and integrity.
3.1 Encryption
Encryption guarantees that the data transmitted between a client
and a server is obfuscated and can only be read by the communi-
cating processes.
When the TLS connection is first opened, the client and the server
negotiate a shared encryption secret using asymmetric encryption.
Both parties generate a key-pair consisting of a private and public
part. The processes are then able to create a shared secret by ex-
changing their public keys. This is possible thanks to some math-
1
https://en.wikipedia.org/wiki/Transport_Layer_Security
CHAPTER 3. SECURE LINKS 31
3.2 Authentication
Although we have a way to obfuscate data transmitted across the
wire, the client still needs to authenticate the server to verify it’s
who it claims to be. Similarly, the server might want to authenti-
cate the identity of the client.
TLS implements authentication using digital signatures based on
asymmetric cryptography. The server generates a key-pair with a
private and a public key, and shares its public key with the client.
When the server sends a message to the client, it signs it with its
private key. The client uses the public key of the server to verify
that the digital signature was actually signed with the private key.
This is possible thanks to mathematical properties3 of the key-pair.
The problem with this naive approach is that the client has no idea
whether the public key shared by the server is authentic, so we
have certificates to prove the ownership of a public key for a spe-
cific entity. A certificate includes information about the owning
entity, expiration date, public key, and a digital signature of the
third-party entity that issued the certificate. The certificate’s issu-
2
https://blog.cloudflare.com/a-relatively-easy-to-understand-primer-on-
elliptic-curve-cryptography/
3
https://en.wikipedia.org/wiki/Digital_signature
CHAPTER 3. SECURE LINKS 32
When a TLS connection is opened, the server sends the full cer-
tificate chain to the client, starting with the server’s certificate and
ending with the root CA. The client verifies the server’s certificate
4
https://letsencrypt.org/
CHAPTER 3. SECURE LINKS 33
3.3 Integrity
Even if the data is obfuscated, a middle man could still tamper
with it; for example, random bits within the messages could be
swapped. To protect against tampering, TLS verifies the integrity
of the data by calculating a message digest. A secure hash function
is used to create a message authentication code5 (HMAC). When a
process receives a message, it recomputes the digest of the message
and checks whether it matches the digest included in the message.
If not, then the message has either been corrupted during trans-
mission or has been tampered with. In this case, the message is
dropped.
The TLS HMAC protects against data corruption as well, not just
tampering. You might be wondering how data can be corrupted if
TCP is supposed to guarantee its integrity. While TCP does use a
checksum to protect against data corruption, it’s not 100% reliable6
because it fails to detect errors for roughly 1 in 16 million to 10
billion packets. With packets of 1KB, this can happen every 16 GB
to 10 TB transmitted.
5
https://en.wikipedia.org/wiki/HMAC
6
https://dl.acm.org/doi/10.1145/347057.347561
CHAPTER 3. SECURE LINKS 34
3.4 Handshake
When a new TLS connection is established, a handshake between
the client and server occurs during which:
1. The parties agree on the cipher suite to use. A cipher suite
specifies the different algorithms that the client and the
server intend to use to create a secure channel, like the:
• key exchange algorithm used to generate shared
secrets;
• signature algorithm used to sign certificates;
• symmetric encryption algorithm used to encrypt the ap-
plication data;
• HMAC algorithm used to guarantee the integrity and
authenticity of the application data.
2. The parties use the negotiated key exchange algorithm to cre-
ate a shared secret. The shared secret is used by the chosen
symmetric encryption algorithm to encrypt the communica-
tion of the secure channel going forwards.
3. The client verifies the certificate provided by the server. The
verification process confirms that the server is who it says it
is. If the verification is successful, the client can start send-
ing encrypted application data to the server. The server can
optionally also verify the client certificate if one is available.
These operations don’t necessarily happen in this order as modern
implementations use several optimizations to reduce round trips.
The handshake typically requires 2 round trips with TLS 1.2 and
just one with TLS 1.37 . The bottom line is creating a new connec-
tion is expensive; yet another reason to put your servers geograph-
ically closer to the clients and reuse connections when possible.
7
https://tools.ietf.org/html/rfc8446
Chapter 4
Discovery
many clients won’t see a change for a long time. But if you set it too
short, you increase the load on the name servers and the average
response time of requests because the clients will have to resolve
the entry more often.
If your name server becomes unavailable for any reason, the
smaller the record’s TTL is and the higher the number of clients
impacted will be. DNS can easily become a single point of failure
— if your DNS name server is down and the clients can’t find the
IP address of your service, they won’t have a way to connect it.
This can lead to massive outages3 .
3
https://en.wikipedia.org/wiki/2016_Dyn_cyberattack
Chapter 5
APIs
5.1 HTTP
HTTP7 is a request-response protocol used to encode and transport
information between a client and a server. In an HTTP transaction,
the client sends a request message to the server’s API endpoint, and
the server replies back with a response message, as shown in Figure
5.2.
In HTTP 1.1, a message is a textual block of data that contains a
start line, a set of headers, and an optional body:
• In a request message, the start line indicates what the request
is for, and in a response message, it indicates what the re-
sponse’s result is.
• The headers are key-value pairs with meta-information that
describe the message.
• The message’s body is a container for data.
HTTP is a stateless protocol, which means that everything needed
by a server to process a request needs to be specified within the
request itself, without context from previous requests. HTTP uses
3
https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/co
ncepts/async/
4
https://grpc.io/
5
https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
6
https://graphql.org/
7
https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
CHAPTER 5. APIS 42
5.2 Resources
Suppose we are responsible for implementing a service to man-
age the product catalog of an e-commerce application. The service
must allow users to browse the catalog and admins to create, up-
date, or delete products. Sounds simple enough; the interface of
the service could be defined like this:
interface CatalogService
{
List<Product> GetProducts(...);
Product GetProduct(...);
void AddProduct(...);
void DeleteProduct(...);
void UpdateProduct(...)
}
5.5 OpenAPI
Now that we have learned how to map the operations defined by
our service’s interface onto RESTful HTTP endpoints, we can for-
mally define the API with an interface definition language (IDL), a
language independent description of it. The IDL definition can be
used to generate boilerplate code for the IPC adapter and client
SDKs in your languages of choice.
The OpenAPI13 specification, which evolved from the Swagger
project, is one of the most popular IDL for RESTful APIs based
on HTTP. With it, we can formally describe our API in a YAML
document, including the available endpoints, supported request
methods and response status codes for each endpoint, and the
schema of the resources’ JSON representation.
For example, this is how part of the /products endpoint of the cata-
log service’s API could be defined:
openapi: 3.0.0
info:
version: "1.0.0"
title: Catalog Service API
paths:
/products:
get:
summary: List products
parameters:
- in: query
name: sort
required: false
schema:
type: string
responses:
'200':
description: list of products in catalog
content:
13
https://swagger.io/specification/
CHAPTER 5. APIS 49
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/ProductItem'
'400':
description: bad input
components:
schemas:
ProductItem:
type: object
required:
- id
- name
- category
properties:
id:
type: number
name:
type: string
category:
type: string
5.6 Evolution
APIs start out as beautifully-designed interfaces. Slowly, but
surely, they will need to change over time to adapt to new use
cases. The last thing you want to do when evolving your API is
to introduce a breaking change that requires modifying all the
clients in unison, some of which you might have no control over
CHAPTER 5. APIS 50
14
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-
protocol-buffers-thrift.html
15
https://github.com/Azure/openapi-diff
Part II
Coordination
Introduction
System models
one. For example, TCP does precisely that (and more), while TLS
implements authentication (and more).
We can also model the different types of node failures we expect
to happen:
• The arbitrary-fault model assumes that a node can deviate
from its algorithm in arbitrary ways, leading to crashes or
unexpected behavior due to bugs or malicious activity. The
arbitrary fault model is also referred to as the “Byzantine”
model for historical reasons. Interestingly, it can be theoreti-
cally proven that a system with Byzantine nodes can tolerate
up to 13 of faulty nodes1 and still operate correctly.
• The crash-recovery model assumes that a node doesn’t devi-
ate from its algorithm, but can crash and restart at any time,
losing its in-memory state.
• The crash-stop model assumes that a node doesn’t deviate
from its algorithm, but if it crashes it never comes back on-
line.
While it’s possible to take an unreliable communication link and
convert it into a more reliable one using a protocol (e.g., keep re-
transmitting lost messages), the equivalent isn’t possible for nodes.
Because of that, algorithms for different node models look very dif-
ferent from each other.
Byzantine node models are typically used to model safety-critical
systems like airplane engine systems, nuclear power plants, finan-
cial systems, and other systems where a single entity doesn’t fully
control all the nodes2 . These use cases are outside of the book’s
scope, and the algorithms presented will generally assume a crash-
recovery model.
Finally, we can also model the timing assumptions:
• The synchronous model assumes that sending a message or
executing an operation never takes over a certain amount of
time. This is very unrealistic in the real world, where we
1
https://en.wikipedia.org/wiki/Byzantine_fault
2
For example, digital cryptocurrencies such as Bitcoin implement algorithms
that assume Byzantine nodes.
CHAPTER 6. SYSTEM MODELS 56
3
https://www.distributedprogramming.net/
4
https://en.wikipedia.org/wiki/All_models_are_wrong
Chapter 7
Failure detection
In the worst case, the client will wait forever for a response that
will never arrive. The best it can do is make an educated guess on
whether the server is likely to be down or unreachable after some
time has passed. To do that, the client can configure a timeout
to trigger if it hasn’t received a response from the server after a
certain amount of time. If and when the timeout triggers, the client
considers the server unavailable and throws an error.
The tricky part is defining how long the amount of time that trig-
gers this timeout should be. If it’s too short and the server is reach-
able, the client will wrongly consider the server dead; if it’s too
long and the server is not reachable, the client will block waiting
for a response. The bottom line is that it’s not possible to build a
perfect failure detector.
A process doesn’t necessarily need to wait to send a message to
find out that the destination is not reachable. It can also actively
try to maintain a list of processes that are available using pings or
heartbeats.
A ping is a periodic request that a process sends to another to check
whether it’s still available. The process expects a response to the
ping within a specific time frame. If that doesn’t happen, a time-
out is triggered that marks the destination as dead. However, the
process will keep regularly sending pings to it so that if and when
it comes back online, it will reply to a ping and be marked as avail-
able again.
A heartbeat is a message that a process periodically sends to another
to inform it that it’s still up and running. If the destination doesn’t
receive a heartbeat within a specific time frame, it triggers a time-
out and marks the process that missed the heartbeat as dead. If
that process comes later back to life and starts sending out heart-
beats, it will eventually be marked as available again.
Pings and heartbeats are typically used when specific processes
frequently interact with each other, and an action needs to be taken
as soon as one of them is no longer reachable. If that’s not the case,
detecting failures just at communication time is good enough.
Chapter 8
Time
counter in 𝑇2 ,
• and there is at least one counter in 𝑇1 that is strictly less than
the corresponding counter in 𝑇2 ,
then 𝑂1 happened-before 𝑂2 . For example, in Figure 8.2, B
happened-before C.
If 𝑂1 didn’t happen before 𝑂2 and 𝑂2 didn’t happen before 𝑂1 ,
then the timestamps can’t be ordered, and the operations are con-
sidered to be concurrent. For example, operation E and C in Figure
8.2 can’t be ordered, and therefore they are considered to be con-
current.
This discussion about logical clocks might feel quite abstract. Later
in the book, we will encounter some practical applications of logi-
cal clocks. Once you learn to spot them, you will realize they are ev-
erywhere, as they can be disguised under different names. What’s
important to internalize at this point is that generally, you can’t
use physical clocks to derive accurately the order of events that
happened on different processes8 .
8
That said, sometimes physical clocks are good enough. For example, using
physical clocks to timestamp logs is fine as they are mostly used for debugging
purposes.
Chapter 9
Leader election
was to stop for any reason, like for a long GC pause, by the
time it resumes another process could have won the election.
• A period of time goes by with no winner — It’s unlikely but
possible that multiple followers become candidates simulta-
neously, and none manages to receive a majority of votes;
this is referred to as a split vote. When that happens, the
candidate will eventually time out and start a new election.
The election timeout is picked randomly from a fixed inter-
val to reduce the likelihood of another split vote in the next
election.
ity, which is why I chose it. That said, you will rarely need to
implement leader election from scratch, as you can leverage lin-
earizable key-value stores, like etcd2 or ZooKeeper3 , which offer
abstractions that make it easy to implement leader election. The
abstractions range from basic primitives like compare-and-swap
to full-fledged distributed mutexes.
Ideally, the external store should at the very least offer an atomic
compare-and-swap operation with an expiration time (TTL). The
compare-and-swap operation updates the value of a key if and
only if the value matches the expected one; the expiration time
defines the time to live for a key, after which the key expires and is
removed from the store if the lease hasn’t been extended. The idea
is that each competing process tries to acquire a lease by creating
a new key with compare-and-swap using a specific TTL. The first
process to succeed becomes the leader and remains such until it
stops renewing the lease, after which another process can become
the leader.
The TTL expiry logic can also be implemented on the client-side,
like this locking library4 for DynamoDB does, but the implemen-
tation is more complex, and it still requires the data store to offer
a compare-and-swap operation.
You might think that’s enough to guarantee there can’t be more
than one leader in your application. Unfortunately, that’s not the
case.
To see why, suppose there are multiple processes that need to up-
date a file on a shared blob store, and you want to guarantee that
only a single process at a time can do so to avoid race conditions.
To achieve that, you decide to use a distributed mutex, a form of
leader election. Each process tries to acquire the lock, and the one
that does so successfully reads the file, updates it in memory, and
writes it back to the store:
2
https://etcd.io/
3
https://zookeeper.apache.org/
4
https://aws.amazon.com/blogs/database/building-distributed-locks-with-
the-dynamodb-lock-client/
CHAPTER 9. LEADER ELECTION 69
if lock.acquire():
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content)
except:
lock.release()
The problem here is that by the time the process writes the content
to the store, it might no longer be the leader and a lot might have
happened since it was elected. For example, the operating system
might have preempted and stopped the process, and several sec-
onds will have passed by the time it’s running again. So how can
the process ensure that it’s still the leader then? It could check one
more time before writing to the store, but that doesn’t eliminate
the race condition, it just makes it less likely.
To avoid this issue, the data store downstream needs to verify that
the request has been sent by the current leader. One way to do
that is by using a fencing token. A fencing token5 is a number that
increases every time that a distributed lock is acquired — in other
words, it’s a logical clock. When the leader writes to the store, it
passes down the fencing token to it. The store remembers the value
of the last token and accepts only writes with a greater value:
success, token = lock.acquire()
if success:
try:
content = store.read(blob_name)
new_content = update(content)
store.write(blob_name, new_content, token)
except:
lock.release()
fact that occasionally there will be more than one leader. For ex-
ample, if there are momentarily two leaders and they both perform
the same idempotent operation, no harm is done.
Although having a leader can simplify the design of a system as it
eliminates concurrency, it can become a scaling bottleneck if the
number of operations performed by the leader increases to the
point where it can no longer keep up. When that happens, you
might be forced to re-design the whole system.
Also, having a leader introduces a single point of failure with a
large blast radius; if the election process stops working or the
leader isn’t working as expected, it can bring down the entire
system with it.
You can mitigate some of these downsides by introducing parti-
tions and assigning a different leader per partition, but that comes
with additional complexity. This is the solution many distributed
data stores use.
Before considering the use of a leader, check whether there are
other ways of achieving the desired functionality without it. For
example, optimistic locking is one way to guarantee mutual ex-
clusion at the cost of wasting some computing power. Or per-
haps high availability is not a requirement for your application, in
which case having just a single process that occasionally crashes
and restarts is not a big deal.
As a rule of thumb, if you must use leader election, you have to
minimize the work it performs and be prepared to occasionally
have more than one leader if you can’t support fencing tokens end-
to-end.
Chapter 10
Replication
followers and call it a day, as any process can fail at any time, and
the network can lose messages. This is why a large part of the al-
gorithm is dedicated to fault-tolerance.
10.2 Consensus
State machine replication can be used for much more than just
replicating data since it’s a solution to the consensus problem. Con-
sensus2 is a fundamental problem studied in distributed systems
research, which requires a set of processes to agree on a value in a
fault-tolerant way so that:
• every non-faulty process eventually agrees on a value;
• the final decision of every non-faulty process is the same ev-
erywhere;
• and the value that has been agreed on has been proposed by
a process.
Consensus has a large number of practical applications. For ex-
ample, a set of processes agreeing which one should hold a lock
or commit a transaction are consensus problems in disguise. As
it turns out, deciding on a value can be solved with state machine
replication. Hence, any problem that requires consensus can be
solved with state machine replication too.
Typically, when you have a problem that requires consensus, the
last thing you want to do is to solve it from scratch by implement-
ing an algorithm like Raft. While it’s important to understand
what consensus is and how it can be solved, many good open-
source projects implement state machine replication and expose
simple APIs on top of it, like etcd and ZooKeeper.
The best guarantee the system can provide is that the request
executes somewhere between its invocation and completion time.
You might think that this doesn’t look like a big deal; after all, it’s
what you are used to when writing single-threaded applications.
If you assign 1 to x and read its value right after, you expect to find
1 in there, assuming there is no other thread writing to the same
variable. But, once you start dealing with systems that replicate
their state on multiple nodes for high availability and scalability,
all bets are off. To understand why that’s the case, we will explore
different ways to implement reads in our replicated store.
CHAPTER 10. REPLICATION 77
no longer be able to reach the leader. The system has two choices
when this happens, it can either:
• remain available by allowing followers to serve reads, sacri-
ficing strong consistency;
• or guarantee strong consistency by failing reads that can’t
reach the leader.
This concept is expressed by the CAP theorem6 , which can be sum-
marized as: “strong consistency, availability and partition toler-
ance: pick two out of three.” In reality, the choice really is only
between strong consistency and availability, as network faults are
a given and can’t be avoided.
Even though network partitions can happen, they are usually rare.
But, there is a trade-off between consistency and latency in the ab-
sence of a network partition. The stronger the consistency guar-
antee is, the higher the latency of individual operations must be.
This relationship is expressed by the PACELC theorem7 . It states
that in case of network partitioning (P) in a distributed computer
system, one has to choose between availability (A) and consistency
(C), but else (E), even when the system is running normally in the
absence of partitions, one has to choose between latency (L) and
consistency (C).
Transactions
11.1 ACID
Consider a money transfer from one bank account to another. If
the withdrawal succeeds, but the deposit doesn’t, the funds need
to be deposited back into the source account — money can’t just
disappear into thin air. In other words, the transfer needs to ex-
ecute atomically; either both the withdrawal and the deposit suc-
ceed, or neither do. To achieve that, the withdrawal and deposit
need to be wrapped in an inseparable unit: a transaction.
In a traditional relational database, a transaction is a group of
operations for which the database guarantees a set of properties,
CHAPTER 11. TRANSACTIONS 84
known as ACID:
• Atomicity guarantees that partial failures aren’t possible; ei-
ther all the operations in the transactions complete success-
fully, or they are rolled back as if they never happened.
• Consistency guarantees that the application-level invariants,
like a column that can’t be null, must always be true. Con-
fusingly, the “C” in ACID has nothing to do with the con-
sistency models we talked about so far, and according to
Joe Hellerstein, the “C” was tossed in to make the acronym
work1 . Therefore, we will safely ignore this property in the
rest of this chapter.
• Isolation guarantees that the concurrent execution of trans-
actions doesn’t cause any race conditions.
• Durability guarantees that once the data store commits the
transaction, the changes are persisted on durable storage.
The use of a write-ahead log2 (WAL) is the standard method
used to ensure durability. When using a WAL, the data
store can update its state only after log entries describing
the changes have been flushed to permanent storage. Most
of the time, the database doesn’t read from this log at all.
But if the database crashes, the log can be used to recover its
prior state.
Transactions relieve you from a whole range of possible failure sce-
narios so that you can focus on the actual application logic rather
than all possible things that can go wrong. This chapter explores
how distributed transactions differ from ACID transactions and
how you can implement them in your systems. We will focus our
attention mainly on atomicity and isolation.
11.2 Isolation
A set of concurrently running transactions that access the same
data can run into all sorts of race conditions, like dirty writes, dirty
reads, fuzzy reads, and phantom reads:
1
http://www.bailis.org/blog/when-is-acid-acid-rarely/
2
https://www.postgresql.org/docs/9.1/wal-intro.html
CHAPTER 11. TRANSACTIONS 85
Figure 11.1: Isolation levels define which race conditions they for-
bid.
11.3 Atomicity
Going back to our original example of sending money from one
bank account to another, suppose the two accounts belong to two
different banks that use separate data stores. How should we go
about guaranteeing atomicity across the two accounts? We can’t
just run two separate transactions to respectively withdraw and
deposit the funds — if the second transaction fails, then the sys-
tem is left in an inconsistent state. We need atomicity: the guar-
antee that either both transactions succeed and their changes are
committed, or that they fail without any side effects.
7
https://en.wikipedia.org/wiki/Multiversion_concurrency_control
8
https://wiki.postgresql.org/wiki/SSI
CHAPTER 11. TRANSACTIONS 89
Figure 11.3: The producer appends entries at the end of the log,
while the consumers read the entries at their own pace.
11.4.2 Sagas
Suppose we own a travel booking service. To book a trip, the
travel service has to atomically book a flight through a dedicated
service and a hotel through another. However, either of these ser-
vices can fail their respective requests. If one booking succeeds,
CHAPTER 11. TRANSACTIONS 95
but the other fails, then the former needs to be canceled to guar-
antee atomicity. Hence, booking a trip requires multiple steps to
complete, some of which are only required in case of failure. Since
appending a single message to a log is no longer sufficient to com-
mit the transaction, we can’t use the simple log-oriented solution
presented earlier.
The Saga18 pattern provides a solution to this problem. A saga is
a distributed transaction composed of a set of local transactions
𝑇1 , 𝑇2 , ..., 𝑇𝑛 , where 𝑇𝑖 has a corresponding compensating local
transaction 𝐶𝑖 used to undo its changes. The Saga guarantees that
either all local transactions succeed, or in case of failure, that the
compensating local transactions undo the partial execution of the
transaction altogether. This guarantees the atomicity of the pro-
tocol; either all local transactions succeed, or none of them do. A
Saga can be implemented with an orchestrator, the transaction’s
coordinator, that manages the execution of the local transactions
across the processes involved, the transaction’s participants.
18
https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf
CHAPTER 11. TRANSACTIONS 96
11.4.3 Isolation
We started our journey into asynchronous transactions as a way to
design around the blocking nature of 2PC. To get here, we had to
sacrifice the isolation guarantee that traditional ACID transactions
provide. As it turns out, we can work around the lack of isolation
20
https://aws.amazon.com/step-functions
21
https://docs.microsoft.com/en-us/azure/azure-functions/durable/durabl
e-functions-orchestrations
CHAPTER 11. TRANSACTIONS 98
as well. For example, one way to do that is with the use of semantic
locks22 . The idea is that any data the Saga modifies is marked with
a dirty flag. This flag is only cleared at the end of the transaction
when it completes. Another transaction trying to access a dirty
record can either fail and roll back its changes, or block until the
dirty flag is cleared. The latter approach can introduce deadlocks,
though, which requires a strategy to mitigate them.
22
https://dl.acm.org/doi/10.5555/284472.284478
Part III
Scalability
Introduction
Functional
decomposition
12.1 Microservices
An application typically starts its life as a monolith. Take a modern
backend of a single-page JavaScript application (SPA), for exam-
ple. It might start out as a single stateless web service that exposes
a RESTful HTTP API and uses a relational database as a backing
store. The service is likely to be composed of a number of compo-
nents or libraries that implement different business capabilities, as
shown in Figure 12.1.
As the number of feature teams contributing to the same codebase
increases, the components become increasingly coupled over time.
This leads the teams to step on each other’s toes more and more
frequently, decreasing their productivity.
The codebase becomes complex enough that nobody fully under-
stands every part of it, and implementing new features or fixing
bugs becomes time-consuming. Even if the backend is componen-
tized into different libraries owned by different teams, a change
to a library requires the service to be redeployed. And if a change
CHAPTER 12. FUNCTIONAL DECOMPOSITION 104
introduces a bug like a memory leak, the entire service can poten-
tially be affected by it. Additionally, rolling back a faulty build
affects the velocity of all teams, not just the one that introduced
the bug.
One way to mitigate the growing pains of a monolithic backend is to
split it into a set of independently deployable services that commu-
nicate via APIs, as shown in Figure 12.2. The APIs decouple the
services from each other by creating boundaries that are hard to
violate, unlike the ones between components running in the same
process.
This architectural style is also referred to as the microservice archi-
tecture. The term micro can be misleading, though — there doesn’t
have to be anything micro about the services. In fact, I would argue
that if a service doesn’t do much, it just creates more operational
overhead than benefits. A more appropriate name for this architec-
ture is service-oriented architecture1 , but unfortunately, that name
comes with some old baggage as well. Perhaps in 10 years, we will
call the same concept with yet another name, but for now we will
1
https://en.wikipedia.org/wiki/Service-oriented_architecture
CHAPTER 12. FUNCTIONAL DECOMPOSITION 105
12.1.1 Benefits
Breaking down the backend by business capabilities into a set of
services with well-defined boundaries allows each service to be
developed and operated by a single small team. Smaller teams
can increase the application’s development speed for a variety of
reasons:
• They are more effective as the communication overhead
grows quadratically2 with the team’s size.
• Since each team dictates its own release schedule and has
complete control over its codebase, less cross-team commu-
2
https://en.wikipedia.org/wiki/The_Mythical_Man-Month
CHAPTER 12. FUNCTIONAL DECOMPOSITION 106
12.1.2 Costs
The microservice architecture adds more moving parts to the over-
all system, and this doesn’t come for free. The cost of fully em-
bracing microservices is only worth paying if it can be amortized
across dozens of development teams.
Development experience
Nothing forbids the use of different languages, libraries, and data-
stores in each microservice, but doing so transforms the applica-
tion into an unmaintainable mess. For example, it becomes more
challenging for a developer to move from one team to another if
the software stack is completely different. And think of the sheer
number of libraries, one for each language adopted, that need to be
supported to provide common functionality that all services need,
like logging.
It’s only reasonable then that a certain degree of standardization
CHAPTER 12. FUNCTIONAL DECOMPOSITION 107
12.2.1 Routing
The API gateway can route the requests it receives to the appro-
priate backend service. It does so with the help of a routing map,
which maps the external APIs to the internal ones. For example,
the map might have a 1:1 mapping between an external path and
internal one. If in the future the internal path changes, the public
API can continue to expose the old path to guarantee backward
compatibility.
CHAPTER 12. FUNCTIONAL DECOMPOSITION 111
Figure 12.3: The API gateway hides the internal APIs from its
clients.
12.2.2 Composition
While data of a monolithic application typically resides in a sin-
gle data store, in a distributed system, it’s spread across multiple
services. As such, some use cases might require stitching data
back together from multiple sources. The API gateway can offer a
higher-level API that queries multiple services and composes their
responses within a single one that is then returned to the client.
This relieves the client from knowing which services to query and
reduces the number of requests it needs to perform to get the data
it needs.
Composition can be hard to get right. The availability of the com-
posed API decreases as the number of internal calls increases since
each has a non-zero probability of failure. Additionally, the data
across the services might be inconsistent as some updates might
CHAPTER 12. FUNCTIONAL DECOMPOSITION 112
not have propagated to all services yet; in that case, the gateway
will have to somehow resolve this discrepancy.
12.2.3 Translation
The API gateway can translate from one IPC mechanism to an-
other. For example, it can translate a RESTful HTTP request into
an internal gRPC call.
The gateway can also expose different APIs to different types of
clients. For example, a web API for a desktop application can po-
tentially return more data than the one for a mobile application, as
the screen estate is larger and more information can be presented
at once. Also, network calls are expensive for mobile clients, and
requests generally need to be batched to reduce battery usage.
To meet these different and competing requirements, the gateway
can provide different APIs tailored to different use cases and trans-
late these APIs to the internal ones. An increasingly popular ap-
proach to tailor APIs to individual use cases is to use graph-based
APIs. A graph-based API exposes a schema composed of types,
fields, and relationships across types. The API allows a client to
declare what data it needs and let the gateway figure out how to
translate the request into a series of internal API calls.
This approach reduces the development time as there is no need
to introduce different APIs for different use cases, and the clients
are free to specify what they need. There is still an API, though; it
just happens that it’s described with a graph schema. In a way, it’s
as if the gateway grants the clients the ability to perform restricted
queries on its backend APIs. GraphQL4 is the most popular tech-
nology in the space at the time of writing.
Figure 12.4:
1. API client sends a request with credentials to API gateway
2. API gateway tries to authenticate credentials with auth service
3. Auth service validates credentials and replies with a security
token
4. API gateway sends a request to service A including the security
token
5. API gateway sends a request to service B including the security
token
6. API gateway composes the responses from A and B and replies
to the client
CHAPTER 12. FUNCTIONAL DECOMPOSITION 115
12.2.5 Caveats
One of the drawbacks of using an API gateway is that it can be-
come a development bottleneck. As it’s coupled with the services
it’s hiding, every new service that is created needs to be wired up
6
https://jwt.io/
7
https://openid.net/connect/
8
https://oauth.net/2/
9
https://www.manning.com/books/microservices-security-in-action
CHAPTER 12. FUNCTIONAL DECOMPOSITION 116
10
https://www.nginx.com/
11
https://azure.microsoft.com/en-gb/services/api-management/
CHAPTER 12. FUNCTIONAL DECOMPOSITION 117
12.3 CQRS
The API’s gateway ability to compose internal APIs is quite lim-
ited, and querying data distributed across services can be very in-
efficient if the composition requires large in-memory joins.
Accessing data can also be inefficient for reasons that have nothing
to do with using a microservice architecture:
• The data store used might not be well suited for specific
types of queries. For example, a vanilla relational data store
isn’t optimized for geospatial queries.
• The data store might not scale to handle the number of reads,
which could be several orders of magnitude higher than the
number of writes.
In these cases, decoupling the read path from the write path can
yield substantial benefits. This approach is also referred to as the
Command Query Responsibility Segregation12 (CQRS) pattern.
The two paths can use different data models and data stores that fit
their specific use cases (see Figure 12.5). For example, the read path
could use a specialized data store tailored to a particular query pat-
tern required by the application, like geospatial or graph-based.
To keep the read and write data models synchronized, the write
path pushes updates to the read path whenever the data changes.
External clients could still use the write path for simple queries,
but complex queries are routed to the read path.
This separation adds more complexity to the system. For exam-
ple, when the data model changes, both paths might need to be
updated. Similarly, operational costs increase as there are more
moving parts to maintain and operate. Also, there is an inherent
replication lag between the time a change has been applied on the
write path and the read path has received and applied it, which
makes the system sequentially consistent.
12
https://martinfowler.com/bliki/CQRS.html
CHAPTER 12. FUNCTIONAL DECOMPOSITION 118
Figure 12.5: In this example, the read and write paths are separated
out into different services.
12.4 Messaging
When an application is decomposed into services, the number of
network calls increases, and with it, the probability that a request’s
destination is momentarily unavailable. So far, we have mostly as-
sumed services communicate using a direct request-response com-
munication style, which requires the destination to be available
and respond promptly. Messaging — a form of indirect communi-
cation — doesn’t have this requirement, though.
Messaging was first introduced when we discussed the implemen-
tation of asynchronous transactions in section 11.4.1. It is a form
of indirect communication in which a producer writes a message
to a channel — or message broker — that delivers the message to
a consumer on the other end.
By decoupling the producer from the consumer, the former gains
CHAPTER 12. FUNCTIONAL DECOMPOSITION 119
Request-response messaging
This messaging style is similar to the direct request-response style
we are familiar with, albeit with the difference that the request
and response messages flow through channels. The consumer has
a point-to-point request channel from which it reads messages,
while every producer has its own dedicated response channel (see
Figure 12.7).
When a producer writes a message to the request channel, it dec-
orates it with a request id and a reference to its response channel.
After a consumer has read and processed the message, it writes
a reply to the producer’s response channel, tagging it with the re-
quest’s id, which allows the producer to identify the request it be-
longs to.
Broadcast messaging
In this messaging style, a producer writes a message to a publish-
subscribe channel to broadcast it to all consumers (see Figure 12.8).
This mechanism is generally used to notify a group of processes
that a specific event has occurred. We have already encountered
this pattern when discussing log-based transactions in section
11.4.1.
12.4.1 Guarantees
A message channel is implemented by a messaging service, like
AWS SQS13 or Kafka14 . The messaging service, or broker, acts as
a buffer for messages. It decouples producers from consumers so
that they don’t need to know the consumers’ addresses, how many
of them there are, or whether they are available.
Different message brokers implement the channel abstraction dif-
ferently depending on the tradeoffs and the guarantees they offer.
For example, you would think that a channel should respect the
insertion order of its messages, but you will find that some im-
plementations, like SQS standard queues15 , don’t offer any strong
ordering guarantees. Why is that?
Because a message broker needs to scale out just like the applica-
tions that use it, its implementation is necessarily distributed. And
when multiple nodes are involved, guaranteeing order becomes
13
https://aws.amazon.com/sqs/
14
https://kafka.apache.org
15
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDevelo
perGuide/standard-queues.html
CHAPTER 12. FUNCTIONAL DECOMPOSITION 122
12.4.3 Failures
When a consumer fails to process a message, the visibility timeout
triggers, and the message is eventually delivered to another con-
sumer. What happens if processing a specific message consistently
fails with an error, though? To guard against the message being
picked up repeatedly in perpetuity, we need to limit the maximum
number of times the same message can be read from the channel.
To enforce a maximum number of retries, the broker can stamp
messages with a counter that keeps track of the number of
times the message has been delivered to a consumer. If the
broker doesn’t support this functionality out of the box, it can be
implemented by the consumers.
Once you have a way to count the number of times a message has
been retried, you still have to decide what to do when the maxi-
mum is reached. A consumer shouldn’t delete a message without
processing it, as that would cause data loss. But what it can do is
remove the message from the channel after writing it to a dead let-
ter channel — a channel that acts as a buffer for messages that have
been retried too many times.
This way, messages that consistently fail are not lost forever but
merely put on the side so that they don’t pollute the main channel,
wasting consumers’ processing resources. A human can then in-
spect these messages to debug the failure, and once the root cause
has been identified and fixed, move them back to the main channel
to be reprocessed.
12.4.4 Backlogs
One of the main advantages of using a messaging broker is that it
makes the system more robust to outages. Producers can continue
to write messages to a channel even if one or more consumers are
not available or are degraded. As long as the rate of arrival of
messages is lower or equal to the rate they are being deleted from
the channel, everything is great. When that is no longer true, and
consumers can’t keep up with producers, a backlog starts to build
CHAPTER 12. FUNCTIONAL DECOMPOSITION 125
up.
A messaging channel introduces a bi-modal behavior in the sys-
tem. In one mode, there is no backlog, and everything works as
expected. In the other, a backlog builds up, and the system en-
ters a degraded state. The issue with a backlog is that the longer it
builds up, the more resources and/or time it will take to drain it.
There are several reasons for backlogs, for example:
• more producers came online, and/or their throughput in-
creased, and the consumers can’t match their rate;
• the consumers have become slower to process individual
messages, which in turn decreased their deletion rate;
• the consumers fail to process a fraction of the messages,
which are picked up again by other consumers until they
eventually end up in the dead letter channel. This can cause
a negative feedback loop that delays healthy messages and
wastes the consumers’ processing time.
To detect backlogs, you should measure the average time a mes-
sage waits in the channel to be read for the first time. Typically,
brokers attach a timestamp of when the message was first written
to it. The consumer can use that timestamp to compute how long
the message has been waiting in the channel by comparing it to the
timestamp taken when the message was read. Although the two
timestamps have been generated by two physical clocks that aren’t
perfectly synchronized (see section 8.1), the measure still provides
a good indication of the backlog.
Partitioning
Now it’s time to change gears and dive into another tool you have
at your disposal to scale out application — partitioning or shard-
ing.
When a dataset no longer fits on a single node, it needs to be par-
titioned across multiple nodes. Partitioning is a general technique
that can be used in a variety of circumstances, like sharding TCP
connections across backends in a load balancer. To ground the dis-
cussion in this chapter, we will anchor it to the implementation of
a sharded key-value store.
the first place? At a high level, there are two ways to implement
the mapping using either range partitioning or hash partitioning.
13.2 Rebalancing
When the number of requests to the data store becomes too large,
or the dataset’s size becomes too large, the number of nodes serv-
ing partitions needs to be increased. Similarly, if the dataset’s size
keeps shrinking, the number of nodes can be decreased to reduce
costs. The process of adding and removing nodes to balance the
system’s load is called rebalancing.
Rebalancing needs to be implemented in such a way to minimize
disruption to the data store, which needs to continue to serve re-
quests. Hence, the amount of data transferred during the rebalanc-
ing act needs to be minimized.
2
https://www.amazon.co.uk/Designing- Data- Intensive- Applications-
Reliable-Maintainable/dp/1449373321
Chapter 14
Duplication
Now it’s time to change gears and dive into another tool you have
at your disposal to design horizontally scalable applications — du-
plication.
As the data going out of the servers usually has a greater volume
than the data coming in, there is a way for servers to bypass the
LB and respond directly to the clients using a mechanism called
direct server return7 , but this is beyond the scope of this section.
Because the LB is communicating directly with the servers, it can
detect unavailable ones (e.g., with a passive health check) and au-
7
https://blog.envoyproxy.io/introduction-to-modern-network-load-balancin
g-and-proxying-a57f6ff80236
CHAPTER 14. DUPLICATION 141
an API gateway, it’s because they both are HTTP proxies, and
therefore their responsibilities can be blurred.
A L7 LB is typically used as the backend of a L4 LB to load bal-
ance requests sent by external clients from the internet (see Figure
14.3). Although L7 LBs offer more functionality than L4 LBs, they
have a lower throughput in comparison, which makes L4 LBs bet-
ter suited to protect against certain DDoS attacks, like SYN floods.
sion to DNS that considers the location of the client inferred from
its IP, and returns a list of the geographically closest L4 LB VIPs
(see Figure 14.4). The LB also needs to take into account the capac-
ity of each data center and its health status.
Figure 14.4: Geo load balancing infers the location of the client
from its IP
frontend/
CHAPTER 14. DUPLICATION 145
14.2 Replication
If the servers behind a load balancer are stateless, scaling out is as
simple as adding more servers. But when there is state involved,
some form of coordination is required.
Replication is the process of storing a copy of the same data in
multiple nodes. If the data is static, replication is easy: just copy
the data to multiple nodes, add a load balancer in front of it, and
you are done. The challenge is dealing with dynamically changing
data, which requires coordination to keep it in sync.
Replication and sharding are techniques that are often combined,
but are orthogonal to each other. For example, a distributed data
store can divide its data into N partitions and distribute them over
K nodes. Then, a state-machine replication algorithm like Raft can
be used to replicate each partition R times (see Figure 14.5).
We have already discussed one way of replicating data in chap-
ter 10. This section will take a broader, but less detailed, look at
replication and explore different approaches with varying trade-
offs. To keep things simple, we will assume that the dataset is
small enough to fit on a single node, and therefore no partitioning
is needed.
clients, and there are edge cases that affect consistency even when
𝑊 + 𝑊 > 𝑁 is satisfied. For example, if a write succeeded on
less than W replicas and failed on the others, the replicas are left
in an inconsistent state.
CHAPTER 14. DUPLICATION 152
14.3 Caching
Let’s take a look now at a very specific type of replication that only
offers best effort guarantees: caching.
Suppose a service requires retrieving data from a remote depen-
dency, like a data store, to handle its requests. As the service scales
out, the dependency needs to do the same to keep up with the
ever-increasing load. A cache can be introduced to reduce the load
on the dependency and improve the performance of accessing the
data.
A cache is a high-speed storage layer that temporarily buffers re-
sponses from downstream dependencies so that future requests
can be served directly from it — it’s a form of best effort replication.
For a cache to be cost-effective, there should be a high probability
that requested data can be found in it. This requires the data access
pattern to have a high locality of reference, like a high likelihood
of accessing the same data again and again over time.
14.3.1 Policies
When a cache miss occurs13 , the missing data item has to be re-
quested from the remote dependency, and the cache has to be up-
dated with it. This can happen in two ways:
• The client, after getting an “item-not-found” error from the
cache, requests the data item from the dependency and up-
dates the cache. In this case, the cache is said to be a side
cache.
• Alternatively, if the cache is inline, the cache communicates
directly with the dependency and requests the missing data
item. In this case, the client only ever accesses the cache.
Because a cache has a maximum capacity for holding entries, an
entry needs to be evicted to make room for a new one when its ca-
pacity is reached. Which entry to remove depends on the eviction
13
A cache hit occurs when the requested data can be found in the cache, while a
cache miss occurs when it cannot.
CHAPTER 14. DUPLICATION 153
policy used by the cache and the client’s access pattern. One com-
monly used policy is to evict the least recently used (LRU) entry.
A cache also has an expiration policy that dictates for how long
to store an entry. For example, a simple expiration policy defines
the maximum time to live (TTL) in seconds. When a data item has
been in the cache for longer than its TTL, it expires and can safely
be evicted.
The expiration doesn’t need to occur immediately, though, and it
can be deferred to the next time the entry is requested. In fact, that
might be preferable — if the dependency is temporarily unavail-
able, and the cache is inline, it can opt to return an entry with an
expired TTL to the client rather than an error.
14
Remember when we talked about the bi-modal behavior of message channels
in section 12.4? As we will learn later, you always want to minimize the number of
modes in your applications to make them simple to understand and operate.
Part IV
Resiliency
Introduction
As you scale out your applications, any failure that can happen will
eventually happen. Hardware failures, software crashes, memory
leaks — you name it. The more components you have, the more
failures you will experience.
Suppose you have a buggy service that leaks 1 MB of memory on
average every hundred requests. If the service does a thousand
requests per day, chances are you will restart the service to deploy
a new build before the leak reaches any significant size. But if your
service is doing 10 million requests per day, then by the end of the
day you lose 100 GB of memory! Eventually, the servers won’t
have enough memory available and they will start to trash due to
the constant swapping of pages in and out from disk.
This nasty behavior is caused by cruel math; given an operation that
has a certain probability of failing, the total number of failures in-
creases with the total number of operations performed. In other
words, the more you scale out your system to handle more load,
and the more operations and moving parts there are, the more fail-
ures your systems will experience.
Remember when we talked about availability and “nines” in chap-
ter 1? Well, to guarantee just two nines, your system can be un-
available for up to 15 min a day. That’s very little time to take any
manual action. If you strive for 3 nines, then you only have 43 min-
utes per month available. Although you can’t escape cruel math,
you can mitigate it by implementing self-healing mechanisms to
reduce the impact of failures.
159
architected before they can cause any harm. The best way to detect
them is to examine every component of the system and ask what
would happen if that component were to fail. Some single points
of failure can be architected away, e.g., by introducing redundancy,
while others can’t. In that case, the only option left is to minimize
the blast radius.
On top of that, the code you write isn’t the only one accessing mem-
ory, threads, and sockets. The libraries your application depends
on access the same resources, and they can do all kinds of shady
things. Without digging into their implementation, assuming it’s
open in the first place, you can’t be sure whether they can wreak
havoc or not.
Figure 15.1: Two replicas behind an LB; each is handling half the
load.
To address a failure, you can either find a way to reduce the prob-
ability of it happening, or reduce its impact.
3
https://en.wikipedia.org/wiki/Risk_matrix
Chapter 16
Downstream resiliency
16.1 Timeout
When you make a network call, you can configure a timeout to fail
the request if there is no response within a certain amount of time.
If you make the call without setting a timeout, you tell your code
that you are 100% confident that the call will succeed. Would you
really take that bet?
Unfortunately, some network APIs don’t have a way to set a time-
out in the first place. When the default timeout is infinity, it’s all
too easy for a client to shoot itself in the foot. As mentioned ear-
lier, network calls that don’t return lead to resource leaks at best.
Timeouts limit and isolate failures, stopping them from cascading
to the rest of the system. And they are useful not just for network
calls, but also for requesting a resource from a pool and for syn-
chronization primitives like mutexes.
To drive the point home on the importance of setting timeouts, let’s
take a look at some concrete examples. JavaScript’s XMLHttpRe-
quest is the web API to retrieve data from a server asynchronously.
CHAPTER 16. DOWNSTREAM RESILIENCY 168
Things aren’t much rosier for Python. The popular requests library
uses a default timeout of infinity5 :
# No timeout by default!
response = requests.get('https://github.com/', timeout=10)
Modern HTTP clients for Java and .NET do a much better job and
usually come with default timeouts. For example, .NET Core Http-
Client has a default timeout of 100 seconds7 . It’s lax but better than
not setting a timeout at all.
As a rule of thumb, always set timeouts when making network
calls, and be wary of third-party libraries that do network calls or
use internal resource pools but don’t expose settings for timeouts.
And if you build libraries, always set reasonable default timeouts
and make them configurable for your clients.
Ideally, you should set your timeouts based on the desired false
timeout rate8 . Say you want to have about 0.1% false timeouts; to
achieve that, you should set the timeout to the 99.9th percentile
of the remote call’s response time, which you can measure empiri-
cally.
You also want to have good monitoring in place to measure the en-
tire lifecycle of your network calls, like the duration of the call, the
status code received, and if a timeout was triggered. We will talk
about monitoring later in the book, but the point I want to make
here is that you have to measure what happens at the integration
5
https://requests.readthedocs.io/en/master/user/quickstart/#timeouts
6
https://github.com/golang/go/issues/24138
7
https://docs.microsoft.com/en-us/dotnet/api/system.net.http.httpclient.t
imeout?view=netcore-3.1#remarks
8
https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-
with-jitter/
CHAPTER 16. DOWNSTREAM RESILIENCY 170
16.2 Retry
You know by now that a client should configure a timeout when
making a network request. But, what should it do when the re-
quest fails, or the timeout fires? The client has two options at that
point: it can either fail fast or retry the request at a later time.
If the failure or timeout was caused by a short-lived connectivity
issue, then retrying after some backoff time has a high probability of
succeeding. However, if the downstream service is overwhelmed,
retrying immediately will only make matters worse. This is why
retrying needs to be slowed down with increasingly longer delays
between the individual retries until either a maximum number of
retries is reached or a certain amount of time has passed since the
initial request.
9
We talked about this in section 14.1.3 when discussing the sidecar pattern and
the service mesh.
CHAPTER 16. DOWNSTREAM RESILIENCY 171
For example, if the cap is set to 8 seconds, and the initial backoff
duration is 2 seconds, then the first retry delay is 2 seconds, the
second is 4 seconds, the third is 8 seconds, and any further delay
will be capped to 8 seconds.
Although exponential backoff does reduce the pressure on the
downstream dependency, there is still a problem. When the
downstream service is temporarily degraded, it’s likely that mul-
tiple clients see their requests failing around the same time. This
causes the clients to retry simultaneously, hitting the downstream
service with load spikes that can further degrade it, as shown in
Figure 16.1.
And if the pressure gets bad enough, this behavior can easily bring
down the whole system. That’s why when you have long depen-
dency chains, you should only retry at a single level of the chain,
and fail fast in all the other ones.
Upstream resiliency
Figure 17.1: The channel smooths out the load for the consuming
service.
17.3 Rate-limiting
Rate-limiting, or throttling, is a mechanism that rejects a request
when a specific quota is exceeded. A service can have multiple
quotas, like for the number of requests seen, or the number of bytes
received within a time interval. Quotas are typically applied to
specific users, API keys, or IP addresses.
For example, if a service with a quota of 10 requests per second,
per API key, receives on average 12 requests per second from a
specific API key, it will on average, reject 2 requests per second
tagged with that API key.
When a service rate-limits a request, it needs to return a response
with a particular error code so that the sender knows that it failed
because a quota has been breached. For services with HTTP APIs,
2
https://en.wikipedia.org/wiki/Autoscaling
CHAPTER 17. UPSTREAM RESILIENCY 179
Figure 17.3: When a new request comes in, its timestamp is used
to determine the bucket it belongs to.
CHAPTER 17. UPSTREAM RESILIENCY 182
17.5). This reduces the shared state’s accuracy, but it’s a good
trade-off as it reduces the load on the database and the number
of requests sent to it.
17.4 Bulkhead
The goal of the bulkhead pattern is to isolate a fault in one part of a
service from taking the entire service down with it. The pattern is
named after the partitions of a ship’s hull. If one partition is dam-
aged and fills up with water, the leak is isolated to that partition
and doesn’t spread to the rest of the ship.
Some clients can create much more load on a service than others.
Without any protections, a single greedy client can hammer the sys-
CHAPTER 17. UPSTREAM RESILIENCY 185
tem and degrade every other client. We have seen some patterns,
like rate-limiting, that help prevent a single client from using more
resources than it should. But rate-limiting is not bulletproof. You
can rate-limit clients based on the number of requests per second;
but what if a client sends very heavy or poisonous requests that
cause the servers to degrade? In that case, rate-limiting wouldn’t
help much as the issue is intrinsic with the requests sent by that
client, which could eventually lead to degrading the service for
every other client.
When everything else fails, the bulkhead pattern provides guar-
anteed fault isolation by design. The idea is to partition a shared
resource, like a pool of service instances behind a load balancer,
and assign each user of the service to a specific partition so that its
requests can only utilize resources belonging to the partition it’s
assigned to.
Consequently, a heavy or poisonous user can only degrade the re-
quests of users within the same partition. For example, suppose
there are 10 instances of a service behind a load balancer, which are
divided into 5 partitions (see Figure 17.6). In that case, a problem-
atic user can only ever impact 20 percent of the service’s instances.
The problem is that the unlucky users who happen to be on the
same partition as the problematic one are fully impacted. Can we
do better?
subset of instances. This can make it much more unlikely for an-
other user to be allocated to the exact same virtual partition.
In our example, we can extract 45 combinations of 2 instances (vir-
tual partitions) from a pool of 10 instances. When a virtual par-
tition is degraded, other virtual partitions are only partially im-
pacted as they don’t fully overlap (see Figure 17.7). If you combine
this with a health check on the load balancer, and a retry mecha-
nism on the client side, what you get is much better fault isolation.
Figure 17.7: Virtual partitions are far less likely to fully overlap
with each other.
dation, the process compares one or more local metrics, like mem-
ory available or remaining disk space, with some fixed upper and
lower-bound thresholds. When a metric is above an upper-bound
threshold, or below a lower-bound one, the process reports itself
as unhealthy.
A more advanced, and also harder check to get right, is the depen-
dency health check. This type of health check detects a degradation
caused by a remote dependency, like a database, that needs to be
accessed to handle incoming requests. The process measures the
response time, timeouts, and errors of the remote calls directed to
the dependency. If any measure breaks a predefined threshold,
the process reports itself as unhealthy to reduce the load on the
downstream dependency.
But here be dragons5 : if the downstream dependency is temporar-
ily unreachable, or the health-check has a bug, then it’s possible
that all the processes behind the load balancer fail the health check.
In that case, a naive load balancer would just take all service in-
stances out of rotation, bringing the entire service down!
A smart load balancer instead detects that a large fraction of the
service instances is being reported as unhealthy and considers the
health check to no longer be reliable. Rather than continuing to re-
move processes from the pool, it starts to ignore the health-checks
altogether so that new requests can be sent to any process in the
pool.
17.6 Watchdog
One of the main reasons to build distributed services is to be able
to withstand single-process failures. Since you are designing your
system under the assumption that any process can crash at any
time, your service needs to be able to deal with that eventuality.
For a process’s crash to not affect your service’s health, you should
ensure ideally that:
5
https://aws.amazon.com/builders-library/implementing-health-checks/
CHAPTER 17. UPSTREAM RESILIENCY 189
• there are other processes that are identical to the one that
crashed that can handle incoming requests;
• requests are stateless and can be served by any process;
• any non-volatile state is stored on a separate and dedicated
data store so that when the process crashes its state isn’t lost;
• all shared resources are leased so that when the process
crashes, the leases expire and the resources can be accessed
by other processes;
• the service is always running slightly over-scaled to with-
stand the occasional individual process failures.
Because crashes are inevitable and your service is prepared for
them, you don’t have to come up with complex recovery logic
when a process gets into some weird degraded state — you can just
let it crash. A transient but rare failure can be hard to diagnose and
fix. Crashing and restarting the affected process gives operators
maintaining the service some breathing room until the root-cause
can be identified, giving the system a kind of self-healing property.
Imagine that a latent memory leak causes the available memory to
decrease over time. When a process doesn’t have more physical
memory available, it starts to swap back and forth to the page file
on disk. This swapping is extremely expensive and degrades the
process’s performance dramatically. If left unchecked, the memory
leak would eventually bring all processes running the service on
their knees. Would you rather have the processes detect they are
degraded and restart themselves, or try to debug the root cause for
the degradation at 3 AM?
To implement this pattern, a process should have a separate back-
ground thread that wakes up periodically — a watchdog — that
monitors its health. For example, the watchdog could monitor
the available physical memory left. When any monitored metric
breaches a configured threshold, the watchdog considers the pro-
cess degraded and deliberately restarts it.
The watchdog’s implementation needs to be well-tested and moni-
tored since a bug could cause the processes to restart continuously.
Part V
When all resiliency mechanisms fail, humans operators are the last
line of defense. Historically, developers, testers, and operators
were part of different teams. The developers handed over their
software to a team of QA engineers responsible for testing it. When
the software passed that stage, it moved to an operations team
responsible for deploying it to production, monitoring it, and re-
sponding to alerts.
This model is being phased out in the industry as it has become
commonplace for the development team to also be responsible for
testing and operating the software they write. This forces the de-
velopers to embrace an end-to-end view of their applications, ac-
knowledging that faults are inevitable and need to be accounted
for.
Chapter 18 describes the different types of tests — unit, integration,
and end-to-end tests — you can leverage to increase the confidence
that your distributed applications work as expected.
Chapter 19 dives into continuous delivery and deployment
pipelines used to release changes safely and efficiently to produc-
tion.
Chapter 20 discusses how to use metrics and service-level indica-
tors to monitor the health of distributed systems. It then describes
how to define objectives that trigger alerts when breached. Finally,
the chapter lists best practices for dashboard design.
Chapter 21 introduces the concept of observability and how it re-
192
lates to monitoring. Then it describes how traces and logs can help
developers debug their systems.
Chapter 18
Testing
18.1 Scope
Tests come in different shapes and sizes. To begin with, we need
to distinguish between code paths a test is actually testing (aka
system under test or SUT) from the ones that are being run. The
SUT represents the scope of the test, and depending on it, the test
can be categorized as either a unit test, an integration test, or an
end-to-end test.
A unit test validates the behavior of a small part of the codebase,
like an individual class. A good unit test should be relatively static
in time and change only when the behavior of the SUT changes —
refactoring, fixing a bug, or adding a new feature shouldn’t require
a unit test to change. To achieve that, a unit test should:
• use only the public interfaces of the SUT;
• test for state changes in the SUT (not predetermined
sequence of actions);
• test for behaviors, i.e., how the SUT handles a given input
when it’s in a specific state.
An integration test has a larger scope than a unit test, since it ver-
ifies that a service can interact with its external dependencies as
expected. This definition is not universal, though, because inte-
gration testing has different meanings for different people.
Martin Fowler2 makes the distinction between narrow and broad
integration tests. A narrow integration test exercises only the code
paths of a service that communicate with an external dependency,
like the adapters and their supporting classes. In contrast, a broad
integration test exercises code paths across multiple live services.
In the rest of the chapter, we will refer to these broader integration
tests as end-to-end tests. An end-to-end test validates behavior that
spans multiple services in the system, like a user-facing scenario.
2
https://martinfowler.com/bliki/IntegrationTest.html
CHAPTER 18. TESTING 195
18.2 Size
The size of a test3 reflects how much computing resources it needs
to run, like the number of nodes. Generally, that depends on
how realistic the environment is where the test runs. Although
the scope and size of a test tend to be correlated, they are distinct
concepts, and it helps to separate them.
A small test runs in a single process and doesn’t perform any block-
ing calls or I/O. It’s very fast, deterministic, and has a very small
probability of failing intermittently.
An intermediate test runs on a single node and performs local I/O,
like reads from disk or network calls to localhost. This introduces
more room for delays and non-determinism, increasing the likeli-
hood of intermittent failures.
A large test requires multiple nodes to run, introducing even more
non-determinism and longer delays.
Unsurprisingly, the larger a test is, the longer it takes to run and
the flakier it becomes. This is why you should write the smallest
possible test for a given behavior. But how do you reduce the size
3
https://www.amazon.co.uk/dp/B0859PF5HB
CHAPTER 18. TESTING 197
quire the test to issue real transactions. Fortunately, the API has
a different endpoint that offers a playground environment, which
the test can use without creating real transactions. If there was no
playground environment available and no fake either, we would
have to resort to stubbing or mocking.
In this case, we have cut the test’s size considerably, while keeping
its scope mostly intact.
Here is a more nuanced example. Suppose we need to test whether
purging the data belonging to a specific user across the entire ap-
plication stack works as expected. In Europe, this functionality is
mandated by law (GDPR), and failing to comply with it can result
in fines up to 20 million euros or 4% annual turnover, whichever is
greater. In this case, because the risk for the functionality silently
breaking is too high, we want to be as confident as possible that
the functionality is working as expected. This warrants the use of
an end-to-end test that runs in production and uses live services
rather than test doubles.
Chapter 19
Once a change and its newly introduced tests have been merged
to a repository, it needs to be released to production.
When releasing a change requires a manual process, it won’t hap-
pen frequently. Meaning that several changes, possibly over days
or even weeks, end up being batched and released together. This
makes it harder to pinpoint the breaking change1 when a deploy-
ment fails, creating interruptions for the whole team. The devel-
oper who initiated the release also needs to keep an eye on it by
monitoring dashboards and alerts to ensure that it’s working as
expected or roll it back.
Manual deployments are a terrible use of engineering time. The
problem gets further exacerbated when there are many services.
Eventually, the only way to release changes safely and efficiently
is to automate the entire process. Once a change has been merged
to a repository, it should automatically be rolled out to production
safely. The developer is then free to context-switch to their next
task, rather than shepherding the deployment. The whole release
1
There could be multiple breaking changes actually.
CHAPTER 19. CONTINUOUS DELIVERY AND DEPLOYMENT 201
It all starts with a pull request (PR) submitted for review by a devel-
oper to a repository. When the PR is submitted for review, it needs
CHAPTER 19. CONTINUOUS DELIVERY AND DEPLOYMENT 202
19.2 Pre-production
During this stage, the artifact is deployed and released to a syn-
thetic pre-production environment. Although this environment
lacks the realism of production, it’s useful to verify that no hard
failures are triggered (e.g., a null pointer exception at startup due
to a missing configuration setting) and that end-to-end tests suc-
ceed. Because releasing a new version to pre-production requires
significantly less time than releasing it to production, bugs can be
detected earlier.
You can even have multiple pre-production environments, start-
ing with one created from scratch for each artifact and used to run
simple smoke tests, to a persistent one similar to production that
receives a small fraction of mirrored requests from it. AWS, for ex-
ample, uses multiple pre-production environments3 (Alpha, Beta,
and Gamma).
A service released to a pre-production environment should call the
production endpoints of its external dependencies to make the en-
vironment as stable as possible; it could call the pre-production
endpoints of other services owned by the same team, though.
Ideally, the CD pipeline should assess the artifact’s health in
pre-production using the same health signals used in production.
Metrics, alerts, and tests used in pre-production should be equiv-
alent to those used in production to avoid the former to become a
second-class citizen with sub-par health coverage.
19.3 Production
Once an artifact has been rolled out to pre-production successfully,
the CD pipeline can proceed to the final stage and release the arti-
fact to production. It should start by releasing it to a small number
of production instances at first4 . The goal is to surface problems
3
https://aws.amazon.com/builders- library/automating- safe- hands- off-
deployments/
4
This is also referred to as canary testing.
CHAPTER 19. CONTINUOUS DELIVERY AND DEPLOYMENT 204
19.4 Rollbacks
After each step, the CD pipeline needs to assess whether the arti-
fact deployed is healthy, or else stop the release and roll it back. A
variety of health signals can be used to make that decision, such
as:
• the result of end-to-end tests;
• health metrics like latencies and errors;
• alerts;
• and health endpoints.
Monitoring just the health signals of the service being rolled out
is not enough. The CD pipeline should also monitor the health of
upstream and downstream services to detect any indirect impact
of the rollout. The pipeline should allow enough time to pass be-
CHAPTER 19. CONTINUOUS DELIVERY AND DEPLOYMENT 205
tween one step and the next (bake time) to ensure that it was suc-
cessful, as some issues can appear only after some time has passed.
For example, a performance degradation could be visible only at
peak time.
The CD pipeline can further gate the bake time on the number of
requests seen for specific API endpoints to guarantee that the API
surface has been properly exercised. To speed up the release, the
bake time can be reduced after each step succeeds and confidence
is built up.
When a health signal reports a degradation, the CD pipeline stops.
At that point, it can either roll back the artifact automatically, or
trigger an alert to engage the engineer on-call, who needs to de-
cide whether a rollback is warranted or not5 . Based on their input,
the CD pipeline retries the stage that failed (e.g., perhaps because
something else was going into production at the time), or rolls back
the release entirely. The operator can also stop the pipeline and
wait for a new artifact with a hotfix to be rolled forward. This
might be necessary if the release can’t be rolled back because a
backward-incompatible change has been introduced.
Since rolling forward is much riskier than rolling back, any change
introduced should always be backward compatible as a rule of
thumb. The most common cause for backward-incompatibility is
changing the serialization format used either for persistence or IPC
purposes.
To safely introduce a backward-incompatible change, it needs to
be broken down into multiple backward-compatible changes6 . For
example, suppose the messaging schema between a producer and
a consumer service needs to change in a backward incompatible
way. In this case, the change is broken down into three smaller
changes that can individually be rolled back safely:
• In the prepare change, the consumer is modified to support
5
CD pipelines can be configured to run only during business hours to minimize
the disruption to on-call engineers.
6
https://aws.amazon.com/builders-library/ensuring-rollback-safety-during-
deployments/
CHAPTER 19. CONTINUOUS DELIVERY AND DEPLOYMENT 206
Monitoring
took and whether they were successful. These scripts are deployed
in the same regions the application’s users are and hit the same
endpoints they do. Because they exercise the system’s public sur-
face from the outside, they can catch issues that aren’t visible from
within the application, like connectivity problems. These scripts
are also useful to detect issues with APIs that aren’t exercised of-
ten by users.
Blackbox monitoring is good at detecting the symptoms when
something is broken; in contrast, white-box monitoring can help
identify the root cause of known hard-failure modes before users
are impacted. As a rule of thumb, if you can’t design away a
hard-failure mode, you should add monitoring for it. The longer
a system has been around, the better you will understand how it
can fail and what needs to be monitored.
20.1 Metrics
A metric is a numeric representation of information measured over
a time interval and represented as a time-series, like the number
of requests handled by a service. Conceptually, a metric is a list
of samples, where each sample is represented by a floating-point
number and a timestamp.
Modern monitoring systems allow a metric to be tagged with a set
of key-value pairs called labels, which increases the dimensionality
of the metric. Essentially, every distinct combination of labels is a
different metric. This has become a necessity as modern services
can have a large amount of metadata associated with each metric,
like datacenter, cluster, node, pod, service, etc. High-cardinality
metrics make it easy to slice and dice the data, and eliminate the
instrumentation cost of manually creating a metric for each label
combination.
A service should emit metrics about its load, internal state, and
availability and performance of downstream service dependen-
cies. Combined with the metrics emitted by downstream services,
this allows operators to identify problems quickly. This requires
CHAPTER 20. MONITORING 209
resource = self._repository.get(id)
# Did the remote call fail, and if so, why?
# Did the remote call timeout?
# How long did the call take?
self._cache[id] = resource
# What's the size of the cache?
return resource
# How long did it take for the handler to run?
"timestamp": 1614438079
}
Figure 20.1: An SLI defined as the ratio of good events over the
total number of events.
SLOs are helpful for alerting purposes and help the team prioritize
repair tasks with feature work. For example, the team can agree
that when an error budget has been exhausted, repair items will
take precedence over new features until the SLO is restored. Also,
an incident’s importance can be measured by how much of the
error budget has been burned. An incident that burned 20% of the
error budget needs more afterthought than one that burned only
1%.
Smaller time windows force the team to act quicker and priori-
tize bug fixes and repair items, while longer windows are better
suited to make long-term decisions about which projects to invest
in. Therefore it makes sense to have multiple SLOs with different
window sizes.
How strict should SLOs be? Choosing the right target range
is harder than it looks. If it’s too loose, you won’t detect user-
facing issues; if it’s too strict, you will waste engineering time
micro-optimizing and get diminishing returns. Even if you
could guarantee 100% reliability for your system, you can’t make
guarantees for anything that your users depend on to access
your service that is outside your control, like their last-mile
connection. Thus, 100% reliability doesn’t translate into a 100%
reliable experience for users.
CHAPTER 20. MONITORING 216
When setting the target range for your SLOs, start with comfort-
able ranges and tighten them as you build up confidence. Don’t
just pick targets that your service meets today that might become
unattainable in a year after the load increases; work backward
from what users care about. In general, anything above 3 nines
of availability is very costly to achieve and provides diminishing
returns.
How many SLOs should you have? You should strive to keep
things simple and have as few as possible that provide a good
enough indication of the desired service level. SLOs should also
be documented and reviewed periodically. For example, suppose
you discover that a specific user-facing issue generated lots of sup-
port tickets, but none of your SLOs showed any degradations. In
that case, they are either too relaxed, or you are not measuring
something that you should.
SLOs need to be agreed on with multiple stakeholders. Engineers
need to agree that the targets are achievable without excessive toil.
If the error budget is burning too rapidly or has been exhausted,
repair items will take priority over features. Product managers
have to agree that the targets guarantee a good user experience.
As Google’s SRE book11 mentions: “if you can’t ever win a conver-
sation about priorities by quoting a particular SLO, it’s probably
not worth having that SLO”.
Users can become over-reliant on the actual behavior of your ser-
vice rather than the published SLO. To mitigate that, you can con-
sider injecting controlled failures12 in production — also known as
chaos testing — to “shake the tree” and ensure the dependencies
can cope with the targeted service level and are not making un-
realistic assumptions. As an added benefit, injecting faults helps
validate that resiliency mechanisms work as expected.
11
https://sre.google/sre-book/service-level-objectives/
12
https://en.wikipedia.org/wiki/Chaos_engineering
CHAPTER 20. MONITORING 217
20.4 Alerts
Alerting is the part of a monitoring system that triggers an action
when a specific condition happens, like a metric crossing a thresh-
old. Depending on the severity and the type of the alert, the action
triggered can range from running some automation, like restarting
a service instance, to ringing the phone of a human operator who
is on-call. In the rest of this section, we will be mostly focusing on
the latter case.
For an alert to be useful, it has to be actionable. The operator
shouldn’t spend time digging into dashboards to assess the alert’s
impact and urgency. For example, an alert signaling a spike in
CPU usage is not useful as it’s not clear whether it has any impact
on the system without further investigation. On the other hand,
an SLO is a good candidate for an alert because it quantifies its
impact on the users. The SLO’s error budget can be monitored to
trigger an alert whenever a large fraction of it has been consumed.
Before we can discuss how to define an alert, it’s important to un-
derstand that there is a trade-off between its precision and recall.
Formally, precision is the fraction of significant events over the to-
tal number of alerts, while recall is the ratio of significant events
that triggered an alert. Alerts with low precision are noisy and of-
ten not actionable, while alerts with low recall don’t always trigger
during an outage. Although it would be nice to have 100% preci-
sion and recall, you have to make a trade-off since improving one
typically lowers the other.
Suppose you have an availability SLO of 99% over 30 days, and
you would like to configure an alert for it. A naive way would
be to trigger an alert whenever the availability goes below 99%
within a relatively short time window, like an hour. But how much
of the error budget has actually been burned by the time the alert
triggers?
Because the time window of the alert is one hour, and the SLO error
budget is defined over 30 days, the percentage of error budget that
1 hour
has been spent when the alert triggers is 30 days
= 0.14. Is it really
CHAPTER 20. MONITORING 218
critical to be notified that 0.14% of the SLO’s error budget has been
burned? Probably not. In this case, you have high recall, but low
precision.
You can improve the alert’s precision by increasing the amount of
time its condition needs to be true. The problem with it is that
now the alert will take longer to trigger, which will be an issue
when there is an actual outage. The alternative is to alert based on
how fast the error budget is burning, also known as the burn rate,
which lowers the detection time.
The burn rate is defined as the percentage of the error budget
consumed over the percentage of the SLO time window that has
elapsed — it’s the rate of increase of the error budget. Concretely,
for our SLO example, a burn rate of 1 means the error budget will
be exhausted precisely in 30 days; if the rate is 2, then it will be 15
days; if the rate is 3, it will be 10 days, and so on.
By rearranging the burn rate’s equation, you can derive the alert
threshold that triggers when a specific percentage of the error bud-
get has been burned. For example, to have an alert trigger when
an error budget of 2% has been burned in a one-hour window, the
threshold for the burn rate should be set to 14.4:
While you should define most of your alerts based on SLOs, some
should trigger for known hard-failure modes that you haven’t had
the time to design or debug away. For example, suppose you know
your service suffers from a memory leak that has led to an incident
in the past, but you haven’t managed yet to track down the root-
cause or build a resiliency mechanism to mitigate it. In this case, it
could be useful to define an alert that triggers an automated restart
when a service instance is running out of memory.
20.5 Dashboards
After alerting, the other main use case for metrics is to power real-
time dashboards that display the overall health of a system.
Unfortunately, dashboards can easily become a dumping ground
for charts that end up being forgotten, have questionable useful-
ness, or are just plain confusing. Good dashboards don’t happen
by coincidence. In this section, I will present some best practices
on how to create useful dashboards.
The first decision you have to make when creating a dashboard
is to decide who the audience is14 and what they are looking for.
Given the audience, you can work backward to decide which
charts, and therefore metrics, to include.
The categories of dashboards presented here (see Figure 20.3) are
by no means standard but should give you an idea of how to orga-
nize dashboards.
SLO dashboard
The SLO summary dashboard is designed to be used by various
stakeholders from across the organization to gain visibility into the
system’s health as represented by its SLOs. During an incident,
this dashboard quantifies the impact it’s having on users.
Public API dashboard
14
https://aws.amazon.com/builders- library/building- dashboards- for-
operational-visibility
CHAPTER 20. MONITORING 220
20.6 On-call
A healthy on-call rotation is only possible when services are built
from the ground up with reliability and operability in mind. By
making the developers responsible for operating what they build,
they are incentivized to reduce the operational toll to a minimum.
They are also in the best position to be on-call since they are in-
timately familiar with the system’s architecture, brick walls, and
CHAPTER 20. MONITORING 223
trade-offs.
Being on-call can be very stressful. Even when there are no call-
outs, just the thought of not having the same freedom usually en-
joyed outside of regular working hours can cause anxiety. This is
why being on-call should be compensated, and there shouldn’t be
any expectations for the on-call engineer to make any progress on
feature work. Since they will be interrupted by alerts, they should
make the most out of it and be given free rein to improve the on-
call experience, for example, by revising dashboards or improving
resiliency mechanisms.
Achieving a healthy on-call is only possible when alerts are action-
able. When an alert triggers, to the very least, it should link to rele-
vant dashboards and a run-book that lists the actions the engineer
should take, as it’s all too easy to miss a step when you get a call
in the middle of the night15 . Unless the alert was a false positive,
all actions taken by the operator should be communicated into a
shared channel like a global chat, that’s accessible by other teams.
This allows others to chime in, track the incident’s progress, and
make it easier to hand over an ongoing incident to someone else.
The first step to address an alert is to mitigate it, not fix the under-
lying root cause that created it. A new artifact has been rolled out
that degrades the service? Roll it back. The service can’t cope with
the load even though it hasn’t increased? Scale it out.
Once the incident has been mitigated, the next step is to brainstorm
ways to prevent it from happening again. The more widespread
the impact was, the more time you should spend on this. Incidents
that burned a significant fraction of an SLO’s error budget require
a postmortem.
A postmortem’s goal is to understand an incident’s root cause and
come up with a set of repair items that will prevent it from hap-
pening again. There should also be an agreement in the team that
if an SLO’s error budget is burned or the number of alerts spirals
15
For the same reason, you should automate what you can to minimize manual
actions that operators need to perform. Machines are good at following instruc-
tions; use that to your advantage.
CHAPTER 20. MONITORING 224
16
https://sre.google/books/
Chapter 21
Observability
21.1 Logs
A log is an immutable list of time-stamped events that happened
over time. An event can have different formats. In its simplest
form, it’s just free-form text. It can also be structured and repre-
sented with a textual format like JSON, or a binary one like Proto-
buf. When structured, an event is typically represented with a bag
of key-value pairs:
{
"failureCount": 1,
"serviceRegion": "EastUs2",
"timestamp": 1614438079
CHAPTER 21. OBSERVABILITY 227
derstand why the remote call failed. To make that possible, every
event should include the id of the request or message for the work
unit.
Costs
There are various ways to keep the costs of logging under con-
trol. A simple approach is to have different logging levels (e.g.:
debug, info, warning, error) controlled by a dynamic knob that de-
termines which ones are emitted. This allows operators to increase
the logging verbosity for investigation purposes and reduce costs
when granular logs aren’t needed.
Sampling2 is another option to reduce verbosity. For example, a
service could log only one every n-th event. Additionally, events
can also be prioritized based on their expected signal to noise ra-
tio; for example, logging failed requests should have a higher sam-
pling frequency than logging successful ones.
The options discussed so far only reduce the logging verbosity on
a single node. As you scale out and add more nodes, logging vol-
ume will necessarily increase. Even with the best intentions, some-
one could check-in a bug that leads to excessive logging. To avoid
costs soaring through the roof or killing your logging pipeline en-
tirely, log collectors need to be able to rate-limit requests. If you
use a third-party service to ingest, store, and query your logs, there
probably is a quota in place already.
Of course, you can always opt to create in-memory aggregates
from the measurements collected in events (e.g., metrics) and emit
just those rather than raw logs. By doing so, you trade-off the abil-
ity to drill down into the aggregates if needed.
21.2 Traces
Tracing captures the entire lifespan of a request as it propagates
throughout the services of a distributed system. A trace is a list
2
https://www.honeycomb.io/blog/dynamic-sampling-by-example/
CHAPTER 21. OBSERVABILITY 230
Just like in the previous case, you can emit individual span events
and have the backend aggregate them together into traces.
Chapter 22
Final words
4
https://azure.microsoft.com/mediahandler/files/resourcefiles/azure-data-
explorer/Azure_Data_Explorer_white_paper.pdf
5
I worked on a time-series data store that builds on top of Azure Data Explorer
and Azure Storage, unfortunately, no public paper is available for it just yet
6
https://www.amazon.co.uk/dp/B08CMF2CQF