Understanding Distributed Systems Sample
Understanding Distributed Systems Sample
UNDERSTANDING
DISTRIBUTED
SYSTEMS
SAMPLE
WHAT EVERY DEVELOPER SHOULD KNOW ABOUT
LARGE DISTRIBUTED APPLICATIONS
Understanding Distributed Systems
Version 1.1.0
Roberto Vitillo
March 2021
Contents
Copyright 6
Acknowledgements 8
Preface 9
0.1 Who should read this book . . . . . . . . . . . . . . 10
1 Introduction 11
1.1 Communication . . . . . . . . . . . . . . . . . . . . 12
1.2 Coordination . . . . . . . . . . . . . . . . . . . . . 13
1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Resiliency . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Operations . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Anatomy of a distributed system . . . . . . . . . . 17
I Communication 20
2 Reliable links 23
2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Connection lifecycle . . . . . . . . . . . . . . . . . 24
2.3 Flow control . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Congestion control . . . . . . . . . . . . . . . . . . 27
2.5 Custom protocols . . . . . . . . . . . . . . . . . . . 28
3 Secure links 30
3.1 Encryption . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Authentication . . . . . . . . . . . . . . . . . . . . 31
3.3 Integrity . . . . . . . . . . . . . . . . . . . . . . . . 33
CONTENTS 2
3.4 Handshake . . . . . . . . . . . . . . . . . . . . . . 34
4 Discovery 35
5 APIs 39
5.1 HTTP . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Resources . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Request methods . . . . . . . . . . . . . . . . . . . 45
5.4 Response status codes . . . . . . . . . . . . . . . . 46
5.5 OpenAPI . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Evolution . . . . . . . . . . . . . . . . . . . . . . . 49
II Coordination 51
6 System models 54
7 Failure detection 57
8 Time 59
8.1 Physical clocks . . . . . . . . . . . . . . . . . . . . 60
8.2 Logical clocks . . . . . . . . . . . . . . . . . . . . . 61
8.3 Vector clocks . . . . . . . . . . . . . . . . . . . . . 63
9 Leader election 65
9.1 Raft leader election . . . . . . . . . . . . . . . . . . 65
9.2 Practical considerations . . . . . . . . . . . . . . . . 67
10 Replication 71
10.1 State machine replication . . . . . . . . . . . . . . . 72
10.2 Consensus . . . . . . . . . . . . . . . . . . . . . . . 75
10.3 Consistency models . . . . . . . . . . . . . . . . . . 76
10.3.1 Strong consistency . . . . . . . . . . . . . . 77
10.3.2 Sequential consistency . . . . . . . . . . . . 79
10.3.3 Eventual consistency . . . . . . . . . . . . . 79
10.3.4 CAP theorem . . . . . . . . . . . . . . . . . 81
10.4 Chain replication . . . . . . . . . . . . . . . . . . . 82
10.5 Solving the CAP theorem . . . . . . . . . . . . . . . 86
10.5.1 Broadcast protocols . . . . . . . . . . . . . . 87
10.5.2 Conflict free replicated data types . . . . . . 89
10.5.3 Dynamo-style data stores . . . . . . . . . . . 94
10.5.4 CALM theorem . . . . . . . . . . . . . . . . 96
10.5.5 Causal consistency . . . . . . . . . . . . . . 97
CONTENTS 3
11 Transactions 101
11.1 ACID . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . 103
11.2.1 Concurrency control . . . . . . . . . . . . . 105
11.3 Atomicity . . . . . . . . . . . . . . . . . . . . . . . 106
11.3.1 Two-phase commit . . . . . . . . . . . . . . 107
11.4 Asynchronous transactions . . . . . . . . . . . . . . 109
11.4.1 Log-based transactions . . . . . . . . . . . . 110
11.4.2 Sagas . . . . . . . . . . . . . . . . . . . . . . 113
11.4.3 Isolation . . . . . . . . . . . . . . . . . . . . 116
13 Partitioning 146
13.1 Sharding strategies . . . . . . . . . . . . . . . . . . 146
13.1.1 Range partitioning . . . . . . . . . . . . . . 147
13.1.2 Hash partitioning . . . . . . . . . . . . . . . 147
13.2 Rebalancing . . . . . . . . . . . . . . . . . . . . . . 151
13.2.1 Static partitioning . . . . . . . . . . . . . . . 151
13.2.2 Dynamic partitioning . . . . . . . . . . . . . 151
13.2.3 Practical considerations . . . . . . . . . . . 152
CONTENTS 4
14 Duplication 153
14.1 Network load balancing . . . . . . . . . . . . . . . 153
14.1.1 DNS load balancing . . . . . . . . . . . . . . 156
14.1.2 Transport layer load balancing . . . . . . . . 156
14.1.3 Application layer load balancing . . . . . . 159
14.1.4 Geo load balancing . . . . . . . . . . . . . . 161
14.2 Replication . . . . . . . . . . . . . . . . . . . . . . 163
14.2.1 Single leader replication . . . . . . . . . . . 163
14.2.2 Multi-leader replication . . . . . . . . . . . 166
14.2.3 Leaderless replication . . . . . . . . . . . . . 168
14.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . 169
14.3.1 Policies . . . . . . . . . . . . . . . . . . . . 169
14.3.2 In-process cache . . . . . . . . . . . . . . . . 170
14.3.3 Out-of-process cache . . . . . . . . . . . . . 171
IV Resiliency 174
15 Common failure causes 177
15.1 Single point of failure . . . . . . . . . . . . . . . . . 177
15.2 Unreliable network . . . . . . . . . . . . . . . . . . 178
15.3 Slow processes . . . . . . . . . . . . . . . . . . . . 179
15.4 Unexpected load . . . . . . . . . . . . . . . . . . . 180
15.5 Cascading failures . . . . . . . . . . . . . . . . . . 181
15.6 Risk management . . . . . . . . . . . . . . . . . . . 182
20 Monitoring 224
20.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 225
20.2 Service-level indicators . . . . . . . . . . . . . . . . 228
20.3 Service-level objectives . . . . . . . . . . . . . . . . 231
20.4 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . 234
20.5 Dashboards . . . . . . . . . . . . . . . . . . . . . . 236
20.5.1 Best practices . . . . . . . . . . . . . . . . . 238
20.6 On-call . . . . . . . . . . . . . . . . . . . . . . . . . 239
21 Observability 242
21.1 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . 243
21.2 Traces . . . . . . . . . . . . . . . . . . . . . . . . . 246
21.3 Putting it all together . . . . . . . . . . . . . . . . . 248
2
https://understandingdistributed.systems/
3
[email protected]
Chapter 1
Introduction
Some applications need to tackle workloads that are just too big to
fit on a single node, no matter how powerful. For example, Google
receives hundreds of thousands of search requests per second from
all over the globe. There is no way a single node could handle that.
And finally, some applications have performance requirements
that would be physically impossible to achieve with a single
node. Netflix can seamlessly stream movies to your TV with high
resolutions because it has a datacenter close to you.
This book will guide you through the fundamental challenges that
need to be solved to design, build and operate distributed sys-
tems: communication, coordination, scalability, resiliency, and op-
erations.
1.1 Communication
The first challenge comes from the fact that nodes need to commu-
nicate over the network with each other. For example, when your
browser wants to load a website, it resolves the server’s address
from the URL and sends an HTTP request to it. In turn, the server
returns a response with the content of the page to the client.
How are request and response messages represented on the wire?
What happens when there is a temporary network outage, or some
faulty network switch flips a few bits in the messages? How can
you guarantee that no intermediary can snoop into the communi-
cation?
Although it would be convenient to assume that some networking
library is going to abstract all communication concerns away, in
practice it’s not that simple because abstractions leak1 , and you
need to understand how the stack works when that happens.
1
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractio
ns/
CHAPTER 1. INTRODUCTION 13
1.2 Coordination
Another hard challenge of building distributed systems is coordi-
nating nodes into a single coherent whole in the presence of fail-
ures. A fault is a component that stopped working, and a system is
fault-tolerant when it can continue to operate despite one or more
faults. The “two generals” problem is a famous thought experi-
ment that showcases why this is a challenging problem.
Suppose there are two generals (nodes), each commanding its own
army, that need to agree on a time to jointly attack a city. There is
some distance between the armies, and the only way to communi-
cate is by sending a messenger (messages). Unfortunately, these
messengers can be captured by the enemy (network failure).
Is there a way for the generals to agree on a time? Well, general 1
could send a message with a proposed time to general 2 and wait
for a response. What if no response arrives, though? Was one
of the messengers captured? Perhaps a messenger was injured,
and it’s taking longer than expected to arrive at the destination?
Should the general send another messenger?
You can see that this problem is much harder than it originally ap-
peared. As it turns out, no matter how many messengers are dis-
patched, neither general can be completely certain that the other
army will attack the city at the same time. Although sending more
messengers increases the general’s confidence, it never reaches ab-
solute certainty.
Because coordination is such a key topic, the second part of this
book is dedicated to distributed algorithms used to implement co-
ordination.
1.3 Scalability
The performance of a distributed system represents how efficiently
it handles load, and it’s generally measured with throughput and
response time. Throughput is the number of operations processed
per second, and response time is the total time between a client
CHAPTER 1. INTRODUCTION 14
as scaling up. But that will hit a brick wall sooner or later. When
that option is no longer available, the alternative is scaling out by
adding more machines to the system.
In the book’s third part, we will explore the main architectural pat-
terns that you can leverage to scale out applications: functional
decomposition, duplication, and partitioning.
1.4 Resiliency
A distributed system is resilient when it can continue to do its job
even when failures happen. And at scale, any failure that can hap-
pen will eventually occur. Every component of a system has a
probability of failing — nodes can crash, network links can be sev-
ered, etc. No matter how small that probability is, the more com-
ponents there are, and the more operations the system performs,
the higher the absolute number of failures becomes. And it gets
worse, since failures typically are not independent, the failure of a
component can increase the probability that another one will fail.
Failures that are left unchecked can impact the system’s availability,
which is defined as the amount of time the application can serve
requests divided by the duration of the period measured. In other
words, it’s the percentage of time the system is capable of servicing
requests and doing useful work.
Availability is often described with nines, a shorthand way of ex-
pressing percentages of availability. Three nines are typically con-
sidered acceptable, and anything above four is considered to be
highly available.
1.5 Operations
Distributed systems need to be tested, deployed, and maintained.
It used to be that one team developed an application, and another
was responsible for operating it. The rise of microservices and De-
vOps has changed that. The same team that designs a system is
also responsible for its live-site operation. That’s a good thing as
there is no better way to find out where a system falls short than
experiencing it by being on-call for it.
New deployments need to be rolled out continuously in a safe
manner without affecting the system’s availability. The system
needs to be observable so that it’s easy to understand what’s hap-
pening at any time. Alerts need to fire when its service level objec-
tives are at risk of being breached, and a human needs to be looped
in. The book’s final part explores best practices to test and operate
distributed systems.
CHAPTER 1. INTRODUCTION 17
Figure 1.2: The business logic uses the messaging interface imple-
mented by the Kafka producer to send messages and the reposi-
tory interface to access the SQL store. In contrast, the HTTP con-
troller handles incoming requests using the service interface.
Communication
Introduction
to another across the network. The Internet Protocol (IP) is the core
protocol of this layer, which delivers packets on a best-effort basis.
Routers operate at this layer and forward IP packets based on their
destination IP address.
The transport layer transmits data between two processes using
port numbers to address the processes on either end. The most
important protocol in this layer is the Transmission Control
Protocol (TCP).
The application layer defines high-level communication protocols,
like HTTP or DNS. Typically your code will target this level of ab-
straction.
Even though each protocol builds up on top of the other, some-
times the abstractions leak. If you don’t know how the bottom lay-
ers work, you will have a hard time troubleshooting networking
issues that will inevitably arise.
Chapter 2 describes how to build a reliable communication chan-
nel (TCP) on top of an unreliable one (IP), which can drop, dupli-
cate and deliver data out of order. Building reliable abstractions on
top of unreliable ones is a common pattern that we will encounter
many times as we explore further how distributed systems work.
Chapter 3 describes how to build a secure channel (TLS) on top of
a reliable one (TCP), which provides encryption, authentication,
and integrity.
Chapter 4 dives into how the phone book of the Internet (DNS)
works, which allows nodes to discover others using names. At its
heart, DNS is a distributed, hierarchical, and eventually consistent
key-value store. By studying it, we will get a first taste of eventu-
ally consistency.
Chapter 5 concludes this part by discussing how services can ex-
pose APIs that other nodes can use to send commands or notifi-
cations to. Specifically, we will dive into the implementation of a
RESTful HTTP API.
Chapter 2
Reliable links
2.1 Reliability
To create the illusion of a reliable channel, TCP partitions a byte
stream into discrete packets called segments. The segments are
sequentially numbered, which allows the receiver to detect holes
and duplicates. Every segment sent needs to be acknowledged
by the receiver. When that doesn’t happen, a timer fires on the
sending side, and the segment is retransmitted. To ensure that the
data hasn’t been corrupted in transit, the receiver uses a checksum
to verify the integrity of a delivered segment.
1
https://tools.ietf.org/html/rfc793
CHAPTER 2. RELIABLE LINKS 24
Figure 2.2: The receive buffer stores data that hasn’t been pro-
cessed yet by the application.
Figure 2.4: The lower the RTT is, the quicker the sender can start
utilizing the underlying network’s bandwidth.
WinSize
Bandwidth =
RTT
nisms that TCP provides, what you get is a simple protocol named
User Datagram Protocol6 (UDP) — a connectionless transport layer
protocol that can be used as an alternative to TCP.
Unlike TCP, UDP does not expose the abstraction of a byte
stream to its clients. Clients can only send discrete packets,
called datagrams, with a limited size. UDP doesn’t offer any
reliability as datagrams don’t have sequence numbers and are
not acknowledged. UDP doesn’t implement flow and congestion
control either. Overall, UDP is a lean and barebone protocol. It’s
used to bootstrap custom protocols, which provide some, but not
all, of the stability and reliability guarantees that TCP does7 .
For example, in modern multi-player games, clients sample
gamepad, mouse and keyboard events several times per second
and send them to a server that keeps track of the global game state.
Similarly, the server samples the game state several times per
second and sends these snapshots back to the clients. If a snapshot
is lost in transmission, there is no value in retransmitting it as the
game evolves in real-time; by the time the retransmitted snapshot
would get to the destination, it would be obsolete. This is a use
case where UDP shines, as TCP would attempt to redeliver the
missing data and consequently slow down the client’s experience.
6
https://en.wikipedia.org/wiki/User_Datagram_Protocol
7
As we will later see, HTTP 3 is based on UDP to avoid some of TCP’s short-
comings.
SAMPLE