0% found this document useful (0 votes)

179 views

Understanding Distributed Systems Sample

Uploaded by

uwinakick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

179 views

Understanding Distributed Systems Sample

Uploaded by

uwinakick

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Roberto Vitillo

UNDERSTANDING
DISTRIBUTED
SYSTEMS
SAMPLE
WHAT EVERY DEVELOPER SHOULD KNOW ABOUT
LARGE DISTRIBUTED APPLICATIONS
Understanding Distributed Systems
Version 1.1.0

Roberto Vitillo

March 2021
Contents

About the author 7

Acknowledgements 8

Preface 9
0.1 Who should read this book . . . . . . . . . . . . . . 10

1 Introduction 11
1.1 Communication . . . . . . . . . . . . . . . . . . . . 12
1.2 Coordination . . . . . . . . . . . . . . . . . . . . . 13
1.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Resiliency . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Operations . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Anatomy of a distributed system . . . . . . . . . . 17

I Communication 20
2 Reliable links 23
2.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Connection lifecycle . . . . . . . . . . . . . . . . . 24
2.3 Flow control . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Congestion control . . . . . . . . . . . . . . . . . . 27
2.5 Custom protocols . . . . . . . . . . . . . . . . . . . 28

3 Secure links 30
3.1 Encryption . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Authentication . . . . . . . . . . . . . . . . . . . . 31
3.3 Integrity . . . . . . . . . . . . . . . . . . . . . . . . 33
CONTENTS 2

3.4 Handshake . . . . . . . . . . . . . . . . . . . . . . 34

4 Discovery 35

5 APIs 39
5.1 HTTP . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Resources . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Request methods . . . . . . . . . . . . . . . . . . . 45
5.4 Response status codes . . . . . . . . . . . . . . . . 46
5.5 OpenAPI . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Evolution . . . . . . . . . . . . . . . . . . . . . . . 49

II Coordination 51
6 System models 54

7 Failure detection 57

8 Time 59
8.1 Physical clocks . . . . . . . . . . . . . . . . . . . . 60
8.2 Logical clocks . . . . . . . . . . . . . . . . . . . . . 61
8.3 Vector clocks . . . . . . . . . . . . . . . . . . . . . 63

9 Leader election 65
9.1 Raft leader election . . . . . . . . . . . . . . . . . . 65
9.2 Practical considerations . . . . . . . . . . . . . . . . 67

10 Replication 71
10.1 State machine replication . . . . . . . . . . . . . . . 72
10.2 Consensus . . . . . . . . . . . . . . . . . . . . . . . 75
10.3 Consistency models . . . . . . . . . . . . . . . . . . 76
10.3.1 Strong consistency . . . . . . . . . . . . . . 77
10.3.2 Sequential consistency . . . . . . . . . . . . 79
10.3.3 Eventual consistency . . . . . . . . . . . . . 79
10.3.4 CAP theorem . . . . . . . . . . . . . . . . . 81
10.4 Chain replication . . . . . . . . . . . . . . . . . . . 82
10.5 Solving the CAP theorem . . . . . . . . . . . . . . . 86
10.5.1 Broadcast protocols . . . . . . . . . . . . . . 87
10.5.2 Conﬂict free replicated data types . . . . . . 89
10.5.3 Dynamo-style data stores . . . . . . . . . . . 94
10.5.4 CALM theorem . . . . . . . . . . . . . . . . 96
10.5.5 Causal consistency . . . . . . . . . . . . . . 97
CONTENTS 3

11 Transactions 101
11.1 ACID . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.2 Isolation . . . . . . . . . . . . . . . . . . . . . . . . 103
11.2.1 Concurrency control . . . . . . . . . . . . . 105
11.3 Atomicity . . . . . . . . . . . . . . . . . . . . . . . 106
11.3.1 Two-phase commit . . . . . . . . . . . . . . 107
11.4 Asynchronous transactions . . . . . . . . . . . . . . 109
11.4.1 Log-based transactions . . . . . . . . . . . . 110
11.4.2 Sagas . . . . . . . . . . . . . . . . . . . . . . 113
11.4.3 Isolation . . . . . . . . . . . . . . . . . . . . 116

III Scalability 117

12 Functional decomposition 121
12.1 Microservices . . . . . . . . . . . . . . . . . . . . . 121
12.1.1 Beneﬁts . . . . . . . . . . . . . . . . . . . . 123
12.1.2 Costs . . . . . . . . . . . . . . . . . . . . . . 124
12.1.3 Practical considerations . . . . . . . . . . . 126
12.2 API gateway . . . . . . . . . . . . . . . . . . . . . . 128
12.2.1 Routing . . . . . . . . . . . . . . . . . . . . 128
12.2.2 Composition . . . . . . . . . . . . . . . . . 129
12.2.3 Translation . . . . . . . . . . . . . . . . . . 130
12.2.4 Cross-cutting concerns . . . . . . . . . . . . 130
12.2.5 Caveats . . . . . . . . . . . . . . . . . . . . 133
12.3 CQRS . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.4 Messaging . . . . . . . . . . . . . . . . . . . . . . . 136
12.4.1 Guarantees . . . . . . . . . . . . . . . . . . 139
12.4.2 Exactly-once processing . . . . . . . . . . . 141
12.4.3 Failures . . . . . . . . . . . . . . . . . . . . 142
12.4.4 Backlogs . . . . . . . . . . . . . . . . . . . . 142
12.4.5 Fault isolation . . . . . . . . . . . . . . . . . 143
12.4.6 Reference plus blob . . . . . . . . . . . . . . 144

13 Partitioning 146
13.1 Sharding strategies . . . . . . . . . . . . . . . . . . 146
13.1.1 Range partitioning . . . . . . . . . . . . . . 147
13.1.2 Hash partitioning . . . . . . . . . . . . . . . 147
13.2 Rebalancing . . . . . . . . . . . . . . . . . . . . . . 151
13.2.1 Static partitioning . . . . . . . . . . . . . . . 151
13.2.2 Dynamic partitioning . . . . . . . . . . . . . 151
13.2.3 Practical considerations . . . . . . . . . . . 152
CONTENTS 4

14 Duplication 153
14.1 Network load balancing . . . . . . . . . . . . . . . 153
14.1.1 DNS load balancing . . . . . . . . . . . . . . 156
14.1.2 Transport layer load balancing . . . . . . . . 156
14.1.3 Application layer load balancing . . . . . . 159
14.1.4 Geo load balancing . . . . . . . . . . . . . . 161
14.2 Replication . . . . . . . . . . . . . . . . . . . . . . 163
14.2.1 Single leader replication . . . . . . . . . . . 163
14.2.2 Multi-leader replication . . . . . . . . . . . 166
14.2.3 Leaderless replication . . . . . . . . . . . . . 168
14.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . 169
14.3.1 Policies . . . . . . . . . . . . . . . . . . . . 169
14.3.2 In-process cache . . . . . . . . . . . . . . . . 170
14.3.3 Out-of-process cache . . . . . . . . . . . . . 171

IV Resiliency 174
15 Common failure causes 177
15.1 Single point of failure . . . . . . . . . . . . . . . . . 177
15.2 Unreliable network . . . . . . . . . . . . . . . . . . 178
15.3 Slow processes . . . . . . . . . . . . . . . . . . . . 179
15.4 Unexpected load . . . . . . . . . . . . . . . . . . . 180
15.5 Cascading failures . . . . . . . . . . . . . . . . . . 181
15.6 Risk management . . . . . . . . . . . . . . . . . . . 182

16 Downstream resiliency 184

16.1 Timeout . . . . . . . . . . . . . . . . . . . . . . . . 184
16.2 Retry . . . . . . . . . . . . . . . . . . . . . . . . . . 187
16.2.1 Exponential backoff . . . . . . . . . . . . . . 188
16.2.2 Retry ampliﬁcation . . . . . . . . . . . . . . 189
16.3 Circuit breaker . . . . . . . . . . . . . . . . . . . . 190
16.3.1 State machine . . . . . . . . . . . . . . . . . 191

17 Upstream resiliency 193

17.1 Load shedding . . . . . . . . . . . . . . . . . . . . 193
17.2 Load leveling . . . . . . . . . . . . . . . . . . . . . 194
17.3 Rate-limiting . . . . . . . . . . . . . . . . . . . . . 195
17.3.1 Single-process implementation . . . . . . . 197
17.3.2 Distributed implementation . . . . . . . . . 200
17.4 Bulkhead . . . . . . . . . . . . . . . . . . . . . . . 201
17.5 Health endpoint . . . . . . . . . . . . . . . . . . . . 204
17.5.1 Health checks . . . . . . . . . . . . . . . . . 204
CONTENTS 5

17.6 Watchdog . . . . . . . . . . . . . . . . . . . . . . . 205

V Testing and operations 207

18 Testing 210
18.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . 211
18.2 Size . . . . . . . . . . . . . . . . . . . . . . . . . . 213
18.3 Practical considerations . . . . . . . . . . . . . . . . 215

19 Continuous delivery and deployment 217

19.1 Review and build . . . . . . . . . . . . . . . . . . . 218
19.2 Pre-production . . . . . . . . . . . . . . . . . . . . 220
19.3 Production . . . . . . . . . . . . . . . . . . . . . . . 220
19.4 Rollbacks . . . . . . . . . . . . . . . . . . . . . . . 221

20 Monitoring 224
20.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 225
20.2 Service-level indicators . . . . . . . . . . . . . . . . 228
20.3 Service-level objectives . . . . . . . . . . . . . . . . 231
20.4 Alerts . . . . . . . . . . . . . . . . . . . . . . . . . 234
20.5 Dashboards . . . . . . . . . . . . . . . . . . . . . . 236
20.5.1 Best practices . . . . . . . . . . . . . . . . . 238
20.6 On-call . . . . . . . . . . . . . . . . . . . . . . . . . 239

21 Observability 242
21.1 Logs . . . . . . . . . . . . . . . . . . . . . . . . . . 243
21.2 Traces . . . . . . . . . . . . . . . . . . . . . . . . . 246
21.3 Putting it all together . . . . . . . . . . . . . . . . . 248

22 Final words 250

Understanding Distributed Systems by Roberto Vitillo

Copyright © Roberto Vitillo. All rights reserved.
The book’s diagrams have been created with Excalidraw.
While the author has used good faith efforts to ensure that the in-
formation and instructions in this work are accurate, the author
disclaims all responsibility for errors or omissions, including with-
out limitation responsibility for damages resulting from the use of
or reliance on this work. The use of the information and instruc-
tions contained in this work is at your own risk. If any code sam-
ples or other technology this work contains or describes is subject
to open source licenses or the intellectual property rights of others,
it is your responsibility to ensure that your use thereof complies
with such licenses and/or rights.
About the author

Authors generally write this page in the third person as if someone

else is writing about them. I like to do things a little bit differently.
I have over 10 years of experience in the tech industry as a software
engineer, technical lead, and manager.
In 2017, I joined Microsoft to work on an internal SaaS data plat-
form. Since then, I have helped launch two public SaaS products,
Product Insights and Playfab. The data pipeline I am responsible
for is one of the largest in the world. It processes millions of events
per second from billions of devices worldwide.
Before that, I worked at Mozilla, where I set the direction of the
data platform from its very early days and built a large part of it,
including the team.
After getting my master’s degree in computer science, I worked
on scientiﬁc computing applications at the Berkeley Lab. The soft-
ware I contributed is used to this day by the ATLAS experiment at
the Large Hadron Collider.
Acknowledgements

Writing a book is an incredibly challenging but rewarding expe-

rience. I wanted to share what I have learned about distributed
systems for a very long time.
I appreciate the colleagues who inspired and believed in me.
Thanks to Chiara Roda, Andrea Dotti, Paolo Calaﬁura, Vladan
Djeric, Mark Reid, Pawel Chodarcewicz, and Nuno Cerqueira.
Doug Warren, Vamis Xhagjika, Gaurav Narula, Alessio Placitelli,
Koﬁ Sarfo, Stefania Vitillo and Alberto Sottile were all kind enough
to provide invaluable feedback. Without them, the book wouldn’t
be what it is today.
Finally, and above all, thanks to my family: Rachell and Leonardo.
You always believed in me. That made all the difference.
Preface

According to Stack Overﬂow’s 2020 developer survey1 , the best-

paid engineering roles require distributed systems expertise. That
comes as no surprise as modern applications are distributed sys-
tems.
Learning to build distributed systems is hard, especially if they are
large scale. It’s not that there is a lack of information out there. You
can find academic papers, engineering blogs, and even books on
the subject. The problem is that the available information is spread
out all over the place, and if you were to put it on a spectrum from
theory to practice, you would find a lot of material at the two ends,
but not much in the middle.
When I first started learning about distributed systems, I spent
hours connecting the missing dots between theory and practice. I
was looking for an accessible and pragmatic introduction to guide
me through the maze of information and setting me on the path to
becoming a practitioner. But there was nothing like that available.
That is why I decided to write a book to teach the fundamentals
of distributed systems so that you don’t have to spend countless
hours scratching your head to understand how everything fits to-
gether. This is the guide I wished existed when I first started out,
and it’s based on my experience building large distributed systems
that scale to millions of requests per second and billions of devices.
1
https://insights.stackoverflow.com/survey/2020#work-salary-by-develope
r-type-united-states
CONTENTS 10

I plan to update the book regularly, which is why it has a version

number. You can subscribe to receive updates from the book’s
landing page2 . As no book is ever perfect, I’m always happy to
receive feedback. So if you ﬁnd an error, have an idea for improve-
ment, or simply want to comment on something, always feel free
to write me3 .

0.1 Who should read this book

If you develop the back-end of web or mobile applications (or
would like to!), this book is for you. When building distributed
systems, you need to be familiar with the network stack, data
consistency models, scalability and reliability patterns, and much
more. Although you can build applications without knowing
any of that, you will end up spending hours debugging and
re-designing their architecture, learning lessons that you could
have acquired in a much faster and less painful way. Even if you
are an experienced engineer, this book will help you ﬁll gaps
in your knowledge that will make you a better practitioner and
system architect.
The book also makes for a great study companion for a system de-
sign interview if you want to land a job at a company that runs
large-scale distributed systems, like Amazon, Google, Facebook,
or Microsoft. If you are interviewing for a senior role, you are ex-
pected to be able to design complex networked services and dive
deep into any vertical. You can be a world champion at balancing
trees, but if you fail the design round, you are out. And if you
just meet the bar, don’t be surprised when your offer is well below
what you expected, even if you aced everything else.

2
https://understandingdistributed.systems/
3
[email protected]
Chapter 1

Introduction

A distributed system is one in which the failure of a computer

you didn’t even know existed can render your own computer
unusable.
– Leslie Lamport
Loosely speaking, a distributed system is composed of nodes that
cooperate to achieve some task by exchanging messages over com-
munication links. A node can generically refer to a physical ma-
chine (e.g., a phone) or a software process (e.g., a browser).
Why do we bother building distributed systems in the ﬁrst place?
Some applications are inherently distributed. For example, the
web is a distributed system you are very familiar with. You access
it with a browser, which runs on your phone, tablet, desktop, or
Xbox. Together with other billions of devices worldwide, it forms
a distributed system.
Another reason for building distributed systems is that some appli-
cations require high availability and need to be resilient to single-
node failures. Dropbox replicates your data across multiple nodes
so that the loss of a single node doesn’t cause all your data to be
lost.
CHAPTER 1. INTRODUCTION 12

Some applications need to tackle workloads that are just too big to
fit on a single node, no matter how powerful. For example, Google
receives hundreds of thousands of search requests per second from
all over the globe. There is no way a single node could handle that.
And finally, some applications have performance requirements
that would be physically impossible to achieve with a single
node. Netflix can seamlessly stream movies to your TV with high
resolutions because it has a datacenter close to you.
This book will guide you through the fundamental challenges that
need to be solved to design, build and operate distributed sys-
tems: communication, coordination, scalability, resiliency, and op-
erations.

1.1 Communication
The ﬁrst challenge comes from the fact that nodes need to commu-
nicate over the network with each other. For example, when your
browser wants to load a website, it resolves the server’s address
from the URL and sends an HTTP request to it. In turn, the server
returns a response with the content of the page to the client.
How are request and response messages represented on the wire?
What happens when there is a temporary network outage, or some
faulty network switch ﬂips a few bits in the messages? How can
you guarantee that no intermediary can snoop into the communi-
cation?
Although it would be convenient to assume that some networking
library is going to abstract all communication concerns away, in
practice it’s not that simple because abstractions leak1 , and you
need to understand how the stack works when that happens.
1
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractio
ns/
CHAPTER 1. INTRODUCTION 13

1.2 Coordination
Another hard challenge of building distributed systems is coordi-
nating nodes into a single coherent whole in the presence of fail-
ures. A fault is a component that stopped working, and a system is
fault-tolerant when it can continue to operate despite one or more
faults. The “two generals” problem is a famous thought experi-
ment that showcases why this is a challenging problem.
Suppose there are two generals (nodes), each commanding its own
army, that need to agree on a time to jointly attack a city. There is
some distance between the armies, and the only way to communi-
cate is by sending a messenger (messages). Unfortunately, these
messengers can be captured by the enemy (network failure).
Is there a way for the generals to agree on a time? Well, general 1
could send a message with a proposed time to general 2 and wait
for a response. What if no response arrives, though? Was one
of the messengers captured? Perhaps a messenger was injured,
and it’s taking longer than expected to arrive at the destination?
Should the general send another messenger?
You can see that this problem is much harder than it originally ap-
peared. As it turns out, no matter how many messengers are dis-
patched, neither general can be completely certain that the other
army will attack the city at the same time. Although sending more
messengers increases the general’s conﬁdence, it never reaches ab-
solute certainty.
Because coordination is such a key topic, the second part of this
book is dedicated to distributed algorithms used to implement co-
ordination.

1.3 Scalability
The performance of a distributed system represents how efﬁciently
it handles load, and it’s generally measured with throughput and
response time. Throughput is the number of operations processed
per second, and response time is the total time between a client
CHAPTER 1. INTRODUCTION 14

request and its response.

Load can be measured in different ways since it’s speciﬁc to the sys-
tem’s use cases. For example, number of concurrent users, number
of communication links, or ratio of writes to reads are all different
forms of load.
As the load increases, it will eventually reach the system’s capac-
ity — the maximum load the system can withstand. At that point,
the system’s performance either plateaus or worsens, as shown in
Figure 1.1. If the load on the system continues to grow, it will even-
tually hit a point where most operations fail or timeout.

Figure 1.1: The system throughput on the y axis is the subset of

client requests (x axis) that can be handled without errors and with
low response times, also referred to as its goodput.

The capacity of a distributed system depends on its architecture

and an intricate web of physical limitations like the nodes’ memory
size and clock cycle, and the bandwidth and latency of network
links.
A quick and easy way to increase the capacity is buying more ex-
pensive hardware with better performance, which is referred to
CHAPTER 1. INTRODUCTION 15

as scaling up. But that will hit a brick wall sooner or later. When
that option is no longer available, the alternative is scaling out by
adding more machines to the system.
In the book’s third part, we will explore the main architectural pat-
terns that you can leverage to scale out applications: functional
decomposition, duplication, and partitioning.

1.4 Resiliency
A distributed system is resilient when it can continue to do its job
even when failures happen. And at scale, any failure that can hap-
pen will eventually occur. Every component of a system has a
probability of failing — nodes can crash, network links can be sev-
ered, etc. No matter how small that probability is, the more com-
ponents there are, and the more operations the system performs,
the higher the absolute number of failures becomes. And it gets
worse, since failures typically are not independent, the failure of a
component can increase the probability that another one will fail.
Failures that are left unchecked can impact the system’s availability,
which is deﬁned as the amount of time the application can serve
requests divided by the duration of the period measured. In other
words, it’s the percentage of time the system is capable of servicing
requests and doing useful work.
Availability is often described with nines, a shorthand way of ex-
pressing percentages of availability. Three nines are typically con-
sidered acceptable, and anything above four is considered to be
highly available.

Availability % Downtime per day

90% (“one nine”) 2.40 hours

99% (“two nines”) 14.40 minutes
99.9% (“three nines”) 1.44 minutes
99.99% (“four nines”) 8.64 seconds
CHAPTER 1. INTRODUCTION 16

Availability % Downtime per day

99.999% (“ﬁve nines”) 864 milliseconds

If the system isn’t resilient to failures, which only increase as the

application scales out to handle more load, its availability will in-
evitably drop. Because of that, a distributed system needs to em-
brace failure and work around it using techniques such as redun-
dancy and self-healing mechanisms.
As an engineer, you need to be paranoid and assess the risk that a
component can fail by considering the likelihood of it happening
and its resulting impact when it does. If the risk is high, you will
need to mitigate it. Part 4 of the book is dedicated to fault tolerance
and it introduces various resiliency patterns, such as rate limiting
and circuit breakers.

1.5 Operations
Distributed systems need to be tested, deployed, and maintained.
It used to be that one team developed an application, and another
was responsible for operating it. The rise of microservices and De-
vOps has changed that. The same team that designs a system is
also responsible for its live-site operation. That’s a good thing as
there is no better way to find out where a system falls short than
experiencing it by being on-call for it.
New deployments need to be rolled out continuously in a safe
manner without affecting the system’s availability. The system
needs to be observable so that it’s easy to understand what’s hap-
pening at any time. Alerts need to fire when its service level objec-
tives are at risk of being breached, and a human needs to be looped
in. The book’s final part explores best practices to test and operate
distributed systems.
CHAPTER 1. INTRODUCTION 17

1.6 Anatomy of a distributed system

Distributed systems come in all shapes and sizes. The book an-
chors the discussion to the backend of systems composed of com-
modity machines that work in unison to implement a business fea-
ture. This comprises the majority of large scale systems being built
today.
Before we can start tackling the fundamentals, we need to discuss
the different ways a distributed system can be decomposed into
parts and relationships, or in other words, its architecture. The
architecture differs depending on the angle you look at it.
Physically, a distributed system is an ensemble of physical ma-
chines that communicate over network links.
At run-time, a distributed system is composed of software pro-
cesses that communicate via inter-process communication (IPC)
mechanisms like HTTP, and are hosted on machines.
From an implementation perspective, a distributed system is a set
of loosely-coupled components that can be deployed and scaled
independently called services.
A service implements one specific part of the overall system’s ca-
pabilities. At the core of its implementation is the business logic,
which exposes interfaces used to communicate with the outside
world. By interface, I mean the kind offered by your language of
choice, like Java or C#. An “inbound” interface defines the opera-
tions that a service offers to its clients. In contrast, an “outbound”
interface defines operations that the service uses to communicate
with external services, like data stores, messaging services, and so
on.
Remote clients can’t just invoke an interface, which is why
adapters2 are required to hook up IPC mechanisms with the
service’s interfaces. An inbound adapter is part of the service’s
Application Programming Interface (API); it handles the requests
received from an IPC mechanism, like HTTP, by invoking oper-
2
http://wiki.c2.com/?PortsAndAdaptersArchitecture
CHAPTER 1. INTRODUCTION 18

ations deﬁned in the inbound interfaces. In contrast, outbound

adapters implement the service’s outbound interfaces, granting
the business logic access to external services, like data stores. This
is illustrated in Figure 1.2.

Figure 1.2: The business logic uses the messaging interface imple-
mented by the Kafka producer to send messages and the reposi-
tory interface to access the SQL store. In contrast, the HTTP con-
troller handles incoming requests using the service interface.

A process running a service is referred to as a server, while a process

that sends requests to a server is referred to as a client. Sometimes,
a process is both a client and a server, since the two aren’t mutually
exclusive.
For simplicity, I will assume that an individual instance of a ser-
vice runs entirely within the boundaries of a single server process.
Similarly, I assume that a process has a single thread. This allows
me to neglect some implementation details that only complicate
our discussion without adding much value.
In the rest of the book, I will switch between the different archi-
CHAPTER 1. INTRODUCTION 19

tectural points of view (see Figure 1.3), depending on which one

is more appropriate to discuss a particular topic. Remember that
they are just different ways to look at the same system.

Figure 1.3: The different architectural points of view used in this

book.
Part I

Communication
Introduction

Communication between processes over the network, or inter-

process communication (IPC), is at the heart of distributed systems.
Network protocols are arranged in a stack3 , where each layer
builds on the abstraction provided by the layer below, and lower
layers are closer to the hardware. When a process sends data to
another through the network, it moves through the stack from the
top layer to the bottom one and vice-versa on the other end, as
shown in Figure 1.4.

Figure 1.4: Internet protocol suite

The link layer consists of network protocols that operate on local

network links, like Ethernet or Wi-Fi, and provides an interface to
the underlying network hardware. Switches operate at this layer
and forward Ethernet packets based on their destination MAC ad-
dress.
The internet layer uses addresses to route packets from one machine
3
https://en.wikipedia.org/wiki/Internet_protocol_suite
22

to another across the network. The Internet Protocol (IP) is the core
protocol of this layer, which delivers packets on a best-effort basis.
Routers operate at this layer and forward IP packets based on their
destination IP address.
The transport layer transmits data between two processes using
port numbers to address the processes on either end. The most
important protocol in this layer is the Transmission Control
Protocol (TCP).
The application layer defines high-level communication protocols,
like HTTP or DNS. Typically your code will target this level of ab-
straction.
Even though each protocol builds up on top of the other, some-
times the abstractions leak. If you don’t know how the bottom lay-
ers work, you will have a hard time troubleshooting networking
issues that will inevitably arise.
Chapter 2 describes how to build a reliable communication chan-
nel (TCP) on top of an unreliable one (IP), which can drop, dupli-
cate and deliver data out of order. Building reliable abstractions on
top of unreliable ones is a common pattern that we will encounter
many times as we explore further how distributed systems work.
Chapter 3 describes how to build a secure channel (TLS) on top of
a reliable one (TCP), which provides encryption, authentication,
and integrity.
Chapter 4 dives into how the phone book of the Internet (DNS)
works, which allows nodes to discover others using names. At its
heart, DNS is a distributed, hierarchical, and eventually consistent
key-value store. By studying it, we will get a first taste of eventu-
ally consistency.
Chapter 5 concludes this part by discussing how services can ex-
pose APIs that other nodes can use to send commands or notifi-
cations to. Specifically, we will dive into the implementation of a
RESTful HTTP API.
Chapter 2

Reliable links

TCP1 is a transport-layer protocol that exposes a reliable communi-

cation channel between two processes on top of IP. TCP guarantees
that a stream of bytes arrives in order, without any gaps, duplica-
tion or corruption. TCP also implements a set of stability patterns
to avoid overwhelming the network or the receiver.

2.1 Reliability
To create the illusion of a reliable channel, TCP partitions a byte
stream into discrete packets called segments. The segments are
sequentially numbered, which allows the receiver to detect holes
and duplicates. Every segment sent needs to be acknowledged
by the receiver. When that doesn’t happen, a timer ﬁres on the
sending side, and the segment is retransmitted. To ensure that the
data hasn’t been corrupted in transit, the receiver uses a checksum
to verify the integrity of a delivered segment.
1
https://tools.ietf.org/html/rfc793
CHAPTER 2. RELIABLE LINKS 24

2.2 Connection lifecycle

A connection needs to be opened before any data can be transmit-
ted on a TCP channel. The state of the connection is managed by
the operating system on both ends through a socket. The socket
keeps track of the state changes of the connection during its life-
time. At a high level, there are three states the connection can be
in:
• The opening state, in which the connection is being created.
• The established state, in which the connection is open and
data is being transferred.
• The closing state, in which the connection is being closed.
This is a simpliﬁcation, though, as there are more states2 than the
three above.
A server must be listening for connection requests from clients be-
fore a connection is established. TCP uses a three-way handshake
to create a new connection, as shown in Figure 2.1:
1. The sender picks a random sequence number x and sends a
SYN segment to the receiver.
2. The receiver increments x, chooses a random sequence num-
ber y and sends back a SYN/ACK segment.
3. The sender increments both sequence numbers and replies
with an ACK segment and the ﬁrst bytes of application data.
The sequence numbers are used by TCP to ensure the data is de-
livered in order and without holes.
The handshake introduces a full round-trip in which no applica-
tion data is sent. Until the connection has been opened, its band-
width is essentially zero. The lower the round trip time is, the
faster the connection can be established. Putting servers closer
to the clients and reusing connections helps reduce this cold-start
penalty.
After data transmission is complete, the connection needs to be
2
https://en.wikipedia.org/wiki/Transmission_Control_Protocol#/media/Fi
le:Tcp_state_diagram_fixed_new.svg
CHAPTER 2. RELIABLE LINKS 25

Figure 2.1: Three-way handshake

closed to release all resources on both ends. This termination

phase involves multiple round-trips.

2.3 Flow control

Flow control is a backoff mechanism implemented to prevent the
sender from overwhelming the receiver. The receiver stores incom-
ing TCP segments waiting to be processed by the process into a
receive buffer, as shown in Figure 2.2.
The receiver also communicates back to the sender the size of the
buffer whenever it acknowledges a segment, as shown in Figure
2.3. The sender, if it’s respecting the protocol, avoids sending more
data that can ﬁt in the receiver’s buffer.
This mechanism is not too dissimilar to rate-limiting3 at the service
level. But, rather than rate-limiting on an API key or IP address,
TCP is rate-limiting on a connection level.
3
https://en.wikipedia.org/wiki/Rate_limiting
CHAPTER 2. RELIABLE LINKS 26

Figure 2.2: The receive buffer stores data that hasn’t been pro-
cessed yet by the application.

Figure 2.3: The size of the receive buffer is communicated in the

headers of acknowledgments segments.
CHAPTER 2. RELIABLE LINKS 27

2.4 Congestion control

TCP not only guards against overwhelming the receiver, but also
against flooding the underlying network.
The sender estimates the available bandwidth of the underlying
network empirically through measurements. The sender main-
tains a so-called congestion window, which represents the total num-
ber of outstanding segments that can be sent without an acknowl-
edgment from the other side. The size of the receiver window lim-
its the maximum size of the congestion window. The smaller the
congestion window is, the fewer bytes can be in-flight at any given
time, and the less bandwidth is utilized.
When a new connection is established, the size of the congestion
window is set to a system default. Then, for every segment
acknowledged, the window increases its size exponentially
until reaching an upper limit. This means that we can’t use the
network’s full capacity right after a connection is established. The
lower the round trip time (RTT) is, the quicker the sender can
start utilizing the underlying network’s bandwidth, as shown in
Figure 2.4.
What happens if a segment is lost? When the sender detects a
missed acknowledgment through a timeout, a mechanism called
congestion avoidance kicks in, and the congestion window size is
reduced. From there onwards, the passing of time increases the
window size4 by a certain amount, and timeouts decrease it by an-
other.
As mentioned earlier, the size of the congestion window defines
the maximum number of bytes that can be sent without receiving
an acknowledgment. Because the sender needs to wait for a full
round trip to get an acknowledgment, we can derive the maximum
theoretical bandwidth by dividing the size of the congestion win-
dow by the round trip time:
4
https://en.wikipedia.org/wiki/CUBIC_TCP
CHAPTER 2. RELIABLE LINKS 28

Figure 2.4: The lower the RTT is, the quicker the sender can start
utilizing the underlying network’s bandwidth.

WinSize
Bandwidth =
RTT

The equation5 shows that bandwidth is a function of latency. TCP

will try very hard to optimize the window size since it can’t do
anything about the round trip time. However, that doesn’t always
yield the optimal conﬁguration. Due to the way congestion control
works, the lower the round trip time is, the better the underlying
network’s bandwidth is utilized. This is more reason to put servers
geographically close to the clients.

2.5 Custom protocols

TCP’s reliability and stability come at the price of lower bandwidth
and higher latencies than the underlying network is actually capa-
ble of delivering. If you drop the stability and reliability mecha-
5
https://en.m.wikipedia.org/wiki/Bandwidth-delay_product
CHAPTER 2. RELIABLE LINKS 29

nisms that TCP provides, what you get is a simple protocol named
User Datagram Protocol6 (UDP) — a connectionless transport layer
protocol that can be used as an alternative to TCP.
Unlike TCP, UDP does not expose the abstraction of a byte
stream to its clients. Clients can only send discrete packets,
called datagrams, with a limited size. UDP doesn’t offer any
reliability as datagrams don’t have sequence numbers and are
not acknowledged. UDP doesn’t implement ﬂow and congestion
control either. Overall, UDP is a lean and barebone protocol. It’s
used to bootstrap custom protocols, which provide some, but not
all, of the stability and reliability guarantees that TCP does7 .
For example, in modern multi-player games, clients sample
gamepad, mouse and keyboard events several times per second
and send them to a server that keeps track of the global game state.
Similarly, the server samples the game state several times per
second and sends these snapshots back to the clients. If a snapshot
is lost in transmission, there is no value in retransmitting it as the
game evolves in real-time; by the time the retransmitted snapshot
would get to the destination, it would be obsolete. This is a use
case where UDP shines, as TCP would attempt to redeliver the
missing data and consequently slow down the client’s experience.

6
https://en.wikipedia.org/wiki/User_Datagram_Protocol
7
As we will later see, HTTP 3 is based on UDP to avoid some of TCP’s short-
comings.
SAMPLE

GET THE REST OF THE BOOK AT

HTTPS://UNDERSTANDINGDISTRIBUTED.SYSTEMS/

Understanding Distributed Systems 2nd Edition 1838430210
100% (3)
Understanding Distributed Systems 2nd Edition 1838430210
346 pages
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Emil Koutanov - Effective Kafka - A Hands-On Guide To Building Robust and Scalable Event-Driven Applications With Code Examples in Java (2021)
100% (3)
Emil Koutanov - Effective Kafka - A Hands-On Guide To Building Robust and Scalable Event-Driven Applications With Code Examples in Java (2021)
394 pages
Cloud Computing - Theory and Practice (2019)
No ratings yet
Cloud Computing - Theory and Practice (2019)
255 pages
Understanding Distributed Systems
100% (2)
Understanding Distributed Systems
236 pages
Microservices For Everyone Sample PDF
No ratings yet
Microservices For Everyone Sample PDF
49 pages
Cisco DevNet Evolving Technologies
No ratings yet
Cisco DevNet Evolving Technologies
196 pages
Report On Python
67% (55)
Report On Python
57 pages
Audio, Video, and Media in the Ministry
From Everand
Audio, Video, and Media in the Ministry
Clarence Floyd Richmond
No ratings yet
Fortinet Nse 2 - Lesson 3
0% (1)
Fortinet Nse 2 - Lesson 3
2 pages
Instant download Understanding Distributed Systems 2nd Edition Roberto Vitillo pdf all chapter
100% (1)
Instant download Understanding Distributed Systems 2nd Edition Roberto Vitillo pdf all chapter
37 pages
Immediate download Understanding Distributed Systems 2nd Edition Roberto Vitillo ebooks 2024
100% (3)
Immediate download Understanding Distributed Systems 2nd Edition Roberto Vitillo ebooks 2024
47 pages
Download full (Ebook) Understanding Distributed Systems - 2nd Edition by Roberto Vitillo ISBN 9781838430214, 1838430210 ebook all chapters
100% (6)
Download full (Ebook) Understanding Distributed Systems - 2nd Edition by Roberto Vitillo ISBN 9781838430214, 1838430210 ebook all chapters
71 pages
Mvsteen Distributed Systems 3rd Preliminary Version 3 01pre 2017 170215 7 11 (1)
No ratings yet
Mvsteen Distributed Systems 3rd Preliminary Version 3 01pre 2017 170215 7 11 (1)
5 pages
Complete Download (Ebook) Distributed Systems by Maarten van Steen, Andrew S. Tanenbaum ISBN 9789081540636, 9081540637 PDF All Chapters
100% (4)
Complete Download (Ebook) Distributed Systems by Maarten van Steen, Andrew S. Tanenbaum ISBN 9789081540636, 9081540637 PDF All Chapters
81 pages
(Ebook) Distributed Systems by Maarten van Steen, Andrew S. Tanenbaum ISBN 9789081540636, 9081540637instant download
100% (3)
(Ebook) Distributed Systems by Maarten van Steen, Andrew S. Tanenbaum ISBN 9789081540636, 9081540637instant download
57 pages
Akka Scala
No ratings yet
Akka Scala
771 pages
Akka Scala
No ratings yet
Akka Scala
622 pages
Distributed Systems Practitioners Dimos Raptis Raspoznan
No ratings yet
Distributed Systems Practitioners Dimos Raptis Raspoznan
259 pages
LINE-python
No ratings yet
LINE-python
96 pages
Middle Ware Architecture
No ratings yet
Middle Ware Architecture
427 pages
AkkaJava PDF
No ratings yet
AkkaJava PDF
761 pages
84463
No ratings yet
84463
71 pages
Immediate download Distributed Systems 4th Edition Maarten Van Steen ebooks 2024
100% (3)
Immediate download Distributed Systems 4th Edition Maarten Van Steen ebooks 2024
81 pages
LTO Network - Technical Paper
No ratings yet
LTO Network - Technical Paper
24 pages
Full Text 01
No ratings yet
Full Text 01
49 pages
1 A CoAP Server With A Rack Interface For Use of Web Frameworks Such As Ruby On Rails in The Internet of Things
No ratings yet
1 A CoAP Server With A Rack Interface For Use of Web Frameworks Such As Ruby On Rails in The Internet of Things
121 pages
Joseph_Stanton_2014
No ratings yet
Joseph_Stanton_2014
85 pages
Middleware Architecture With Patterns and Framework
No ratings yet
Middleware Architecture With Patterns and Framework
437 pages
Akka Java
No ratings yet
Akka Java
452 pages
Akka Java
No ratings yet
Akka Java
452 pages
Analysis of Six Distributed File Systems: Benjamin Depardon, Gaël Le Mahec, Cyril Séguin
No ratings yet
Analysis of Six Distributed File Systems: Benjamin Depardon, Gaël Le Mahec, Cyril Séguin
45 pages
AkkaScala PDF
No ratings yet
AkkaScala PDF
470 pages
IBM Spectum Virtualize 3-Site Replication
No ratings yet
IBM Spectum Virtualize 3-Site Replication
188 pages
AkkaJava PDF
No ratings yet
AkkaJava PDF
555 pages
Akka Java Documentation: Release 2.3.10
No ratings yet
Akka Java Documentation: Release 2.3.10
466 pages
Process Management
No ratings yet
Process Management
39 pages
Operating systems and middleware supporting controlled interaction Max Hailperin pdf download
100% (2)
Operating systems and middleware supporting controlled interaction Max Hailperin pdf download
75 pages
CCIE CCDE Written Exam Evolving Technologies Study Guide
100% (1)
CCIE CCDE Written Exam Evolving Technologies Study Guide
179 pages
XuChenThesis2010 PDF
No ratings yet
XuChenThesis2010 PDF
169 pages
doc-nova
No ratings yet
doc-nova
1,121 pages
CCIE/CCDE Written Exam Evolving Technologies Study Guide
No ratings yet
CCIE/CCDE Written Exam Evolving Technologies Study Guide
180 pages
Akka Scala
No ratings yet
Akka Scala
399 pages
Better Apis Quality Stability Observability
No ratings yet
Better Apis Quality Stability Observability
100 pages
SG 246650
No ratings yet
SG 246650
254 pages
Trust in IOT
No ratings yet
Trust in IOT
19 pages
AkkaScala PDF
No ratings yet
AkkaScala PDF
382 pages
F8 DP 2023 Kolodka Iaroslav Thesis
No ratings yet
F8 DP 2023 Kolodka Iaroslav Thesis
81 pages
Akka Scala Concurrency
No ratings yet
Akka Scala Concurrency
389 pages
Operating Systems and Middle Ware - Supporting Controlled Interaction
100% (1)
Operating Systems and Middle Ware - Supporting Controlled Interaction
563 pages
Nova Yoga
No ratings yet
Nova Yoga
1,038 pages
12 Factor Applications With Docker and Go (Tit Petric) (Z-Library)
No ratings yet
12 Factor Applications With Docker and Go (Tit Petric) (Z-Library)
148 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
75 pages
Infrastructure Design For Student Collaboration Projects Using Kubernetes
No ratings yet
Infrastructure Design For Student Collaboration Projects Using Kubernetes
52 pages
Path Allocation in Backbone Networks Pro
No ratings yet
Path Allocation in Backbone Networks Pro
124 pages
Clean Code V2
No ratings yet
Clean Code V2
672 pages
A Survey of Mobile Transactions: Distributed and Parallel Databases September 2004
No ratings yet
A Survey of Mobile Transactions: Distributed and Parallel Databases September 2004
54 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet
Kellory the Warlock
From Everand
Kellory the Warlock
Lin Carter
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
Krishnamoorthi - EB Green Card Backlog Letter 8.31.21
No ratings yet
Krishnamoorthi - EB Green Card Backlog Letter 8.31.21
3 pages
ISE 633 Large Scale Optimization For Machine Learning: Number of Units: 03
No ratings yet
ISE 633 Large Scale Optimization For Machine Learning: Number of Units: 03
4 pages
Steps To Access Your NCSU Course - Sep21Start
No ratings yet
Steps To Access Your NCSU Course - Sep21Start
1 page
Job Skills of 2022
No ratings yet
Job Skills of 2022
24 pages
Research-Based Web Design & Usability Guidelines
No ratings yet
Research-Based Web Design & Usability Guidelines
292 pages
Ramanujan Type 1:pi Approx Formulas
No ratings yet
Ramanujan Type 1:pi Approx Formulas
17 pages
Wireshark Packet Capture and Decode
No ratings yet
Wireshark Packet Capture and Decode
2 pages
Client Server Computing
No ratings yet
Client Server Computing
2 pages
PMKVY Process For Aadhar Enabled Biometric Attendance System On Boarding
No ratings yet
PMKVY Process For Aadhar Enabled Biometric Attendance System On Boarding
6 pages
Exercise 1 (V3B - List Comprehensions) : (1 + 2 + 2 5 Points)
No ratings yet
Exercise 1 (V3B - List Comprehensions) : (1 + 2 + 2 5 Points)
3 pages
Ramdump Modem 2023-12-20 00-41-12 Props
No ratings yet
Ramdump Modem 2023-12-20 00-41-12 Props
26 pages
DAY 1 Assessment 1
No ratings yet
DAY 1 Assessment 1
12 pages
Android
No ratings yet
Android
4 pages
Crazy Machines
No ratings yet
Crazy Machines
12 pages
RW Lesson 11 HYPERTEXT UPDATED
No ratings yet
RW Lesson 11 HYPERTEXT UPDATED
45 pages
How To Compatible With ZLG CANTest Software
No ratings yet
How To Compatible With ZLG CANTest Software
9 pages
Ganesh DJ
No ratings yet
Ganesh DJ
74 pages
COMPUTER SYSTEMS ORGANIZATION With ASSEMBLY LANGUAGE TIP
No ratings yet
COMPUTER SYSTEMS ORGANIZATION With ASSEMBLY LANGUAGE TIP
10 pages
Essential Guide To Working From Home
No ratings yet
Essential Guide To Working From Home
17 pages
JD - Eidiko Systems Integrators PVT LTD
No ratings yet
JD - Eidiko Systems Integrators PVT LTD
3 pages
External Sorting: Demetris Zeinalipour
No ratings yet
External Sorting: Demetris Zeinalipour
18 pages
SPM Unit 3 Notes
No ratings yet
SPM Unit 3 Notes
13 pages
Program Security1
No ratings yet
Program Security1
68 pages
Xvisor 140810103501 Phpapp01
No ratings yet
Xvisor 140810103501 Phpapp01
38 pages
Quiz3 1
No ratings yet
Quiz3 1
6 pages
Triton 1024
No ratings yet
Triton 1024
24 pages
Alexander Saleh Al-Gumaei Front End Developer
No ratings yet
Alexander Saleh Al-Gumaei Front End Developer
1 page
B1722 16 FC6AUsersEN
No ratings yet
B1722 16 FC6AUsersEN
662 pages
GCS 8 Startup
No ratings yet
GCS 8 Startup
10 pages
Question 1 Logistic
No ratings yet
Question 1 Logistic
3 pages
Procedure Manual Peoplesoft Version 9.2 Running Processes and Reports
No ratings yet
Procedure Manual Peoplesoft Version 9.2 Running Processes and Reports
26 pages
10Co6 Cover Work
No ratings yet
10Co6 Cover Work
8 pages
Writing Business Emails
100% (1)
Writing Business Emails
18 pages
CCNP Security ISE Sample
No ratings yet
CCNP Security ISE Sample
147 pages

Understanding Distributed Systems Sample

Uploaded by

Understanding Distributed Systems Sample

Uploaded by

Roberto Vitillo

About the author 7

III Scalability 117

16 Downstream resiliency 184

17 Upstream resiliency 193

17.6 Watchdog . . . . . . . . . . . . . . . . . . . . . . . 205

V Testing and operations 207

19 Continuous delivery and deployment 217

22 Final words 250

Understanding Distributed Systems by Roberto Vitillo

Authors generally write this page in the third person as if someone

Writing a book is an incredibly challenging but rewarding expe-

According to Stack Overﬂow’s 2020 developer survey1 , the best-

I plan to update the book regularly, which is why it has a version

0.1 Who should read this book

A distributed system is one in which the failure of a computer

request and its response.

Figure 1.1: The system throughput on the y axis is the subset of

The capacity of a distributed system depends on its architecture

Availability % Downtime per day

90% (“one nine”) 2.40 hours

Availability % Downtime per day

99.999% (“ﬁve nines”) 864 milliseconds

If the system isn’t resilient to failures, which only increase as the

1.6 Anatomy of a distributed system

ations deﬁned in the inbound interfaces. In contrast, outbound

A process running a service is referred to as a server, while a process

tectural points of view (see Figure 1.3), depending on which one

Figure 1.3: The different architectural points of view used in this

Communication between processes over the network, or inter-

Figure 1.4: Internet protocol suite

The link layer consists of network protocols that operate on local

TCP1 is a transport-layer protocol that exposes a reliable communi-

2.2 Connection lifecycle

Figure 2.1: Three-way handshake

closed to release all resources on both ends. This termination

2.3 Flow control

Figure 2.3: The size of the receive buffer is communicated in the

2.4 Congestion control

The equation5 shows that bandwidth is a function of latency. TCP

2.5 Custom protocols

GET THE REST OF THE BOOK AT

You might also like