ByteByteGo LinkedIn PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

System Design

What are database isolation levels? What are they used for? 4

What is IaaS/PaaS/SaaS? 6

Most popular programming languages 7

What is the future of online payments? 9

What is SSO (Single Sign-On)? 11

How to store passwords safely in the database? 13

How does HTTPS work? 16

How to learn design patterns? 18

A visual guide on how to choose the right Database 20

Do you know how to generate globally unique IDs? 22

How does Twitter work? 24

What is the difference between Process and Thread? 26

Interview Question: design Google Docs 28

Deployment strategies 30

Flowchart of how slack decides to send a notification 32

How does Amazon build and operate the software? 33

How to design a secure web API access for your website? 35

How do microservices collaborate and interact with each other? 38

What are the differences between Virtualization (VMware) and


Containerization (Docker)? 40

Which cloud provider should be used when building a big data


solution? 42

How to avoid crawling duplicate URLs at Google scale? 44

Why is a solid-state drive (SSD) fast? 47

Handling a large-scale outage 49

AWS Lambda behind the scenes 51

HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC). 53 Read replica pattern 105

How to scale a website to support millions of users? 55 Email receiving flow 107

DevOps Books 58 Email sending flow 109

Why is Kafka fast? 60 Interview Question: Design Gmail 111

SOAP vs REST vs GraphQL vs RPC. 62 Map rendering 113

How do modern browsers work? 63 Interview Question: Design Google Maps 115

Redis vs Memcached 64 Pull vs push models 117

Optimistic locking 65 Money movement 119

Tradeoff between latency and consistency 67 Reconciliation 122

Cache miss attack 68 Which database shall I use for the metrics collecting system? 126

How to diagnose a mysterious process that’s taking too much CPU, Metrics monitoring and altering system 129
memory, IO, etc? 70
Reconciliation 131
What are the top cache strategies? 71
Big data papers 134
Upload large files 74
Avoid double charge 136
Why is Redis so Fast? 76
Payment security 138
SWIFT payment network 77
System Design Interview Tip 139
At-most once, at-least once, and exactly once 80
Big data evolvement 140
Vertical partitioning and Horizontal partitioning 82
Quadtree 142
CDN 84
How do we find nearby restaurants on Yelp? 144
Erasure coding 87
How does a modern stock exchange achieve microsecond latency? 147
Foreign exchange in payment 89
Match buy and sell orders 149
Block storage, file storage and object storage 94
Stock exchange design 151
Block storage, file storage and object storage 95
Design a payment system 153
Domain Name System (DNS) lookup 97
Design a flash sale system 155
What happens when you type a URL into your browser? 99
Back-of-the-envelope estimation 157
AI Coding engine 101

Read replica pattern 103

2 3
What are database isolation levels? What are they used 🔹 Read Uncommitted: The data modification can be read by other
for? transactions before a transaction is committed.

Database isolation allows a transaction to execute as if there are no The isolation is guaranteed by MVCC (Multi-Version Consistency
other concurrently running transactions. Control) and locks.

The diagram below illustrates four isolation levels. The diagram below takes Repeatable Read as an example to
demonstrate how MVCC works:

There are two hidden columns for each row: transaction_id and
roll_pointer. When transaction A starts, a new Read View with
transaction_id=201 is created. Shortly afterward, transaction B starts,
and a new Read View with transaction_id=202 is created.

Now transaction A modifies the balance to 200, a new row of the log is
created, and the roll_pointer points to the old row. Before transaction A
commits, transaction B reads the balance data. Transaction B finds
that transaction_id 201 is not committed, it reads the next committed
record(transaction_id=200).

Even when transaction A commits, transaction B still reads data based


on the Read View created when transaction B starts. So transaction B
always reads the data with balance=100.

Over to you: have you seen isolation levels used in the wrong way?
Did it cause serious outages?

🔹 Serializalble: This is the highest isolation level. Concurrent


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
transactions are guaranteed to be executed in sequence.

🔹 Repeatable Read: Data read during the transaction stays the same
as the transaction starts.

🔹 Read Committed: Data modification can only be read after the


transaction is committed.

4 5

What is IaaS/PaaS/SaaS? Most popular programming languages


The diagram below illustrates the differences between IaaS Programming languages come and go. Some stand the test of time.
(Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), and SaaS Some already are shooting stars and some are rising rapidly on the
(Software-as-a-Service).
horizon.

I draw a diagram by putting the top 38 most commonly used


programming languages in one place, sorted by year. Data source:
StackOverflow survey.

For a non-cloud application, we own and manage all the hardware and
software. We say the application is on-premises.

With cloud computing, cloud service vendors provide three kinds of


models for us to use: IaaS, PaaS, and SaaS.

𝐈𝐚𝐚𝐒 provides us access to cloud vendors' infrastructure, like servers,


storage, and networking. We pay for the infrastructure service and
install and manage supporting software on it for our application.
1 JavaScript
𝐏𝐚𝐚𝐒 goes further. It provides a platform with a variety of middleware, 2 HTML/CSS
frameworks, and tools to build our application. We only focus on 3 Python
application development and data.
4 SQL
𝐒𝐚𝐚𝐒 enables the application to run in the cloud. We pay a monthly or 5 Java
annual fee to use the SaaS product. 6 Node
7 TypeScript
Over to you: which IaaS/PaaS/SaaS products have you used? How do 8C
you decide which architecture to use? 9 Bash/Shell
10 C
Image Source: https://www.ibm.com/cloud/learn/iaas-paas-saas 11 PHP

6 7
12 C What is the future of online payments?
13 PowerShell
I don’t know the answer, but I do know one of the candidates is the
14 Go
blockchain.
15 Kotlin
16 Rust
As a fan of technology, I always seek new solutions to old challenges.
17 Ruby
A book that explains a lot about an emerging payment system is
18 Dart
‘Mastering Bitcoin’ by Andreas M. Antonopoulos. I want to share my
19 Assembly
discovery of this book with you because it explains very clearly bitcoin
20 Swift
and its underlying blockchain. This book makes me rethink how to
21 R
renovate payment systems.
22 VBA
23 Matlab
24 Groovy
25 Objective-C
26 Scala
27 Perl
28 Haskell
29 Delphi
30 Clojure
31 Elixir
32 LISP
33 Julia
34 F
35 Erlang
36 APL
37 Crystal
38 COBOL

Over to you: what’s the first programming language you learned? And
what are the other languages you learned over the years?

Here are the takeaways:

1. The bitcoin wallet balance is calculated on the fly, while the


traditional wallet balance is stored in the database. You can check
chapter 12 of System Design Interview Volume 2, on how to implement
a traditional wallet (https://amzn.to/34G2vmC).

8 9

2. The golden source of truth for bitcoin is the blockchain, which is also What is SSO (Single Sign-On)?
the journal. It’s the same if we use Event Sourcing architecture to build
A friend recently went through the irksome experience of being signed
a traditional wallet, although there are other options.
out from a number of websites they use daily. This event will be familiar
to millions of web users, and it is a tedious process to fix. It can involve
3. There is a small virtual machine for bitcoin - and also Ethereum. The
trying to remember multiple long-forgotten passwords, or typing in the
virtual machine defines a set of bytecodes to do basic tasks such as
names of pets from childhood to answer security questions. SSO
validation.
removes this inconvenience and makes life online better. But how does
it work?
Over to you: if Elon Musk set up a base on planet Mars, what payment
solution will you recommend?
Basically, Single Sign-On (SSO) is an authentication scheme. It allows
a user to log in to different systems using a single ID.

The diagram below illustrates how SSO works.

Step 1: A user visits Gmail, or any email service. Gmail finds the user
is not logged in and so redirects them to the SSO authentication
server, which also finds the user is not logged in. As a result, the user

10 11
is redirected to the SSO login page, where they enter their login How to store passwords safely in the database?
credentials.

Let’s take a look.


Steps 2-3: The SSO authentication server validates the credentials,
creates the global session for the user, and creates a token.

🔹
𝐓𝐡𝐢𝐧𝐠𝐬 𝐍𝐎𝐓 𝐭𝐨 𝐝𝐨
Storing passwords in plain text is not a good idea because anyone
Steps 4-7: Gmail validates the token in the SSO authentication server.
with internal access can see them.
The authentication server registers the Gmail system, and returns
“valid.” Gmail returns the protected resource to the user.
🔹 Storing password hashes directly is not sufficient because it is
pruned to precomputation attacks, such as rainbow tables.
Step 8: From Gmail, the user navigates to another Google-owned
website, for example, YouTube.
🔹 To mitigate precomputation attacks, we salt the passwords.
Steps 9-10: YouTube finds the user is not logged in, and then requests
𝐖𝐡𝐚𝐭 𝐢𝐬 𝐬𝐚𝐥𝐭?
authentication. The SSO authentication server finds the user is already
According to OWASP guidelines, “a salt is a unique, randomly
logged in and returns the token.
generated string that is added to each password as part of the hashing
process”.
Step 11-14: YouTube validates the token in the SSO authentication
server. The authentication server registers the YouTube system, and
returns “valid.” YouTube returns the protected resource to the user.

The process is complete and the user gets back access to their
account.

Over to you:

Question 1: have you implemented SSO in your projects? What is the


most difficult part?

Question 2: what’s your favorite sign-in method and why?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

12 13

3️⃣ The system appends the salt to the password and hashes it. Let’s
call the hashed value H1.
4️⃣ The system compares H1 and H2, where H2 is the hash stored in the
database. If they are the same, the password is valid.

Over to you: what other mechanisms can we use to ensure password


safety?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

𝐇𝐨𝐰 𝐭𝐨 𝐬𝐭𝐨𝐫𝐞 𝐚 𝐩𝐚𝐬𝐬𝐰𝐨𝐫𝐝 𝐚𝐧𝐝 𝐬𝐚𝐥𝐭?


1️⃣ A salt is not meant to be secret and it can be stored in plain text in
the database. It is used to ensure the hash result is unique to each
password.

2️⃣ The password can be stored in the database using the following
format: 𝘩𝘢𝘴𝘩( 𝘱𝘢𝘴𝘴𝘸𝘰𝘳𝘥 + 𝘴𝘢𝘭𝘵).

𝐇𝐨𝐰 𝐭𝐨 𝐯𝐚𝐥𝐢𝐝𝐚𝐭𝐞 𝐚 𝐩𝐚𝐬𝐬𝐰𝐨𝐫𝐝?


To validate a password, it can go through the following process:
1️⃣ A client enters the password.
2️⃣ The system fetches the corresponding salt from the database.

14 15
How does HTTPS work? The server then sends the SSL certificate to the client. The certificate
contains the public key, host name, expiry dates, etc. The client
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the
validates the certificate.
Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data
using Transport Layer Security (TLS.) If the data is hijacked online, all
Step 3 - After validating the SSL certificate, the client generates a
the hijacker gets is binary code.
session key and encrypts it using the public key. The server receives
the encrypted session key and decrypts it with the private key.

Step 4 - Now that both the client and the server hold the same session
key (symmetric encryption), the encrypted data is transmitted in a
secure bi-directional channel.

Why does HTTPS switch to symmetric encryption during data


transmission? There are two main reasons:

1. Security: The asymmetric encryption goes only one way. This means
that if the server tries to send the encrypted data back to the client,
anyone can decrypt the data using the public key.

2. Server resources: The asymmetric encryption adds quite a lot of


mathematical overhead. It is not suitable for data transmissions in long
sessions.

Over to you: how much performance overhead does HTTPS add,


compared to HTTP?

How is the data encrypted and decrypted? —


Check out our bestselling system design books.
Step 1 - The client (browser) and the server establish a TCP Paperback: Amazon Digital: ByteByteGo.
connection.

Step 2 - The client sends a “client hello” to the server. The message
contains a set of necessary encryption algorithms (cipher suites) and
the latest TLS version it can support. The server responds with a
“server hello” so the browser knows whether it can support the
algorithms and TLS version.

16 17

How to learn design patterns? 🔹 This book solves the challenge of software’s abstract, “invisible”
nature. Software is difficult to build because we cannot see its
Besides reading a lot of well-written code, a good book guides us like a
architecture; its details are embedded in the code and binary files. It is
good teacher.
even harder to understand software design patterns because these are
higher-level abstractions of the software. The book fixes this by using
𝐇𝐞𝐚𝐝 𝐅𝐢𝐫𝐬𝐭 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬, second edition, is the one I would
visualization. There are lots of diagrams, arrows, and comments on
recommend.
almost every page. If I do not understand the text, it’s no problem. The
diagrams explain things very well.

🔹 We all have questions we are afraid to ask when we first learn a


new skill. Maybe we think it’s an easy one. This book is good at
tackling design patterns from the student’s point of view. It guides us by
asking our questions and clearly answering them. There is a Guru in
the book and there’s also a Student.

Over to you: which book helped you understand a challenging topic?


Why do you like it?

When I began my journey in software engineering, I found it hard to


understand the classic textbook, 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬, by the Gang of Four.
Luckily, I discovered Head First Design Patterns in the school library.
This book solved a lot of puzzles for me. When I went back to the
Design Patterns book, everything looked familiar and more
understandable.

Last year, I bought the second edition of Head First Design Patterns
and read through it. Here are a few things I like about the book:

18 19
A visual guide on how to choose the right Database
Data can be structured (SQL table schema), semi-structured (JSON,
Picking a database is a long-term commitment so the decision
XML, etc.), and unstructured (Blob).
shouldn’t be made lightly. The important thing to keep in mind is to
choose the right database for the right job.
🔹
Common database categories include:

🔹
Relational

🔹
Columnar

🔹
Key-value

🔹
In-memory

🔹
Wide column

🔹
Time Series

🔹
Immutable ledger

🔹
Geospatial

🔹
Graph

🔹
Document

🔹
Text search
Blob

Thanks, Satish Chandra Gupta

Over to you - Which database have you used for which workload?

20 21

Do you know how to generate globally unique IDs?


The implementation details of the algorithms can be found online so
In this post, we will explore common requirements for IDs that are used
we will not go into detail here.
in social media such as Facebook, Twitter, and LinkedIn.
Over to you: What kind of ID generators have you used?

🔹
Requirements:

🔹
Globally unique

🔹
Roughly sorted by time
Check out our bestselling system design books.

🔹
Numerical values only
Paperback: Amazon Digital: ByteByteGo.

🔹
64 bits
Highly scalable, low latency

22 23
How does Twitter work? 4️⃣ The Timeline service is used to find the Redis server that has the
home timeline on it.
This post is a summary of a tech talk given by Twitter in 2013. Let’s
5️⃣ A user pulls their home timeline through the Timeline service.
take a look.

🔹
𝐒𝐞𝐚𝐫𝐜𝐡 & 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲
Ingester: annotates and tokenizes Tweets so the data can be

🔹
indexed.

🔹
Earlybird: stores search index.
Blender: creates the search and discovery timelines.

🔹
𝐏𝐮𝐬𝐡 𝐂𝐨𝐦𝐩𝐮𝐭𝐞

🔹
HTTP push
Mobile push

Disclaimer: This article is based on the tech talk given by Twitter in


2013 (https://bit.ly/3vNfjRp). Even though many years have passed, it’s
still quite relevant. I redraw the diagram as the original diagram is
difficult to read.

Over to you:
Do you use Twitter? What are some of the biggest differences between
LinkedIn and Twitter that might shape their system architectures?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

𝐓𝐡𝐞 𝐋𝐢𝐟𝐞 𝐨𝐟 𝐚 𝐓𝐰𝐞𝐞𝐭:


1️⃣ A tweet comes in through the Write API.
2️⃣ The Write API routes the request to the Fanout service.
3️⃣ The Fanout service does a lot of processing and stores them in the
Redis cache.

24 25

What is the difference between Process and Thread? A 𝐓𝐡𝐫𝐞𝐚𝐝 is the smallest unit of execution within a process.

The following process explains the relationship between program,


process, and thread.

1. The program contains a set of instructions.


2. The program is loaded into memory. It becomes one or more
running processes.
3. When a process starts, it is assigned memory and resources. A
process can have one or more threads. For example, in the Microsoft
Word app, a thread might be responsible for spelling checking and the
other thread for inserting text into the doc.

Main differences between process and thread:

🔹 Processes are usually independent, while threads exist as subsets

🔹
of a process.
Each process has its own memory space. Threads that belong to

🔹
the same process share the same memory.
A process is a heavyweight operation. It takes more time to create

🔹
and terminate.

🔹
Context switching is more expensive between processes.
Inter-thread communication is faster for threads.

Over to you:
1). Some programming languages support coroutine. What is the
difference between coroutine and thread?
To better understand this question, let’s first take a look at what is a
Program. A 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 is an executable file containing a set of 2). How to list running processes in Linux?
instructions and passively stored on disk. One program can have
multiple processes. For example, the Chrome browser creates a —
different process for every single tab. Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
A 𝐏𝐫𝐨𝐜𝐞𝐬𝐬 means a program is in execution. When a program is loaded
into the memory and becomes active, the program becomes a
process. The process requires some essential resources such as
registers, program counter, and stack.

26 27
Interview Question: design Google Docs 4️⃣ The File Operation Server consumes operations produced by clients
and generates transformed operations using collaboration algorithms.
5️⃣ Three types of data are stored: file metadata, file content, and
operations.

One of the biggest challenges is real-time conflict resolution. Common


algorithms include:

🔹
🔹
Operational transformation (OT)

🔹
Differential Synchronization (DS)
Conflict-free replicated data type (CRDT)

Google Doc uses OT according to its Wikipedia page and CRDT is an


active area of research for real-time concurrent editing.

Over to you - Have you encountered any issues while using Google
Docs? If so, what do you think might have caused the issue?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

1️⃣ Clients send document editing operations to the WebSocket Server.


2️⃣ The real-time communication is handled by the WebSocket Server.
3️⃣ Documents operations are persisted in the Message Queue.

28 29

Deployment strategies
𝐁𝐥𝐮𝐞-𝐆𝐫𝐞𝐞𝐧 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
Deploying or upgrading services is risky. In this post, we explore risk
With blue-green deployment, we have two identical environments: one
mitigation strategies.
is staging (blue) and the other is production (green). The staging
environment is one version ahead of production. Once testing is done
The diagram below illustrates the common ones.
in the staging environment, user traffic is switched to the staging
environment, and the staging becomes the production. This
deployment strategy is simple to perform rollback, but having two
identical production quality environments could be expensive.

𝐂𝐚𝐧𝐚𝐫𝐲 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
A canary deployment upgrades services gradually, each time to a
subset of users. It is cheaper than blue-green deployment and easy to
perform rollback. However, since there is no staging environment, we
have to test on production. This process is more complicated because
we need to monitor the canary while gradually migrating more and
more users away from the old version.

𝐀/𝐁 𝐓𝐞𝐬𝐭
In the A/B test, different versions of services run in production
simultaneously. Each version runs an “experiment” for a subset of
users. A/B test is a cheap method to test new features in production.
We need to control the deployment process in case some features are
pushed to users by accident.

Over to you - Which deployment strategy have you used? Did you
witness any deployment-related outages in production and why did
they happen?

𝐌𝐮𝐥𝐭𝐢-𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
In this model, we deploy new changes to multiple services
simultaneously. This approach is easy to implement. But since all the
services are upgraded at the same time, it is hard to manage and test
dependencies. It’s also hard to rollback safely.

30 31
Flowchart of how slack decides to send a notification How does Amazon build and operate the software?
It is a great example of why a simple feature may take much longer to In 2019, Amazon released The Amazon Builders' Library. It contains
develop than many people think. architecture-based articles that describe how Amazon architects,
releases, and operates technology.
When we have a great design, users may not notice the complexity
because it feels like the feature is just working as intended.

As of today, it published 26 articles. It took me two weekends to go


through all the articles. I’ve had great fun and learned a lot. Here are
some of my favorites:
What’s your takeaway from this diagram?

🔹
🔹
Making retries safe with idempotent APIs
Image source:

🔹
Timeouts, retries, and backoff with jitter
https://slack.engineering/reducing-slacks-memory-footprint/

🔹
Beyond five 9s: Lessons from our highest available data planes

🔹
Caching challenges and strategies

🔹
Ensuring rollback safety during deployments
Going faster with continuous delivery

32 33

🔹
🔹
Challenges with distributed systems How to design a secure web API access for your
Amazon's approach to high-availability deployment website?
Over to you: what’s your favorite place to learn system design and When we open web API access to users, we need to make sure each
design principles? API call is authenticated. This means the user must be who they claim
to be.
Link to The Amazon Builders' Library: aws.amazon.com/builders-library
In this post, we explore two common ways:
1. Token based authentication
2. HMAC (Hash-based Message Authentication Code) authentication

The diagram below illustrates how they work.

34 35
Step 2 - the Authentication Server authenticates the credentials and
generates a token with an expiry time.

Steps 3 and 4 - now the client can send requests to access server
resources with the token in the HTTP header. This access is valid until
the token expires.

𝐇𝐌𝐀𝐂 𝐛𝐚𝐬𝐞𝐝
This mechanism generates a Message Authentication Code
(signature) by using a hash function (SHA256 or MD5).

Steps 1 and 2 - the server generates two keys, one is Public APP ID
(public key) and the other one is API Key (private key).
Step 3 - we now generate a HMAC signature on the client side (hmac
A). This signature is generated with a set of attributes listed in the
diagram.

Step 4 - the client sends requests to access server resources with


hmac A in the HTTP header.

Step 5 - the server receives the request which contains the request
data and the authentication header. It extracts the necessary attributes
from the request and uses the API key that’s stored on the server side
to generate a signature (hmac B.)

Steps 6 and 7 - the server compares hmac A (generated on the client


side) and hmac B (generated on the server side). If they are matched,
the requested resource will be returned to the client.

Question - How does HMAC authentication ensure data integrity? Why


do we include “request timestamp” in HMAC signature generation?


𝐓𝐨𝐤𝐞𝐧 𝐛𝐚𝐬𝐞𝐝 Check out our bestselling system design books.
Step 1 - the user enters their password into the client, and the client Paperback: Amazon Digital: ByteByteGo.
sends the password to the Authentication Server.

36 37

How do microservices collaborate and interact with each describes the interactions between all the participating services. It is
other? just like a conductor leading the musicians in a musical symphony. The
orchestration pattern also includes the transaction management
among different services.
There are two ways: 𝐨𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 and 𝐜𝐡𝐨𝐫𝐞𝐨𝐠𝐫𝐚𝐩𝐡𝐲.
The benefits of orchestration:
The diagram below illustrates the collaboration of microservices. 1. Reliability - orchestration has built-in transaction management and
error handling, while choreography is point-to-point communications
and the fault tolerance scenarios are much more complicated.
2. Scalability - when adding a new service into orchestration, only the
orchestrator needs to modify the interaction rules, while in
choreography all the interacting services need to be modified.

Some limitations of orchestration:


1. Performance - all the services talk via a centralized orchestrator, so
latency is higher than it is with choreography. Also, the throughput is
bound to the capacity of the orchestrator.
2. Single point of failure - if the orchestrator goes down, no services
can talk to each other. To mitigate this, the orchestrator must be highly
available.

Real-world use case: Netflix Conductor is a microservice orchestrator


and you can read more details on the orchestrator design.

Question - Have you used orchestrator products in production? What


are their pros & cons?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

Choreography is like having a choreographer set all the rules. Then the
dancers on stage (the microservices) interact according to them.
Service choreography describes this exchange of messages and the
rules by which the microservices interact.

Orchestration is different. The orchestrator acts as a center of


authority. It is responsible for invoking and combining the services. It

38 39
What are the differences between Virtualization needed to run the application or microservice are packaged together,
(VMware) and Containerization (Docker)? so that the applications can run anywhere.

Question: how much performance differences have you observed in


The diagram below illustrates the layered architecture of virtualization production between virtualization, containerization, and bare-metal?
and containerization.
Image Source: https://lnkd.in/gaPYcGTz

Sources:
[1] Understanding virtualization: https://lnkd.in/gtQY9gkx
[2] What is containerization?: https://lnkd.in/gm4Qv_x2

“Virtualization is a technology that allows you to create multiple


simulated environments or dedicated resources from a single, physical
hardware system” [1].

“Containerization is the packaging together of software code with all its


necessary components like libraries, frameworks, and other
dependencies so that they are isolated in their own "container" [2].

🔹
The major differences are:
In virtualization, the hypervisor creates an abstraction layer over
hardware, so that multiple operating systems can run alongside each
other. This technique is considered to be the first generation of cloud
computing.

🔹 Containerization is considered to be a lightweight version of


virtualization, which virtualizes the operating system instead of
hardware. Without the hypervisor, the containers enjoy faster resource
provisioning. All the resources (including code, dependencies) that are

40 41

Which cloud provider should be used when building a The common parts of the solutions:
big data solution?
1. Data ingestion of structured or unstructured data.
The diagram below illustrates the detailed comparison of AWS, Google 2. Raw data storage.
Cloud, and Microsoft Azure. 3. Data processing, including filtering, transformation, normalization,
etc.
4. Data warehouse, including key-value storage, relational database,
OLAP database, etc.
5. Presentation layer with dashboards and real-time notifications.

It is interesting to see different cloud vendors have different names for


the same type of products.

For example, the first step and the last step both use the serverless
product. The product is called “lambda” in AWS, and “function” in
Azure and Google Cloud.

Question - which products have you used in production? What kind of


application did you use it for?

Source: S.C. Gupta’s post

42 43
How to avoid crawling duplicate URLs at Google scale?
Option 1: Use a Set data structure to check if a URL already exists or
not. Set is fast, but it is not space-efficient.

Option 2: Store URLs in a database and check if a new URL is in the


database. This can work but the load to the database will be very high.

Option 3: 𝐁𝐥𝐨𝐨𝐦 𝐟𝐢𝐥𝐭𝐞𝐫. This option is preferred. Bloom filter was


proposed by Burton Howard Bloom in 1970. It is a probabilistic data
structure that is used to test whether an element is a member of a set.

🔹 false: the element is definitely not in the set.

🔹 true: the element is probably in the set.

False-positive matches are possible, but false negatives are not.

The diagram below illustrates how the Bloom filter works. The basic
data structure for the Bloom filter is Bit Vector. Each bit represents a
hashed value.

Step 1: To add an element to the bloom filter, we feed it to 3 different


hash functions (A, B, and C) and set the bits at the resulting positions.
Note that both “www.myweb1.com” and “www.myweb2.com” mark the
same bit with 1 at index 5. False positives are possible because a bit
might be set by another element.

Step 2: When testing the existence of a URL string, the same hash
functions A, B, and C are applied to the URL string. If all three bits are

44 45

1, then the URL may exist in the dataset; if any of the bits is 0, then the Why is a solid-state drive (SSD) fast?
URL definitely does not exist in the dataset.
“A solid state drive reads up to 10 times faster and writes up to 20
times faster than a hard disk drive.” [1].
Hash function choices are important. They must be uniformly
distributed and fast. For example, RedisBloom and Apache Spark use
“An SSD is a flash-memory based data storage device. Bits are stored
murmur, and InfluxDB uses xxhash.
into cells, which are made of floating-gate transistors. SSDs are made
entirely of electronic components, there are no moving or mechanical
Question - In our example, we used three hash functions. How many
parts like in hard drives (HDD)” [2].
hash functions should we use in reality? What are the trade-offs?

The diagram below illustrates the SSD architecture.



Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

46 47
Step 1: “Commands come from the user through the host interface” [2]. Handling a large-scale outage
The interface can be Serial ATA (SATA) or PCI Express (PCIe).
This is a true story about handling a large-scale outage written by Staff
Step 2: “The processor in the SSD controller takes the commands and
Engineers at Discord Sahn Lam.
passes them to the flash controller” [2].
Step 3: “SSDs also have embedded RAM memory, generally for
About 10 years ago, I witnessed the most impactful UI bugs in my
caching purposes and to store mapping information” [2].
career.
Step 4: “The packages of NAND flash memory are organized in gangs,
over multiple channels” [2].
It was 9PM on a Friday. I was on the team responsible for one of the
largest social games at the time. It had about 30 million DAU. I just so
The second diagram illustrates how the logical and physical pages are
happened to glance at the operational dashboard before shutting down
mapped, and why this architecture is fast.
for the night.
SSD controller operates multiple FLASH particles in parallel, greatly
Every line on the dashboard was at zero.
improving the underlying bandwidth. When we need to write more than
one page, the SSD controller can write them in parallel [3], whereas
At that very moment, I got a phone call from my boss. He said the
the HDD has a single head and it can only read from one head at a
entire game was down. Firefighting mode. Full on.
time.

Everything had shut down. Every single instance on AWS was


Every time a HOST Page is written, the SSD controller finds a Physical
terminated. HA proxy instances, PHP web servers, MySQL databases,
Page to write the data and this mapping is recorded. With this
Memcache nodes, everything.
mapping, the next time HOST reads a HOST Page, the SSD knows
where to read the data from FLASH [3].
It took 50 people 10 hours to bring everything back up. It was quite a
feat. That in itself is a story for another day.
Question - What are the main differences between SSD and HDD?

We used a cloud management software vendor to manage our AWS


If you are interested in the architecture, I recommend reading Coding
deployment. This was before Infrastructure as Code was a thing. There
for SSDs by Emmanuel Goossaert in reference [2].
was no Terraform. It was so early in cloud computing and we were so
big that AWS required an advanced warning before we scaled up.
Sources:
[1] SSD or HDD: Which Is Right for You?:
What had gone wrong? The software vendor had introduced a bug that
https://www.avg.com/en/signal/ssd-hdd-which-is-best
week in their confirmation dialog flow. When terminating a subset of
[2] Coding for SSDs:
nodes in the UI, it would correctly show in the confirmation dialog box
https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introductio
the list of nodes to be terminated, but under the hood, it terminated
n-and-table-of-contents/
everything.
[3] Overview of SSD Structure and Basic Working Principle:
https://www.elinfor.com/knowledge/overview-of-ssd-structure-and-basic
Shortly before 9PM that fateful evening, one of our poor SREs fulfilled
-working-principle1-p-11203
our routine request and terminated an unused Memcache pool. I could
only imagine the horror and the phone conversation that ensured.

48 49

AWS Lambda behind the scenes


What kind of code structure could allow this disastrous bug to slip
Serverless is one of the hottest topics in cloud services. How does
through? We could only guess. We never received a full explanation.
AWS Lambda work behind the scenes?
What are some of the most impactful software bugs you encountered
Lambda is a 𝐬𝐞𝐫𝐯𝐞𝐫𝐥𝐞𝐬𝐬 computing service provided by Amazon Web
in your career?
Services (AWS), which runs functions in response to events.

𝐅𝐢𝐫𝐞𝐜𝐫𝐚𝐜𝐤𝐞𝐫 𝐌𝐢𝐜𝐫𝐨𝐕𝐌
Firecracker is the engine powering all of the Lambda functions [1]. It is
a virtualization technology developed at Amazon and written in Rust.

The diagram below illustrates the isolation model for AWS Lambda
Workers.

50 51
Lambda functions run within a sandbox, which provides a minimal HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC).
Linux userland, some common libraries and utilities. It creates the
What problem does each generation of HTTP solve?
Execution environment (worker) on EC2 instances.

The diagram below illustrates the key features.


How are lambdas initiated and invoked? There are two ways.

𝐒𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step1: "The Worker Manager communicates with a Placement Service
which is responsible to place a workload on a location for the given
host (it’s provisioning the sandbox) and returns that to the Worker
Manager" [2].

Step 2: "The Worker Manager can then call 𝘐𝘯𝘪𝘵 to initialize the function
for execution by downloading the Lambda package from S3 and
setting up the Lambda runtime" [2]

Step 3: The Frontend Worker is now able to call 𝘐𝘯𝘷𝘰𝘬𝘦 [2].

𝐀𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step 1: The Application Load Balancer forwards the invocation to an
available Frontend which places the event onto an internal
queue(SQS).
Step 2: There is "a set of pollers assigned to this internal queue which
are responsible for polling it and moving the event onto a Frontend
synchronously. After it’s been placed onto the Frontend it follows the
synchronous invocation call pattern which we covered earlier" [2].

Question: Can you think of any use cases for AWS Lambda?
🔹 HTTP 1.0 was finalized and fully documented in 1996. Every
request to the same server requires a separate TCP connection.

Sources:
[1] AWS Lambda whitepaper:
🔹 HTTP 1.1 was published in 1997. A TCP connection can be left
open for reuse (persistent connection), but it doesn’t solve the HOL
https://docs.aws.amazon.com/whitepapers/latest/security-overview-aw
(head-of-line) blocking issue.
s-lambda/lambda-executions.html
[2] Behind the scenes, Lambda:
HOL blocking - when the number of allowed parallel requests in the
https://www.bschaatsbergen.com/behind-the-scenes-lambda/
browser is used up, subsequent requests need to wait for the former
ones to complete.
Image source: [1] [2]

52 53

🔹 HTTP 2.0 was published in 2015. It addresses HOL issue through How to scale a website to support millions of users?
request multiplexing, which eliminates HOL blocking at the application
We will explain this step-by-step.
layer, but HOL still exists at the transport (TCP) layer.

The diagram below illustrates the evolution of a simplified eCommerce


As you can see in the diagram, HTTP 2.0 introduced the concept of
website. It goes from a monolithic design on one single server, to a
HTTP “streams”: an abstraction that allows multiplexing different HTTP
service-oriented/microservice architecture.
exchanges onto the same TCP connection. Each stream doesn’t need
to be sent in order.

🔹 HTTP 3.0 first draft was published in 2020. It is the proposed


successor to HTTP 2.0. It uses QUIC instead of TCP for the underlying
transport protocol, thus removing HOL blocking in the transport layer.

QUIC is based on UDP. It introduces streams as first-class citizens at


the transport layer. QUIC streams share the same QUIC connection,
so no additional handshakes and slow starts are required to create
new ones, but QUIC streams are delivered independently such that in
most cases packet loss affecting one stream doesn't affect others.

Question: When shall we upgrade to HTTP 3.0? Any pros & cons you
can think of?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

54 55
Suppose we have two services: inventory service (handles product
descriptions and inventory management) and user service (handles
user information, registration, login, etc.).

Step 1 - With the growth of the user base, one single application server
cannot handle the traffic anymore. We put the application server and
the database server into two separate servers.

Step 2 - The business continues to grow, and a single application


server is no longer enough. So we deploy a cluster of application
servers.

Step 3 - Now the incoming requests have to be routed to multiple


application servers, how can we ensure each application server gets
an even load? The load balancer handles this nicely.

Step 4 - With the business continuing to grow, the database might


become the bottleneck. To mitigate this, we separate reads and writes
in a way that frequent read queries go to read replicas. With this setup,
the throughput for the database writes can be greatly increased.

Step 5 - Suppose the business continues to grow. One single database


cannot handle the load on both the inventory table and user table. We
have a few options:

1. Vertical partition. Adding more power (CPU, RAM, etc.) to the


database server. It has a hard limit.
2. Horizontal partition by adding more database servers.
3. Adding a caching layer to offload read requests.

Step 6 - Now we can modularize the functions into different services.


The architecture becomes service-oriented / microservice.

Question: what else do we need to support an e-commerce website at


Amazon’s scale?

56 57

DevOps Books 🔹 The Phoenix Project - a classic novel about effectiveness and
communications. IT work is like manufacturing plant work, and a
Some 𝐃𝐞𝐯𝐎𝐩𝐬 books I find enlightening:
system must be established to streamline the workflow. Very
interesting read!

🔹 The DevOps Handbook - introduces product development, quality


assurance, IT operations, and information security.

What’s your favorite dev-ops book?

🔹 Accelerate - presents both the findings and the science behind


measuring software delivery performance.

🔹 Continuous Delivery - introduces automated architecture


management and data migration. It also pointed out key problems and
optimal solutions in each area.

🔹 Site Reliability Engineering - famous Google SRE book. It explains


the whole life cycle of Google’s development, deployment, and
monitoring, and how to manage the world’s biggest software systems.

🔹 Effective DevOps - provides effective ways to improve team


coordination.

58 59
Why is Kafka fast? 🔹 Step 2: Consumer reads data without zero-copy
2.1: The data is loaded from disk to OS cache
Kafka achieves low latency message delivery through Sequential I/O
2.2 The data is copied from OS cache to Kafka application
and Zero Copy Principle. The same techniques are commonly used in
2.3 Kafka application copies the data into the socket buffer
many other messaging/streaming platforms.
2.4 The data is copied from socket buffer to network card
2.5 The network card sends data out to the consumer
The diagram below illustrates how the data is transmitted between
producer and consumer, and what zero-copy means.
🔹 Step 3: Consumer reads data with zero-copy
3.1: The data is loaded from disk to OS cache
3.2 OS cache directly copies the data to the network card via sendfile()
command
3.3 The network card sends data out to the consumer

Zero copy is a shortcut to save the multiple data copies between


application context and kernel context. This approach brings down the
time by ​approximately 65%.


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

🔹 Step 1.1 - 1.3: Producer writes data to the disk

60 61

SOAP vs REST vs GraphQL vs RPC. How do modern browsers work?


The diagram below illustrates the API timeline and API styles
comparison.

Google published a series of articles about "Inside look at modern web


browser". It's a great read.

Links:
https://developer.chrome.com/blog/inside-browser-part1/
https://developer.chrome.com/blog/inside-browser-part2/
https://developer.chrome.com/blog/inside-browser-part3/
https://developer.chrome.com/blog/inside-browser-part4/

Over time, different API architectural styles are released. Each of them
has its own patterns of standardizing data exchange.

You can check out the use cases of each style in the diagram.

Source: https://lnkd.in/gFgi33RY I combined a few diagrams together.


The credit all goes to AltexSoft.


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

62 63
Redis vs Memcached Optimistic locking
The diagram below illustrates the key differences. Optimistic locking, also referred to as optimistic concurrency control,
allows multiple concurrent users to attempt to update the same
resource.

There are two common ways to implement optimistic locking: version


number and timestamp. Version number is generally considered to be
a better option because the server clock can be inaccurate over time.
We explain how optimistic locking works with version number.

The diagram below shows a successful case and a failure case.

The advantages on data structures make Redis a good choice for:

🔹 Recording the number of clicks and comments for each post (hash)

🔹 Sorting the commented user list and deduping the users (zset)

🔹 Caching user behavior history and filtering malicious behaviors 1. A new column called “version” is added to the database table.
(zset, hash)
2. Before a user modifies a database row, the application reads the
🔹 Storing boolean information of extremely large data into small version number of the row.
space. For example, login status, membership status. (bitmap)
3. When the user updates the row, the application increases the
version number by 1 and writes it back to the database.

4. A database validation check is put in place; the next version number


should exceed the current version number by 1. The transaction aborts
if the validation fails and the user tries again from step 2.

64 65

Optimistic locking is usually faster than pessimistic locking because we Tradeoff between latency and consistency
do not lock the database. However, the performance of optimistic
Understanding the 𝐭𝐫𝐚𝐝𝐞𝐨𝐟𝐟𝐬 is very important not only in system design
locking drops dramatically when concurrency is high.
interviews but also designing real-world systems. When we talk about
To understand why, consider the case when many clients try to reserve data replication, there is a fundamental tradeoff between 𝐥𝐚𝐭𝐞𝐧𝐜𝐲 and
a hotel room at the same time. Because there is no limit on how many 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲. It is illustrated by the diagram below.
clients can read the available room count, all of them read back the
same available room count and the current version number. When
different clients make reservations and write back the results to the
database, only one of them will succeed, and the rest of the clients
receive a version check failure message. These clients have to retry. In
the subsequent round of retries, there is only one successful client
again, and the rest have to retry. Although the end result is correct,
repeated retries cause a very unpleasant user experience.

Question: what are the possible ways of solving race conditions?

66 67
Cache miss attack 🔹 Cache keys with null value. Set a short TTL (Time to Live) for keys
with null value.
Caching is awesome but it doesn’t come without a cost, just like many
things in life.
🔹 Using Bloom filter. A Bloom filter is a data structure that can rapidly
tell us whether an element is present in a set or not. If the key exists,
One of the issues is 𝐂𝐚𝐜𝐡𝐞 𝐌𝐢𝐬𝐬 𝐀𝐭𝐭𝐚𝐜𝐤. Correct me if this is not the
the request first goes to the cache and then queries the database if
right term. It refers to the scenario where data to fetch doesn't exist in
needed. If the key doesn't exist in the data set, it means the key
the database and the data isn’t cached either. So every request hits
doesn’t exist in the cache/database. In this case, the query will not hit
the database eventually, defeating the purpose of using a cache. If a
the cache or database layer.
malicious user initiates lots of queries with such keys, the database
can easily be overloaded.

Check out our bestselling system design books.
The diagram below illustrates the process.
Paperback: Amazon Digital: ByteByteGo.

Two approaches are commonly used to solve this problem:

68 69

How to diagnose a mysterious process that’s taking too What are the top cache strategies?
much CPU, memory, IO, etc?
🔹
Read data from the system:

🔹
The diagram below illustrates helpful tools in a Linux system. Cache aside
Read through

🔹
Write data to the system:

🔹
Write around

🔹
Write back
Write through

The diagram below illustrates how those 5 strategies work. Some of


the caching strategies can be used together.

🔹 ‘vmstat’ - reports information about processes, memory, paging,


block IO, traps, and CPU activity.

🔹 ‘iostat’ - reports CPU and input/output statistics of the system.

🔹 ‘netstat’ - displays statistical data related to IP, TCP, UDP, and ICMP
protocols.

🔹 ‘lsof’ - lists open files of the current system.

🔹 ‘pidstat’ - monitors the utilization of system resources by all or


specified processes, including CPU, memory, device IO, task
switching, threads, etc.

70 71
Question: What are the pros and cons of each caching strategy? How
to choose the right one to use?


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

I left out a lot of details as that will make the post very long. Feel free to
leave a comment so we can learn from each other.

72 73

Upload large files


1. The client calls the object storage to initiate a multipart upload.
How can we optimize performance when we 𝐮𝐩𝐥𝐨𝐚𝐝 𝐥𝐚𝐫𝐠𝐞 𝐟𝐢𝐥𝐞𝐬 to object
storage service such as S3?
2. The data store returns an uploadID, which uniquely identifies the
upload.
Before we answer this question, let's take a look at why we need to
optimize this process. Some files might be larger than a few GBs. It is
3. The client splits the large file into small objects and starts uploading.
possible to upload such a large object file directly, but it could take a
Let’s assume the size of the file is 1.6GB and the client splits it into 8
long time. If the network connection fails in the middle of the upload,
parts, so each part is 200 MB in size. The client uploads the first part to
we have to start over. A better solution is to slice a large object into
the data store together with the uploadID it received in step 2.
smaller parts and upload them independently. After all the parts are
uploaded, the object store re-assembles the object from the parts. This
4. When a part is uploaded, the data store returns an ETag, which is
process is called 𝐦𝐮𝐥𝐭𝐢𝐩𝐚𝐫𝐭 𝐮𝐩𝐥𝐨𝐚𝐝.
essentially the md5 checksum of that part. It is used to verify multipart
uploads.
The diagram below illustrates how multipart upload works:
5. After all parts are uploaded, the client sends a complete multipart
upload request, which includes the uploadID, part numbers, and
ETags.

6. The data store reassembles the object from its parts based on the
part number. Since the object is really large, this process may take a
few minutes. After reassembly is complete, it returns a success
message to the client.


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

74 75
Why is Redis so Fast? SWIFT payment network
There are 3 main reasons as shown in the diagram below. You probably heard about 𝐒𝐖𝐈𝐅𝐓. What is SWIFT? What role does it
play in cross-border payments? You can find answers to those
questions in this post.

1. Redis is a RAM-based database. RAM access is at least 1000 times


faster than random disk access.

2. Redis leverages IO multiplexing and single-threaded execution loop


for execution efficiency.
The Society for Worldwide Interbank Financial Telecommunication
3. Redis leverages several efficient lower-level data structures. (SWIFT) is the main secure 𝐦𝐞𝐬𝐬𝐚𝐠𝐢𝐧𝐠 𝐬𝐲𝐬𝐭𝐞𝐦 that links the world’s
banks.
Question: Another popular in-memory store is Memcached. Do you
know the differences between Redis and Memcached? The Belgium-based system is run by its member banks and handles
millions of payment messages per day. The diagram below illustrates
You might have noticed the style of this diagram is different from my how payment messages are transmitted from Bank A (in New York) to
previous posts. Please let me know which one you prefer. Bank B (in London).

— Step 1: Bank A sends a message with transfer details to Regional


Check out our bestselling system design books. Processor A in New York. The destination is Bank B.
Paperback: Amazon Digital: ByteByteGo.

76 77

Step 2: Regional processor validates the format and sends it to Slice


Processor A. The Regional Processor is responsible for input message Step 15: Slice Processor B stores the report.
validation and output message queuing. The Slice Processor is
responsible for storing and routing messages safely. Step 16 - 17: Slice Processor B sends a copy of the report to Slice
Processor A. Slice Processor A stores the report.
Step 3: Slice Processor A stores the message.

Step 4: Slice Processor A informs Regional Processor A the message Check out our bestselling system design books.
is stored. Paperback: Amazon Digital: ByteByteGo.

Step 5: Regional Processor A sends ACK/NAK to Bank A. ACK means


a message will be sent to Bank B. NAK means the message will NOT
be sent to Bank B.

Step 6: Slice Processor A sends the message to Regional Processor B


in London.

Step 7: Regional Processor B stores the message temporarily.

Step 8: Regional Processor B assigns a unique ID MON (Message


Output Number) to the message and sends it to Slice Processor B

Step 9: Slice Processor B validates MON.

Step 10: Slice Processor B authorizes Regional Processor B to send


the message to Bank B.

Step 11: Regional Processor B sends the message to Bank B.

Step 12: Bank B receives the message and stores it.

Step 13: Bank B sends UAK/UNK to Regional Processor B. UAK (user


positive acknowledgment) means Bank B received the message
without error; UNK (user negative acknowledgment) means Bank B
received checksum failure.

Step 14: Regional Processor B creates a report based on Bank B’s


response, and sends it to Slice Processor B.

78 79
At-most once, at-least once, and exactly once is possible on the consumer side. For example, with a unique key in
each message, a message can be rejected when writing duplicate data
In modern architecture, systems are broken up into small and
to the database.
independent building blocks with well-defined interfaces between them.
Message queues provide communication and coordination for those
𝐄𝐱𝐚𝐜𝐭𝐥𝐲 𝐨𝐧𝐜𝐞
building blocks. Today, let’s discuss different delivery semantics:
Exactly once is the most difficult delivery semantic to implement. It is
at-most once, at-least once, and exactly once.
friendly to users, but it has a high cost for the system’s performance
and complexity.

Use cases: Financial-related use cases (payment, trading, accounting,


etc.). Exactly once is especially important when duplication is not
acceptable and the downstream service or third party doesn’t support
idempotency.

Question: what is the difference between message queues vs event


streaming platforms such as Kafka, Apache Pulsar, etc?

𝐀𝐭-𝐦𝐨𝐬𝐭 𝐨𝐧𝐜𝐞
As the name suggests, at-most once means a message will be
delivered not more than once. Messages may be lost but are not
redelivered. This is how at-most once delivery works at the high level.

Use cases: It is suitable for use cases like monitoring metrics, where a
small amount of data loss is acceptable.

𝐀𝐭-𝐥𝐞𝐚𝐬𝐭 𝐨𝐧𝐜𝐞
With this data delivery semantic, it’s acceptable to deliver a message
more than once, but no message should be lost.

Use cases: With at-least once, messages won’t be lost but the same
message might be delivered multiple times. While not ideal from a user
perspective, at-least once delivery semantics are usually good enough
for use cases where data duplication is not a big issue or deduplication

80 81

Vertical partitioning and Horizontal partitioning Horizontal partitioning is widely used so let’s take a closer look.

In many large-scale applications, data is divided into partitions that can


𝐑𝐨𝐮𝐭𝐢𝐧𝐠 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦
be accessed separately. There are two typical strategies for partitioning
Routing algorithm decides which partition (shard) stores the data.
data.

🔹 Vertical partitioning: it means some columns are moved to new


🔹 Range-based sharding. This algorithm uses ordered columns, such
as integers, longs, timestamps, to separate the rows. For example, the
tables. Each table contains the same number of rows but fewer
diagram below uses the User ID column for range partition: User IDs 1
columns (see diagram below).
and 2 are in shard 1, User IDs 3 and 4 are in shard 2.

🔹 Horizontal partitioning (often called sharding): it divides a table into


🔹 Hash-based sharding. This algorithm applies a hash function to one
multiple smaller tables. Each table is a separate data store, and it
column or several columns to decide which row goes to which table.
contains the same number of columns, but fewer rows (see diagram
For example, the diagram below uses 𝐔𝐬𝐞𝐫 𝐈𝐃 𝐦𝐨𝐝 2 as a hash
below).
function. User IDs 1 and 3 are in shard 1, User IDs 2 and 4 are in
shard 2.

🔹
𝐁𝐞𝐧𝐞𝐟𝐢𝐭𝐬
Facilitate horizontal scaling. Sharding facilitates the possibility of
adding more machines to spread out the load.

🔹 Shorten response time. By sharding one table into multiple tables,


queries go over fewer rows, and results are returned much more
quickly.

🔹
𝐃𝐫𝐚𝐰𝐛𝐚𝐜𝐤𝐬
The order by the operation is more complicated. Usually, we need
to fetch data from different shards and sort the data in the application's
code.

🔹 Uneven distribution. Some shards may contain more data than


others (this is also called the hotspot).

This topic is very big and I’m sure I missed a lot of important details.
What else do you think is important for data partitioning?

82 83
CDN 2. If the domain name does not exist in the local DNS cache, the
browser goes to the DNS resolver to resolve the name. The DNS
A content delivery network (CDN) refers to a geographically distributed
resolver usually sits in the Internet Service Provider (ISP).
servers (also called edge servers) which provide fast delivery of static
and dynamic content. Let’s take a look at how it works.
3. The DNS resolver recursively resolves the domain name (see my
previous post for details). Finally, it asks the authoritative name server
Suppose Bob who lives in New York wants to visit an eCommerce
to resolve the domain name.
website that is deployed in London. If the request goes to servers
located in London, the response will be quite slow. So we deploy CDN
4. If we don’t use CDN, the authoritative name server returns the IP
servers close to where Bob lives, and the content will be loaded from
address for www.myshop.com. But with CDN, the authoritative name
the nearby CDN server.
server has an alias pointing to www.myshop.cdn.com (the domain
name of the CDN server).
The diagram below illustrates the process:
5. The DNS resolver asks the authoritative name server to resolve
www.myshop.cdn.com.

6. The authoritative name server returns the domain name for the load
balancer of CDN www.myshop.lb.com.

7. The DNS resolver asks the CDN load balancer to resolve


www.myshop.lb.com. The load balancer chooses an optimal CDN
edge server based on the user’s IP address, user’s ISP, the content
requested, and the server load.

8. The CDN load balancer returns the CDN edge server’s IP address
for www.myshop.lb.com.

9. Now we finally get the actual IP address to visit. The DNS resolver
returns the IP address to the browser.

10. The browser visits the CDN edge server to load the content. There
are two types of contents cached on the CDN servers: static contents
and dynamic contents. The former contains static pages, pictures, and
videos; the latter one includes results of edge computing.

1. Bob types in www.myshop.com in the browser. The browser looks 11. If the edge CDN server cache doesn't contain the content, it goes
up the domain name in the local DNS cache. upward to the regional CDN server. If the content is still not found, it
will go upward to the central CDN server, or even go to the origin - the

84 85

London web server. This is called the CDN distribution network, where Erasure coding
the servers are deployed geographically.
A really cool technique that’s commonly used in object storage such as
S3 to improve durability is called 𝐄𝐫𝐚𝐬𝐮𝐫𝐞 𝐂𝐨𝐝𝐢𝐧𝐠. Let’s take a look at
Over to you: How do you prevent videos cached on CDN from being
how it works.
pirated?

86 87
Erasure coding deals with data durability differently from replication. It Foreign exchange in payment
chunks data into smaller pieces (placed on different servers) and
Have you wondered what happens under the hood when you pay with
creates parities for redundancy. In the event of failures, we can use
USD online and the seller from Europe receives EUR (euro)? This
chunk data and parities to reconstruct the data. Let’s take a look at a
process is called foreign exchange.
concrete example (4 + 2 erasure coding) as shown in Figure 1.

1️⃣ Data is broken up into four even-sized data chunks d1, d2, d3, and
d4.

2️⃣ The mathematical formula is used to calculate the parities p1 and p2.
To give a much simplified example, p1 = d1 + 2*d2 - d3 + 4*d4 and p2
= -d1 + 5*d2 + d3 - 3*d4.

3️⃣ Data d3 and d4 are lost due to node crashes.

4️⃣ The mathematical formula is used to reconstruct lost data d3 and d4,
using the known values of d1, d2, p1, and p2.

How much extra space does erasure coding need? For every two
chunks of data, we need one parity block, so the storage overhead is
50% (Figure 2). While in 3-copy replication, the storage overhead is
200% (Figure 2).

Does erasure coding increase data durability? Let’s assume a node


has a 0.81% annual failure rate. According to the calculation done by
Backblaze, erasure coding can achieve 11 nines durability vs 3-copy
replication can achieve 6 nines durability.
Suppose Bob (the buyer) needs to pay 100 USD to Alice (the seller),
and Alice can only receive EUR. The diagram below illustrates the
What other techniques do you think are important to improve the
process.
scalability and durability of an object store such as S3?
1. Bob sends 100 USD via a third-party payment provider. In our
example, it is Paypal. The money is transferred from Bob’s bank
account (Bank B) to Paypal’s account in Bank P1.

2. Paypal needs to convert USD to EUR. It leverages the foreign


exchange provider (Bank E). Paypal sends 100 USD to its USD
account in Bank E.

88 89

3. 100 USD is sold to Bank E’s funding pool. Interview Question: Design S3

4. Bank E’s funding pool provides 88 EUR in exchange for 100 USD. What happens when you upload a file to Amazon S3? Let’s design an
The money is put into Paypal’s EUR account in Bank E. S3 like object storage system.

5. Paypal’s EUR account in Bank P2 receives 88 EUR.

6. 88 EUR is paid to Alice’s EUR account in Bank A.

Now let’s take a close look at the foreign exchange (forex) market. It
has 3 layers:

🔹 Retail market. Funding pools are parts of the retail market. To


improve efficiency, Paypal usually buys a certain amount of foreign

🔹
currencies in advance.
Wholesale market. The wholesale business is composed of
investment banks, commercial banks, and foreign exchange providers.

🔹
It usually handles accumulated orders from the retail market.
Top-level participants. They are multinational commercial banks
that hold a large number of certificates of deposit from different
countries. They exchange these certificates for foreign exchange
trading.

When Bank E’s funding pool needs more EUR, it goes upward to the
wholesale market to sell USD and buy EUR. When the wholesale
market accumulates enough orders, it goes upward to top-level
participants. Steps 3.1-3.3 and 4.1-4.3 explain how it works.

If you have any questions, please leave a comment.

What foreign currency did you find difficult to exchange? And what
company have you used for foreign currency exchange?

Before we dive into the design, let’s define some terms.

90 91
𝐁𝐮𝐜𝐤𝐞𝐭. A logical container for objects. The bucket name is globally 6. Once validation succeeds, the API service sends the object data in
unique. To upload data to S3, we must first create a bucket. the HTTP PUT payload to the data store. The data store persists the
payload as an object and returns the UUID of the object.
𝐎𝐛𝐣𝐞𝐜𝐭. An object is an individual piece of data we store in a bucket. It
contains object data (also called payload) and metadata. Object data 7. The API service calls the metadata store to create a new entry in the
can be any sequence of bytes we want to store. The metadata is a set metadata database. It contains important metadata such as the
of name-value pairs that describe the object. object_id (UUID), bucket_id (which bucket the object belongs to),
object_name, etc.

🔹
An S3 object consists of (Figure 1):
Metadata. It is mutable and contains attributes such as ID, bucket

🔹
name, object name, etc.
Object data. It is immutable and contains the actual data.

In S3, an object resides in a bucket. The path looks like this:


/bucket-to-share/script.txt. The bucket only has metadata. The object
has metadata and the actual data.

The diagram below (Figure 2) illustrates how file uploading works. In


this example, we first create a bucket named “bucket-to-share” and
then upload a file named “script.txt” to the bucket.

1. The client sends an HTTP PUT request to create a bucket named


“bucket-to-share.” The request is forwarded to the API service.

2. The API service calls the Identity and Access Management (IAM) to
ensure the user is authorized and has WRITE permission.

3. The API service calls the metadata store to create an entry with the
bucket info in the metadata database. Once the entry is created, a
success message is returned to the client.

4. After the bucket is created, the client sends an HTTP PUT request
to create an object named “script.txt”.

5. The API service verifies the user’s identity and ensures the user has
WRITE permission on the bucket.

92 93

Block storage, file storage and object storage Block storage, file storage and object storage
Yesterday, I posted the definitions of block storage, file storage, and In this post, let’s review the storage systems in general.
object storage. Let’s continue the discussion and compare those 3
options. Storage systems fall into three broad categories:

🔹
🔹
Block storage

🔹
File storage
Object storage

The diagram below illustrates the comparison of different storage


systems.

𝐁𝐥𝐨𝐜𝐤 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
Block storage came first, in the 1960s. Common storage devices like
hard disk drives (HDD) and solid-state drives (SSD) that are physically
attached to servers are all considered as block storage.

Block storage presents the raw blocks to the server as a volume. This
is the most flexible and versatile form of storage. The server can
format the raw blocks and use them as a file system, or it can hand
control of those blocks to an application. Some applications like a
database or a virtual machine engine manage these blocks directly in
order to squeeze every drop of performance out of them.

Block storage is not limited to physically attached storage. Block


storage could be connected to a server over a high-speed network or
over industry-standard connectivity protocols like Fibre Channel (FC)

94 95
and iSCSI. Conceptually, the network-attached block storage still Domain Name System (DNS) lookup
presents raw blocks. To the servers, it works the same as physically
DNS acts as an address book. It translates human-readable domain
attached block storage. Whether to a network or physically attached,
names (google.com) to machine-readable IP addresses
block storage is fully owned by a single server. It is not a shared
(142.251.46.238).
resource.

To achieve better scalability, the DNS servers are organized in a


𝐅𝐢𝐥𝐞 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
hierarchical tree structure.
File storage is built on top of block storage. It provides a higher-level
abstraction to make it easier to handle files and directories. Data is
There are 3 basic levels of DNS servers:
stored as files under a hierarchical directory structure. File storage is
the most common general-purpose storage solution. File storage could
1. Root name server (.). It stores the IP addresses of Top Level
be made accessible by a large number of servers using common
Domain (TLD) name servers. There are 13 logical root name servers
file-level network protocols like SMB/CIFS and NFS. The servers
globally.
accessing file storage do not need to deal with the complexity of
2. TLD name server. It stores the IP addresses of authoritative name
managing the blocks, formatting volume, etc. The simplicity of file
servers. There are several types of TLD names. For example, generic
storage makes it a great solution for sharing a large number of files
TLD (.com, .org), country code TLD (.us), test TLD (.test).
and folders within an organization.

3. Authoritative name server. It provides actual answers to the DNS


𝐎𝐛𝐣𝐞𝐜𝐭 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
query. You can register authoritative name servers with domain name
Object storage is new. It makes a very deliberate tradeoff to sacrifice
registrar such as GoDaddy, Namecheap, etc.
performance for high durability, vast scale, and low cost. It targets
relatively “cold” data and is mainly used for archival and backup.
The diagram below illustrates how DNS lookup works under the hood:
Object storage stores all data as objects in a flat structure. There is no
hierarchical directory structure. Data access is normally provided via a
RESTful API. It is relatively slow compared to other storage types.
Most public cloud service providers have an object storage offering,
such as AWS S3, Google block storage, and Azure blob storage.

1. google.com is typed into the browser, and the browser sends the
domain name to the DNS resolver.

96 97

2. The resolver queries a DNS root name server. What happens when you type a URL into your browser?
The diagram below illustrates the steps.
3. The root server responds to the resolver with the address of a TLD
DNS server. In this case, it is .com.

4. The resolver then makes a request to the .com TLD.

5. The TLD server responds with the IP address of the domain’s name
server, google.com (authoritative name server).

6. The DNS resolver sends a query to the domain’s nameserver.

7. The IP address for google.com is then returned to the resolver from


the nameserver.

8. The DNS resolver responds to the web browser with the IP address
(142.251.46.238) of the domain requested initially.

DNS lookups on average take between 20-120 milliseconds to


complete (according to YSlow).

1. Bob enters a URL into the browser and hits Enter. In this example,
the URL is composed of 4 parts:

🔹 scheme - 𝒉𝒕𝒕𝒑𝒔://. This tells the browser to send a connection to the

🔹
server using HTTPS.

🔹
domain - 𝒆𝒙𝒂𝒎𝒑𝒍𝒆.𝒄𝒐𝒎. This is the domain name of the site.
path - 𝒑𝒓𝒐𝒅𝒖𝒄𝒕/𝒆𝒍𝒆𝒄𝒕𝒓𝒊𝒄. It is the path on the server to the requested

🔹
resource: phone.
resource - 𝒑𝒉𝒐𝒏𝒆. It is the name of the resource Bob wants to visit.

2. The browser looks up the IP address for the domain with a domain
name system (DNS) lookup. To make the lookup process fast, data is
cached at different layers: browser cache, OS cache, local network
cache and ISP cache.

98 99
AI Coding engine
2.1 If the IP address cannot be found at any of the caches, the browser
DeepMind says its new AI coding engine (AlphaCode) is as good as an
goes to DNS servers to do a recursive DNS lookup until the IP address
average programmer.
is found (this will be covered in another post).

The AI bot participated in the 10 Codeforces coding competitions and


3. Now that we have the IP address of the server, the browser
was ranked 54.3%. It means its score exceeded half of the human
establishes a TCP connection with the server.
contestants. If we look at its score for the last 6 months, AlphaCode
ranks at 28%.
4. The browser sends a HTTP request to the server. The request looks
like this:
The diagram below explains how the AI bot works:
𝘎𝘌𝘛 /𝘱𝘩𝘰𝘯𝘦 𝘏𝘛𝘛𝘗/1.1
𝘏𝘰𝘴𝘵: 𝘦𝘹𝘢𝘮𝘱𝘭𝘦.𝘤𝘰𝘮

5. The server processes the request and sends back the response. For
a successful response (the status code is 200). The HTML response
might look like this:

𝘏𝘛𝘛𝘗/1.1 200 𝘖𝘒
𝘋𝘢𝘵𝘦: 𝘚𝘶𝘯, 30 𝘑𝘢𝘯 2022 00:01:01 𝘎𝘔𝘛
𝘚𝘦𝘳𝘷𝘦𝘳: 𝘈𝘱𝘢𝘤𝘩𝘦
𝘊𝘰𝘯𝘵𝘦𝘯𝘵-𝘛𝘺𝘱𝘦: 𝘵𝘦𝘹𝘵/𝘩𝘵𝘮𝘭; 𝘤𝘩𝘢𝘳𝘴𝘦𝘵=𝘶𝘵𝘧-8

<!𝘋𝘖𝘊𝘛𝘠𝘗𝘌 𝘩𝘵𝘮𝘭>
<𝘩𝘵𝘮𝘭 𝘭𝘢𝘯𝘨="𝘦𝘯"> 1. Pre-train the transformer models on GitHub code.
𝘏𝘦𝘭𝘭𝘰 𝘸𝘰𝘳𝘭𝘥
</𝘩𝘵𝘮𝘭> 2. Fine-tune the models on the relatively small competitive
programming dataset.
6. The browser renders the HTML content.
3. At evaluation time, create a massive amount of solutions for each
problem.

4. Filter, cluster and rerank the solutions to a small set of candidate


programs (at most 10), and then submit for further assessments.

5. Run the candidate programs against the test cases, evaluate the
performance, and choose the best one.

100 101

Do you think AI bot will be better at Leetcode or competitive Read replica pattern
programming than software engineers five years from now?
There are two common ways to implement the read replica pattern:

1. Embed the routing logic in the application code (explained in the last
post).
2. Use database middleware.

We focus on option 2 here. The middleware provides transparent


routing between the application and database servers. We can
customize the routing logic based on difficult rules such as user,
schema, statement, etc.

The diagram below illustrates the setup:

102 103
Read replica pattern
1. When Alice places an order on amazon, the request is sent to Order
In this post, we talk about a simple yet commonly used database
Service.
design pattern (setup): 𝐑𝐞𝐚𝐝 𝐫𝐞𝐩𝐥𝐢𝐜𝐚 𝐩𝐚𝐭𝐭𝐞𝐫𝐧.
2. Order Service does not directly interact with the database. Instead, it
In this setup, all data-modifying commands like insert, delete, or
sends database queries to the database middleware.
update are sent to the primary DB, and reads are sent to read replicas.
3. The database middleware routes writes to the primary database.
The diagram below illustrates the setup:
Data is replicated to two replicas.
1. When Alice places an order on amazon.com, the request is sent
to Order Service.
4. Alice views the order details (read). The request is sent through the
2. Order Service creates a record about the order in the primary
middleware.
DB (write). Data is replicated to two replicas.
3. Alice views the order details. Data is served from a replica
5. Alice views the recent order history (read). The request is sent
(read).
through the middleware.
4. Alice views the recent order history. Data is served from a
replica (read).
The database middleware acts as a proxy between the application and
databases. It uses standard MySQL network protocol for
communication.

Pros:
- Simplified application code. The application doesn’t need to be aware
of the database topology and manage access to the database directly.

- Better compatibility. The middleware uses the MySQL network


protocol. Any MySQL compatible client can connect to the middleware
easily. This makes database migration easier.

Cons:
- Increased system complexity. A database middleware is a complex
system. Since all database queries go through the middleware, it
usually requires a high availability setup to avoid a single point of
failure.

- Additional middleware layer means additional network latency.


Therefore, this layer requires excellent performance.

There is one major problem in this setup: 𝐫𝐞𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐥𝐚𝐠.

104 105

Email receiving flow


Under certain circumstances (network delay, server overload, etc.),
data in replicas might be seconds or even minutes behind. In this case, The following diagram demonstrates the email receiving flow.
if Alice immediately checks the order status (query is served by the
replica) after the order is placed, she might not see the order at all.
This leaves Alice confused. In this case, we need “read-after-write”
consistency.

Possible solutions to mitigate this problem:

1️⃣ Latency sensitive reads are sent to the primary database.

2️⃣ Reads that immediately follow writes are routed to the primary
database.

3️⃣ A relational DB generally provides a way to check if a replica is


caught up with the primary. If data is up to date, query the replica.
Otherwise, fail the read request or read from the primary.
1. Incoming emails arrive at the SMTP load balancer.

2. The load balancer distributes traffic among SMTP servers. Email


acceptance policy can be configured and applied at the
SMTP-connection level. For example, invalid emails are bounced to
avoid unnecessary email processing.

3. If the attachment of an email is too large to put into the queue, we


can put it into the attachment store (s3).

4. Emails are put in the incoming email queue. The queue decouples
mail processing workers from SMTP servers so they can be scaled
independently. Moreover, the queue serves as a buffer in case the
email volume surges.

5. Mail processing workers are responsible for a lot of tasks, including


filtering out spam mails, stopping viruses, etc. The following steps
assume an email passed the validation.

6. The email is stored in the mail storage, cache, and object data store.

106 107
7. If the receiver is currently online, the email is pushed to real-time Email sending flow
servers.
In this post, we will take a closer look at the email sending flow.
8. Real-time servers are WebSocket servers that allow clients to
receive new emails in real-time.

9. For offline users, emails are stored in the storage layer. When a user
comes back online, the webmail client connects to web servers via
RESTful API.

10. Web servers pull new emails from the storage layer and return
them to the client.

1. A user writes an email on webmail and presses the “send” button.


The request is sent to the load balancer.

2. The load balancer makes sure it doesn’t exceed the rate limit and
routes traffic to web servers.

3. Web servers are responsible for:

- Basic email validation. Each incoming email is checked against


pre-defined rules such as email size limit.

- Checking if the domain of the recipient’s email address is the


same as the sender. If it is the same, email data is inserted to storage,
cache, and object store directly. The recipient can fetch the email
directly via the RESTful API. There is no need to go to step 4.

4. Message queues.

108 109

4.a. If basic email validation succeeds, the email data is passed to Interview Question: Design Gmail
the outgoing queue.
One picture is worth more than a thousand words. In this post, we will
4.b. If basic email validation fails, the email is put in the error take a look at what happens when Alice sends an email to Bob.
queue.

5. SMTP outgoing workers pull events from the outgoing queue and
make sure emails are spam and virus free.

6. The outgoing email is stored in the “Sent Folder” of the storage


layer.

7. SMTP outgoing workers send the email to the recipient mail server.

Each message in the outgoing queue contains all the metadata


required to create an email. A distributed message queue is a critical
component that allows asynchronous mail processing. By decoupling
SMTP outgoing workers from the web servers, we can scale SMTP
outgoing workers independently.

We monitor the size of the outgoing queue very closely. If there are
many emails stuck in the queue, we need to analyze the cause of the
issue. Here are some possibilities:
- The recipient’s mail server is unavailable. In this case, we need to
retry sending the email at a later time. Exponential backoff might be a
good retry strategy.

- Not enough consumers to send emails. In this case, we may need


1. Alice logs in to her Outlook client, composes an email, and presses
more consumers to reduce the processing time.
“send”. The email is sent to the Outlook mail server. The
communication protocol between the Outlook client and mail server is
SMTP.

2. Outlook mail server queries the DNS (not shown in the diagram) to
find the address of the recipient’s SMTP server. In this case, it is
Gmail’s SMTP server. Next, it transfers the email to the Gmail mail
server. The communication protocol between the mail servers is SMTP.

3. The Gmail server stores the email and makes it available to Bob, the
recipient.

110 111
4. Gmail client fetches new emails through the IMAP/POP server when Map rendering
Bob logs in to Gmail.
Google Maps Continued. Let’s take a look at 𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠 in this
post.
Please keep in mind this is a highly simplified design. Hope it sparks
your interest and curiosity:) I'll explain each component in more depth
𝐏𝐫𝐞-𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐝 𝐓𝐢𝐥𝐞𝐬
in the future.
One foundational concept in map rendering is tiling. Instead of
rendering the entire map as one large custom image, the world is
broken up into smaller tiles. The client only downloads the relevant
tiles for the area the user is in and stitches them together like a mosaic
for display. The tiles are pre-computed at different zoom levels. Google
Maps uses 21 zoom levels.

For example, at zoom level 0, The entire map is represented by a


single tile of size 256 * 256 pixels. Then at zoom level 1, the number of
map tiles doubles in both north-south and east-west directions, while
each tile stays at 256 * 256 pixels. So we have 4 tiles at zoom level 1,
and the whole image of zoom level 1 is 512 * 512 pixels. With each
increment, the entire set of tiles has 4x as many pixels as the previous
level. The increased pixel count provides an increasing level of detail
to the user.

This allows the client to render the map at the best granularities
depending on the client’s zoom level without consuming excessive
bandwidth to download tiles with too much detail. This is especially
important when we are loading the images from mobile clients.

𝐑𝐨𝐚𝐝 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐬
Now that we have transformed massive maps into tiles, we also need
to define a data structure for the roads. We divide the world of roads
into small blocks. We call these blocks road segments. Each road
segment contains multiple roads, junctions, and other metadata.

We group nearby segments into super segments. This process can be


applied repeatedly to meet the level of coverage required.

We then transform the road segments into a data structure that the
navigation algorithms can use. The typical approach is to convert the
map into a 𝒈𝒓𝒂𝒑𝒉, where the nodes are road segments, and two nodes
are connected if the corresponding road segments are reachable

112 113

neighbors. In this way, finding a path between two locations becomes a Interview Question: Design Google Maps
shortest-path problem, where we can leverage Dijkstra or A*
Google started project G𝐨𝐨𝐠𝐥𝐞 M𝐚𝐩𝐬 in 2005. As of March 2021, Google
algorithms.
Maps had one billion daily active users, 99% coverage of the world in
200 countries.

Although Google Maps is a very complex system, we can break it


down into 3 high-level components. In this post, let’s take a look at how
to design a simplified Google Maps.

114 115
Pull vs push models
𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
There are two ways metrics data can be collected, pull or push. It is a
The location service is responsible for recording a user’s location
routine debate as to which one is better and there is no clear answer.
update. The Google Map clients send location updates every few
In this post, we will take a look at the pull model.
seconds. The user location data is used in many cases:

- detect new and recently closed roads


- improve the accuracy of the map over time
- used as an input for live traffic data.

𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠
The world’s map is projected into a huge 2D map image. It is broken
down into small image blocks called “tiles” (see below). The tiles are
static. They don’t change very often. An efficient way to serve static tile
files is with a CDN backed by cloud storage like S3. The users can
load the necessary tiles to compose a map from nearby CDN.

What if a user is zooming and panning the map viewpoint on the client
to explore their surroundings?

An efficient way is to pre-calculate the map blocks with different zoom


levels and load the images when needed.

𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
This component is responsible for finding a reasonably fast route from
point A to point B. It calls two services to help with the path calculation:

1️⃣ Geocoding Service: resolve the given address to a latitude/longitude


pair

2️⃣ Route Planner Service: this service does three things in sequence:

- Calculate the top-K shortest paths between A and B


- Calculate the estimation of time for each path based on current
traffic and historical data
- Rank the paths by time predictions and user filtering. For example,
the user doesn’t want to avoid tolls.

116 117

Figure 1 shows data collection with a pull model over HTTP. We have Money movement
dedicated metric collectors which pull metrics values from the running
One picture is worth more than a thousand words. This is what
applications periodically.
happens when you buy a product using Paypal/bank card under the
hood.
In this approach, the metrics collector needs to know the complete list
of service endpoints to pull data from. One naive approach is to use a
To understand this, we need to digest two concepts: 𝐜𝐥𝐞𝐚𝐫𝐢𝐧𝐠 &
file to hold DNS/IP information for every service endpoint on the
𝐬𝐞𝐭𝐭𝐥𝐞𝐦𝐞𝐧𝐭. Clearing is a process that calculates who should pay whom
“metric collector” servers. While the idea is simple, this approach is
with how much money; while settlement is a process where real money
hard to maintain in a large-scale environment where servers are added
moves between reserves in the settlement bank.
or removed frequently, and we want to ensure that metric collectors
don’t miss out on collecting metrics from any new servers.

The good news is that we have a reliable, scalable, and maintainable


solution available through Service Discovery, provided by Kubernetes,
Zookeeper, etc., wherein services register their availability and the
metrics collector can be notified by the Service Discovery component
whenever the list of service endpoints changes. Service discovery
contains configuration rules about when and where to collect metrics
as shown in Figure 2.

Figure 3 explains the pull model in detail.

1️⃣ The metrics collector fetches configuration metadata of service


endpoints from Service Discovery. Metadata include pulling interval, IP
addresses, timeout and retries parameters, etc.

2️⃣ The metrics collector pulls metrics data via a pre-defined HTTP
endpoint (for example, /metrics). To expose the endpoint, a client
library usually needs to be added to the service. In Figure 3, the
service is Web Servers.

3️⃣ Optionally, the metrics collector registers a change event notification


with Service Discovery to receive an update whenever the service
endpoints change. Alternatively, the metrics collector can poll for
endpoint changes periodically.

118 119
Let’s say Bob wants to buy an SDI book from Claire’s shop on The first two layers are called information flow, and the settlement layer
Amazon. is called fund flow.

- Pay-in flow (Bob pays Amazon money): You can see the 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰 𝐚𝐧𝐝 𝐟𝐮𝐧𝐝 𝐟𝐥𝐨𝐰 𝐚𝐫𝐞 𝐬𝐞𝐩𝐚𝐫𝐚𝐭𝐞𝐝. In the
1.1 Bob buys a book on Amazon using Paypal. info flow, the money seems to be deducted from one bank account and
1.2 Amazon issues a money transfer request to Paypal. added to another bank account, but the actual money movement
1.3 Since the payment token of Bob’s debit card is stored in Paypal, happens in the settlement bank at the end of the day.
Paypal can transfer money, on Bob’s behalf, to Amazon’s bank
account in Bank A. Because of the asynchronous nature of the info flow and the fund flow,
1.4 Both Bank A and Bank B send transaction statements to the reconciliation is very important for data consistency in the systems
clearing institution. It reduces the transactions that need to be settled. along with the flow.
Let’s assume Bank A owns Bank B $100 and Bank B owns bank A
$500 at the end of the day. When they settle, the net position is that It makes things even more interesting when Bob wants to buy a book
Bank B pays Bank A $400. in the Indian market, where Bob pays USD but the seller can only
1.5 & 1.6 The clearing institution sends clearing and settlement receive INR.
information to the settlement bank. Both Bank A and Bank B have
pre-deposited funds in the settlement bank as money reserves, so
actual money movement happens between two reserve accounts in
the settlement bank.

- Pay-out flow (Amazon pays the money to the seller: Claire):


2.1 Amazon informs the seller (Claire) that she will get paid soon.
2.2 Amazon issues a money transfer request from its own bank (Bank
A) to the seller bank (bank C). Here both banks record the
transactions, but no real money is moved.
2.3 Both Bank A and Bank C send transaction statements to the
clearing institution.
2.4 & 2.5 The clearing institution sends clearing and settlement
information to the settlement bank. Money is transferred from Bank A’s
reserve to Bank C’s reserve.

Notice that we have three layers:


- Transaction layer: where the online purchases happen
- Payment and clearing layer: where the payment instructions and
transaction netting happen
- Settlement layer: where the actual money movement happen

120 121

Reconciliation 2) The order number is carried over to the payment provider


3) The payment provider creates another internal ID, which is carried
My previous post about painful payment reconciliation problems
over across transactions within the system
sparked lots of interesting discussions. One of the readers shared
4) The payment ID is used when you get the payout on your bank
more problems we may face when working with intermediary payment
account (or the payment provider bundles individual payments, which
processors in the trenches and a potential solution:
can be reconciled within the payment provider system)
5) Ideally, your payment provider and your shop have an
1. Foreign Currency Problem: When you operate a store globally, you
integration/API with the tool you use to (hopefully automatically) create
will come across this problem quite frequently. To go back to the
invoices. This usually carries over the order id from the shop (closing
example from Paypal - if the transaction happens in a currency
the loop) and sometimes even the payment id to match it with the
different from the standard currency of Paypal, this will create another
invoice id, which you then can use to reconcile it with your accounts
layer, where the transaction is first received in that currency and
receivable/payable. :)
exchanged to whatever currency your Paypal is using. There needs to
be a reliable way to reconcile that currency exchange transaction. It
Credit: A knowledgeable reader who prefers to stay private. Thank
certainly does not help that every payment provider handles this
you!
differently.

2. Payment providers are only that - intermediaries. Each purchase


does not trigger two events for a company, but actually at least 4. The
purchase via Paypal (where both the time and the currency dimension
can come into play) trigger the debit/credit pair for the transaction and
then, usually a few days later, another pair when the money is
transferred from Paypal to a bank account (where there might be yet
another FX discrepancy to reconcile if, for example, the initial purchase
was in JPY, Paypal is set up in USD and your bank account is in EUR).
There needs to be a way to reconcile all of these.

3. Some problems also pop up on the buyer side that is very


platform-specific. One example is shadow transaction from Paypal: if
you buy two items on Paypal with 1 week of time between the two
transactions, Paypal will first debit money from your bank account for
transaction A. If at the time of transaction B, transaction A has not
gone through completely or is canceled, there might be a world where
Paypal will use the money from transaction A to partially pay for
transaction B, which leads to only a partial amount of transaction B
being withdrawn from the bank account.

In practice, this usually looks something like this:


1) Your shop assigns an order number to the purchase

122 123
Continued: how to choose the right database for metrics collecting Since a time-series database is a specialized database, you are not
service? expected to understand the internals in an interview unless you
explicitly mentioned it in your resume. For the purpose of an interview,
it’s important to understand the metrics data are time-series in nature
and we can select time-series databases such as InfluxDB for storage
to store them.

Another feature of a strong time-series database is efficient


aggregation and analysis of a large amount of time-series data by
labels, also known as tags in some databases. For example, InfluxDB
builds indexes on labels to facilitate the fast lookup of time-series by
labels. It provides clear best-practice guidelines on how to use labels,
without overloading the database. The key is to make sure each label
is of low cardinality (having a small set of possible values). This feature
is critical for visualization, and it would take a lot of effort to build this
with a general-purpose database.

There are many storage systems available that are optimized for
time-series data. The optimization lets us use far fewer servers to
handle the same volume of data. Many of these databases also have
custom query interfaces specially designed for the analysis of
time-series data that are much easier to use than SQL. Some even
provide features to manage data retention and data aggregation. Here
are a few examples of time-series databases.

OpenTSDB is a distributed time-series database, but since it is based


on Hadoop and HBase, running a Hadoop/HBase cluster adds
complexity. Twitter uses MetricsDB, and Amazon offers Timestream as
a time-series database. According to DB-engines, the two most
popular time-series databases are InfluxDB and Prometheus, which
are designed to store large volumes of time-series data and quickly
perform real-time analysis on that data. Both of them primarily rely on
an in-memory cache and on-disk storage. And they both handle
durability and performance quite well. According to the benchmark, an
InfluxDB with 8 cores and 32GB RAM can handle over 250,000 writes
per second.

124 125

Which database shall I use for the metrics collecting How about NoSQL? In theory, a few NoSQL databases on the market
system? could handle time-series data effectively. For example, Cassandra and
Bigtable can both be used for time series data. However, this would
require deep knowledge of the internal workings of each NoSQL to
This is one of the most important questions we need to address in an devise a scalable schema for effectively storing and querying
interview. time-series data. With industrial-scale time-series databases readily
available, using a general purpose NoSQL database is not appealing.
𝐃𝐚𝐭𝐚 𝐚𝐜𝐜𝐞𝐬𝐬 𝐩𝐚𝐭𝐭𝐞𝐫𝐧
As shown in the diagram, each label on the y-axis represents a time There are many storage systems available that are optimized for
series (uniquely identified by the names and labels) while the x-axis time-series data. The optimization lets us use far fewer servers to
represents time. handle the same volume of data. Many of these databases also have
The write load is heavy. As you can see, there can be many custom query interfaces specially designed for the analysis of
time-series data points written at any moment. There are millions of time-series data that are much easier to use than SQL. Some even
operational metrics written per day, and many metrics are collected at provide features to manage data retention and data aggregation. Here
high frequency, so the traffic is undoubtedly write-heavy. are a few examples of time-series databases.

At the same time, the read load is spiky. Both visualization and alert OpenTSDB is a distributed time-series database, but since it is based
services send queries to the database and depending on the access on Hadoop and HBase, running a Hadoop/HBase cluster adds
patterns of the graphs and alerts, the read volume could be bursty. complexity. Twitter uses MetricsDB, and Amazon offers Timestream as
a time-series database. According to DB-engines, the two most
𝐂𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 popular time-series databases are InfluxDB and Prometheus, which
The data storage system is the heart of the design. It’s not are designed to store large volumes of time-series data and quickly
recommended to build your own storage system or use a perform real-time analysis on that data. Both of them primarily rely on
general-purpose storage system (MySQL) for this job. an in-memory cache and on-disk storage. And they both handle
durability and performance quite well. According to the benchmark
A general-purpose database, in theory, could support time-series data, listed on InfluxDB website, a DB server with 8 cores and 32GB RAM
but it would require expert-level tuning to make it work at our scale. can handle over 250,000 writes per second.
Specifically, a relational database is not optimized for operations you
would commonly perform against time-series data. For example, Since a time-series database is a specialized database, you are not
computing the moving average in a rolling time window requires expected to understand the internals in an interview unless you
complicated SQL that is difficult to read (there is an example of this in explicitly mentioned it in your resume. For the purpose of an interview,
the deep dive section). Besides, to support tagging/labeling data, we it’s important to understand the metrics data are time-series in nature
need to add an index for each tag. Moreover, a general-purpose and we can select time-series databases such as InfluxDB for storage
relational database does not perform well under constant heavy write to store them.
load. At our scale, we would need to expend significant effort in tuning
the database, and even then, it might not perform well. Another feature of a strong time-series database is efficient
aggregation and analysis of a large amount of time-series data by
labels, also known as tags in some databases. For example, InfluxDB
builds indexes on labels to facilitate the fast lookup of time-series by

126 127
labels. It provides clear best-practice guidelines on how to use labels, Metrics monitoring and altering system
without overloading the database. The key is to make sure each label
A well-designed 𝐦𝐞𝐭𝐫𝐢𝐜𝐬 𝐦𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 and alerting system plays a key
is of low cardinality (having a small set of possible values). This feature
role in providing clear visibility into the health of the infrastructure to
is critical for visualization, and it would take a lot of effort to build this
ensure high availability and reliability. The diagram below explains how
with a general-purpose database.
it works at a high level.

Metrics source: This can be application servers, SQL databases,


message queues, etc.

Metrics collector: It gathers metrics data and writes data into the
time-series database.

Time-series database: This stores metrics data as time series. It


usually provides a custom query interface for analyzing and
summarizing a large amount of time-series data. It maintains indexes
on labels to facilitate the fast lookup of time-series data by labels.

Kafka: Kafka is used as a highly reliable and scalable distributed


messaging platform. It decouples the data collection and data
processing services from each other.

128 129

Consumers: Consumers or streaming processing services such as Reconciliation


Apache Storm, Flink and Spark, process and push data to the
𝐑𝐞𝐜𝐨𝐧𝐜𝐢𝐥𝐢𝐚𝐭𝐢𝐨𝐧 might be the most painful process in a payment system.
time-series database.
It is the process of comparing records in different systems to make
sure the amounts match each other.
Query service: The query service makes it easy to query and retrieve
data from the time-series database. This should be a very thin wrapper
if we choose a good time-series database. It could also be entirely
replaced by the time-series database’s own query interface.

Alerting system: This sends alert notifications to various alerting


destinations.

Visualization system: This shows metrics in the form of various


graphs/charts.

For example, if you pay $200 to buy a watch with Paypal:


- The eCommerce website should have a record about the purchase
order of $200.
- There should be a transaction record of $200 in Paypal (marked with
2 in the diagram).
- The Ledger should record a debit of $200 dollars for the buyer, and a
credit of $200 for the seller. This is called double-entry bookkeeping
(see the table below).

Let’s take a look at some pain points and how we can address them:

130 131
𝐏𝐫𝐨𝐛𝐥𝐞𝐦 1: Data normalization. When comparing records in different Which database shall I use? This is one of the most important
systems, they come in different formats. For example, the timestamp questions we usually need to address in an interview.
can be “2022/01/01” in one system and “Jan 1, 2022” in another.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can add a layer to transform different formats into Choosing the right database is hard. Google Cloud recently posted a
the same format. great article that summarized different database options available in
Google Cloud and explained which use cases are best suited for each
𝐏𝐫𝐨𝐛𝐥𝐞𝐦 2: Massive data volume database option.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can use big data processing techniques to speed
up data comparisons. If we need near real-time reconciliation, a
streaming platform such as Flink is used; otherwise, end-of-day batch
processing such as Hadoop is enough.

𝐏𝐫𝐨𝐛𝐥𝐞𝐦 3: Cut-off time issue. For example, if we choose 00:00:00 as


the daily cut-off time, one record is stamped with 23:59:55 in the
internal system, but might be stamped 00:00:30 in the external system
(Paypal), which is the next day. In this case, we couldn’t find this record
in today’s Paypal records. It causes a discrepancy.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we need to categorize this break as a “temporary
break” and run it later against the next day’s Paypal records. If we find
a match in the next day’s Paypal records, the break is cleared, and no
more action is needed.

You may argue that if we have exactly-once semantics in the system,


there shouldn’t be any discrepancies. But the truth is, there are so
many places that can go wrong. Having a reconciliation system is
always necessary. It is like having a safety net to keep you sleeping
well at night.

132 133

Big data papers language. But Hive still used MapReduce under the hood, so it’s not
very responsive. In 2010, Dremel provided an interactive query engine.
Below is a timeline of important big data papers and how the
techniques evolved over time.
𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 was born to further solve the latency issue in
OLAP. The famous 𝒍𝒂𝒎𝒃𝒅𝒂 architecture was based on Storm and
MapReduce, where streaming processing and batch processing have
different processing flows. Then people started to build streaming
processing with apache Kafka. 𝑲𝒂𝒑𝒑𝒂 architecture was proposed in
2014, where streaming and batching processings were merged into
one flow. Google published The Dataflow Model in 2015, which was an
abstraction standard for streaming processing, and Flink implemented
this model.

To manage a big crowd of commodity server resources, we need


resource management Kubernetes.

The green highlighted boxes are the famous 3 Google papers, which
established the foundation of the big data framework. At the high-level:

𝘉𝘪𝘨 𝘋𝘢𝘵𝘢 𝘛𝘦𝘤𝘩𝘯𝘪𝘲𝘶𝘦𝘴 = 𝘔𝘢𝘴𝘴𝘪𝘷𝘦 𝘥𝘢𝘵𝘢 + 𝘔𝘢𝘴𝘴𝘪𝘷𝘦 𝘤𝘢𝘭𝘤𝘶𝘭𝘢𝘵𝘪𝘰𝘯

Let’s look at the 𝐎𝐋𝐓𝐏 evolution. BigTable provided a distributed


storage system for structured data but dropped some characteristics of
relational DB. Then Megastore brought back schema and simple
transactions; Spanner brought back data consistency.

Now let’s look at the 𝐎𝐋𝐀𝐏 evolution. MapReduce was not easy to
program, so Hive solved this by introducing a SQL-like query

134 135
Avoid double charge At the first glance, exactly-once delivery seems very hard to tackle, but
if we divide the problem into two parts, it is much easier to solve.
One of the most serious problems a payment system can have is to
Mathematically, an operation is executed exactly-once if:
𝐝𝐨𝐮𝐛𝐥𝐞 𝐜𝐡𝐚𝐫𝐠𝐞 𝐚 𝐜𝐮𝐬𝐭𝐨𝐦𝐞𝐫. When we design the payment system, it is
important to guarantee that the payment system executes a payment
1. It is executed at least once.
order exactly-once.
2. At the same time, it is executed at most once.

We now explain how to implement at least once using retry and at


most once using idempotency check.

𝐑𝐞𝐭𝐫𝐲
Occasionally, we need to retry a payment transaction due to network
errors or timeout. Retry provides the at-least-once guarantee. For
example, as shown in Figure 10, the client tries to make a $10
payment, but the payment keeps failing due to a poor network
connection. Considering the network condition might get better, the
client retries the request and this payment finally succeeds at the
fourth attempt.

𝐈𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲
From an API standpoint, idempotency means clients can make the
same call repeatedly and produce the same result.

For communication between clients (web and mobile applications) and


servers, an idempotency key is usually a unique value that is
generated by clients and expires after a certain period of time. A UUID
is commonly used as an idempotency key and it is recommended by
many tech companies such as Stripe and PayPal. To perform an
idempotent payment request, an idempotency key is added to the
HTTP header: <idempotency-key: key_value>.

136 137

Payment security System Design Interview Tip


A few weeks ago, I posted the high-level design for the payment One pro tip for acing a system design interview is to read the
system. Today, I’ll continue the discussion and focus on payment engineering blog of the company you are interviewing with. You can
security. get a good sense of what technology they use, why the technology
was chosen over others, and learn what issues are important to
The table below summarizes techniques that are commonly used in engineers.
payment security. If you have any questions or I missed anything,
please leave a comment.

For example, here are 4 blog posts Twitter Engineering recommends:


1. The Infrastructure Behind Twitter: Scale
2. Discovery and Consumption of Analytics Data at Twitter
3. The what and why of product experimentation at Twitter
4. Twitter experimentation: technical overview

138 139
Big data evolvement last generation. For example, “Hive - support SQL” means Hive was
trying to solve the lack of SQL in MapReduce.
I hope everyone has a great time with friends and family during the
holidays. If you are looking for some readings, classic engineering
If you want to learn more, you can refer to the papers for details. What
papers are a good start.
other classics would you recommend?

A lot of times when we are busy with work, we only focus on scattered
information, telling us “how” and “what” to get our immediate needs to
get things done.

However, reading the classics helps us know “why” behind the scenes,
and teaches us how to solve problems, make better decisions, or even
contribute to open source projects.

Let’s take big data as an example.

Big data area has progressed a lot over the past 20 years. It started
from 3 Google papers (see the links in the comment), which tackled
real engineering challenges at Google scale:

- GFS (2003) - big data storage


- MapReduce (2004) - calculation model
- BigTable (2006) - online services

The diagram below shows the functionalities and limitations of the 3


techniques, and how they evolve over time into two streams: OLTP and
OLAP. Each evolved product was trying to solve the limitations of the

140 141

Quadtree Quadtree is an 𝐢𝐧-𝐦𝐞𝐦𝐨𝐫𝐲 𝐝𝐚𝐭𝐚 𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 and it is not a database


solution. It runs on each LBS (Location-Based Service, see last week’s
In this post, let’s explore another data structure to find nearby
post) server, and the data structure is built at server start-up time.
restaurants on Yelp or Google Maps.
The second diagram explains the quadtree building process in more
A quadtree is a data structure that is commonly used to partition a
detail. The root node represents the whole world map. The root node is
two-dimensional space by recursively subdividing it into four quadrants
𝐫𝐞𝐜𝐮𝐫𝐬𝐢𝐯𝐞𝐥𝐲 broken down into 4 quadrants until no nodes are left with
(grids) until the contents of the grids meet certain criteria (see the first
more than 100 businesses.
diagram).

𝐇𝐨𝐰 𝐭𝐨 𝐠𝐞𝐭 𝐧𝐞𝐚𝐫𝐛𝐲 𝐛𝐮𝐬𝐢𝐧𝐞𝐬𝐬𝐞𝐬 𝐰𝐢𝐭𝐡 𝐪𝐮𝐚𝐝𝐭𝐫𝐞𝐞?


- Build the quadtree in memory.

- After the quadtree is built, start searching from the root and traverse
the tree, until we find the leaf node where the search origin is.

- If that leaf node has 100 businesses, return the node. Otherwise, add
businesses from its neighbors until enough businesses are returned.

𝐔𝐩𝐝𝐚𝐭𝐞 𝐋𝐁𝐒 𝐬𝐞𝐫𝐯𝐞𝐫 𝐚𝐧𝐝 𝐫𝐞𝐛𝐮𝐢𝐥𝐝 𝐪𝐮𝐚𝐝𝐭𝐫𝐞𝐞


- It may take a few minutes to build a quadtree in memory with 200
million businesses at the server start-up time.

- While the quadtree is being built, the server cannot serve traffic.

- Therefore, we should roll out a new release of the server


incrementally to 𝐚 𝐬𝐦𝐚𝐥𝐥 𝐬𝐮𝐛𝐬𝐞𝐭 of servers at a time. This avoids taking a
large swathe of the server cluster offline and causes service brownout.

142 143
How do we find nearby restaurants on Yelp? - Add/delete/update restaurant information
- Customers view restaurant details
- 𝐋𝐨𝐜𝐚𝐥-𝐛𝐚𝐬𝐞𝐝 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 (𝐋𝐁𝐒)
Here are some design details behind the scenes.
- Given a radius and location, return a list of nearby restaurants

There are two key services (see the diagram below):


How are the restaurant locations stored in the database so that LBS
can return nearby restaurants efficiently?

Store the latitude and longitude of restaurants in the database? The


query will be very inefficient when you need to calculate the distance
between you and every restaurant.

One way to speed up the search is using the 𝐠𝐞𝐨𝐡𝐚𝐬𝐡 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦.

First, divide the planet into four quadrants along with the prime
meridian and equator:

- Latitude range [-90, 0] is represented by 0

- Latitude range [0, 90] is represented by 1

- Longitude range [-180, 0] is represented by 0

- Longitude range [0, 180] is represented by 1

Second, divide each grid into four smaller grids. Each grid can be
represented by alternating between longitude bit and latitude bit.

So when you want to search for the nearby restaurants in the


red-highlighted block, you can write SQL like:

SELECT * FROM geohash_index WHERE geohash LIKE `01%`

Geohash has some limitations. There can be a lot of restaurants in one


block (downtown New York), but none in another block (ocean). So
there are other more complicated algorithms to optimize the process.
Let me know if you are interested in the details.
- 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐒𝐞𝐫𝐯𝐢𝐜𝐞

144 145

How does a modern stock exchange achieve


microsecond latency?
One picture is worth more than a thousand words. Log4j from attack to
prevention in one illustration. The principal is:

𝐃𝐨 𝐥𝐞𝐬𝐬 𝐨𝐧 𝐭𝐡𝐞 𝐜𝐫𝐢𝐭𝐢𝐜𝐚𝐥 𝐩𝐚𝐭𝐡!

- Fewer tasks on the critical path


- Less time on each task
- Fewer network hops
- Less disk usage

Credit GovCERT

Link:
https://www.govcert.ch/blog/zero-day-exploit-targeting-popular-java-libr
ary-log4j/

For the stock exchange, the critical path is:


- 𝐬𝐭𝐚𝐫𝐭: an order comes into the order manager
- mandatory risk checks
- the order gets matched and the execution is sent back
- 𝐞𝐧𝐝: the execution comes out of the order manager

Other non-critical tasks should be removed from the critical path.

We put together a design as shown in the diagram:

146 147
Match buy and sell orders
- deploy all the components in a single giant server (no containers)
Stocks go up and down. Do you know what data structure is used to
efficiently match buy and sell orders?
- use shared memory as an event bus to communicate among the
components, no hard disk

- key components like Order Manager and Matching Engine are


single-threaded on the critical path, and each pinned to a CPU so that
there is 𝐧𝐨 𝐜𝐨𝐧𝐭𝐞𝐱𝐭 𝐬𝐰𝐢𝐭𝐜𝐡 and ​𝐧𝐨 𝐥𝐨𝐜𝐤𝐬

- the single-threaded application loop executes tasks one by one in


sequence

- other components listen on the event bus and react accordingly

Stock exchanges use the data structure called 𝐨𝐫𝐝𝐞𝐫 𝐛𝐨𝐨𝐤𝐬. An order
book is an electronic list of buy and sell orders, organized by price
levels. It has a buy book and a sell book, where each side of the book
contains a bunch of price levels, and each price level contains a list of
orders (first in first out).

The image is an example of price levels and the queued quantity in


each price level.

So what happens when you place a market order to buy 2700 shares
in the diagram?

- The buy order is matched with all the sell onrders at price 100.10,
and the first order at price 100.11 (illustrated in light red).

148 149

- Now because of the big buy order which “eats up” the first price level Stock exchange design
on the sell book, the best ask price goes up from 100.10 to 100.11.
The stock market has been volatile recently.
- So when the market is bullish, people tend to buy stocks, and the
Coincidentally, we just finished a new chapter “Design a stock
price goes up and up.
exchange”. I’ll use plain English to explain what happens when you
place a stock buying order. The focus is on the exchange side.
An efficient data structure for an order book must satisfy:

- Constant lookup time. Operations include: get volume at a price level


or between price levels, query best bid/ask.

- Fast add/cancel/execute/update operations, preferably O(1) time


complexity. Operations include: place a new order, cancel an order,
and match an order.

Step 1: client places an order via the broker’s web or mobile app.

Step 2: broker sends the order to the exchange.

150 151
Step 3: the exchange client gateway performs operations such as Design a payment system
validation, rate limiting, authentication, normalization, etc, and sends
Today is Cyber Monday. Here is how money moves when you click the
the order to the order manager.
Buy button on Amazon or any of your favorite shopping websites.
Step 4: the order manager performs risk checks based on rules set by
I posted the same diagram last week for an overview and a few people
the risk manager.
asked me about the detailed steps, so here you go:
Step 5: once risk checks pass, the order manager checks if there is
enough balance in the wallet.

Step 6-7: the order is sent to the matching engine. The matching
engine sends back the execution result if a match is found. Both order
and execution results need to be sequenced first in the sequencer so
that matching determinism is guaranteed.

Step 8 - 10: execution result is passed all the way back to the client.

Step 11-12: market data (including the candlestick chart and order
book) are sent to the data service for consolidation. Brokers query the
data service to get the market data.
1. When a user clicks the “Buy” button, a payment event is generated
Step 13: the reporter composes all the necessary reporting fields (e.g. and sent to the payment service.
client_id, price, quantity, order_type, filled_quantity,
remaining_quantity) and writes the data to the database for 2. The payment service stores the payment event in the database.
persistence
3. Sometimes a single payment event may contain several payment
A stock exchange requires 𝐞𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐨𝐰 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. While most web orders. For example, you may select products from multiple sellers in a
applications are ok with hundreds of milliseconds latency, a stock single checkout process. The payment service will call the payment
exchange requires 𝐦𝐢𝐜𝐫𝐨-𝐬𝐞𝐜𝐨𝐧𝐝 𝐥𝐞𝐯𝐞𝐥 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. I’ll leave the latency executor for each payment order.
discussion for a separate post since the post is already long.
4. The payment executor stores the payment order in the database.

5. The payment executor calls an external PSP to finish the credit card
payment.

6. After the payment executor has successfully executed the payment,


the payment service will update the wallet to record how much money
a given seller has.

152 153

Design a flash sale system


7. The wallet server stores the updated balance information in the
Black Friday is coming. Designing a system with extremely high
database.
concurrency, high availability and quick responsiveness needs to
consider many aspects 𝐚𝐥𝐥 𝐭𝐡𝐞 𝐰𝐚𝐲 𝐟𝐫𝐨𝐦 𝐟𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐭𝐨 𝐛𝐚𝐜𝐤𝐞𝐧𝐝. See the
8. After the wallet service has successfully updated the seller’s balance
below picture for details:
information, the payment service will call the ledger to update it.

9. The ledger service appends the new ledger information to the


database.

10. Every night the PSP or banks send settlement files to their clients.
The settlement file contains the balance of the bank account, together
with all the transactions that took place on this bank account during the
day.

𝐃𝐞𝐬𝐢𝐠𝐧 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬:
1. Less is more - less element on the web page, fewer data
queries to the database, fewer web requests, fewer system
dependencies
2. Short critical path - fewer hops among services or merge into
one service
3. Async - use message queues to handle high TPS
4. Isolation - isolate static and dynamic contents, isolate processes
and databases for rare items
5. Overselling is bad. When Decreasing the inventory is important

154 155
6. User experience is important. We definitely don’t want to inform Back-of-the-envelope estimation
users that they have successfully placed orders but later tell
them no items are actually available Recently, a few engineers asked me whether we really need
back-of-the-envelope estimation in a system design interview. I think it
would be helpful to clarify.

Estimations are important because we need them to understand the


scale of the system and justify the design. It helps answer questions
like:

- Do we really need a distributed solution?


- Is a cache layer necessary?
- Shall we choose data replication or sharding?

Here is an example of how the estimations shape the design decision.

One interview question is to design proximity service and how to scale


geospatial index is a key part of it. Here are a few paragraphs we
wrote to show why jumping to a sharding design without estimations is
a bad idea:

“One common mistake about scaling the geospatial index is to quickly


jump to a sharding scheme without considering the actual data size of
the table. In our case, the full dataset for the geospatial index table is
not large (quadtree index only takes 1.71G memory and storage
requirement for geohash index is similar). The whole geospatial index
can easily fit in the working set of a modern database server. However,
depending on the read volume, a single database server might not
have enough CPU or network bandwidth to service all read requests. If
that is the case, it will be necessary to spread the read load among
multiple database servers.

There are two general approaches to spread the load of a relational


database server. We can add read replicas or shard the database.

Many engineers like to talk about sharding during interviews. However,


it might not be a good fit for the geohash table. Sharding is
complicated. The sharding logic has to be added to the application
layer. Sometimes, sharding is the only option. In this case though,
since everything can fit in the working set of a database server, there is
no strong technical reason to shard the data among multiple servers.

156 157

A better approach, in this case, is to have a series of read replicas to


help with the read load. This method is much simpler to develop and
maintain. Thus, we recommend scaling the geospatial index table
through replicas.”


Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.

158

You might also like