ByteByteGo LinkedIn PDF
ByteByteGo LinkedIn PDF
ByteByteGo LinkedIn PDF
What are database isolation levels? What are they used for? 4
What is IaaS/PaaS/SaaS? 6
Deployment strategies 30
HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC). 53 Read replica pattern 105
How to scale a website to support millions of users? 55 Email receiving flow 107
How do modern browsers work? 63 Interview Question: Design Google Maps 115
Cache miss attack 68 Which database shall I use for the metrics collecting system? 126
How to diagnose a mysterious process that’s taking too much CPU, Metrics monitoring and altering system 129
memory, IO, etc? 70
Reconciliation 131
What are the top cache strategies? 71
Big data papers 134
Upload large files 74
Avoid double charge 136
Why is Redis so Fast? 76
Payment security 138
SWIFT payment network 77
System Design Interview Tip 139
At-most once, at-least once, and exactly once 80
Big data evolvement 140
Vertical partitioning and Horizontal partitioning 82
Quadtree 142
CDN 84
How do we find nearby restaurants on Yelp? 144
Erasure coding 87
How does a modern stock exchange achieve microsecond latency? 147
Foreign exchange in payment 89
Match buy and sell orders 149
Block storage, file storage and object storage 94
Stock exchange design 151
Block storage, file storage and object storage 95
Design a payment system 153
Domain Name System (DNS) lookup 97
Design a flash sale system 155
What happens when you type a URL into your browser? 99
Back-of-the-envelope estimation 157
AI Coding engine 101
2 3
What are database isolation levels? What are they used 🔹 Read Uncommitted: The data modification can be read by other
for? transactions before a transaction is committed.
Database isolation allows a transaction to execute as if there are no The isolation is guaranteed by MVCC (Multi-Version Consistency
other concurrently running transactions. Control) and locks.
The diagram below illustrates four isolation levels. The diagram below takes Repeatable Read as an example to
demonstrate how MVCC works:
There are two hidden columns for each row: transaction_id and
roll_pointer. When transaction A starts, a new Read View with
transaction_id=201 is created. Shortly afterward, transaction B starts,
and a new Read View with transaction_id=202 is created.
Now transaction A modifies the balance to 200, a new row of the log is
created, and the roll_pointer points to the old row. Before transaction A
commits, transaction B reads the balance data. Transaction B finds
that transaction_id 201 is not committed, it reads the next committed
record(transaction_id=200).
Over to you: have you seen isolation levels used in the wrong way?
Did it cause serious outages?
🔹 Repeatable Read: Data read during the transaction stays the same
as the transaction starts.
4 5
For a non-cloud application, we own and manage all the hardware and
software. We say the application is on-premises.
6 7
12 C What is the future of online payments?
13 PowerShell
I don’t know the answer, but I do know one of the candidates is the
14 Go
blockchain.
15 Kotlin
16 Rust
As a fan of technology, I always seek new solutions to old challenges.
17 Ruby
A book that explains a lot about an emerging payment system is
18 Dart
‘Mastering Bitcoin’ by Andreas M. Antonopoulos. I want to share my
19 Assembly
discovery of this book with you because it explains very clearly bitcoin
20 Swift
and its underlying blockchain. This book makes me rethink how to
21 R
renovate payment systems.
22 VBA
23 Matlab
24 Groovy
25 Objective-C
26 Scala
27 Perl
28 Haskell
29 Delphi
30 Clojure
31 Elixir
32 LISP
33 Julia
34 F
35 Erlang
36 APL
37 Crystal
38 COBOL
Over to you: what’s the first programming language you learned? And
what are the other languages you learned over the years?
8 9
2. The golden source of truth for bitcoin is the blockchain, which is also What is SSO (Single Sign-On)?
the journal. It’s the same if we use Event Sourcing architecture to build
A friend recently went through the irksome experience of being signed
a traditional wallet, although there are other options.
out from a number of websites they use daily. This event will be familiar
to millions of web users, and it is a tedious process to fix. It can involve
3. There is a small virtual machine for bitcoin - and also Ethereum. The
trying to remember multiple long-forgotten passwords, or typing in the
virtual machine defines a set of bytecodes to do basic tasks such as
names of pets from childhood to answer security questions. SSO
validation.
removes this inconvenience and makes life online better. But how does
it work?
Over to you: if Elon Musk set up a base on planet Mars, what payment
solution will you recommend?
Basically, Single Sign-On (SSO) is an authentication scheme. It allows
a user to log in to different systems using a single ID.
Step 1: A user visits Gmail, or any email service. Gmail finds the user
is not logged in and so redirects them to the SSO authentication
server, which also finds the user is not logged in. As a result, the user
10 11
is redirected to the SSO login page, where they enter their login How to store passwords safely in the database?
credentials.
🔹
𝐓𝐡𝐢𝐧𝐠𝐬 𝐍𝐎𝐓 𝐭𝐨 𝐝𝐨
Storing passwords in plain text is not a good idea because anyone
Steps 4-7: Gmail validates the token in the SSO authentication server.
with internal access can see them.
The authentication server registers the Gmail system, and returns
“valid.” Gmail returns the protected resource to the user.
🔹 Storing password hashes directly is not sufficient because it is
pruned to precomputation attacks, such as rainbow tables.
Step 8: From Gmail, the user navigates to another Google-owned
website, for example, YouTube.
🔹 To mitigate precomputation attacks, we salt the passwords.
Steps 9-10: YouTube finds the user is not logged in, and then requests
𝐖𝐡𝐚𝐭 𝐢𝐬 𝐬𝐚𝐥𝐭?
authentication. The SSO authentication server finds the user is already
According to OWASP guidelines, “a salt is a unique, randomly
logged in and returns the token.
generated string that is added to each password as part of the hashing
process”.
Step 11-14: YouTube validates the token in the SSO authentication
server. The authentication server registers the YouTube system, and
returns “valid.” YouTube returns the protected resource to the user.
The process is complete and the user gets back access to their
account.
Over to you:
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
12 13
3️⃣ The system appends the salt to the password and hashes it. Let’s
call the hashed value H1.
4️⃣ The system compares H1 and H2, where H2 is the hash stored in the
database. If they are the same, the password is valid.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
2️⃣ The password can be stored in the database using the following
format: 𝘩𝘢𝘴𝘩( 𝘱𝘢𝘴𝘴𝘸𝘰𝘳𝘥 + 𝘴𝘢𝘭𝘵).
14 15
How does HTTPS work? The server then sends the SSL certificate to the client. The certificate
contains the public key, host name, expiry dates, etc. The client
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the
validates the certificate.
Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data
using Transport Layer Security (TLS.) If the data is hijacked online, all
Step 3 - After validating the SSL certificate, the client generates a
the hijacker gets is binary code.
session key and encrypts it using the public key. The server receives
the encrypted session key and decrypts it with the private key.
Step 4 - Now that both the client and the server hold the same session
key (symmetric encryption), the encrypted data is transmitted in a
secure bi-directional channel.
1. Security: The asymmetric encryption goes only one way. This means
that if the server tries to send the encrypted data back to the client,
anyone can decrypt the data using the public key.
Step 2 - The client sends a “client hello” to the server. The message
contains a set of necessary encryption algorithms (cipher suites) and
the latest TLS version it can support. The server responds with a
“server hello” so the browser knows whether it can support the
algorithms and TLS version.
16 17
How to learn design patterns? 🔹 This book solves the challenge of software’s abstract, “invisible”
nature. Software is difficult to build because we cannot see its
Besides reading a lot of well-written code, a good book guides us like a
architecture; its details are embedded in the code and binary files. It is
good teacher.
even harder to understand software design patterns because these are
higher-level abstractions of the software. The book fixes this by using
𝐇𝐞𝐚𝐝 𝐅𝐢𝐫𝐬𝐭 𝐃𝐞𝐬𝐢𝐠𝐧 𝐏𝐚𝐭𝐭𝐞𝐫𝐧𝐬, second edition, is the one I would
visualization. There are lots of diagrams, arrows, and comments on
recommend.
almost every page. If I do not understand the text, it’s no problem. The
diagrams explain things very well.
Last year, I bought the second edition of Head First Design Patterns
and read through it. Here are a few things I like about the book:
18 19
A visual guide on how to choose the right Database
Data can be structured (SQL table schema), semi-structured (JSON,
Picking a database is a long-term commitment so the decision
XML, etc.), and unstructured (Blob).
shouldn’t be made lightly. The important thing to keep in mind is to
choose the right database for the right job.
🔹
Common database categories include:
🔹
Relational
🔹
Columnar
🔹
Key-value
🔹
In-memory
🔹
Wide column
🔹
Time Series
🔹
Immutable ledger
🔹
Geospatial
🔹
Graph
🔹
Document
🔹
Text search
Blob
Over to you - Which database have you used for which workload?
20 21
🔹
Requirements:
🔹
Globally unique
—
🔹
Roughly sorted by time
Check out our bestselling system design books.
🔹
Numerical values only
Paperback: Amazon Digital: ByteByteGo.
🔹
64 bits
Highly scalable, low latency
22 23
How does Twitter work? 4️⃣ The Timeline service is used to find the Redis server that has the
home timeline on it.
This post is a summary of a tech talk given by Twitter in 2013. Let’s
5️⃣ A user pulls their home timeline through the Timeline service.
take a look.
🔹
𝐒𝐞𝐚𝐫𝐜𝐡 & 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲
Ingester: annotates and tokenizes Tweets so the data can be
🔹
indexed.
🔹
Earlybird: stores search index.
Blender: creates the search and discovery timelines.
🔹
𝐏𝐮𝐬𝐡 𝐂𝐨𝐦𝐩𝐮𝐭𝐞
🔹
HTTP push
Mobile push
Over to you:
Do you use Twitter? What are some of the biggest differences between
LinkedIn and Twitter that might shape their system architectures?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
24 25
What is the difference between Process and Thread? A 𝐓𝐡𝐫𝐞𝐚𝐝 is the smallest unit of execution within a process.
🔹
of a process.
Each process has its own memory space. Threads that belong to
🔹
the same process share the same memory.
A process is a heavyweight operation. It takes more time to create
🔹
and terminate.
🔹
Context switching is more expensive between processes.
Inter-thread communication is faster for threads.
Over to you:
1). Some programming languages support coroutine. What is the
difference between coroutine and thread?
To better understand this question, let’s first take a look at what is a
Program. A 𝐏𝐫𝐨𝐠𝐫𝐚𝐦 is an executable file containing a set of 2). How to list running processes in Linux?
instructions and passively stored on disk. One program can have
multiple processes. For example, the Chrome browser creates a —
different process for every single tab. Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
A 𝐏𝐫𝐨𝐜𝐞𝐬𝐬 means a program is in execution. When a program is loaded
into the memory and becomes active, the program becomes a
process. The process requires some essential resources such as
registers, program counter, and stack.
26 27
Interview Question: design Google Docs 4️⃣ The File Operation Server consumes operations produced by clients
and generates transformed operations using collaboration algorithms.
5️⃣ Three types of data are stored: file metadata, file content, and
operations.
🔹
🔹
Operational transformation (OT)
🔹
Differential Synchronization (DS)
Conflict-free replicated data type (CRDT)
Over to you - Have you encountered any issues while using Google
Docs? If so, what do you think might have caused the issue?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
28 29
Deployment strategies
𝐁𝐥𝐮𝐞-𝐆𝐫𝐞𝐞𝐧 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
Deploying or upgrading services is risky. In this post, we explore risk
With blue-green deployment, we have two identical environments: one
mitigation strategies.
is staging (blue) and the other is production (green). The staging
environment is one version ahead of production. Once testing is done
The diagram below illustrates the common ones.
in the staging environment, user traffic is switched to the staging
environment, and the staging becomes the production. This
deployment strategy is simple to perform rollback, but having two
identical production quality environments could be expensive.
𝐂𝐚𝐧𝐚𝐫𝐲 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
A canary deployment upgrades services gradually, each time to a
subset of users. It is cheaper than blue-green deployment and easy to
perform rollback. However, since there is no staging environment, we
have to test on production. This process is more complicated because
we need to monitor the canary while gradually migrating more and
more users away from the old version.
𝐀/𝐁 𝐓𝐞𝐬𝐭
In the A/B test, different versions of services run in production
simultaneously. Each version runs an “experiment” for a subset of
users. A/B test is a cheap method to test new features in production.
We need to control the deployment process in case some features are
pushed to users by accident.
Over to you - Which deployment strategy have you used? Did you
witness any deployment-related outages in production and why did
they happen?
𝐌𝐮𝐥𝐭𝐢-𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
In this model, we deploy new changes to multiple services
simultaneously. This approach is easy to implement. But since all the
services are upgraded at the same time, it is hard to manage and test
dependencies. It’s also hard to rollback safely.
30 31
Flowchart of how slack decides to send a notification How does Amazon build and operate the software?
It is a great example of why a simple feature may take much longer to In 2019, Amazon released The Amazon Builders' Library. It contains
develop than many people think. architecture-based articles that describe how Amazon architects,
releases, and operates technology.
When we have a great design, users may not notice the complexity
because it feels like the feature is just working as intended.
🔹
🔹
Making retries safe with idempotent APIs
Image source:
🔹
Timeouts, retries, and backoff with jitter
https://slack.engineering/reducing-slacks-memory-footprint/
🔹
Beyond five 9s: Lessons from our highest available data planes
🔹
Caching challenges and strategies
🔹
Ensuring rollback safety during deployments
Going faster with continuous delivery
32 33
🔹
🔹
Challenges with distributed systems How to design a secure web API access for your
Amazon's approach to high-availability deployment website?
Over to you: what’s your favorite place to learn system design and When we open web API access to users, we need to make sure each
design principles? API call is authenticated. This means the user must be who they claim
to be.
Link to The Amazon Builders' Library: aws.amazon.com/builders-library
In this post, we explore two common ways:
1. Token based authentication
2. HMAC (Hash-based Message Authentication Code) authentication
34 35
Step 2 - the Authentication Server authenticates the credentials and
generates a token with an expiry time.
Steps 3 and 4 - now the client can send requests to access server
resources with the token in the HTTP header. This access is valid until
the token expires.
𝐇𝐌𝐀𝐂 𝐛𝐚𝐬𝐞𝐝
This mechanism generates a Message Authentication Code
(signature) by using a hash function (SHA256 or MD5).
Steps 1 and 2 - the server generates two keys, one is Public APP ID
(public key) and the other one is API Key (private key).
Step 3 - we now generate a HMAC signature on the client side (hmac
A). This signature is generated with a set of attributes listed in the
diagram.
Step 5 - the server receives the request which contains the request
data and the authentication header. It extracts the necessary attributes
from the request and uses the API key that’s stored on the server side
to generate a signature (hmac B.)
—
𝐓𝐨𝐤𝐞𝐧 𝐛𝐚𝐬𝐞𝐝 Check out our bestselling system design books.
Step 1 - the user enters their password into the client, and the client Paperback: Amazon Digital: ByteByteGo.
sends the password to the Authentication Server.
36 37
How do microservices collaborate and interact with each describes the interactions between all the participating services. It is
other? just like a conductor leading the musicians in a musical symphony. The
orchestration pattern also includes the transaction management
among different services.
There are two ways: 𝐨𝐫𝐜𝐡𝐞𝐬𝐭𝐫𝐚𝐭𝐢𝐨𝐧 and 𝐜𝐡𝐨𝐫𝐞𝐨𝐠𝐫𝐚𝐩𝐡𝐲.
The benefits of orchestration:
The diagram below illustrates the collaboration of microservices. 1. Reliability - orchestration has built-in transaction management and
error handling, while choreography is point-to-point communications
and the fault tolerance scenarios are much more complicated.
2. Scalability - when adding a new service into orchestration, only the
orchestrator needs to modify the interaction rules, while in
choreography all the interacting services need to be modified.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
Choreography is like having a choreographer set all the rules. Then the
dancers on stage (the microservices) interact according to them.
Service choreography describes this exchange of messages and the
rules by which the microservices interact.
38 39
What are the differences between Virtualization needed to run the application or microservice are packaged together,
(VMware) and Containerization (Docker)? so that the applications can run anywhere.
Sources:
[1] Understanding virtualization: https://lnkd.in/gtQY9gkx
[2] What is containerization?: https://lnkd.in/gm4Qv_x2
🔹
The major differences are:
In virtualization, the hypervisor creates an abstraction layer over
hardware, so that multiple operating systems can run alongside each
other. This technique is considered to be the first generation of cloud
computing.
40 41
Which cloud provider should be used when building a The common parts of the solutions:
big data solution?
1. Data ingestion of structured or unstructured data.
The diagram below illustrates the detailed comparison of AWS, Google 2. Raw data storage.
Cloud, and Microsoft Azure. 3. Data processing, including filtering, transformation, normalization,
etc.
4. Data warehouse, including key-value storage, relational database,
OLAP database, etc.
5. Presentation layer with dashboards and real-time notifications.
For example, the first step and the last step both use the serverless
product. The product is called “lambda” in AWS, and “function” in
Azure and Google Cloud.
42 43
How to avoid crawling duplicate URLs at Google scale?
Option 1: Use a Set data structure to check if a URL already exists or
not. Set is fast, but it is not space-efficient.
The diagram below illustrates how the Bloom filter works. The basic
data structure for the Bloom filter is Bit Vector. Each bit represents a
hashed value.
Step 2: When testing the existence of a URL string, the same hash
functions A, B, and C are applied to the URL string. If all three bits are
44 45
1, then the URL may exist in the dataset; if any of the bits is 0, then the Why is a solid-state drive (SSD) fast?
URL definitely does not exist in the dataset.
“A solid state drive reads up to 10 times faster and writes up to 20
times faster than a hard disk drive.” [1].
Hash function choices are important. They must be uniformly
distributed and fast. For example, RedisBloom and Apache Spark use
“An SSD is a flash-memory based data storage device. Bits are stored
murmur, and InfluxDB uses xxhash.
into cells, which are made of floating-gate transistors. SSDs are made
entirely of electronic components, there are no moving or mechanical
Question - In our example, we used three hash functions. How many
parts like in hard drives (HDD)” [2].
hash functions should we use in reality? What are the trade-offs?
46 47
Step 1: “Commands come from the user through the host interface” [2]. Handling a large-scale outage
The interface can be Serial ATA (SATA) or PCI Express (PCIe).
This is a true story about handling a large-scale outage written by Staff
Step 2: “The processor in the SSD controller takes the commands and
Engineers at Discord Sahn Lam.
passes them to the flash controller” [2].
Step 3: “SSDs also have embedded RAM memory, generally for
About 10 years ago, I witnessed the most impactful UI bugs in my
caching purposes and to store mapping information” [2].
career.
Step 4: “The packages of NAND flash memory are organized in gangs,
over multiple channels” [2].
It was 9PM on a Friday. I was on the team responsible for one of the
largest social games at the time. It had about 30 million DAU. I just so
The second diagram illustrates how the logical and physical pages are
happened to glance at the operational dashboard before shutting down
mapped, and why this architecture is fast.
for the night.
SSD controller operates multiple FLASH particles in parallel, greatly
Every line on the dashboard was at zero.
improving the underlying bandwidth. When we need to write more than
one page, the SSD controller can write them in parallel [3], whereas
At that very moment, I got a phone call from my boss. He said the
the HDD has a single head and it can only read from one head at a
entire game was down. Firefighting mode. Full on.
time.
48 49
𝐅𝐢𝐫𝐞𝐜𝐫𝐚𝐜𝐤𝐞𝐫 𝐌𝐢𝐜𝐫𝐨𝐕𝐌
Firecracker is the engine powering all of the Lambda functions [1]. It is
a virtualization technology developed at Amazon and written in Rust.
The diagram below illustrates the isolation model for AWS Lambda
Workers.
50 51
Lambda functions run within a sandbox, which provides a minimal HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC).
Linux userland, some common libraries and utilities. It creates the
What problem does each generation of HTTP solve?
Execution environment (worker) on EC2 instances.
𝐒𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step1: "The Worker Manager communicates with a Placement Service
which is responsible to place a workload on a location for the given
host (it’s provisioning the sandbox) and returns that to the Worker
Manager" [2].
Step 2: "The Worker Manager can then call 𝘐𝘯𝘪𝘵 to initialize the function
for execution by downloading the Lambda package from S3 and
setting up the Lambda runtime" [2]
𝐀𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step 1: The Application Load Balancer forwards the invocation to an
available Frontend which places the event onto an internal
queue(SQS).
Step 2: There is "a set of pollers assigned to this internal queue which
are responsible for polling it and moving the event onto a Frontend
synchronously. After it’s been placed onto the Frontend it follows the
synchronous invocation call pattern which we covered earlier" [2].
Question: Can you think of any use cases for AWS Lambda?
🔹 HTTP 1.0 was finalized and fully documented in 1996. Every
request to the same server requires a separate TCP connection.
Sources:
[1] AWS Lambda whitepaper:
🔹 HTTP 1.1 was published in 1997. A TCP connection can be left
open for reuse (persistent connection), but it doesn’t solve the HOL
https://docs.aws.amazon.com/whitepapers/latest/security-overview-aw
(head-of-line) blocking issue.
s-lambda/lambda-executions.html
[2] Behind the scenes, Lambda:
HOL blocking - when the number of allowed parallel requests in the
https://www.bschaatsbergen.com/behind-the-scenes-lambda/
browser is used up, subsequent requests need to wait for the former
ones to complete.
Image source: [1] [2]
52 53
🔹 HTTP 2.0 was published in 2015. It addresses HOL issue through How to scale a website to support millions of users?
request multiplexing, which eliminates HOL blocking at the application
We will explain this step-by-step.
layer, but HOL still exists at the transport (TCP) layer.
Question: When shall we upgrade to HTTP 3.0? Any pros & cons you
can think of?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
54 55
Suppose we have two services: inventory service (handles product
descriptions and inventory management) and user service (handles
user information, registration, login, etc.).
Step 1 - With the growth of the user base, one single application server
cannot handle the traffic anymore. We put the application server and
the database server into two separate servers.
56 57
DevOps Books 🔹 The Phoenix Project - a classic novel about effectiveness and
communications. IT work is like manufacturing plant work, and a
Some 𝐃𝐞𝐯𝐎𝐩𝐬 books I find enlightening:
system must be established to streamline the workflow. Very
interesting read!
58 59
Why is Kafka fast? 🔹 Step 2: Consumer reads data without zero-copy
2.1: The data is loaded from disk to OS cache
Kafka achieves low latency message delivery through Sequential I/O
2.2 The data is copied from OS cache to Kafka application
and Zero Copy Principle. The same techniques are commonly used in
2.3 Kafka application copies the data into the socket buffer
many other messaging/streaming platforms.
2.4 The data is copied from socket buffer to network card
2.5 The network card sends data out to the consumer
The diagram below illustrates how the data is transmitted between
producer and consumer, and what zero-copy means.
🔹 Step 3: Consumer reads data with zero-copy
3.1: The data is loaded from disk to OS cache
3.2 OS cache directly copies the data to the network card via sendfile()
command
3.3 The network card sends data out to the consumer
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
60 61
Links:
https://developer.chrome.com/blog/inside-browser-part1/
https://developer.chrome.com/blog/inside-browser-part2/
https://developer.chrome.com/blog/inside-browser-part3/
https://developer.chrome.com/blog/inside-browser-part4/
Over time, different API architectural styles are released. Each of them
has its own patterns of standardizing data exchange.
You can check out the use cases of each style in the diagram.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
62 63
Redis vs Memcached Optimistic locking
The diagram below illustrates the key differences. Optimistic locking, also referred to as optimistic concurrency control,
allows multiple concurrent users to attempt to update the same
resource.
🔹 Recording the number of clicks and comments for each post (hash)
🔹 Sorting the commented user list and deduping the users (zset)
🔹 Caching user behavior history and filtering malicious behaviors 1. A new column called “version” is added to the database table.
(zset, hash)
2. Before a user modifies a database row, the application reads the
🔹 Storing boolean information of extremely large data into small version number of the row.
space. For example, login status, membership status. (bitmap)
3. When the user updates the row, the application increases the
version number by 1 and writes it back to the database.
64 65
Optimistic locking is usually faster than pessimistic locking because we Tradeoff between latency and consistency
do not lock the database. However, the performance of optimistic
Understanding the 𝐭𝐫𝐚𝐝𝐞𝐨𝐟𝐟𝐬 is very important not only in system design
locking drops dramatically when concurrency is high.
interviews but also designing real-world systems. When we talk about
To understand why, consider the case when many clients try to reserve data replication, there is a fundamental tradeoff between 𝐥𝐚𝐭𝐞𝐧𝐜𝐲 and
a hotel room at the same time. Because there is no limit on how many 𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲. It is illustrated by the diagram below.
clients can read the available room count, all of them read back the
same available room count and the current version number. When
different clients make reservations and write back the results to the
database, only one of them will succeed, and the rest of the clients
receive a version check failure message. These clients have to retry. In
the subsequent round of retries, there is only one successful client
again, and the rest have to retry. Although the end result is correct,
repeated retries cause a very unpleasant user experience.
66 67
Cache miss attack 🔹 Cache keys with null value. Set a short TTL (Time to Live) for keys
with null value.
Caching is awesome but it doesn’t come without a cost, just like many
things in life.
🔹 Using Bloom filter. A Bloom filter is a data structure that can rapidly
tell us whether an element is present in a set or not. If the key exists,
One of the issues is 𝐂𝐚𝐜𝐡𝐞 𝐌𝐢𝐬𝐬 𝐀𝐭𝐭𝐚𝐜𝐤. Correct me if this is not the
the request first goes to the cache and then queries the database if
right term. It refers to the scenario where data to fetch doesn't exist in
needed. If the key doesn't exist in the data set, it means the key
the database and the data isn’t cached either. So every request hits
doesn’t exist in the cache/database. In this case, the query will not hit
the database eventually, defeating the purpose of using a cache. If a
the cache or database layer.
malicious user initiates lots of queries with such keys, the database
can easily be overloaded.
—
Check out our bestselling system design books.
The diagram below illustrates the process.
Paperback: Amazon Digital: ByteByteGo.
68 69
How to diagnose a mysterious process that’s taking too What are the top cache strategies?
much CPU, memory, IO, etc?
🔹
Read data from the system:
🔹
The diagram below illustrates helpful tools in a Linux system. Cache aside
Read through
🔹
Write data to the system:
🔹
Write around
🔹
Write back
Write through
🔹 ‘netstat’ - displays statistical data related to IP, TCP, UDP, and ICMP
protocols.
70 71
Question: What are the pros and cons of each caching strategy? How
to choose the right one to use?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
I left out a lot of details as that will make the post very long. Feel free to
leave a comment so we can learn from each other.
72 73
6. The data store reassembles the object from its parts based on the
part number. Since the object is really large, this process may take a
few minutes. After reassembly is complete, it returns a success
message to the client.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
74 75
Why is Redis so Fast? SWIFT payment network
There are 3 main reasons as shown in the diagram below. You probably heard about 𝐒𝐖𝐈𝐅𝐓. What is SWIFT? What role does it
play in cross-border payments? You can find answers to those
questions in this post.
76 77
78 79
At-most once, at-least once, and exactly once is possible on the consumer side. For example, with a unique key in
each message, a message can be rejected when writing duplicate data
In modern architecture, systems are broken up into small and
to the database.
independent building blocks with well-defined interfaces between them.
Message queues provide communication and coordination for those
𝐄𝐱𝐚𝐜𝐭𝐥𝐲 𝐨𝐧𝐜𝐞
building blocks. Today, let’s discuss different delivery semantics:
Exactly once is the most difficult delivery semantic to implement. It is
at-most once, at-least once, and exactly once.
friendly to users, but it has a high cost for the system’s performance
and complexity.
𝐀𝐭-𝐦𝐨𝐬𝐭 𝐨𝐧𝐜𝐞
As the name suggests, at-most once means a message will be
delivered not more than once. Messages may be lost but are not
redelivered. This is how at-most once delivery works at the high level.
Use cases: It is suitable for use cases like monitoring metrics, where a
small amount of data loss is acceptable.
𝐀𝐭-𝐥𝐞𝐚𝐬𝐭 𝐨𝐧𝐜𝐞
With this data delivery semantic, it’s acceptable to deliver a message
more than once, but no message should be lost.
Use cases: With at-least once, messages won’t be lost but the same
message might be delivered multiple times. While not ideal from a user
perspective, at-least once delivery semantics are usually good enough
for use cases where data duplication is not a big issue or deduplication
80 81
Vertical partitioning and Horizontal partitioning Horizontal partitioning is widely used so let’s take a closer look.
🔹
𝐁𝐞𝐧𝐞𝐟𝐢𝐭𝐬
Facilitate horizontal scaling. Sharding facilitates the possibility of
adding more machines to spread out the load.
🔹
𝐃𝐫𝐚𝐰𝐛𝐚𝐜𝐤𝐬
The order by the operation is more complicated. Usually, we need
to fetch data from different shards and sort the data in the application's
code.
This topic is very big and I’m sure I missed a lot of important details.
What else do you think is important for data partitioning?
82 83
CDN 2. If the domain name does not exist in the local DNS cache, the
browser goes to the DNS resolver to resolve the name. The DNS
A content delivery network (CDN) refers to a geographically distributed
resolver usually sits in the Internet Service Provider (ISP).
servers (also called edge servers) which provide fast delivery of static
and dynamic content. Let’s take a look at how it works.
3. The DNS resolver recursively resolves the domain name (see my
previous post for details). Finally, it asks the authoritative name server
Suppose Bob who lives in New York wants to visit an eCommerce
to resolve the domain name.
website that is deployed in London. If the request goes to servers
located in London, the response will be quite slow. So we deploy CDN
4. If we don’t use CDN, the authoritative name server returns the IP
servers close to where Bob lives, and the content will be loaded from
address for www.myshop.com. But with CDN, the authoritative name
the nearby CDN server.
server has an alias pointing to www.myshop.cdn.com (the domain
name of the CDN server).
The diagram below illustrates the process:
5. The DNS resolver asks the authoritative name server to resolve
www.myshop.cdn.com.
6. The authoritative name server returns the domain name for the load
balancer of CDN www.myshop.lb.com.
8. The CDN load balancer returns the CDN edge server’s IP address
for www.myshop.lb.com.
9. Now we finally get the actual IP address to visit. The DNS resolver
returns the IP address to the browser.
10. The browser visits the CDN edge server to load the content. There
are two types of contents cached on the CDN servers: static contents
and dynamic contents. The former contains static pages, pictures, and
videos; the latter one includes results of edge computing.
1. Bob types in www.myshop.com in the browser. The browser looks 11. If the edge CDN server cache doesn't contain the content, it goes
up the domain name in the local DNS cache. upward to the regional CDN server. If the content is still not found, it
will go upward to the central CDN server, or even go to the origin - the
84 85
London web server. This is called the CDN distribution network, where Erasure coding
the servers are deployed geographically.
A really cool technique that’s commonly used in object storage such as
S3 to improve durability is called 𝐄𝐫𝐚𝐬𝐮𝐫𝐞 𝐂𝐨𝐝𝐢𝐧𝐠. Let’s take a look at
Over to you: How do you prevent videos cached on CDN from being
how it works.
pirated?
86 87
Erasure coding deals with data durability differently from replication. It Foreign exchange in payment
chunks data into smaller pieces (placed on different servers) and
Have you wondered what happens under the hood when you pay with
creates parities for redundancy. In the event of failures, we can use
USD online and the seller from Europe receives EUR (euro)? This
chunk data and parities to reconstruct the data. Let’s take a look at a
process is called foreign exchange.
concrete example (4 + 2 erasure coding) as shown in Figure 1.
1️⃣ Data is broken up into four even-sized data chunks d1, d2, d3, and
d4.
2️⃣ The mathematical formula is used to calculate the parities p1 and p2.
To give a much simplified example, p1 = d1 + 2*d2 - d3 + 4*d4 and p2
= -d1 + 5*d2 + d3 - 3*d4.
4️⃣ The mathematical formula is used to reconstruct lost data d3 and d4,
using the known values of d1, d2, p1, and p2.
How much extra space does erasure coding need? For every two
chunks of data, we need one parity block, so the storage overhead is
50% (Figure 2). While in 3-copy replication, the storage overhead is
200% (Figure 2).
88 89
3. 100 USD is sold to Bank E’s funding pool. Interview Question: Design S3
4. Bank E’s funding pool provides 88 EUR in exchange for 100 USD. What happens when you upload a file to Amazon S3? Let’s design an
The money is put into Paypal’s EUR account in Bank E. S3 like object storage system.
Now let’s take a close look at the foreign exchange (forex) market. It
has 3 layers:
🔹
currencies in advance.
Wholesale market. The wholesale business is composed of
investment banks, commercial banks, and foreign exchange providers.
🔹
It usually handles accumulated orders from the retail market.
Top-level participants. They are multinational commercial banks
that hold a large number of certificates of deposit from different
countries. They exchange these certificates for foreign exchange
trading.
When Bank E’s funding pool needs more EUR, it goes upward to the
wholesale market to sell USD and buy EUR. When the wholesale
market accumulates enough orders, it goes upward to top-level
participants. Steps 3.1-3.3 and 4.1-4.3 explain how it works.
What foreign currency did you find difficult to exchange? And what
company have you used for foreign currency exchange?
90 91
𝐁𝐮𝐜𝐤𝐞𝐭. A logical container for objects. The bucket name is globally 6. Once validation succeeds, the API service sends the object data in
unique. To upload data to S3, we must first create a bucket. the HTTP PUT payload to the data store. The data store persists the
payload as an object and returns the UUID of the object.
𝐎𝐛𝐣𝐞𝐜𝐭. An object is an individual piece of data we store in a bucket. It
contains object data (also called payload) and metadata. Object data 7. The API service calls the metadata store to create a new entry in the
can be any sequence of bytes we want to store. The metadata is a set metadata database. It contains important metadata such as the
of name-value pairs that describe the object. object_id (UUID), bucket_id (which bucket the object belongs to),
object_name, etc.
🔹
An S3 object consists of (Figure 1):
Metadata. It is mutable and contains attributes such as ID, bucket
🔹
name, object name, etc.
Object data. It is immutable and contains the actual data.
2. The API service calls the Identity and Access Management (IAM) to
ensure the user is authorized and has WRITE permission.
3. The API service calls the metadata store to create an entry with the
bucket info in the metadata database. Once the entry is created, a
success message is returned to the client.
4. After the bucket is created, the client sends an HTTP PUT request
to create an object named “script.txt”.
5. The API service verifies the user’s identity and ensures the user has
WRITE permission on the bucket.
92 93
Block storage, file storage and object storage Block storage, file storage and object storage
Yesterday, I posted the definitions of block storage, file storage, and In this post, let’s review the storage systems in general.
object storage. Let’s continue the discussion and compare those 3
options. Storage systems fall into three broad categories:
🔹
🔹
Block storage
🔹
File storage
Object storage
𝐁𝐥𝐨𝐜𝐤 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
Block storage came first, in the 1960s. Common storage devices like
hard disk drives (HDD) and solid-state drives (SSD) that are physically
attached to servers are all considered as block storage.
Block storage presents the raw blocks to the server as a volume. This
is the most flexible and versatile form of storage. The server can
format the raw blocks and use them as a file system, or it can hand
control of those blocks to an application. Some applications like a
database or a virtual machine engine manage these blocks directly in
order to squeeze every drop of performance out of them.
94 95
and iSCSI. Conceptually, the network-attached block storage still Domain Name System (DNS) lookup
presents raw blocks. To the servers, it works the same as physically
DNS acts as an address book. It translates human-readable domain
attached block storage. Whether to a network or physically attached,
names (google.com) to machine-readable IP addresses
block storage is fully owned by a single server. It is not a shared
(142.251.46.238).
resource.
1. google.com is typed into the browser, and the browser sends the
domain name to the DNS resolver.
96 97
2. The resolver queries a DNS root name server. What happens when you type a URL into your browser?
The diagram below illustrates the steps.
3. The root server responds to the resolver with the address of a TLD
DNS server. In this case, it is .com.
5. The TLD server responds with the IP address of the domain’s name
server, google.com (authoritative name server).
8. The DNS resolver responds to the web browser with the IP address
(142.251.46.238) of the domain requested initially.
1. Bob enters a URL into the browser and hits Enter. In this example,
the URL is composed of 4 parts:
🔹
server using HTTPS.
🔹
domain - 𝒆𝒙𝒂𝒎𝒑𝒍𝒆.𝒄𝒐𝒎. This is the domain name of the site.
path - 𝒑𝒓𝒐𝒅𝒖𝒄𝒕/𝒆𝒍𝒆𝒄𝒕𝒓𝒊𝒄. It is the path on the server to the requested
🔹
resource: phone.
resource - 𝒑𝒉𝒐𝒏𝒆. It is the name of the resource Bob wants to visit.
2. The browser looks up the IP address for the domain with a domain
name system (DNS) lookup. To make the lookup process fast, data is
cached at different layers: browser cache, OS cache, local network
cache and ISP cache.
98 99
AI Coding engine
2.1 If the IP address cannot be found at any of the caches, the browser
DeepMind says its new AI coding engine (AlphaCode) is as good as an
goes to DNS servers to do a recursive DNS lookup until the IP address
average programmer.
is found (this will be covered in another post).
5. The server processes the request and sends back the response. For
a successful response (the status code is 200). The HTML response
might look like this:
𝘏𝘛𝘛𝘗/1.1 200 𝘖𝘒
𝘋𝘢𝘵𝘦: 𝘚𝘶𝘯, 30 𝘑𝘢𝘯 2022 00:01:01 𝘎𝘔𝘛
𝘚𝘦𝘳𝘷𝘦𝘳: 𝘈𝘱𝘢𝘤𝘩𝘦
𝘊𝘰𝘯𝘵𝘦𝘯𝘵-𝘛𝘺𝘱𝘦: 𝘵𝘦𝘹𝘵/𝘩𝘵𝘮𝘭; 𝘤𝘩𝘢𝘳𝘴𝘦𝘵=𝘶𝘵𝘧-8
<!𝘋𝘖𝘊𝘛𝘠𝘗𝘌 𝘩𝘵𝘮𝘭>
<𝘩𝘵𝘮𝘭 𝘭𝘢𝘯𝘨="𝘦𝘯"> 1. Pre-train the transformer models on GitHub code.
𝘏𝘦𝘭𝘭𝘰 𝘸𝘰𝘳𝘭𝘥
</𝘩𝘵𝘮𝘭> 2. Fine-tune the models on the relatively small competitive
programming dataset.
6. The browser renders the HTML content.
3. At evaluation time, create a massive amount of solutions for each
problem.
5. Run the candidate programs against the test cases, evaluate the
performance, and choose the best one.
100 101
Do you think AI bot will be better at Leetcode or competitive Read replica pattern
programming than software engineers five years from now?
There are two common ways to implement the read replica pattern:
1. Embed the routing logic in the application code (explained in the last
post).
2. Use database middleware.
102 103
Read replica pattern
1. When Alice places an order on amazon, the request is sent to Order
In this post, we talk about a simple yet commonly used database
Service.
design pattern (setup): 𝐑𝐞𝐚𝐝 𝐫𝐞𝐩𝐥𝐢𝐜𝐚 𝐩𝐚𝐭𝐭𝐞𝐫𝐧.
2. Order Service does not directly interact with the database. Instead, it
In this setup, all data-modifying commands like insert, delete, or
sends database queries to the database middleware.
update are sent to the primary DB, and reads are sent to read replicas.
3. The database middleware routes writes to the primary database.
The diagram below illustrates the setup:
Data is replicated to two replicas.
1. When Alice places an order on amazon.com, the request is sent
to Order Service.
4. Alice views the order details (read). The request is sent through the
2. Order Service creates a record about the order in the primary
middleware.
DB (write). Data is replicated to two replicas.
3. Alice views the order details. Data is served from a replica
5. Alice views the recent order history (read). The request is sent
(read).
through the middleware.
4. Alice views the recent order history. Data is served from a
replica (read).
The database middleware acts as a proxy between the application and
databases. It uses standard MySQL network protocol for
communication.
Pros:
- Simplified application code. The application doesn’t need to be aware
of the database topology and manage access to the database directly.
Cons:
- Increased system complexity. A database middleware is a complex
system. Since all database queries go through the middleware, it
usually requires a high availability setup to avoid a single point of
failure.
104 105
2️⃣ Reads that immediately follow writes are routed to the primary
database.
4. Emails are put in the incoming email queue. The queue decouples
mail processing workers from SMTP servers so they can be scaled
independently. Moreover, the queue serves as a buffer in case the
email volume surges.
6. The email is stored in the mail storage, cache, and object data store.
106 107
7. If the receiver is currently online, the email is pushed to real-time Email sending flow
servers.
In this post, we will take a closer look at the email sending flow.
8. Real-time servers are WebSocket servers that allow clients to
receive new emails in real-time.
9. For offline users, emails are stored in the storage layer. When a user
comes back online, the webmail client connects to web servers via
RESTful API.
10. Web servers pull new emails from the storage layer and return
them to the client.
2. The load balancer makes sure it doesn’t exceed the rate limit and
routes traffic to web servers.
4. Message queues.
108 109
4.a. If basic email validation succeeds, the email data is passed to Interview Question: Design Gmail
the outgoing queue.
One picture is worth more than a thousand words. In this post, we will
4.b. If basic email validation fails, the email is put in the error take a look at what happens when Alice sends an email to Bob.
queue.
5. SMTP outgoing workers pull events from the outgoing queue and
make sure emails are spam and virus free.
7. SMTP outgoing workers send the email to the recipient mail server.
We monitor the size of the outgoing queue very closely. If there are
many emails stuck in the queue, we need to analyze the cause of the
issue. Here are some possibilities:
- The recipient’s mail server is unavailable. In this case, we need to
retry sending the email at a later time. Exponential backoff might be a
good retry strategy.
2. Outlook mail server queries the DNS (not shown in the diagram) to
find the address of the recipient’s SMTP server. In this case, it is
Gmail’s SMTP server. Next, it transfers the email to the Gmail mail
server. The communication protocol between the mail servers is SMTP.
3. The Gmail server stores the email and makes it available to Bob, the
recipient.
110 111
4. Gmail client fetches new emails through the IMAP/POP server when Map rendering
Bob logs in to Gmail.
Google Maps Continued. Let’s take a look at 𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠 in this
post.
Please keep in mind this is a highly simplified design. Hope it sparks
your interest and curiosity:) I'll explain each component in more depth
𝐏𝐫𝐞-𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐝 𝐓𝐢𝐥𝐞𝐬
in the future.
One foundational concept in map rendering is tiling. Instead of
rendering the entire map as one large custom image, the world is
broken up into smaller tiles. The client only downloads the relevant
tiles for the area the user is in and stitches them together like a mosaic
for display. The tiles are pre-computed at different zoom levels. Google
Maps uses 21 zoom levels.
This allows the client to render the map at the best granularities
depending on the client’s zoom level without consuming excessive
bandwidth to download tiles with too much detail. This is especially
important when we are loading the images from mobile clients.
𝐑𝐨𝐚𝐝 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐬
Now that we have transformed massive maps into tiles, we also need
to define a data structure for the roads. We divide the world of roads
into small blocks. We call these blocks road segments. Each road
segment contains multiple roads, junctions, and other metadata.
We then transform the road segments into a data structure that the
navigation algorithms can use. The typical approach is to convert the
map into a 𝒈𝒓𝒂𝒑𝒉, where the nodes are road segments, and two nodes
are connected if the corresponding road segments are reachable
112 113
neighbors. In this way, finding a path between two locations becomes a Interview Question: Design Google Maps
shortest-path problem, where we can leverage Dijkstra or A*
Google started project G𝐨𝐨𝐠𝐥𝐞 M𝐚𝐩𝐬 in 2005. As of March 2021, Google
algorithms.
Maps had one billion daily active users, 99% coverage of the world in
200 countries.
114 115
Pull vs push models
𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
There are two ways metrics data can be collected, pull or push. It is a
The location service is responsible for recording a user’s location
routine debate as to which one is better and there is no clear answer.
update. The Google Map clients send location updates every few
In this post, we will take a look at the pull model.
seconds. The user location data is used in many cases:
𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠
The world’s map is projected into a huge 2D map image. It is broken
down into small image blocks called “tiles” (see below). The tiles are
static. They don’t change very often. An efficient way to serve static tile
files is with a CDN backed by cloud storage like S3. The users can
load the necessary tiles to compose a map from nearby CDN.
What if a user is zooming and panning the map viewpoint on the client
to explore their surroundings?
𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
This component is responsible for finding a reasonably fast route from
point A to point B. It calls two services to help with the path calculation:
2️⃣ Route Planner Service: this service does three things in sequence:
116 117
Figure 1 shows data collection with a pull model over HTTP. We have Money movement
dedicated metric collectors which pull metrics values from the running
One picture is worth more than a thousand words. This is what
applications periodically.
happens when you buy a product using Paypal/bank card under the
hood.
In this approach, the metrics collector needs to know the complete list
of service endpoints to pull data from. One naive approach is to use a
To understand this, we need to digest two concepts: 𝐜𝐥𝐞𝐚𝐫𝐢𝐧𝐠 &
file to hold DNS/IP information for every service endpoint on the
𝐬𝐞𝐭𝐭𝐥𝐞𝐦𝐞𝐧𝐭. Clearing is a process that calculates who should pay whom
“metric collector” servers. While the idea is simple, this approach is
with how much money; while settlement is a process where real money
hard to maintain in a large-scale environment where servers are added
moves between reserves in the settlement bank.
or removed frequently, and we want to ensure that metric collectors
don’t miss out on collecting metrics from any new servers.
2️⃣ The metrics collector pulls metrics data via a pre-defined HTTP
endpoint (for example, /metrics). To expose the endpoint, a client
library usually needs to be added to the service. In Figure 3, the
service is Web Servers.
118 119
Let’s say Bob wants to buy an SDI book from Claire’s shop on The first two layers are called information flow, and the settlement layer
Amazon. is called fund flow.
- Pay-in flow (Bob pays Amazon money): You can see the 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰 𝐚𝐧𝐝 𝐟𝐮𝐧𝐝 𝐟𝐥𝐨𝐰 𝐚𝐫𝐞 𝐬𝐞𝐩𝐚𝐫𝐚𝐭𝐞𝐝. In the
1.1 Bob buys a book on Amazon using Paypal. info flow, the money seems to be deducted from one bank account and
1.2 Amazon issues a money transfer request to Paypal. added to another bank account, but the actual money movement
1.3 Since the payment token of Bob’s debit card is stored in Paypal, happens in the settlement bank at the end of the day.
Paypal can transfer money, on Bob’s behalf, to Amazon’s bank
account in Bank A. Because of the asynchronous nature of the info flow and the fund flow,
1.4 Both Bank A and Bank B send transaction statements to the reconciliation is very important for data consistency in the systems
clearing institution. It reduces the transactions that need to be settled. along with the flow.
Let’s assume Bank A owns Bank B $100 and Bank B owns bank A
$500 at the end of the day. When they settle, the net position is that It makes things even more interesting when Bob wants to buy a book
Bank B pays Bank A $400. in the Indian market, where Bob pays USD but the seller can only
1.5 & 1.6 The clearing institution sends clearing and settlement receive INR.
information to the settlement bank. Both Bank A and Bank B have
pre-deposited funds in the settlement bank as money reserves, so
actual money movement happens between two reserve accounts in
the settlement bank.
120 121
122 123
Continued: how to choose the right database for metrics collecting Since a time-series database is a specialized database, you are not
service? expected to understand the internals in an interview unless you
explicitly mentioned it in your resume. For the purpose of an interview,
it’s important to understand the metrics data are time-series in nature
and we can select time-series databases such as InfluxDB for storage
to store them.
There are many storage systems available that are optimized for
time-series data. The optimization lets us use far fewer servers to
handle the same volume of data. Many of these databases also have
custom query interfaces specially designed for the analysis of
time-series data that are much easier to use than SQL. Some even
provide features to manage data retention and data aggregation. Here
are a few examples of time-series databases.
124 125
Which database shall I use for the metrics collecting How about NoSQL? In theory, a few NoSQL databases on the market
system? could handle time-series data effectively. For example, Cassandra and
Bigtable can both be used for time series data. However, this would
require deep knowledge of the internal workings of each NoSQL to
This is one of the most important questions we need to address in an devise a scalable schema for effectively storing and querying
interview. time-series data. With industrial-scale time-series databases readily
available, using a general purpose NoSQL database is not appealing.
𝐃𝐚𝐭𝐚 𝐚𝐜𝐜𝐞𝐬𝐬 𝐩𝐚𝐭𝐭𝐞𝐫𝐧
As shown in the diagram, each label on the y-axis represents a time There are many storage systems available that are optimized for
series (uniquely identified by the names and labels) while the x-axis time-series data. The optimization lets us use far fewer servers to
represents time. handle the same volume of data. Many of these databases also have
The write load is heavy. As you can see, there can be many custom query interfaces specially designed for the analysis of
time-series data points written at any moment. There are millions of time-series data that are much easier to use than SQL. Some even
operational metrics written per day, and many metrics are collected at provide features to manage data retention and data aggregation. Here
high frequency, so the traffic is undoubtedly write-heavy. are a few examples of time-series databases.
At the same time, the read load is spiky. Both visualization and alert OpenTSDB is a distributed time-series database, but since it is based
services send queries to the database and depending on the access on Hadoop and HBase, running a Hadoop/HBase cluster adds
patterns of the graphs and alerts, the read volume could be bursty. complexity. Twitter uses MetricsDB, and Amazon offers Timestream as
a time-series database. According to DB-engines, the two most
𝐂𝐡𝐨𝐨𝐬𝐞 𝐭𝐡𝐞 𝐫𝐢𝐠𝐡𝐭 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 popular time-series databases are InfluxDB and Prometheus, which
The data storage system is the heart of the design. It’s not are designed to store large volumes of time-series data and quickly
recommended to build your own storage system or use a perform real-time analysis on that data. Both of them primarily rely on
general-purpose storage system (MySQL) for this job. an in-memory cache and on-disk storage. And they both handle
durability and performance quite well. According to the benchmark
A general-purpose database, in theory, could support time-series data, listed on InfluxDB website, a DB server with 8 cores and 32GB RAM
but it would require expert-level tuning to make it work at our scale. can handle over 250,000 writes per second.
Specifically, a relational database is not optimized for operations you
would commonly perform against time-series data. For example, Since a time-series database is a specialized database, you are not
computing the moving average in a rolling time window requires expected to understand the internals in an interview unless you
complicated SQL that is difficult to read (there is an example of this in explicitly mentioned it in your resume. For the purpose of an interview,
the deep dive section). Besides, to support tagging/labeling data, we it’s important to understand the metrics data are time-series in nature
need to add an index for each tag. Moreover, a general-purpose and we can select time-series databases such as InfluxDB for storage
relational database does not perform well under constant heavy write to store them.
load. At our scale, we would need to expend significant effort in tuning
the database, and even then, it might not perform well. Another feature of a strong time-series database is efficient
aggregation and analysis of a large amount of time-series data by
labels, also known as tags in some databases. For example, InfluxDB
builds indexes on labels to facilitate the fast lookup of time-series by
126 127
labels. It provides clear best-practice guidelines on how to use labels, Metrics monitoring and altering system
without overloading the database. The key is to make sure each label
A well-designed 𝐦𝐞𝐭𝐫𝐢𝐜𝐬 𝐦𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 and alerting system plays a key
is of low cardinality (having a small set of possible values). This feature
role in providing clear visibility into the health of the infrastructure to
is critical for visualization, and it would take a lot of effort to build this
ensure high availability and reliability. The diagram below explains how
with a general-purpose database.
it works at a high level.
Metrics collector: It gathers metrics data and writes data into the
time-series database.
128 129
Let’s take a look at some pain points and how we can address them:
130 131
𝐏𝐫𝐨𝐛𝐥𝐞𝐦 1: Data normalization. When comparing records in different Which database shall I use? This is one of the most important
systems, they come in different formats. For example, the timestamp questions we usually need to address in an interview.
can be “2022/01/01” in one system and “Jan 1, 2022” in another.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can add a layer to transform different formats into Choosing the right database is hard. Google Cloud recently posted a
the same format. great article that summarized different database options available in
Google Cloud and explained which use cases are best suited for each
𝐏𝐫𝐨𝐛𝐥𝐞𝐦 2: Massive data volume database option.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can use big data processing techniques to speed
up data comparisons. If we need near real-time reconciliation, a
streaming platform such as Flink is used; otherwise, end-of-day batch
processing such as Hadoop is enough.
132 133
Big data papers language. But Hive still used MapReduce under the hood, so it’s not
very responsive. In 2010, Dremel provided an interactive query engine.
Below is a timeline of important big data papers and how the
techniques evolved over time.
𝐒𝐭𝐫𝐞𝐚𝐦𝐢𝐧𝐠 𝐩𝐫𝐨𝐜𝐞𝐬𝐬𝐢𝐧𝐠 was born to further solve the latency issue in
OLAP. The famous 𝒍𝒂𝒎𝒃𝒅𝒂 architecture was based on Storm and
MapReduce, where streaming processing and batch processing have
different processing flows. Then people started to build streaming
processing with apache Kafka. 𝑲𝒂𝒑𝒑𝒂 architecture was proposed in
2014, where streaming and batching processings were merged into
one flow. Google published The Dataflow Model in 2015, which was an
abstraction standard for streaming processing, and Flink implemented
this model.
The green highlighted boxes are the famous 3 Google papers, which
established the foundation of the big data framework. At the high-level:
Now let’s look at the 𝐎𝐋𝐀𝐏 evolution. MapReduce was not easy to
program, so Hive solved this by introducing a SQL-like query
134 135
Avoid double charge At the first glance, exactly-once delivery seems very hard to tackle, but
if we divide the problem into two parts, it is much easier to solve.
One of the most serious problems a payment system can have is to
Mathematically, an operation is executed exactly-once if:
𝐝𝐨𝐮𝐛𝐥𝐞 𝐜𝐡𝐚𝐫𝐠𝐞 𝐚 𝐜𝐮𝐬𝐭𝐨𝐦𝐞𝐫. When we design the payment system, it is
important to guarantee that the payment system executes a payment
1. It is executed at least once.
order exactly-once.
2. At the same time, it is executed at most once.
𝐑𝐞𝐭𝐫𝐲
Occasionally, we need to retry a payment transaction due to network
errors or timeout. Retry provides the at-least-once guarantee. For
example, as shown in Figure 10, the client tries to make a $10
payment, but the payment keeps failing due to a poor network
connection. Considering the network condition might get better, the
client retries the request and this payment finally succeeds at the
fourth attempt.
𝐈𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲
From an API standpoint, idempotency means clients can make the
same call repeatedly and produce the same result.
136 137
138 139
Big data evolvement last generation. For example, “Hive - support SQL” means Hive was
trying to solve the lack of SQL in MapReduce.
I hope everyone has a great time with friends and family during the
holidays. If you are looking for some readings, classic engineering
If you want to learn more, you can refer to the papers for details. What
papers are a good start.
other classics would you recommend?
A lot of times when we are busy with work, we only focus on scattered
information, telling us “how” and “what” to get our immediate needs to
get things done.
However, reading the classics helps us know “why” behind the scenes,
and teaches us how to solve problems, make better decisions, or even
contribute to open source projects.
Big data area has progressed a lot over the past 20 years. It started
from 3 Google papers (see the links in the comment), which tackled
real engineering challenges at Google scale:
140 141
- After the quadtree is built, start searching from the root and traverse
the tree, until we find the leaf node where the search origin is.
- If that leaf node has 100 businesses, return the node. Otherwise, add
businesses from its neighbors until enough businesses are returned.
- While the quadtree is being built, the server cannot serve traffic.
142 143
How do we find nearby restaurants on Yelp? - Add/delete/update restaurant information
- Customers view restaurant details
- 𝐋𝐨𝐜𝐚𝐥-𝐛𝐚𝐬𝐞𝐝 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 (𝐋𝐁𝐒)
Here are some design details behind the scenes.
- Given a radius and location, return a list of nearby restaurants
First, divide the planet into four quadrants along with the prime
meridian and equator:
Second, divide each grid into four smaller grids. Each grid can be
represented by alternating between longitude bit and latitude bit.
144 145
Credit GovCERT
Link:
https://www.govcert.ch/blog/zero-day-exploit-targeting-popular-java-libr
ary-log4j/
146 147
Match buy and sell orders
- deploy all the components in a single giant server (no containers)
Stocks go up and down. Do you know what data structure is used to
efficiently match buy and sell orders?
- use shared memory as an event bus to communicate among the
components, no hard disk
Stock exchanges use the data structure called 𝐨𝐫𝐝𝐞𝐫 𝐛𝐨𝐨𝐤𝐬. An order
book is an electronic list of buy and sell orders, organized by price
levels. It has a buy book and a sell book, where each side of the book
contains a bunch of price levels, and each price level contains a list of
orders (first in first out).
So what happens when you place a market order to buy 2700 shares
in the diagram?
- The buy order is matched with all the sell onrders at price 100.10,
and the first order at price 100.11 (illustrated in light red).
148 149
- Now because of the big buy order which “eats up” the first price level Stock exchange design
on the sell book, the best ask price goes up from 100.10 to 100.11.
The stock market has been volatile recently.
- So when the market is bullish, people tend to buy stocks, and the
Coincidentally, we just finished a new chapter “Design a stock
price goes up and up.
exchange”. I’ll use plain English to explain what happens when you
place a stock buying order. The focus is on the exchange side.
An efficient data structure for an order book must satisfy:
Step 1: client places an order via the broker’s web or mobile app.
150 151
Step 3: the exchange client gateway performs operations such as Design a payment system
validation, rate limiting, authentication, normalization, etc, and sends
Today is Cyber Monday. Here is how money moves when you click the
the order to the order manager.
Buy button on Amazon or any of your favorite shopping websites.
Step 4: the order manager performs risk checks based on rules set by
I posted the same diagram last week for an overview and a few people
the risk manager.
asked me about the detailed steps, so here you go:
Step 5: once risk checks pass, the order manager checks if there is
enough balance in the wallet.
Step 6-7: the order is sent to the matching engine. The matching
engine sends back the execution result if a match is found. Both order
and execution results need to be sequenced first in the sequencer so
that matching determinism is guaranteed.
Step 8 - 10: execution result is passed all the way back to the client.
Step 11-12: market data (including the candlestick chart and order
book) are sent to the data service for consolidation. Brokers query the
data service to get the market data.
1. When a user clicks the “Buy” button, a payment event is generated
Step 13: the reporter composes all the necessary reporting fields (e.g. and sent to the payment service.
client_id, price, quantity, order_type, filled_quantity,
remaining_quantity) and writes the data to the database for 2. The payment service stores the payment event in the database.
persistence
3. Sometimes a single payment event may contain several payment
A stock exchange requires 𝐞𝐱𝐭𝐫𝐞𝐦𝐞𝐥𝐲 𝐥𝐨𝐰 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. While most web orders. For example, you may select products from multiple sellers in a
applications are ok with hundreds of milliseconds latency, a stock single checkout process. The payment service will call the payment
exchange requires 𝐦𝐢𝐜𝐫𝐨-𝐬𝐞𝐜𝐨𝐧𝐝 𝐥𝐞𝐯𝐞𝐥 𝐥𝐚𝐭𝐞𝐧𝐜𝐲. I’ll leave the latency executor for each payment order.
discussion for a separate post since the post is already long.
4. The payment executor stores the payment order in the database.
5. The payment executor calls an external PSP to finish the credit card
payment.
152 153
10. Every night the PSP or banks send settlement files to their clients.
The settlement file contains the balance of the bank account, together
with all the transactions that took place on this bank account during the
day.
𝐃𝐞𝐬𝐢𝐠𝐧 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬:
1. Less is more - less element on the web page, fewer data
queries to the database, fewer web requests, fewer system
dependencies
2. Short critical path - fewer hops among services or merge into
one service
3. Async - use message queues to handle high TPS
4. Isolation - isolate static and dynamic contents, isolate processes
and databases for rare items
5. Overselling is bad. When Decreasing the inventory is important
154 155
6. User experience is important. We definitely don’t want to inform Back-of-the-envelope estimation
users that they have successfully placed orders but later tell
them no items are actually available Recently, a few engineers asked me whether we really need
back-of-the-envelope estimation in a system design interview. I think it
would be helpful to clarify.
156 157
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
158