SD Blueprint Merged
SD Blueprint Merged
SD Blueprint Merged
Nameserver
𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 𝗕𝗹𝘂𝗲𝗽𝗿𝗶𝗻𝘁:
Domain
IP?
12.34.56.78 𝗧𝗵𝗲 𝗨𝗹𝘁𝗶𝗺𝗮𝘁𝗲 𝗚𝘂𝗶𝗱𝗲
Metrics Security Options / Algo
Dom
ain
Det
ails
Response
4XX, 5XX, 2XX
Request
Gzip, deflate Pagination Response Time
Request ID Expiry Headers Failed Status Codes
Root Idempotent Key Mime Error Codes /
Nameserver Many more... Cookie Message
many more...
LOAD BALANCING
Input Validation
Authorization / Authentication API Gateway Load Balancer Frontend Servers CDN / Edge Servers
Rate Limiting / Throttling #MultiPrimary #InMemoryConnections #Global #Regional #Static
Whitelist / Blacklisting
Flow Control For Live Streaming / Chat Static / Stream with ABR
Authentication (OAuth 2.0)
Access Logs Round Robin
User Connection
Request Header Validations Weighted Round Cache Miss Outs
Status Codes RTMP
TLS Termination Robin
WebRTC Active Resources
Number of Requests Least Connections
Websocket user_1 ada_inst_obj
Request Deduplication Hash / Server SSE (Server Side
Total Resources
Active Connections Stickiness Events)
Metering / Usage data collection Random HTTP Short Polling
Request Dispatching HTTP Long Polling
Webhook Bandwidth
Stream API
Topics Serves Static
Push Hot Resource Content
Pull Rare Resource
Dispatch Messages Hybrid
to other FE Servers
Concurrency: Vertical Scale
Serially Horizontal
Serially Batch
Pessimistic
Server Topic / Obj ID
Locking
Optimistic
DiskIO (if buffering)
Locking server_1 asdf23dsf3oj23098asfdf3
CPU
UUID Memory
Auto Increment Latency
Bandwidth
Message
Auto Incr. Multiple
Servers (Odd / Even)
Dispatcher
Twitter’s SnowFlakes
Offline Generations Distributed ID Leaderless
Baidu UID generator Generator (#Gossip
Sonyflake Protocol)
Leader + Count of
messages
Followers
Consumption
Multi-Leaders
Backend servers For every new Broker
Pub/Sub
Rate
In-Transit
chunk object
Redlock #ClusterOfServers (Waiting for Ack)
Queue limit
(Redis) #MicroService
Count of messages
Google
Consumption Rate
Chubby Distributed Object Storage In-Transit (Waiting
Apache
Resource Locking #S3 #Chunk #Raw
for Ack)
Zookeeper Queue limit
Message Queue
Validate Checksum
Compression
Processed /
Compute Time Ratio
Failed Count Storage Encoded Storage
Key Checksum Timestamp Validate Consumed #Lossless #Lossy
CPU / Disk, etc
checksum Object Count #Compression
Range based
CPU / Disk Usage Notification
Network Throughput
Eviction: Hash based Active Connections Service
LRU (Least Recently Geographical Storage
Analytics Service Expired /
Blocked /
used) Directory
Invalid Card
LFU (Least Freq. Used) based Spamming Service Down
FIFO Quorum Stop Words Insufficient
MRU (Read / Write) Dedup Payment Charge Balance
Hinted-off Retry with
Random Eviction
Idempotent Key
Service
Least Used Merkle Tree
(Algo) Third-Party
On-Demand Expiration
Banking
Shards Replicas
Garbage Collector
#HotNode #Hash #CrossGeo COMMON FAN-OUT SERVICES With Idempotent
Key
Service
System Design
What are database isolation levels? What are they used for? 4
What is IaaS/PaaS/SaaS? 6
Deployment strategies 30
1
HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC). 53
DevOps Books 58
Redis vs Memcached 64
Optimistic locking 65
CDN 84
Erasure coding 87
2
Read replica pattern 105
Reconciliation 122
Which database shall I use for the metrics collecting system? 126
Reconciliation 131
Quadtree 142
3
What are database isolation levels? What are they used
for?
Database isolation allows a transaction to execute as if there are no
other concurrently running transactions.
🔹 Repeatable Read: Data read during the transaction stays the same
as the transaction starts.
4
🔹 Read Uncommitted: The data modification can be read by other
transactions before a transaction is committed.
There are two hidden columns for each row: transaction_id and
roll_pointer. When transaction A starts, a new Read View with
transaction_id=201 is created. Shortly afterward, transaction B starts,
and a new Read View with transaction_id=202 is created.
Now transaction A modifies the balance to 200, a new row of the log is
created, and the roll_pointer points to the old row. Before transaction A
commits, transaction B reads the balance data. Transaction B finds
that transaction_id 201 is not committed, it reads the next committed
record(transaction_id=200).
Over to you: have you seen isolation levels used in the wrong way?
Did it cause serious outages?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
5
What is IaaS/PaaS/SaaS?
The diagram below illustrates the differences between IaaS
(Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), and SaaS
(Software-as-a-Service).
For a non-cloud application, we own and manage all the hardware and
software. We say the application is on-premises.
6
Most popular programming languages
Programming languages come and go. Some stand the test of time.
Some already are shooting stars and some are rising rapidly on the
horizon.
1 JavaScript
2 HTML/CSS
3 Python
4 SQL
5 Java
6 Node
7 TypeScript
8C
9 Bash/Shell
10 C
11 PHP
7
12 C
13 PowerShell
14 Go
15 Kotlin
16 Rust
17 Ruby
18 Dart
19 Assembly
20 Swift
21 R
22 VBA
23 Matlab
24 Groovy
25 Objective-C
26 Scala
27 Perl
28 Haskell
29 Delphi
30 Clojure
31 Elixir
32 LISP
33 Julia
34 F
35 Erlang
36 APL
37 Crystal
38 COBOL
Over to you: what’s the first programming language you learned? And
what are the other languages you learned over the years?
8
What is the future of online payments?
I don’t know the answer, but I do know one of the candidates is the
blockchain.
9
2. The golden source of truth for bitcoin is the blockchain, which is also
the journal. It’s the same if we use Event Sourcing architecture to build
a traditional wallet, although there are other options.
3. There is a small virtual machine for bitcoin - and also Ethereum. The
virtual machine defines a set of bytecodes to do basic tasks such as
validation.
Over to you: if Elon Musk set up a base on planet Mars, what payment
solution will you recommend?
10
What is SSO (Single Sign-On)?
A friend recently went through the irksome experience of being signed
out from a number of websites they use daily. This event will be familiar
to millions of web users, and it is a tedious process to fix. It can involve
trying to remember multiple long-forgotten passwords, or typing in the
names of pets from childhood to answer security questions. SSO
removes this inconvenience and makes life online better. But how does
it work?
Step 1: A user visits Gmail, or any email service. Gmail finds the user
is not logged in and so redirects them to the SSO authentication
server, which also finds the user is not logged in. As a result, the user
11
is redirected to the SSO login page, where they enter their login
credentials.
Steps 4-7: Gmail validates the token in the SSO authentication server.
The authentication server registers the Gmail system, and returns
“valid.” Gmail returns the protected resource to the user.
Steps 9-10: YouTube finds the user is not logged in, and then requests
authentication. The SSO authentication server finds the user is already
logged in and returns the token.
The process is complete and the user gets back access to their
account.
Over to you:
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
12
How to store passwords safely in the database?
🔹
𝐓𝐡𝐢𝐧𝐠𝐬 𝐍𝐎𝐓 𝐭𝐨 𝐝𝐨
Storing passwords in plain text is not a good idea because anyone
with internal access can see them.
𝐖𝐡𝐚𝐭 𝐢𝐬 𝐬𝐚𝐥𝐭?
According to OWASP guidelines, “a salt is a unique, randomly
generated string that is added to each password as part of the hashing
process”.
13
𝐇𝐨𝐰 𝐭𝐨 𝐬𝐭𝐨𝐫𝐞 𝐚 𝐩𝐚𝐬𝐬𝐰𝐨𝐫𝐝 𝐚𝐧𝐝 𝐬𝐚𝐥𝐭?
1️⃣ A salt is not meant to be secret and it can be stored in plain text in
the database. It is used to ensure the hash result is unique to each
password.
2️⃣ The password can be stored in the database using the following
format: 𝘩𝘢𝘴𝘩( 𝘱𝘢𝘴𝘴𝘸𝘰𝘳𝘥 + 𝘴𝘢𝘭𝘵).
14
3️⃣ The system appends the salt to the password and hashes it. Let’s
call the hashed value H1.
4️⃣ The system compares H1 and H2, where H2 is the hash stored in the
database. If they are the same, the password is valid.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
15
How does HTTPS work?
Hypertext Transfer Protocol Secure (HTTPS) is an extension of the
Hypertext Transfer Protocol (HTTP.) HTTPS transmits encrypted data
using Transport Layer Security (TLS.) If the data is hijacked online, all
the hijacker gets is binary code.
Step 2 - The client sends a “client hello” to the server. The message
contains a set of necessary encryption algorithms (cipher suites) and
the latest TLS version it can support. The server responds with a
“server hello” so the browser knows whether it can support the
algorithms and TLS version.
16
The server then sends the SSL certificate to the client. The certificate
contains the public key, host name, expiry dates, etc. The client
validates the certificate.
Step 4 - Now that both the client and the server hold the same session
key (symmetric encryption), the encrypted data is transmitted in a
secure bi-directional channel.
1. Security: The asymmetric encryption goes only one way. This means
that if the server tries to send the encrypted data back to the client,
anyone can decrypt the data using the public key.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
17
How to learn design patterns?
Besides reading a lot of well-written code, a good book guides us like a
good teacher.
Last year, I bought the second edition of Head First Design Patterns
and read through it. Here are a few things I like about the book:
18
🔹 This book solves the challenge of software’s abstract, “invisible”
nature. Software is difficult to build because we cannot see its
architecture; its details are embedded in the code and binary files. It is
even harder to understand software design patterns because these are
higher-level abstractions of the software. The book fixes this by using
visualization. There are lots of diagrams, arrows, and comments on
almost every page. If I do not understand the text, it’s no problem. The
diagrams explain things very well.
19
A visual guide on how to choose the right Database
Picking a database is a long-term commitment so the decision
shouldn’t be made lightly. The important thing to keep in mind is to
choose the right database for the right job.
20
Data can be structured (SQL table schema), semi-structured (JSON,
XML, etc.), and unstructured (Blob).
🔹
Common database categories include:
🔹
Relational
🔹
Columnar
🔹
Key-value
🔹
In-memory
🔹
Wide column
🔹
Time Series
🔹
Immutable ledger
🔹
Geospatial
🔹
Graph
🔹
Document
🔹
Text search
Blob
Over to you - Which database have you used for which workload?
21
Do you know how to generate globally unique IDs?
In this post, we will explore common requirements for IDs that are used
in social media such as Facebook, Twitter, and LinkedIn.
🔹
Requirements:
🔹
Globally unique
🔹
Roughly sorted by time
🔹
Numerical values only
🔹
64 bits
Highly scalable, low latency
22
The implementation details of the algorithms can be found online so
we will not go into detail here.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
23
How does Twitter work?
This post is a summary of a tech talk given by Twitter in 2013. Let’s
take a look.
24
4️⃣ The Timeline service is used to find the Redis server that has the
home timeline on it.
5️⃣ A user pulls their home timeline through the Timeline service.
🔹
𝐒𝐞𝐚𝐫𝐜𝐡 & 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲
Ingester: annotates and tokenizes Tweets so the data can be
🔹
indexed.
🔹
Earlybird: stores search index.
Blender: creates the search and discovery timelines.
🔹
𝐏𝐮𝐬𝐡 𝐂𝐨𝐦𝐩𝐮𝐭𝐞
🔹
HTTP push
Mobile push
Over to you:
Do you use Twitter? What are some of the biggest differences between
LinkedIn and Twitter that might shape their system architectures?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
25
What is the difference between Process and Thread?
26
A 𝐓𝐡𝐫𝐞𝐚𝐝 is the smallest unit of execution within a process.
🔹
of a process.
Each process has its own memory space. Threads that belong to
🔹
the same process share the same memory.
A process is a heavyweight operation. It takes more time to create
🔹
and terminate.
🔹
Context switching is more expensive between processes.
Inter-thread communication is faster for threads.
Over to you:
1). Some programming languages support coroutine. What is the
difference between coroutine and thread?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
27
Interview Question: design Google Docs
28
4️⃣ The File Operation Server consumes operations produced by clients
and generates transformed operations using collaboration algorithms.
5️⃣ Three types of data are stored: file metadata, file content, and
operations.
🔹
🔹
Operational transformation (OT)
🔹
Differential Synchronization (DS)
Conflict-free replicated data type (CRDT)
Over to you - Have you encountered any issues while using Google
Docs? If so, what do you think might have caused the issue?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
29
Deployment strategies
Deploying or upgrading services is risky. In this post, we explore risk
mitigation strategies.
𝐌𝐮𝐥𝐭𝐢-𝐒𝐞𝐫𝐯𝐢𝐜𝐞 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
In this model, we deploy new changes to multiple services
simultaneously. This approach is easy to implement. But since all the
services are upgraded at the same time, it is hard to manage and test
dependencies. It’s also hard to rollback safely.
30
𝐁𝐥𝐮𝐞-𝐆𝐫𝐞𝐞𝐧 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
With blue-green deployment, we have two identical environments: one
is staging (blue) and the other is production (green). The staging
environment is one version ahead of production. Once testing is done
in the staging environment, user traffic is switched to the staging
environment, and the staging becomes the production. This
deployment strategy is simple to perform rollback, but having two
identical production quality environments could be expensive.
𝐂𝐚𝐧𝐚𝐫𝐲 𝐃𝐞𝐩𝐥𝐨𝐲𝐦𝐞𝐧𝐭
A canary deployment upgrades services gradually, each time to a
subset of users. It is cheaper than blue-green deployment and easy to
perform rollback. However, since there is no staging environment, we
have to test on production. This process is more complicated because
we need to monitor the canary while gradually migrating more and
more users away from the old version.
𝐀/𝐁 𝐓𝐞𝐬𝐭
In the A/B test, different versions of services run in production
simultaneously. Each version runs an “experiment” for a subset of
users. A/B test is a cheap method to test new features in production.
We need to control the deployment process in case some features are
pushed to users by accident.
Over to you - Which deployment strategy have you used? Did you
witness any deployment-related outages in production and why did
they happen?
31
Flowchart of how slack decides to send a notification
It is a great example of why a simple feature may take much longer to
develop than many people think.
When we have a great design, users may not notice the complexity
because it feels like the feature is just working as intended.
Image source:
https://slack.engineering/reducing-slacks-memory-footprint/
32
How does Amazon build and operate the software?
In 2019, Amazon released The Amazon Builders' Library. It contains
architecture-based articles that describe how Amazon architects,
releases, and operates technology.
🔹
🔹
Making retries safe with idempotent APIs
🔹
Timeouts, retries, and backoff with jitter
🔹
Beyond five 9s: Lessons from our highest available data planes
🔹
Caching challenges and strategies
🔹
Ensuring rollback safety during deployments
Going faster with continuous delivery
33
🔹
🔹
Challenges with distributed systems
Amazon's approach to high-availability deployment
Over to you: what’s your favorite place to learn system design and
design principles?
34
How to design a secure web API access for your
website?
When we open web API access to users, we need to make sure each
API call is authenticated. This means the user must be who they claim
to be.
35
𝐓𝐨𝐤𝐞𝐧 𝐛𝐚𝐬𝐞𝐝
Step 1 - the user enters their password into the client, and the client
sends the password to the Authentication Server.
36
Step 2 - the Authentication Server authenticates the credentials and
generates a token with an expiry time.
Steps 3 and 4 - now the client can send requests to access server
resources with the token in the HTTP header. This access is valid until
the token expires.
𝐇𝐌𝐀𝐂 𝐛𝐚𝐬𝐞𝐝
This mechanism generates a Message Authentication Code
(signature) by using a hash function (SHA256 or MD5).
Steps 1 and 2 - the server generates two keys, one is Public APP ID
(public key) and the other one is API Key (private key).
Step 3 - we now generate a HMAC signature on the client side (hmac
A). This signature is generated with a set of attributes listed in the
diagram.
Step 5 - the server receives the request which contains the request
data and the authentication header. It extracts the necessary attributes
from the request and uses the API key that’s stored on the server side
to generate a signature (hmac B.)
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
37
How do microservices collaborate and interact with each
other?
Choreography is like having a choreographer set all the rules. Then the
dancers on stage (the microservices) interact according to them.
Service choreography describes this exchange of messages and the
rules by which the microservices interact.
38
describes the interactions between all the participating services. It is
just like a conductor leading the musicians in a musical symphony. The
orchestration pattern also includes the transaction management
among different services.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
39
What are the differences between Virtualization
(VMware) and Containerization (Docker)?
🔹
The major differences are:
In virtualization, the hypervisor creates an abstraction layer over
hardware, so that multiple operating systems can run alongside each
other. This technique is considered to be the first generation of cloud
computing.
40
needed to run the application or microservice are packaged together,
so that the applications can run anywhere.
Sources:
[1] Understanding virtualization: https://lnkd.in/gtQY9gkx
[2] What is containerization?: https://lnkd.in/gm4Qv_x2
41
Which cloud provider should be used when building a
big data solution?
The diagram below illustrates the detailed comparison of AWS, Google
Cloud, and Microsoft Azure.
42
The common parts of the solutions:
For example, the first step and the last step both use the serverless
product. The product is called “lambda” in AWS, and “function” in
Azure and Google Cloud.
43
How to avoid crawling duplicate URLs at Google scale?
Option 1: Use a Set data structure to check if a URL already exists or
not. Set is fast, but it is not space-efficient.
The diagram below illustrates how the Bloom filter works. The basic
data structure for the Bloom filter is Bit Vector. Each bit represents a
hashed value.
44
Step 1: To add an element to the bloom filter, we feed it to 3 different
hash functions (A, B, and C) and set the bits at the resulting positions.
Note that both “www.myweb1.com” and “www.myweb2.com” mark the
same bit with 1 at index 5. False positives are possible because a bit
might be set by another element.
Step 2: When testing the existence of a URL string, the same hash
functions A, B, and C are applied to the URL string. If all three bits are
45
1, then the URL may exist in the dataset; if any of the bits is 0, then the
URL definitely does not exist in the dataset.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
46
Why is a solid-state drive (SSD) fast?
“A solid state drive reads up to 10 times faster and writes up to 20
times faster than a hard disk drive.” [1].
“An SSD is a flash-memory based data storage device. Bits are stored
into cells, which are made of floating-gate transistors. SSDs are made
entirely of electronic components, there are no moving or mechanical
parts like in hard drives (HDD)” [2].
47
Step 1: “Commands come from the user through the host interface” [2].
The interface can be Serial ATA (SATA) or PCI Express (PCIe).
Step 2: “The processor in the SSD controller takes the commands and
passes them to the flash controller” [2].
Step 3: “SSDs also have embedded RAM memory, generally for
caching purposes and to store mapping information” [2].
Step 4: “The packages of NAND flash memory are organized in gangs,
over multiple channels” [2].
The second diagram illustrates how the logical and physical pages are
mapped, and why this architecture is fast.
Every time a HOST Page is written, the SSD controller finds a Physical
Page to write the data and this mapping is recorded. With this
mapping, the next time HOST reads a HOST Page, the SSD knows
where to read the data from FLASH [3].
Question - What are the main differences between SSD and HDD?
Sources:
[1] SSD or HDD: Which Is Right for You?:
https://www.avg.com/en/signal/ssd-hdd-which-is-best
[2] Coding for SSDs:
https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introductio
n-and-table-of-contents/
[3] Overview of SSD Structure and Basic Working Principle:
https://www.elinfor.com/knowledge/overview-of-ssd-structure-and-basic
-working-principle1-p-11203
48
Handling a large-scale outage
This is a true story about handling a large-scale outage written by Staff
Engineers at Discord Sahn Lam.
It was 9PM on a Friday. I was on the team responsible for one of the
largest social games at the time. It had about 30 million DAU. I just so
happened to glance at the operational dashboard before shutting down
for the night.
At that very moment, I got a phone call from my boss. He said the
entire game was down. Firefighting mode. Full on.
What had gone wrong? The software vendor had introduced a bug that
week in their confirmation dialog flow. When terminating a subset of
nodes in the UI, it would correctly show in the confirmation dialog box
the list of nodes to be terminated, but under the hood, it terminated
everything.
Shortly before 9PM that fateful evening, one of our poor SREs fulfilled
our routine request and terminated an unused Memcache pool. I could
only imagine the horror and the phone conversation that ensured.
49
What kind of code structure could allow this disastrous bug to slip
through? We could only guess. We never received a full explanation.
What are some of the most impactful software bugs you encountered
in your career?
50
AWS Lambda behind the scenes
Serverless is one of the hottest topics in cloud services. How does
AWS Lambda work behind the scenes?
𝐅𝐢𝐫𝐞𝐜𝐫𝐚𝐜𝐤𝐞𝐫 𝐌𝐢𝐜𝐫𝐨𝐕𝐌
Firecracker is the engine powering all of the Lambda functions [1]. It is
a virtualization technology developed at Amazon and written in Rust.
The diagram below illustrates the isolation model for AWS Lambda
Workers.
51
Lambda functions run within a sandbox, which provides a minimal
Linux userland, some common libraries and utilities. It creates the
Execution environment (worker) on EC2 instances.
How are lambdas initiated and invoked? There are two ways.
𝐒𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step1: "The Worker Manager communicates with a Placement Service
which is responsible to place a workload on a location for the given
host (it’s provisioning the sandbox) and returns that to the Worker
Manager" [2].
Step 2: "The Worker Manager can then call 𝘐𝘯𝘪𝘵 to initialize the function
for execution by downloading the Lambda package from S3 and
setting up the Lambda runtime" [2]
𝐀𝐬𝐲𝐧𝐜𝐡𝐫𝐨𝐧𝐨𝐮𝐬 𝐞𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧
Step 1: The Application Load Balancer forwards the invocation to an
available Frontend which places the event onto an internal
queue(SQS).
Step 2: There is "a set of pollers assigned to this internal queue which
are responsible for polling it and moving the event onto a Frontend
synchronously. After it’s been placed onto the Frontend it follows the
synchronous invocation call pattern which we covered earlier" [2].
Question: Can you think of any use cases for AWS Lambda?
Sources:
[1] AWS Lambda whitepaper:
https://docs.aws.amazon.com/whitepapers/latest/security-overview-aw
s-lambda/lambda-executions.html
[2] Behind the scenes, Lambda:
https://www.bschaatsbergen.com/behind-the-scenes-lambda/
52
HTTP 1.0 -> HTTP 1.1 -> HTTP 2.0 -> HTTP 3.0 (QUIC).
What problem does each generation of HTTP solve?
53
🔹 HTTP 2.0 was published in 2015. It addresses HOL issue through
request multiplexing, which eliminates HOL blocking at the application
layer, but HOL still exists at the transport (TCP) layer.
As you can see in the diagram, HTTP 2.0 introduced the concept of
HTTP “streams”: an abstraction that allows multiplexing different HTTP
exchanges onto the same TCP connection. Each stream doesn’t need
to be sent in order.
Question: When shall we upgrade to HTTP 3.0? Any pros & cons you
can think of?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
54
How to scale a website to support millions of users?
We will explain this step-by-step.
55
56
Suppose we have two services: inventory service (handles product
descriptions and inventory management) and user service (handles
user information, registration, login, etc.).
Step 1 - With the growth of the user base, one single application server
cannot handle the traffic anymore. We put the application server and
the database server into two separate servers.
57
DevOps Books
Some 𝐃𝐞𝐯𝐎𝐩𝐬 books I find enlightening:
58
🔹 The Phoenix Project - a classic novel about effectiveness and
communications. IT work is like manufacturing plant work, and a
system must be established to streamline the workflow. Very
interesting read!
59
Why is Kafka fast?
Kafka achieves low latency message delivery through Sequential I/O
and Zero Copy Principle. The same techniques are commonly used in
many other messaging/streaming platforms.
60
🔹 Step 2: Consumer reads data without zero-copy
2.1: The data is loaded from disk to OS cache
2.2 The data is copied from OS cache to Kafka application
2.3 Kafka application copies the data into the socket buffer
2.4 The data is copied from socket buffer to network card
2.5 The network card sends data out to the consumer
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
61
SOAP vs REST vs GraphQL vs RPC.
The diagram below illustrates the API timeline and API styles
comparison.
Over time, different API architectural styles are released. Each of them
has its own patterns of standardizing data exchange.
You can check out the use cases of each style in the diagram.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
62
How do modern browsers work?
Links:
https://developer.chrome.com/blog/inside-browser-part1/
https://developer.chrome.com/blog/inside-browser-part2/
https://developer.chrome.com/blog/inside-browser-part3/
https://developer.chrome.com/blog/inside-browser-part4/
63
Redis vs Memcached
The diagram below illustrates the key differences.
🔹 Recording the number of clicks and comments for each post (hash)
🔹 Sorting the commented user list and deduping the users (zset)
64
Optimistic locking
Optimistic locking, also referred to as optimistic concurrency control,
allows multiple concurrent users to attempt to update the same
resource.
3. When the user updates the row, the application increases the
version number by 1 and writes it back to the database.
65
Optimistic locking is usually faster than pessimistic locking because we
do not lock the database. However, the performance of optimistic
locking drops dramatically when concurrency is high.
To understand why, consider the case when many clients try to reserve
a hotel room at the same time. Because there is no limit on how many
clients can read the available room count, all of them read back the
same available room count and the current version number. When
different clients make reservations and write back the results to the
database, only one of them will succeed, and the rest of the clients
receive a version check failure message. These clients have to retry. In
the subsequent round of retries, there is only one successful client
again, and the rest have to retry. Although the end result is correct,
repeated retries cause a very unpleasant user experience.
66
Tradeoff between latency and consistency
Understanding the 𝐭𝐫𝐚𝐝𝐞𝐨𝐟𝐟𝐬 is very important not only in system design
interviews but also designing real-world systems. When we talk about
data replication, there is a fundamental tradeoff between 𝐥𝐚𝐭𝐞𝐧𝐜𝐲 and
𝐜𝐨𝐧𝐬𝐢𝐬𝐭𝐞𝐧𝐜𝐲. It is illustrated by the diagram below.
67
Cache miss attack
Caching is awesome but it doesn’t come without a cost, just like many
things in life.
One of the issues is 𝐂𝐚𝐜𝐡𝐞 𝐌𝐢𝐬𝐬 𝐀𝐭𝐭𝐚𝐜𝐤. Correct me if this is not the
right term. It refers to the scenario where data to fetch doesn't exist in
the database and the data isn’t cached either. So every request hits
the database eventually, defeating the purpose of using a cache. If a
malicious user initiates lots of queries with such keys, the database
can easily be overloaded.
68
🔹 Cache keys with null value. Set a short TTL (Time to Live) for keys
with null value.
🔹 Using Bloom filter. A Bloom filter is a data structure that can rapidly
tell us whether an element is present in a set or not. If the key exists,
the request first goes to the cache and then queries the database if
needed. If the key doesn't exist in the data set, it means the key
doesn’t exist in the cache/database. In this case, the query will not hit
the cache or database layer.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
69
How to diagnose a mysterious process that’s taking too
much CPU, memory, IO, etc?
The diagram below illustrates helpful tools in a Linux system.
🔹 ‘netstat’ - displays statistical data related to IP, TCP, UDP, and ICMP
protocols.
70
What are the top cache strategies?
🔹
Read data from the system:
🔹
Cache aside
Read through
🔹
Write data to the system:
🔹
Write around
🔹
Write back
Write through
71
I left out a lot of details as that will make the post very long. Feel free to
leave a comment so we can learn from each other.
72
Question: What are the pros and cons of each caching strategy? How
to choose the right one to use?
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
73
Upload large files
How can we optimize performance when we 𝐮𝐩𝐥𝐨𝐚𝐝 𝐥𝐚𝐫𝐠𝐞 𝐟𝐢𝐥𝐞𝐬 to object
storage service such as S3?
74
1. The client calls the object storage to initiate a multipart upload.
3. The client splits the large file into small objects and starts uploading.
Let’s assume the size of the file is 1.6GB and the client splits it into 8
parts, so each part is 200 MB in size. The client uploads the first part to
the data store together with the uploadID it received in step 2.
5. After all parts are uploaded, the client sends a complete multipart
upload request, which includes the uploadID, part numbers, and
ETags.
6. The data store reassembles the object from its parts based on the
part number. Since the object is really large, this process may take a
few minutes. After reassembly is complete, it returns a success
message to the client.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
75
Why is Redis so Fast?
There are 3 main reasons as shown in the diagram below.
You might have noticed the style of this diagram is different from my
previous posts. Please let me know which one you prefer.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
76
SWIFT payment network
You probably heard about 𝐒𝐖𝐈𝐅𝐓. What is SWIFT? What role does it
play in cross-border payments? You can find answers to those
questions in this post.
77
Step 2: Regional processor validates the format and sends it to Slice
Processor A. The Regional Processor is responsible for input message
validation and output message queuing. The Slice Processor is
responsible for storing and routing messages safely.
78
Step 15: Slice Processor B stores the report.
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
79
At-most once, at-least once, and exactly once
In modern architecture, systems are broken up into small and
independent building blocks with well-defined interfaces between them.
Message queues provide communication and coordination for those
building blocks. Today, let’s discuss different delivery semantics:
at-most once, at-least once, and exactly once.
𝐀𝐭-𝐦𝐨𝐬𝐭 𝐨𝐧𝐜𝐞
As the name suggests, at-most once means a message will be
delivered not more than once. Messages may be lost but are not
redelivered. This is how at-most once delivery works at the high level.
Use cases: It is suitable for use cases like monitoring metrics, where a
small amount of data loss is acceptable.
𝐀𝐭-𝐥𝐞𝐚𝐬𝐭 𝐨𝐧𝐜𝐞
With this data delivery semantic, it’s acceptable to deliver a message
more than once, but no message should be lost.
Use cases: With at-least once, messages won’t be lost but the same
message might be delivered multiple times. While not ideal from a user
perspective, at-least once delivery semantics are usually good enough
for use cases where data duplication is not a big issue or deduplication
80
is possible on the consumer side. For example, with a unique key in
each message, a message can be rejected when writing duplicate data
to the database.
𝐄𝐱𝐚𝐜𝐭𝐥𝐲 𝐨𝐧𝐜𝐞
Exactly once is the most difficult delivery semantic to implement. It is
friendly to users, but it has a high cost for the system’s performance
and complexity.
81
Vertical partitioning and Horizontal partitioning
In many large-scale applications, data is divided into partitions that can
be accessed separately. There are two typical strategies for partitioning
data.
82
Horizontal partitioning is widely used so let’s take a closer look.
𝐑𝐨𝐮𝐭𝐢𝐧𝐠 𝐚𝐥𝐠𝐨𝐫𝐢𝐭𝐡𝐦
Routing algorithm decides which partition (shard) stores the data.
🔹
𝐁𝐞𝐧𝐞𝐟𝐢𝐭𝐬
Facilitate horizontal scaling. Sharding facilitates the possibility of
adding more machines to spread out the load.
🔹
𝐃𝐫𝐚𝐰𝐛𝐚𝐜𝐤𝐬
The order by the operation is more complicated. Usually, we need
to fetch data from different shards and sort the data in the application's
code.
This topic is very big and I’m sure I missed a lot of important details.
What else do you think is important for data partitioning?
83
CDN
A content delivery network (CDN) refers to a geographically distributed
servers (also called edge servers) which provide fast delivery of static
and dynamic content. Let’s take a look at how it works.
84
2. If the domain name does not exist in the local DNS cache, the
browser goes to the DNS resolver to resolve the name. The DNS
resolver usually sits in the Internet Service Provider (ISP).
6. The authoritative name server returns the domain name for the load
balancer of CDN www.myshop.lb.com.
8. The CDN load balancer returns the CDN edge server’s IP address
for www.myshop.lb.com.
9. Now we finally get the actual IP address to visit. The DNS resolver
returns the IP address to the browser.
10. The browser visits the CDN edge server to load the content. There
are two types of contents cached on the CDN servers: static contents
and dynamic contents. The former contains static pages, pictures, and
videos; the latter one includes results of edge computing.
11. If the edge CDN server cache doesn't contain the content, it goes
upward to the regional CDN server. If the content is still not found, it
will go upward to the central CDN server, or even go to the origin - the
85
London web server. This is called the CDN distribution network, where
the servers are deployed geographically.
Over to you: How do you prevent videos cached on CDN from being
pirated?
86
Erasure coding
A really cool technique that’s commonly used in object storage such as
S3 to improve durability is called 𝐄𝐫𝐚𝐬𝐮𝐫𝐞 𝐂𝐨𝐝𝐢𝐧𝐠. Let’s take a look at
how it works.
87
Erasure coding deals with data durability differently from replication. It
chunks data into smaller pieces (placed on different servers) and
creates parities for redundancy. In the event of failures, we can use
chunk data and parities to reconstruct the data. Let’s take a look at a
concrete example (4 + 2 erasure coding) as shown in Figure 1.
1️⃣ Data is broken up into four even-sized data chunks d1, d2, d3, and
d4.
2️⃣ The mathematical formula is used to calculate the parities p1 and p2.
To give a much simplified example, p1 = d1 + 2*d2 - d3 + 4*d4 and p2
= -d1 + 5*d2 + d3 - 3*d4.
4️⃣ The mathematical formula is used to reconstruct lost data d3 and d4,
using the known values of d1, d2, p1, and p2.
How much extra space does erasure coding need? For every two
chunks of data, we need one parity block, so the storage overhead is
50% (Figure 2). While in 3-copy replication, the storage overhead is
200% (Figure 2).
88
Foreign exchange in payment
Have you wondered what happens under the hood when you pay with
USD online and the seller from Europe receives EUR (euro)? This
process is called foreign exchange.
Suppose Bob (the buyer) needs to pay 100 USD to Alice (the seller),
and Alice can only receive EUR. The diagram below illustrates the
process.
89
3. 100 USD is sold to Bank E’s funding pool.
4. Bank E’s funding pool provides 88 EUR in exchange for 100 USD.
The money is put into Paypal’s EUR account in Bank E.
Now let’s take a close look at the foreign exchange (forex) market. It
has 3 layers:
🔹
currencies in advance.
Wholesale market. The wholesale business is composed of
investment banks, commercial banks, and foreign exchange providers.
🔹
It usually handles accumulated orders from the retail market.
Top-level participants. They are multinational commercial banks
that hold a large number of certificates of deposit from different
countries. They exchange these certificates for foreign exchange
trading.
When Bank E’s funding pool needs more EUR, it goes upward to the
wholesale market to sell USD and buy EUR. When the wholesale
market accumulates enough orders, it goes upward to top-level
participants. Steps 3.1-3.3 and 4.1-4.3 explain how it works.
What foreign currency did you find difficult to exchange? And what
company have you used for foreign currency exchange?
90
Interview Question: Design S3
What happens when you upload a file to Amazon S3? Let’s design an
S3 like object storage system.
91
𝐁𝐮𝐜𝐤𝐞𝐭. A logical container for objects. The bucket name is globally
unique. To upload data to S3, we must first create a bucket.
🔹
An S3 object consists of (Figure 1):
Metadata. It is mutable and contains attributes such as ID, bucket
🔹
name, object name, etc.
Object data. It is immutable and contains the actual data.
2. The API service calls the Identity and Access Management (IAM) to
ensure the user is authorized and has WRITE permission.
3. The API service calls the metadata store to create an entry with the
bucket info in the metadata database. Once the entry is created, a
success message is returned to the client.
4. After the bucket is created, the client sends an HTTP PUT request
to create an object named “script.txt”.
5. The API service verifies the user’s identity and ensures the user has
WRITE permission on the bucket.
92
6. Once validation succeeds, the API service sends the object data in
the HTTP PUT payload to the data store. The data store persists the
payload as an object and returns the UUID of the object.
7. The API service calls the metadata store to create a new entry in the
metadata database. It contains important metadata such as the
object_id (UUID), bucket_id (which bucket the object belongs to),
object_name, etc.
93
Block storage, file storage and object storage
Yesterday, I posted the definitions of block storage, file storage, and
object storage. Let’s continue the discussion and compare those 3
options.
94
Block storage, file storage and object storage
In this post, let’s review the storage systems in general.
🔹
🔹
Block storage
🔹
File storage
Object storage
𝐁𝐥𝐨𝐜𝐤 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
Block storage came first, in the 1960s. Common storage devices like
hard disk drives (HDD) and solid-state drives (SSD) that are physically
attached to servers are all considered as block storage.
Block storage presents the raw blocks to the server as a volume. This
is the most flexible and versatile form of storage. The server can
format the raw blocks and use them as a file system, or it can hand
control of those blocks to an application. Some applications like a
database or a virtual machine engine manage these blocks directly in
order to squeeze every drop of performance out of them.
95
and iSCSI. Conceptually, the network-attached block storage still
presents raw blocks. To the servers, it works the same as physically
attached block storage. Whether to a network or physically attached,
block storage is fully owned by a single server. It is not a shared
resource.
𝐅𝐢𝐥𝐞 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
File storage is built on top of block storage. It provides a higher-level
abstraction to make it easier to handle files and directories. Data is
stored as files under a hierarchical directory structure. File storage is
the most common general-purpose storage solution. File storage could
be made accessible by a large number of servers using common
file-level network protocols like SMB/CIFS and NFS. The servers
accessing file storage do not need to deal with the complexity of
managing the blocks, formatting volume, etc. The simplicity of file
storage makes it a great solution for sharing a large number of files
and folders within an organization.
𝐎𝐛𝐣𝐞𝐜𝐭 𝐬𝐭𝐨𝐫𝐚𝐠𝐞
Object storage is new. It makes a very deliberate tradeoff to sacrifice
performance for high durability, vast scale, and low cost. It targets
relatively “cold” data and is mainly used for archival and backup.
Object storage stores all data as objects in a flat structure. There is no
hierarchical directory structure. Data access is normally provided via a
RESTful API. It is relatively slow compared to other storage types.
Most public cloud service providers have an object storage offering,
such as AWS S3, Google block storage, and Azure blob storage.
96
Domain Name System (DNS) lookup
DNS acts as an address book. It translates human-readable domain
names (google.com) to machine-readable IP addresses
(142.251.46.238).
The diagram below illustrates how DNS lookup works under the hood:
1. google.com is typed into the browser, and the browser sends the
domain name to the DNS resolver.
97
2. The resolver queries a DNS root name server.
3. The root server responds to the resolver with the address of a TLD
DNS server. In this case, it is .com.
5. The TLD server responds with the IP address of the domain’s name
server, google.com (authoritative name server).
8. The DNS resolver responds to the web browser with the IP address
(142.251.46.238) of the domain requested initially.
98
What happens when you type a URL into your browser?
The diagram below illustrates the steps.
1. Bob enters a URL into the browser and hits Enter. In this example,
the URL is composed of 4 parts:
🔹
server using HTTPS.
🔹
domain - 𝒆𝒙𝒂𝒎𝒑𝒍𝒆.𝒄𝒐𝒎. This is the domain name of the site.
path - 𝒑𝒓𝒐𝒅𝒖𝒄𝒕/𝒆𝒍𝒆𝒄𝒕𝒓𝒊𝒄. It is the path on the server to the requested
🔹
resource: phone.
resource - 𝒑𝒉𝒐𝒏𝒆. It is the name of the resource Bob wants to visit.
2. The browser looks up the IP address for the domain with a domain
name system (DNS) lookup. To make the lookup process fast, data is
cached at different layers: browser cache, OS cache, local network
cache and ISP cache.
99
2.1 If the IP address cannot be found at any of the caches, the browser
goes to DNS servers to do a recursive DNS lookup until the IP address
is found (this will be covered in another post).
4. The browser sends a HTTP request to the server. The request looks
like this:
5. The server processes the request and sends back the response. For
a successful response (the status code is 200). The HTML response
might look like this:
𝘏𝘛𝘛𝘗/1.1 200 𝘖𝘒
𝘋𝘢𝘵𝘦: 𝘚𝘶𝘯, 30 𝘑𝘢𝘯 2022 00:01:01 𝘎𝘔𝘛
𝘚𝘦𝘳𝘷𝘦𝘳: 𝘈𝘱𝘢𝘤𝘩𝘦
𝘊𝘰𝘯𝘵𝘦𝘯𝘵-𝘛𝘺𝘱𝘦: 𝘵𝘦𝘹𝘵/𝘩𝘵𝘮𝘭; 𝘤𝘩𝘢𝘳𝘴𝘦𝘵=𝘶𝘵𝘧-8
<!𝘋𝘖𝘊𝘛𝘠𝘗𝘌 𝘩𝘵𝘮𝘭>
<𝘩𝘵𝘮𝘭 𝘭𝘢𝘯𝘨="𝘦𝘯">
𝘏𝘦𝘭𝘭𝘰 𝘸𝘰𝘳𝘭𝘥
</𝘩𝘵𝘮𝘭>
100
AI Coding engine
DeepMind says its new AI coding engine (AlphaCode) is as good as an
average programmer.
5. Run the candidate programs against the test cases, evaluate the
performance, and choose the best one.
101
Do you think AI bot will be better at Leetcode or competitive
programming than software engineers five years from now?
102
Read replica pattern
There are two common ways to implement the read replica pattern:
1. Embed the routing logic in the application code (explained in the last
post).
2. Use database middleware.
103
1. When Alice places an order on amazon, the request is sent to Order
Service.
2. Order Service does not directly interact with the database. Instead, it
sends database queries to the database middleware.
4. Alice views the order details (read). The request is sent through the
middleware.
5. Alice views the recent order history (read). The request is sent
through the middleware.
Pros:
- Simplified application code. The application doesn’t need to be aware
of the database topology and manage access to the database directly.
Cons:
- Increased system complexity. A database middleware is a complex
system. Since all database queries go through the middleware, it
usually requires a high availability setup to avoid a single point of
failure.
104
Read replica pattern
In this post, we talk about a simple yet commonly used database
design pattern (setup): 𝐑𝐞𝐚𝐝 𝐫𝐞𝐩𝐥𝐢𝐜𝐚 𝐩𝐚𝐭𝐭𝐞𝐫𝐧.
105
Under certain circumstances (network delay, server overload, etc.),
data in replicas might be seconds or even minutes behind. In this case,
if Alice immediately checks the order status (query is served by the
replica) after the order is placed, she might not see the order at all.
This leaves Alice confused. In this case, we need “read-after-write”
consistency.
2️⃣ Reads that immediately follow writes are routed to the primary
database.
106
Email receiving flow
The following diagram demonstrates the email receiving flow.
4. Emails are put in the incoming email queue. The queue decouples
mail processing workers from SMTP servers so they can be scaled
independently. Moreover, the queue serves as a buffer in case the
email volume surges.
6. The email is stored in the mail storage, cache, and object data store.
107
7. If the receiver is currently online, the email is pushed to real-time
servers.
9. For offline users, emails are stored in the storage layer. When a user
comes back online, the webmail client connects to web servers via
RESTful API.
10. Web servers pull new emails from the storage layer and return
them to the client.
108
Email sending flow
In this post, we will take a closer look at the email sending flow.
2. The load balancer makes sure it doesn’t exceed the rate limit and
routes traffic to web servers.
4. Message queues.
109
4.a. If basic email validation succeeds, the email data is passed to
the outgoing queue.
4.b. If basic email validation fails, the email is put in the error
queue.
5. SMTP outgoing workers pull events from the outgoing queue and
make sure emails are spam and virus free.
7. SMTP outgoing workers send the email to the recipient mail server.
We monitor the size of the outgoing queue very closely. If there are
many emails stuck in the queue, we need to analyze the cause of the
issue. Here are some possibilities:
- The recipient’s mail server is unavailable. In this case, we need to
retry sending the email at a later time. Exponential backoff might be a
good retry strategy.
110
Interview Question: Design Gmail
One picture is worth more than a thousand words. In this post, we will
take a look at what happens when Alice sends an email to Bob.
2. Outlook mail server queries the DNS (not shown in the diagram) to
find the address of the recipient’s SMTP server. In this case, it is
Gmail’s SMTP server. Next, it transfers the email to the Gmail mail
server. The communication protocol between the mail servers is SMTP.
3. The Gmail server stores the email and makes it available to Bob, the
recipient.
111
4. Gmail client fetches new emails through the IMAP/POP server when
Bob logs in to Gmail.
112
Map rendering
Google Maps Continued. Let’s take a look at 𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠 in this
post.
𝐏𝐫𝐞-𝐂𝐨𝐦𝐩𝐮𝐭𝐞𝐝 𝐓𝐢𝐥𝐞𝐬
One foundational concept in map rendering is tiling. Instead of
rendering the entire map as one large custom image, the world is
broken up into smaller tiles. The client only downloads the relevant
tiles for the area the user is in and stitches them together like a mosaic
for display. The tiles are pre-computed at different zoom levels. Google
Maps uses 21 zoom levels.
This allows the client to render the map at the best granularities
depending on the client’s zoom level without consuming excessive
bandwidth to download tiles with too much detail. This is especially
important when we are loading the images from mobile clients.
𝐑𝐨𝐚𝐝 𝐒𝐞𝐠𝐦𝐞𝐧𝐭𝐬
Now that we have transformed massive maps into tiles, we also need
to define a data structure for the roads. We divide the world of roads
into small blocks. We call these blocks road segments. Each road
segment contains multiple roads, junctions, and other metadata.
We then transform the road segments into a data structure that the
navigation algorithms can use. The typical approach is to convert the
map into a 𝒈𝒓𝒂𝒑𝒉, where the nodes are road segments, and two nodes
are connected if the corresponding road segments are reachable
113
neighbors. In this way, finding a path between two locations becomes a
shortest-path problem, where we can leverage Dijkstra or A*
algorithms.
114
Interview Question: Design Google Maps
Google started project G𝐨𝐨𝐠𝐥𝐞 M𝐚𝐩𝐬 in 2005. As of March 2021, Google
Maps had one billion daily active users, 99% coverage of the world in
200 countries.
115
𝐋𝐨𝐜𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
The location service is responsible for recording a user’s location
update. The Google Map clients send location updates every few
seconds. The user location data is used in many cases:
𝐌𝐚𝐩 𝐑𝐞𝐧𝐝𝐞𝐫𝐢𝐧𝐠
The world’s map is projected into a huge 2D map image. It is broken
down into small image blocks called “tiles” (see below). The tiles are
static. They don’t change very often. An efficient way to serve static tile
files is with a CDN backed by cloud storage like S3. The users can
load the necessary tiles to compose a map from nearby CDN.
What if a user is zooming and panning the map viewpoint on the client
to explore their surroundings?
𝐍𝐚𝐯𝐢𝐠𝐚𝐭𝐢𝐨𝐧 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
This component is responsible for finding a reasonably fast route from
point A to point B. It calls two services to help with the path calculation:
2️⃣ Route Planner Service: this service does three things in sequence:
116
Pull vs push models
There are two ways metrics data can be collected, pull or push. It is a
routine debate as to which one is better and there is no clear answer.
In this post, we will take a look at the pull model.
117
Figure 1 shows data collection with a pull model over HTTP. We have
dedicated metric collectors which pull metrics values from the running
applications periodically.
In this approach, the metrics collector needs to know the complete list
of service endpoints to pull data from. One naive approach is to use a
file to hold DNS/IP information for every service endpoint on the
“metric collector” servers. While the idea is simple, this approach is
hard to maintain in a large-scale environment where servers are added
or removed frequently, and we want to ensure that metric collectors
don’t miss out on collecting metrics from any new servers.
2️⃣ The metrics collector pulls metrics data via a pre-defined HTTP
endpoint (for example, /metrics). To expose the endpoint, a client
library usually needs to be added to the service. In Figure 3, the
service is Web Servers.
118
Money movement
One picture is worth more than a thousand words. This is what
happens when you buy a product using Paypal/bank card under the
hood.
119
Let’s say Bob wants to buy an SDI book from Claire’s shop on
Amazon.
120
The first two layers are called information flow, and the settlement layer
is called fund flow.
You can see the 𝐢𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰 𝐚𝐧𝐝 𝐟𝐮𝐧𝐝 𝐟𝐥𝐨𝐰 𝐚𝐫𝐞 𝐬𝐞𝐩𝐚𝐫𝐚𝐭𝐞𝐝. In the
info flow, the money seems to be deducted from one bank account and
added to another bank account, but the actual money movement
happens in the settlement bank at the end of the day.
Because of the asynchronous nature of the info flow and the fund flow,
reconciliation is very important for data consistency in the systems
along with the flow.
It makes things even more interesting when Bob wants to buy a book
in the Indian market, where Bob pays USD but the seller can only
receive INR.
121
Reconciliation
My previous post about painful payment reconciliation problems
sparked lots of interesting discussions. One of the readers shared
more problems we may face when working with intermediary payment
processors in the trenches and a potential solution:
122
2) The order number is carried over to the payment provider
3) The payment provider creates another internal ID, which is carried
over across transactions within the system
4) The payment ID is used when you get the payout on your bank
account (or the payment provider bundles individual payments, which
can be reconciled within the payment provider system)
5) Ideally, your payment provider and your shop have an
integration/API with the tool you use to (hopefully automatically) create
invoices. This usually carries over the order id from the shop (closing
the loop) and sometimes even the payment id to match it with the
invoice id, which you then can use to reconcile it with your accounts
receivable/payable. :)
123
Continued: how to choose the right database for metrics collecting
service?
There are many storage systems available that are optimized for
time-series data. The optimization lets us use far fewer servers to
handle the same volume of data. Many of these databases also have
custom query interfaces specially designed for the analysis of
time-series data that are much easier to use than SQL. Some even
provide features to manage data retention and data aggregation. Here
are a few examples of time-series databases.
124
Since a time-series database is a specialized database, you are not
expected to understand the internals in an interview unless you
explicitly mentioned it in your resume. For the purpose of an interview,
it’s important to understand the metrics data are time-series in nature
and we can select time-series databases such as InfluxDB for storage
to store them.
125
Which database shall I use for the metrics collecting
system?
The write load is heavy. As you can see, there can be many
time-series data points written at any moment. There are millions of
operational metrics written per day, and many metrics are collected at
high frequency, so the traffic is undoubtedly write-heavy.
At the same time, the read load is spiky. Both visualization and alert
services send queries to the database and depending on the access
patterns of the graphs and alerts, the read volume could be bursty.
126
How about NoSQL? In theory, a few NoSQL databases on the market
could handle time-series data effectively. For example, Cassandra and
Bigtable can both be used for time series data. However, this would
require deep knowledge of the internal workings of each NoSQL to
devise a scalable schema for effectively storing and querying
time-series data. With industrial-scale time-series databases readily
available, using a general purpose NoSQL database is not appealing.
There are many storage systems available that are optimized for
time-series data. The optimization lets us use far fewer servers to
handle the same volume of data. Many of these databases also have
custom query interfaces specially designed for the analysis of
time-series data that are much easier to use than SQL. Some even
provide features to manage data retention and data aggregation. Here
are a few examples of time-series databases.
127
labels. It provides clear best-practice guidelines on how to use labels,
without overloading the database. The key is to make sure each label
is of low cardinality (having a small set of possible values). This feature
is critical for visualization, and it would take a lot of effort to build this
with a general-purpose database.
128
Metrics monitoring and altering system
A well-designed 𝐦𝐞𝐭𝐫𝐢𝐜𝐬 𝐦𝐨𝐧𝐢𝐭𝐨𝐫𝐢𝐧𝐠 and alerting system plays a key
role in providing clear visibility into the health of the infrastructure to
ensure high availability and reliability. The diagram below explains how
it works at a high level.
Metrics collector: It gathers metrics data and writes data into the
time-series database.
129
Consumers: Consumers or streaming processing services such as
Apache Storm, Flink and Spark, process and push data to the
time-series database.
Query service: The query service makes it easy to query and retrieve
data from the time-series database. This should be a very thin wrapper
if we choose a good time-series database. It could also be entirely
replaced by the time-series database’s own query interface.
130
Reconciliation
𝐑𝐞𝐜𝐨𝐧𝐜𝐢𝐥𝐢𝐚𝐭𝐢𝐨𝐧 might be the most painful process in a payment system.
It is the process of comparing records in different systems to make
sure the amounts match each other.
Let’s take a look at some pain points and how we can address them:
131
𝐏𝐫𝐨𝐛𝐥𝐞𝐦 1: Data normalization. When comparing records in different
systems, they come in different formats. For example, the timestamp
can be “2022/01/01” in one system and “Jan 1, 2022” in another.
𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐬𝐨𝐥𝐮𝐭𝐢𝐨𝐧: we can add a layer to transform different formats into
the same format.
132
Which database shall I use? This is one of the most important
questions we usually need to address in an interview.
133
Big data papers
Below is a timeline of important big data papers and how the
techniques evolved over time.
The green highlighted boxes are the famous 3 Google papers, which
established the foundation of the big data framework. At the high-level:
Now let’s look at the 𝐎𝐋𝐀𝐏 evolution. MapReduce was not easy to
program, so Hive solved this by introducing a SQL-like query
134
language. But Hive still used MapReduce under the hood, so it’s not
very responsive. In 2010, Dremel provided an interactive query engine.
135
Avoid double charge
One of the most serious problems a payment system can have is to
𝐝𝐨𝐮𝐛𝐥𝐞 𝐜𝐡𝐚𝐫𝐠𝐞 𝐚 𝐜𝐮𝐬𝐭𝐨𝐦𝐞𝐫. When we design the payment system, it is
important to guarantee that the payment system executes a payment
order exactly-once.
136
At the first glance, exactly-once delivery seems very hard to tackle, but
if we divide the problem into two parts, it is much easier to solve.
Mathematically, an operation is executed exactly-once if:
𝐑𝐞𝐭𝐫𝐲
Occasionally, we need to retry a payment transaction due to network
errors or timeout. Retry provides the at-least-once guarantee. For
example, as shown in Figure 10, the client tries to make a $10
payment, but the payment keeps failing due to a poor network
connection. Considering the network condition might get better, the
client retries the request and this payment finally succeeds at the
fourth attempt.
𝐈𝐝𝐞𝐦𝐩𝐨𝐭𝐞𝐧𝐜𝐲
From an API standpoint, idempotency means clients can make the
same call repeatedly and produce the same result.
137
Payment security
A few weeks ago, I posted the high-level design for the payment
system. Today, I’ll continue the discussion and focus on payment
security.
138
System Design Interview Tip
One pro tip for acing a system design interview is to read the
engineering blog of the company you are interviewing with. You can
get a good sense of what technology they use, why the technology
was chosen over others, and learn what issues are important to
engineers.
139
Big data evolvement
I hope everyone has a great time with friends and family during the
holidays. If you are looking for some readings, classic engineering
papers are a good start.
A lot of times when we are busy with work, we only focus on scattered
information, telling us “how” and “what” to get our immediate needs to
get things done.
However, reading the classics helps us know “why” behind the scenes,
and teaches us how to solve problems, make better decisions, or even
contribute to open source projects.
Big data area has progressed a lot over the past 20 years. It started
from 3 Google papers (see the links in the comment), which tackled
real engineering challenges at Google scale:
140
last generation. For example, “Hive - support SQL” means Hive was
trying to solve the lack of SQL in MapReduce.
If you want to learn more, you can refer to the papers for details. What
other classics would you recommend?
141
Quadtree
In this post, let’s explore another data structure to find nearby
restaurants on Yelp or Google Maps.
142
Quadtree is an 𝐢𝐧-𝐦𝐞𝐦𝐨𝐫𝐲 𝐝𝐚𝐭𝐚 𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 and it is not a database
solution. It runs on each LBS (Location-Based Service, see last week’s
post) server, and the data structure is built at server start-up time.
- After the quadtree is built, start searching from the root and traverse
the tree, until we find the leaf node where the search origin is.
- If that leaf node has 100 businesses, return the node. Otherwise, add
businesses from its neighbors until enough businesses are returned.
- While the quadtree is being built, the server cannot serve traffic.
143
How do we find nearby restaurants on Yelp?
- 𝐁𝐮𝐬𝐢𝐧𝐞𝐬𝐬 𝐒𝐞𝐫𝐯𝐢𝐜𝐞
144
- Add/delete/update restaurant information
- Customers view restaurant details
- 𝐋𝐨𝐜𝐚𝐥-𝐛𝐚𝐬𝐞𝐝 𝐒𝐞𝐫𝐯𝐢𝐜𝐞 (𝐋𝐁𝐒)
- Given a radius and location, return a list of nearby restaurants
How are the restaurant locations stored in the database so that LBS
can return nearby restaurants efficiently?
First, divide the planet into four quadrants along with the prime
meridian and equator:
Second, divide each grid into four smaller grids. Each grid can be
represented by alternating between longitude bit and latitude bit.
145
One picture is worth more than a thousand words. Log4j from attack to
prevention in one illustration.
Credit GovCERT
Link:
https://www.govcert.ch/blog/zero-day-exploit-targeting-popular-java-libr
ary-log4j/
146
How does a modern stock exchange achieve
microsecond latency?
147
- deploy all the components in a single giant server (no containers)
148
Match buy and sell orders
Stocks go up and down. Do you know what data structure is used to
efficiently match buy and sell orders?
Stock exchanges use the data structure called 𝐨𝐫𝐝𝐞𝐫 𝐛𝐨𝐨𝐤𝐬. An order
book is an electronic list of buy and sell orders, organized by price
levels. It has a buy book and a sell book, where each side of the book
contains a bunch of price levels, and each price level contains a list of
orders (first in first out).
So what happens when you place a market order to buy 2700 shares
in the diagram?
- The buy order is matched with all the sell onrders at price 100.10,
and the first order at price 100.11 (illustrated in light red).
149
- Now because of the big buy order which “eats up” the first price level
on the sell book, the best ask price goes up from 100.10 to 100.11.
- So when the market is bullish, people tend to buy stocks, and the
price goes up and up.
150
Stock exchange design
The stock market has been volatile recently.
Step 1: client places an order via the broker’s web or mobile app.
151
Step 3: the exchange client gateway performs operations such as
validation, rate limiting, authentication, normalization, etc, and sends
the order to the order manager.
Step 4: the order manager performs risk checks based on rules set by
the risk manager.
Step 5: once risk checks pass, the order manager checks if there is
enough balance in the wallet.
Step 6-7: the order is sent to the matching engine. The matching
engine sends back the execution result if a match is found. Both order
and execution results need to be sequenced first in the sequencer so
that matching determinism is guaranteed.
Step 8 - 10: execution result is passed all the way back to the client.
Step 11-12: market data (including the candlestick chart and order
book) are sent to the data service for consolidation. Brokers query the
data service to get the market data.
Step 13: the reporter composes all the necessary reporting fields (e.g.
client_id, price, quantity, order_type, filled_quantity,
remaining_quantity) and writes the data to the database for
persistence
152
Design a payment system
Today is Cyber Monday. Here is how money moves when you click the
Buy button on Amazon or any of your favorite shopping websites.
I posted the same diagram last week for an overview and a few people
asked me about the detailed steps, so here you go:
5. The payment executor calls an external PSP to finish the credit card
payment.
153
7. The wallet server stores the updated balance information in the
database.
8. After the wallet service has successfully updated the seller’s balance
information, the payment service will call the ledger to update it.
10. Every night the PSP or banks send settlement files to their clients.
The settlement file contains the balance of the bank account, together
with all the transactions that took place on this bank account during the
day.
154
Design a flash sale system
Black Friday is coming. Designing a system with extremely high
concurrency, high availability and quick responsiveness needs to
consider many aspects 𝐚𝐥𝐥 𝐭𝐡𝐞 𝐰𝐚𝐲 𝐟𝐫𝐨𝐦 𝐟𝐫𝐨𝐧𝐭𝐞𝐧𝐝 𝐭𝐨 𝐛𝐚𝐜𝐤𝐞𝐧𝐝. See the
below picture for details:
𝐃𝐞𝐬𝐢𝐠𝐧 𝐩𝐫𝐢𝐧𝐜𝐢𝐩𝐥𝐞𝐬:
1. Less is more - less element on the web page, fewer data
queries to the database, fewer web requests, fewer system
dependencies
2. Short critical path - fewer hops among services or merge into
one service
3. Async - use message queues to handle high TPS
4. Isolation - isolate static and dynamic contents, isolate processes
and databases for rare items
5. Overselling is bad. When Decreasing the inventory is important
155
6. User experience is important. We definitely don’t want to inform
users that they have successfully placed orders but later tell
them no items are actually available
156
Back-of-the-envelope estimation
Recently, a few engineers asked me whether we really need
back-of-the-envelope estimation in a system design interview. I think it
would be helpful to clarify.
157
A better approach, in this case, is to have a series of read replicas to
help with the read load. This method is much simpler to develop and
maintain. Thus, we recommend scaling the geospatial index table
through replicas.”
—
Check out our bestselling system design books.
Paperback: Amazon Digital: ByteByteGo.
158