System Design
System Design
Search Write
Get unlimited access to the best of Medium for less than $1/week. Become a member
3.9K 48
This is a detailed blog that covers all the topics you need to master to solve
any system design problem in interviews. You will not need to study any
more theory except this blog. Directly start attempting interview problems
after reading this blog.
For system design, in most places, you will only see theoretical stuff, but in
this blog, I have tried to show the practical implementation of a lot of things
so you will not just be preparing for interviews but also know how these
things are being used in the real world.
2. What is a Server?
5. Auto Scaling
6. Back-of-the-envelope Estimation
7. CAP Theorem
8. Scaling of Database
– Indexing
– Partitioning
– Master Slave Architecture
– Multi-master Setup
– Database Sharding
– Disadvantage of sharding
12. Caching
– Caching Introduction
– Benefits of Caching
– Types of Caches
– Redis Deep Dive
25. Proxy
– What is Proxy?
– Forward Proxy
– Reverse Proxy
– Building our own Reverse Proxy
Note: Whenever I make the box for a client, then this client can be anything.
It can be a ReactJS application, Android application, IOS application or
anything that a normal person is using on his/her device (laptop, phone etc).
What is a Server?
Many of you might already know what a server is, but this blog is for
beginners, so I am explaining it.
For external websites, you type https://abc.com . When you hit this in your
browser, the following things happen:
1. abc.com goes to the DNS (Domain Name Service) resolver to find the IP
address of the server which corresponds to this domain. Every server has
an IP address. Its a physical address which is unique for each device.
2. Your browser got the IP address of the server. With the help of an IP
Address, your browser requests the server.
3. Now, the server got the request. In a server, there are multiple
applications running. (Like on your laptop, there are multiple
applications running at the same time, such as Google Chrome, Netflix,
etc). The server finds the correct application with the help of Port. Then
it returns the response.
Doing all this and managing the server on your own is a pain, so people
generally rent servers from Cloud Providers such as AWS, Azure, GCP, etc.
They will give you a virtual machine where you can run your application,
and that machine also has a public IP address attached to it, so you can visit
it from anywhere. In AWS, this virtual machine is called EC2 Instance. And
putting your application code into the virtual machine of a cloud provider
from your local laptop is called deployment.
Exercise for you: If you want to see this in action, how to deploy a simple
application in AWS, then read this blog.
Latency
Latency is the time taken for a single request to travel from the client to the
server and back (or a single unit of work to complete). It is usually measured
in milliseconds (ms).
Loading a webpage: If it takes 200ms for the server to send the page data
back to the browser, the latency is 200ms.
In simple terms, if a website loads faster, then it takes less time, so it has low
latency. If it loads slower, then it takes more time, so it has high latency.
Round Trip Time (RTT): The total time it takes for a request to go to the
server and for the response to come back. Sometimes, you can also hear RTT
as a replacement word for latency.
Throughput
Throughput is the number of requests or units of work the system can
handle per second. It is typically measured in requests per second (RPS) or
transactions per second (TPS).
Every server has a limit, which means it can handle X number of requests
per section. Giving further load can choke it, or it may go down.
High Throughput: The system can process many requests at the same
time.
In the ideal case, we want to make a system whose throughput is high, and
latency is low.
Example:
Latency: The time it takes for a car to travel from one point to another (e.g.,
10 minutes).
Throughput: The number of cars that can travel on a highway in one hour
(e.g., 1,000 cars).
In short,
You can relate it with the example of a mobile phone, when you buy a cheap
mobile with less RAM and storage then your mobile hangs by using heavy
games or a lot of applications simultaneously. The same happens with the
EC2 instance, when a lot of traffic comes at the same time then it also starts
choking, that time we need to scale our system.
Types of Scaling
It is of 2 types:
1. Vertical Scaling
2. Horizontal Scaling
This type of scaling is mostly used in SQL databases and also in stateful
applications because it is difficult to maintain consistency of states in a
horizontal scaling setup.
The solution to this is to add more machines and distribute the incoming
load. It is called horizontal scaling.
Ex: We have 8 clients and 2 machines and distribute our load. We want to
distribute the load equally so the first 4 clients can hit one machine-1 and the
next 4 clients can hit machine-2. Clients are not smart; we can’t give them 2
different IP addresses and let them decide what machine to hit because they
don’t know about our system. For this, we put a load balancer in between. All
the clients hit the load balancer, and this load balancer is responsible for
routing the traffic to the least busy server.
In this setup, Clients don’t make requests directly to the server. Instead, they
send requests to the load balancer. The load balancer takes the incoming
traffic and transfers it to the least busy machine.
Below is the picture where you can see that 3 clients are making requests,
and the load balancer distributes the load in 3 EC2 instances equally.
Exercise for you: If you want to see this in action, how to do horizontal scaling
and set up a load balancer using AWS, then read this blog.
Auto Scaling
Suppose you started a business and made it online. You rented an EC2 server
to deploy your application. If a lot of users come to your website at the same
time, then your website may crash because the EC2 server has limitations
(CPU, RAM, etc) on serving a certain number of users concurrently at one
time. At this time, you will do horizontal scaling and increase the number of
EC2 instances and attach a load balancer.
Suppose one EC2 machine can serve 1000 users without choking. If the
number of users on our website is not constant. Some days, we have 10,000
users on our website, which can be served by 10 instances. Some days, we
have 100,000 users then we need 100 EC2 instances. One solution might be to
keep running the maximum number (100) of EC2 instances all the time. In
this way, we can serve all users at all times without any problem. But during
low-traffic periods, we are wasting our money in extra instances. We only
need 10 instances, but we are running 100 instances all the time and paying
for it.
The best solution for this is to run only the required number of instances
every time. And add some sort of mechanism that if CPU usage of an EC2
instance goes up to a certain threshold (say 90%) then launch another
instance and distribute the traffic without us manually doing this. This
changing number of servers dynamically based on the traffic is called Auto
Scaling.
Note: These numbers are all hypothetical to make you understand the topic.
If you want to find the actual threshold, then you can do Load Testing on
your instance.
Exercise for you: If you want to see this in action, how to configure Auto
Scaling using AWS, then read this blog.
Back-of-the-envelope Estimation
You can see that above, in the horizontal scaling, we saw that we need more
servers to handle the load. In back-of-the-envelope estimation, we estimate
the number of servers, storage, etc., needed.
There can be many things that you can calculate here, but I prefer doing
calculations of only below things:
1. Load Estimation
2. Storage Estimation
3. Resource Estimation
Load Estimation
Here, ask for DAU (Daily Active Users). Then, calculate the number of reads
and the number of writes.
Suppose Twitter has 100 million daily active users, and one user posts 10
tweets per day.
Storage Estimation
Tweets are of two types:
Normal tweet and tweet with a photo. Suppose only 10% of the tweets
contain photos.
Let one tweet comprise 200 characters. One character is of 2 bytes and one
photo is of 2 MB
Size of one tweet without photo = 200 character * 2 byte = 400 bytes ~ 500
bytes
=> 1 TB + 1 PB
Resource Estimation
Here, calculate the total number of CPUs and servers required.
Assuming we get 10 thousand requests per second and each request takes 10
ms for the CPU to process.
Assuming each core of the CPU can handle 1000 ms of processing per
second, the total no. of cores required:
=> 100,000 / 1000 = 100 cores.
CAP Theorem
This theorem states a very important tradeoff while designing any system.
C: Consistency
A: Availability
P: Partition Tolerance
With multiple servers, we can spread the workload & handle more
requests simultaneously, improving overall performance.
Keep Databases in different locations and serve data from the nearest
location to the user. It reduces the time of access & retrieval.
One individual server that is part of the overall distributed system is called
Node.
In this system, we replicate the same data across different servers to get the
above benefits.
You can see this in the picture below. The same data is stored in multiple
database servers (nodes) kept in different locations in India.
If data is added in one of the nodes then it gets replicated into all the other
nodes automatically. How this replication happens, we will talk about it later
in this blog.
CA — Possible
AP — Possible
CP — Possible
CAP — Impossible
Scaling of Database
You normally have one database server; your application server queries from
this DB and gets the result.
When you reach a certain scale, this database server starts giving slow
responses or may go down because of its limitations. In that situation, how
to scale the database, which we will be going to study in this section.
We will be scaling our database step by step, which means if we have only
10k users, then scaling it to support 10 million is a waste. It’s over-
engineering. We will only scale up to that limit which is sufficient for our
business.
Suppose you have a database server, and inside that, there is a user table.
There are a lot of read requests fired from the application server to get the
user with a specific ID. To make the read request faster, do the following
thing:
Indexing
Before indexing, the database checks each row in the table to find your data.
This is called a full table scan, and it can be slow for large tables. It takes
O(N) time to check each id.
With indexing, the database uses the index to directly jump to the rows you
need, making it much faster.
You make the “id” column indexed, then the database will make a copy of
that id column in a data structure (called B-trees). It uses the B-trees to
search the specific id. Searching is faster here because IDs are stored in a
sorted way such that you can apply a binary search kind of thing to search in
O(logN).
If you want to enable indexing in any column, then you just need to add one
line of syntax, and all the overhead of creating b-trees, etc, is handled by DB.
You don’t need to worry about anything.
Partitioning
Partitioning means breaking the big table into multiple small tables.
You can see that we have broken the users table into 3 tables:
- user_table_1
- user_table_2
- user_table_3
You may be wondering how we know which table to query from. Since
before that, we can hit SELECT * FROM users where ID=4 . Don’t worry, you can
again hit the same query. Behind the scenes, PostgreSQL is smart. It will find
the appropriate table and give you the result. But you can also write this
configuration on the application level as well if you want.
Master Slave Architecture
Use this method when you hit the bottleneck, like even after doing indexing,
partitioning and vertical scaling, your queries are slow, or your database
can’t handle further requests on the single server.
In this setup, you replicate the data into multiple servers instead of one.
When you do any read request, your read request (SELECT queries) will be
redirected to the least busy server. In this way, you distribute your load.
But all the Write requests (INSERT, UPDATE, DELETE) will only be processed
by one server.
The node (server) which processes the write request is called the Master
Node.
Nodes (servers) that take the read requests are called Slave Nodes.
When you make a Write Request, it is processed and written in the master
node, and then it asynchronously (or synchronously depending upon
configuration) gets replicated to all the slave nodes.
Multi-master Setup
When write queries become slow or one master node cannot handle all the
write requests, then you do this.
Ex: A very common thing is to put two master nodes, one for North India
and another for South India. All the write requests coming from North India
are processed by North-India-DB, and all the write requests coming from
South India are processed by South-India-DB, and periodically, they sync (or
replicate) their data.
In a multi-master setup, the most challenging part is how you would handle
conflicts. If for the same ID, there are two different data present in both
masters, then you have to write the logic in code like do you want to accept
both, override the previous with the latest one, concatenate it, etc. There is
no rule here. It totally depends upon the business use case.
Database Sharding
Sharding is a very complex thing. Try to avoid this in practical life and only
do this when all the above things are not sufficient and you require further
scaling.
Note: The sharding key should distribute data evenly across shards to avoid
overloading a single shard.
Sharding Strategies
1. Range-Based Sharding:
Data is divided into shards based on ranges of values in the sharding key.
Example:
Shard 1: Users with user_id 1–1000 .
3. Geographic/Entity-Based Sharding:
Data is divided based on a logical grouping, like region or department.
Example:
Shard 1: Users from America.
Shard 2: Users from Europe.
Pros: Useful for geographically distributed systems.
Cons: Some shards may become “hotspots” with uneven traffic.
4. Directory-Based Sharding:
A mapping directory keeps track of which shard contains specific data.
Example: A lookup table maps user_id ranges to shard IDs.
Pros: Flexibility to reassign shards without changing application logic.
Cons: The directory can become a bottleneck.
Disadvantage of sharding
1. Difficult to implement because you have to write the logic yourself to
know which shard to query from and in which shard to write the data.
2. Partitions are kept in different servers (called shards). So, when you
perform joins, then you have to pull out data from different shards to do
the join with different tables. It is an expensive operation.
3. You lose consistency. Since different parts of data are present in different
servers. So, keeping it consistent is difficult.
When you have write-heavy traffic, do sharding because the entire data
can’t fit in one machine. Just try to avoid cross-shard queries here.
If you have read heavy traffic but master-slave architecture becomes slow
or not able to handle the load, then you can also do sharding and
distribute the load. But it generally happens on a very large scale.
SQL Database
Data is stored in the form of tables.
It has a predefined schema, which means the structure of the data (the
tables, columns, and their data types) must be defined before inserting
data.
NoSQL Database
It is categorized into 4 types:
– Document-based: Stores data in documents, like JSON or BSON. Ex:
MongoDB.
– Key-value stores: Stores data in key-value pairs. Ex: Redis, AWS
DynamoDB
– Column-family stores: Stores data in columns rather than rows. Ex:
Apache Cassandra
– Graph databases: Focuses on relationships between data as it is stored
like a graph. Useful in social media applications like creating mutual
friends, friends of friends, etc. Ex: Neo4j.
It has a flexible schema, which means we can insert new data types or
fields which may not be defined in the initial schema.
Sharding can also be done in SQL DB, but generally, we avoid it because
we use SQL DB for ACID, but ensuring data consistency becomes very
difficult when data is distributed across multiple servers and querying
data by JOINS across shards is also complex and expensive.
Microservices
Ex: For an e-commerce app, suppose you break it into the following services:
— User Service
— Product Service
— Order Service
— Payment Service
Most startups start with a monolith because, at the starting point, only 2–
3 people work on the tech side, but it eventually moves to microservice
when no. of teams increases.
You can scale each service independently. Suppose the product service has
more traffic and requires 3 machines, user service requires 2 machines, and
payment service is sufficient with 1 machine, then you can do this also. See
the below picture
API Gateway provides several other advantages as well:
— Rate Limiting
— Caching
— Security (Authentication and authorization)
Load Balancer acts as the single point of contact for the clients. They request
the Domain Name of the Load Balancer, and the load balancer redirects to
one of the servers which is least busy.
Exercise for you: I would highly recommend you read this blog to see how it is
implemented.
What algorithm load balancer follow to decide which server to send the
traffic to? We will study it in load balancer algorithms.
Disadvantages:
How it works: Similar to Round Robin, but servers are assigned weights
based on their capacity. Servers with higher weights receive more requests.
You can see in the below picture the request number to understand how it
works. In the below picture, 3rd server is bigger (has more RAM, Storage,
etc). So, twice the request is going to it than 1st and 2nd.
Advantages:
Disadvantages:
How it works: Directs traffic to the server with the fewest active connections.
The connection here can be anything like HTTP, TCP, WebSocket etc. Here,
the load balancer will redirect traffic to the server which has the least active
connection with the load balancer.
Advantages:
Disadvantages:
May not work well with servers handling connections with varying
durations.
4. Hash-Based Algorithm
How it works: The load balancer takes anything, such as the client’s IP,
user_id, etc, as input and hash that to find the server. This ensures a specific
client is consistently routed to the same server.
Advantages:
Disadvantages:
Exercise for you: I would highly recommend you read this blog to see how
load balance is configured. I would also encourage you to code your own
load balancer from scratch using any of the above algorithms in any
programming language, such as Go, NodeJS, etc.
Caching
Caching Introduction
Caching is the process of storing frequently accessed data in a high-speed
storage layer so that future requests for that data can be served faster.
Ex: Suppose some data is taking 500ms to fetch from MongoDB Database,
then it takes 100 ms to do some calculations in the backend on that data and
finally send it to the client. So, in total, it takes 600 ms for the client to get the
data. If we cache this calculated data and store it in a high-speed store like
Redis and serve it from there, then we can reduce the time from 600 ms to 60
ms. (These are hypothetical numbers).
Example: Let’s take example of blog website. When we hit the route /blogs
then we get all the blogs. If user hits this route first time then there is no data
on cache so we have to get the data from database and suppose its response
time is 800ms. Now we stored this data in Redis. Next time a user hits this
route, he’ll get the data from Redis, not from database. The response time of
this is 20ms. When a new blog is added, then, we have to somehow remove
the old value of blogs from Redis and update it with a new one. It is called
Cache invalidation. There are many ways for cache invalidation. We can set
an expiry time (Time to live — TTL); after every 24 hrs, Redis will delete the
blogs, and when the request comes the first time from any user after 24 hrs,
then he’ll get data from DB. After that, it will be cached for the next requests.
Benefits of Caching:
Improved Performance: Reduces latency for end-users.
Types of Caches
1. Client-Side Cache:
- Stored on the user’s device (e.g., browser cache).
- Reduces server requests and bandwidth usage.
- Examples: HTML, CSS, JavaScript files.
2. Server-Side Cache:
- Stored on the server.
- Examples: In-memory caches like Redis or Memcached.
3. CDN Cache:
- Used for static content delivery (HTML, CSS, PNG, MP4 etc files).
- Cached in geographically distributed servers.
- Examples: AWS CloudFront, Cloudflare CDNs
4. Application-Level Cache:
- Embedded within application code.
- Caches intermediate results or database query results.
This was just the introduction. We will go deep dive into each type of cache.
In-memory means data is stored in RAM. And if you have basic knowledge of
computer science, then you know that reading and writing data from RAM is
extremely fast compared to disk.
Databases use disks to store data. And reading/ writing is very slow
compared to Redis.
One question you might be thinking is that if Redis is so fast, then why use a
database? Can’t we rely on Redis to store all the data?
Ans) Redis stores data in RAM, and RAM has very little memory as compared
to Disk. If you did coding in leetcode or codeforces, then sometimes you
might get the “Memory Limit Exceeded” message. In the same way, if we
store too much data in Redis, it can have a memory leakage error.
Redis stores data in key-value pairs. In the Database, we access data from the
table the same way in Redis; we access data from keys.
Values can be of any data type like string, list, etc, as shown in the figure
below.
There are more data types also, but the above ones are mostly used.
I will be showing all the things about Redis on CLI, but you can configure all
the corresponding things in any application, such as NodeJS, Springboot,
and Go.
Run the below command to install and run Redis on your local laptop.
1. String
SET key value NX: Stores a string value only if the key doesn’t already
exist
2. List
LPOP key: Pops left value and returns it. RPOP key: Pops right value and
returns it.
To make a queue, only do LPUSH and RPOP means push item from left side
and pop items from right side (FIFO).
To make a stack, only do LPUSH and LPOP means push and pop items from
same side (LIFO).
Try more commands and different datatypes on your own from the Redis
documentation.
Below is a basic NodeJS use. Feel free to try it on your preferred backend
language like Django, Go etc.
I have coded the same blog example. If /blog route got hit the first time,
then data will come from apiCall or database. But after that, it is cached and
served from Redis (Used in middleware). Data in Redis is valid for 24 hrs.
After that, it automatically gets deleted from Redis.
The other way of caching is whenever the server writes to the database at the
same it writes to the cache (redis) as well.
Example: When there is a contest in codeforces, whenever some user
submits any questions, then you immediately update the rank list in the
Database and Redis same time so that the user will see the current ranking if
you are serving the rankings list from Redis.
Nodejs: https://npmjs.com/package/ioredis
Django: https://pypi.org/project/django-redis/
Go: https://redis.uptrace.dev
Blob Storage
These files can be represented as a bunch of 0s and 1s, and this binary
representation is called Blob (Binary Large Object). Storing data as mp4 is
not feasible, but storing its blob is easy because it's just a bunch of 0s and 1s.
The size of this Blob can be very big like a single mp4 video can be 1 GB. If
you store this in databases like MySQL or MongoDB, your queries become
too slow. You also need to take care of scaling, backups, and availability by
storing such a large volume of data. These are some reasons why we don’t
store Blob data into Database. Instead, we store it in a Blob Storage that is a
managed service (Managed service means its scaling, security, etc, are taken
care of by companies such as Amazon, and we use it as a black box).
AWS S3
This is just an introduction to S3. If you want to learn it in detail with
examples and implementation, then you can follow my AWS Series.
S3 is used to store files (blob data) such as mp4, png, jpeg, pdf, html or any
kind of file that you can think of.
You can think of S3 as Google Drive, where you store all your files.
The best thing about S3 is its very cheap. Storing 1 GB of data in S3 is a lot
cheaper than storing it in RDS (RDS is the DB service of AWS that provides
various DBs such as PostgreSQL, MySQL, etc).
Features of S3:
Exercise for you: Your task is to make an application in your desired language
(such as nodejs, spring boot, fast API, etc) with an image upload feature.
When a user uploads an image, it gets stored in S3. Code this exercise, then
you will learn a lot about S3 rather than reading theoretical stuff.
CDN Introduction
CDNs are the easiest way to scale static files (such as mp4, jpeg, pdf, png etc).
Suppose files are stored in the S3 server, which is located in the India region.
When users from the USA, Australia or far away from India request these
files, then it will take a lot of time to serve them. Request served from the
nearest location is always faster than serving it from a long distance. Now,
we want these files to be stored in their nearest location for low latency. This
we do with the help of CDNs.
Users are routed to the nearest edge server for faster delivery.
2. Origin Server
This is your main web server (e.g., AWS S3).
The CDN fetches content from here if it’s not already cached at the edge
server.
3. Caching
CDNs store copies of your static content (e.g., images, videos, HTML) in
their edge servers.
Cached content is served directly to users, reducing the need for frequent
requests to the origin server.
5. GeoDNS
CDNs use GeoDNS to route users to the nearest edge server based on their
geographic location.
Exercise for you: Your task is to configure AWS CloudFront CDN with an S3
bucket to see this in practice. You can follow my AWS Series. There we will
cover it.
Message Broker
Asynchronous Programming
Before that, we need to know what is Synchronous Programming. It means
whenever the client sends a request, the server processes the request and
immediately sends back the response. Most of the things we build are
synchronous in nature.
Here, we don’t directly assign the task to the worker; instead, we place a
message broker in between.
This message broker acts like a queue in between. The server puts the task as
a message in the queue, and the worker pulls it from the queue and
processes it. After processing is completed, the worker can delete the task
message from the queue.
The server which puts the message into the message queue is called the
producer, and the server that pulls and processes the message is called the
consumer (or worker).
2. It also gives the retry feature. Suppose the worker fails to process in
between, then it can retry after some time because the message is still
present in the message broker.
Message Queue
As the name suggests, it is a kind of queue where the producer puts the
message from one side, and the consumer pulls out the message to process
from the other side.
The only thing different between a message queue and a message stream is
that a message queue can only have one kind of consumer for one type of
message.
Suppose you have a message queue where you put the metadata of the video
and let the consumer get the video from metadata and transcode it to various
formats (480p, 720p etc).
After the transcoder service done the transcoding of the video, it deletes the
message from the message queue. Various message queues, such as
RabbitMQ, AWS SQS, etc, provide APIs to delete messages from the queue.
If one video transcoder server is not enough, then what will we do?
Ans) We will horizontally scale it. Any one server who is free can pickup the
message from the message queue, process it and delete it from the queue
after processing.
This way, we can also do parallel processing. Means more than one message
at a time. Let us have 3 consumers, as shown in the figure, and then all three
can work on 3 different messages at one time.
Message Stream
In this, for one message we can have more than one type of consumer.
I hope the difference between message queues and message streams is clear.
Exercise for you: Your task is to utilise RabbitMQ or AWS SQS (anyone) in your
project to see this in action.
Kafka also has very high throughput. This means you can dump a lot of data
simultaneously in Kafka, and it can handle it without crashing.
Ex: Suppose Uber is tracking the driver’s location. After every 2 seconds, get
the driver’s location and insert it. If there are thousands of drivers and we do
thousands of writes in the Database every 2 sec, then the Database might go
down because DB’s throughput (no. of operations per sec) is low. We can put
those data every 2 sec in Kafka because its throughput is very high. After
every 10 minutes, the consumer will take those bulk data from Kafka and
write them into the DB. In this way, we perform operations in DB every 10
minutes, not every 2 sec.
Kafka Internals
Producer: Publishes messages to Kafka. As for sending email, the
producer can send {“email”, “message”} to the Kafka.
Let’s consider a topic with four partitions and one consumer group with
three consumers (maybe 3 different servers part of the same consumer
group) subscribed to that topic.
Below pic, you can see that we have two different consumer groups, one for
video transcoding and the other for caption generating. And you can also see
that Kafka balanced all the partitions between the consumers of each group.
Exercise for you: Your task is to set up Kafka locally on your laptop and code
any application which involves Kafka in NodeJS (or any other framework) so
that you will get a better understanding. I am not showing code here, but I
have taught all the required theories. The coding part is easy when you
understand the theory.
Realtime Pubsub
In Message Broker, whenever a publisher pushes a message to the message
broker, it remains in the broker until the consumer pulls it from it. How
message is pulled from the broker? All message brokers like AWS SQS
provides APIs (or SDK) to do this.
In short, in the message broker, consumers pull the message from the
broker, while the Pubsub broker pushes the message to the consumer.
One thing to note here is messages are not stored/retained in the Pubsub
broker. As soon as the Pubsub broker receives the message, it pushes it to all
the consumers who are subscribed to this channel and gets done with it. It
does not store anything.
Redis is not only used for caching but also for real-time Pubsub as well.
One use case is when you want to build a real-time chatting application. For
chatting applications, we use Websocket. But in a horizontally scaled
environment, there can be many servers connected to different clients, as
you can see in the below picture.
You need to somehow deliver the message of client-1 to server-2 then server-
2 can send this message to client-3. You can do this via Redis Pubsub. See the
below picture.
Exercise for you: Your task is to set up Redis locally (as we have done in the
caching section) and explore its Pubsub feature.
Event-Driven Architecture
Introduction to EDA
EDA is the short form for event-driven architecture.
What we can do is put the task of order success information in the message
broker and let the inventory and notification service consume it
asynchronously without making the client wait. That is event-driven
architecture, where the producer puts the message (called event) in the
message broker and forgets about it. Now, it’s the consumer’s duty to process
it.
Why use EDA?
1. Decoupling: Producers don’t need to know about consumers.
In the above example, without using EDA, it was very tightly coupled.
Suppose Inventory Service goes down, then it can affect the whole system
because Order Service was directly calling it. But after using EDA, order
service has nothing to do with inventory service.
3. Event Sourcing
4. Event Sourcing with CQRS (Command Query Responsibility Segregation)
We will only study the first two patterns because in the world, most of the
time, only the first two are being used. The bottom two have a very specific
use case, so we will not cover them here.
In this, the producer only puts lightweight info on the message broker. It can
just put the order_id in, then the consumer pulls this from the message
broker, and if a consumer wants additional info for this order_id, then it can
query it from the database.
Disadvantage: Events are larger in size, which could increase broker storage
and bandwidth costs.
Distributed Systems
From the client’s perspective, in a distributed system, clients should feel like
they are making requests to a single machine (not multiple). It’s the job of
the system to take requests from the client, do it in a distributed fashion, and
return the result.
CAP theorem that we studied earlier is the fundamental theorem that applies
in all distributed systems.
The system needs to decide which server should be the leader in two cases:
1. The first time when the entire system starts, then, it needs to decide
which server should be the leader.
2. When the leader goes down for any reason, then again, any follower
should come forward and become the leader. For this, followers should
quickly detect when the leader goes down at any moment.
We will not discuss these leader election algorithms here because it will go
out of scope for system design. Distributed System is itself a very vast topic.
But this much theory for distributed systems is enough from a system design
point of view.
Just think of the leader election algorithm as a black box. It will make one of
the servers the leader, and whenever the leader crashes or goes down, then
the entire system (which means most servers) will automatically detect it
and again run the leader-election algorithm to make any new server the
leader.
Exercise for you: Study some leader election algorithms and code them in
your favourite programming language.
Auto-Recoverable System using Leader Election
Suppose you built a horizontally scalable system where some servers are put
behind a load balancer to serve the request. You want that every time, at
least 4 servers will be there to serve the request, no matter what.
For this, you always need to monitor the server with your own eyes so that if
some server crashes or goes down, you manually restart it. Don’t you think
it's a tedious and boring task? In this section, we are going to study how to
automate this. Whenever any server goes down, the system automatically
detects it and restarts it without us manually doing it.
For this, we need an Orchestrator. Its job is to always keep monitoring the
server, and whenever any server goes down, its orchestrator's job is to restart
it.
Now, what happens when the orchestrator goes down? Who is keeping an
eye on the orchestrator?
In this scenario, the leader election algorithm comes in. Here, we don’t keep
only one orchestrator server; instead, we keep a bunch of multiple
orchestrator servers. We pick one of the orchestrator servers as leader-
orchestrator using the leader-election algorithm. This leader-orchestrator
keeps an eye on all the worker-orchestrators, and these worker-orchestrators
keep an eye on servers. Whenever the leader-orchestrator goes down, then
one of the worker-orchestrator will promote itself to become the leader
using the leader-election algorithm.
In this setup, we don’t need any human intervention. The system will be
auto-recoverable on its own.
Whenever any server goes down, the worker-orchestrator will restart it.
When any of the worker-orchestrators goes down, the leader will restart it.
When the leader-orchestrator goes down, then one of the worker-
orchestrator will promote itself to become the leader using the leader-
election algorithm.
When you have a very large amount of data, then we use big data tools to
process it.
In this, we have a coordinator and workers. The client makes the request to
the coordinator and its job is to divide the large dataset into smaller ones,
assign it to the worker, take the result from the worker and combine it and
return the result. Each worker computes a small dataset given by the
coordinator.
There are several things to take care of by the coordinator in this system:
If any worker crashes, then move its data to another machine for
processing.
Recovery: This means if any worker goes down, then restart it.
Take the individual results from the worker and combine them to return
the final result.
Logging
Taking large amounts of data from multiple sources and dumping it into
any warehouse.
Building this system is very hard, as you saw several things that needed to be
taken care of. It's a very common problem, so there are several open-source
tools (such as Spark and Flink) which give this infrastructure of coordinators
and workers.
We are only concerned about business logic, like writing code for training
ML Models. Big Data tools such as Apache Spark give us the distributed
system infrastructure. We use it as a black box and only write our business
logic in Python, Java or Scala in the form of jobs in Spark, and Spark will
process it in a distributed fashion.
We will not study Apache Spark in detail here because designing data-
intensive applications is itself a vast field. You will learn it only when you
work in any company that deals with big data. For system design interviews,
this theory on big data is enough.
Consistency Deep Dive
First of all, we need to know that consistency will only be considered when
we have a distributed stateful system.
Stateful System: The machines (or servers) store some data for future use.
Most of the time, our application servers are stateless. They don’t hold any
data. However, databases are stateful because they store all the data for our
application.
Since consistency means data should be the same across all nodes
(machines) at any point in time. So, it will only come into the picture when
we have distributed systems which are stateful. That’s why the term
consistency only comes into the picture with databases, mostly, not with
application servers.
We use databases as a black box. Because of that, you may not do much
coding here but don’t skip it. It is very important. In future, if you code any
stateful application or build your own database, then these concepts will be
useful. Also, whatever database, like DynamoDB, Cassandra, MongoDB, etc,
you are using, do check what type of consistency they provide.
1. Strong Consistency
2. Eventual Consistency
Strong Consistency
Any read operation after a write operation will always return the most
recent write.
Once a write is acknowledged, all subsequent reads will reflect that write.
Trading application where the user gets the latest and correct share
prices.
Eventual Consistency
It does not guarantee immediate consistency after a write operation but
ensures that eventually, after some time, all reads will return the same
value.
There might be a delay before all replicas reflect the latest write.
Since you compromised with consistency here, you will get high
availability (CAP theorem).
2. Quorum-Based Protocols
In distributed databases, the leader-follower setup is followed, as we
studied in the distributed system section. When a follower completes the
write or read operation, it gives acknowledgement to the leader.
- Read quorum refers to the number of followers who return the data for
a read. Suppose, for key user_id_2, value “Shivam” is returned by 5 nodes.
Then its read-quorum is 5.
- Write quorum refers to the number of followers who acknowledge the
successful right for a particular key.
In Strong consistency, if W is write-quorum and R is read-quorem then W
3. Consensus Algorithms
It's a vast topic of distributed systems, so we will not cover it here. The
summary is that it uses a leader election algorithm. And, a write or read
is successful when more than 50% of nodes acknowledge it.
If you want to know more about it then learn Raft. It is an easy consensus
algorithm.
Ex: Docker Swarm uses Raft internally
4. Gossip Protocol
Nodes exchange heartbeats with a subset of other nodes, spreading
updates throughout the system. Heartbeat is just HTTP or TCP requests
sent periodically every 2–3 seconds. This way, we detect any failed nodes.
For failed nodes, we configure a degree of consistency from how many
replicas read and writes should be allowed.
Ex: DynamoDB, Cassandra uses it
Consistent Hashing
Consistent hashing is an algorithm that tells which data belongs to which
node. It's just an algorithm, nothing else.
Since data is involved so, consistent hashing is mostly used with the stateful
applications and the system is distributed.
Hash the key with any hash function (such as SHA256, SHA128, MD5,
etc).
Then, take Mod of that with the number of servers to find its belonging.
Let's the number of servers is 3.
key1 belongs to => (Hash(key1) % 3) server
This approach is perfectly fine when the number of servers is fixed. This
means we don’t have an auto-scaling or a dynamic number of servers.
Suppose, for our previous example, the number of servers changed from 3 to
2. Then now, key1 belongs to => (Hash(key1) % 2) server. This number can be
different from the previous one.
You saw that key1 may now belong to a different server. So, we need to move
key1 from one server to another. This is the drawback of the above approach.
If we follow the above simple approach to find which key belongs to which
server, then there will be a lot of data movements happening when the
number of servers changes.
How Consistent Hashing used to find which data belongs to which node?
We take the server identity (such as IP or ID) and pass it to a hash
function such as SHA128. It will generate some random number between
[0, 2¹²⁸).
For visualization, we will place it in the ring. Divide the ring into [0, 2¹²⁸),
i.e., the range of hash. Whatever the number comes by passing the
server_id to the hash function, place that server on that place of the ring.
Similarly, do the above thing for the keys also. Divide the circle in the
same range as the hash function, then pass the key to the hash function,
and whatever the number comes, place the key at that position of the
ring.
Any key will go to its nearest clockwise server. In the above example:
Suppose Node-2 is removed from the above setup, then the key-3 will go to
Node-1, and no other keys will be affected. This is the power of consistent
hashing. It makes the minimum movement of keys. But it is not the job of
consistent hashing to do this movement. You need to do it yourself.
Consistent hashing is just an algorithm that tells which key belongs to which
node.
In real life, you can manually copy-paste or take snapshots of the data to
move it from one server to another.
Use of consistent hashing: Distributed databases such as AWS DynamoDB,
Apache Cassandra, Riak use it internally.
We make our database redundant and store the copy of data in multiple
database servers (machines).
If some technical failure happens and somehow the disk of the database
server gets corrupted or crashes, at that time we don’t lose the data if we
have a copy of it on a different server.
Do daily backup every day at night. Whatever changes are made in the
database, take a snapshot (replicate) and store it in a different database
server.
Do a weekly backup.
In this way, if some failure happens in the database server, then you have the
data at least till the last day or last week.
Continuous Redundancy
The above ways that I mentioned about daily and weekly backup are
generally not followed now. Most companies nowadays follow continuous
redundancy.
When we do any read or write operation, then it is done in the main and gets
replicated synchronously or asynchronously (according to your
configuration choice) into the replica. This Replica DB doesn’t participate in
any read / write operation from the client. Its only job is to keep itself in sync
with the main DB. When DB failure happens in the main, then this replica
becomes the main DB and starts serving the request. In this way, we made
our system resilient.
Your task is to set up this Replica in any cloud provider (like AWS RDS) and
see how this thing is done in the real world.
And second task is to do it locally. Set up two MySQL servers locally on the
laptop and configure this replication. Write to the main and see if it gets
replicated into the replica.
Proxy
What is Proxy?
A proxy is an intermediary server that sits between a client and another
server.
Forward Proxy
A forward proxy acts on behalf of the client. When a client makes a request,
this request goes through a forward proxy. The server doesn’t know about
the client's identity (IP). The server will only know the forward proxy IP.
If you ever use a VPN to access websites, then it is an example of a forward
proxy. VPN request on behalf of you.
Main Feature: Hides the client. The server only sees the forward proxy, not
the client.
Use Cases:
It is also used for caching. Forward proxy can cache the frequently
accessed content so the client won’t have to make a request to the server.
Content is directly returned from the forward proxy.
That’s it for the forward proxy. In system design, we mostly focus on the
reverse proxy, not on the forward proxy, because forward proxy is related to
the client (frontend). In system design, we mostly build server-side
(backend) apps.
Reverse Proxy
A reverse proxy acts on behalf of a server. Clients send requests to the
reverse proxy, which forwards them to the appropriate server in the
backend. The response goes back through the reverse proxy to the client.
In forward proxy, servers don’t know the client’s identity, but in reverse
proxy, the client doesn’t know about the server. They send the request to
reverse proxy. It is the job of the reverse proxy to route this request to the
appropriate server.
Main Feature: Hides the server. The client only sees the reverse proxy, not
the server.
An example of a reverse proxy could be a load balancer. In load balancers,
clients don’t know about the actual server. They send requests to load
balancers.
Use Cases:
Flow:
Exercise for you: Learn about Ngnix and use it in your side projects.
Suppose we have two microservices: one is the Order microservice, and the
other is a Product microservice. Request coming with /product will go to
Product microservice and request coming with /order will go to Order
microservice. Clients will only request to reverse proxy. and the work of
forwarding the requests to the correct microservice by seeing the URL is
done by the reverse proxy.
1. Install Dependencies
npm init -y
npm install http-proxy
2. Create the Reverse Proxy Server
Most things we learnt in this blog, we can code ourselves like load balancers,
reverse proxy, message brokers, Redis, etc, but we don’t reinvent the wheel
again. We use these tools as black boxes from a famous open source or big
organisation that has solved these in the most optimised way.
Note: Only create a new sub-problem when it is actually needed. Don’t create
it unnecessarily to over-complicate the design.
It took me a lot of time to write this blog. I hope you enjoyed reading this.
The best thing you can do is implement all these things. I also gave exercises
for implementation.
If you liked my efforts and want to support it then you can donate any
amount on the below:
LinkedIn
System Design Interview High Level Design System Design Concepts
Software Development
Responses (48)
Respond
Nisha MB
12 days ago
Good effort. One input I have: In section "What is a Server?", you previously mentioned abc.com is an external
address which is entered in browser and then in point 1, that abc.com goes to DNS resolver... instead of
browser doing the work. Also, in…...
Read More
8 Reply
Ayush
14 days ago
Amazing Read.
20 Reply
My Top 3 Tips for Being a Great I Reduced AWS Bills by 80% Just by
Software Architect Optimizing Node.js Code
My top tips for being a great Software The Art of AWS Cost-Cutting: A Node.js
Architect and making the best decisions for… Optimization Journey
2d ago 818 17 Dec 19, 2024 470 6
Lists