System Design Basics
System Design Basics
ACID Transaction
o Atomicity: The operations that constitute the transaction will either all succeed or all fail.
There is no in-between state.
o Consistency: The transaction cannot bring the database to an invalid state. After the
transaction is committed or rolled back, the rules for each record will still apply, and all
future transactions will see the effect of the transaction. Also named Strong Consistency.
o Isolation: The execution of multiple transactions concurrently will have the same effect as if
they had been executed sequentially.
o Durability: Any committed transaction is written to non-volatile storage. It will not be
undone by a crash, power loss, or network partition.
Alerting
The process through which system administrators get notified when critical system issues
occur. Alerting can be be set up by defining specific thresholds on monitoring charts, past
which alerts are sent to a communication channel like Slack.
Availability
The odds of a particular server or service being up and running at any point in time, usually
measured in percentages. A server that has 99% availability will be operational 99% of the
time (this would be described as having two nines> of availability).
Blob Storage
Widely used kind of storage, in small and large scale systems. They don’t really count as
databases per se, partially because they only allow the user to store and retrieve data based
on the name of the blob. This is sort of like a key-value store but usually blob stores have
different guarantees. They might be slower than KV stores but values can be megabytes
large (or sometimes gigabytes large). Usually people use this to store things like large
binaries, database snapshots, or images and other static assets that a website might have.
Blob storage is rather complicated to have on premise, and only giant companies like Google
and Amazon have infrastructure that supports it. So usually in the context of System Design
interviews you can assume that you will be able to use GCS or S3. These are blob storage
services hosted by Google and Amazon respectively, that cost money depending on how
much storage you use and how often you store and retrieve blobs from that storage.
Cache
A piece of hardware or software that stores data, typically meant to retrieve that data faster
than otherwise.
Caches are often used to store responses to network requests as well as results of
computationally-long operations.
Note that data in a cache can become stale if the main source of truth for that data (i.e., the
main database behind the cache) gets updated and the cache doesn't.
The policy by which values get evicted or removed from a cache. Popular cache eviction
policies include LRU (least-recently used), FIFO (first in first out), and LFU (least-frequently
used).
Cache Hit
Cache Miss
When requested data could have been found in a cache but isn't. This is typically used to
refer to a negative consequence of a system failure or of a poor design choice. For example:
If a server goes down, our load balancer will have to forward requests to a new server,
which will result in cache misses.
CAP Theorem
Stands for Consistency, Availability, Partition tolerance. In a nutshell, this theorem states
that any distributed system can only achieve 2 of these 3 properties. Furthermore, since
almost all useful systems do have network-partition tolerance, it's generally boiled down
to: Consistency vs. Availability; pick one.
One thing to keep in mind is that some levels of consistency are still achievable with high
availability, but strong consistency is much harder.
Client
Client—Server Model
The paradigm by which modern systems are designed, which consists of clients requesting
data or service from servers and servers providing data or service to clients.
Configuration
Consensus Algorithm
A type of complex algorithms used to have multiple entities agree on a single data value, like
who the "leader" is amongst a group of machines. Two popular consensus algorithms
are Paxos and Raft.
Consistent Hashing
A type of hashing that minimizes the number of keys that need to be remapped when a
hash table gets resized. It's often used by load balancers to distribute traffic to servers; it
minimizes the number of requests that get forwarded to different servers when new servers
are added or when existing servers are brought down.
A CDN is a third-party service that acts like a cache for your servers. Sometimes, web
applications can be slow for users in a particular region if your servers are located only in
another region. A CDN has servers all around the world, meaning that the latency to a CDN's
servers will almost always be far better than the latency to your servers. Two of the most
popular CDNs are Cloudflare and Google Cloud CDN.
Database Index
A special auxiliary data structure that allows your database to perform certain queries much
faster. Indexes can typically only exist to reference structured data, like data stored in
relational databases. In practice, you create an index on one or multiple columns in your
database to greatly speed up read queries that you run very often, with the downside of
slightly longer writes to your database, since writes have to also take place in the relevant
index.
Database Lock
In a relational database that provides ACID transactions, updating rows inside a table will
cause a lock to be held on that table or on the rows you are updating. If a second
transaction tries to update the same rows, it will block before the update until the first
transaction releases that lock. This is one of the core mechanisms behind the Atomicity of
ACID transactions.
Databases
Databases are programs that either use disk or memory to do 2 core things: record data
and query data. In general, they are themselves servers that are long lived and interact with
the rest of your application through network calls, with protocols on top of TCP or even
HTTP.
Some databases only keep records in memory, and the users of such databases are aware of
the fact that those records may be lost forever if the machine or process dies.
For the most part though, databases need persistence of those records, and thus cannot use
memory. This means that you have to write your data to disk. Anything written to disk will
remain through power loss or network partitions, so that’s what is used to keep permanent
records.
Since machines die often in a large scale system, special disk partitions or volumes are used
by the database processes, and those volumes can get recovered even if the machine were
to go down permanently.
DDoS Attack
Short for "distributed denial-of-service attack", a DDoS attack is a DoS attack in which the
traffic flooding the target system comes from many different sources (like thousands of
machines), making it much harder to defend against.
Disk
Usually refers to either HDD (hard-disk drive) or SSD (solid-state drive). Data written to disk
will persist through power failures and general machine crashes. Disk is also referred to
as non-volatile storage.
SSD is far faster than HDD (see latencies of accessing data from SSD and HDD) but also far
more expensive from a financial point of view. Because of that, HDD will typically be used
for data that's rarely accessed or updated, but that's stored for a long time, and SSD will be
used for data that's frequently accessed and updated.
DNS
Short for Domain Name System, it describes the entities and protocols involved in the
translation from domain names to IP Addresses. Typically, machines make a DNS query to a
well known entity which is responsible for returning the IP address (or multiple ones) of the
requested domain name in the response.
DoS Attack
Short for "denial-of-service attack", a DoS attack is an attack in which a malicious user tries
to bring down or damage a system in order to render it unavailable to users. Much of the
time, it consists of flooding it with traffic. Some DoS attacks are easily preventable with rate
limiting, while others can be far trickier to defend against.
Etcd
Etcd is a strongly consistent and highly available key-value store that's often used to
implement leader election in a system.
Website: https://etcd.io/ 3
Eventual Consistency
A consistency model which is unlike Strong Consistency. In this model, reads might return a
view of the system that is stale. An eventually consistency datastore will give guarantees
that the state of the database will eventually reflect writes within a time period (could be 10
seconds, or minutes).
Forward Proxy
A server that sits between a client and servers and acts on behalf of the client, typically used
to mask the client's identity (IP address). Note that forward proxies are often referred to as
just proxies.
Website: https://cloud.google.com/storage 1
Gossip Protocol
When a set of machines talk to each other in a uncoordinated manner in a cluster to spread
information through a system without requiring a central source of data.
Hashing Function
A function that takes in a specific data type (such as a string or an identifier) and outputs a
number. Different inputs may have the same output, but a good hashing function attempts
to minimize those hashing collisions (which is equivalent to maximizing uniformity).
High Availability
Used to describe systems that have particularly high levels of availability, typically 5 nines or
more; sometimes abbreviated "HA".
Horizontal Scaling
Scaling a system horizontally means adding more machines to perform the same task,
resulting in increased throughput for the system. Typically, horizontal scaling increases a
system's throughput roughly linearly with the number of machines performing a given task.
Hot Spot
When distributing a workload across a set of servers, that workload might be spread
unevenly. This can happen if your sharding key or your hashing function are suboptimal, or if
your workload is naturally skewed: some servers will receive a lot more traffic than others,
thus creating a "hot spot".
HTTP
HyperText Transfer Protocol, very common network protocol implemented on top of TCP.
Clients make HTTP requests, and servers respond with a response.
IP
Stands for Internet Protocol. This network protocol outlines how almost all machine-to-
machine communications should happen in the world. Other protocols
like TCP, UDP and HTTP are built on top of IP.
IP Address
An address given to each machine connected to the public internet. IPv4 addresses consist
of four numbers separated by dots: a.b.c.d where all four numbers are between 0 and 255.
Special values include:
IP Packet
o an IP header, which contains the source and destination IP addresses as well as other
information related to the network
o a payload, which is just the data being sent over the network
JSON
A file format heavily used in APIs and configuration. Stands for JavaScript Object Notation.
Example:
{
"version": 1.0,
"name": "AlgoExpert Configuration"
}
Kafka
A distributed message passing storage system created by LinkedIn. Very useful when using
the streaming paradigm as opposed to polling.
Website: https://kafka.apache.org/ 1
Key-Value Store
A Key-Value Store is a flexible NoSQL database that's often used for caching and dynamic
configuration. Popular options include DynamoDB, Etcd, Redis, and ZooKeeper.
Latency
The time it takes for a certain operation to complete in a system. Most often this measure is
a time duration, like milliseconds or seconds. You should know these orders of magnitude:
Leader Election
The process by which nodes in a cluster (for instance, servers in a set of servers) elect a so-
called "leader" amongst them, responsible for the primary operations of the service that
these nodes support. When correctly implemented, leader election guarantees that all
nodes in the cluster know which one is the leader at any given time and can elect a new
leader if the leader dies for whatever reason.
Load Balancer
A type of reverse proxy that distributes traffic across servers. Load balancers can be found in
many parts of a system, from the DNS layer all the way to the database layer.
Logging
The act of collecting and storing logs--useful information about events in your system.
Typically your programs will output log messages to its STDOUT or STDERR pipes, which will
automatically get aggregated into a centralized logging solution.
MapReduce
A popular framework for data processing at a very large scale by splitting the work into as
many sub-tasks as needed and processing those in parallel on a big cluster of machines. It is
comprised of 2 main steps: Map and Reduce. The map step takes the input and its output
will further get passed onto reducers. The output of the reducers get concatenated into the
final result.
Memory
Short for Random Access Memory (RAM). Data stored in memory will be lost when the
process that has written that data dies.
Microservice Architecture
When a system is made up of many small web services that can be compiled and deployed
independently. This is usually thought of as a counterpart of monoliths.
MongoDB
Website: https://www.mongodb.com/
Monitoring
The process of having visibility into a system's key metrics, monitoring is typically
implemented by collecting important events in a system and aggregating them in human-
readable charts.
Monolith Architecture
When a system is primarily made up of a single large web application that is compiled and
rolled out as a unit. Typically a counterpart of microservices. Companies sometimes try to
split up this monolith into microservices once it reaches a very large size in an attempt to
increase developer productivity.
MySQL
A relational database that provides ACID transactions and supports a SQL dialect.
Website: https://www.mysql.com/
Nginx
Pronounced "engine X"—not "N jinx", Nginx is a very popular webserver that's often used as
a reverse proxy and load balancer.
Website: https://www.nginx.com/
Nines
Typically refers to percentages of uptime. For example, 5 nines of availability means an
uptime of 99.999% of the time. Below are the downtimes expected per year depending on
those 9s:
Node/Instance/Host
These three terms refer to the same thing most of the time: a virtual or physical machine on
which the developer runs processes. Sometimes the word server also refers to this same
concept.
Non-Relational Database
In contrast with relational database (SQL databases), a type of database that is free of
imposed, tabular-like structure. Non-relational databases are often referred to as NoSQL
databases.
NoSQL Database
Two consensus algorithms that, when implemented correctly, allow for the synchronization
of certain operations even in a distributed setting.
Peer-To-Peer Network
Percentiles
Most often used when describing a latency distribution. If your Xth percentile is 100
milliseconds, it means that X% of the requests have latencies of 100ms or less. Sometimes,
SLAs describe their guarantees using these percentiles.
Persistent Storage
Usually refers to disk, but in general it is any form of storage that persists if the process in
charge of managing it dies.
Polling
The act of fetching a resource or piece of data regularly at an interval to make sure your
data is not too stale.
Postgres
A relational database that uses a dialect of SQL called PostgreSQL. Provides ACID
transactions.
Website: https://www.postgresql.org/
Process
A program that is currently running on a machine. You should always assume that any
process may get terminated at any time in a sufficiently large system.
Pub/Sub Model
Short for Publish/Subscribe. In this model, the subscribers usually create a long lived
connection with a server waiting for messages pertaining to a topic (sometimes
called channel). Independently of those subscribers, other clients, the publishers, will create
messages pertaining to one of the topics. The service implementing this Pub/Sub Model is
required to notify and pass along the messages all the way to the subscribers within some
time period (could be 1 second or 10 minutes depending on the system).
Rate Limiting
The act of limiting the number of requests sent to or from a system. Rate limiting is most
often used to limit the number of incoming requests in order to prevent DoS attacks and
can be enforced at the IP-address level, at the user-account level, or at the region level, for
example. Rate limiting can also be implemented in tiers; for instance, a type of network
request could be limited to 1 per second, 5 per 10 seconds, and 10 per minute.
Redis
An in-memory key-value store. Does offer some persistent storage options but is typically
used as a really fast, best-effort caching solution. Redis is also often used to implement rate
limiting.
Website: https://redis.io/
Redundancy
Relational Database
A type of structured database in which data is stored following a tabular format; often
supports powerful querying using SQL.
Rendezvous Hashing
A type of hashing also coined highest random weight hashing. Allows for minimal re-
distribution of mappings when a server goes down.
Replication
The act of duplicating the data from one database server to others. This is sometimes used
to increase the redundancy of your system and tolerate regional failures for instance. Other
times you can use replication to move data closer to your clients, thus decreasing the
latency of accessing specific data.
Reverse Proxy
A server that sits between clients and servers and acts on behalf of the servers, typically
used for logging, load balancing, or caching.
S3
S3 is a blob storage service provided by Amazon through Amazon Web Services (AWS).
Website: https://aws.amazon.com/s3/
Server
A machine or process that provides data or service for a client, usually by listening for
incoming network calls.
Note that a single machine or piece of software can be both a client and a server at the
same time. For instance, a single machine could act as a server for end users and as a client
for a database.
Server-Selection Strategy
How a load balancer chooses servers when distribute traffic between multiple servers.
Commonly used strategies include round-robin, random, health-checks, and IP-based
routing.
SHA
Short for "Secure Hash Algorithms", the SHA is a collection of cryptographic hash functions
used in the industry. These days, SHA-3 is a popular choice to use in a system.
Sharding
Sometimes called data partitioning, sharding is the act of splitting a database into two or
more pieces called shards and is typically done to increase the throughput of your database.
Popular sharding strategies include:
SLA
SLO
Socket
A kind of file that acts like a stream. Processes can read and write to sockets and
communicate in this manner. Most of the time the sockets are fronts for TCP connection.
SQL
Structured Query Language. Relational databases can be used using a derivative of SQL such
as PostgreSQL in the case of Postgres.
SQL Database
Any database that supports SQL. This term is often used synonymously with "Relational
Database", though in practice, not every relational database supports SQL.
Streaming
In networking, it usually refers to the act of continuously getting a feed of information from
a server by keeping an open connection between the two machines or processes.
Strong Consistency
TCP
Network protocol built on top of the Internet Protocol (IP). Allows for ordered, reliable data
delivery between machines over the public internet by creating a connection.
TCP is usually implemented in the kernel, which exposes sockets to applications that they
can use to stream data through an open connection.
Throughput
The number of operations that a system can handle properly per time unit. For instance the
throughput of a server can often be measured in requests per second (RPS or QPS).
Vertical Scaling
Scaling a system vertically means increasing the resources (CPU / Memory) available to a
certain task on a single machine, so that your throughput may increase.
Virtual Machine
Similar to the Task Queue Pattern. In this design, a pool of workers, usually themselves
servers, take tasks off of a single shared queue and process those tasks independently. In
order to ensure that every task gets done at least once despite potential partitions between
queue and workers, the workers must confirm the status of the task after it is done
(usually success or failure).
YAML
version: 1.0
name: AlgoExpert Configuration
ZooKeeper
ZooKeeper is a strongly consistent, highly available key-value store. It's often used to store
important configuration or to perform leader election.
Lots of people struggle with system design interviews (SDIs) primarily because of 1)
Unstructured nature of SDIs, where you’re asked to work on an open-ended design problem
that doesn’t have a standard answer, 2) Your lack of experience in developing large scale
systems and, 3) You did not spend enough time to prepare for SDIs.
Just like coding interviews, candidates who haven’t put a conscious effort to prepare for
SDIs, mostly perform poorly. This gets aggravated when you’re interviewing at the top
companies like Google, Facebook, Uber, etc. In these companies, if a candidate doesn’t
perform above average, they have a limited chance to get an offer. On the other hand, a
good performance always results in a better offer (higher position and salary), since it
reflects upon your ability to handle large complex systems - a skill that all such companies
require.
In this course, we’ll follow a step by step approach to solve multiple design problems. Here
are those seven steps:
Always ask questions to find the exact scope of the problem you’re solving. Design questions
are mostly open-ended, and they don’t have ONE correct answer, that’s why clarifying
ambiguities early in the interview becomes critical. Candidates who spend enough time to
define the end goals of the system, always have a better chance to be successful in the
interview. Also, since you only have 35-40 minutes to design a (supposedly) large system,
you should clarify what parts of the system you would be focusing on.
Under each step, we’ll try to give examples of different design considerations for developing
a Twitter-like service.
Here are some questions for designing Twitter that should be answered before moving on to
the next steps:
Will users of our service be able to post tweets and follow other people?
Should we also design to create and display user’s timeline?
Will tweets contain photos and videos?
Are we focusing on backend only or are we developing front-end too?
Will users be able to search tweets?
Do we need to display hot trending topics?
Would there be any push notification for new (or important) tweets?
All such question will determine how our end design will look like.
Define what APIs are expected from the system. This would not only establish the exact
contract expected from the system but would also ensure if you haven’t gotten any
requirements wrong. Some examples for our Twitter-like service would be:
It’s always a good idea to estimate the scale of the system you’re going to design. This
would also help later when you’ll be focusing on scaling, partitioning, load balancing and
caching.
What scale is expected from the system (e.g., number of new tweets, number of tweet
views, how many timeline generations per sec., etc.)?
How much storage will we need? We will have different numbers if users can have photos
and videos in their tweets.
What is the network bandwidth usage we expect? This would be crucial in deciding how
would we manage traffic and balance load between servers.
Defining the data model early will clarify how data will flow among different components of
the system. Later, it will guide towards data partitioning and management. Candidate
should be able to identify various entities of the system, how they will interact with each
other and different aspect of data management like storage, transportation, encryption, etc.
Here are some entities for our Twitter-like service:
Draw a block diagram with 5-6 boxes representing the core components of your system. You
should identify enough components that are needed to solve the actual problem from end-
to-end.
For Twitter, at a high level, we would need multiple application servers to serve all the
read/write requests with load balancers in front of them for traffic distributions. If we’re
assuming that we’ll have a lot more read traffic (as compared to write), we can decide to
have separate servers for handling these scenarios. On the backend, we need an efficient
database that can store all the tweets and can support a huge number of reads. We would
also need a distributed file storage system for storing photos and videos.
5759778777202688.png975×363 41.9 KB
Dig deeper into 2-3 components; interviewers feedback should always guide you towards
which parts of the system she wants you to explain further. You should be able to provide
different approaches, their pros and cons, and why would you choose one? Remember
there is no single answer, the only important thing is to consider tradeoffs between
different options while keeping system constraints in mind.
Since we will be storing a massive amount of data, how should we partition our data to
distribute it to multiple databases? Should we try to store all the data of a user on the same
database? What issue can it cause?
How would we handle hot users, who tweet a lot or follow lots of people?
Since user’s timeline will contain most recent (and relevant) tweets, should we try to store
our data in such a way that is optimized to scan latest tweets?
How much and at which layer should we introduce cache to speed things up?
What components need better load balancing?
Try to discuss as many bottlenecks as possible and different approaches to mitigate them.
Is there any single point of failure in our system? What are we doing to mitigate it?
Do we’ve enough replicas of the data so that if we lose a few servers, we can still serve our
users?
Similarly, do we’ve enough copies of different services running, such that a few failures will
not cause total system shutdown?
How are we monitoring the performance of our service? Do we get alerts whenever critical
components fail or their performance degrade?
In summary, preparation and being organized during the interview are the keys to be
successful in system design interviews.
Let’s design a URL shortening service like TinyURL. This service will provide short aliases
redirecting to long URLs. Similar services: bit.ly, goo.gl, qlink.me 18, etc. Difficulty Level: Easy
URL shortening is used to create shorter aliases for long URLs. We call these shortened
aliases “short links.” Users are redirected to the original URL when they hit these short links.
Short links save a lot of space when displayed, printed, messaged or tweeted. Additionally,
users are less likely to mistype shorter URLs.
https://www.educative.io/collection/page/
5668639101419520/5649050225344512/5668600916475904/ 104
We would get:
http://tinyurl.com/jlg8zpc 56
The shortened URL is nearly one-third of the size of the actual URL.
URL shortening is used for optimizing links across devices, tracking individual links to analyze
audience and campaign performance, and hiding affiliated original URLs.
If you haven’t used tinyurl.com 80 before, please try creating a new shortened URL and
spend some time going through the various options their service offers. This will help you a
lot in understanding this chapter better.
Functional Requirements:
1. Given a URL, our service should generate a shorter and unique alias of it. This is
called a short link.
2. When users access a short link, our service should redirect them to the original link.
3. Users should optionally be able to pick a custom short link for their URL.
4. Links will expire after a standard default timespan. Users should also be able to
specify the expiration time.
Non-Functional Requirements:
1. The system should be highly available. This is required because, if our service is
down, all the URL redirections will start failing.
2. URL redirection should happen in real-time with minimal latency.
3. Shortened links should not be guessable (not predictable).
Extended Requirements:
Our system will be read-heavy. There will be lots of redirection requests compared to new
URL shortenings. Let’s assume 100:1 ratio between read and write.
Traffic estimates: If we assume we will have 500M new URL shortenings per month, we can
expect (100 * 500M => 50B) redirections during that same period. What would be Queries
Per Second (QPS) for our system?
Let’s assume that each stored object will be approximately 500 bytes (just a ballpark
estimate–we will dig into it later). We will need 15TB of total storage:
Bandwidth estimates: For write requests, since we expect 200 new URLs every second, total
incoming data for our service will be 100KB per second:
For read requests, since every second we expect ~19K URLs redirections, total outgoing data
for our service would be 9MB per second.
Memory estimates: If we want to cache some of the hot URLs that are frequently accessed,
how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of
URLs generate 80% of traffic, we would like to cache these 20% hot URLs.
Since we have 19K requests per second, we will be getting 1.7 billion requests per day:
High level estimates: Assuming 500 million new URLs per month and 100:1 read:write ratio,
following is the summary of the high level estimates for our service:
Once we’ve finalized the requirements, it’s always a good idea to define the
system APIs. This should explicitly state what is expected from the system.
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definitions of the APIs for creating and deleting URLs:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
original_url (string): Original URL to be shortened.
custom_alias (string): Optional custom key for the URL.
user_name (string): Optional user name to be used in encoding.
expire_date (string): Optional expiration date for the shortened URL.
Returns: (string)
A successful insertion returns the shortened URL; otherwise, it returns an error code.
deleteURL(api_dev_key, url_key)
How do we detect and prevent abuse? A malicious user can put us out of business by
consuming all URL keys in the current design. To prevent abuse, we can limit users via their
api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and
redirections per some time period (which may be set to a different duration per developer
key).
5. Database Design
Defining the DB schema in the early stages of the interview would help to
understand the data flow among various components and later would guide towards the
data partitioning.A few observations about the nature of the data we will store:
Database Schema:
We would need two tables: one for storing information about the URL mappings, and one
for the user’s data who created the short link.
What kind of database should we use? Since we anticipate storing billions of rows, and we
don’t need to use relationships between objects – a NoSQL key-value store like Dynamo or
Cassandra is a better choice. A NoSQL choice would also be easier to scale. Please see SQL vs
NoSQL 182 for more details.
The problem we are solving here is: how to generate a short and unique key for a given
URL?
We can compute a unique hash (e.g., MD5 25 or SHA256 19, etc.) of the given URL. The hash
can then be encoded for displaying. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-
Z, a-z, 0-9]) and if we add ‘-’ and ‘.’, we can use base64 encoding. A reasonable question
would be: what should be the length of the short key? 6, 8 or 10 characters?
Using base64 encoding, a 6 letter long key would result in 64^6 = ~68.7 billion possible
strings
Using base64 encoding, an 8 letter long key would result in 64^8 = ~281 trillion possible
strings
With 68.7B unique strings, let’s assume for our system six letter keys would suffice.
If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After
base64 encoding, we’ll get a string having more than 21 characters (since each base64
character encodes 6 bits of the hash value). Since we only have space for 8 characters per
short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key.
This could result in key duplication though, upon which we can choose some other
characters out of the encoding string or swap some characters.
What are different issues with our solution? We have the following couple of problems
with our encoding scheme:
1. If multiple users enter the same URL, they can get the same shortened URL, which is
not acceptable.
2. What if parts of the URL are URL-encoded?
e.g., http://www.educative.io/distributed.php?id=design 10,
and http://www.educative.io/distributed.php%3Fid%3Ddesign 9 are identical except
for the URL encoding.
Workaround for the issues: We can append an increasing sequence number to each input
URL to make it unique, and then generate a hash of it. We don’t need to store this sequence
number in the databases, though. Possible problems with this approach could be an ever-
increasing sequence number. Can it overflow? Appending an increasing sequence number
will also impact the performance of the service.
Another solution could be to append user id (which should be unique) to the input URL.
However, if the user has not signed in, we would have to ask the user to choose a
uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until
we get a unique one.
1.png1006×361 38.1 KB
2.png707×354 11 KB
4.png1190×353 21.5 KB
5.png1214×382 26.5 KB
6.png1198×374 31.5 KB
7.png1212×383 31.5 KB
8.png1202×370 35.2 KB
9.png1185×374 39.6 KB
b. Generating keys offline
We can have a standalone Key Generation Service (KGS) that generates random six letter
strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want
to shorten a URL, we will just take one of the already-generated keys and use it. This
approach will make things quite simple and fast. Not only are we not encoding the URL, but
we won’t have to worry about duplications or collisions. KGS will make sure all the keys
inserted into key-DB are unique
Can concurrency cause problems? As soon as a key is used, it should be marked in the
database to ensure it doesn’t get used again. If there are multiple servers reading keys
concurrently, we might get a scenario where two or more servers try to read the same key
from the database. How can we solve this concurrency problem?
Servers can use KGS to read/mark keys in the database. KGS can use two tables to store
keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives
keys to one of the servers, it can move them to the used keys table. KGS can always keep
some keys in memory so that it can quickly provide them whenever a server needs them.
For simplicity, as soon as KGS loads some keys in memory, it can move them to the used
keys table. This ensures each server gets unique keys. If KGS dies before assigning all the
loaded keys to some server, we will be wasting those keys–which is acceptable, given the
huge number of keys we have.
KGS also has to make sure not to give the same key to multiple servers. For that, it must
synchronize (or get a lock to) the data structure holding the keys before removing keys from
it and giving them to a server
What would be the key-DB size? With base64 encoding, we can generate 68.7B unique six
letters keys. If we need one byte to store one alpha-numeric character, we can store all
these keys in:
Isn’t KGS the single point of failure? Yes, it is. To solve this, we can have a standby replica of
KGS. Whenever the primary server dies, the standby server can take over to generate and
provide keys.
Can each app server cache some keys from key-DB? Yes, this can surely speed things up.
Although in this case, if the application server dies before consuming all the keys, we will
end up losing those keys. This could be acceptable since we have 68B unique six letter keys.
How would we perform a key lookup? We can look up the key in our database or key-value
store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the
browser, passing the stored URL in the “Location” field of the request. If that key is not
present in our system, issue an “HTTP 404 Not Found” status, or redirect the user back to
the homepage.
Should we impose size limits on custom aliases? Our service supports custom aliases. Users
can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is
reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have
a consistent URL database. Let’s assume users can specify a maximum of 16 characters per
customer key (as reflected in the above database schema).
To scale out our DB, we need to partition it so that it can store information about billions of
URLs. We need to come up with a partitioning scheme that would divide and store our data
to different DB servers.
a. Range Based Partitioning: We can store URLs in separate partitions based on the first
letter of the URL or the hash key. Hence we save all the URLs starting with letter ‘A’ in one
partition, save those that start with letter ‘B’ in another partition and so on. This approach is
called range-based partitioning. We can even combine certain less frequently occurring
letters into one database partition. We should come up with a static partitioning scheme so
that we can always store/find a file in a predictable manner.
The main problem with this approach is that it can lead to unbalanced servers. For example:
we decide to put all URLs starting with letter ‘E’ into a DB partition, but later we realize that
we have too many URLs that start with letter ‘E’.
b. Hash-Based Partitioning: In this scheme, we take a hash of the object we are storing. We
then calculate which partition to use based upon the hash. In our case, we can take the hash
of the ‘key’ or the actual URL to determine the partition in which we store the data object.
Our hashing function will randomly distribute URLs into different partitions (e.g., our
hashing function can always map any key to a number between [1…256]), and this number
would represent the partition in which we store our object.
This approach can still lead to overloaded partitions, which can be solved by
using Consistent Hashing 69.
8. Cache
We can cache URLs that are frequently accessed. We can use some off-the-shelf solution
like Memcache, which can store full URLs with their respective hashes. The application
servers, before hitting backend storage, can quickly check if the cache has the desired URL.
How much cache should we have? We can start with 20% of daily traffic and, based on
clients’ usage pattern, we can adjust how many cache servers we need. As estimated above,
we need 170GB memory to cache 20% of daily traffic. Since a modern day server can have
256GB memory, we can easily fit all the cache into one machine. Alternatively, we can use a
couple of smaller servers to store all these hot URLs.
Which cache eviction policy would best fit our needs? When the cache is full, and we want
to replace a link with a newer/hotter URL, how would we choose? Least Recently Used (LRU)
can be a reasonable policy for our system. Under this policy, we discard the least recently
used URL first. We can use a Linked Hash Map 29 or a similar data structure to store our
URLs and Hashes, which will also keep track of which URLs are accessed recently.
To further increase the efficiency, we can replicate our caching servers to distribute load
between them.
How can each cache replica be updated? Whenever there is a cache miss, our servers
would be hitting a backend database. Whenever this happens, we can update the cache and
pass the new entry to all the cache replicas. Each replica can update their cache by adding
the new entry. If a replica already has that entry, it can simply ignore it.
6.png883×461 29.1 KB
Initially, we could use a simple Round Robin approach that distributes incoming requests
equally among backend servers. This LB is simple to implement and does not introduce any
overhead. Another benefit of this approach is, if a server is dead, LB will take it out of the
rotation and will stop sending any traffic to it.
A problem with Round Robin LB is that server load is not taken into consideration. If a server
is overloaded or slow, the LB will not stop sending new requests to that server. To handle
this, a more intelligent LB solution can be placed that periodically queries the backend
server about its load and adjusts traffic based on that.
Should entries stick around forever or should they be purged? If a user-specified expiration
time is reached, what should happen to the link?
If we chose to actively search for expired links to remove them, it would put a lot of
pressure on our database. Instead, we can slowly remove expired links and do a lazy
cleanup. Our service will make sure that only expired links will be deleted, although some
expired links can live longer but will never be returned to users.
Whenever a user tries to access an expired link, we can delete the link and return an error
to the user.
A separate Cleanup service can run periodically to remove expired links from our storage
and cache. This service should be very lightweight and can be scheduled to run only when
the user traffic is expected to be low.
We can have a default expiration time for each link (e.g., two years).
After removing an expired link, we can put the key back in the key-DB to be reused.
Should we remove links that haven’t been visited in some length of time, say six months?
This could be tricky. Since storage is getting cheap, we can decide to keep links forever.
завантаження (1).png849×398 29.4 KB
11. Telemetry
How many times a short URL has been used, what were user locations, etc.? How would we
store these statistics? If it is part of a DB row that gets updated on each view, what will
happen when a popular URL is slammed with a large number of concurrent requests?
Some statistics worth tracking: country of the visitor, date and time of access, web page that
refers the click, browser or platform from where the page was accessed.
Can users create private URLs or allow a particular set of users to access a URL?
We can store permission level (public/private) with each URL in the database. We can also
create a separate table to store UserIDs that have permission to see a specific URL. If a user
does not have permission and tries to access a URL, we can send an error (HTTP 401) back.
Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key
for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The
columns will store the UserIDs of those users that have permissions to see the URL.
Designing Pastebin
Let’s design a Pastebin like web service, where users can store plain text. Users of the
service will enter a piece of text and get a randomly generated URL to access it. Similar
Services: pastebin.com 46, pasted.co 2, chopapp.com 6 Difficulty Level: Easy
1. What is Pastebin?
Pastebin like services enable users to store plain text or images over the network (typically
the Internet) and generate unique URLs to access the uploaded data. Such services are also
used to share data over the network quickly, as users would just need to pass the URL to let
other users see it.
If you haven’t used pastebin.com 38 before, please try creating a new ‘Paste’ there and
spend some time going through different options their service offers. This will help you a lot
in understanding this chapter better.
Functional Requirements:
1. Users should be able to upload or “paste” their data and get a unique URL to access
it.
2. Users will only be able to upload text.
3. Data and links will expire after a specific timespan automatically; users should also
be able to specify expiration time.
4. Users should optionally be able to pick a custom alias for their paste.
Non-Functional Requirements:
1. The system should be highly reliable, any data uploaded should not be lost.
2. The system should be highly available. This is required because if our service is down,
users will not be able to access their Pastes.
3. Users should be able to access their Pastes in real-time with minimum latency.
4. Paste links should not be guessable (not predictable).
Extended Requirements:
Pastebin shares some requirements with URL Shortening service 251, but there are some
additional design considerations we should keep in mind.
What should be the limit on the amount of text user can paste at a time? We can limit
users not to have Pastes bigger than 10MB to stop the abuse of the service.
Should we impose size limits on custom URLs? Since our service supports custom URLs,
users can pick any URL that they like, but providing a custom URL is not mandatory.
However, it is reasonable (and often desirable) to impose a size limit on custom URLs, so
that we have a consistent URL database.
4. Capacity Estimation and Constraints
Our services will be read-heavy; there will be more read requests compared to new Pastes
creation. We can assume a 5:1 ratio between read and write.
Traffic estimates: Pastebin services are not expected to have traffic similar to Twitter or
Facebook, let’s assume here that we get one million new pastes added to our system every
day. This leaves us with five million reads per day.
Storage estimates: Users can upload maximum 10MB of data; commonly Pastebin like
services are used to share source code, configs or logs. Such texts are not huge, so let’s
assume that each paste on average contains 10KB.
If we want to store this data for ten years, we would need the total storage capacity of
36TB.
With 1M pastes every day we will have 3.6 billion Pastes in 10 years. We need to generate
and store keys to uniquely identify these pastes. If we use base64 encoding ([A-Z, a-z, 0-9, .,
-]) we would need six letters strings:
If it takes one byte to store one character, total size required to store 3.6B keys would be:
3.6B * 6 => 22 GB
22GB is negligible compared to 36TB. To keep some margin, we will assume a 70% capacity
model (meaning we don’t want to use more than 70% of our total storage capacity at any
point), which raises our storage needs to 51.4TB.
Bandwidth estimates: For write requests, we expect 12 new pastes per second, resulting in
120KB of ingress per second.
Although total ingress and egress are not big, we should keep these numbers in mind while
designing our service.
Memory estimates: We can cache some of the hot pastes that are frequently accessed.
Following the 80-20 rule, meaning 20% of hot pastes generate 80% of traffic, we would like
to cache these 20% pastes
Since we have 5M read requests per day, to cache 20% of these requests, we would need:
0.2 * 5M * 10KB ~= 10 GB
5. System APIs
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definitions of the APIs to create/retrieve/delete Pastes:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
paste_data (string): Textual data of the paste.
custom_url (string): Optional custom URL.
user_name (string): Optional user name to be used to generate URL.
paste_name (string): Optional name of the paste expire_date (string): Optional expiration
date for the paste.
Returns: (string)
A successful insertion returns the URL through which the paste can be accessed, otherwise,
returns an error code.
getPaste(api_dev_key, api_paste_key)
Where “api_paste_key” is a string representing the Paste Key of the paste to be retrieved.
This API will return the textual data of the paste.
deletePaste(api_dev_key, api_paste_key)
A few observations about the nature of the data we are going to store:
Database Schema:
We would need two tables, one for storing information about the Pastes and the other for
users’ data.
Here, ‘URlHash’ is the URL equivalent of the TinyURL and ‘ContentKey’ is the object key
storing the contents of the paste.
At a high level, we need an application layer that will serve all the read and write requests.
Application layer will talk to a storage layer to store and retrieve data. We can segregate our
storage layer with one database storing metadata related to each paste, users, etc., while
the other storing the paste contents in some object storage (like Amazon S3 19). This
division of data will also allow us to scale them individually.
8. Component Design
a. Application layer
Our application layer will process all incoming and outgoing requests. The application
servers will be talking to the backend data store components to serve the requests.
How to handle a write request? Upon receiving a write request, our application server will
generate a six-letter random string, which would serve as the key of the paste (if the user
has not provided a custom key). The application server will then store the contents of the
paste and the generated key in the database. After the successful insertion, the server can
return the key to the user. One possible problem here could be that the insertion fails
because of a duplicate key. Since we are generating a random key, there is a possibility that
the newly generated key could match an existing one. In that case, we should regenerate a
new key and try again. We should keep retrying until we don’t see failure due to the
duplicate key. We should return an error to the user if the custom key they have provided is
already present in our database.
Another solution of the above problem could be to run a standalone Key Generation
Service (KGS) that generates random six letters strings beforehand and stores them in a
database (let’s call it key-DB). Whenever we want to store a new paste, we will just take one
of the already generated keys and use it. This approach will make things quite simple and
fast since we will not be worrying about duplications or collisions. KGS will make sure all the
keys inserted in key-DB are unique. KGS can use two tables to store keys, one for keys that
are not used yet and one for all the used keys. As soon as KGS give some keys to an
application server, it can move these to the used keys table. KGS can always keep some keys
in memory so that whenever a server needs them, it can quickly provide them. As soon as
KGS loads some keys in memory, it can move them to the used keys table, this way we can
make sure each server gets unique keys. If KGS dies before using all the keys loaded in
memory, we will be wasting those keys. We can ignore these keys given that we have a huge
number of them.
Isn’t KGS single point of failure? Yes, it is. To solve this, we can have a standby replica of
KGS, and whenever the primary server dies, it can take over to generate and provide keys.
Can each app server cache some keys from key-DB? Yes, this can surely speed things up.
Although in this case, if the application server dies before consuming all the keys, we will
end up losing those keys. This could be acceptable since we have 68B unique six letters keys,
which are a lot more than we require.
How to handle a paste read request? Upon receiving a read paste request, the application
service layer contacts the datastore. The datastore searches for the key, and if it is found,
returns the paste’s contents. Otherwise, an error code is returned.
b. Datastore layer
9. Purging or DB Cleanup
Please see Designing a URL Shortening service 251.
Designing Instagram
Let’s design a photo-sharing service like Instagram, where users can upload photos to share
them with other users. Similar Services: Flickr, Picasa Difficulty Level: Medium
1. What is Instagram?
Instagram is a social networking service which enables its users to upload and share their
photos and videos with other users. Instagram users can choose to share either publicly or
privately. Anything shared publically can be seen by any other user, whereas privately
shared contents can only be accessible by a specified set of people. Instagram also enables
its users to share through many other social networking platforms, such as Facebook,
Twitter, Flickr, and Tumblr.
For the sake of this exercise, we plan to design a simpler version of Instagram, where a user
can share photos and can also follow other users. The ‘News Feed’ for each user will consist
of top photos of all the people the user follows.
We’ll focus on the following set of requirements while designing the Instagram:
Functional Requirements
Non-functional Requirements
Not in scope: Adding tags to photos, searching photos on tags, commenting on photos,
tagging users to photos, who to follow, suggestions, etc.
The system would be read-heavy, so we will focus on building a system that can retrieve
photos quickly.
1. Practically users can upload as many photos as they like. Efficient management of
storage should be a crucial factor while designing this system.
2. Low latency is expected while viewing photos.
3. Data should be 100% reliable. If a user uploads a photo, the system will guarantee
that it will never be lost.
Let’s assume we have 500M total users, with 1M daily active users.
2M new photos every day, 23 new photos every second.
Average photo file size => 200KB
Total space required for 1 day of photos
At a high-level, we need to support two scenarios, one to upload photos and the other to
view/search photos. Our service would need some object storage 21 servers to store photos
and also some database servers to store metadata information about the photos.
завантаження (4).png743×311 18.8 KB
6. Database Schema
Defining the DB schema in the early stages of the interview would help to
understand the data flow among various components and later would guide towards the
data partitioning.
We need to store data about users, their uploaded photos, and people they follow. Photo
table will store all data related to a photo, we need to have an index on (PhotoID,
CreationDate) since we need to fetch recent photos first.
5752142325350400.png693×241 36.6 KB
A straightforward approach for storing the above schema would be to use an RDBMS like
MySQL since we require joins. But relational databases come with their challenges,
especially when we need to scale them. For details, please take a look at SQL vs. NoSQL 182.
We can store the above schema in a distributed key-value store to enjoy the benefits
offered by NoSQL. All the metadata related to photos can go to a table, where the ‘key’
would be the ‘PhotoID’ and the ‘value’ would be an object containing PhotoLocation,
UserLocation, CreationTimestamp, etc.
We need to store relationships between users and photos, to know who owns which photo.
We also need to store the list of people a user follows. For both of these tables, we can use
a wide-column datastore like Cassandra 64. For the ‘UserPhoto’ table, the ‘key’ would be
‘UserID’ and the ‘value’ would be the list of ‘PhotoIDs’ the user owns, stored in different
columns. We will have a similar scheme for the ‘UserFollow’ table.
Let’s estimate how much data will be going into each table and how much total storage we
will need for 10 years.
User: Assuming each “int” and “dateTime” is four bytes, each row in the User’s table will be
of 68 bytes:
UserID (4 bytes) + Name (20 bytes) + Email (32 bytes) + DateOfBirth (4 bytes) + CreationDate
(4 bytes) + LastLogin (4 bytes) = 68 bytes
If 2M new photos get uploaded every day, we will need 0.5GB of storage for one day:
2M * 284 bytes ~= 0.5GB per dayFor 10 years we will need 1.88TB of storage.
UserFollow: Each row in the UserFollow table will be of 8 bytes. If we have 500 million users
and on average each user follows 500 users. We would need 895GB of storage for the
UserFollow table.
Total space required for all tables for 10 years will be 3.7TB:
8. Component Design
Photo uploads (or writes) can be slow as they have to go to the disk, whereas reads will be
faster, especially if they are being served from cache.
Uploading users can consume all the available connections, as uploading is a slow process.
This means that ‘reads’ cannot be served if the system gets busy with all the write requests.
As we know that web servers have a connection limit, so we should keep this thing in mind
before designing our system. If we assume that a web server can have a maximum of 500
connections at any time, this would mean it can’t have more than 500 concurrent uploads
or reads. To handle this bottleneck we can split reads and writes into separate services. We
will have dedicated servers for reads and different servers for writes to ensure that uploads
don’t hog the system.
Separating photos’ read and write requests will also allow us to scale and optimize each of
these operations independently.
This same principle also applies to other components of the system too. If we want to have
high availability of the system, we need to have multiple replicas of services running in the
system. So that if a few services die down, the system still remains available and serving.
Redundancy removes the single point of failures in the system.
If only one instance of a service is required to be running at any point, we can run a
redundant secondary copy of the service that is not serving any traffic but whenever
primary has any problem it can take control after the failover.
Creating redundancy in a system can remove single points of failure and provide a backup or
spare functionality if needed in a crisis. For example, if there are two instances of the same
service running in production, and one fails or degrades, the system can failover to the
healthy copy. Failover can happen automatically or require manual intervention.
a. Partitioning based on UserID Let’s assume we shard based on the ‘UserID’ so that we can
keep all photos of a user on the same shard. If one DB shard is 1TB, we will need four shards
to store 3.7TB of data. Let’s assume for better performance and scalability we keep 10
shards.
So we’ll find the shard number by UserID % 10 and then store the data there. To uniquely
identify any photo in our system, we can append shard number with each PhotoID.
How can we generate PhotoIDs? Each DB shard can have its own auto-increment sequence
for PhotoIDs, and since we will append ShardID with each PhotoID, it will make it unique
throughout our system.
1. How would we handle hot users? Several people follow such hot users, and a lot of
other people sees any photo they upload.
2. Some users will have a lot of photos compared to others, thus making a non-uniform
distribution of storage.
3. What if we cannot store all pictures of a user on one shard? If we distribute photos
of a user onto multiple shards, will it cause higher latencies?
4. Storing all photos of a user on one shard can cause issues like unavailability of all of
the user’s data if that shard is down or higher latency if it is serving high load etc.
b. Partitioning based on PhotoID If we can generate unique PhotoIDs first and then find
shard number through “PhotoID % 10”, this can solve the above problems. We would not
need to append ShardID with PhotoID in this case as PhotoID will itself be unique
throughout the system.
Wouldn’t this key generating DB be a single point of failure? Yes, it will be. A workaround
for that could be, we can define two such databases, with one generating even numbered
IDs and the other odd numbered. For MySQL following script can define such sequences:
KeyGeneratingServer1:
auto-increment-increment = 2
auto-increment-offset = 1
KeyGeneratingServer2:
auto-increment-increment = 2
auto-increment-offset = 2
We can put a load balancer in front of both of these databases to round robin between
them and to deal with downtime. Both these servers could be out of sync with one
generating more keys than the other, but this will not cause any issue in our system. We can
extend this design by defining separate ID tables for Users, Photo-Comments or other
objects present in our system.
Alternately, we can implement a ‘key’ generation scheme similar to what we have discussed
in Designing a URL Shortening service like TinyURL 251.
How can we plan for future growth of our system? We can have a large number of logical
partitions to accommodate future data growth, such that, in the beginning, multiple logical
partitions reside on a single physical database server. Since each database server can have
multiple database instances on it, we can have separate databases for each logical partition
on any server. So whenever we feel that a particular database server has a lot of data, we
can migrate some logical partitions from it to another server. We can maintain a config file
(or a separate database) that can map our logical partitions to database servers; this will
enable us to move partitions around easily. Whenever we want to move a partition, we only
have to update the config file to announce the change.
To create the News Feed for any given user, we need to fetch the latest, most popular and
relevant photos of other people the user follows.
For simplicity, let’s assume we need to fetch top 100 photos for a user’s News Feed. Our
application server will first get a list of people the user follows and then fetches metadata
info of latest 100 photos from each user. In the final step, the server will submit all these
photos to our ranking algorithm which will determine the top 100 photos (based on recency,
likeness, etc.) and return them to the user. A possible problem with this approach would be
higher latency, as we have to query multiple tables and perform sorting/merging/ranking on
the results. To improve the efficiency, we can pre-generate the News Feed and store it in a
separate table.
Pre-generating the News Feed: We can have dedicated servers that are continuously
generating users’ News Feeds and storing them in a ‘UserNewsFeed’ table. So whenever any
user needs the latest photos for their News Feed, we will simply query this table and return
the results to the user.
Whenever these servers need to generate the News Feed of a user, they will first query the
UserNewsFeed table to find the last time the News Feed was generated for that user. Then,
new News Feed data will be generated from that time onwards (following the
abovementioned steps).
What are the different approaches for sending News Feed contents to the users?
1. Pull: Clients can pull the News Feed contents from the server on a regular basis or
manually whenever they need it. Possible problems with this approach are a) New data
might not be shown to the users until clients issue a pull request b) Most of the time pull
requests will result in an empty response if there is no new data.
2. Push: Servers can push new data to the users as soon as it is available. To efficiently
manage this, users have to maintain a Long Poll 21 request with the server for receiving the
updates. A possible problem with this approach is when a user has a lot of follows or a
celebrity user who has millions of followers; in this case, the server has to push updates
quite frequently.
3. Hybrid: We can adopt a hybrid approach. We can move all the users with high followings
to pull based model and only push data to those users who have a few hundred (or
thousand) follows. Another approach could be that the server pushes updates to all the
users not more than a certain frequency, letting users with a lot of follows/updates to
regularly pull data.
For a detailed discussion about News Feed generation, take a look at Designing Facebook’s
Newsfeed 63.
One of the most important requirement to create the News Feed for any given user is to
fetch the latest photos from all people the user follows. For this, we need to have a
mechanism to sort photos on their time of creation. To efficiently do this, we can make
photo creation time part of the PhotoID. As we will have a primary index on PhotoID, it will
be quite quick to find latest PhotoIDs.
We can use epoch time for this. Let’s say our PhotoID will have two parts; the first part will
be representing epoch time and the second part will be an auto-incrementing sequence. So
to make a new PhotoID, we can take the current epoch time and append an auto-
incrementing ID from our key-generating DB. We can figure out shard number from this
PhotoID ( PhotoID % 10) and store the photo there.
What could be the size of our PhotoID ? Let’s say our epoch time starts today, how many
bits we would need to store the number of seconds for next 50 years?
86400 sec/day * 365 (days a year) * 50 (years) => 1.6 billion seconds
We would need 31 bits to store this number. Since on the average, we are expecting 23 new
photos per second; we can allocate 9 bits to store auto incremented sequence. So every
second we can store (2^9 => 512) new photos. We can reset our auto incrementing
sequence every second.
We will discuss more details about this technique under ‘Data Sharding’ in Designing
Twitter 71.
Our service would need a massive-scale photo delivery system to serve the globally
distributed users. Our service should push its content closer to the user using a large
number of geographically distributed photo cache servers and use CDNs (for details
see Caching 33).
We can introduce a cache for metadata servers to cache hot database rows. We can use
Memcache to cache the data and Application servers before hitting database can quickly
check if the cache has desired rows. Least Recently Used (LRU) can be a reasonable cache
eviction policy for our system. Under this policy, we discard the least recently viewed row
first.
How can we build more intelligent cache? If we go with 80-20 rule, i.e., 20% of daily read
volume for photos is generating 80% of traffic which means that certain photos are so
popular that the majority of people reads them. This dictates we can try caching 20% of
daily read volume of photos and metadata.
Let’s design a file hosting service like Dropbox or Google Drive. Cloud file storage enables
users to store their data on remote servers. Usually, these servers are maintained by cloud
storage providers and made available to users over a network (typically through the
Internet). Users pay for their cloud data storage on a monthly basis. Similar Services:
OneDrive, Google Drive Difficulty Level: Medium
Cloud file storage services have become very popular recently as they simplify the storage
and exchange of digital resources among multiple devices. The shift from using single
personal computers to using multiple devices with different platforms and operating
systems such as smartphones and tablets and their portable access from various
geographical locations at any time is believed to be accountable for the huge popularity of
cloud storage services. Some of the top benefits of such services are:
Availability: The motto of cloud storage services is to have data availability anywhere
anytime. Users can access their files/photos from any device whenever and wherever they
like.
Reliability and Durability: Another benefit of cloud storage is that it offers 100% reliability
and durability of data. Cloud storage ensures that users will never lose their data, by
keeping multiple copies of the data stored on different geographically located servers.
Scalability: Users will never have to worry about getting out of storage space. With cloud
storage, you have unlimited storage as long as you are ready to pay for it.
What do we wish to achieve from a Cloud Storage system? Here are the top-level
requirements for our system:
1. Users should be able to upload and download their files/photos from any device.
2. Users should be able to share files or folders with other users.
3. Our service should support automatic synchronization between devices, i.e., after
updating a file on one device, it should get synchronized on all devices.
4. The system should support storing large files up to a GB.
5. ACID-ity is required. Atomicity, Consistency, Isolation and Durability of all file
operations should be guaranteed.
6. Our system should support offline editing. Users should be able to
add/delete/modify files while offline, and as soon as they come online, all their
changes should be synced to the remote servers and other online devices.
Extended Requirements
The system should support snapshotting of the data, so that users can go back to any
version of the files.
Let’s assume that we have 500M total users, and 100M daily active users (DAU).
Let’s assume that on average each user connects from three different devices.
On average if a user has 200 files/photos, we will have 100 billion total files.
Let’s assume that average file size is 100KB, this would give us ten petabytes of total
storage.
The user will specify a folder as the workspace on their device. Any file/photo/folder placed
in this folder will be uploaded to the cloud, and whenever a file is modified or deleted, it will
be reflected in the same way in the cloud storage. The user can specify similar workspaces
on all their devices and any modification done on one device will be propagated to all other
devices to have the same view of the workspace everywhere.
At a high level, we need to store files and their metadata information like File Name, File
Size, Directory, etc., and who this file is shared with. So, we need some servers that can help
the clients to upload/download files to Cloud Storage and some servers that can facilitate
updating metadata about files and users. We also need some mechanism to notify all clients
whenever an update happens so they can synchronize their files.
As shown in the diagram below, Block servers will work with the clients to upload/download
files from cloud storage, and Metadata servers will keep metadata of files updated in a SQL
or NoSQL database. Synchronization servers will handle the workflow of notifying all clients
about different changes for synchronization.
6. Component Design
a. Client
The Client Application monitors the workspace folder on user’s machine and syncs all
files/folders in it with the remote Cloud Storage. The client application will work with the
storage servers to upload, download and modify actual files to backend Cloud Storage. The
client also interacts with the remote Synchronization Service to handle any file metadata
updates e.g. change in the file name, size, modification date, etc.
How do we handle file transfer efficiently? As mentioned above, we can break each file into
smaller chunks so that we transfer only those chunks that are modified and not the whole
file. Let’s say we divide each file into fixed size of 4MB chunks. We can statically calculate
what could be an optimal chunk size based on 1) Storage devices we use in the cloud to
optimize space utilization and Input/output operations per second (IOPS) 2) Network
bandwidth 3) Average file size in the storage etc. In our metadata, we should also keep a
record of each file and the chunks that constitute it.
Should we keep a copy of metadata with Client? Keeping a local copy of metadata not only
enable us to do offline updates but also saves a lot of round trips to update remote
metadata.
How can clients efficiently listen to changes happening on other clients? One solution
could be that the clients periodically check with the server if there are any changes. The
problem with this approach is that we will have a delay in reflecting changes locally as
clients will be checking for changes periodically compared to server notifying whenever
there is some change. If the client frequently checks the server for changes, it will not only
be wasting bandwidth, as the server has to return empty response most of the time but will
also be keeping the server busy. Pulling information in this manner is not scalable too.
A solution to above problem could be to use HTTP long polling. With long polling, the client
requests information from the server with the expectation that the server may not respond
immediately. If the server has no new data for the client when the poll is received, instead
of sending an empty response, the server holds the request open and waits for response
information to become available. Once it does have new information, the server
immediately sends an HTTP/S response to the client, completing the open HTTP/S Request.
Upon receipt of the server response, the client can immediately issue another server
request for future updates.
Based on the above considerations we can divide our client into following four parts:
I. Internal Metadata Database will keep track of all the files, chunks, their versions, and
their location in the file system.
II. Chunker will split the files into smaller pieces called chunks. It will also be responsible for
reconstructing a file from its chunks. Our chunking algorithm will detect the parts of the files
that have been modified by the user and only transfer those parts to the Cloud Storage; this
will save us bandwidth and synchronization time.
III. Watcher will monitor the local workspace folders and notify the Indexer (discussed
below) of any action performed by the users, e.g., when users create, delete, or update files
or folders. Watcher also listens to any changes happening on other clients that are
broadcasted by Synchronization service.
IV. Indexer will process the events received from the Watcher and update the internal
metadata database with information about the chunks of the modified files. Once the
chunks are successfully submitted/downloaded to the Cloud Storage, the Indexer will
communicate with the remote Synchronization Service to broadcast changes to other clients
and update remote metadata database.
How should clients handle slow servers? Clients should exponentially back-off if the server
is busy/not-responding. Meaning, if a server is too slow to respond, clients should delay
their retries, and this delay should increase exponentially.
Should mobile clients sync remote changes immediately? Unlike desktop or web clients,
that check for file changes on a regular basis, mobile clients usually sync on demand to save
user’s bandwidth and space.
b. Metadata Database
The Metadata Database is responsible for maintaining the versioning and metadata
information about files/chunks, users, and workspaces. The Metadata Database can be a
relational database such as MySQL, or a NoSQL database service such as DynamoDB.
Regardless of the type of the database, the Synchronization Service should be able to
provide a consistent view of the files using a database, especially if more than one user work
with the same file simultaneously. Since NoSQL data stores do not support ACID properties
in favor of scalability and performance, we need to incorporate the support for ACID
properties programmatically in the logic of our Synchronization Service in case we opt for
this kind of databases. However, using a relational database can simplify the
implementation of the Synchronization Service as they natively support ACID properties.
1. Chunks
2. Files
3. User
4. Devices
5. Workspace (sync folders)
c. Synchronization Service
The Synchronization Service is the component that processes file updates made by a client
and applies these changes to other subscribed clients. It also synchronizes clients’ local
databases with the information stored in the remote Metadata DB. The Synchronization
Service is the most important part of the system architecture due to its critical role in
managing the metadata and synchronizing users’ files. Desktop clients communicate with
the Synchronization Service to either obtain updates from the Cloud Storage or send files
and updates to the Cloud Storage and potentially other users. If a client was offline for a
period, it polls the system for new updates as soon as it becomes online. When the
Synchronization Service receives an update request, it checks with the Metadata Database
for consistency and then proceeds with the update. Subsequently, a notification is sent to all
subscribed users or devices to report the file update.
The Synchronization Service should be designed in such a way to transmit less data between
clients and the Cloud Storage to achieve better response time. To meet this design goal, the
Synchronization Service can employ a differencing algorithm to reduce the amount of the
data that needs to be synchronized. Instead of transmitting entire files from clients to the
server or vice versa, we can just transmit the difference between two versions of a file.
Therefore, only the part of the file that has been changed is transmitted. This also decreases
bandwidth consumption and cloud data storage for the end user. As described above we
will be dividing our files into 4MB chunks and will be transferring modified chunks only.
Server and clients can calculate a hash (e.g., SHA-256) to see whether to update the local
copy of a chunk or not. On server if we already have a chunk with a similar hash (even from
another user) we don’t need to create another copy, we can use the same chunk. This is
discussed in detail later under Data Deduplication.
Message Queuing Service will implement two types of queues in our system. The Request
Queue is a global queue, and all client will share it. Clients’ requests to update the Metadata
Database will be sent to the Request Queue first, from there Synchronization Service will
take it to update metadata. The Response Queues that correspond to individual subscribed
clients are responsible for delivering the update messages to each client. Since a message
will be deleted from the queue once received by a client, we need to create separate
Response Queues for each subscribed client to share update messages.
e. Cloud/Block Storage
Cloud/Block Storage stores chunks of files uploaded by the users. Clients directly interact
with the storage to send and receive objects from it. Separation of the metadata from
storage enables us to use any storage either in cloud or in-house.
завантаження (10).png800×469 42.7 KB
The sequence below shows the interaction between the components of the application in a
scenario when Client A updates a file that is shared with Client B and C, so they should
receive the update too. If the other clients were not online at the time of the update, the
Message Queuing Service keeps the update notifications in separate response queues for
them until they become online later.
8. Data Deduplication
Data deduplication is a technique used for eliminating duplicate copies of data to improve
storage utilization. It can also be applied to network data transfers to reduce the number of
bytes that must be sent. For each new incoming chunk, we can calculate a hash of it and
compare that hash with all the hashes of the existing chunks to see if we already have same
chunk present in our storage.
b. In-line deduplication
Alternatively, deduplication hash calculations can be done in real-time as the clients are
entering data on their device. If our system identifies a chunk which it has already stored,
only a reference to the existing chunk will be added in the metadata, rather than the full
copy of the chunk. This approach will give us optimal network and storage usage.
9. Metadata Partitioning
To scale out metadata DB, we need to partition it so that it can store information about
millions of users and billions of files/chunks. We need to come up with a partitioning
scheme that would divide and store our data to different DB servers.
1. Vertical Partitioning: We can partition our database in such a way that we store tables
related to one particular feature on one server. For example, we can store all the user
related tables in one database and all files/chunks related tables in another database.
Although this approach is straightforward to implement it has some issues:
1. Will we still have scale issues? What if we have trillions of chunks to be stored and
our database cannot support to store such huge number of records? How would we
further partition such tables?
2. Joining two tables in two separate databases can cause performance and consistency
issues. How frequently do we have to join user and file tables?
The main problem with this approach is that it can lead to unbalanced servers. For example,
if we decide to put all files starting with letter ‘E’ into a DB partition, and later we realize
that we have too many files that start with letter ‘E’, to such an extent that we cannot fit
them into one DB partition.
3. Hash-Based Partitioning: In this scheme we take a hash of the object we are storing and
based on this hash we figure out the DB partition to which this object should go. In our case,
we can take the hash of the ‘FileID’ of the File object we are storing to determine the
partition the file will be stored. Our hashing function will randomly distribute objects into
different partitions, e.g., our hashing function can always map any ID to a number between
[1…256], and this number would be the partition we will store our object.
This approach can still lead to overloaded partitions, which can be solved by
using Consistent Hashing 14.
10. Caching
We can have two kinds of caches in our system. To deal with hot files/chunks, we can
introduce a cache for Block storage. We can use an off-the-shelf solution like Memcache,
that can store whole chunks with their respective IDs/Hashes, and Block servers before
hitting Block storage can quickly check if the cache has desired chunk. Based on clients’
usage pattern we can determine how many cache servers we need. A high-end commercial
server can have up to 144GB of memory; So, one such server can cache 36K chunks.
Which cache replacement policy would best fit our needs? When the cache is full, and we
want to replace a chunk with a newer/hotter chunk, how would we choose? Least Recently
Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least
recently used chunk first.
We can add Load balancing layer at two places in our system 1) Between Clients and Block
servers and 2) Between Clients and Metadata servers. Initially, a simple Round Robin
approach can be adopted; that distributes incoming requests equally among backend
servers. This LB is simple to implement and does not introduce any overhead. Another
benefit of this approach is if a server is dead, LB will take it out of the rotation and will stop
sending any traffic to it. A problem with Round Robin LB is, it won’t take server load into
consideration. If a server is overloaded or slow, the LB will not stop sending new requests to
that server. To handle this, a more intelligent LB solution can be placed that periodically
queries backend server about their load and adjusts traffic based on that.
One of the primary concerns users will have while storing their files in the cloud is the
privacy and security of their data. Especially since in our system users can share their files
with other users or even make them public to share it with everyone. To handle this, we will
be storing permissions of each file in our metadata DB to reflect what files are visible or
modifiable by any user.
Let’s design an instant messaging service like Facebook Messenger, where users can send
text messages to each other through web and mobile interfaces.
Functional Requirements:
Non-functional Requirements:
Extended Requirements:
Group Chats: Messenger should support multiple people talking to each other in a group.
Push notifications: Messenger should be able to notify users of new messages when they
are offline.
Let’s assume that we have 500 million daily active users and on average each user sends 40
messages daily; this gives us 20 billion messages per day.
Storage Estimation: Let’s assume that on average a message is 100 bytes, so to store all the
messages for one day we would need 2TB of storage.
Although Facebook Messenger stores all previous chat history, but just for estimation to
save five years of chat history, we would need 3.6 petabytes of storage.
Other than the chat messages, we would also need to store users’ information, messages’
metadata (ID, Timestamp, etc.). Also, the above calculations didn’t keep data compression
and replication in consideration.
Bandwidth Estimation: If our service is getting 2TB of data every day, this will give us 25MB
of incoming data for each second.
Since each incoming message needs to go out to another user, we will need the same
amount of bandwidth 25MB/s for both upload and download.
At a high level, we will need a chat server that would be the central piece orchestrating all
the communications between users. When a user wants to send a message to another user,
they will connect to the chat server and send the message to the server; the server then
passes that message to the other user and also stores it in the database.
8.png826×335 20.5 KB
7.png862×330 20 KB
6.png830×326 19.4 KB
5.png831×330 18.5 KB
4.png844×334 17.1 KB
3.png833×329 16.5 KB
2.png853×333 8.97 KB
1.png832×327 8.21 KB
Let’s try to build a simple solution first where everything runs on one server. At the high
level our system needs to handle the following use cases:
1. Pull model: Users can periodically ask the server if there are any new messages for
them.
2. Push model: Users can keep a connection open with the server and can depend
upon the server to notify them whenever there are new messages.
If we go with our first approach, then the server needs to keep track of messages that are
still waiting to be delivered, and as soon as the receiving user connects to the server to ask
for any new message, the server can return all the pending messages. To minimize latency
for the user, they have to check the server quite frequently, and most of the time they will
be getting an empty response if there are no pending message. This will waste a lot of
resources and does not look like an efficient solution.
If we go with our second approach, where all the active users keep a connection open with
the server, then as soon as the server receives a message it can immediately pass the
message to the intended user. This way, the server does not need to keep track of the
pending messages, and we will have minimum latency, as the messages are delivered
instantly on the opened connection.
How will clients maintain an open connection with the server? We can use HTTP Long
Polling 22 or WebSockets 22. In long polling, clients can request information from the server
with the expectation that the server may not respond immediately. If the server has no new
data for the client when the poll is received, instead of sending an empty response, the
server holds the request open and waits for response information to become available.
Once it does have new information, the server immediately sends the response to the client,
completing the open request. Upon receipt of the server response, the client can
immediately issue another server request for future updates. This gives a lot of
improvements in latencies, throughputs, and performance. The long polling request can
timeout or can receive a disconnect from the server, in that case, the client has to open a
new request.
How can the server keep track of all the opened connection to redirect messages to the
users efficiently? The server can maintain a hash table, where “key” would be the UserID
and “value” would be the connection object. So whenever the server receives a message for
a user, it looks up that user in the hash table to find the connection object and sends the
message on the open request.
What will happen when the server receives a message for a user who has gone offline? If
the receiver has disconnected, the server can notify the sender about the delivery failure. If
it is a temporary disconnect, e.g., the receiver’s long-poll request just timed out, then we
should expect a reconnect from the user. In that case, we can ask the sender to retry
sending the message. This retry could be embedded in the client’s logic so that users don’t
have to retype the message. The server can also store the message for a while and retry
sending it once the receiver reconnects.
How many chat servers we need? Let’s plan for 500 million connections at any time.
Assuming a modern server can handle 50K concurrent connections at any time, we would
need 10K such servers.
How to know which server holds the connection to which user? We can introduce a
software load balancer in front of our chat servers; that can map each UserID to a server to
redirect the request.
How should the server process a ‘deliver message’ request? The server needs to do
following things upon receiving a new message 1) Store the message in the database 2)
Send the message to the receiver 3) Send an acknowledgment to the sender.
The chat server will first find the server that holds the connection for the receiver and pass
the message to that server to send it to the receiver. The chat server can then send the
acknowledgment to the sender; we don’t need to wait for storing the message in the
database; this can happen in the background. Storing the message is discussed in the next
section.
How does the messenger maintain the sequencing of the messages? We can store a
timestamp with each message, which would be the time when the message is received at
the server. But this will still not ensure correct ordering of messages for clients. The scenario
where the server timestamp cannot determine the exact order of messages would look like
this:
So User-1 will see M1 first and then M2, whereas User-2 will see M2 first and then M1.
To resolve this, we need to keep a sequence number with every message for each client.
This sequence number will determine the exact ordering of messages for EACH user. With
this solution, both clients will see a different view of the message sequence, but this view
will be consistent for them on all devices.
Whenever the chat server receives a new message, it needs to store it in the database. To
do so, we have two options:
1. Start a separate thread, which will work with the database to store the message.
2. Send an asynchronous request to the database to store the message.
We have to keep certain things in mind while designing our database:
Which storage system we should use? We need to have a database that can support a very
high rate of small updates, and also that can fetch a range of records quickly. This is
required because we have a huge number of small messages that need to be inserted in the
database and while querying, a user is mostly interested in sequentially accessing the
messages.
We cannot use RDBMS like MySQL or NoSQL like MongoDB because we cannot afford to
read/write a row from the database every time a user receives/sends a message. This will
not only make the basic operations of our service to run with high latency but also create a
huge load on databases.
Both of our requirements can be easily met with a wide-column database solution
like HBase 7. HBase is a column-oriented key-value NoSQL database that can store multiple
values against one key into multiple columns. HBase is modeled after
Google’s BigTable 4 and runs on top of Hadoop Distributed File System (HDFS). HBase
groups data together to store new data in a memory buffer and once the buffer is full, it
dumps the data to the disk This way of storage not only helps storing a lot of small data
quickly but also fetching rows by the key or scanning ranges of rows. HBase is also an
efficient database to store variable size data, which is also required by our service.
How should clients efficiently fetch data from the server? Clients should paginate while
fetching data from the server. Page size could be different for different clients, e.g., cell
phones have smaller screens, so we need a lesser number of message/conversations in the
viewport.
We need to keep track of user’s online/offline status and notify all the relevant users
whenever a status change happens. Since we are maintaining a connection object on the
server for all active users, we can easily figure out the user’s current status from this. With
500M active users at any time, if we have to broadcast each status change to all the relevant
active users, it will consume a lot of resources. We can do the following optimization around
this:
1. Whenever a client starts the app, it can pull current status of all users in their
friends’ list.
2. Whenever a user sends a message to another user that has gone offline, we can send
a failure to the sender and update the status on the client.
3. Whenever a user comes online, the server can always broadcast that status with a
delay of a few seconds to see if the user does not go offline immediately.
4. Client’s can pull the status from the server about those users that are being shown
on the user’s viewport. This should not be a frequent operation, as the server is
broadcasting the online status of users and we can live with the stale offline status of
users for a while.
5. Whenever the client starts a new chat with another user, we can pull the status at
that time.
Design Summary: Clients will open a connection to the chat server to send a message; the
server will then pass it to the requested user. All the active users will keep a connection
open with the server to receive messages. Whenever a new message arrives, the chat server
will push it to the receiving user on the long poll request. Messages can be stored in HBase,
which supports quick small updates, and range based searches. The servers can broadcast
the online status of a user to other relevant users. Clients can pull status updates for users
who are visible in the client’s viewport on a less frequent basis.
6. Data partitioning
Since we will be storing a lot of data (3.6PB for five years), we need to distribute it onto
multiple database servers. What would be our partitioning scheme?
Partitioning based on UserID: Let’s assume we partition based on the hash of the UserID so
that we can keep all messages of a user on the same database. If one DB shard is 4TB, we
will have “3.6PB/4TB ~= 900” shards for five years. For simplicity, let’s assume we keep 1K
shards. So we will find the shard number by “hash(UserID) % 1000”, and then store/retrieve
the data from there. This partitioning scheme will also be very quick to fetch chat history for
any user.
In the beginning, we can start with fewer database servers with multiple shards residing on
one physical server. Since we can have multiple database instances on a server, we can
easily store multiple partitions on a single server. Our hash function needs to understand
this logical partitioning scheme so that it can map multiple logical partitions on one physical
server.
Since we will store an unlimited history of messages, we can start with a big number of
logical partitions, which would be mapped to fewer physical servers, and as our storage
demand increases, we can add more physical servers to distribute our logical partitions.
7. Cache
We can cache a few recent messages (say last 15) in a few recent conversations that are
visible in user’s viewport (say last 5). Since we decided to store all of the user’s messages on
one shard, cache for a user should entirely reside on one machine too.
8. Load balancing
We will need a load balancer in front of our chat servers; that can map each UserID to a
server that holds the connection for the user and then direct the request to that server.
Similarly, we would need a load balancer for our cache servers.
What will happen when a chat server fails? Our chat servers are holding connections with
the users. If a server goes down, should we devise a mechanism to transfer those
connections to some other server? It’s extremely hard to failover TCP connections to other
servers; an easier approach can be to have clients automatically reconnect if the connection
is lost.
Should we store multiple copies of user messages? We cannot have only one copy of the
user’s data, because if the server holding the data crashes or is down permanently, we don’t
have any mechanism to recover that data. For this, either we have to store multiple copies
of the data on different servers or use techniques like Reed-Solomon encoding to distribute
and replicate it.
We can have separate group-chat objects in our system that can be stored on the chat
servers. A group-chat object is identified by GroupChatID and will also maintain a list of
people who are part of that chat. Our load balancer can direct each group chat message
based on GroupChatID and the server handling that group chat can iterate through all the
users of the chat to find the server handling the connection of each user to deliver the
message.
In databases, we can store all the group chats in a separate table partitioned based on
GroupChatID.
b. Push notifications
In our current design user’s can only send messages to active users, and if the receiving user
is offline, we send a failure to the sending user. Push notifications will enable our system to
send messages to offline users.
For Push notifications, each user can opt-in from their device (or a web browser) to get
notifications whenever there is a new message or event. Each manufacturer maintains a set
of servers that handles pushing these notifications to the user.
To have push notifications in our system, we would need to set up a Notification server,
which will take the messages for offline users and send them to manufacture’s push
notification server, which will then send them to the user’s device.
Designing Twitter
Let’s design a Twitter like social networking service. Users of the service will be able to post
tweets, follow other people and favorite tweets. Difficulty Level: Medium
1. What is Twitter?
Twitter is an online social networking service where users post and read short 140-character
messages called “tweets”. Registered users can post and read tweets, but those who are not
registered can only read them. Users access Twitter through their website interface, SMS or
mobile app.
Functional Requirements
Non-functional Requirements
Extended Requirements
1. Searching tweets.
2. Reply to a tweet.
3. Trending topics – current hot topics/searches.
4. Tagging other users.
5. Tweet Notification.
6. Who to follow? Suggestions?
7. Moments.
Let’s assume we have one billion total users, with 200 million daily active users (DAU). Also,
we have 100 million new tweets every day, and on average each user follows 200 people.
How many favorites per day? If on average each user favorites five tweets per day, we will
have:
How many total tweet-views our system will generate? Let’s assume on average a user
visits their timeline two times a day and visits five other people’s pages. One each page if a
user sees 20 tweets, total tweet-views our system will generate:
Storage Estimates Let’s say each tweet has 140 characters and we need two bytes to store a
character without compression. Let’s assume we need 30 bytes to store metadata with each
tweet (like ID, timestamp, user ID, etc.). Total storage we would need:
What would be our storage needs for five years? How much storage we would need for
users’ data, follows, favorites? We will leave this for exercise.
Not all tweets will have media, let’s assume that on average every fifth tweet has a photo
and every tenth has a video. Let’s also assume on average a photo is 200KB and a video is
2MB. This will lead us to have 24TB of new media every day.
Bandwidth Estimates Since total ingress is 24TB per day, this would translate into
290MB/sec.
Remember that we have 28B tweet views per day. We must show the photo of every tweet
(if it has a photo), but let’s assume that the users watch every 3rd video they see in their
timeline. So, total egress will be:
Total ~= 35GB/s
4. System APIs
Once we’ve finalized the requirements, it’s always a good idea to define the
system APIs. This should explicitly state what is expected from the system.
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definition of the API for posting a new tweet:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
tweet_data (string): The text of the tweet, typically up to 140 characters.
tweet_location (string): Optional location (longitude, latitude) this Tweet refers to.
user_location (string): Optional location (longitude, latitude) of the user adding the tweet.
media_ids (number[]): Optional list of media_ids to be associated with the Tweet. (All the
media photo, video, etc.) need to be uploaded separately.
Returns: (string)
A successful post will return the URL to access that tweet. Otherwise, an appropriate HTTP
error is returned.
At a high level, we need multiple application servers to serve all these requests with load
balancers in front of them for traffic distributions. On the backend, we need an efficient
database that can store all the new tweets and can support a huge number of reads. We
would also need some file storage to store photos and videos.
Although our expected daily write load is 100 million and read load is 28 billion tweets. This
means, on average our system will receive around 1160 new tweets and 325K read requests
per second. This traffic will be distributed unevenly throughout the day, though, at peak
time we should expect at least a few thousand write requests and around 1M read requests
per second. We should keep this thing in mind while designing the architecture of our
system.
6. Database Schema
We need to store data about users, their tweets, their favorite tweets, and people they
follow.
завантаження (14).png998×362 7.67 KB
For choosing between SQL and NoSQL databases to store the above schema, please see
‘Database schema’ under Designing Instagram 70.
7. Data Sharding
Since we have a huge number of new tweets every day and our read load is extremely high
too, we need to distribute our data onto multiple machines such that we can read/write it
efficiently. We have many options to shard our data; let’s go through them one by one:
Sharding based on UserID: We can try storing all the data of a user on one server. While
storing, we can pass the UserID to our hash function that will map the user to a database
server where we will store all of the user’s tweets, favorites, follows, etc. While querying for
tweets/follows/favorites of a user, we can ask our hash function where can we find the data
of a user and then read it from there. This approach has a couple of issues:
1. What if a user becomes hot? There could be a lot of queries on the server holding
the user. This high load will affect the performance of our service.
2. Over time some users can end up storing a lot of tweets or have a lot of follows
compared to others. Maintaining a uniform distribution of growing user’s data is
quite difficult.
To recover from these situations either we have to repartition/redistribute our data or use
consistent hashing.
Sharding based on TweetID: Our hash function will map each TweetID to a random server
where we will store that Tweet. To search tweets, we have to query all servers, and each
server will return a set of tweets. A centralized server will aggregate these results to return
them to the user. Let’s look into timeline generation example, here are the number of steps
our system has to perform to generate a user’s timeline:
1. Our application (app) server will find all the people the user follows.
2. App server will send the query to all database servers to find tweets from these
people.
3. Each database server will find the tweets for each user, sort them by recency and
return the top tweets.
4. App server will merge all the results and sort them again to return the top results to
the user.
This approach solves the problem of hot users, but in contrast to sharding by UserID, we
have to query all database partitions to find tweets of a user, which can result in higher
latencies.
We can further improve our performance by introducing cache to store hot tweets in front
of the database servers.
Sharding based on Tweet creation time: Storing tweets based on recency will give us the
advantage of fetching all the top tweets quickly, and we only have to query a very small set
of servers. But the problem here is that the traffic load will not be distributed, e.g., while
writing, all new tweets will be going to one server, and the remaining servers will be sitting
idle. Similarly while reading, the server holding latest data will have a very high load as
compared to servers holding old data.
What if we can combine sharding by TweedID and Tweet creation time? If we don’t store
tweet creation time separately and use TweetID to reflect that, we can get benefits of both
the approaches. This way it will be quite quick to find latest Tweets. For this, we must make
each TweetID universally unique in our system, and each TweetID should contain timestamp
too.
We can use epoch time for this. Let’s say our TweetID will have two parts; the first part will
be representing epoch seconds and the second part will be an auto-incrementing sequence.
So, to make a new TweetID, we can take the current epoch time and append an auto-
incrementing number to it. We can figure out shard number from this TweetID and store it
there.
What could be the size of our TweetID? Let’s say our epoch time starts today, how many
bits we would need to store the number of seconds for next 50 years?
We would need 31 bits to store this number. Since on average we are expecting 1150 new
tweets per second, we can allocate 17 bits to store auto incremented sequence; this will
make our TweetID 48 bits long. So, every second we can store (2^17 => 130K) new tweets.
We can reset our auto incrementing sequence every second. For fault tolerance and better
performance, we can have two database servers to generate auto-incrementing keys for us,
one generating even numbered keys and the other generating odd numbered keys.
завантаження (15).png750×148 13.6 KB
If we assume our current epoch seconds are “1483228800”, our TweetID will look like this:
1483228800 000001
1483228800 000002
1483228800 000003
1483228800 000004
…
If we make our TweetID 64bits (8 bytes) long, we can easily store tweets for next 100 years
and also store them for mili-seconds granularity.
8. Cache
We can introduce a cache for database servers to cache hot tweets and users. We can use
an off-the-shelf solution like Memcache that can store the whole tweet objects. Application
servers before hitting database can quickly check if the cache has desired tweets. Based on
clients’ usage pattern we can determine how many cache servers we need.
Which cache replacement policy would best fit our needs? When the cache is full, and we
want to replace a tweet with a newer/hotter tweet, how would we choose? Least Recently
Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least
recently viewed tweet first.
How can we have more intelligent cache? If we go with 80-20 rule, that is 20% of tweets
are generating 80% of read traffic which means that certain tweets are so popular that
majority of people read them. This dictates that we can try to cache 20% of daily read
volume from each shard.
What if we cache the latest data? Our service can benefit from this approach. Let’s say if
80% of our users see tweets from past three days only; we can try to cache all the tweets
from past three days. Let’s say we have dedicated cache servers that cache all the tweets
from all users from past three days. As estimated above, we are getting 100 million new
tweets or 30GB of new data every day (without photos and videos). If we want to store all
the tweets from last three days, we would need less than 100GB of memory. This data can
easily fit into one server, but we should replicate it onto multiple servers to distribute all the
read traffic to reduce the load on cache servers. So whenever we are generating a user’s
timeline, we can ask the cache servers if they have all the recent tweets for that user, if yes,
we can simply return all the data from the cache. If we don’t have enough tweets in the
cache, we have to query backend to fetch that data. On a similar design, we can try caching
photos and videos from last three days.
Our cache would be like a hash table, where ‘key’ would be ‘OwnerID’ and ‘value’ would be
a doubly linked list containing all the tweets from that user in past three days. Since we
want to retrieve most recent data first, we can always insert new tweets at the head of the
linked list, which means all the older tweets will be near the tail of the linked list. Therefore,
we can remove tweets from the tail to make space for newer tweets.
9. Timeline Generation
For a detailed discussion about timeline generation, take a look at Designing Facebook’s
Newsfeed 23.
Since our system is read-heavy, we can have multiple secondary database servers for each
DB partition. Secondary servers will be used for read traffic only. All writes will first go to the
primary server and then will be replicated to secondary servers. This scheme will also give
us fault tolerance, as whenever the primary server goes down, we can failover to a
secondary server.
We can add Load balancing layer at three places in our system 1) Between Clients and
Application servers 2) Between Application servers and database replication servers and 3)
Between Aggregation servers and Cache server. Initially, a simple Round Robin approach can
be adopted; that distributes incoming requests equally among servers. This LB is simple to
implement and does not introduce any overhead. Another benefit of this approach is if a
server is dead, LB will take it out of the rotation and will stop sending any traffic to it. A
problem with Round Robin LB is, it won’t take server load into consideration. If a server is
overloaded or slow, the LB will not stop sending new requests to that server. To handle this,
a more intelligent LB solution can be placed that periodically queries backend server about
their load and adjusts traffic based on that.
12. Monitoring
Having the ability to monitor our systems is crucial. We should constantly collect data to get
an instant insight into how our system is doing. We can collect following metrics/counters to
get an understanding of the performance of our service:
By monitoring these counters, we will realize if we need more replication or load balancing
or caching, etc.
How to serve feeds? Get all the latest tweets from the people someone follows and
merge/sort them by time. Use pagination to fetch/show tweets. Only fetch top N tweets
from all the people someone follows. This N will depend on the client’s Viewport, as on
mobile we show fewer tweets compared to a Web client. We can also cache next top tweets
to speed things up.
Alternately, we can pre-generate the feed to improve efficiency, for details please see
‘Ranking and timeline generation’ under Designing Instagram 70.
Retweet: With each Tweet object in the database, we can store the ID of original Tweet and
not store any contents on this retweet object.
Trending Topics: We can cache most frequently occurring hashtags or searched queries in
the last N seconds and keep updating them after every M seconds. We can rank trending
topics based on the frequency of tweets or search queries or retweets or likes. We can give
more weight to topics which are shown to more people.
Who to follow? How to give suggestions? This feature will improve user engagement. We
can suggest friends of people someone follows. We can go two or three level down to find
famous people for the suggestions. We can give preference to people with more followers.
As only a few suggestions can be made at any time, use Machine Learning (ML) to shuffle
and re-prioritize. ML signals could include people with recently increased follow-ship,
common followers if the other person is following this user, common location or interests,
etc.
Moments: Get top news for different websites for past 1 or 2 hours, figure out related
tweets, prioritize them, categorize them (news, support, financials, entertainment, etc.)
using ML – supervised learning or Clustering. Then we can show these articles as trending
topics in Moments.
Search: Search involves Indexing, Ranking, and Retrieval of tweets. A similar solution is
discussed in our next problem Design Twitter Search 56.
Let’s design a video sharing service like Youtube, where users will be able to
upload/view/search videos. Similar
Services: netflix.com 9, vimeo.com 3, dailymotion.com 6, veoh.com 4 Difficulty Level:
Medium
1. Why Youtube?
Youtube is one of the most popular video sharing websites in the world. Users of the service
can upload, view, share, rate, and report videos as well as add comments on videos.
For the sake of this exercise, we plan to design a simpler version of Youtube with following
requirements:
Functional Requirements:
Non-Functional Requirements:
1. The system should be highly reliable, any video uploaded should not be lost.
2. The system should be highly available. Consistency can take a hit (in the interest of
availability), if a user doesn’t see a video for a while, it should be fine.
3. Users should have real time experience while watching videos and should not feel
any lag.
Not in scope: Video recommendation, most popular videos, channels, and subscriptions,
watch later, favorites, etc.
3. Capacity Estimation and Constraints
Let’s assume we have 1.5 billion total users, 800 million of whom are daily active users. If,
on the average, a user views five videos per day, total video-views per second would be:
Let’s assume our upload:view ratio is 1:200 i.e., for every video upload we have 200 video
viewed, giving us 230 videos uploaded per second.
Storage Estimates: Let’s assume that every minute 500 hours worth of videos are uploaded
to Youtube. If on average, one minute of video needs 50MB of storage (videos need to be
stored in multiple formats), total storage needed for videos uploaded in a minute would be:
These numbers are estimated, ignoring video compression and replication, which would
change our estimates.
Bandwidth estimates: With 500 hours of video uploads per minute, assuming each video
upload takes a bandwidth of 10MB/min, we would be getting 300GB of uploads every
minute.
4. System APIs
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definitions of the APIs for uploading and searching videos:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
video_title (string): Title of the video.
vide_description (string): Optional description of the video.
tags (string[]): Optional tags for the video.
category_id (string): Category of the video, e.g., Film, Song, People, etc.
default_language (string): For example English, Mandarin, Hindi, etc.
recording_details (string): Location where the video was recorded.
video_contents (stream): Video to be uploaded.
Returns: (string)
A successful upload will return HTTP 202 (request accepted), and once the video encoding is
completed, the user is notified through email with a link to access the video. We can also
expose a queryable API to let users know the current status of their uploaded video.
Parameters:
api_dev_key (string): The API developer key of a registered account of our service.
search_query (string): A string containing the search terms.
user_location (string): Optional location of the user performing the search.
maximum_videos_to_return (number): Maximum number of results returned in one
request.
page_token (string): This token will specify a page in the result set that should be returned.
Returns: (JSON)
A JSON containing information about the list of video resources matching the search query.
Each video resource will have a video title, a thumbnail, a video creation date and how
many views it has.
Parameters:
api_dev_key (string): The API developer key of a registered account of our service.
video_id (string): A string to identify the video.
offset (number): We should be able to stream video from any offset, this offset would be a
time in seconds from the beginning of the video. If we support to play/pause a video from
multiple devices, we will need to store the offset on the server. This will enable the users to
start watching a video on any device from the same point where the left.
codec (string) & resolution(string): We should send the codec and resolution info in the API
from the client to support play/pause from multiple devices. Imagine if you are watching a
video on your TV’s Netflix app, paused it there and start watching it on your phone’s Netflix
app. In this case, you would need codec and resolution, as both these devices have a
different resolution and using a different codec.
Returns: (STREAM)
A media stream (a video chunk) from the given offset.
6. Database Schema
VideoID
Title
Description
Size
Thumbnail
Uploader/User
Total number of likes
Total number of dislikes
Total number of views
CommentID
VideoID
UserID
Comment
TimeOfCreation
User data storage - MySql
The service would be read-heavy, so we will focus on building a system that can retrieve
videos quickly. We can expect our read:write ratio as 200:1, which means for every video
upload there are 200 video views.
Where would videos be stored? Videos can be stored in a distributed file storage system
like HDFS 26 or GlusterFS 13.
How should we efficiently manage read traffic? We should segregate our read traffic from
write. Since we will be having multiple copies of each video, we can distribute our read
traffic on different servers. For metadata, we can have master-slave configurations, where
writes will go to master first and then replayed at all the slaves. Such configurations can
cause some staleness in data, e.g., when a new video is added, its metadata would be
inserted in the master first, and before it gets replayed at the slave, our slaves would not be
able to see it and therefore will be returning stale results to the user. This staleness might
be acceptable in our system, as it would be very short-lived and the user will be able to see
the new videos after a few milliseconds.
Where would thumbnails be stored? There will be a lot more thumbnails than videos. If we
assume that every video will have five thumbnails, we need to have a very efficient storage
system that can serve a huge read traffic. There will be two consideration before deciding
which storage system will be used for thumbnails:
Let’s evaluate storing all the thumbnails on disk. Given that we have a huge number of files;
to read these files we have to perform a lot of seeks to different locations on the disk. This is
quite inefficient and will result in higher latencies.
Bigtable 31 can be a reasonable choice here, as it combines multiple files into one block to
store on the disk and is very efficient in reading a small amount of data. Both of these are
the two most significant requirements of our service. Keeping hot thumbnails in the cache
will also help in improving the latencies, and given that thumbnails files are small in size, we
can easily cache a large number of such files in memory.
Video Uploads: Since videos could be huge, if while uploading, the connection drops, we
should support resuming from the same point.
Video Encoding: Newly uploaded videos are stored on the server, and a new task is added
to the processing queue to encode the video into multiple formats. Once all the encoding is
completed; uploader is notified, and video is made available for view/sharing.
8. Metadata Sharding
Since we have a huge number of new videos every day and our read load is extremely high
too, we need to distribute our data onto multiple machines so that we can perform
read/write operations efficiently. We have many options to shard our data. Let’s go through
different strategies of sharding this data one by one:
Sharding based on UserID: We can try storing all the data for a particular user on one
server. While storing, we can pass the UserID to our hash function which will map the user
to a database server where we will store all the metadata for that user’s videos. While
querying for videos of a user, we can ask our hash function to find the server holding user’s
data and then read it from there. To search videos by titles, we will have to query all servers,
and each server will return a set of videos. A centralized server will then aggregate and rank
these results before returning them to the user.
1. What if a user becomes popular? There could be a lot of queries on the server
holding that user, creating a performance bottleneck. This will affect the overall
performance of our service.
2. Over time, some users can end up storing a lot of videos compared to others.
Maintaining a uniform distribution of growing user’s data is quite tricky.
To recover from these situations either we have to repartition/redistribute our data or used
consistent hashing to balance the load between servers.
Sharding based on VideoID: Our hash function will map each VideoID to a random server
where we will store that Video’s metadata. To find videos of a user we will query all servers,
and each server will return a set of videos. A centralized server will aggregate and rank
these results before returning them to the user. This approach solves our problem of
popular users but shifts it to popular videos.
We can further improve our performance by introducing cache to store hot videos in front
of the database servers.
9. Video Deduplication
With a huge number of users, uploading a massive amount of video data, our service will
have to deal with widespread video duplication. Duplicate videos often differ in aspect ratios
or encodings, can contain overlays or additional borders, or can be excerpts from a longer,
original video. The proliferation of duplicate videos can have an impact on many levels:
1. Data Storage: We could be wasting storage space by keeping multiple copies of the
same video.
2. Caching: Duplicate videos would result in degraded cache efficiency by taking up
space that could be used for unique content.
3. Network usage: Increasing the amount of data that must be sent over the network to
in-network caching systems.
4. Energy consumption: Higher storage, inefficient cache, and network usage will result
in energy wastage.
For the end user, these inefficiencies will be realized in the form of duplicate search results,
longer video startup times, and interrupted streaming.
For our service, deduplication makes most sense early, when a user is uploading a video; as
compared to post-processing it to find duplicate videos later. Inline deduplication will save
us a lot of resources that can be used to encode, transfer and store the duplicate copy of
the video. As soon as any user starts uploading a video, our service can run video matching
algorithms (e.g., Block Matching 13, Phase Correlation 11, etc.) to find duplications. If we
already have a copy of the video being uploaded, we can either stop the upload and use the
existing copy or use the newly uploaded video if it is of higher quality. If the newly uploaded
video is a subpart of an existing video or vice versa, we can intelligently divide the video into
smaller chunks, so that we only upload those parts that are missing.
We should use Consistent Hashing 35 among our cache servers, which will also help in
balancing the load between cache servers. Since we will be using a static hash-based
scheme to map videos to hostnames, it can lead to an uneven load on the logical replicas
due to the different popularity of each video. For instance, if a video becomes popular, the
logical replica corresponding to that video will experience more traffic than other servers.
These uneven loads for logical replicas can then translate into uneven load distribution on
corresponding physical servers. To resolve this issue, any busy server in one location can
redirect a client to a less busy server in the same cache location. We can use dynamic HTTP
redirections for this scenario.
However, the use of redirections also has its drawbacks. First, since our service tries to load
balance locally, it leads to multiple redirections if the host that receives the redirection can’t
serve the video. Also, each redirection requires a client to make an additional HTTP request;
it also leads to higher delays before the video starts playing back. Moreover, inter-tier (or
cross data-center) redirections lead a client to a distant cache location because the higher
tier caches are only present at a small number of locations.
11. Cache
To serve globally distributed users, our service needs a massive-scale video delivery system.
Our service should push its content closer to the user using a large number of geographically
distributed video cache servers. We need to have a strategy that would maximize user
performance and also evenly distributes the load on its cache servers.
We can introduce a cache for metadata servers to cache hot database rows. Using
Memcache to cache the data and Application servers before hitting database can quickly
check if the cache has the desired rows. Least Recently Used (LRU) can be a reasonable
cache eviction policy for our system. Under this policy, we discard the least recently viewed
row first.
How can we build more intelligent cache? If we go with 80-20 rule, i.e., 20% of daily read
volume for videos is generating 80% of traffic, meaning that certain videos are so popular
that the majority of people view them; It follows that we can try caching 20% of daily read
volume of videos and metadata.
A CDN is a system of distributed servers that deliver web content to a user based on the
geographic locations of the user, the origin of the web page and a content delivery server.
Take a look at ‘CDN’ section in our Caching 12 chapter.
CDNs replicate content in multiple places. There’s a better chance of videos being closer to
the user, and with fewer hops, videos will stream from a friendlier network.
CDN machines make heavy use of caching and can mostly serve videos out of memory.
Less popular videos (1-20 views per day) that are not cached by CDNs can be served by our
servers in various data centers.
Let’s design a real-time suggestion service, which will recommend terms to users as they
enter text for searching. Similar Services: Auto-suggestions, Typeahead search Difficulty:
Medium
Typeahead suggestions enable users to search for known and frequently searched terms. As
the user types into the search box, it tries to predict the query based on the characters the
user has entered and gives a list of suggestions to complete the query. Typeahead
suggestions help the user to articulate their search queries better. It’s not about speeding
up the search process but rather about guiding the users and lending them a helping hand in
constructing their search query.
Functional Requirements: As the user types in their query, our service should suggest top
10 terms starting with whatever user has typed.
Non-function Requirements: The suggestions should appear in real-time. The user should
be able to see the suggestions within 200ms.
The problem we are solving is that we have a lot of ‘strings’ that we need to store in such a
way that users can search on any prefix. Our service will suggest next terms that will match
the given prefix. For example, if our database contains following terms: cap, cat, captain,
capital; and the user has typed in ‘cap’, our system should suggest ‘cap’, ‘captain’ and
‘capital’.
Since we’ve to serve a lot of queries with minimum latency, we need to come up with a
scheme that can efficiently store our data such that it can be queried quickly. We can’t
depend upon some database for this; we need to store our index in memory in a highly
efficient data structure.
One of the most appropriate data structure that can serve our purpose would be the Trie
(pronounced “try”). A trie is a tree-like data structure used to store phrases where each
node stores a character of the phrase in a sequential manner. For example, if we need to
store ‘cap, cat, caption, captain, capital’ in the trie, it would look like:
завантаження (19).png599×650 31.4 KB
Now if the user has typed ‘cap’, our service can traverse the trie to go to the node ‘P’ to find
all the terms that start with this prefix (e.g., cap-tion, cap-ital etc).
We can merge nodes that have only one branch to save storage space. The above trie can
be stored like this:
Should we have case insensitive trie? For simplicity and search use case let’s assume our
data is case insensitive.
How to find top suggestion? Now that we can find all the terms given a prefix, how can we
know what’re the top 10 terms that we should suggest? One simple solution could be to
store the count of searches that terminated at each node, e.g., if users have searched about
‘CAPTAIN’ 100 times and ‘CAPTION’ 500 times, we can store this number with the last
character of the phrase. So now if the user has typed ‘CAP’ we know the top most searched
word under the prefix ‘CAP’ is ‘CAPTION’. So given a prefix, we can traverse the sub-tree
under it, to find the top suggestions.
Given a prefix, how much time will it take to traverse its sub-tree? Given the amount of
data we need to index, we should expect a huge tree. Even, traversing a sub-tree would take
really long, e.g., the phrase ‘system design interview questions’ is 30 levels deep. Since we
have very strict latency requirements, we do need to improve the efficiency of our solution.
Can we store top suggestions with each node? This can surely speed up our searches but
will require a lot of extra storage. We can store top 10 suggestions at each node that we can
return to the user. We have to bear the big increase in our storage capacity to achieve the
required efficiency.
We can optimize our storage by storing only references of the terminal nodes rather than
storing the entire phrase. To find the suggested term we’ve to traverse back using the
parent reference from the terminal node. We will also need to store the frequency with
each reference to keep track of top suggestions.
How would we build this trie? We can efficiently build our trie bottom up. Each parent node
will recursively call all the child nodes to calculate their top suggestions and their counts.
Parent nodes will combine top suggestions from all of their children to determine their top
suggestions.
How to update the trie? Assuming five billion searches every day, which would give us
approximately 60K queries per second. If we try to update our trie for every query it’ll be
extremely resource intensive and this can hamper our read requests too. One solution to
handle this could be to update our trie offline after a certain interval.
As the new queries come in, we can log them and also track their frequencies. Either we can
log every query or do sampling and log every 1000th query. For example, if we don’t want
to show a term which is searched for less than 1000 times, it’s safe to log every 1000th
searched term.
We can have a Map-Reduce (MR) 9 setup to process all the logging data periodically, say
every hour. These MR jobs will calculate frequencies of all searched terms in the past hour.
We can then update our trie with this new data. We can take the current snapshot of the
trie and update it with all the new terms and their frequencies. We should do this offline, as
we don’t want our read queries to be blocked by update trie requests. We can have two
options:
1. We can make a copy of the trie on each server to update it offline. Once done we
can switch to start using it and discard the old one.
2. Another option is we can have a master-slave configuration for each trie server. We
can update slave while the master is serving traffic. Once the update is complete, we
can make the slave our new master. We can later update our old master, which can
then start serving traffic too.
How can we update the frequencies of typeahead suggestions? Since we are storing
frequencies of our typeahead suggestions with each node, we need to update them too. We
can update only difference in frequencies rather than recounting all search terms from
scratch. If we’re keeping count of all the terms searched in last 10 days, we’ll need to
subtract the counts from the time period no longer included and add the counts for the new
time period being included. We can add and subtract frequencies based on Exponential
Moving Average (EMA) 10 of each term. In EMA, we give more weight to the latest data. It’s
also known as the exponentially weighted moving average.
After inserting a new term in the trie, we’ll go to the terminal node of the phrase and
increase its frequency. Since we’re storing the top 10 queries in each node, it is possible that
this particular search term jumped into the top 10 queries of a few other nodes. So, we
need to update the top 10 queries of those nodes then. We’ve to traverse back from the
node to all the way up to the root. For every parent, we check if the current query is part of
the top 10. If so, we update the corresponding frequency. If not, we check if the current
query’s frequency is high enough to be a part of the top 10. If so, we insert this new term
and remove the term with the lowest frequency.
How can we remove a term from the trie? Let’s say we’ve to remove a term from the trie,
because of some legal issue or hate or piracy etc. We can completely remove such terms
from the trie when the regular update happens, meanwhile, we can add a filtering layer on
each server, which will remove any such term before sending them to users.
What could be different ranking criteria for suggestions? In addition to a simple count, for
terms ranking, we have to consider other factors too, e.g., freshness, user location,
language, demographics, personal history etc.
How to store trie in a file so that we can rebuild our trie easily - this will be needed when a
machine restarts? We can take snapshot of our trie periodically and store it in a file. This
will enable us to rebuild a trie if the server goes down. To store, we can start with the root
node and save the trie level-by-level. With each node we can store what character it
contains and how many children it has. Right after each node we should put all of its
children. Let’s assume we have following trie:
If we store this trie in a file with the above-mentioned scheme, we will have:
“C2,A2,R1,T,P,O1,D”. From this, we can easily rebuild our trie.
If you’ve noticed we are not storing top suggestions and their counts with each node, it is
hard to store this information, as our trie is being stored top down, we don’t have child
nodes created before the parent, so there is no easy way to store their references. For this,
we have to recalculate all the top terms with counts. This can be done while we are building
the trie. Each node will calculate its top suggestions and pass it to its parent. Each parent
node will merge results from all of its children to figure out its top suggestions.
5. Scale Estimation
If we are building a service, which has the same scale as that of Google, we can expect 5
billion searches every day, which would give us approximately 60K queries per second.
Since there will be a lot of duplicates in 5 billion queries, we can assume that only 20% of
these will be unique. If we only want to index top 50% of the search terms, we can get rid of
a lot of less frequently searched queries. Let’s assume we will have 100 million unique terms
for which we want to build an index.
Storage Estimation: If on the average each query consists of 3 words, and if the average
length of a word is 5 characters, this will give us 15 characters of average query size.
Assuming we need 2 bytes to store a character, we will need 30 bytes to store an average
query. So total storage we will need:
We can expect some growth in this data every day, but we should also be removing some
terms that are not searched anymore. If we assume we have 2% new queries every day and
if we are maintaining our index for last one year, total storage we should expect:
6. Data Partition
Although our index can easily fit on one server, we can still partition it in order to meet our
requirements of higher efficiency and lower latencies. How can we efficiently partition our
data to distribute it onto multiple servers?
a. Range Based Partitioning: What if we store our phrases in separate partitions based on
their first letter. So we save all the terms starting with letter ‘A’ in one partition and those
that start with letter ‘B’ into another partition and so on. We can even combine certain less
frequently occurring letters into one database partition. We should come up with this
partitioning scheme statically so that we can always store and search terms in a predictable
manner.
The main problem with this approach is that it can lead to unbalanced servers, for instance;
if we decide to put all terms starting with letter ‘E’ into a DB partition, but later we realize
that we have too many terms that start with letter ‘E’, which we can’t fit into one DB
partition.
We can see that the above problem will happen with every statically defined scheme. It is
not possible to calculate if each of our partitions will fit on one server statically.
b. Partition based on the maximum capacity of the server: Let’s say we partition our trie
based on the maximum memory capacity of the servers. We can keep storing data on a
server as long as it has memory available. Whenever a sub-tree cannot fit into a server, we
break our partition there to assign that range to this server and move on the next server to
repeat this process. Let’s say, if our first trie server can store all terms from ‘A’ to ‘AABC’,
which mean our next server will store from ‘AABD’ onwards. If our second server could
store up to ‘BXA’, next serve will start from ‘BXB’ and so on. We can keep a hash table to
quickly access this partitioning scheme:
Server 1, A-AABC
Server 2, AABD-BXA
Server 3, BXB-CDA
For querying, if the user has typed ‘A’ we have to query both server 1 and 2 to find the top
suggestions. When the user has typed ‘AA’, still we have to query server 1 and 2, but when
the user has typed ‘AAA’ we only need to query server 1.
We can have a load balancer in front of our trie servers, which can store this mapping and
redirect traffic. Also if we are querying from multiple servers, either we need to merge the
results at the server side to calculate overall top results, or make our clients do that. If we
prefer to do this on the server side, we need to introduce another layer of servers between
load balancers and trie servers, let’s call them aggregator. These servers will aggregate
results from multiple trie servers and return the top results to the client.
Partitioning based on the maximum capacity can still lead us to hotspots e.g., if there are a
lot of queries for terms starting with ‘cap’, the server holding it will have a high load
compared to others.
c. Partition based on the hash of the term: Each term will be passed to a hash function,
which will generate a server number and we will store the term on that server. This will
make our term distribution random and hence minimizing hotspots. To find typeahead
suggestions for a term, we have to ask all servers and then aggregate the results. We have
to use consistent hashing for fault tolerance and load distribution.
7. Cache
We should realize that caching the top searched terms will be extremely helpful in our
service. There will be a small percentage of queries that will be responsible for most of the
traffic. We can have separate cache servers in front of the trie servers, holding most
frequently searched terms and their typeahead suggestions. Application servers should
check these cache servers before hitting the trie servers to see if they have the desired
searched terms.
We can also build a simple Machine Learning (ML) model that can try to predict the
engagement on each suggestion based on simple counting, personalization, or trending data
etc., and cache these terms.
We should have replicas for our trie servers both for load balancing and also for fault
tolerance. We also need a load balancer that keeps track of our data partitioning scheme
and redirects traffic based on the prefixes.
9. Fault Tolerance
What will happen when a trie server goes down? As discussed above we can have a
master-slave configuration, if the master dies slave can take over after failover. Any server
that comes back up, can rebuild the trie based on the last snapshot.
We can perform the following optimizations on the client to improve user’s experience:
1. The client should only try hitting the server if the user has not pressed any key for
50ms.
2. If the user is constantly typing, the client can cancel the in-progress requests.
3. Initially, the client can wait until the user enters a couple of characters.
4. Clients can pre-fetch some data from the server to save future requests.
5. Clients can store the recent history of suggestions locally. Recent history has a very
high rate of being reused.
6. Establishing an early connection with server turns out to be one of the most
important factors. As soon as the user opens the search engine website, the client
can open a connection with the server. So when user types in the first character,
client doesn’t waste time in establishing the connection.
7. The server can push some part of their cache to CDNs and Internet Service Providers
(ISPs) for efficiency.
11. Personalization
Users will receive some typeahead suggestions based on their historical searches, location,
language, etc. We can store the personal history of each user separately on the server and
cache them on the client too. The server can add these personalized terms in the final set,
before sending it to the user. Personalized searches should always come before others.
Let’s design an API Rate Limiter which will throttle users based upon the number of the
requests they are sending. Difficulty Level: Medium
Imagine we’ve a service which is receiving a huge number of requests, but it can only serve
a limited number of requests per second. To handle this problem, we would need some kind
of throttling or rate limiting mechanism that will allow only a certain number of requests
which our service can respond to. A rate limiter, at a high-level, limits the number of events
an entity (user, device, IP, etc.) can perform in a particular time window. For example:
In general, a rate limiter caps how many requests a sender can issue in a specific time
window. It then blocks requests once the cap is reached.
2. Why do we need API rate limiting?
Rate Limiting helps to protect services against abusive behaviors targeting the application
layer like Denial-of-service (DOS) 1attacks, brute-force password attempts, brute-force
credit card transactions, etc. These attacks are usually a barrage of HTTP/S requests which
may look like they are coming from real users, but are typically generated by machines (or
bots). As a result, these attacks are often harder to detect and can more easily bring down a
service, application, or an API.
Rate limiting is also used to prevent revenue loss, to reduce infrastructure costs, to stop
spam and online harassment. Following is a list of scenarios that can benefit from Rate
limiting by making a service (or API) more reliable:
Functional Requirements:
1. Limit the number of requests an entity can send to an API within a time window,
e.g., 15 requests per second.
2. The APIs are accessible through a cluster, so the rate limit should be considered
across different servers. The user should get an error message whenever the defined
threshold is crossed within a single server or across a combination of servers.
Non-Functional Requirements:
1. The system should be highly available. The rate limiter should always work since it
protects our service from external attacks.
2. Our rate limiter should not introduce substantial latencies affecting the user
experience.
Rate Limiting is a process that is used to define the rate and speed at which consumers can
access APIs. Throttling is the process of controlling the usage of the APIs by customers
during a given period. Throttling can be defined at the application level and/or API level.
When a throttle limit is crossed, the server returns “429” as HTTP status to the user with
message content as “Too many requests”.
Here are the three famous throttling types that are used by different services:
Hard Throttling: The number of API requests cannot exceed the throttle limit.
Soft Throttling: In this type, we can set the API request limit to exceed a certain percentage.
For example, if we have rate-limit of 100 messages a minute and 10% exceed limit. Our rate
limiter will allow up to 110 messages per minute.
Elastic or Dynamic Throttling : Under Elastic throttling, the number of requests can go
beyond the threshold if the system has some resources available. For example, if a user is
allowed only 100 messages a minute, we can let the user send more than 100 messages a
minute if there are free resources available in the system.
Following are the two types of algorithms used for Rate Limiting:
Fixed Window Algorithm: In this algorithm, the time window is considered from the start of
the time-unit to the end of the time-unit. For example, a period would be considered as 0-
60 seconds for a minute irrespective of the time frame at which the API request has been
made. In the diagram below, there are two messages between 0-1 second and three
messages between 1-2 second. If we’ve a rate limiting of two messages a second, this
algorithm will throttle only ‘m5’.
завантаження (22).png832×221 2.64 KB
Rolling Window Algorithm: In this algorithm, the time window is considered from the
fraction of the time at which the request is made plus the time window length. For example,
if there are two messages sent at 300th millisecond and 400th millisecond of a second, we’ll
count them as two messages from 300th millisecond of that second up to the 300th
millisecond of next second. In the above diagram, keeping two messages a second, we’ll
throttle ‘m3’ and ‘m4’.
Rate Limiter will be responsible for deciding which request will be served by the API servers
and which request will be declined. Once a new request arrives, Web Server first asks the
Rate Limiter to decide if it will be served or throttled. If the request is not throttled, then it’ll
be passed to the API servers.
Let’s take the example where we want to limit the number of requests per user. Under this
scenario, for each unique user, we would keep a count representing how many requests the
user has made and a timestamp when we started counting the requests. We can keep it in a
hashtable, where the ‘key’ would be the ‘UserID’ and ‘value’ would be a structure
containing an integer for the ‘Count’ and an integer for the Epoch time:
Let’s assume our rate limiter is allowing three requests per minute per user, so whenever a
new request comes in, our rate limiter will perform following steps:
1. If the ‘UserID’ is not present in the hash-table, insert it and set the ‘Count’ to 1 and
‘StartTime’ to the current time (normalized to a minute) , and allow the request.
2. Otherwise, find the record of the ‘UserID’ and if ‘CurrentTime – StartTime >= 1 min’,
set the ‘StartTime’ to the current time and ‘Count’ to 1, and allow the request.
3. If ‘CurrentTime - StartTime <= 1 min’ and
If ‘Count < 3’, increment the Count and allow the request.
If ‘Count >= 3’, reject the request.
завантаження (25).png677×764 18.9 KB
1. This is a Fixed Window algorithm, as we’re resetting the ‘StartTime’ at the end of
every minute, which means it can potentially allow twice the number of requests per
minute. Imagine if Kristie sends three requests at the last second of a minute, then
she can immediately send three more requests at the very first second of the next
minute, resulting in 6 requests in the span of two seconds. The solution to this
problem would be a sliding window algorithm which we’ll discuss later.
If we are using Redis to store our key-value, one solution to resolve the atomicity problem is
to use Redis lock 3 for the duration of the read-update operation. This, however, would
come at the expense of slowing down concurrent requests from the same user and
introducing another layer of complexity. We can use Memcached 3, but it would have
comparable complications.
If we are using a simple hash-table, we can have a custom implementation for ‘locking’ each
record to solve our atomicity problems.
How much memory would we need to store all of the user data? Let’s assume the simple
solution where we are keeping all of the data in a hash-table.
Let’s assume ‘UserID’ takes 8 bytes. Let’s also assume a 2 byte ‘Count’, which can count up
to 65k, is sufficient for our use case. Although epoch time will need 4 bytes, we can choose
to store only the minute and second part, which can fit into 2 bytes. Hence, we need total
12 bytes to store a user’s data:
8 + 2 + 2 = 12 bytes
Let’s assume our hash-table has an overhead of 20 bytes for each record. If we need to track
one million users at any time, the total memory we would need would be 32MB:
If we assume that we would need a 4-byte number to lock each user’s record to resolve our
atomicity problems, we would require a total 36MB memory.
This can easily fit on a single server, however we would not like to route all of our traffic
through a single machine. Also, if we assume a rate limit of 10 requests per second, this
would translate into 10 million QPS for our rate limiter! This would be too much for a single
server. Practically we can assume we would use Redis or Memcached kind of a solution in a
distributed setup. We’ll be storing all the data in the remote Redis servers, and all the Rate
Limiter servers will read (and update) these servers before serving or throttling any request.
We can maintain a sliding window if we can keep track of each request per user. We can
store the timestamp of each request in a Redis Sorted Set 6 in our ‘value’ field of hash-table.
Let’s assume our rate limiter is allowing three requests per minute per user, so whenever a
new request comes in the Rate Limiter will perform following steps:
1. Remove all the timestamps from the Sorted Set that are older than “CurrentTime - 1
minute”.
2. Count the total number of elements in the sorted set. Reject the request if this count
is greater than our throttling limit of “3”.
3. Insert the current time in the sorted set, and accept the request.
завантаження (29).png712×769 19.7 KB
How much memory would we need to store all of the user data for sliding window? Let’s
assume ‘UserID’ takes 8 bytes. Each epoch time will require 4 bytes. Let’s suppose we need
a rate limiting of 500 requests per hour. Let’s assume 20 bytes overhead for hash-table and
20 bytes overhead for the Sorted Set. At max, we would need a total of 12KB to store one
user’s data:
Here are reserving 20 bytes overhead per element. In a sorted set, we can assume that we
need at least two pointers to maintain order among elements. One pointer to the previous
element and one to the next element. On a 64bit machine, each pointer will cost 8 bytes. So
we will need 16 bytes for pointers. We added an extra word (4 bytes) for storing other
overhead.
If we need to track one million users at any time, total memory we would need would be
12GB:
12KB * 1 million ~= 12GB
Sliding Window Algorithm is taking a lot of memory compared to the Fixed Window; this
would be a scalability issue. What if we can combine the above two algorithms to optimize
our memory usage?
What if we keep track of request counts for each user using multiple fixed time windows,
e.g., 1/60th the size of our rate limit’s time window. For example, if we’ve an hourly rate
limit, we can keep a count for each minute and calculate the sum of all counters in the past
hour when we receive a new request to calculate the throttling limit. This would reduce our
memory footprint. Let’s take an example where we rate limit at 500 requests per hour with
an additional limit of 10 requests per minute. This means that when the sum of the counters
with timestamps in the past hour exceeds the request threshold (500), Kristie has exceeded
the rate limit. In addition to that, she can’t send more than ten requests per minute. This
would be a reasonable and a practical consideration, as none of the real users would send
frequent requests. Even if they do, they will see success with retries since their limits get
reset every minute.
We can store our counters in a Redis Hash 6 - as it offers extremely efficient storage for
fewer than 100 keys. When each request increments a counter in the hash, it also sets the
hash to expire 5 an hour later. We’ll normailze each ‘time’ to a minute.
завантаження (30).png679×773 19.5 KB
How much memory we would need to store all the user data for sliding window with
counters? Let’s assume ‘UserID’ takes 8 bytes. Each epoch time will need 4 bytes and the
Counter would need 2 bytes. Let’s suppose we need a rate limiting of 500 requests per hour.
Assume 20 bytes overhead for hash-table and 20 bytes for Redis hash. Since we’ll keep a
count for each minute, at max, we would need 60 entries for each user. We would need a
total of 1.6KB to store one user’s data:
If we need to track one million users at any time, total memory we would need would be
1.6GB:
So, our ‘Sliding Window with Counters’ algorithm uses 86% less memory than simple sliding
window algorithm
11. Data Sharding and Caching
We can shard based on the ‘UserID’ to distribute user’s data. For fault tolerance and
replication we should use Consistent Hashing 35. If we want to have different throttling
limits for different APIs, we can choose to shard per user per API. Take the example of URL
Shortner 1, we can have different rate limiter for creatURL() and deleteUR() APIs for each
user or IP.
If our APIs are partitioned, a practical consideration could be to have a separate (somewhat
smaller) rate limiter for each API shard as well. Let’s take the example of our URL Shortener,
where we want to limit each user to not create more than 100 short URLs per hour.
Assuming, we are using Hash-Based Partitioning for our creatURL() API, we can rate limit
each partition to allow a user to create not more than three short URLs per minute, in
addition to 100 short URLs per hour.
Our system can get huge benefits from caching recent active users. Application servers
before hitting backend servers can quickly check if the cache has the desired record. Our
rate limiter can greatly benefit from the Write-back cache , by updating all counters and
timestamps in cache only. The write to the permanent storage can be done at fixed
intervals. This way we can ensure minimum latency added to the user’s requests by the rate
limiter. The reads can always hit the cache first, this will be extremely useful once the user
has hit their maximum limit and the rate limiter will only be reading data without any
updates.
Least Recently Used (LRU) can be a reasonable cache eviction policy for our system.
Let’s discuss pros and cons of using each one of these schemes:
IP: In this scheme, we throttle request per-IP, although it’s not really optimal in terms of
differentiating between ‘good’ and ‘bad’ actors, it’s still better than not have rate limiting at
all. The biggest problem with IP based throttling is when multiple users share a single public
IP like in an internet cafe or smartphone users that are using the same gateway. One bad
user can cause throttling to other users. Another issue could arise while caching IP-based
limits, as there are a huge number of IPv6 addresses available to a hacker from even one
computer, it’s trivial to make a server run out of memory tracking IPv6 addresses!
User: Rate limiting can be done on APIs after user authentication. Once authenticated, the
user will be provided with a token which the user will pass with each request. This will
ensure that we will rate limit against a particular API that has a valid authentication token.
But what if we have to rate limit on the login API itself? The weakness of this rate-limiting
would be that a hacker can perform a denial of service attack against a user by entering
wrong credentials up to the limit, after that the actual user will not be able to login.
Twitter is one of the largest social networking service where users can share photos, news,
and text-based messages. In this chapter, we will design a service that can store and search
user tweets. Similar Problems: Tweet search. Difficulty Level: Medium
Twitter users can update their status whenever they like. Each status consists of plain text,
and our goal is to design a system that allows searching over all the user statuses.
Let’s assume Twitter has 1.5 billion total users with 800 million daily active users.
On the average Twitter gets 400 million status updates every day.
Average size of a status is 300 bytes.
Let’s assume there will be 500M searches every day.
The search query will consist of multiple words combined with AND/OR.
We need to design a system that can efficiently store and query user statuses.
Storage Capacity: Since we have 400 million new statuses every day and each status on
average is 300 bytes, therefore total storage we need, will be:
4. System APIs
We can have SOAP or REST APIs to expose functionality of our service; following could be
the definition of search API:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
search_terms (string): A string containing the search terms.
maximum_results_to_return (number): Number of status messages to return.
sort (number): Optional sort mode: Latest first (0 - default), Best matched (1), Most liked (2).
page_token (string): This token will specify a page in the result set that should be returned.
Returns: (JSON)
A JSON containing information about a list of status messages matching the search query.
Each result entry can have the user ID & name, status text, status ID, creation time, number
of likes, etc.
At the high level, we need to store all the statues in a database, and also build an index that
can keep track of which word appears in which status. This index will help us quickly find
statuses that users are trying to search.
1. Storage: We need to store 120GB of new data every day. Given this huge amount of data,
we need to come up with a data partitioning scheme that will be efficiently distributing it
onto multiple servers. If we plan for next five years, we will need following storage:
If we never want to be more than 80% full at any time, we will approximately need 250TB of
total storage. Let’s assume that we want to keep an extra copy of all the statuses for fault
tolerance; then our total storage requirement will be 500TB. If we assume a modern server
can store up to 4TB of data, then we would need 125 such servers to hold all of the required
data for the next five years.
Let’s start with a simplistic design where we store the statuses in a MySQL database. We can
assume to store the statuses in a table having two columns, StatusID and StatusText. Let’s
assume we partition our data based on StatusID. If our StatusIDs are system-wide unique,
we can define a hash function that can map a StatusID to a storage server, where we can
store that status object.
How can we create system-wide unique StatusIDs? If we are getting 400M new statuses
each day, then how many status objects we can expect in five years?
This means we would need a five bytes number to identify StatusIDs uniquely. Let’s assume
we have a service that can generate a unique StatusID whenever we need to store an object
(StatusID could be similar to TweetID discussed in Designing Twitter 7). We can feed the
StatusID to our hash function to find the storage server and store our status object there.
2. Index: What should our index look like? Since our status queries will consist of words,
therefore, let’s build the index that can tell us which word comes in which status object.
Let’s first estimate how big our index will be. If we want to build an index for all the English
words and some famous nouns like people names, city names, etc., and if we assume that
we have around 300K English words and 200K nouns, then we will have 500k total words in
our index. Let’s assume that the average length of a word is five characters. If we are
keeping our index in memory, we would need 2.5MB of memory to store all the words:
Let’s assume that we want to keep the index in memory for all the status objects for only
past two years. Since we will be getting 730B status objects in 5 years, this will give us 292B
status messages in two years. Given that, each StatusID will be 5 bytes, how much memory
will we need to store all the StatusIDs?
So our index would be like a big distributed hash table, where ‘key’ would be the word, and
‘value’ will be a list of StatusIDs of all those status objects which contain that word.
Assuming on the average we have 40 words in each status and since we will not be indexing
prepositions and other small words like ‘the’, ‘an’, ‘and’ etc., let’s assume we will have
around 15 words in each status that need to be indexed. This means each StatusID will be
stored 15 times in our index. So total memory will need to store our index:
Assuming a high-end server has 144GB of memory, we would need 152 such servers to hold
our index.
Sharding based on Words: While building our index, we will iterate through all the words of
a status and calculate the hash of each word to find the server where it would be indexed.
To find all statuses containing a specific word we have to query only that server which
contains this word.
To recover from these situations either we have to repartition our data or use Consistent
Hashing 35.
Sharding based on the status object: While storing, we will pass the StatusID to our hash
function to find the server and index all the words of the status on that server. While
querying for a particular word, we have to query all the servers, and each server will return
a set of StatusIDs. A centralized server will aggregate these results to return them to the
user.
7. Fault Tolerance
What will happen when an index server dies? We can have a secondary replica of each
server, and if the primary server dies it can take control after the failover. Both primary and
secondary servers will have the same copy of the index.
What if both primary and secondary servers die at the same time? We have to allocate a
new server and rebuild the same index on it. How can we do that? We don’t know what
words/statuses were kept on this server. If we were using ‘Sharding based on the status
object’, the brute-force solution would be to iterate through the whole database and filter
StatusIDs using our hash function to figure out all the required Statuses that will be stored
on this server. This would be inefficient and also during the time when the server is being
rebuilt we will not be able to serve any query from it, thus missing some Statuses that
should have been seen by the user.
How can we efficiently retrieve a mapping between Statuses and index server? We have to
build a reverse index that will map all the StatusID to their index server. Our Index-Builder
server can hold this information. We will need to build a Hashtable, where the ‘key’ would
be the index server number and the ‘value’ would be a HashSet containing all the StatusIDs
being kept at that index server. Notice that we are keeping all the StatusIDs in a HashSet,
this will enable us to add/remove Statuses from our index quickly. So now whenever an
index server has to rebuild itself, it can simply ask the Index-Builder server for all the
Statuses it needs to store, and then fetch those statuses to build the index. This approach
will surely be quite fast. We should also have a replica of Index-Builder server for fault
tolerance.
8. Cache
To deal with hot status objects, we can introduce a cache in front of our database. We can
use Memcache , which can store all such hot status objects in memory. Application servers
before hitting backend database can quickly check if the cache has that status object. Based
on clients’ usage pattern we can adjust how many cache servers we need. For cache eviction
policy, Least Recently Used (LRU) seems suitable for our system.
9. Load Balancing
We can add Load balancing layer at two places in our system 1) Between Clients and
Application servers and 2) Between Application servers and Backend server. Initially, a
simple Round Robin approach can be adopted; that distributes incoming requests equally
among backend servers. This LB is simple to implement and does not introduce any
overhead. Another benefit of this approach is if a server is dead, LB will take it out of the
rotation and will stop sending any traffic to it. A problem with Round Robin LB is, it won’t
take server load into consideration. If a server is overloaded or slow, the LB will not stop
sending new requests to that server. To handle this, a more intelligent LB solution can be
placed that periodically queries backend server about their load and adjusts traffic based on
that.
10. Ranking
How about if we want to rank the search results by social graph distance, popularity,
relevance, etc?
Let’s assume we want to rank statuses on popularity, like, how many likes or comments a
status is getting, etc. In such a case our ranking algorithm can calculate a ‘popularity
number’ (based on the number of likes etc.), and store it with the index. Each partition can
sort the results based on this popularity number before returning results to the aggregator
server. The aggregator server combines all these results, sort them based on the popularity
number and sends the top results to the user.
A web crawler is a software program which browses the World Wide Web in a methodical
and automated manner. It collects documents by recursively fetching links from a set of
starting pages. Many sites, particularly search engines, use web crawling as a means of
providing up-to-date data. Search engines download all the pages to create an index on
them to perform faster searches.
To test web pages and links for valid syntax and structure.
To monitor sites to see when their structure or contents change.
To maintain mirror sites for popular Web sites.
To search for copyright infringements.
To build a special-purpose index, e.g., one that has some understanding of the content
stored in multimedia files on the Web.
Scalability: Our service needs to be scalable such that it can crawl the entire Web, and can
be used to fetch hundreds of millions of Web documents.
Extensibility: Our service should be designed in a modular way, with the expectation that
new functionality will be added to it. There could be newer document types that needs to
be downloaded and processed in the future.
Crawling the web is a complex task, and there are many ways to go about it. We should be
asking a few questions before going any further:
Is it a crawler for HTML pages only? Or should we fetch and store other types of media,
such as sound files, images, videos, etc.? This is important because the answer can change
the design. If we are writing a general-purpose crawler to download different media types,
we might want to break down the parsing module into different sets of modules: one for
HTML, another for images, another for videos, where each module extracts what is
considered interesting for that media type.
Let’s assume for now that our crawler is going to deal with HTML only, but it should be
extensible and make it easy to add support for new media types.
What protocols are we looking at? HTTP? What about FTP links? What different protocols
should our crawler handle? For the sake of the exercise, we will assume HTTP. Again, it
shouldn’t be hard to extend the design to use FTP and other protocols later.
What is the expected number of pages we will crawl? How big will the URL database
become? Assuming we need to crawl one billion websites. Since a website can contain
many, many URLs, let’s assume an upper bound of 15 billion different web pages that will be
reached by our crawler.
What is ‘RobotsExclusion’ and how should we deal with it? Courteous Web crawlers
implement the Robots Exclusion Protocol, which allows Webmasters to declare parts of
their sites off limits to crawlers. The Robots Exclusion Protocol requires a Web crawler to
fetch a special document called robot.txt, containing these declarations from a Web site
before downloading any real content from it.
If we want to crawl 15 billion pages within four weeks, how many pages do we need to fetch
per second?
What about storage? Page sizes vary a lot, but as mentioned above since we will be dealing
with HTML text only, let’s assume an average page size be 100KB. With each page if we are
storing 500 bytes of metadata, total storage we would need:
Assuming a 70% capacity model (we don’t want to go above 70% of the total capacity of our
storage system), total storage we will need:
The basic algorithm executed by any Web crawler is to take a list of seed URLs as its input
and repeatedly execute the following steps.
Breadth first or depth first? Breadth-first search (BFS) is usually used. However, Depth First
Search (DFS) is also utilized in some situations, such as if your crawler has already
established a connection with the website, it might just DFS all the URLs within this website
to save some handshaking overhead.
There are two important characteristics of the Web that makes Web crawling a very difficult
task:
1. Large volume of Web pages: A large volume of web page implies that web crawler can
only download a fraction of the web pages at any time and hence it is critical that web
crawler should be intelligent enough to prioritize download.
2. Rate of change on web pages. Another problem with today’s dynamic world is that web
pages on the internet change very frequently, as a result, by the time the crawler is
downloading the last page from a site, the page may change, or a new page has been added
to the site.
1. URL frontier: To store the list of URLs to download and also prioritize which URLs should
be crawled first.
2. HTTP Fetcher: To retrieve a web page from the server.
3. Extractor: To extract links from HTML documents.
4. Duplicate Eliminator: To make sure same content is not extracted twice unintentionally.
5. Datastore: To store retrieve pages and URL and other metadata.
завантаження (33).png800×298 16.1 KB
Let’s assume our crawler is running on one server, and all the crawling is done by multiple
working threads, where each working thread performs all the steps needed to download
and process a document in a loop.
The first step of this loop is to remove an absolute URL from the shared URL frontier for
downloading. An absolute URL begins with a scheme (e.g., “HTTP”), which identifies the
network protocol that should be used to download it. We can implement these protocols in
a modular way for extensibility, so that later if our crawler needs to support more protocols,
it can be easily done. Based on the URL’s scheme, the worker calls the appropriate protocol
module to download the document. After downloading, the document is placed into a
Document Input Stream (DIS). Putting documents into DIS will enable other modules to re-
read the document multiple times.
Once the document has been written to the DIS, the worker thread invokes the dedupe test
to determine whether this document (associated with a different URL) has been seen
before. If so, the document is not processed any further, and the worker thread removes
the next URL from the frontier.
Next, our crawler needs to process the downloaded document. Each document can have a
different MIME type like HTML page, Image, Video, etc. We can implement these MIME
schemes in a modular way so that later if our crawler needs to support more types, we can
easily implement them. Based on the downloaded document’s MIME type, the worker
invokes the process method of each processing module associated with that MIME type.
Furthermore, our HTML processing module will extract all links from the page. Each link is
converted into an absolute URL and tested against a user-supplied URL filter to determine if
it should be downloaded. If the URL passes the filter, the worker performs the URL-seen
test, which checks if the URL has been seen before, namely, if it is in the URL frontier or has
already been downloaded. If the URL is new, it is added to the frontier.
завантаження (34).png900×549 36.4 KB
Let’s discuss these components one by one, and see how they can be distributed onto
multiple machines:
1. The URL frontier: The URL frontier is the data structure that contains all the URLs that
remain to be downloaded. We can crawl by performing a breadth-first traversal of the Web,
starting from the pages in the seed set. Such traversals are easily implemented by using a
FIFO queue.
Since we’ll be having a huge list of URLs to crawl, we can distribute our URL frontier into
multiple servers. Let’s assume on each server we have multiple worker threads performing
the crawling tasks. Let’s also assume that our hash function maps each URL to a server
which will be responsible for crawling it.
Following politeness requirements must be kept in mind while designing a distributed URL
frontier:
1. Our crawler should not overload a server by downloading a lot of pages from it.
2. We should not have multiple machines connecting a web server.
To implement this politeness constraint, our crawler can have a collection of distinct FIFO
sub-queues on each server. Each worker thread will have its separate sub-queue, from
which it removes URLs for crawling. When a new URL needs to be added, the FIFO sub-
queue in which it is placed will be determined by the URL’s canonical hostname. Our hash
function can map each hostname to a thread number. Together, these two points imply that
at most one worker thread will download documents from a given Web server and also by
using FIFO queue it’ll not overload a Web server.
How big will our URL frontier be? The size would be in the hundreds of millions of URLs.
Hence, we need to store our URLs on disk. We can implement our queues in such a way that
they have separate buffers for enqueuing and dequeuing. Enqueue buffer, once filled will be
dumped to the disk, whereas dequeue buffer will keep a cache of URLs that need to be
visited, it can periodically read from disk to fill the buffer.
2. The fetcher module: The purpose of a fetcher module is to download the document
corresponding to a given URL using the appropriate network protocol like HTTP. As
discussed above webmasters create robot.txt to make certain parts of their websites off
limits for the crawler. To avoid downloading this file on every request, our crawler’s HTTP
protocol module can maintain a fixed-sized cache mapping host-names to their robot’s
exclusion rules.
3. Document input stream: Our crawler’s design enables the same document to be
processed by multiple processing modules. To avoid downloading a document multiple
times, we cache the document locally using an abstraction called a Document Input Stream
(DIS).
A DIS is an input stream that caches the entire contents of the document read from the
internet. It also provides methods to re-read the document. The DIS can cache small
documents (64 KB or less) entirely in memory, while larger documents can be temporarily
written to a backing file.
Each worker thread has an associated DIS, which it reuses from document to document.
After extracting a URL from the frontier, the worker passes that URL to the relevant protocol
module, which initializes the DIS from a network connection to contain the document’s
contents. The worker then passes the DIS to all relevant processing modules.
4. Document Dedupe test: Many documents on the Web are available under multiple,
different URLs. There are also many cases in which documents are mirrored on various
servers. Both of these effects will cause any Web crawler to download the same document
contents multiple times. To prevent processing a document more than once, we perform a
dedupe test on each document to remove duplication.
To perform this test, we can calculate a 64-bit checksum of every processed document and
store it in a database. For every new document, we can compare its checksum to all the
previously calculated checksums to see the document has been seen before. We can use
MD5 or SHA to calculate these checksums.
How big would be the checksum store? If the whole purpose of our checksum store is to do
dedupe, then we just need to keep a unique set containing checksums of all previously
processed document. Considering 15 billion distinct web pages, we would need:
5. URL filters: The URL filtering mechanism provides a customizable way to control the set of
URLs that are downloaded. This is used to blacklist websites so that our crawler can ignore
them. Before adding each URL to the frontier, the worker thread consults the user-supplied
URL filter. We can define filters to restrict URLs by domain, prefix, or protocol type.
6. Domain name resolution: Before contacting a Web server, a Web crawler must use the
Domain Name Service (DNS) to map the Web server’s hostname into an IP address. DNS
name resolution will be a big bottleneck of our crawlers given the amount of URLs we will be
working with. To avoid repeated requests, we can start caching DNS results by building our
local DNS server.
7. URL dedupe test: While extracting links, any Web crawler will encounter multiple links to
the same document. To avoid downloading and processing a document multiple times, a
URL dedupe test must be performed on each extracted link before adding it to the URL
frontier.
To perform the URL dedupe test, we can store all the URLs seen by our crawler in canonical
form in a database. To save space, we do not store the textual representation of each URL in
the URL set, but rather a fixed-sized checksum.
To reduce the number of operations on the database store, we can keep an in-memory
cache of popular URLs on each host shared by all threads. The reason to have this cache is
that links to some URLs are quite common, so caching the popular ones in memory will lead
to a high in-memory hit rate.
How much storage we would need for URL’s store? If the whole purpose of our checksum
is to do URL dedupe, then we just need to keep a unique set containing checksums of all
previously seen URLs. Considering 15 billion distinct URLs and 4 bytes for checksum, we
would need:
Can we use bloom filters for deduping? Bloom filters are a probabilistic data structure for
set membership testing that may yield false positives. A large bit vector represents the set.
An element is added to the set by computing ‘n’ hash functions of the element and setting
the corresponding bits. An element is deemed to be in the set if the bits at all ‘n’ of the
element’s hash locations are set. Hence, a document may incorrectly be deemed to be in
the set, but false negatives are not possible.
The disadvantage of using a bloom filter for the URL seen test is that each false positive will
cause the URL not to be added to the frontier, and therefore the document will never be
downloaded. The chance of a false positive can be reduced by making the bit vector larger.
8. Checkpointing: A crawl of the entire Web takes weeks to complete. To guard against
failures, our crawler can write regular snapshots of its state to disk. An interrupted or
aborted crawl can easily be restarted from the latest checkpoint.
7. Fault tolerance
We should use consistent hashing for distribution among crawling servers. Extended
hashing will not only help in replacing a dead host but also help in distributing load among
crawling servers.
All our crawling servers will be performing regular checkpointing and storing their FIFO
queues to disks. If a server goes down, we can replace it. Meanwhile, extended hashing
should shift the load to other servers.
8. Data Partitioning
Our crawler will be dealing with three kinds of data: 1) URLs to visit 2) URL checksums for
dedupe 3) Document checksums for dedupe.
Since we are distributing URLs based on the hostnames, we can store these data on the
same host. So, each host will store its set of URLs that need to be visited, checksums of all
the previously visited URLs and checksums of all the downloaded documents. Since we will
be using extended hashing, we can assume that URLs will be redistributed from overloaded
hosts.
Each host will perform checkpointing periodically and dump a snapshot of all the data it is
holding into a remote server. This will ensure that if a server dies down, another server can
replace it by taking its data from the last snapshot.
9. Crawler Traps
There are many crawler traps, spam sites, and cloaked content. A crawler trap is a URL or
set of URLs that cause a crawler to crawl indefinitely. Some crawler traps are unintentional.
For example, a symbolic link within a file system can create a cycle. Other crawler traps are
introduced intentionally. For example, people have written traps that dynamically generate
an infinite Web of documents. The motivations behind such traps vary. Anti-spam traps are
designed to catch crawlers used by spammers looking for email addresses, while other sites
use traps to catch search engine crawlers to boost their search ratings.
AOPIC algorithm (Adaptive Online Page Importance Computation), can help mitigate
common types of bot-traps. AOPIC solves this problem by using a credit system.
Since the Lambda page continuously collects the tax, eventually it will be the page with the
largest amount of credit, and we’ll have to “crawl” it. By crawling the Lambda page, we just
take its credits and distribute them equally to all the pages in our database.
Since bot traps only give internal links credits and they rarely get credit from the outside,
they will continually leak credits (from taxation) to the Lambda page. The Lambda page will
distribute that credits out to all the pages in the database evenly, and upon each cycle, the
bot trap page will lose more and more credits until it has so little credits that it almost never
gets crawled again. This will not happen with good pages because they often get credits
from backlinks found on other pages.
Let’s design Facebook’s Newsfeed, which would contain posts, photos, videos and status
updates from all the people and pages a user follows. Similar Services: Twitter Newsfeed,
Instagram Newsfeed, Quora Newsfeed Difficulty Level: Hard
Newsfeed is the constantly updating list of stories in the middle of Facebook’s homepage. It
includes status updates, photos, videos, links, app activity and ‘likes’ from people, pages,
and groups that a user follows on Facebook. In other words, it is a compilation of a complete
scrollable version of your and your friends’ life story from photos, videos, locations, status
updates and other activities.
Any social media site you design - Twitter, Instagram or Facebook, you will need some
newsfeed system to display updates from friends and followers.
Functional requirements:
1. Newsfeed will be generated based on the posts from the people, pages, and groups
that a user follows.
2. A user may have many friends and follow a large number of pages/groups.
3. Feeds may contain images, videos or just text.
4. Our service should support appending new posts, as they arrive, to the newsfeed for
all active users.
Non-functional requirements:
1. Our system should be able to generate any user’s newsfeed in real-time - maximum
latency seen by the end user could be 2s.
2. A post shouldn’t take more than 5s to make it to a user’s feed assuming a new
newsfeed request comes in.
Let’s assume on average a user has 300 friends and follows 200 pages.
Traffic estimates: Let’s assume 300M daily active users, with each user fetching their
timeline an average of five times a day. This will result in 1.5B newsfeed requests per day or
approximately 17,500 requests per second.
Storage estimates: On average, let’s assume, we would need to have around 500 posts in
every user’s feed that we want to keep in memory for a quick fetch. Let’s also assume that
on average each post would be 1KB in size. This would mean that we need to store roughly
500KB of data per user. To store all this data for all the active users, we would need 150TB
of memory. If a server can hold 100GB, we would need around 1500 machines to keep the
top 500 posts in memory for all active users.
4. System APIs
Once we have finalized the requirements, it’s always a good idea to define the
system APIs. This should explicitly state what is expected from the system.
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definition of the API for getting the newsfeed:
Parameters:
api_dev_key (string): The API developer key of a registered account. This can be used to,
among other things, throttle users based on their allocated quota.
user_id (number): The ID of the user for whom the system will generate the newsfeed.
since_id (number): Optional; returns results with an ID higher than (that is, more recent
than) the specified ID.
count (number): Optional; specifies the number of feed items to try and retrieve, up to a
maximum of 200 per distinct request.
max_id (number): Optional; returns results with an ID less than (that is, older than) or equal
to the specified ID.
exclude_replies(boolean): Optional; this parameter will prevent replies from appearing in
the returned timeline.
Returns: (JSON) Returns a JSON object containing a list of feed items.
5. Database Design
There are three primary objects: User, Entity (e.g., page, group, etc.) and FeedItem (or Post).
Here are some observations about the relationships between these entities:
A User can follow other entities and can become friends with other users.
Both users and entities can post FeedItems which can contain text, images or videos.
Each FeedItem will have a UserID which would point to the User who created it. For
simplicity, let’s assume that only users can create feed items, although, on Facebook, Pages
can post feed item too.
Each FeedItem can optionally have an EntityID pointing to the page or the group where that
post was created.
If we are using a relational database, we would need to model two relations: User-Entity
relation and FeedItem-Media relation. Since each user can be friends with many people and
follow a lot of entities, we can store this relation in a separate table. The “Type” column in
“UserFollow” identifies if the entity being followed is a User or Entity. Similarly, we can have
a table for FeedMedia relation.
5713144022302720.png693×506 67.6 KB
6. High Level System Design
Feed generation: Newsfeed is generated from the posts (or feed items) of users and entities
(pages and groups) that a user follows. So, whenever our system receives a request to
generate the feed for a user (say Jane), we will perform the following steps:
One thing to notice here is that we generated the feed once and stored it in the cache. What
about new incoming posts from people that Jane follows? If Jane is online, we should have a
mechanism to rank and add those new posts to her feed. We can periodically (say every five
minutes) perform the above steps to rank and add the newer posts to her feed. Jane can
then be notified that there are newer items in her feed that she can fetch.
Feed publishing: Whenever Jane loads her newsfeed page, she has to request and pull feed
items from the server. When she reaches the end of her current feed, she can pull more
data from the server. For newer items either the server can notify Jane and then she can
pull, or the server can push these new posts. We will discuss these options in detail later.
1. Web servers: To maintain a connection with the user. This connection will be used to
transfer data between the user and the server.
2. Application server: To execute the workflows of storing new posts in the database
servers. We will also need some application servers to retrieve and push the
newsfeed to the end user.
3. Metadata database and cache: To store the metadata about Users, Pages and
Groups.
4. Posts database and cache: To store metadata about posts and their contents.
5. Video and photo storage, and cache: Blob storage, to store all the media included in
the posts.
6. Newsfeed generation service: To gather and rank all the relevant posts for a user to
generate newsfeed and store in the cache. This service will also receive live updates
and will add these newer feed items to any user’s timeline.
7. Feed notification service: To notify the user that there are newer items available for
their newsfeed.
Following is the high-level architecture diagram of our system. User B and C are following
User A.
5674248798470144.png999×599 55.4 KB
a. Feed generation
Let’s take the simple case of the newsfeed generation service fetching most recent posts
from all the users and entities that Jane follows; the query would look like this:
Here are issues with this design for the feed generation service:
Offline generation for newsfeed: We can have dedicated servers that are continuously
generating users’ newsfeed and storing them in memory. So, whenever a user requests for
the new posts for their feed, we can simply serve it from the pre-generated, stored location.
Using this scheme user’s newsfeed is not compiled on load, but rather on a regular basis and
returned to users whenever they request for it.
Whenever these servers need to generate the feed for a user, they will first query to see
what was the last time the feed was generated for that user. Then, new feed data would be
generated from that time onwards. We can store this data in a hash table, where the “key”
would be UserID and “value” would be a STRUCT like this:
Struct {
LinkedHashMap<FeedItemID> feedItems;
DateTime lastGenerated;
}
We can store FeedItemIDs in a Linked HashMap 13 kind of data structure, which will allow
us to not only jump to any feed item but also iterate through the map easily. Whenever
users want to fetch more feed items, they can send the last FeedItemID they currently see
in their newsfeed, we can then jump to that FeedItemID in our linked hash map and return
next batch/page of feed items from there.
How many feed items should we store in memory for a user’s feed? Initially, we can decide
to store 500 feed items per user, but this number can be adjusted later based on the usage
pattern. For example, if we assume that one page of user’s feed has 20 posts and most of
the users never browse more than ten pages of their feed, we can decide to store only 200
posts per user. For any user, who wants to see more posts (more than what is stored in
memory) we can always query backend servers.
Should we generate (and keep in memory) newsfeed for all users? There will be a lot of
users that don’t login frequently. Here are a few things we can do to handle this. A more
straightforward approach could be to use an LRU based cache that can remove users from
memory that haven’t accessed their newsfeed for a long time. A smarter solution can figure
out the login pattern of users to pre-generate their newsfeed, e.g., At what time of the day a
user is active? Which days of the week a user accesses their newsfeed? etc.
Let’s now discuss some solutions to our “live updates” problems in the following section.
b. Feed publishing
The process of pushing a post to all the followers is called a fanout. By analogy, the push
approach is called fanout-on-write, while the pull approach is called fanout-on-load. Let’s
discuss different options for publishing feed data to users.
1. “Pull” model or Fan-out-on-load: This method involves keeping all the recent feed
data in memory so that users can pull it from the server whenever they need it.
Clients can pull the feed data on a regular basis or manually whenever they need it.
Possible problems with this approach are a) New data might not be shown to the
users until they issue a pull request, b) It’s hard to find the right pull cadence, as
most of the time pull requests will result in an empty response if there is no new
data, causing waste of resources.
2. “Push” model or Fan-out-on-write: For a push system, once a user has published a
post, we can immediately push this post to all the followers. The advantage is that
when fetching feed, you don’t need to go through your friend’s list and get feeds for
each of them. It significantly reduces read operations. To efficiently handle this,
users have to maintain a Long Poll 7 request with the server for receiving the
updates. A possible problem with this approach is that when a user has millions of
followers (a celebrity-user), the server has to push updates to a lot of people.
3. Hybrid: An alternate method to handle feed data could be to use a hybrid approach,
i.e., to do a combination of fan-out-on-write and fan-out-on-load. Specifically, we
can stop pushing posts from users with a high number of followers (a celebrity user)
and only push data for those users who have a few hundred (or thousand) followers.
For celebrity users, we can let the followers pull the updates. Since the push
operation can be extremely costly for users who have a lot of friends or followers,
therefore, by disabling fanout for them, we can save a huge number of resources.
Another alternate approach could be that once a user publishes a post; we can limit
the fanout to only her online friends. Also, to get benefits from both the approaches,
a combination of push to notify and pull for serving end users is a great way to go.
Purely push or pull model is less versatile.
How many feed items can we return to the client in each request? We should have a
maximum limit for the number of items a user can fetch in one request (say 20). But we
should let clients choose to specify how many feed items they want with each request, as
the user may like to fetch a different number of posts depending on the device (mobile vs.
desktop).
Should we always notify users if there are new posts available for their newsfeed? It could
be useful for users to get notified whenever new data is available. However, on mobile
devices, where data usage is relatively expensive, it can consume unnecessary bandwidth.
Hence, at least for mobile devices, we can choose not to push data, instead, let users “Pull
to Refresh” to get new posts.
8. Feed Ranking
The most straightforward way to rank posts in a newsfeed is by the creation time of the
posts. But today’s ranking algorithms are doing a lot more than that to ensure “important”
posts are ranked higher. The high-level idea of ranking is first to select key “signals” that
make a post important and then find out how to combine them to calculate a final ranking
score.
More specifically, we can select features that are relevant to the importance of any feed
item, e.g., number of likes, comments, shares, time of the update, whether the post has
images/videos, etc., and then, a score can be calculated using these features. This is
generally enough for a simple ranking system. A better ranking system can significantly
improve itself by constantly evaluating if we are making progress in user stickiness,
retention, ads revenue, etc.
9. Data Partitioning
Let’s design a Yelp like service, where users can search for nearby places like restaurants,
theaters or shopping malls, etc., and can also add/view reviews of places. Similar Services:
Proximity server. Difficulty Level: Hard
Proximity servers are used to discover nearby attractions like places, events, etc. If you
haven’t used yelp.com 8 before, please try it before proceeding. You can search for nearby
restaurants, theaters, etc., and spend some time understanding different options the
website offers. This will help you a lot in understanding this chapter better.
What do we wish to achieve from a Yelp like service? Our service will be storing
information about different places so that users can perform a search on them. Upon
querying, our service will return a list of places around the user.
Non-functional Requirements:
3. Scale Estimation
Let’s build our system assuming that we have 500M places and 100K queries per second
(QPS). Let’s also assume a 20% growth in the number of places and QPS each year.
4. Database Schema
Although a four bytes number can uniquely identify 500M locations, with future growth in
mind, we will go with 8 bytes for LocationID.
We also need to store reviews, photos, and ratings of a Place. We can have a separate table
to store reviews for Places:
1. LocationID (8 bytes)
2. ReviewID (4 bytes): Uniquely identifies a review, assuming any location will not have
more than 2^32 reviews.
3. ReviewText (512 bytes)
4. Rating (1 byte): how many stars a place gets out of ten.
Similarly, we can have a separate table to store photos for Places and Reviews.
5. System APIs
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definition of the API for searching:
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
search_terms (string): A string containing the search terms.
user_location (string): Location of the user performing the search.
radius_filter (number): Optional search radius in meters.
maximum_results_to_return (number): Number of business results to return.
category_filter (string): Optional category to filter search results, e.g., Restaurants, Shopping
Centers, etc.
sort (number): Optional sort mode: Best matched (0 - default), Minimum distance (1),
Highest rated (2).
page_token (string): This token will specify a page in the result set that should be returned.
Returns: (JSON)
A JSON containing information about a list of businesses matching the search query. Each
result entry will have the business name, address, category, rating, and thumbnail.
At a high level, we need to store and index each dataset described above (places, reviews,
etc.). For users to query this massive database, the indexing should be read efficient, since
while searching for nearby places users expect to see the results in real-time.
Given that the location of a place doesn’t change that often, we don’t need to worry about
frequent updates of the data. As a contrast, if we intend to build a service where objects do
change their location frequently, e.g., people or taxis, then we might come up with a very
different design.
Let’s see what are different ways to store this data and also find out which method will suit
best for our use cases:
a. SQL solution
One simple solution could be to store all the data in a database like MySQL. Each place will
be stored in a separate row, uniquely identified by LocationID. Each place will have its
longitude and latitude stored separately in two different columns, and to perform a fast
search; we should have indexes on both these fields.
To find all the nearby places of a given location (X, Y) within a radius ‘D’, we can query like
this:
Select * from Places where Latitude between X-D and X+D and Longitude between Y-D and
Y+D
How efficient this query would be? We have estimated 500M places to be stored in our
service. Since we have two separate indexes, each index can return a huge list of places, and
performing an intersection on those two lists won’t be efficient. Another way to look at this
problem is that there could be too many locations between ‘X-D’ and ‘X+D’, and similarly
between ‘Y-D’ and ‘Y+D’. If we can somehow shorten these lists, it can improve the
performance of our query.
b. Grids
We can divide the whole map into smaller grids to group locations into smaller sets. Each
grid will store all the Places residing within a specific range of longitude and latitude. This
scheme would enable us to query only a few grids to find nearby places. Based on a given
location and radius, we can find all the neighboring grids and then query these grids to find
nearby places.
Let’s assume that GridID (a four bytes number) would uniquely identify grids in our system.
What could be a reasonable grid size? Grid size could be equal to the distance we would
like to query since we also want to reduce the number of grids. If the grid size is equal to the
distance we want to query, then we only need to search within the grid which contains the
given location and neighboring eight grids. Since our grids would be statically defined (from
the fixed grid size), we can easily find the grid number of any location (lat, long) and its
neighboring grids.
In the database, we can store the GridID with each location and have an index on it too for
faster searching. Now, our query will look like:
Select * from Places where Latitude between X-D and X+D and Longitude between Y-D and
Y+D and GridID in (GridID, GridID1, GridID2, …, GridID8)
Should we keep our index in memory? Maintaining the index in memory will improve the
performance of our service. We can keep our index in a hash table, where ‘key’ would be
the grid number and ‘value’ would be the list of places contained in that grid.
How much memory will we need to store the index? Let’s assume our search radius is 10
miles, given that total area of the earth is around 200 million square miles; we will have 20
million grids. We would need a four bytes number to uniquely identify each grid, and since
LocationID is 8 bytes, therefore we would need 4GB of memory (ignoring hash table
overhead) to store the index.
(4 * 20M) + (8 * 500M) ~= 4 GB
This solution can still run slow for those grids that have a lot of places since our places are
not uniformly distributed among grids. We can have a thickly dense area with a lot of places,
and on the other hand, we can have areas which are sparsely populated.
This problem can be solved if we can dynamically adjust our grid size, such that whenever
we have a grid with a lot of places we break it down to create smaller grids. One challenge
with this approach could be, how would we map these grids to locations? Also, how can we
find all the neighboring grids of a grid?
Let’s assume we don’t want to have more than 500 places in a grid so that we can have a
faster searching. So, whenever a grid reaches this limit, we break it down into four grids of
equal size and distribute places among them. This means thickly populated areas like
downtown San Francisco will have a lot of grids, and sparsely populated area like the Pacific
Ocean will have large grids with places only around the coastal lines.
What data-structure can hold this information? A tree in which each node has four children
can serve our purpose. Each node will represent a grid and will contain information about all
the places in that grid. If a node reaches our limit of 500 places, we will break it down to
create four child nodes under it and distribute places among them. In this way, all the leaf
nodes will represent the grids that cannot be further broken down. So leaf nodes will keep a
list of places with them. This tree structure in which each node can have four children is
called a QuadTree 9
завантаження (36).png800×378 67.9 KB
How will we build QuadTree? We will start with one node that would represent the whole
world in one grid. Since it will have more than 500 locations, we will break it down into four
nodes and distribute locations among them. We will keep repeating this process with each
child node until there are no nodes left with more than 500 locations.
How will we find the grid for a given location? We will start with the root node and search
downward to find our required node/grid. At each step, we will see if the current node we
are visiting has children. If it has, we will move to the child node that contains our desired
location and repeat this process. If the node does not have any children, then that is our
desired node.
How will we find neighboring grids of a given grid? Since only leaf nodes contain a list of
locations, we can connect all leaf nodes with a doubly linked list. This way we can iterate
forward or backward among the neighboring leaf nodes to find out our desired locations.
Another approach for finding adjacent grids would be through parent nodes. We can keep a
pointer in each node to access its parent, and since each parent node has pointers to all of
its children, we can easily find siblings of a node. We can keep expanding our search for
neighboring grids by going up through the parent pointers.
Once we have nearby LocationIDs, we can query the backend database to find details about
those places.
What will be the search workflow? We will first find the node that contains the user’s
location. If that node has enough desired places, we can return them to the user. If not, we
will keep expanding to the neighboring nodes (either through the parent pointers or doubly
linked list), until either we find the required number of places or exhaust our search based
on the maximum radius.
How much memory will be needed to store the QuadTree? For each Place, if we cache only
LocationID and Lat/Long, we would need 12GB to store all places.
24 * 500M => 12 GB
Since each grid can have maximum 500 places and we have 500M locations, how many total
grids we will have?
Which means we will have 1M leaf nodes and they will be holding 12GB of location data. A
QuadTree with 1M leaf nodes will have approximately 1/3rd internal nodes, and each
internal node will have 4 pointers (for its children). If each pointer is 8 bytes, then the
memory we need to store all internal nodes would be:
1M * 1/3 * 4 * 8 = 10 MB
So, total memory required to hold the whole QuadTree would be 12.01GB. This can easily fit
into a modern-day server.
How would we insert a new Place into our system? Whenever a new Place is added by a
user, we need to insert it into the databases, as well as, in the QuadTree. If our tree resides
on one server, it is easy to add a new Place, but if the QuadTree is distributed among
different servers, first we need to find the grid/server of the new Place and then add it there
(discussed in the next section).
7. Data Partitioning
What if we have a huge number of places such that, our index does not fit into a single
machine’s memory? With 20% growth, each year, we will reach the memory limit of the
server in the future. Also, what if one server cannot serve the desired read traffic? To
resolve these issues, we must partition our QuadTree!
We will explore two solutions here (both of these partitioning schemes can be applied to
databases too):
a. Sharding based on regions: We can divide our places into regions (like zip codes), such
that all places belonging to a region will be stored on a fixed node. While storing, we will
find the region of each place to find the server and store the place there. Similarly, while
querying for nearby places, we can ask the region server that contains user’s location. This
approach has a couple of issues:
1. What if a region becomes hot? There would be a lot of queries on the server holding
that region, making it perform slow. This will affect the performance of our service.
2. Over time some regions can end up storing a lot of places compared to others.
Hence maintaining a uniform distribution of places, while regions are growing, is
quite difficult.
To recover from these situations either we have to repartition our data or use consistent
hashing.
b. Sharding based on LocationID: Our hash function will map each LocationID to a server
where we will store that place. While building our QuadTree, we will iterate through all the
places and calculate the hash of each LocationID to find a server where it would be stored.
To find nearby places of a location we have to query all servers, and each server will return a
set of nearby places. A centralized server will aggregate these results to return them to the
user.
Will we have different QuadTree structure on different partitions? Yes, this can happen,
since it is not guaranteed that we will have an equal number of places in any given grid on
all partitions. Though, we do make sure that all servers have approximately equal number of
Places. This different tree structure on different servers will not cause any issue though, as
we will be searching all the neighboring grids within the given radius on all partitions.
Remaining part of this chapter assumes that we have partitioned our data based on
LocationID.
What will happen when a QuadTree server dies? We can have a secondary replica of each
server, and if primary dies, it can take control after the failover. Both primary and secondary
servers will have the same QuadTree structure.
What if both primary and secondary servers die at the same time? We have to allocate a
new server and rebuild the same QuadTree on it. How can we do that, since we don’t know
what places were kept on this server? The brute-force solution would be to iterate through
the whole database and filter LocationIDs using our hash function to figure out all the
required places that will be stored on this server. This would be inefficient and slow, also
during the time when the server is being rebuilt; we will not be able to serve any query from
it, thus missing some places that should have been seen by users.
How can we efficiently retrieve a mapping between Places and QuadTree server? We have
to build a reverse index that will map all the Places to their QuadTree server. We can have a
separate QuadTree Index server that will hold this information. We will need to build a
HashMap, where the ‘key’ would be the QuadTree server number and the ‘value’ would be
a HashSet containing all the Places being kept on that QuadTree server. We need to store
LocationID and Lat/Long with each place because through this information servers can build
their QuadTrees. Notice that we are keeping Places’ data in a HashSet, this will enable us to
add/remove Places from our index quickly. So now whenever a QuadTree server needs to
rebuild itself, it can simply ask the QuadTree Index server for all the Places it needs to store.
This approach will surely be quite fast. We should also have a replica of QuadTree Index
server for fault tolerance. If a QuadTree Index server dies, it can always rebuild its index
from iterating through the database.
9. Cache
To deal with hot Places, we can introduce a cache in front of our database. We can use an
off-the-shelf solution like Memcache, which can store all data about hot places. Application
servers before hitting backend database can quickly check if the cache has that Place. Based
on clients’ usage pattern, we can adjust how many cache servers we need. For cache
eviction policy, Least Recently Used (LRU) seems suitable for our system.
We can add LB layer at two places in our system 1) Between Clients and Application servers
and 2) Between Application servers and Backend server. Initially, a simple Round Robin
approach can be adopted; that will distribute all incoming requests equally among backend
servers. This LB is simple to implement and does not introduce any overhead. Another
benefit of this approach is if a server is dead, the load balancer will take it out of the
rotation and will stop sending any traffic to it.
A problem with Round Robin LB is, it won’t take server load into consideration. If a server is
overloaded or slow, the load balancer will not stop sending new requests to that server. To
handle this, a more intelligent LB solution would be needed that periodically queries
backend server about their load and adjusts traffic based on that.
11. Ranking
How about if we want to rank the search results not just by proximity but also by popularity
or relevance?
How can we return most popular places within a given radius? Let’s assume we keep track
of the overall popularity of each place. An aggregated number can represent this popularity
in our system, e.g., how many stars a place gets out of ten (this would be an average of
different rankings given by users)? We will store this number in the database, as well as, in
the QuadTree. While searching for top 100 places within a given radius, we can ask each
partition of the QuadTree to return top 100 places having maximum popularity. Then the
aggregator server can determine top 100 places among all the places returned by different
partitions.
Remember that we didn’t build our system to update place’s data frequently. With this
design, how can we modify popularity of a place in our QuadTree? Although we can search a
place and update its popularity in the QuadTree, it would take a lot of resources and can
affect search requests and system throughput. Assuming popularity of a place is not
expected to reflect in the system within a few hours, we can decide to update it once or
twice a day, especially when the load on the system is minimum.
Our next problem Designing Uber backend 13 discusses dynamic updates of the QuadTree in
detail.
Let’s design a ride-sharing service like Uber, which connects passengers who need a ride
with drivers who have a car. Similar Services: Lyft, Didi, Via, Sidecar etc. Difficulty level: Hard
Prerequisite: Designing Yelp
1. What is Uber?
Uber enables its customers to book drivers for taxi rides. Uber drivers use their personal
cars to drive customers around. Both customers and drivers communicate with each other
through their smartphones using Uber app.
Drivers need to regularly notify the service about their current location and their availability
to pick passengers.
Passengers get to see all the nearby available drivers.
Customer can request a ride; nearby drivers are notified that a customer is ready to be
picked up.
Once a driver and customer accept a ride, they can constantly see each other’s current
location, until the trip finishes.
Upon reaching the destination, the driver marks the journey complete to become available
for the next ride.
We will take the solution discussed in Designing Yelp 8 and modify it to make it work for the
above-mentioned “Uber” use cases. The biggest difference we have is that our QuadTree
was not built keeping in mind that there will be frequent updates to it. So, we have two
issues with our Dynamic Grid solution:
Since all active drivers are reporting their locations every three seconds, we need to update
our data structures to reflect that. If we have to update the QuadTree for every change in
the driver’s position, it will take a lot of time and resources. To update a driver to its new
location, we must find the right grid based on the driver’s previous location. If the new
position does not belong to the current grid, we have to remove the driver from the current
grid and move/reinsert the user to the correct grid. After this move, if the new grid reaches
the maximum limit of drivers, we have to repartition it.
We need to have a quick mechanism to propagate the current location of all the nearby
drivers to any active customer in that area. Also, when a ride is in progress, our system
needs to notify both the driver and passenger about the current location of the car.
Although our QuadTree helps us find nearby drivers quickly, a fast update in the tree is not
guaranteed.
Do we need to modify our QuadTree every time a driver reports their location? If we don’t
update our QuadTree with every update from the driver, it will have some old data and will
not reflect the current location of drivers correctly. If you recall, our purpose of building the
QuadTree was to find nearby drivers (or places) efficiently. Since all active drivers report
their location every three seconds, therefore there will be a lot more updates happening to
our tree than querying for nearby drivers. So, what if we keep the latest position reported
by all drivers in a hash table and update our QuadTree a little less frequent? Let’s assume
we guarantee that a driver’s current location will be reflected in the QuadTree within 15
seconds. Meanwhile, we will maintain a hash table that will store the current location
reported by drivers; let’s call this DriverLocationHT.
How much memory we need for DriverLocationHT? We need to store DriveID, their present
and old location in the hash table. So we need total 35 bytes to store one record:
How much bandwidth will our service consume to receive location updates from all
drivers? If we get DriverID and their location, it will be (3+16 => 19 bytes). If we receive this
information every three seconds from one million drivers, we will be getting 19MB per three
seconds.
1. As soon as the server receives an update for a driver’s location, they will broadcast
that information to all the interested customers.
2. The server needs to notify respective QuadTree server to refresh the driver’s
location. As discussed above, this can happen every 10 seconds.
How can we efficiently broadcast driver’s location to customers? We can have a Push
Model , where the server will push the positions to all the relevant users. We can have a
dedicated Notification Service that can broadcast the current location of drivers to all the
interested customers. We can build our Notification service on publisher/subscriber model.
When a customer opens the Uber app on their cell phone, they query the server to find
nearby drivers. On the server side, before returning the list of drivers to the customer, we
will subscribe the customer for all the updates from those drivers. We can maintain a list of
customers (subscribers) interested in knowing the location of a driver and whenever we
have an update in DriverLocationHT for that driver, we can broadcast the current location of
the driver to all subscribed customers. This way our system makes sure that we always show
the driver’s current position to the customer.
How much memory will we need to store all these subscriptions? As we have estimated
above we will have 1M daily active customers and 500K daily active drivers. On the average
let’s assume that five customers subscribe to one driver. Let’s assume we store all this
information in a hash table so that we can update it efficiently. We need to store driver and
customer IDs to maintain the subscriptions. Assuming we will need 3 bytes for DriverID and
8 bytes for CustomerID, we will need 21MB of memory.
(500K * 3) + (500K * 5 * 8 ) ~= 21 MB
How much bandwidth will we need to broadcast the driver’s location to customers? For
every active driver we have five subscribers, so total subscribers we have:
How can we efficiently implement Notification service? We can either use HTTP long
polling or push notifications.
How will the new publishers/drivers get added for a current customer? As we have
proposed above that customers will be subscribed to nearby drivers when they open the
Uber app for the first time, what will happen when a new driver enters the area the
customer is looking at? To add a new customer/driver subscription dynamically, we need to
keep track of the area the customer is watching. This will make our solution complicated,
what if instead of pushing this information, clients pull it from the server?
How about if clients pull information about nearby drivers from the server? Clients can
send their current location, and the server will find all the nearby drivers from the QuadTree
to return them to the client. Upon receiving this information, the client can update their
screen to reflect current positions of the drivers. Clients can query every five seconds to
limit the number of round trips to the server. This solution looks quite simpler compared to
the push model described above.
Do we need to repartition a grid as soon as it reaches the maximum limit? We can have a
cushion to let each grid grow a little bigger beyond the limit before we decide to partition it.
Let’s say our grids can grow/shrink extra 10% before we partition/merge them. This should
decrease the load for grid partition or merge on high traffic grids.
What if a Driver Location server or Notification server dies? We would need replicas of
these servers, so that if the primary dies the secondary can take control. Also, we can store
this data in some persistent storage like SSDs that can provide fast IOs; this will ensure that
if both primary and secondary servers die we can recover the data from the persistent
storage.
6. Ranking
How about if we want to rank the search results not just by proximity but also by popularity
or relevance?
How can we return top rated drivers within a given radius? Let’s assume we keep track of
the overall ratings of each driver in our database and QuadTree. An aggregated number can
represent this popularity in our system, e.g., how many stars a driver gets out of ten? While
searching for top 10 drivers within a given radius, we can ask each partition of the QuadTree
to return top 10 drivers with maximum rating. The aggregator server can then determine
top 10 drivers among all the drivers returned by different partitions.
7. Advanced Issues
9. What if a client gets disconnected when it was a part of a ride? How will we handle
billing in such a scenario?
10. How about if clients pull all the information as compared to servers always pushing
it?
Let’s design an online ticketing system that sells movie tickets like Ticketmastr or
BookMyShow. Similar Services: bookmyshow.com 5, ticketmaster.com 3 Difficulty Level:
Hard
Functional Requirements:
1. Our ticket booking service should be able to list down different cities where its
affiliate cinemas are located.
2. Once the user selects the city, the service should display the movies released in that
particular city.
3. Once the user selects a movie, the service should display the cinemas running that
movie and its available shows.
4. The user should be able to choose a show at a particular cinema and book their
tickets.
5. The service should be able to show the user the seating arrangement of the cinema
hall. The user should be able to select multiple seats according to their preference.
6. The user should be able to distinguish available seats from the booked ones.
7. Users should be able to put a hold on the seats for five minutes before they make a
payment to finalize the booking.
8. The user should be able to wait if there is a chance that the seats might become
available – e.g., when holds by other users expire.
9. Waiting customers should be serviced in a fair first come first serve manner.
Non-Functional Requirements:
1. The system would need to be highly concurrent. There will be multiple booking
requests for the same seat at any particular point in time. The service should handle
this gracefully and fairly.
2. The core thing of the service is ticket booking which means financial transactions.
This means that the system should be secure and the database ACID compliant.
4. For simplicity, let’s assume our service does not require any user authentication.
5. The system will not handle partial ticket orders. Either user gets all the tickets they
want, or they get nothing.
7. To stop system abuse, we can restrict users not to book more than ten seats at a
time.
8. We can assume that traffic would spike on popular/much-awaited movie releases,
and the seats fill up pretty fast. The system should be scalable, highly available to
cope up with the surge in traffic.
4. Capacity Estimation
Traffic estimates: Let’s assume that our service has 3 billion page views per month and sells
10 million tickets a month.
Storage estimates: Let’s assume that we have 500 cities and on average each city has ten
cinemas. If there are 2000 seats in each cinema and on average, there are two shows every
day.
Let’s assume each seat booking needs 50 bytes (IDs, NumberOfSeats, ShowID, MovieID,
SeatNumbers, SeatStatus, Timestamp, etc.) to store in the database. We would also need to
store information about movies and cinemas, let’s assume it’ll take 50 bytes. So, to store all
the data about all shows of all cinemas of all cities for a day:
500 cities * 10 cinemas * 2000 seats * 2 shows * (50+50) bytes = 2GB / day
5. System APIs
We can have SOAP or REST APIs to expose the functionality of our service. Following could
be the definition of the APIs to search movie shows and reserve seats.
Parameters:
api_dev_key (string): The API developer key of a registered account. This will be used to,
among other things, throttle users based on their allocated quota.
keyword (string): Keyword to search on.
city (string): City to filter movies by.
lat_long (string): Latitude and longitude to filter by. radius (number): Radius of the area in
which we want to search for events.
start_datetime (string): Filter movies with a starting datetime.
end_datetime (string): Filter movies with an ending datetime.
postal_code (string): Filter movies by postal code / zipcode.
includeSpellcheck (Enum: “yes” or " no"): Yes, to include spell check suggestions in the
response.
results_per_page (number): How many results to return per page. Maximum is 30.
sorting_order (string): Sorting order of the search result. Some allowable values :
‘name,asc’, ‘name,desc’, ‘date,asc’, ‘date,desc’, ‘distance,asc’, ‘name,date,asc’,
‘name,date,desc’, ‘date,name,asc’, ‘date,name,desc’.
Returns: (JSON)
Here is a sample list of movies and their shows:
[
{
"MovieID": 1,
"ShowID": 1,
"Title": "Cars 2",
"Description": "About cars",
"Duration": 120,
"Genre": "Animation",
"Language": "English",
"ReleaseDate": "8th Oct. 2014",
"Country": USA,
"StartTime": "14:00",
"EndTime": "16:00",
"Seats":
[
{
"Type": "Regular"
"Price": 14.99
"Status: "Almost Full"
},
{
"Type": "Premium"
"Price": 24.99
"Status: "Available"
}
]
},
{
"MovieID": 1,
"ShowID": 2,
"Title": "Cars 2",
"Description": "About cars",
"Duration": 120,
"Genre": "Animation",
"Language": "English",
"ReleaseDate": "8th Oct. 2014",
"Country": USA,
"StartTime": "16:30",
"EndTime": "18:30",
"Seats":
[
{
"Type": "Regular"
"Price": 14.99
"Status: "Full"
},
{
"Type": "Premium"
"Price": 24.99
"Status: "Almost Full"
}
]
},
]
ReserveSeats(api_dev_key, session_id, movie_id, show_id, seats_to_reserve[])
Parameters:
api_dev_key (string): same as above
session_id (string): User’s session ID to track this reservation. Once the reservation time
expires, user’s reservation on the server will be removed using this ID.
movie_id (string): Movie to reserve.
show_id (string): Show to reserve.
seats_to_reserve (number): An array containing seat IDs to reserve.
Returns: (JSON)
Returns the status of the reservation, which would be one of the following: 1) “Reservation
Successful” 2) “Reservation Failed - Show Full,” 3) “Reservation Failed - Retry, as other users
are holding reserved seats”.
6. Database Design
Here are a few observations about the data we are going to store:
At a high-level, our web servers will manage users’ sessions, and application servers will
handle all the ticket management, storing data in the databases, as well as, work with the
cache servers to process reservations.
завантаження (40).png996×536 34.7 KB
First, let’s try to build our service assuming if it is being served from a single server.
9. If seats are reserved successfully, the user has five minutes to pay for the
reservation. After payment, booking is marked complete. If the user is not able to
pay within five minutes, all their reserved seats are freed to become available to
other users.
завантаження (48).png975×556 54.5 KB
a. ActiveReservationsService
We can keep all the reservations of a ‘show’ in memory in a Linked HashMap 13 in addition
to keeping all the data in the database. We would need a Linked HashMap so that we can
jump to any reservation to remove it when the booking is complete. Also, since we will have
expiry time associated with each reservation, the head of the linked HashTable will always
point to the oldest reservation record, so that the reservation can be expired when the
timeout is reached.
To store every reservation for every show, we can have a HashTable where the ‘key’ would
be ‘ShowID’ and the ‘value’ would be the Linked HashMap containing ‘BookingID’ and
creation ‘Timestamp’.
In the database, we will store the reservation in the ‘Booking’ table, and the expiry time will
be in the Timestamp column. The ‘Status’ field will have a value of ‘Reserved (1)’ and as
soon as a booking is complete, the system will update the ‘Status’ to ‘Booked (2)’ and
remove the reservation record from the Linked HashMap of the relevant show. When the
reservation is expired, we can either remove it from the Booking table or mark it ‘Expired
(3)’ in addition to removing it from memory.
ActiveReservationsService will also work with the external financial service to process user
payments. Whenever a booking is completed, or a reservation gets expired,
WaitingUsersService will get a signal, so that any waiting customer can be served.
b. WaitingUsersService
Just like ActiveReservationsService, we can keep all the waiting users of a show in memory
in a Linked HashMap. We need a Linked HashMap so that we can jump to any user to
remove them from the HashMap when the user cancels their request. Also, since we are
serving in a first-come-first-serve manner, the head of the Linked HashMap would always be
pointing to the longest waiting user, so that whenever seats become available, we can serve
users in a fair manner.
We will have a HashTable to store all the waiting users for every Show. The ‘key’ would be
'ShowID, and the ‘value’ would be a Linked HashMap containing ‘UserIDs’ and their wait-
start-time.
Clients can use Long Polling 7 for keeping themselves updated for their reservation status.
Whenever seats become available, the server can use this request to notify the user.
Reservation Expiration
On the server, ActiveReservationsService keeps track of expiry (based on reservation time)
of active reservations. As the client will be shown a timer (for the expiration time) which
could be a little out of sync with the server, we can add a buffer of five seconds on the
server to safeguard from a broken experience - such that – the client never times out after
the server, preventing a successful purchase.
9. Concurrency
How to handle concurrency; such that no two users are able to book same seat? We can
use transactions in SQL databases to avoid any clashes. For example, if we are using SQL
server we can utilize Transaction Isolation Levels 11 to lock the rows before we can update
them. Here is the sample code:
BEGIN TRANSACTION;
-- Suppose we intend to reserve three seats (IDs: 54, 55, 56) for ShowID=99
Select * From Show_Seat where ShowID=99 && ShowSeatID in (54, 55, 56) && Status=0
-- free
-- if the number of rows returned by the above statement is three, we can update to
-- return success otherwise return failure to the user.
update Show_Seat ...
update Booking ...
COMMIT TRANSACTION;
Once the above database transaction is successful, we can start tracking the reservation in
ActiveReservationService.
Similarly, we’ll have a master-slave setup for databases to make them fault tolerant.
Database partitioning: If we partition by ‘MovieID’ then all the Shows of a movie will be on
a single server. For a very hot movie, this could cause a lot of load on that server. A better
approach would be to partition based on ShowID, this way the load gets distributed among
different servers.
1. Update database to remove the Booking (or mark it expired) and update the seats’
Status in ‘Show_Seats’ table.
2. Remove the reservation from the Linked HashMap.
3. Notify the user that their reservation has expired.
4. Broadcast a message to all WaitingUserService servers that are holding waiting users
of that Show to figure out the longest waiting user. Consistent Hashing scheme will
tell what servers are holding these users.
5. Send a message to the WaitingUserService server holding the longest waiting user to
process their request if required seats have become available.
1. The server holding that reservation sends a message to all servers holding waiting
users of that Show so that they can expire all those waiting users that need more
seats than the available seats.
2. Upon receiving the above message, all servers holding the waiting users will query
the database to find how many free seats are available now. Database cache would
greatly help here to run this query only once.
3. Expire all waiting users who want to reserve more seats than the available seats. For
this, WaitingUserService has to iterate through the Linked HashMap of all the
waiting users.
Load Balancing
Load Balancer (LB) is another critical component of any distributed system. It helps to
spread the traffic across a cluster of servers to improve responsiveness and availability of
applications, websites or databases. LB also keeps track of the status of all the resources
while distributing requests. If a server is not available to take new requests or is not
responding or has elevated error rate, LB will stop sending traffic to such a server.
Typically a load balancer sits between the client and the server accepting incoming network
and application traffic and distributing the traffic across multiple backend servers using
various algorithms. By balancing application requests across multiple servers, a load
balancer reduces individual server load and prevents any one application server from
becoming a single point of failure, thus improving overall application availability and
responsiveness.
To utilize full scalability and redundancy, we can try to balance the load at each layer of the
system. We can add LBs at three places:
Users experience faster, uninterrupted service. Users won’t have to wait for a single
struggling server to finish its previous tasks. Instead, their requests are immediately passed
on to a more readily available resource.
Service providers experience less downtime and higher throughput. Even a full server failure
won’t affect the end user experience as the load balancer will simply route around it to a
healthy server.
Load balancing makes it easier for system administrators to handle incoming requests while
decreasing wait time for users.
Smart load balancers provide benefits like predictive analytics that determine traffic
bottlenecks before they happen. As a result, the smart load balancer gives an organization
actionable insights. These are key to automation and can help drive business decisions.
System administrators experience fewer failed or stressed components. Instead of a single
device performing a lot of work, load balancing has several devices perform a little bit of
work.
Health Checks - Load balancers should only forward traffic to “healthy” backend servers. To
monitor the health of a backend server, “health checks” regularly attempt to connect to
backend servers to ensure that servers are listening. If a server fails a health check, it is
automatically removed from the pool, and traffic will not be forwarded to it until it responds
to the health checks again.
There is a variety of load balancing methods, which use different algorithms for different
needs.
Least Connection Method — This method directs traffic to the server with the fewest active
connections. This approach is most useful when there are a large number of persistent
connections in the traffic unevenly distributed between the servers.
Least Response Time Method — This algorithm directs traffic to the server with the fewest
active connections and the lowest average response time.
Least Bandwidth Method - This method selects the server that is currently serving the least
amount of traffic, measured in megabits per second (Mbps).
Round Robin Method — This method cycles through a list of servers and sends each new
request to the next server. When it reaches the end of the list, it starts over at the
beginning. It is most useful when the servers are of equal specification, and there are not
many persistent connections.
Weighted Round Robin Method — The weighted round-robin scheduling is designed to
better handle servers with different processing capacities. Each server is assigned a weight,
an integer value that indicates the processing capacity. Servers with higher weights receive
new connections first than those with less weights, and servers with higher weights get
more connections than those with less weights.
IP Hash — This method calculates a hash of the IP address of the client to determine which
server receives the request.
Redundant Load Balancers
The load balancer can be a single point of failure, to overcome this a second load balancer
can be connected to the first to form a cluster. Each LB monitors the health of the other and
since both of them are equally capable of serving traffic and failure detection, in the event
the main load balancer fails, the second load balancer takes over.
1. Smart Clients
One way to implement load-balancing is through the client applications. Developers can add
the load balancing algorithm to the application or the database client. Such a client will take
a pool of service hosts and balances load across them. It also detects hosts that are not
responding to avoid sending requests their way. Smart clients also have to discover
recovered hosts, deal with adding new hosts, etc. Smart clients look easy to implement and
manage especially when the system is not large, but as the system grows, LBs need to be
evolved into standalone servers.
As such, even large companies with large budgets will often avoid using dedicated hardware
for all their load-balancing needs. Instead, they use them only as the first point of contact
for user requests to their infrastructure and use other mechanisms (smart clients or the
hybrid approach discussed in the next section) for load-balancing for traffic within their
network.
3. Software Load Balancers
If we want to avoid the pain of creating a smart client, and since purchasing dedicated
hardware is expensive, we can adopt a hybrid approach, called software load-balancers.
HAProxy 29 is one of the popular open source software LB. The load balancer can be placed
between the client and the server or between two server-side layers. If we can control the
machine where the client is running, HAProxy could be running on the same machine. Each
service we want to load balance can have a locally bound port (e.g., localhost:9000) on that
machine, and the client will use this port to connect to the server. This port is, actually,
managed by HAProxy; every client request on this port will be received by the proxy and
then passed to the backend service in an efficient way (distributing load). If we can’t
manage the client’s machine, HAProxy can run on an intermediate server. Similarly, we can
have proxies running between different server-side components. HAProxy manages health
checks and will remove or add servers to those pools. It also balances requests across all the
servers in those pools.
For most systems, we should start with a software load balancer and move to smart clients
or hardware load balancing as the need arises.
Caching
Load balancing helps you scale horizontally across an ever-increasing number of servers, but
caching will enable you to make vastly better use of the resources you already have, as well
as making otherwise unattainable product requirements feasible. Caches take advantage of
the locality of reference principle: recently requested data is likely to be requested again.
They are used in almost every layer of computing: hardware, operating systems, web
browsers, web applications and more. A cache is like short-term memory: it has a limited
amount of space, but is typically faster than the original data source and contains the most
recently accessed items. Caches can exist at all levels in architecture but are often found at
the level nearest to the front end, where they are implemented to return data quickly
without taxing downstream levels.
Placing a cache directly on a request layer node enables the local storage of response data.
Each time a request is made to the service, the node will quickly return local, cached data if
it exists. If it is not in the cache, the requesting node will query the data from disk. The
cache on one request layer node could also be located both in memory (which is very fast)
and on the node’s local disk (faster than going to network storage).
What happens when you expand this to many nodes? If the request layer is expanded to
multiple nodes, it’s still quite possible to have each node host its own cache. However, if
your load balancer randomly distributes requests across the nodes, the same request will go
to different nodes, thus increasing cache misses. Two choices for overcoming this hurdle are
global caches and distributed caches.
2. Distributed cache
In a distributed cache, each of its nodes own part of the cached data. Typically, the cache is
divided up using a consistent hashing function, such that if a request node is looking for a
certain piece of data, it can quickly know where to look within the distributed cache to
determine if that data is available. In this case, each node has a small piece of the cache,
and will then send a request to another node for the data before going to the origin.
Therefore, one of the advantages of a distributed cache is the ease by which we can
increase the cache space, which can be achieved just by adding nodes to the request pool.
3. Global Cache
A global cache is just as it sounds: all the nodes use the same single cache space. This
involves adding a server, or file store of some sort, faster than your original store and
accessible by all the request layer nodes. Each of the request nodes queries the cache in the
same way it would a local one. This kind of caching scheme can get a bit complicated
because it is very easy to overwhelm a single cache as the number of clients and requests
increase, but is very effective in some architectures (particularly ones with specialized
hardware that make this global cache very fast, or that have a fixed dataset that needs to be
cached).
There are two common forms of global caches depicted in the following diagram. First,
when a cached response is not found in the cache, the cache itself becomes responsible for
retrieving the missing piece of data from the underlying store. Second, it is the responsibility
of request nodes to retrieve any data that is not found in the cache.
завантаження (53).png599×620 39.4 KB
Most applications leveraging global caches tend to use the first type, where the cache itself
manages eviction and fetching data to prevent a flood of requests for the same data from
the clients. However, there are some cases where the second implementation makes more
sense. For example, if the cache is being used for very large files, a low cache hit percentage
would cause the cache buffer to become overwhelmed with cache misses; in this situation,
it helps to have a large percentage of the total data set (or hot data set) in the cache.
Another example is an architecture where the files stored in the cache are static and
shouldn’t be evicted. (This could be because of application requirements around that data
latency—certain pieces of data might need to be very fast for large data sets—where the
application logic understands the eviction strategy or hot spots better than the cache.)
CDNs are a kind of cache that comes into play for sites serving large amounts of static
media. In a typical CDN setup, a request will first ask the CDN for a piece of static media; the
CDN will serve that content if it has it locally available. If it isn’t available, the CDN will query
the back-end servers for the file and then cache it locally and serve it to the requesting user.
If the system we are building isn’t yet large enough to have its own CDN, we can ease a
future transition by serving the static media off a separate subdomain
(e.g. static.yourservice.com 4) using a lightweight HTTP server like Nginx, and cutover the
DNS from your servers to a CDN later.
Cache Invalidation
While caching is fantastic, it does require some maintenance for keeping cache coherent
with the source of truth (e.g., database). If the data is modified in the database, it should be
invalidated in the cache, if not, this can cause inconsistent application behavior.
Solving this problem is known as cache invalidation, there are three main schemes that are
used:
Write-through cache: Under this scheme data is written into the cache and the
corresponding database at the same time. The cached data allows for fast retrieval, and
since the same data gets written in the permanent storage, we will have complete data
consistency between cache and storage. Also, this scheme ensures that nothing will get lost
in case of a crash, power failure, or other system disruptions.
Although write through minimizes the risk of data loss, since every write operation must be
done twice before returning success to the client, this scheme has the disadvantage of
higher latency for write operations.
Write-around cache: This technique is similar to write through cache, but data is written
directly to permanent storage, bypassing the cache. This can reduce the cache being flooded
with write operations that will not subsequently be re-read, but has the disadvantage that a
read request for recently written data will create a “cache miss” and must be read from
slower back-end storage and experience higher latency.
Write-back cache: Under this scheme, data is written to cache alone, and completion is
immediately confirmed to the client. The write to the permanent storage is done after
specified intervals or under certain conditions. This results in low latency and high
throughput for write-intensive applications, however, this speed comes with the risk of data
loss in case of a crash or other adverse event because the only copy of the written data is in
the cache.
1. First In First Out (FIFO): The cache evicts the first block accessed first without any
regard to how often or how many times it was accessed before.
2. Last In First Out (LIFO): The cache evicts the block accessed most recently first
without any regard to how often or how many times it was accessed before.
3. Least Recently Used (LRU): Discards the least recently used items first.
4. Most Recently Used (MRU): Discards, in contrast to LRU, the most recently used
items first.
5. Least Frequently Used (LFU): Counts how often an item is needed. Those that are
used least often are discarded first.
6. Random Replacement (RR): Randomly selects a candidate item and discards it to
make space when necessary.
Data partitioning (also known as sharding) is a technique to break up a big database (DB)
into many smaller parts. It is the process of splitting up a DB/table across multiple machines
to improve the manageability, performance, availability and load balancing of an
application. The justification for data sharding is that, after a certain scale point, it is cheaper
and more feasible to scale horizontally by adding more machines than to grow it vertically
by adding beefier servers.
1. Partitioning Methods
There are many different schemes one could use to decide how to break up an application
database into multiple smaller DBs. Below are three of the most popular schemes used by
various large scale applications.
a. Horizontal partitioning: In this scheme, we put different rows into different tables. For
example, if we are storing different places in a table, we can decide that locations with ZIP
codes less than 10000 are stored in one table, and places with ZIP codes greater than 10000
are stored in a separate table. This is also called a range based sharding, as we are storing
different ranges of data in separate tables.
The key problem with this approach is that if the value whose range is used for sharding
isn’t chosen carefully, then the partitioning scheme will lead to unbalanced servers. In the
previous example, splitting location based on their zip codes assumes that places will be
evenly distributed across the different zip codes. This assumption is not valid as there will be
a lot of places in a thickly populated area like Manhattan compared to its suburb cities.
b. Vertical Partitioning: In this scheme, we divide our data to store tables related to a
specific feature to their own server. For example, if we are building Instagram like
application, where we need to store data related to users, all the photos they upload and
people they follow, we can decide to place user profile information on one DB server, friend
lists on another and photos on a third server.
2. Partitioning Criteria
a. Key or Hash-based partitioning: Under this scheme, we apply a hash function to some
key attribute of the entity we are storing, that yields the partition number. For example, if
we have 100 DB servers and our ID is a numeric value that gets incremented by one, each
time a new record is inserted. In this example, the hash function could be ‘ID % 100’, which
will give us the server number where we can store/read that record. This approach should
ensure a uniform allocation of data among servers. The fundamental problem with this
approach is that it effectively fixes the total number of DB servers, since adding new servers
means changing the hash function which would require redistribution of data and downtime
for the service. A workaround for this problem is to use Consistent Hashing.
b. List partitioning: In this scheme, each partition is assigned a list of values, so whenever
we want to insert a new record, we will see which partition contains our key and then store
it there. For example, we can decide all users living in Iceland, Norway, Sweden, Finland or
Denmark will be stored in a partition for the Nordic countries.
c. Round-robin partitioning: This is a very simple strategy that ensures uniform data
distribution. With ‘n’ partitions, the ‘i’ tuple is assigned to partition (i mod n).
On a sharded database, there are certain extra constraints on the different operations that
can be performed. Most of these constraints are due to the fact that, operations across
multiple tables or multiple rows in the same table, will no longer run on the same server.
Below are some of the constraints and additional complexities introduced by sharding:
c. Rebalancing: There could be many reasons we have to change our sharding scheme:
1. The data distribution is not uniform, e.g., there are a lot of places for a particular ZIP
code, that cannot fit into one database partition.
2. There are a lot of load on a shard, e.g., there are too many requests being handled
by the DB shard dedicated to user photos.
In such cases, either we have to create more DB shards or have to rebalance existing shards,
which means the partitioning scheme changed and all existing data moved to new locations.
Doing this without incurring downtime is extremely difficult. Using a scheme like directory
based partitioning does make rebalancing a more palatable experience at the cost of
increasing the complexity of the system and creating a new single point of failure (i.e. the
lookup service/database).
Indexes
Indexes are well known when it comes to databases. Sooner or later there comes a time
when database performance is no longer satisfactory. One of the very first things you should
turn to when that happens is database indexing.
The goal of creating an index on a particular table in a database is to make it faster to search
through the table and find the row or rows that we want. Indexes can be created using one
or more columns of a database table, providing the basis for both rapid random lookups and
efficient access of ordered records.
A library catalog is a register that contains the list of books found in a library. The catalog is
organized like a database table generally with four columns: book title, writer, subject and
date of publication. There are usually two such catalogs, one sorted by the book title and
one sorted by the writer name. That way you can either think of a writer you want to read
and then look through their books, or look up a specific book title you know you want to
read or in case you don’t know the writer’s name. These catalogs are like indexes for the
database of books. They provide a sorted list of data that is easily searchable by relevant
information.
Simply saying, an index is a data structure that can be perceived as a table of contents that
points us to the location where actual data lives. So when we create an index on a column of
a table, we store that column and a pointer to the whole row in the index. Let’s assume a
table containing a list of books, following diagram shows how an index on ‘Title’ column
looks like:
5684961520648192.png711×192
Just as to a traditional relational data store, we can also apply this concept to larger
datasets. The trick with indexes is that we must carefully consider how users will access the
data. In the case of data sets that are many terabytes in size but with very small payloads
(e.g., 1 KB), indexes are a necessity for optimizing data access. Finding a small payload in
such a large dataset can be a real challenge since we can’t possibly iterate over that much
data in any reasonable time. Furthermore, it is very likely that such a large data set is spread
over several physical devices—this means we need some way to find the correct physical
location of the desired data. Indexes are the best way to do this.
The downside is that indexes make it slower to add rows or make updates to existing rows
for that table since we not only have to write the data but also have to update the index. So
adding indexes can increase the read performance, but at the same time, decrease the write
performance.
Adding a new row to a table without indexes is simple. The database finds the next available
space in the table to add the new entry and inserts it there. However, when adding a new
row to a table with one or more indexes, the database not only have to add the new entry
to the table but also have to add a new entry into each index on that table, making sure to
insert the entry into the correct spot in the index to ensure that the data is sorted correctly.
This performance degradation applies to all insert, update, and delete operations for the
table. For this reason, adding unnecessary indexes on tables should be avoided, and indexes
that are no longer used should be removed. To reiterate, adding indexes is about improving
the performance of search queries. If the goal of the database is to provide a data store that
is often written to and rarely read from, in that case, decreasing the performance of the
more common operation, which is writing, is probably not worth the increase in
performance we get from reading.
Proxies
A proxy server is an intermediary piece of hardware/software that sits between the client
and the back-end server. It receives requests from clients and relays them to the origin
servers. Typically, proxies are used to filter requests or log requests, or sometimes
transform requests (by adding/removing headers, encrypting/decrypting, or compression).
Another advantage of a proxy server is that its cache can serve a lot of requests. If multiple
clients access a particular resource, the proxy server can cache it and serve all clients
without going to the remote server.
Proxies are also extremely helpful when coordinating requests from multiple servers and
can be used to optimize request traffic from a system-wide perspective. For example, we
can collapse the same (or similar) data access requests into one request and then return the
single result to the user; this scheme is called collapsed forwarding.
Imagine there is a request for the same data across several nodes, and that piece of data is
not in the cache. If these requests are routed through the proxy, then all them can be
collapsed into one, which means we will be reading the required data from the disk only
once.
Another great way to use the proxy is to collapse requests for data that is spatially close
together in the storage (consecutively on disk). This strategy will result in decreasing request
latency. For example, let’s say a bunch of servers request parts of file: part1, part2, part3,
etc. We can set up our proxy in such a way that it can recognize the spatial locality of the
individual requests, thus collapsing them into a single request and reading complete file,
which will greatly minimize the reads from the data origin. Such scheme makes a big
difference in request time when we are doing random accesses across TBs of data. Proxies
are particularly useful under high load situations, or when we have limited caching since
proxies can mostly batch several requests into one.
Queues
Queues are used to effectively manage requests in a large-scale distributed system. In small
systems with minimal processing loads and small databases, writes can be predictably fast;
however, in more complex and large systems writes can take an almost non-
deterministically long time. For example, data may have to be written in different places on
different servers or indices, or the system could simply be under high load. In such cases
where individual writes (or tasks) may take a long time, achieving high performance and
availability requires different components of the system to work in an asynchronous way; a
common way to do that is with queues.
Let’s assume a system where each client is requesting a task to be processed on a remote
server. Each of these clients sends their requests to the server, and the server tries to finish
the tasks as quickly as possible to return the results to the respective clients. In small
systems where one server can handle incoming requests just as fast as they come, this kind
of situation should work just fine. However, when the server gets more requests than it can
handle, then each client is forced to wait for other clients’ requests to finish before a
response can be generated.
This kind of synchronous behavior can severely degrade client’s performance; the client is
forced to wait, effectively doing zero work, until its request can be responded. Adding extra
servers to address high load does not solve the problem either; even with effective load
balancing in place, it is very difficult to ensure the fair and balanced distribution of work
required to maximize client performance. Further, if the server processing the requests is
unavailable, or fails, then the clients upstream will fail too. Solving this problem effectively
requires building an abstraction between the client’s request and the actual work
performed to service it.
A processing queue is as simple as it sounds: all incoming tasks are added to the queue, and
as soon as any worker has the capacity to process, they can pick up a task from the queue.
These tasks could represent a simple write to a database, or something as complex as
generating a thumbnail preview image for a document.
Queues are implemented on the asynchronous communication protocol, meaning when a
client submits a task to a queue they are no longer required to wait for the results; instead,
they need only acknowledgment that the request was properly received. This
acknowledgment can later serve as a reference for the results of the work when the client
requires it. Queues have implicit or explicit limits on the size of data that may be
transmitted in a single request and the number of requests that may remain outstanding on
the queue.
Queues are also used for fault tolerance as they can provide some protection from service
outages and failures. For example, we can create a highly robust queue that can retry
service requests that have failed due to transient system failures. It is preferable to use a
queue to enforce quality-of-service guarantees than to expose clients directly to
intermittent service outages, requiring complicated and often inconsistent client-side error
handling.
Queues play a vital role in managing distributed communication between different parts of
any large-scale distributed system. There are a lot of ways to implement them and quite a
few open source implementations of queues available like RabbitMQ, ZeroMQ, ActiveMQ,
and BeanstalkD.
Redundancy means duplication of critical data or services with the intention of increased
reliability of the system. For example, if there is only one copy of a file stored on a single
server, then losing that server means losing the file. Since losing data is seldom a good thing,
we can create duplicate or redundant copies of the file to solve this problem.
This same principle applies to services too. If we have a critical service in our system,
ensuring that multiple copies or versions of it are running simultaneously can secure against
the failure of a single node.
Creating redundancy in a system can remove single points of failure and provide backups if
needed in a crisis. For example, if we have two instances of a service running in production,
and if one fails or degrades, the system can failover to the other one. These failovers can
happen automatically or can be done manually.
завантаження (56).png726×437 28.1 KB
In the world of databases, there are two main types of solutions: SQL and NoSQL - or
relational databases and non-relational databases. Both of them differ in the way they were
built, the kind of information they store, and how they store it.
Relational databases are structured and have predefined schemas, like phone books that
store phone numbers and addresses. Non-relational databases are unstructured, distributed
and have a dynamic schema, like file folders that hold everything from a person’s address
and phone number to their Facebook ‘likes’ and online shopping preferences.
SQL
Relational databases store data in rows and columns. Each row contains all the information
about one entity, and columns are all the separate data points. Some of the most popular
relational databases are MySQL, Oracle, MS SQL Server, SQLite, Postgres and MariaDB.
NoSQL
Key-Value Stores: Data is stored in an array of key-value pairs. The ‘key’ is an attribute
name, which is linked to a ‘value’. Well-known key value stores include Redis, Voldemort
and Dynamo.
Document Databases: In these databases data is stored in documents, instead of rows and
columns in a table, and these documents are grouped together in collections. Each
document can have an entirely different structure. Document databases include the
CouchDB and MongoDB.
Graph Databases: These databases are used to store data whose relations are best
represented in a graph. Data is saved in graph structures with nodes (entities), properties
(information about the entities) and lines (connections between the entities). Examples of
graph database include Neo4J and InfiniteGraph.
Storage: SQL stores data in tables, where each row represents an entity, and each column
represents a data point about that entity; for example, if we are storing a car entity in a
table, different columns could be ‘Color’, ‘Make’, ‘Model’, and so on.
NoSQL databases have different data storage models. The main ones are key-value,
document, graph and columnar. We will discuss differences between these databases
below.
Schema: In SQL, each record conforms to a fixed schema, meaning the columns must be
decided and chosen before data entry and each row must have data for each column. The
schema can be altered later, but it involves modifying the whole database and going offline.
Whereas in NoSQL, schemas are dynamic. Columns can be added on the fly, and each ‘row’
(or equivalent) doesn’t have to contain data for each ‘column.’
Querying: SQL databases uses SQL (structured query language) for defining and
manipulating the data, which is very powerful. In NoSQL database, queries are focused on a
collection of documents. Sometimes it is also called UnQL (Unstructured Query Language).
Different databases have different syntax for using UnQL.
Scalability: In most common situations, SQL databases are vertically scalable, i.e., by
increasing the horsepower (higher Memory, CPU, etc.) of the hardware, which can get very
expensive. It is possible to scale a relational database across multiple servers, but this is a
challenging and time-consuming process.
On the other hand, NoSQL databases are horizontally scalable, meaning we can add more
servers easily in our NoSQL database infrastructure to handle large traffic. Any cheap
commodity hardware or cloud instances can host NoSQL databases, thus making it a lot
more cost-effective than vertical scaling. A lot of NoSQL technologies also distribute data
across servers automatically.
Most of the NoSQL solutions sacrifice ACID compliance for performance and scalability.
1. Storing large volumes of data that often have little to no structure. A NoSQL
database sets no limits on the types of data we can store together and allows us to
add different new types as the need changes. With document-based databases, you
can store data in one place without having to define what “types” of data those are
in advance.
2. Making the most of cloud computing and storage. Cloud-based storage is an
excellent cost-saving solution but requires data to be easily spread across multiple
servers to scale up. Using commodity (affordable, smaller) hardware on-site or in the
cloud saves you the hassle of additional software, and NoSQL databases like
Cassandra are designed to be scaled across multiple data centers out of the box
without a lot of headaches.
3. Rapid development. NoSQL is extremely useful for rapid development as it doesn’t
need to be prepped ahead of time. If you’re working on quick iterations of your
system which require making frequent updates to the data structure without a lot of
downtime between versions, a relational database will slow you down.
CAP Theorem
CAP theorem states that it is impossible for a distributed software system to simultaneously
provide more than two out of three of the following guarantees (CAP): Consistency,
Availability and Partition tolerance. When we design a distributed system, trading off among
CAP is almost the first thing we want to consider. CAP theorem says while designing a
distributed system we can pick only two of:
Consistency: All nodes see the same data at the same time. Consistency is achieved by
updating several nodes before allowing further reads.
Partition tolerance: System continues to work despite message loss or partial failure. A
system that is partition-tolerant can sustain any amount of network failure that doesn’t
result in a failure of the entire network. Data is sufficiently replicated across combinations of
nodes and networks to keep the system up through intermittent outages.
завантаження (57).png800×418 60.9 KB
We cannot build a general data store that is continually available, sequentially consistent
and tolerant to any partition failures. We can only build a system that has any two of these
three properties. Because, to be consistent, all nodes should see the same set of updates in
the same order. But if the network suffers a partition, updates in one partition might not
make it to the other partitions before a client reads from the out-of-date partition after
having read from the up-to-date one. The only thing that can be done to cope with this
possibility is to stop serving requests from the out-of-date partition, but then the service is
no longer 100% available.
Consistent Hashing
Distributed Hash Table (DHT) is one of the fundamental component used in distributed
scalable systems. Hash Tables need key, value and a hash function, where hash function
maps the key to a location where the value is stored.
index = hash_function(key)
Suppose we are designing a distributed caching system. Given ‘n’ cache servers, an intuitive
hash function would be ‘key % n’. It is simple and commonly used. But it has two major
drawbacks:
1. It is NOT horizontally scalable. Whenever a new cache host is added to the system,
all existing mappings are broken. It will be a pain point in maintenance if the caching
system contains lots of data. Practically it becomes difficult to schedule a downtime
to update all caching mappings.
2. It may NOT be load balanced, especially for non-uniformly distributed data. In
practice, it can be easily assumed that the data will not be distributed uniformly. For
the caching system, it translates into some caches becoming hot and saturated while
the others idle and almost empty.
In such situations, consistent hashing is a good way to improve the caching system.
Consistent hashing is a very useful strategy for distributed caching system and DHTs. It
allows distributing data across a cluster in such a way that will minimize reorganization
when nodes are added or removed. Hence, making the caching system easier to scale up or
scale down.
In Consistent Hashing when the hash table is resized (e.g. a new cache host is added to the
system), only k/n keys need to be remapped, where k is the total number of keys and n is
the total number of servers. Recall that in a caching system using the ‘mod’ as the hash
function, all keys need to be remapped.
In consistent hashing objects are mapped to the same host if possible. When a host is
removed from the system, the objects on that host are shared by other hosts; and when a
new host is added, it takes its share from a few hosts without touching other’s shares.
How it works?
As a typical hash function, consistent hashing maps a key to an integer. Suppose the output
of the hash function is in the range of [0, 256). Imagine that the integers in the range are
placed on a ring such that the values are wrapped around.
To add a new server, say D, keys that were originally residing at C will be split. Some of them
will be shifted to D, while other keys will not be touched.
To remove a cache or if a cache failed, say A, all keys that were originally mapping to A will
fall into B, and only those keys need to be moved to B, other keys will not be affected.
For load balancing, as we discussed in the beginning, the real data is essentially randomly
distributed and thus may not be uniform. It may make the keys on caches unbalanced.
To handle this issue, we add “virtual replicas” for caches. Instead of mapping each cache to
a single point on the ring, we map it to multiple points on the ring, i.e. replicas. This way,
each cache is associated with multiple portions of the ring.
If the hash function is “mixes well,” as the number of replicas increases, the keys will be
more balanced.
Ajax Polling
Polling is a standard technique used by the vast majority of AJAX applications. The basic idea
is that the client repeatedly polls (or requests) a server for data. The client makes a request
and waits for the server to respond with data. If no data is available, an empty response is
returned.
1. Client opens a connection and requests data from the server using regular HTTP.
2. The requested webpage sends requests to the server at regular intervals (e.g., 0.5
seconds).
3. The server calculates the response and sends it back, just like regular HTTP traffic.
4. Client repeats the above three steps periodically to get updates from the server.
Problem with Polling is that the client has to keep asking the server for any new data. As a
result, a lot of responses are empty creating HTTP overhead.
HTTP Long-Polling
A variation of the traditional polling technique that allows the server to push information to
a client, whenever the data is available. With Long-Polling, the client requests information
from the server exactly as in normal polling, but with the expectation that the server may
not respond immediately. That’s why this technique is sometimes referred to as a “Hanging
GET”.
If the server does not have any data available for the client, instead of sending an empty
response, the server holds the request and waits until some data becomes available.
Once the data becomes available, a full response is sent to the client. The client then
immediately re-request information from the server so that the server will almost always
have an available waiting request that it can use to deliver data in response to an event.
1. The client makes an initial request using regular HTTP and then waits for a response.
2. The server delays its response until an update is available, or until a timeout has
occurred.
3. When an update is available, the server sends a full response to the client.
4. The client typically sends a new long-poll request, either immediately upon receiving
a response or after a pause to allow an acceptable latency period.
5. Each Long-Poll request has a timeout. The client has to reconnect periodically after
the connection is closed, due to timeouts.
WebSockets
WebSocket provides Full duplex 7 communication channels over a single TCP connection. It
provides a persistent connection between a client and a server that both parties can use to
start sending data at any time. The client establishes a WebSocket connection through a
process known as the WebSocket handshake. If the process succeeds, then the server and
client can exchange data in both directions at any time. The WebSocket protocol enables
communication between a client and a server with lower overheads, facilitating real-time
data transfer from and to the server. This is made possible by providing a standardized way
for the server to send content to the browser without being asked by the client, and
allowing for messages to be passed back and forth while keeping the connection open. In
this way, a two-way (bi-directional) ongoing conversation can take place between a client
and a server.
Under SSEs the client establishes a persistent and long-term connection with the server. The
server uses this connection to send data to a client. If the client wants to send data to the
server, it would require the use of another technology/protocol to do so.
SSEs are best when we need real-time traffic from the server to the client or if the server is
generating data in a loop and will be sending multiple events to the client.
завантаження (67).png869×335 38.3 KB
Scalability
Scalability is the capability of a system, process or a network to grow and manage increased
demand. Any distributed system that can continuously evolve in order to support the
growing amount of work is considered to be scalable.
A system may have to scale because of many reasons like increased data volume or
increased amount of work, e.g., number of transactions. A scalable system would like to
achieve this scaling without performance loss.
Horizontal vs. Vertical Scaling: Horizontal scaling means that you scale by adding more
servers into your pool of resources whereas Vertical scaling means that you scale by adding
more power (CPU, RAM, Storage, etc.) to an existing server.
With horizontal-scaling it is often easier to scale dynamically by adding more machines into
the existing pool; Vertical-scaling is usually limited to the capacity of a single server, scaling
beyond that capacity often involves downtime and comes with an upper limit.
Good examples of horizontal scaling are Cassandra 1 and MongoDB, as they both provide an
easy way to scale horizontally by adding more machines to meet growing needs. Similarly a
good example of vertical scaling is MySQL as it allows for an easy way to scale vertically by
switching from small to bigger machines. However, this process often involves downtime.
завантаження (68).png739×481 50.3 KB
Reliability
By definition, reliability is the probability a system will fail in a given period. In simple terms,
a distributed system is considered reliable if it keeps delivering its services even when one
or several of its software or hardware components fail. Reliability represents one of the
main characteristics of any distributed system, as in such systems any failing machine can
always be replaced by another healthy one, ensuring the completion of the requested task.
Take the example of a large electronic commerce store (like Amazon 1), where one of the
primary requirement is that any user transaction should never be canceled due to a failure
of the machine that is running that transaction. For instance, if a user has added an item to
their shopping cart, the system is expected not to lose it. A reliable distributed system
achieves this through redundancy of both the software components and data. If the server
carrying the user’s shopping cart fails, another server that has the exact replica of the
shopping cart should replace it.
Obviously, redundancy has a cost, and a reliable system has to pay that to achieve such
resilience for services by eliminating every single point of failure.
Availability
By definition, availability is the time a system remains operational to perform its required
function, in a specific period. It is a simple measure of the percentage of time that a system,
service, or a machine remains operational under normal conditions. An aircraft that can be
flown for many hours a month without much downtime can be said to have a high
availability. Availability takes into account maintainability, repair time, spares availability,
and other logistics considerations. If an aircraft is down for maintenance, it is considered not
available during that time.
Reliability is availability over time considering the full range of possible real-world conditions
that can occur. An aircraft that can make it through any possible weather safely is more
reliable than one that has vulnerabilities to possible conditions.
Efficiency
Number of messages globally sent by the nodes of the system, regardless of the message
size.
Size of messages representing the volume of data exchanges.
The complexity of operations supported by distributed data structures (e.g., searching for a
specific key in a distributed index) can be characterized as a function of one of these cost
units. Generally speaking, the analysis of a distributed structure in terms of ‘number of
messages’ is over-simplistic. It ignores the impact of many aspects, including the network
topology, the network load and its variation, the possible heterogeneity of the software and
hardware components involved in data processing and routing, etc. However, it is quite
difficult to develop a precise cost model that would accurately take into account all these
performance factors; therefore we’ve to live with rough but robust estimates of the system
behavior.
Serviceability or Manageability
Early detection of faults can decrease or avoid system downtime. For example, some
enterprise systems can automatically call a service center (without human intervention)
when the system experiences a system fault.
In system design interviews, candidates are required to show their ability to develop a high-
level architecture of a large system. Designing software systems is a very broad topic and
even a software engineer having years of experience at a top software company may not
claim to be an expert on system design. During such interviews, candidates have 30 to 40
minutes to answer questions like, “How to design a cloud storage system like Dropbox?” or
“How to design a search engine” etc. In real life, companies spend not weeks but months
and hire a big team of software engineers to build such systems. Given this, then how can a
person answer such a question in 40 minutes? Moreover, there is no set pattern of such
questions. Questions are flexible, unpredictable, usually open-ended, and have no standard
or squarely correct answer.
These days, companies are least bothered about your pedigree, where you studied or where
you come from but surely concerned about what you can do on the job. For them, the most
important thing is your thought process and your mindset to look into and handle problems.
For these counts, candidates are generally scared of the design interviews. But despite all
this, I believe there is no need of scaring off. You need to get into what the companies want
to know about you during these 40 minutes, which is basically “your approach and strategy
to handle a problem” and how organized, disciplined, systematic, and professional you are
at solving it. What is your capacity to analyze an issue and your level of professional
mechanics to solve it step by step?
In short, system design interview is, just understanding it from interviewer’s perspective.
During the whole process, it is your discussion with the interviewer that is of core
importance.
There is no strictly defined process to system design interview. Secondly, there are so many
things inherently unclear about large systems that without clarifying at least a few of them
in the beginning, it would be impossible to go for a solution. Any candidate who does not
realize this fact would fall into the trap of quickly jumping onto finding a solution.
Any candidate who does not have experience in building systems might think such question
grossly unfair. On top of that, there isn’t one correct answer to such questions. The way you
are answering the question would sufficiently tell upon your professional competence and
background experience. That is the thing which the interviewer will evaluate you on.
Since the questions are intentionally weakly defined, jumping onto designing the solution
immediately without fully understanding is liable to get you in trouble. Spend a few minutes
questioning the interviewer to comprehend the full scope of the system. Never assume
things that are not explicitly stated. For instance, the “URL shortening service” could be
serving just a few thousand users, but each could be sharing millions of URLs. It could also
mean to handle millions of clicks on the shortened URLs or just a few dozens. The service
may also require providing extensive statistics about each shortened URL (which will
increase our data size), or statistics may not be a requirement at all. Therefore, don’t forget
to make sure you gather all the requirements as the interviewer will not be listing them for
you.
The main difference between design interviews and the rest is that you are not given the full
detail of the problem, rather you are required to scale the breadth and depth of a blurred
problem. You are supposed to take the details and figure out the issue by asking probing
questions. Your questions for clarifying the problem reflects your evaluating ability and
competence, which would be an asset to the company.
In design and architecture interviews the problems presented are quite significant. They
definitely cannot be solved in 40 minutes’ time implying that the objective is to test the
technical depth and diversity the interviewee invokes during the interview. That also speaks
strongly for your would be ‘level’ in the company. And your level in the company should
come from your analytical ability to sort out the problem besides your ability to work in a
team (your behavioral and background side of the interview), and your capacity to perform
as a strong technical leader. In a nutshell, the basic idea of hiring at a level is to scale a
person’s ability to contribute value to the company’s wants and needs. For that, you must
exhibit your strengths by showing reasonable technical breadth.
Try to learn from the existing systems: How have these been designed? Another critical
point to be kept in mind is that the interviewer expects that candidate’s analytical ability
and questioning on the problem must comparable to his/her experience. If you have a few
years of software development experience, you are expected to have certain knowledge
and should avoid divulging into asking basic questions that might have been appropriate
coming from a fresh graduate. For that, you should prepare sufficiently ahead of time. Try to
go through real projects and practices well in advance of the interview as most questions
are based on real-life products, issues, and challenges.
Leading the conversation: It is not the ultimate solution to the problem, instead the
discussion process itself that is important in the interview. And it is the candidate who
should lead the conversation going both broad and deep into the components of the
problem. Hence, take the interviewer along with you during the course of solving the
problem by communicating with him/her step by step as you move along
Solving by breaking down: Design questions at first might look complicated and
intimidating. But whatever the complexity level of the problem, a top-down and
modularization approach can help a lot in solving the problem. Therefore, you should break
the problem into modules and then tackle each of them independently. Subsequently, each
component can be explained as a sub-problem by reducing it to the level of a known
algorithm. This strategy will not only make the design much clearer to you and your
interviewer but make evaluation much easier for the interviewer. However, while doing so,
keep this thing in mind that mostly the problems presented in high skill design interviews
don’t have the solutions. The most important thing is the way how you make progress
tackling the problem and the strategies you adopt.
Dealing with the bottlenecks: Working on the solution, you might confront some
bottlenecks. This is very normal. While resolving bottlenecks, your system might require a
load balancer with many machines behind it to handle the user requests or the data might
be so huge that you need to distribute your database on multiple servers. It might also be
possible that the interviewer wants to take the interview in a particular direction. If that is
the case, you are supposed to move in that direction and should go deep leaving everything
else aside. If you feel stuck somewhere, you can ask for a hint so that you may keep going.
Keep in mind that each solution is a kind of trade-off; hence, changing something may
worsen something else. Here, the important thing is your ability to talk about these trade-
offs and to measure their impact on the system keeping all the constraints and use cases in
mind. After finishing your high-level design and making sure that the interviewer is ok with
it, you can go for making it more detailed. Usually, that means making your system scale.
3. Summary
Solving system design questions could be broken down into three steps:
Scoping the problem: Don’t make assumptions; Ask clarifying questions to understand the
constraints and use cases.
Sketching up an abstract design Illustrating the building blocks of the system and the
relationships between them.
Identifying and addressing the bottlenecks by using the fundamental principles of scalable
system design.
4. Conclusion
Design interviews are formidable, open-ended problems that cannot be solved in the
allotted time. Therefore, you should try to understand what your interviewer intends to
focus on and spend sufficient time on it. Be well aware of the fact that the discussion on
system design problem could go in different directions depending on the preferences of the
interviewer. The interviewers might be unwilling to see how you create a high-level
architecture covering all aspects of the system or they could be interested in looking for
specific areas and diving deep into them. This means that you must deal with the situation
strategically as there are chances of even the good candidates failing the interview, not
because they don’t have the knowledge, but because they lack the ability to focus on the
right things while discussing the problem.
If you have no idea how to solve these kinds of problems, you can familiarize yourself with
the typical patterns of system design by reading diversely from the blogs, watching videos of
tech talks from conferences. It is also advisable to arrange discussions and even mock
interviews with experienced engineers at big tech companies.
Remember there is no ONE right answer to the question because any system can be built in
different ways. The only thing that is going to be looked into is your ability to rationalize
ideas and inputs.
Many systems design questions are intentionally left very vague and are literally given in the
form of Design Foobar. It’s your job to ask clarifying questions to better understand the
system that you have to build.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Question 1
Q: Are we designing the entire AlgoExpert platform or just a specific part of it, like the
coding workspace?
A: Since we only have about 45 minutes, you should just design the core user flow of the
AlgoExpert platform. The core user flow includes users landing on the home page of the
website, going to the questions list, marking questions as complete or in progress, and then
writing and running code in various languages for each language. Don’t worry about
payments or authentication; you can just assume that you have these services
workingalready (by the way, we mainly rely on third-party services here, like Stripe, PayPal,
and OAuth2).
Question 2
Q: AlgoExpert doesn’t seem like a system of utmost criticality (like a hospital system or
airplane software); are we okay with 2 to 3 nines of availability for the system?
A: Yes, this seems fine–no need to focus too much on making the system highly available.
Question 3
Q: How many customers should we be building this for? Is AlgoExpert’s audience global or
limited to one country?
A: AlgoExpert’s website receives hundreds of thousands of users every month, and tens of
thousands of users may be on the website at any point in time. We want the website to feel
very responsive to people everywhere in the world, and the U.S. and India are the
platform’s top 2 markets that we especially want to cater to.
Question 4
Q: Does AlgoExpert make changes to its content (questions list and question solutions)
often?
A: Yes–every couple of days on average. And we like to have our changes reflected in
production globallywithin the hour.
Question 5
Q: How much of the code-execution engine behind the coding workspace should we be
designing? Do we have to worry about the security aspect of running random user code on
our servers?
A: You can disregard the security aspects of the code-execution engine and just focus on its
core functionality–the ability to run code in various languages at any given time with
acceptable latency.
Question 6
Q: While we’ll care about latency across the entire system, the code-execution engine
seems like the place where we’ll care about it most, since it’s very interactive, and it also
seems like the toughest part of our system to support low latencies; are we okay with
anywhere between 1 and 3 seconds for the average run-code latency?
Many systems design questions are intentionally left very vague and are literally given in the
form of Design Foobar. It’s your job to ask clarifying questions to better understand the
system that you have to build.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Question 1
A: We want to design a system that takes code, builds it into a binary (an opaque blob of
data–the compiled code), and deploys the result globally in an efficient and scalable way.
We don’t need to worry about testing code; let’s assume that’s already covered.
Question 2
Q: What part of the software-development lifecycle, so to speak, are we designing this for?
Is this process of building and deploying code happening when code is being submitted for
code review, when code is being merged into a codebase, or when code is being shipped?
A: Once code is merged to the trunk or master branch of a central code reposity, engineers
should be able to (through a UI, which we’re not designing) trigger a build and deploy that
build. At that point, the code has already been reviewed and is ready to ship. So to clarify,
we’re not designing the system that handles code being submitted for review or being
merged into a master branch–just the system that takes merged code, builds it, and deploys
it.
Question 3
Q: Are we essentially trying to ship code to production by sending it to, presumably, all of
our application servers around the world?
A: Yes, exactly.
Question 4
Q: How many machines are we deploying to? Are they located all over the world?
Question 5
Q: This sounds like an internal system. Is there any sense of urgency in deploying this code?
Can we afford failures in the deployment process? How fast do we want a single
deployment to take?
A: This is an internal system, but we’ll want to have decent availability, because many
outages are resolved by rolling forward or rolling back buggy code, so this part of the
infrastructure may be necessary to avoid certain terrible situations. In terms of failure
tolerance, any build should eventually reach a SUCCESS or FAILURE state. Once a binary has
been successfully built, it should be shippable to all machines globally within 30 minutes.
Question 6
Q: So it sounds like we want our system to be available, but not necessarily highly available,
we want a clear end-state for builds, and we want the entire process of building and
deploying code to take roughly 30 minutes. Is that correct?
Question 7
Q: How often will we be building and deploying code, how long does it take to build code,
and how big can the binaries that we’ll be deploying get?
Question 8
Q: When building code, how do we have access to the actual code? Is there some sort of
reference that we can use to grab code to build?
A: Yes; you can assume that you’ll be building code from commits that have been merged
into a master branch. These commits have SHA identifiers (effectively arbitrary strings) that
you can use to download the code that needs to be built.
Design A Stockbroker
Design a stockbroker: a platform that acts as the intermediary between end-customers and
some central stock exchange.
Many systems design questions are intentionally left very vague and are literally given in the
form of Design Foobar. It’s your job to ask clarifying questions to better understand the
system that you have to build.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Question 1
Q: What do we mean exactly by a stock broker? Is this something like Robinhood or Etrade?
A: Yes, exactly.
Question 2
Q: What is the platform supposed to support exactly? Are we just supporting the ability for
customers to buy and sell stocks, or are we supporting more? For instance, are we allowing
other types of securities like options and futures to be traded on our platform? Are we
supporting special types of orders like limit orders and stop losses?
A: We’re only supporting market orders on stocks in this design. A market order means that,
given a placed order to buy or sell a stock, we should try to execute the order as soon as
possible regardless of the stock price. We also aren’t designing any “margin” system, so the
available balance is the source of truth for what can be bought.
Question 3
Q: Are we designing any of the auxiliary aspects of the stock brokerage, like depositing and
withdrawing funds, downloading tax documents, etc.?
Question 4
Q: Are we just designing the system to place trades? Do we want to support other trade-
related operations like getting trade statuses? In other words, how comprehensive should
the API that’s going to support this platform be?
A: In essence, you’re only designing a system around a PlaceTrade API call from the user, but
you should define that API call (inputs, response, etc.).
Question 5
Q: Where does a customer’s balance live? Is the platform pulling a customer’s money
directly from their bank account, or are we expecting that customers will have already
deposited funds into the platform somehow? In other words, are we ever directly
interacting with banks?
A: No, you won’t be interacting with banks. You can assume that customers have already
deposited funds into the platform, and you can further assume that you have a SQL table
with the balance for each customer who wants to make a trade.
Question 6
Q: How many customers are we building this for? And is our customer-base a global one?
A: Millions of customers, millions of trades a day. Let’s assume that our customers are only
located in 1 region – the U.S., for instance.
Question 7
A: As high as possible, with this kind of service people can lose a lot of money if the system
is down even for a few minutes.
Question 8
Q: Are we also designing the UI for this platform? What kinds of clients can we assume we
have to support?
A: You don’t have to design the UI, but you should design the PlaceTrade API call that a UI
would be making to your backend. Clients would be either a mobile app or a webapp.
Question 9
Q: So we want to design the API for the actual brokerage, that itself interacts with some
central stock exchange on behalf of customers. Does this exchange have an API? If yes, do
we know what it looks like, and do we have any guarantees about it?
A: Yes, the exchange has an API, and your platform’s API (the PlaceTrade call) will have to
interact with the exchange’s API. As far as that’s concerned, you can assume that the call to
the exchange to make an actual trade will take in a callback (in addition to the info about
the trade) that will get executed when that trade completes at the exchange level (meaning,
when the trade either gets FILLED or REJECTED, this callback will be executed). You can also
assume that the exchange’s system is highly available–your callback will always get
executed at least once.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Question 1
Q: Facebook News Feed consists of multiple major features, like loading a user’s news feed,
interacting with it (i.e., posting status updates, liking posts, etc.), and updating it in real time
(i.e., adding new status updates that are being posted to the top of the feed, in real time).
What part of Facebook News Feed are we designing exactly?
A: We’re designing the core functionality of the feed itself, which we’ll define as follows:
loading a user’s news feed and updating it in real time, as well as posting status updates. But
for posting status updates, we don’t need to worry about the actual API or the type of
information that a user can post; we just want to design what happens once an API call to
post a status update has been made. Ultimately, we primarily want to design the feed
generation/refreshing piece of the data pipeline (i.e, how/when does it get constructed, and
how/when does it get updated with new posts).
Question 2
Q: To clarify, posts on Facebook can be pretty complicated, with pictures, videos, special
types of status updates, etc… Are you saying that we’re not concerned with this aspect of
the system? For example, should we not focus on how we’ll be storing this type of
information?
A: That’s correct. For the purpose of this question, we can treat posts as opaque entities
that we’ll certainly want to store, but without worrying about the details of the storage, the
ramifications of storing and serving large files like videos, etc…
Question 3
Q: Are we designing the relevant-post curation system (i.e., the system that decides what
posts will show up on a user’s news feed)?
A: No. We’re not designing this system or any ranking algorithms; you can assume that you
have access to a ranking algorithm that you can simply feed a list of relevant posts to in
order to generate an actual news feed to display.
Question 4
Q: Are we concerned with showing ads in a user’s news feed at all? Ads seem like they
would behave a little bit differently than posts, since they probably rely on a different
ranking algorithm.
A: You can treat ads as a bonus part of the design; if you find a way to incorporate them in,
great (and yes, you’d have some other ads-serving algorithm to determine what ads need to
be shown to a user at any point in time). But don’t focus on ads to start.
Question 5
A: Yes – we’re serving a global audience, and let’s say that the news feed will be loaded in
the order of 100 million times a day, by 100 million different users, with 1 million new status
updates posted every day
Question 6
Q: How many friends does a user have on average? This is important to know, since a user’s
status updates could theoretically have to show up on all of the user’s friends’ news feeds at
once.
A: You can expect each user to have, on average, 500 friends on the social network. You can
treat the number of friends per user as a bell-shaped distribution, with some users who
have very few friends, and some users who have a lot more than 500 friends.
Question 7
Q: How quickly does a status update have to appear on a news feed once it’s posted, and is
it okay if this varies depending on user locations with respect to the location of the user
submitting a post?
A: When a user posts something, you probably want it to show up on other news feeds fairly
quickly. This speed can indeed vary depending on user locations. For instance, we’d
probably want a local friend within the same region to see the new post within a few
seconds, but we’d likely be okay with a user on the other side of the world seeing the same
post within a minute.
Question 8
A: Your design shouldn’t be completely unavailable from a single machine failure, but this
isn’t a high availability requirement. However, posts shouldn’t ever just disappear. Once the
user’s client gets confirmation that the post was created, you cannot lose it.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Question 1
Q: Are we just designing the storage aspect of Google Drive, or are we also designing some
of the related products like Google Docs, Sheets, Slides, Drawings, etc.?
A: We’re just designing the core Google Drive product, which is indeed the storage product.
In other words, users can create folders and upload files, which effectively stores them in
the cloud. Also, for simplicity, we can refer to folders and files as “entities”.
Question 2
Q: There are a lot of features on Google Drive, like shared company drives vs. personal
drives, permissions on entities (ACLs), starred files, recently-accessed files, etc… Are we
designing all of these features or just some of them?
A: Let’s keep things narrow and imagine that we’re designing a personal Google Drive (so
you can forget about shared company drives). In a personal Google Drive, users can store
entities, and that’s all that you should take care of. Ignore any feature that isn’t core to the
storage aspect of Google Drive; ignore things like starred files, recently-accessed files, etc…
You can even ignore sharing entities for this design.
Question 3
Q: Since we’re primarily concerned with storing entities, are we supporting all basic CRUD
operations like creating, deleting, renaming, and moving entities?
A: Yes, but to clarify, creating a file is actually uploading a file, folders have to be created
(they can’t be uploaded), and we also want to support downloading files.
Question 4
Q: Are we just designing the Google Drive web application, or are we also designing a
desktop client for Google drive?
A: We’re just designing the functionality of the Google Drive web application.
Question 5
Q: Since we’re not dealing with sharing entities, should we handle multiple users in a single
folder at the same time, or can we assume that this will never happen?
A: While we’re not designing the sharing feature, let’s still handle what would happen if
multiple clients were in a single folder at the same time (two tabs from the same browser,
for example). In this case, we would want changes made in that folder to be reflected to all
clients within 10 seconds. But for the purpose of this question, let’s not worry about
conflicts or anything like that (i.e., assume that two clients won’t make changes to the same
file or folder at the same time).
Question 6
A: This system should serve about a billion users and handle 15GB per user on average.
Question 7
Q: What kind of reliability or guarantees does this Google Drive service give to its users?
A: First and foremost, data loss isn’t tolerated at all; we need to make sure that once a file is
uploaded or a folder is created, it won’t disappear until the user deletes it. As for
availability, we need this system to be highly available.
Both of these entities likely have other fields, but for the purpose of this question, those
other fields aren’t needed.
Many systems design questions are intentionally left very vague and are literally given in the
form of Design Foobar. It’s your job to ask clarifying questions to better understand the
system that you have to build.
We’ve laid out some of these questions below; their answers should give you some
guidance on the problem. Before looking at them, we encourage you to take few minutes to
think about what questions you’d ask in a real interview.
Q: To make sure that we’re on the same page: a subreddit is an online community where
users can write posts, comment on posts, upvote / downvote posts, share posts, report
posts, become moderators, etc…–is this correct, and are we designing all of this
functionality?
A: Yes, that’s correct, but let’s keep things simple and focus only on writing posts, writing
comments, and upvoting / downvoting. You can forget about all of the auxiliary features like
sharing, reporting, moderating, etc…
Question 2
Q: So we’re really focusing on the very narrow but core aspect of a subreddit: writing posts,
commenting on them, and voting on them.
A: Yes.
Question 3
Q: I’m thinking of defining the schemas for the main entities that live within a subreddit and
then defining their CRUD operations – methods like Create/Get/Edit/Delete/List – is this in
line with what you’re asking me to do?
A: Yes, and make sure to include method signatures – what each method takes in and what
each method returns. Also include the types of each argument.
Question 4
Q: The entities that I’ve identified are Posts, Comments, and Votes (upvotes and
downvotes). Does this seem accurate?
A: Yes. These are the 3 core entities that you should be defining and whose APIs you’re
designing.
Question 5
A: Yes, you should also allow people to award posts. Awards are a special currency that can
be bought for real money and gifted to comments and posts. Users can buy some quantity
of awards in exchange for real money, and they can give awards to posts and comments
(one award per post / comment).