System Design for Data Engineering
System Design for Data Engineering
Engineering
System Design
Core Concepts
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
1
the architecture, components, modules,
interfaces, and data for a system to
satisfy specified requirements. It involves
both high-level architecture and
detailed design.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
2
handle increased load without
compromising performance by adding
resources.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
3
scope, outline a high-level architecture,
dive into detailed design for key
components, and address potential
bottlenecks and trade-offs.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
4
involves adding more machines to a
system, while vertical scaling (scaling
up) involves adding more power (CPU,
RAM) to an existing machine.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
5
easier load distribution, and avoids the
limitations of a single machine.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
6
network traffic across multiple servers to
ensure no single server becomes
overwhelmed.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
7
one server fails, the load balancer can
redirect traffic to other available servers,
thus maintaining service availability.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS CACHING?
8
frequently accessed data in a
temporary storage location to speed up
subsequent data retrievals.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
9
caches, and distributed caches.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
10
geographically to deliver static content
to users from the nearest server location,
reducing latency.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
11
and the content, decreasing load times
and reducing bandwidth usage.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
12
copying data from one database server
(master) to another (slave) to ensure
high availability and fault tolerance.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
13
high availability, and provides data
redundancy.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS SHARDING?
14
technique that divides a large database
into smaller, more manageable pieces,
called shards, which can be distributed
across multiple servers.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
15
complex queries across shards, and re-
sharding data when a shard grows too
large.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
16
for communication between processes,
allowing them to send and receive
messages asynchronously.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
17
improves system resilience, and
decouples system components.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A MICROSERVICES
ARCHITECTURE?
Microservices architecture is an
architectural style that structures an
18
application as a collection of loosely
coupled services, each with its own
functionality and data storage.
Breaking down a monolithic e-
commerce application into individual
services for inventory, payment, and
user management.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
19
easier scaling, and better fault isolation.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A MONOLITHIC
ARCHITECTURE?
20
software application where all
components are interconnected and
interdependent.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
21
individual components, tight coupling,
and challenges in maintaining and
deploying the application.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
22
model used in distributed systems
where updates to a database will
propagate to all nodes, but not
immediately.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
23
algorithms (e.g., Paxos, Raft), distributed
transactions, and ensuring idempotent
operations.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
24
Consistency, Availability, and Partition
tolerance.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
25
collection of key-value pairs.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
26
operations, are easy to scale, and
provide flexible data models.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
27
database that stores data as
documents, typically in JSON or BSON
format.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
28
and query data modeled as graphs, with
nodes, edges, and properties.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
29
relationships, enable efficient traversal
queries, and provide a flexible schema.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A COLUMN-FAMILY
STORE?
30
database that stores data in columns
rather than rows, optimized for read and
write operations.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
31
horizontal scalability, and are suitable
for time-series data and real-time
analytics.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
32
system that allows access to files from
multiple hosts, providing redundancy
and fault tolerance.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
33
repository for storing large volumes of
structured and semi-structured data,
optimized for query and analysis.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
34
that involves extracting data from
various sources, transforming it to fit
operational needs, and loading it into a
target data store.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
35
holds vast amounts of raw data in its
native format until it is needed.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
36
the ability to store diverse data types.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
37
data in real-time as it flows from one
source to another, allowing immediate
insights and actions.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
38
data latency, and supports event-driven
architectures.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
39
data in large blocks or batches at
scheduled intervals.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
40
volumes of data processed at intervals,
while stream processing deals with
continuous data processing in real-time.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A LAMBDA
ARCHITECTURE?
A lambda architecture is a data-
processing architecture designed to
41
handle massive quantities of data by
taking advantage of both batch and
stream-processing methods.
Combining Apache Hadoop for batch
processing and Apache Storm for real-
time processing in a lambda
architecture.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
42
for processing large datasets, a speed
layer for real-time processing, and a
serving layer to merge the results.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
43
lambda architecture by using only
stream processing to handle both real-
time and historical data.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
44
database designed to handle large
volumes of unstructured or semi-
structured data with flexible schemas.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
45
document stores, column-family stores,
and graph databases.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A RELATIONAL
DATABASE?
46
database that organizes data into
tables with rows and columns, using SQL
for data management.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
47
transactions, and support complex
queries and relationships.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
48
properties that ensure reliable database
transactions.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
49
state, and Eventual consistency, which
are properties of some NoSQL databases
that prioritize availability over strict
consistency.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
50
components are located on different
networked computers, which
communicate and coordinate to
achieve a common goal.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
51
fault tolerance, data consistency, and
synchronization.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
52
to continue operating properly in the
event of the failure of some of its
components.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
53
balancers, redundant systems, failover
mechanisms, and data replication.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A CONSENSUS
ALGORITHM?
54
computer science used to achieve
agreement on a single data value
among distributed processes or
systems.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
55
to be understandable and
implementable, used to manage a
replicated log in distributed systems.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
56
consensus in a network of unreliable or
faulty processors.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
57
designating a single node as the
organizer of some task distributed
among several nodes.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A QUORUM?
58
votes needed for a distributed
transaction to be committed.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
59
databases to ensure data integrity by
logging changes before applying them
to the database.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
61
nodes in a distributed system see the
same data at the same time after a
write operation.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
62
database into smaller, more
manageable pieces, which can be
stored across multiple servers.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
65
system that provides a lookup service
similar to a hash table, where data is
distributed across multiple nodes.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
66
to-peer communication where nodes
periodically exchange state information
to achieve eventual consistency.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A LEADER-FOLLOWER
PATTERN?
67
replication strategy where one node
(leader) handles all writes, and other
nodes (followers) replicate the leader's
data.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
68
synchronize access to a shared resource
across multiple nodes in a distributed
system.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A HEARTBEAT IN
DISTRIBUTED SYSTEMS?
69
node to indicate its presence and
operational status to other nodes in a
distributed system.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
70
distributed system to continue
functioning even when network
partitions occur, isolating parts of the
system.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
71
data be written to all replicas before
acknowledging the write, while
asynchronous replication allows for
acknowledgement before all replicas
are updated.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
72
the primary database, improving
performance and availability.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
73
the same across all nodes in a
distributed system after a write
operation.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A ROLLBACK?
75
changes made to a database during a
transaction if an error occurs, ensuring
data integrity.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
76
mechanism where data is stored across
multiple nodes to improve scalability
and performance.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
77
pattern used to detect failures and
prevent cascading failures in distributed
systems by stopping the flow of requests
to a failing service.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
78
components of a system into separate
pools to prevent a failure in one
component from affecting others.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
79
alternative action or data source when a
service call fails, ensuring system
resilience.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS IDEMPOTENCY?
80
an operation multiple times has the
same effect as performing it once, which
is crucial for reliable systems.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
81
automatically detecting services and
their endpoints in a distributed system,
facilitating communication between
components.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
82
for requests from clients seeking
resources from other servers, providing
benefits like load balancing, caching,
and security.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
83
that retrieves resources on behalf of a
client from one or more servers, often
used for load balancing and security.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
84
the edge of a network that stores
cached content, delivering it to users
from a location closer to them.
Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN
WHAT IS A MICROSERVICES
GATEWAY?
85
gateway that manages and routes
requests to the appropriate
microservice, often handling cross-
cutting concerns like authentication and
rate limiting.
Shwetank Singh
GritSetGrow - GSGLearn.com