0% found this document useful (0 votes)

128 views83 pages

System Design for Data Engineering

The document provides an overview of system design in data engineering, covering key concepts such as scalable systems, horizontal and vertical scaling, load balancing, caching, and various database types. It discusses the importance of system architecture, the differences between monolithic and microservices architectures, and the benefits of using message queues and distributed systems. Additionally, it explores data processing methods like batch and stream processing, as well as the lambda and kappa architectures.

Uploaded by

krishnapmishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views83 pages

System Design for Data Engineering

Uploaded by

krishnapmishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Data

Engineering
System Design
Core Concepts

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS SYSTEM DESIGN?

System design is the process of defining

1
the architecture, components, modules,
interfaces, and data for a system to
satisfy specified requirements. It involves
both high-level architecture and
detailed design.

Designing a news feed system for a

social media application.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A SCALABLE SYSTEM?

A scalable system is one that can

2
handle increased load without
compromising performance by adding
resources.

Horizontal scaling by adding more

servers to handle increased traffic.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

HOW DO YOU APPROACH A

SYSTEM DESIGN INTERVIEW?

Understand the requirements, define the

3
scope, outline a high-level architecture,
dive into detailed design for key
components, and address potential
bottlenecks and trade-offs.

Designing a URL shortener: Start with the

core functionality, then address
storage, scalability, and fault tolerance.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE DIFFERENCE BETWEEN

HORIZONTAL AND VERTICAL
SCALING?
Horizontal scaling (scaling out)

4
involves adding more machines to a
system, while vertical scaling (scaling
up) involves adding more power (CPU,
RAM) to an existing machine.

Adding more servers to a web

application (horizontal) vs. upgrading
the server’s hardware (vertical).

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE ADVANTAGES OF

HORIZONTAL SCALING?

It improves fault tolerance, allows for

5
easier load distribution, and avoids the
limitations of a single machine.

Distributing web traffic across multiple

servers using a load balancer.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A LOAD BALANCER?

A load balancer distributes incoming

6
network traffic across multiple servers to
ensure no single server becomes
overwhelmed.

Using an AWS Elastic Load Balancer to

manage web traffic.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

HOW DOES A LOAD BALANCER

IMPROVE SYSTEM RELIABILITY?

By distributing traffic, it ensures that if

7
one server fails, the load balancer can
redirect traffic to other available servers,
thus maintaining service availability.

In a web application with three servers,

if one fails, the load balancer redirects
traffic to the remaining two.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS CACHING?

Caching is a technique to store

8
frequently accessed data in a
temporary storage location to speed up
subsequent data retrievals.

Using Redis to cache database query

results.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE TYPES OF CACHES?

There are client-side caches, server-side

9
caches, and distributed caches.

Browser cache (client-side),

Memcached (server-side), and Redis
(distributed).

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A CDN (CONTENT

DELIVERY NETWORK)?

A CDN is a network of servers distributed

10
geographically to deliver static content
to users from the nearest server location,
reducing latency.

Using Cloudflare CDN to serve static

assets like images and CSS files.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

HOW DOES A CDN IMPROVE

PERFORMANCE?

By reducing the distance between users

11
and the content, decreasing load times
and reducing bandwidth usage.

Serving a video from a server located

closer to the user, resulting in faster
loading times.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS DATABASE REPLICATION?

Database replication is the process of

12
copying data from one database server
(master) to another (slave) to ensure
high availability and fault tolerance.

Using MySQL replication to maintain a

backup database server.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF

DATABASE REPLICATION?

It improves read performance, ensures

13
high availability, and provides data
redundancy.

A master-slave setup where the master

handles writes, and slaves handle
reads.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS SHARDING?

Sharding is a database partitioning

14
technique that divides a large database
into smaller, more manageable pieces,
called shards, which can be distributed
across multiple servers.

Splitting user data across different

databases based on user ID ranges.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE CHALLENGES OF

SHARDING?

Challenges include data consistency,

15
complex queries across shards, and re-
sharding data when a shard grows too
large.

Implementing a consistent hashing

algorithm to evenly distribute data
across shards.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A MESSAGE QUEUE?

A message queue is a component used

16
for communication between processes,
allowing them to send and receive
messages asynchronously.

Using RabbitMQ to manage

background tasks in a web application.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF USING

A MESSAGE QUEUE?

It enables asynchronous processing,

17
improves system resilience, and
decouples system components.

Processing user signup emails in the

background while the main application
handles user requests.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A MICROSERVICES
ARCHITECTURE?
Microservices architecture is an
architectural style that structures an

18
application as a collection of loosely
coupled services, each with its own
functionality and data storage.
Breaking down a monolithic e-
commerce application into individual
services for inventory, payment, and
user management.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF

MICROSERVICES?

Benefits include improved modularity,

19
easier scaling, and better fault isolation.

Like scaling the payment service

independently of the inventory service
in an e-commerce application.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A MONOLITHIC
ARCHITECTURE?

A monolithic architecture is a single-tier

20
software application where all
components are interconnected and
interdependent.

A traditional web application where the

frontend, backend, and database are
tightly integrated.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE DRAWBACKS OF A

MONOLITHIC ARCHITECTURE?

Drawbacks include difficulty in scaling

21
individual components, tight coupling,
and challenges in maintaining and
deploying the application.

Updating one part of the application

requires redeploying the entire system.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS EVENTUAL CONSISTENCY?

Eventual consistency is a consistency

22
model used in distributed systems
where updates to a database will
propagate to all nodes, but not
immediately.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

HOW DO YOU ENSURE DATA

CONSISTENCY IN A DISTRIBUTED
SYSTEM?
Techniques include using consensus

23
algorithms (e.g., Paxos, Raft), distributed
transactions, and ensuring idempotent
operations.

The implementation of the Raft

algorithm is used to manage leader
election and data replication.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS CAP THEOREM?

CAP theorem states that in a distributed

data store, you can only achieve two out
of the following three guarantees:

24
Consistency, Availability, and Partition
tolerance.

Choosing between strong consistency

and availability in a distributed
database during network partitions.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A KEY-VALUE STORE?

A key-value store is a type of NoSQL

database that stores data as a

25
collection of key-value pairs.

Using Redis or Amazon DynamoDB to

store session data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE ADVANTAGES OF

KEY-VALUE STORES?

They offer fast read and write

26
operations, are easy to scale, and
provide flexible data models.

For example storing user preferences in

Redis helps in quick retrieval.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DOCUMENT STORE?

A document store is a type of NoSQL

27
database that stores data as
documents, typically in JSON or BSON
format.

Using MongoDB to store user profiles

with varying attributes.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A GRAPH DATABASE?

A graph database is designed to store

28
and query data modeled as graphs, with
nodes, edges, and properties.

Using Neo4j to manage social network

data with complex relationships.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF GRAPH

DATABASES?

They excel at handling complex

29
relationships, enable efficient traversal
queries, and provide a flexible schema.

Finding shortest paths and

recommendations in a social network
using Neo4j.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A COLUMN-FAMILY
STORE?

A column-family store is a type of NoSQL

30
database that stores data in columns
rather than rows, optimized for read and
write operations.

Using Apache Cassandra for time-

series data storage.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF

COLUMN-FAMILY STORES?

They offer high write throughput,

31
horizontal scalability, and are suitable
for time-series data and real-time
analytics.

Storing and retrieving logs in Apache

Cassandra.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DISTRIBUTED FILE

SYSTEM?

A distributed file system (DFS) is a file

32
system that allows access to files from
multiple hosts, providing redundancy
and fault tolerance.

Using Hadoop Distributed File System

(HDFS) for big data storage.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DATA WAREHOUSE?

A data warehouse is a centralized

33
repository for storing large volumes of
structured and semi-structured data,
optimized for query and analysis.

Example using Amazon Redshift or

Google BigQuery for business analytics.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS ETL (EXTRACT,

TRANSFORM, LOAD)?

ETL is a process in data warehousing

34
that involves extracting data from
various sources, transforming it to fit
operational needs, and loading it into a
target data store.

Using Apache NiFi to extract data from

databases, transform it, and load it into
HDFS.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DATA LAKE?

A data lake is a storage repository that

35
holds vast amounts of raw data in its
native format until it is needed.

Using Amazon S3 as a data lake for

storing structured and unstructured
data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF A DATA

LAKE?

Benefits include scalability, flexibility, and

36
the ability to store diverse data types.

Storing log files, images, and structured

data together in Amazon S3.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS STREAM PROCESSING?

Stream processing involves processing

37
data in real-time as it flows from one
source to another, allowing immediate
insights and actions.

Using Apache Kafka and Apache Flink

to process real-time clickstream data.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE ADVANTAGES OF

STREAM PROCESSING?

It enables real-time analytics, reduces

38
data latency, and supports event-driven
architectures.

Real-time fraud detection in financial

transactions using Apache Kafka and
Apache Storm.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS BATCH PROCESSING?

Batch processing involves processing

39
data in large blocks or batches at
scheduled intervals.

Using Apache Hadoop to run nightly ETL

jobs.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE DIFFERENCE BETWEEN

BATCH PROCESSING AND STREAM
PROCESSING?
Batch processing deals with large

40
volumes of data processed at intervals,
while stream processing deals with
continuous data processing in real-time.

Using Apache Hadoop for batch

processing and Apache Kafka for
stream processing.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A LAMBDA
ARCHITECTURE?
A lambda architecture is a data-
processing architecture designed to

41
handle massive quantities of data by
taking advantage of both batch and
stream-processing methods.
Combining Apache Hadoop for batch
processing and Apache Storm for real-
time processing in a lambda
architecture.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE COMPONENTS OF A

LAMBDA ARCHITECTURE?

The components include a batch layer

42
for processing large datasets, a speed
layer for real-time processing, and a
serving layer to merge the results.

Using Hadoop for batch processing,

Kafka for real-time data, and HBase as
the serving layer.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE KAPPA

ARCHITECTURE?

The kappa architecture simplifies the

43
lambda architecture by using only
stream processing to handle both real-
time and historical data.

Using Apache Kafka and Apache Flink

to process all data as streams.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A NOSQL DATABASE?

A NoSQL database is a non-relational

44
database designed to handle large
volumes of unstructured or semi-
structured data with flexible schemas.

Using MongoDB or Cassandra for

storing large-scale data with varying
structures.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE TYPES OF NOSQL

DATABASES?

Types include key-value stores,

45
document stores, column-family stores,
and graph databases.

Redis (key-value), MongoDB

(document), Cassandra (column-
family), Neo4j (graph).

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A RELATIONAL
DATABASE?

A relational database is a type of

46
database that organizes data into
tables with rows and columns, using SQL
for data management.

Using MySQL or PostgreSQL for

structured data storage.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE BENEFITS OF

RELATIONAL DATABASES?

They provide strong consistency, ACID

47
transactions, and support complex
queries and relationships.

Using SQL joins to query related data

across multiple tables in PostgreSQL.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS ACID COMPLIANCE?

ACID stands for Atomicity, Consistency,

Isolation, and Durability, which are

48
properties that ensure reliable database
transactions.

A banking system using ACID

transactions to ensure accurate fund
transfers.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS BASE CONSISTENCY?

BASE stands for Basically Available, Soft

49
state, and Eventual consistency, which
are properties of some NoSQL databases
that prioritize availability over strict
consistency.

Using Cassandra for high availability

and eventual consistency in a
distributed database.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DISTRIBUTED SYSTEM?

A distributed system is a system whose

50
components are located on different
networked computers, which
communicate and coordinate to
achieve a common goal.

Using a microservices architecture with

services running on multiple servers.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT ARE THE CHALLENGES OF

DISTRIBUTED SYSTEMS?

Challenges include network latency,

51
fault tolerance, data consistency, and
synchronization.

Implementing consensus algorithms

like Raft to manage distributed state.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS FAULT TOLERANCE?

Fault tolerance is the ability of a system

52
to continue operating properly in the
event of the failure of some of its
components.

Using redundant servers and data

replication to ensure high availability.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

HOW DO YOU ACHIEVE HIGH

AVAILABILITY?

Techniques include using load

53
balancers, redundant systems, failover
mechanisms, and data replication.

Implementing database replication and

load balancing for a web application.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A CONSENSUS
ALGORITHM?

A consensus algorithm is a process in

54
computer science used to achieve
agreement on a single data value
among distributed processes or
systems.

Using the Raft algorithm to elect a

leader in a distributed system.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE RAFT ALGORITHM?

Raft is a consensus algorithm designed

55
to be understandable and
implementable, used to manage a
replicated log in distributed systems.

Implementing Raft for leader election

and log replication in a distributed
database.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE PAXOS ALGORITHM?

Paxos is a family of protocols for solving

56
consensus in a network of unreliable or
faulty processors.

Using Paxos for achieving consensus in

a distributed system with unreliable
nodes.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A LEADER ELECTION?

Leader election is the process of

57
designating a single node as the
organizer of some task distributed
among several nodes.

Using ZooKeeper for leader election in a

distributed system.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A QUORUM?

A quorum is the minimum number of

58
votes needed for a distributed
transaction to be committed.

Using a majority quorum (n/2 + 1) in a

distributed database to ensure data
consistency.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A WRITE-AHEAD LOG

(WAL)?

A write-ahead log is a technique used in

59
databases to ensure data integrity by
logging changes before applying them
to the database.

Using WAL in PostgreSQL to ensure data

is not lost during a crash.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS STRONG CONSISTENCY?

Strong consistency ensures that all

61
nodes in a distributed system see the
same data at the same time after a
write operation.

Using a relational database like

PostgreSQL to ensure strong
consistency.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DATA PARTITION?

Data partitioning involves dividing a

62
database into smaller, more
manageable pieces, which can be
stored across multiple servers.

Sharding a user database by user ID to

distribute data across different servers.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DISTRIBUTED HASH

TABLE (DHT)?

A DHT is a decentralized distributed

65
system that provides a lookup service
similar to a hash table, where data is
distributed across multiple nodes.

Using a DHT for peer-to-peer file

sharing in BitTorrent.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A GOSSIP PROTOCOL?

A gossip protocol is a method of peer-

66
to-peer communication where nodes
periodically exchange state information
to achieve eventual consistency.

Using a gossip protocol in Cassandra

for node state information exchange.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A LEADER-FOLLOWER
PATTERN?

The leader-follower pattern is a

67
replication strategy where one node
(leader) handles all writes, and other
nodes (followers) replicate the leader's
data.

Using leader-follower replication in

Kafka for high availability.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DISTRIBUTED LOCK?

A distributed lock is a mechanism to

68
synchronize access to a shared resource
across multiple nodes in a distributed
system.

Using ZooKeeper to implement

distributed locks for coordination.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A HEARTBEAT IN
DISTRIBUTED SYSTEMS?

A heartbeat is a periodic signal sent by a

69
node to indicate its presence and
operational status to other nodes in a
distributed system.

Using heartbeats in Kubernetes to

monitor node health.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A PARTITION TOLERANCE?

Partition tolerance is the ability of a

70
distributed system to continue
functioning even when network
partitions occur, isolating parts of the
system.

Using eventual consistency to handle

network partitions in a distributed
database.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS THE DIFFERENCE BETWEEN

SYNCHRONOUS AND
ASYNCHRONOUS REPLICATION?
Synchronous replication requires that

71
data be written to all replicas before
acknowledging the write, while
asynchronous replication allows for
acknowledgement before all replicas
are updated.

Synchronous replication in a financial

transaction system for data integrity vs.
asynchronous replication in a content
delivery network for performance.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A READ REPLICA?

A read replica is a copy of a database

that is used to offload read traffic from

72
the primary database, improving
performance and availability.

Using read replicas in Amazon RDS to

handle increased read traffic.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS DATA CONSISTENCY?

Data consistency ensures that data is

73
the same across all nodes in a
distributed system after a write
operation.

Using strong consistency in a

distributed database to ensure
accurate data retrieval.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A ROLLBACK?

A rollback is the process of undoing

75
changes made to a database during a
transaction if an error occurs, ensuring
data integrity.

Using rollback in SQL transactions to

revert changes if an error occurs during
a multi-step process.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A DISTRIBUTED CACHE?

A distributed cache is a caching

76
mechanism where data is stored across
multiple nodes to improve scalability
and performance.

Using Redis Cluster to distribute cache

data across multiple nodes.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A CIRCUIT BREAKER

PATTERN?

The circuit breaker pattern is a design

77
pattern used to detect failures and
prevent cascading failures in distributed
systems by stopping the flow of requests
to a failing service.

Implementing a circuit breaker in a

microservices architecture to handle
service failures.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A BULKHEAD PATTERN?

The bulkhead pattern isolates

78
components of a system into separate
pools to prevent a failure in one
component from affecting others.

Using separate thread pools for

different services in a microservices
architecture to isolate failures.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A FALLBACK MECHANISM?

A fallback mechanism provides an

79
alternative action or data source when a
service call fails, ensuring system
resilience.

Returning cached data or a default

response when a primary service call
fails.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS IDEMPOTENCY?

Idempotency ensures that performing

80
an operation multiple times has the
same effect as performing it once, which
is crucial for reliable systems.

Using unique transaction IDs to ensure

that duplicate payment requests do not
result in multiple charges.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A SERVICE DISCOVERY?

Service discovery is the process of

81
automatically detecting services and
their endpoints in a distributed system,
facilitating communication between
components.

Using Consul or Eureka for service

discovery in a microservices
architecture.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A PROXY SERVER?

A proxy server acts as an intermediary

82
for requests from clients seeking
resources from other servers, providing
benefits like load balancing, caching,
and security.

Using Nginx as a reverse proxy to

distribute traffic to backend servers.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A REVERSE PROXY?

A reverse proxy is a type of proxy server

83
that retrieves resources on behalf of a
client from one or more servers, often
used for load balancing and security.

Using HAProxy to distribute incoming

HTTP requests to multiple web servers.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A CDN EDGE SERVER?

A CDN edge server is a server located at

84
the edge of a network that stores
cached content, delivering it to users
from a location closer to them.

Using Cloudflare edge servers to deliver

static content to users with low latency.

Shwetank Singh
GritSetGrow - GSGLearn.com
DATA ENGINEERING - SYSTEM DESIGN

WHAT IS A MICROSERVICES
GATEWAY?

A microservices gateway is an API

85
gateway that manages and routes
requests to the appropriate
microservice, often handling cross-
cutting concerns like authentication and
rate limiting.

Using Kong or Apigee as a

microservices gateway to manage API
traffic.

Shwetank Singh
GritSetGrow - GSGLearn.com

nosql-databases
No ratings yet
nosql-databases
379 pages
Algomasterio System Design Interview Handbook
No ratings yet
Algomasterio System Design Interview Handbook
19 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
System Design
No ratings yet
System Design
385 pages
Monitoring, Managing, and Recovering AD DS
No ratings yet
Monitoring, Managing, and Recovering AD DS
26 pages
The Ultimate System Design Cheat Sheet
No ratings yet
The Ultimate System Design Cheat Sheet
21 pages
Data Engineering 101 Databases 1719158525
No ratings yet
Data Engineering 101 Databases 1719158525
82 pages
Module_2
No ratings yet
Module_2
40 pages
System Design Theory Book
No ratings yet
System Design Theory Book
128 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Big data Slides
No ratings yet
Big data Slides
26 pages
DTUnit 1 & 2
No ratings yet
DTUnit 1 & 2
69 pages
System Design Terms
No ratings yet
System Design Terms
52 pages
Lecture 06
No ratings yet
Lecture 06
68 pages
III-sharding-strategies
No ratings yet
III-sharding-strategies
30 pages
Storagesystems
No ratings yet
Storagesystems
41 pages
Converted Cfabe
No ratings yet
Converted Cfabe
3 pages
Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
No ratings yet
Unit - Iv Data Analytics Frameworks: Centralized and Distributed Functional Architectures of Relational Systems
24 pages
Nosql
No ratings yet
Nosql
12 pages
NoSQL
No ratings yet
NoSQL
13 pages
SPA_L1_To_L7
No ratings yet
SPA_L1_To_L7
52 pages
Storage
No ratings yet
Storage
6 pages
BDA Assign 1
No ratings yet
BDA Assign 1
21 pages
Cloud Data Storage
No ratings yet
Cloud Data Storage
47 pages
Ebook - Cracking The System Design Interview Course
100% (1)
Ebook - Cracking The System Design Interview Course
91 pages
DBMS
No ratings yet
DBMS
43 pages
Lecture 16
No ratings yet
Lecture 16
31 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
DDIS U1-3
No ratings yet
DDIS U1-3
40 pages
Design Distributed Database
No ratings yet
Design Distributed Database
2 pages
DE UNIT 4
No ratings yet
DE UNIT 4
33 pages
NOSQL_MOD2
No ratings yet
NOSQL_MOD2
25 pages
Distributed Systems
No ratings yet
Distributed Systems
25 pages
07-DistributedDataManagement
No ratings yet
07-DistributedDataManagement
44 pages
Database Management System (1)
No ratings yet
Database Management System (1)
30 pages
Basis For Distributed Database Technology
No ratings yet
Basis For Distributed Database Technology
35 pages
Scalable SQL: How Do Large-Scale Sites and Applications Remain SQL-based?
No ratings yet
Scalable SQL: How Do Large-Scale Sites and Applications Remain SQL-based?
8 pages
MDS 271 2448001
No ratings yet
MDS 271 2448001
9 pages
3)Wase 2021 Dds Ho Modified
No ratings yet
3)Wase 2021 Dds Ho Modified
8 pages
Dbms Notes
No ratings yet
Dbms Notes
81 pages
Important System Design Concepts - Shumbul Arifa
No ratings yet
Important System Design Concepts - Shumbul Arifa
36 pages
System Design Cheat Sheet
No ratings yet
System Design Cheat Sheet
6 pages
System Design
No ratings yet
System Design
30 pages
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
No ratings yet
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
17 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas
No ratings yet
Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas
73 pages
CHAPTER 1
No ratings yet
CHAPTER 1
15 pages
04 Surveys Cattell PDF
No ratings yet
04 Surveys Cattell PDF
16 pages
41 NoSQL Introduction.pptx
No ratings yet
41 NoSQL Introduction.pptx
18 pages
Overview of Physical Database Design Methodology
No ratings yet
Overview of Physical Database Design Methodology
5 pages
Scale From Zero To Millions of Users
No ratings yet
Scale From Zero To Millions of Users
40 pages
Search Engine Student Documents
No ratings yet
Search Engine Student Documents
6 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
NoSQL Database
No ratings yet
NoSQL Database
8 pages
Tybca Recent Trends in It Chpter 1
No ratings yet
Tybca Recent Trends in It Chpter 1
16 pages
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
No ratings yet
Nosql Databases: P.Krishna Reddy Iiit Hyderabad
30 pages
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
No ratings yet
UNIT-1 Introduction: Dr. C.Nagaraju Head of Cse Ysrec of YVU Proddatur
86 pages
Unit 3
No ratings yet
Unit 3
35 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
System Design - ML Design 1 PDF
100% (1)
System Design - ML Design 1 PDF
24 pages
Windows 7 Beginners
No ratings yet
Windows 7 Beginners
28 pages
Oracle Database 21c - Data Warehousing
No ratings yet
Oracle Database 21c - Data Warehousing
37 pages
Database Presentation Slides
No ratings yet
Database Presentation Slides
52 pages
DHIS2-to-PBI Connector User Manual
No ratings yet
DHIS2-to-PBI Connector User Manual
48 pages
SAF XXXX BO DS Job Stat Collection Technical Spec
No ratings yet
SAF XXXX BO DS Job Stat Collection Technical Spec
18 pages
A New Dynamic Weight Assignment Schema For Index Terms Based On Statistical Approach
No ratings yet
A New Dynamic Weight Assignment Schema For Index Terms Based On Statistical Approach
6 pages
Usermanagement
No ratings yet
Usermanagement
8 pages
Hadoop 2.8.0 Installation On Window 10: Prepare
No ratings yet
Hadoop 2.8.0 Installation On Window 10: Prepare
9 pages
Ivunit Query Processing
No ratings yet
Ivunit Query Processing
12 pages
Relational Database Management System: Normalization
No ratings yet
Relational Database Management System: Normalization
8 pages
Pms
No ratings yet
Pms
6 pages
Chapter (1) Basic Concepts: Objectives
No ratings yet
Chapter (1) Basic Concepts: Objectives
16 pages
Problems in Data Warehousing
No ratings yet
Problems in Data Warehousing
6 pages
SQL Exercises - Scientists - Wikibooks, Open Books For An Open World
No ratings yet
SQL Exercises - Scientists - Wikibooks, Open Books For An Open World
4 pages
data base basic worksheet
No ratings yet
data base basic worksheet
3 pages
SHARMISTHA BOHIDAR - Data Architect
No ratings yet
SHARMISTHA BOHIDAR - Data Architect
6 pages
Snowflake Admin Keypoints
No ratings yet
Snowflake Admin Keypoints
3 pages
Postgresql Dba: Learn Basic Rdbms Terms and Concepts
No ratings yet
Postgresql Dba: Learn Basic Rdbms Terms and Concepts
7 pages
PT 1 Paper CS 12th 24-25
No ratings yet
PT 1 Paper CS 12th 24-25
2 pages
DBMS I.P
No ratings yet
DBMS I.P
5 pages
100 Useful Tips and Tools To Research The Deep Web
100% (1)
100 Useful Tips and Tools To Research The Deep Web
5 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
5 pages
Untitled
No ratings yet
Untitled
2 pages
DP 3 2 Practice
No ratings yet
DP 3 2 Practice
7 pages
Raghav Srinivasan Resume
No ratings yet
Raghav Srinivasan Resume
1 page
Scalability By Design
From Everand
Scalability By Design
Chukwunonso Offor
No ratings yet
Building Applications with Blitz.js: Definitive Reference for Developers and Engineers
From Everand
Building Applications with Blitz.js: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Azure Data Demystified: From SQL to Synapse
From Everand
Azure Data Demystified: From SQL to Synapse
Kameron Hussain
No ratings yet
Microsoft Azure Fundamentals Exam Cram: Second Edition
From Everand
Microsoft Azure Fundamentals Exam Cram: Second Edition
IP Specialist
5/5 (1)