0% found this document useful (0 votes)

12 views33 pages

BDA CH 2 (StorageConcepts)

Uploaded by

Cool Guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views33 pages

BDA CH 2 (StorageConcepts)

Uploaded by

Cool Guy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 33

Big Data

Storage
Concepts 1
2. Big Data Storage

Contents:
• Clusters
• File systems
• Distributed File System
• No SQL
• Sharding
• Replication
• Combining Sharding and Replication
2
Introduction
• Data acquired from external sources is often not in
a format or structure that can be directly
processed.
• Data wrangling includes steps to filter, cleanse and
otherwise prepare the data for downstream
analysis.
• Copy of the data is first stored in its acquired
format, and, after wrangling, the prepared data
needs to be stored again.
storage is required whenever the following,
• External datasets are acquired, or internal
data will be used in a Big Data environment.
• Data is manipulated to be made amenable for
data analysis.
• Data is processed via an ETL activity, or output
is generated as a result of an analytical
operation.
Clusters
• Cluster is a tightly coupled collection of servers,
or nodes.
• Servers usually have the same hardware
specifications and are connected together via a
network to work as a single unit.
• Each node in the cluster has its own dedicated
resources, such as memory, a processor, and a
hard drive.
• Cluster can execute a task by splitting it into small
pieces and distributing their execution onto
different computers that belong to the cluster.
Fig : cluster
2. File Systems
• File System -- method of storing and organizing files
containing data on a storage device -- flash drives, DVDs and
hard drives.
o File-- atomic unit of storage.

• File System provides a logical view of the data stored --

presents it as a tree structure of directories and files .

• Operating systems employ file systems to store and retrieve

data-- NTFS on Microsoft Windows and ext on Linux.
7
2. Big Data Storage:
Distributed File System(DFS)
• DFS -- stores large files spread across the nodes of a cluster.

• Local view(Logical view) -- presented via DFS ---enables the

files to be accessed from multiple locations.
o Files are distributed physically throughout the cluster.

• Example --Google File System (GFS) , Hadoop Distributed

File System (HDFS). 8
NoSQL
• A Not-only SQL (NoSQL) database is a non-
relational database that is highly scalable,fault-
tolerant.
• Designed to house semi-structured and
unstructured data.
• NoSQL database provides an API-based query
interface that can be called from within an
application.
• NoSQL databases also support query languages
other than Structured Query Language (SQL).
2. Big Data Storage:
Sharding

• Sharding: Process of horizontally partitioning large dataset

into a collection of smaller, more manageable datasets called
shards.
• Shards are distributed across multiple nodes.
• Each shard is stored on a separate node -- each node responsible
10
for only the data stored on it.
2. Big Data Storage:
Sharding

• Each shard shares the same schema -- all shards collectively

represent the complete dataset.
Advantages :
• Enhanced Storage Capacity while allowing horizontal scalability
• High Availability - read/write times are greatly improved.
• Provides partial tolerance toward failures. 11
2. Big Data Storage:
Sharding

How sharding works :

1.Each shard can

independently service reads
and writes for the specific
subset of data that it is
responsible for.

2. Depending on the query,

data may need to be fetched
from both shards.

12
2. Big Data Storage:
Sharding by MongoDB
A MongoDB sharded cluster consists of the following components:
shard: Each shard contains a subset of the sharded data.
mongos: The mongos acts as a query router, providing an interface
between client applications and the sharded cluster.
config servers: Config servers store metadata and configuration
settings for the cluster.

13
2. Big Data Storage:
Sharding

Considerations Before Sharding:

•Queries requiring data from multiple shards will impose

performance penalties.

•Query patterns need to be taken into account so that commonly

accessed data co-located on a single shard and helps counter such
performance issues.

14
2. Big Data Storage:
Replication
• Stores multiple copies of a
dataset, known as replicas,
on multiple nodes.

• Provides scalability and

availability.

• Fault tolerance achieved -

ensures that data is not lost
when an individual node
fails.

• Two methods used to

implement replication:
o Master-slave
o Peer-to-peer 15
2. Big Data Storage:
Replication – Master Slave

16
2. Big Data Storage:
Replication

Master-slave Replication:

•Nodes are arranged in a master-slave configuration,

•All data is written to a master node.

•Then, data is replicated over to multiple slave nodes.

•All write requests (insert, update and delete) occur on the master
node,
o whereas read requests can be fulfilled by any slave node.

17
2. Big Data Storage:
Replication – Master Slave

• Ideal for read intensive loads - growing read demands can be

managed by horizontal scaling.

• Writes are consistent, as all writes are coordinated by the master

node
- write performance suffers as the amount of writes increases.

• In the event that the master node fails –

o Reads are still possible via any of the slave nodes.
oWrites are not supported until a master node is reestablished.

18
2. Big Data Storage:
Replication – Master Slave
• An Issue- Read Inconsistency – a Scenario:

Voting
Solution ?
System

19
2. Big Data Storage:
Replication : Peer-to-Peer

20
2. Big Data Storage:
Replication: Peer-to-Peer

Peer-to-Peer Replication:

• All nodes operate at the same level - there is not a master-slave

relationship between the nodes.

• Each node, known as a peer, is equally capable of handling

reads and writes.

• Each write is copied to all peers.

21
2. Big Data Storage:
Replication: Peer-to-Peer
• Prone to write inconsistencies – due to simultaneous update of
the same data across multiple peers.

• Addressed by implementing either a pessimistic or optimistic

concurrency strategy.
o Pessimistic concurrency -
 proactive strategy
 Uses locking
 However, this is detrimental to database availability
o Optimistic concurrency –
 Reactive strategy
 Does not use locking - database remains available
 Allows inconsistency to occur - peers may remain inconsistent for
some time before attaining consistency.
22
2. Big Data Storage:
Replication : Peer-to-peer
• An Issue- Read Inconsistency – a Scenario:

Voting
Solution ?
System

23
2. Big Data Storage:
Replication by MongoDB

24
2. Big Data Storage:
Replication by MongoDB
• Replica set - a group of mongod instances that maintain the
same data set.
o It contains several data bearing nodes and optionally one
arbiter node.
o One of the data bearing nodes is deemed the primary node
- other nodes are deemed secondary nodes.

• Primary node receives all write operations.

• Primary records all changes to its data sets in its operation log,
i.e. oplog.

• The secondaries apply the operations in oplog to their data sets

to reflect the primary’s data set.

• If the primary is unavailable, an eligible secondary will hold an

election to elect itself the new primary.
25
2. Big Data Storage:
Combining Sharding and Replication

26
2. Big Data Storage:
Combining Sharding and Replication

Advantages Combining Sharding and Replication

• Provides advantages of both Sharding & Replication

o Improves on the limited fault tolerance offered by sharding

o Increased availability and scalability of replication.

27
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Master-Slave Replication

• When combined, multiple shards become slaves of a single

master, and the master itself is a shard.

• Results in multiple masters – Each master shard manages

only the corresponding slave-shard.

• Write consistency is maintained by the master-shard.

• Replicas of shards are kept on multiple slave nodes to

provide scalability and fault tolerance for read operations.

28
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Master-Slave Replication

29
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Master-Slave Replication

In Figure:
• Each node acts both as a master and a slave for different shards.

• Writes (id = 2) to Shard A are regulated by Node A, as it is the

master for Shard A.

• Node A replicates data (id = 2) to Node B, which is a slave for Shard

A.
• Reads (id = 4) can be served directly by either Node B or Node C as
they each contain Shard B.

30
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Peer-to-Peer Replication

• When combining, each shard is replicated to multiple peers, and

• Each peer is only responsible for a subset of the overall dataset.

• Helps achieve increased scalability and fault tolerance.

• As there is no master involved, there is no single point of failure

and fault-tolerance for both read and write operations is
supported.

31
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Peer-to-Peer Replication

32
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Peer-to-Peer Replication

In Figure:
• Each node contains replicas of two different shards.

• Writes (id = 3) are replicated to both Node A and Node C (Peers) as

they are responsible for Shard C.

• Reads (id = 6) can be served by either Node B or Node C as they

each contain ShardB.

Grokking The System Design Interview PDF
93% (46)
Grokking The System Design Interview PDF
196 pages
Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
Mongodb Vs Couchbase Architecture WP PDF
No ratings yet
Mongodb Vs Couchbase Architecture WP PDF
45 pages
MongoDB Sharding Guide PDF
No ratings yet
MongoDB Sharding Guide PDF
81 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Mongo DB
No ratings yet
Mongo DB
19 pages
Nosql Module 2
100% (1)
Nosql Module 2
87 pages
Introduction To Mongodb
No ratings yet
Introduction To Mongodb
50 pages
Instagram Engineering Technology
No ratings yet
Instagram Engineering Technology
16 pages
Hiredintech System Design The Twitter Problem Beta
100% (1)
Hiredintech System Design The Twitter Problem Beta
17 pages
BDSL456B Lab Manual
No ratings yet
BDSL456B Lab Manual
36 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
YugaByte Fundamentals DBA Certification Guide
No ratings yet
YugaByte Fundamentals DBA Certification Guide
8 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
Big Data Management Basic Principles
No ratings yet
Big Data Management Basic Principles
55 pages
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
No ratings yet
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
17 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
Big Data IN A Gist
No ratings yet
Big Data IN A Gist
16 pages
Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning
No ratings yet
Subject: A Glance To Elasticsearch in The Era of Analytics and Machine Learning
8 pages
Aws Nosql Mongodb
No ratings yet
Aws Nosql Mongodb
21 pages
NGT Unit 2 - 230630 - 094118
No ratings yet
NGT Unit 2 - 230630 - 094118
62 pages
Instagram Sfpug
No ratings yet
Instagram Sfpug
183 pages
777 1651399819 BD Module 5
No ratings yet
777 1651399819 BD Module 5
75 pages
DBMS NOSQLMongoDBMCQ
No ratings yet
DBMS NOSQLMongoDBMCQ
9 pages
CC - Lecture 8-Final
No ratings yet
CC - Lecture 8-Final
51 pages
Lecture 02
No ratings yet
Lecture 02
60 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
MongoDB Capacity Planning - Aditya Sawhney's Blog
No ratings yet
MongoDB Capacity Planning - Aditya Sawhney's Blog
5 pages
System Design
No ratings yet
System Design
19 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
Surveyondatamanagementsystemfor Final
No ratings yet
Surveyondatamanagementsystemfor Final
5 pages
Big Data Finance T8 2 CHOI NEOMA Ch7 2024
No ratings yet
Big Data Finance T8 2 CHOI NEOMA Ch7 2024
10 pages
Cheat Sheet v2
No ratings yet
Cheat Sheet v2
3 pages
20 - 04 - 2024 Cheatsheet
No ratings yet
20 - 04 - 2024 Cheatsheet
3 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
DRKP Module 2 1
No ratings yet
DRKP Module 2 1
77 pages
Assignment 4
No ratings yet
Assignment 4
9 pages
Mongodb Homework 4.2 Answer
100% (1)
Mongodb Homework 4.2 Answer
4 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Module 2
No ratings yet
Module 2
36 pages
III Sharding Strategies
No ratings yet
III Sharding Strategies
30 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Cloud Design Patterns 1711512535
No ratings yet
Cloud Design Patterns 1711512535
3 pages
System Design
No ratings yet
System Design
4 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
Awanish Mishra Resume SDE
No ratings yet
Awanish Mishra Resume SDE
1 page
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
7 pages
Nosql Mod2
No ratings yet
Nosql Mod2
25 pages
NoSQL - Unit 2
No ratings yet
NoSQL - Unit 2
11 pages
NoSQL Databases Critical Analysis and Comparison
No ratings yet
NoSQL Databases Critical Analysis and Comparison
7 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
10 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
BDA Unit 2
No ratings yet
BDA Unit 2
29 pages
DBMS-Module 5
No ratings yet
DBMS-Module 5
15 pages
2022-23-BDA-LAB Manual
No ratings yet
2022-23-BDA-LAB Manual
59 pages
NGD Question Bank Answers
No ratings yet
NGD Question Bank Answers
41 pages
Wa0000.
No ratings yet
Wa0000.
35 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
31 pages
HCIA Big Data
No ratings yet
HCIA Big Data
20 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
Nosql Databases
No ratings yet
Nosql Databases
379 pages
Project Report 7th Sem
No ratings yet
Project Report 7th Sem
60 pages
No SQL
No ratings yet
No SQL
14 pages
BD Unit4 Summary
No ratings yet
BD Unit4 Summary
6 pages
Bdav 01
No ratings yet
Bdav 01
2 pages
Lec 3 - Basic Concepts
No ratings yet
Lec 3 - Basic Concepts
32 pages
CH-07 Replication
No ratings yet
CH-07 Replication
35 pages
Gcru 2 Nosql
No ratings yet
Gcru 2 Nosql
52 pages
Nosql, Mongodb
No ratings yet
Nosql, Mongodb
18 pages
Lecture 3 - Principles of NoSQL Databases
No ratings yet
Lecture 3 - Principles of NoSQL Databases
49 pages
Mathina BDA
No ratings yet
Mathina BDA
11 pages
NoSQL Databases and Big Data Storage Systems
No ratings yet
NoSQL Databases and Big Data Storage Systems
4 pages
Module 2
No ratings yet
Module 2
40 pages
ADBS Short Questions - Lecture 7
No ratings yet
ADBS Short Questions - Lecture 7
4 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Alex Xu - Sahn Lam - System Design Interview - An Insider's Guide - Volume 2 (2023, ByteByteGo Inc) - Libgen - Li
No ratings yet
Alex Xu - Sahn Lam - System Design Interview - An Insider's Guide - Volume 2 (2023, ByteByteGo Inc) - Libgen - Li
417 pages
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet

BDA CH 2 (StorageConcepts)

Uploaded by

BDA CH 2 (StorageConcepts)

Uploaded by

Big Data

• File System provides a logical view of the data stored --

• Operating systems employ file systems to store and retrieve

• Local view(Logical view) -- presented via DFS ---enables the

• Example --Google File System (GFS) , Hadoop Distributed

• Sharding: Process of horizontally partitioning large dataset

• Each shard shares the same schema -- all shards collectively

How sharding works :

1.Each shard can

2. Depending on the query,

Considerations Before Sharding:

•Queries requiring data from multiple shards will impose

•Query patterns need to be taken into account so that commonly

• Provides scalability and

• Fault tolerance achieved -

• Two methods used to

•Nodes are arranged in a master-slave configuration,

•All data is written to a master node.

•Then, data is replicated over to multiple slave nodes.

• Ideal for read intensive loads - growing read demands can be

• Writes are consistent, as all writes are coordinated by the master

• In the event that the master node fails –

• All nodes operate at the same level - there is not a master-slave

• Each node, known as a peer, is equally capable of handling

• Each write is copied to all peers.

• Addressed by implementing either a pessimistic or optimistic

• Primary node receives all write operations.

• The secondaries apply the operations in oplog to their data sets

• If the primary is unavailable, an eligible secondary will hold an

Advantages Combining Sharding and Replication

• Provides advantages of both Sharding & Replication

o Improves on the limited fault tolerance offered by sharding

o Increased availability and scalability of replication.

Combining Sharding and Master-Slave Replication

• When combined, multiple shards become slaves of a single

• Results in multiple masters – Each master shard manages

• Write consistency is maintained by the master-shard.

• Replicas of shards are kept on multiple slave nodes to

Combining Sharding and Master-Slave Replication

• Writes (id = 2) to Shard A are regulated by Node A, as it is the

• Node A replicates data (id = 2) to Node B, which is a slave for Shard

Combining Sharding and Peer-to-Peer Replication

• When combining, each shard is replicated to multiple peers, and

• Each peer is only responsible for a subset of the overall dataset.

• Helps achieve increased scalability and fault tolerance.

• As there is no master involved, there is no single point of failure

Combining Sharding and Peer-to-Peer Replication

• Writes (id = 3) are replicated to both Node A and Node C (Peers) as

• Reads (id = 6) can be served by either Node B or Node C as they

You might also like