Big Data
Storage
Concepts 1
2. Big Data Storage
Contents:
• Clusters
• File systems
• Distributed File System
• No SQL
• Sharding
• Replication
• Combining Sharding and Replication
2
Introduction
• Data acquired from external sources is often not in
a format or structure that can be directly
processed.
• Data wrangling includes steps to filter, cleanse and
otherwise prepare the data for downstream
analysis.
• Copy of the data is first stored in its acquired
format, and, after wrangling, the prepared data
needs to be stored again.
storage is required whenever the following,
• External datasets are acquired, or internal
data will be used in a Big Data environment.
• Data is manipulated to be made amenable for
data analysis.
• Data is processed via an ETL activity, or output
is generated as a result of an analytical
operation.
Clusters
• Cluster is a tightly coupled collection of servers,
or nodes.
• Servers usually have the same hardware
specifications and are connected together via a
network to work as a single unit.
• Each node in the cluster has its own dedicated
resources, such as memory, a processor, and a
hard drive.
• Cluster can execute a task by splitting it into small
pieces and distributing their execution onto
different computers that belong to the cluster.
Fig : cluster
2. File Systems
• File System -- method of storing and organizing files
containing data on a storage device -- flash drives, DVDs and
hard drives.
o File-- atomic unit of storage.
• File System provides a logical view of the data stored --
presents it as a tree structure of directories and files .
• Operating systems employ file systems to store and retrieve
data-- NTFS on Microsoft Windows and ext on Linux.
7
2. Big Data Storage:
Distributed File System(DFS)
• DFS -- stores large files spread across the nodes of a cluster.
• Local view(Logical view) -- presented via DFS ---enables the
files to be accessed from multiple locations.
o Files are distributed physically throughout the cluster.
• Example --Google File System (GFS) , Hadoop Distributed
File System (HDFS). 8
NoSQL
• A Not-only SQL (NoSQL) database is a non-
relational database that is highly scalable,fault-
tolerant.
• Designed to house semi-structured and
unstructured data.
• NoSQL database provides an API-based query
interface that can be called from within an
application.
• NoSQL databases also support query languages
other than Structured Query Language (SQL).
2. Big Data Storage:
Sharding
• Sharding: Process of horizontally partitioning large dataset
into a collection of smaller, more manageable datasets called
shards.
• Shards are distributed across multiple nodes.
• Each shard is stored on a separate node -- each node responsible
10
for only the data stored on it.
2. Big Data Storage:
Sharding
• Each shard shares the same schema -- all shards collectively
represent the complete dataset.
Advantages :
• Enhanced Storage Capacity while allowing horizontal scalability
• High Availability - read/write times are greatly improved.
• Provides partial tolerance toward failures. 11
2. Big Data Storage:
Sharding
How sharding works :
1.Each shard can
independently service reads
and writes for the specific
subset of data that it is
responsible for.
2. Depending on the query,
data may need to be fetched
from both shards.
12
2. Big Data Storage:
Sharding by MongoDB
A MongoDB sharded cluster consists of the following components:
shard: Each shard contains a subset of the sharded data.
mongos: The mongos acts as a query router, providing an interface
between client applications and the sharded cluster.
config servers: Config servers store metadata and configuration
settings for the cluster.
13
2. Big Data Storage:
Sharding
Considerations Before Sharding:
•Queries requiring data from multiple shards will impose
performance penalties.
•Query patterns need to be taken into account so that commonly
accessed data co-located on a single shard and helps counter such
performance issues.
14
2. Big Data Storage:
Replication
• Stores multiple copies of a
dataset, known as replicas,
on multiple nodes.
• Provides scalability and
availability.
• Fault tolerance achieved -
ensures that data is not lost
when an individual node
fails.
• Two methods used to
implement replication:
o Master-slave
o Peer-to-peer 15
2. Big Data Storage:
Replication – Master Slave
16
2. Big Data Storage:
Replication
Master-slave Replication:
•Nodes are arranged in a master-slave configuration,
•All data is written to a master node.
•Then, data is replicated over to multiple slave nodes.
•All write requests (insert, update and delete) occur on the master
node,
o whereas read requests can be fulfilled by any slave node.
17
2. Big Data Storage:
Replication – Master Slave
• Ideal for read intensive loads - growing read demands can be
managed by horizontal scaling.
• Writes are consistent, as all writes are coordinated by the master
node
- write performance suffers as the amount of writes increases.
• In the event that the master node fails –
o Reads are still possible via any of the slave nodes.
oWrites are not supported until a master node is reestablished.
18
2. Big Data Storage:
Replication – Master Slave
• An Issue- Read Inconsistency – a Scenario:
Voting
Solution ?
System
19
2. Big Data Storage:
Replication : Peer-to-Peer
20
2. Big Data Storage:
Replication: Peer-to-Peer
Peer-to-Peer Replication:
• All nodes operate at the same level - there is not a master-slave
relationship between the nodes.
• Each node, known as a peer, is equally capable of handling
reads and writes.
• Each write is copied to all peers.
21
2. Big Data Storage:
Replication: Peer-to-Peer
• Prone to write inconsistencies – due to simultaneous update of
the same data across multiple peers.
• Addressed by implementing either a pessimistic or optimistic
concurrency strategy.
o Pessimistic concurrency -
proactive strategy
Uses locking
However, this is detrimental to database availability
o Optimistic concurrency –
Reactive strategy
Does not use locking - database remains available
Allows inconsistency to occur - peers may remain inconsistent for
some time before attaining consistency.
22
2. Big Data Storage:
Replication : Peer-to-peer
• An Issue- Read Inconsistency – a Scenario:
Voting
Solution ?
System
23
2. Big Data Storage:
Replication by MongoDB
24
2. Big Data Storage:
Replication by MongoDB
• Replica set - a group of mongod instances that maintain the
same data set.
o It contains several data bearing nodes and optionally one
arbiter node.
o One of the data bearing nodes is deemed the primary node
- other nodes are deemed secondary nodes.
• Primary node receives all write operations.
• Primary records all changes to its data sets in its operation log,
i.e. oplog.
• The secondaries apply the operations in oplog to their data sets
to reflect the primary’s data set.
• If the primary is unavailable, an eligible secondary will hold an
election to elect itself the new primary.
25
2. Big Data Storage:
Combining Sharding and Replication
26
2. Big Data Storage:
Combining Sharding and Replication
Advantages Combining Sharding and Replication
• Provides advantages of both Sharding & Replication
o Improves on the limited fault tolerance offered by sharding
o Increased availability and scalability of replication.
27
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Master-Slave Replication
• When combined, multiple shards become slaves of a single
master, and the master itself is a shard.
• Results in multiple masters – Each master shard manages
only the corresponding slave-shard.
• Write consistency is maintained by the master-shard.
• Replicas of shards are kept on multiple slave nodes to
provide scalability and fault tolerance for read operations.
28
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Master-Slave Replication
29
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Master-Slave Replication
In Figure:
• Each node acts both as a master and a slave for different shards.
• Writes (id = 2) to Shard A are regulated by Node A, as it is the
master for Shard A.
• Node A replicates data (id = 2) to Node B, which is a slave for Shard
A.
• Reads (id = 4) can be served directly by either Node B or Node C as
they each contain Shard B.
30
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Peer-to-Peer Replication
• When combining, each shard is replicated to multiple peers, and
• Each peer is only responsible for a subset of the overall dataset.
• Helps achieve increased scalability and fault tolerance.
• As there is no master involved, there is no single point of failure
and fault-tolerance for both read and write operations is
supported.
31
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Peer-to-Peer Replication
32
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Peer-to-Peer Replication
In Figure:
• Each node contains replicas of two different shards.
• Writes (id = 3) are replicated to both Node A and Node C (Peers) as
they are responsible for Shard C.
• Reads (id = 6) can be served by either Node B or Node C as they
each contain ShardB.
33