0% found this document useful (0 votes)
12 views33 pages

BDA CH 2 (StorageConcepts)

Uploaded by

Cool Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

BDA CH 2 (StorageConcepts)

Uploaded by

Cool Guy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Big Data

Storage
Concepts 1
2. Big Data Storage

Contents:
• Clusters
• File systems
• Distributed File System
• No SQL
• Sharding
• Replication
• Combining Sharding and Replication
2
Introduction
• Data acquired from external sources is often not in
a format or structure that can be directly
processed.
• Data wrangling includes steps to filter, cleanse and
otherwise prepare the data for downstream
analysis.
• Copy of the data is first stored in its acquired
format, and, after wrangling, the prepared data
needs to be stored again.
storage is required whenever the following,
• External datasets are acquired, or internal
data will be used in a Big Data environment.
• Data is manipulated to be made amenable for
data analysis.
• Data is processed via an ETL activity, or output
is generated as a result of an analytical
operation.
Clusters
• Cluster is a tightly coupled collection of servers,
or nodes.
• Servers usually have the same hardware
specifications and are connected together via a
network to work as a single unit.
• Each node in the cluster has its own dedicated
resources, such as memory, a processor, and a
hard drive.
• Cluster can execute a task by splitting it into small
pieces and distributing their execution onto
different computers that belong to the cluster.
Fig : cluster
2. File Systems
• File System -- method of storing and organizing files
containing data on a storage device -- flash drives, DVDs and
hard drives.
o File-- atomic unit of storage.

• File System provides a logical view of the data stored --


presents it as a tree structure of directories and files .

• Operating systems employ file systems to store and retrieve


data-- NTFS on Microsoft Windows and ext on Linux.
7
2. Big Data Storage:
Distributed File System(DFS)
• DFS -- stores large files spread across the nodes of a cluster.

• Local view(Logical view) -- presented via DFS ---enables the


files to be accessed from multiple locations.
o Files are distributed physically throughout the cluster.

• Example --Google File System (GFS) , Hadoop Distributed


File System (HDFS). 8
NoSQL
• A Not-only SQL (NoSQL) database is a non-
relational database that is highly scalable,fault-
tolerant.
• Designed to house semi-structured and
unstructured data.
• NoSQL database provides an API-based query
interface that can be called from within an
application.
• NoSQL databases also support query languages
other than Structured Query Language (SQL).
2. Big Data Storage:
Sharding

• Sharding: Process of horizontally partitioning large dataset


into a collection of smaller, more manageable datasets called
shards.
• Shards are distributed across multiple nodes.
• Each shard is stored on a separate node -- each node responsible
10
for only the data stored on it.
2. Big Data Storage:
Sharding

• Each shard shares the same schema -- all shards collectively


represent the complete dataset.
Advantages :
• Enhanced Storage Capacity while allowing horizontal scalability
• High Availability - read/write times are greatly improved.
• Provides partial tolerance toward failures. 11
2. Big Data Storage:
Sharding

How sharding works :

1.Each shard can


independently service reads
and writes for the specific
subset of data that it is
responsible for.

2. Depending on the query,


data may need to be fetched
from both shards.

12
2. Big Data Storage:
Sharding by MongoDB
A MongoDB sharded cluster consists of the following components:
shard: Each shard contains a subset of the sharded data.
mongos: The mongos acts as a query router, providing an interface
between client applications and the sharded cluster.
config servers: Config servers store metadata and configuration
settings for the cluster.

13
2. Big Data Storage:
Sharding

Considerations Before Sharding:

•Queries requiring data from multiple shards will impose


performance penalties.

•Query patterns need to be taken into account so that commonly


accessed data co-located on a single shard and helps counter such
performance issues.

14
2. Big Data Storage:
Replication
• Stores multiple copies of a
dataset, known as replicas,
on multiple nodes.

• Provides scalability and


availability.

• Fault tolerance achieved -


ensures that data is not lost
when an individual node
fails.

• Two methods used to


implement replication:
o Master-slave
o Peer-to-peer 15
2. Big Data Storage:
Replication – Master Slave

16
2. Big Data Storage:
Replication

Master-slave Replication:

•Nodes are arranged in a master-slave configuration,

•All data is written to a master node.

•Then, data is replicated over to multiple slave nodes.

•All write requests (insert, update and delete) occur on the master
node,
o whereas read requests can be fulfilled by any slave node.

17
2. Big Data Storage:
Replication – Master Slave

• Ideal for read intensive loads - growing read demands can be


managed by horizontal scaling.

• Writes are consistent, as all writes are coordinated by the master


node
- write performance suffers as the amount of writes increases.

• In the event that the master node fails –


o Reads are still possible via any of the slave nodes.
oWrites are not supported until a master node is reestablished.

18
2. Big Data Storage:
Replication – Master Slave
• An Issue- Read Inconsistency – a Scenario:

Voting
Solution ?
System

19
2. Big Data Storage:
Replication : Peer-to-Peer

20
2. Big Data Storage:
Replication: Peer-to-Peer

Peer-to-Peer Replication:

• All nodes operate at the same level - there is not a master-slave


relationship between the nodes.

• Each node, known as a peer, is equally capable of handling


reads and writes.

• Each write is copied to all peers.

21
2. Big Data Storage:
Replication: Peer-to-Peer
• Prone to write inconsistencies – due to simultaneous update of
the same data across multiple peers.

• Addressed by implementing either a pessimistic or optimistic


concurrency strategy.
o Pessimistic concurrency -
 proactive strategy
 Uses locking
 However, this is detrimental to database availability
o Optimistic concurrency –
 Reactive strategy
 Does not use locking - database remains available
 Allows inconsistency to occur - peers may remain inconsistent for
some time before attaining consistency.
22
2. Big Data Storage:
Replication : Peer-to-peer
• An Issue- Read Inconsistency – a Scenario:

Voting
Solution ?
System

23
2. Big Data Storage:
Replication by MongoDB

24
2. Big Data Storage:
Replication by MongoDB
• Replica set - a group of mongod instances that maintain the
same data set.
o It contains several data bearing nodes and optionally one
arbiter node.
o One of the data bearing nodes is deemed the primary node
- other nodes are deemed secondary nodes.

• Primary node receives all write operations.


• Primary records all changes to its data sets in its operation log,
i.e. oplog.

• The secondaries apply the operations in oplog to their data sets


to reflect the primary’s data set.

• If the primary is unavailable, an eligible secondary will hold an


election to elect itself the new primary.
25
2. Big Data Storage:
Combining Sharding and Replication

26
2. Big Data Storage:
Combining Sharding and Replication

Advantages Combining Sharding and Replication

• Provides advantages of both Sharding & Replication

o Improves on the limited fault tolerance offered by sharding

o Increased availability and scalability of replication.

27
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Master-Slave Replication

• When combined, multiple shards become slaves of a single


master, and the master itself is a shard.

• Results in multiple masters – Each master shard manages


only the corresponding slave-shard.

• Write consistency is maintained by the master-shard.

• Replicas of shards are kept on multiple slave nodes to


provide scalability and fault tolerance for read operations.

28
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Master-Slave Replication

29
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Master-Slave Replication

In Figure:
• Each node acts both as a master and a slave for different shards.

• Writes (id = 2) to Shard A are regulated by Node A, as it is the


master for Shard A.

• Node A replicates data (id = 2) to Node B, which is a slave for Shard


A.
• Reads (id = 4) can be served directly by either Node B or Node C as
they each contain Shard B.

30
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Peer-to-Peer Replication

• When combining, each shard is replicated to multiple peers, and

• Each peer is only responsible for a subset of the overall dataset.

• Helps achieve increased scalability and fault tolerance.

• As there is no master involved, there is no single point of failure


and fault-tolerance for both read and write operations is
supported.

31
2. Big Data Storage:
Combining Sharding and Replication
Combining Sharding and Peer-to-Peer Replication

32
2. Big Data Storage:
Combining Sharding and Replication

Combining Sharding and Peer-to-Peer Replication

In Figure:
• Each node contains replicas of two different shards.

• Writes (id = 3) are replicated to both Node A and Node C (Peers) as


they are responsible for Shard C.

• Reads (id = 6) can be served by either Node B or Node C as they


each contain ShardB.

33

You might also like