0% found this document useful (0 votes)

82 views55 pages

Big Data Management Basic Principles

Here are some key considerations when deciding to shard: - Performance - Sharding can improve performance by distributing load across multiple nodes. But it also introduces overhead. - Scalability - Sharding allows nearly unlimited horizontal scaling by adding more nodes as your data and traffic grow. - Data size - Once your data exceeds the storage limits of a single node, sharding becomes necessary to store more data. - Data access patterns - Sharding works best if your queries can be distributed across shards, like by user ID or date range. - Consistency needs - Sharding complicates transactions and consistency. Consider your consistency requirements. - Application complexity - Sharding makes applications more complex to

Uploaded by

k yong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views55 pages

Big Data Management Basic Principles

Uploaded by

k yong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Basic Principles

Big Data Management

Anis Ur Rahman
Faculty of Computer Science & Information Technology
University of Malaya

October 15, 2019

Borrowed from http://www.ksi.mff.cuni.cz/~svoboda/courses/171-NDBI040/

1 / 55
Lecture Outline

Different aspects of data distribution

1 Scaling
Vertical vs. horizontal
2 Distribution models
Sharding
Replication: master-slave vs. peer-to-peer architectures
3 CAP properties
Consistency, availability and partition tolerance
ACID vs. BASE guarantees
4 Consistency
Read and write quora

2 / 55
Scalability

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

3 / 55
Scalability

Scalability

What is scalability?
the capability of a system to handle growing amounts of data
and/or queries without losing performance, or
its potential to be enlarged in order to accommodate such a growth
Two general approaches
Vertical scaling
Horizontal scaling

4 / 55
Scalability

Vertical Scalability

Vertical scaling (scaling up/down)

Adding resources to a single node in a system
e.g. increasing the number of CPUs, extending system memory,
using larger disk arrays, ...
i.e. larger and more powerful machines are involved
Traditional choice
In favor of strong consistency
Easy to implement and deploy
No issues caused by data distribution
...
Works well in many cases but ...

5 / 55
Scalability

Vertical Scalability: Drawbacks

Performance limits
Even the most powerful machine has a limit
Everything works well ... until start approaching the limit
Higher costs
The cost of expansion increases exponentially
In particular, it is higher than the sum of costs of equivalent
commodity hardware
Proactive provisioning
New projects/applications might evolve rapidly
Upfront budget is needed when deploying new machines
So flexibility is seriously suppressed

6 / 55
Scalability

Vertical Scalability: Drawbacks

Vendor lock-in
There are only a few manufacturers of large machines
Customer is made dependent on a single vendor
Their products, services, but also implementation details,
proprietary formats, interfaces, support, ...
e.e. it is difficult or impossible to switch to another vendor
Deployment downtime
Inevitable downtime is often required when scaling up

7 / 55
Scalability

Horizontal Scalability

Horizontal scaling (scaling out/in)

adding more nodes to a system
i.e. system is distributed across multiple nodes in a cluster
Choice of many NoSQL systems
Advantages
Commodity hardware, cost effective
Flexible deployment and maintenance
Often surpasses the vertical scaling
Often no single point of failure
...

8 / 55
Scalability

Horizontal Scalability: Consequences

Significantly increases complexity

Complexity of management, programming model, ...
Introduces new issues and problems
Data distribution
Synchronization of nodes
Data consistency
Recovery from failures
...
And there are also plenty of false assumptions ...

9 / 55
Scalability

Horizontal Scalability: Fallacies

False assumptions
Network is reliable
Latency is zero
Bandwidth is infinite
Network is secure
Topology does not change
There is one administrator
Network is homogeneous
Transport cost is zero
Source: https://www.red-gate.com/simple-talk/blogs/
the-eight-fallacies-of-distributed-computing/

10 / 55
Scalability

Horizontal Scalability: Conclusion

A standalone node still might be a better option in certain cases

e.g. for graph databases
Simply because it is difficult to split and distribute graphs
In other words
It can make sense to run even a NoSQL database system on a
single node
No distribution at all is the most preferred/simple scenario
But in general, horizontal scaling really opens new possibilities

11 / 55
Scalability

Horizontal Scalability: Architecture

What is a cluster?
A collection of mutually interconnected commodity nodes
Based on the shared-nothing architecture
Nodes do not share their CPUs, memory, hard drives, ...
Each node runs its own operating system instance
Nodes send messages to interact with each other
Nodes of a cluster can be heterogeneous
Data, queries, calculations, requests, workload, ... all distributed
among the nodes within a cluster

12 / 55
Distribution Models

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

13 / 55
Distribution Models

Distribution Models

Generic techniques of data distribution

1 Sharding
Idea. different data on different nodes
Motivation. increasing volume of data, increasing performance
2 Replication
Idea. the same data on different nodes
Motivation. increasing performance, increasing fault tolerance
Both the techniques are mutually orthogonal
i.e. we can use either of them, or combine them both
Distribution model
specific way how sharding and replication is implemented
NoSQL systems often offer automatic sharding and replication

14 / 55
Distribution Models

Sharding

Sharding (horizontal partitioning)

Placement of different data on different nodes
What different data means? Usually aggregates
e.g. key-value pairs, documents, ...
Related pieces of data that are accessed together should also
be kept together
Specifically, operations involving data on multiple shards should be
avoided (if possible)
The questions are...
how to design aggregate structures?
how to actually distribute these aggregates?

15 / 55
Distribution Models

Sharding

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

16 / 55
Distribution Models

Sharding

Objectives
Achieve uniform data distribution
Achieve balanced workload (read and write requests)
Respect physical locations
e.g. different data centers for users around the world
...
Unfortunately, these objectives...
may mutually contradict each other
may change in time
So, how to actually determine shards for aggregates?

17 / 55
Distribution Models

Sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
18 / 55
Distribution Models

Sharding

Sharding strategies
Based on mapping structures
Data is placed on shards in a random fashion
e.g. round-robin, ...
Knowledge of the mapping of individual aggregates to particular
shards must then be maintained
Thus usually maintained using a centralized index structures with
all the disadvantages
Based on general rules
Each shard is responsible for storing certain data
Hash partitioning, range partitioning, ...

19 / 55
Distribution Models

Sharding

Key-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
20 / 55
Distribution Models

Sharding

Range-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding

21 / 55
Distribution Models

Sharding

Directory-based sharding

Source:
https://www.digitalocean.com/community/tutorials/understanding-database-sharding
22 / 55
Distribution Models

Sharding

Should I Shard?
Amount of application data grows to exceed the storage capacity
of a single database node.
Volume of writes or reads to the database surpasses what a single
node can handle,
resulting in slowed response times or timeouts.

23 / 55
Distribution Models

Sharding

Why is sharding difficult?

Not only we need to be able to determine particular shards during
write requests
i.e. when a new aggregate is about to be inserted
So that we can actually make a decision where it should be
physically stored
but also during read requests
i.e. when existing aggregate/s are about to be retrieved
So that we can actually find and return them efficiently (or detect
they are missing)
And all that only based on the search criteria provided (e.g. key, id,
...) unless all the nodes should be accessed

24 / 55
Distribution Models

Sharding

Why is sharding even more difficult?

Structure of the cluster may be changing
Nodes can be added or removed
Nodes may have incomplete/obsolete cluster knowledge
Nodes involved, their responsibilities, sharding rules, ...
Individual nodes may be failing
Network may be partitioned
Messages may not be delivered even though sent

25 / 55
Distribution Models

Replication

Replication
Placement of multiple copies of the same data (replicas) on
different nodes
Replication factor = number of such copies
Two approaches
1 Master-slave architecture
2 Peer-to-peer architecture

26 / 55
Distribution Models

Replication

Master-Slave Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
27 / 55
Distribution Models

Replication

Master-Slave Architecture
One node is primary (master), all the other secondary (slave)
Master node bears all the management responsibility
All the nodes contain identical data
Read requests can be handled by both the master or slaves
Suitable for read-intensive applications
More read requests to deal with → more slaves to deploy
When the master fails, read operations can still be handled

28 / 55
Distribution Models

Replication

Master-Slave Architecture
Write requests can only be handled by the master
Newly written replicas are propagated to all the slaves
Consistency issue
Luckily enough, at most one write request is handled at a time
But the propagation still takes some time during which obsolete
reads might happen
Hence certain synchronization is required to avoid conflicts
In case of master failure, a new one needs to be appointed
Manually (user-defined) or automatically (cluster-elected)
Since the nodes are identical, appointment can be fast
Master might therefore represent a bottleneck (because of the
performance or failures)

29 / 55
Distribution Models

Replication

Peer-to-Peer Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
30 / 55
Distribution Models

Replication

Peer-to-Peer Architecture
All the nodes have equal roles and responsibilities
All the nodes contain identical data once again
Both read and write requests can be handled by any node
No bottleneck, no single point of failure
Both the operations scale well
More requests to deal with → more nodes to deploy
Consistency issues
Unfortunately, multiple write requests can be initiated
independently and being executed at the same time
Hence synchronization is required to avoid conflicts

31 / 55
Distribution Models

Sharding and Replication

Observations with respect to the replication

Does the replication factor really need to correspond to the
number of nodes?
No, replication factor of 3 will often be the right choice
Consequences
Nodes will no longer contain identical data
Replica placement strategy will be needed
Do all the replicas really need to be successfully written when
write requests are handled?
No, but consistency issues have to be tackled carefully
Sharding and replication can be combined... but how?

32 / 55
Distribution Models

Sharding and Replication

Sharding and Master-Slave Replication

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

33 / 55
Distribution Models

Sharding and Replication

Sharding and Peer-to-Peer Replication

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.

34 / 55
Distribution Models

Sharding and Replication

Combinations of sharding and replication

1 Sharding + master-slave replication
Multiple masters, each for different data
Roles of the nodes can overlap
Each node can be master for some data and/or slave for other
2 Sharding + peer-to-peer replication
Basically placement of anything anywhere (although certain rules
can still be applied)

35 / 55
Distribution Models

Sharding and Replication

Questions to figure out for any distribution model

Can all the nodes serve both read and write requests?
Which replica placement strategy is used?
How the mapping of replicas is maintained?
What level of consistency and availability is provided?
What extent of infrastructure knowledge do the nodes have?
...

36 / 55
CAP Theorem

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

37 / 55
CAP Theorem

CAP Theorem

Assumptions
Distributed system with sharding and replication
Read and write operations on a single aggregate only
CAP properties
Properties of a distributed system:
Consistency, Availability, and Partition tolerance
CAP theorem
It is not possible to have a distributed system that would
guarantee all CAP properties at the same time.
Only 2 of these 3 properties can be enforced.
But, what these properties actually mean?

38 / 55
CAP Theorem

CAP Properties

Consistency
Read and write operations must be executed atomically
There must exist a total order on all operations
Each operation looks as if it was completed at a single instant
i.e. as if all operations were executed sequentially one by one on a
single standalone node
Practical consequence. after a write operation, all readers see
the same data
Since any node can be used for handling of read requests, atomicity
of write operations means that changes must be propagated to all
the replicas
As we will see later on, other ways for such a strong consistency
exist as well

39 / 55
CAP Theorem

CAP Properties

Availability
If a node is working, it must respond to user requests
Every read or write request successfully received by a non-failing
node in the system must result in a response,
i.e. their execution must not be rejected
Partition tolerance
System continues to operate even when two or more sets of
nodes get isolated
The network is allowed to lose arbitrarily many messages sent
from one node to another
i.e. a connection failure must not shut the whole system down

40 / 55
CAP Theorem

CAP Theorem Consequences

If at most two properties can be guaranteed ...

1 CA = consistency + availability
Traditional ACID properties are easy to achieve
Examples.: RDBMS, Google BigTable
Any single-node system, but even clusters (at least in theory)
However, should the network partition happen, all the nodes must be
forced to stop accepting user requests
2 CP = consistency + partition tolerance
Other examples. distributed locking
3 AP = availability + partition tolerance
New concept of BASE properties
Examples. Apache Cassandra, Apache CouchDB
Other examples. web caching, DNS

41 / 55
CAP Theorem

CAP Theorem Consequences

42 / 55
CAP Theorem

CAP Theorem Consequences

Partition tolerance is necessary in clusters

Why? Because it is difficult to detect network failures
Does it mean that only purely CP and AP systems are possible?
No...
The real meaning of the CAP theorem:
The real-world does not need to be just black and white
Partition tolerance is a must, but we can trade off consistency
versus availability
Just a little bit relaxed consistency can bring a lot of availability
Such trade-offs are not only possible, but often works very well in
practice

43 / 55
CAP Theorem

CAP Theorem Example

You are asked to design a distributed cluster of 4 data nodes.

Replication factor is 2 i.e. any data written in cluster must be written on
2 nodes; so when one goes down – second can serve the data. Now try
to apply CAP theorem on this requirement.
In distributed system, two things may happen anytime i.e. node failure
(hard disk crash) or network failure (connection between two nodes go
down).
1 CP [Consistency/Partition Tolerance] Systems
2 AP [Availability/Partition Tolerance] Systems
3 CA [Consistency/Availability] Systems

44 / 55
CAP Theorem

ACID Properties

Traditional ACID properties

1 Atomicity
Partial execution of transactions is not allowed (all or nothing)
2 Consistency
Transactions bring the database from one consistent (valid) state
to another
3 Isolation
Transactions executed in parallel do not see uncommitted effects
of each other
4 Durability
Effects of committed transactions must remain durable

45 / 55
CAP Theorem

BASE Properties

New concept of BASE properties

1 Basically Available
The system works basically all the time
Partial failures can occur, but there are no total system failures
2 Soft State
The system is in flux (unstable), non-deterministic state
Changes occur all the time
3 Eventual Consistency
Sooner or later the system will be in some consistent state
BASE is just a vague term, no formal definition
Proposed to illustrate design philosophies at the opposite ends of
the consistency-availability spectrum

46 / 55
CAP Theorem

ACID and BASE

ACID
Choose consistency over availability
Pessimistic approach
Implemented by traditional relational databases
BASE
Choose availability over consistency
Optimistic approach
Common in NoSQL databases
Allows levels of scalability that cannot be acquired with ACID
Current trend in NoSQL. strong consistency → eventual consistency

47 / 55
CAP Theorem

CAP Theorem Conclusion

48 / 55
Consistency

Outline

1 Scalability

2 Distribution Models

3 CAP Theorem

4 Consistency

49 / 55
Consistency

Consistency

Consistency in general...
Consistency is the lack of contradiction in the database
However, it has many facets ... e.g.
only assume atomic operations always manipulating just a single
aggregate,
but set operations could also be considered etc.
Strong consistency is achievable even in clusters, but eventual
consistency might often be sufficient
1 One minute obsolete article on a news portal does not matter
2 Even when an already unavailable hotel room is booked once
again, the situation can still be figured out in the real world
3 ...

50 / 55
Consistency

Consistency

Write consistency (update consistency)

Problem. write-write conflict
Two or more write requests on the same aggregate are initiated
concurrently
Context. peer-to-peer architecture only
Issue. lost update
Solution.
1 Pessimistic strategies
Preventing conflicts from occurring
Write locks, ...
2 Optimistic strategies
Conflicts may occur, but are detected and resolved later on
Version stamps, vector clocks, ...

51 / 55
Consistency

Consistency

Read consistency (replication consistency)

Problem. read-write conflict
Write and read requests on the same aggregate are initiated
concurrently
Context. both master-slave and peer-to-peer architectures
Issue. inconsistent read
When not treated, inconsistency window will exist
Propagation of changes to all the replicas takes some time
Until this process is finished, inconsistent reads may happen
Even the initiator of the write request may read wrong data!
Session consistency/read-your-writes/sticky session

52 / 55
Consistency

Strong Consistency

How many nodes need to be involved to get strong consistency?

Write quorum. W > N /2
Idea. only one write request can get the majority
W = number of nodes successfully participating in the write
N = number of nodes involved in replication (replication factor)
Read quorum. R > N − W
Idea. concurrent write requests cannot happen
R = number of nodes participating in the read
Should the retrieved replicas be mutually different, the newest
version is resolved and then returned
When a quorum is not attained rightarrow the request cannot be
handled

53 / 55
Consistency

Strong Consistency

Examples for replication factor N = 3

Write quorum W = 3 and read quorum R = 1
All the replicas are always updated
Can read any one of them
Write quorum W = 2 and read quorum R = 2
Typical configuration, reasonable trade-off
Consequence
Quora can be configured to balance read and write workload
The higher the write quorum is required lower the read quorum
can then be required

54 / 55
Consistency

Lecture Conclusion

There is a wide range of options influencing...

Scalability – how well the entire system scales?
Availability – when nodes may refuse to handle user requests?
Consistency – what level of consistency is required?
Latency – how long does it take to handle user requests?
Durability – is the committed data written reliably?
Resilience – can the data be recovered in case of failures?
it’s good to know these properties and choose the right trade-off

55 / 55

Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
2 NoSQL Databases Principles
No ratings yet
2 NoSQL Databases Principles
58 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
31 pages
Lecture 3 - Principles of NoSQL Databases
No ratings yet
Lecture 3 - Principles of NoSQL Databases
49 pages
Mathina BDA
No ratings yet
Mathina BDA
11 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Module 2
No ratings yet
Module 2
36 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
NoSql Module 2 Part 1
No ratings yet
NoSql Module 2 Part 1
13 pages
Nosql Mod2
No ratings yet
Nosql Mod2
25 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
Module 2
No ratings yet
Module 2
40 pages
Nosql Databases
No ratings yet
Nosql Databases
379 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
BDT Assignment
No ratings yet
BDT Assignment
4 pages
III Sharding Strategies
No ratings yet
III Sharding Strategies
30 pages
Nosql M2-P1-P2
No ratings yet
Nosql M2-P1-P2
75 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
NoSQL M1
No ratings yet
NoSQL M1
48 pages
Gcru 2 Nosql
No ratings yet
Gcru 2 Nosql
52 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
Lec 3 - Basic Concepts
No ratings yet
Lec 3 - Basic Concepts
32 pages
6q9k5yndkd9j-SDE DF400 020 Full Deck
No ratings yet
6q9k5yndkd9j-SDE DF400 020 Full Deck
81 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
Data Base Ppt.... Dbms
No ratings yet
Data Base Ppt.... Dbms
8 pages
Bda Ia2 Bda
No ratings yet
Bda Ia2 Bda
7 pages
BDA Assign 1
No ratings yet
BDA Assign 1
21 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
BDA CH 2 (StorageConcepts)
No ratings yet
BDA CH 2 (StorageConcepts)
33 pages
Non Relational Database Management Systems
No ratings yet
Non Relational Database Management Systems
15 pages
SDA Presentation
No ratings yet
SDA Presentation
12 pages
DRKP Module 2 1
No ratings yet
DRKP Module 2 1
77 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
Wa0000.
No ratings yet
Wa0000.
35 pages
System Design Concepts
No ratings yet
System Design Concepts
36 pages
Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems
No ratings yet
Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems
14 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
Shard Hash Pattern
No ratings yet
Shard Hash Pattern
9 pages
Chap 2 Emerging Database Landscape
No ratings yet
Chap 2 Emerging Database Landscape
10 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Unit 6
No ratings yet
Unit 6
143 pages
3 Bda Chapter3 Answer
No ratings yet
3 Bda Chapter3 Answer
7 pages
Module 3
No ratings yet
Module 3
14 pages
M3, C5 NoSQLandBigdata
No ratings yet
M3, C5 NoSQLandBigdata
20 pages
Week 02
No ratings yet
Week 02
115 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
Unit 4
No ratings yet
Unit 4
13 pages
Big Data
No ratings yet
Big Data
51 pages
777 1651399819 BD Module 5
No ratings yet
777 1651399819 BD Module 5
75 pages
NoSQL DBs
No ratings yet
NoSQL DBs
46 pages
DMND 1
No ratings yet
DMND 1
8 pages
Hadoop Recap
No ratings yet
Hadoop Recap
27 pages
Gfe - Week 1 - 2017-09-08
No ratings yet
Gfe - Week 1 - 2017-09-08
60 pages
Produced Water Handling
No ratings yet
Produced Water Handling
20 pages
Newton Raphson Exercise
100% (1)
Newton Raphson Exercise
2 pages
CH2 Rig Components (Student Copy)
100% (1)
CH2 Rig Components (Student Copy)
31 pages
TEST 1 - Jan 2013 UTP
No ratings yet
TEST 1 - Jan 2013 UTP
3 pages
Video Streaming Service Roadmap
No ratings yet
Video Streaming Service Roadmap
9 pages
1211 Datasheet
No ratings yet
1211 Datasheet
6 pages
Ac500 PLC
100% (1)
Ac500 PLC
22 pages
DIR-615 Wireless Router (User Manual)
No ratings yet
DIR-615 Wireless Router (User Manual)
102 pages
Lecture 25 - Ch6 - USR - Ripple Counters
No ratings yet
Lecture 25 - Ch6 - USR - Ripple Counters
8 pages
Integrated Services Digital Network Isdn: - Services - History - Subscriber Access - Layers - Bisdn
No ratings yet
Integrated Services Digital Network Isdn: - Services - History - Subscriber Access - Layers - Bisdn
31 pages
Brochure
No ratings yet
Brochure
12 pages
Pdfmaker PDF
No ratings yet
Pdfmaker PDF
63 pages
SCIO Identification and Authentication
No ratings yet
SCIO Identification and Authentication
9 pages
CRM Comparison Worksheet: Buyer's Guide
No ratings yet
CRM Comparison Worksheet: Buyer's Guide
2 pages
Syllabus of Cloud Computing and Plan
No ratings yet
Syllabus of Cloud Computing and Plan
2 pages
Appendix A Gate-Level Details: Primitive Descriptions
No ratings yet
Appendix A Gate-Level Details: Primitive Descriptions
43 pages
Library Management System
No ratings yet
Library Management System
4 pages
Femtocell Case Study..
No ratings yet
Femtocell Case Study..
17 pages
E4450 P5kpl-Am In-Gb Contents Web
No ratings yet
E4450 P5kpl-Am In-Gb Contents Web
40 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
The Ultimate Computer Acronyms Archive
100% (13)
The Ultimate Computer Acronyms Archive
174 pages
CFG DWDM Controllers
No ratings yet
CFG DWDM Controllers
24 pages
LCD Code For Avr
100% (2)
LCD Code For Avr
4 pages
LG P999 P999DW T-Mobile G2x Service Manual
No ratings yet
LG P999 P999DW T-Mobile G2x Service Manual
264 pages
IWCF Forum Candidate User Role
No ratings yet
IWCF Forum Candidate User Role
7 pages
01-12-ECX1 CPU Cable-02
100% (1)
01-12-ECX1 CPU Cable-02
6 pages
Prisma D-PON System ONT and Upstream Receiver: Data Sheet
No ratings yet
Prisma D-PON System ONT and Upstream Receiver: Data Sheet
6 pages
Ssis Interview Questions
No ratings yet
Ssis Interview Questions
114 pages
Ece3002 Vlsi System Deign Cocob 2019
No ratings yet
Ece3002 Vlsi System Deign Cocob 2019
3 pages
Computer Fundamentals Vikram Computer Institute
No ratings yet
Computer Fundamentals Vikram Computer Institute
52 pages
984 Rtu / Ascii Rtu / Ascii Hex Address Rtu / Ascii NW Rtu 2W
No ratings yet
984 Rtu / Ascii Rtu / Ascii Hex Address Rtu / Ascii NW Rtu 2W
2 pages
Ielts 2
No ratings yet
Ielts 2
2 pages
Basics To Configure A CISCO Router To Connect To Internet
100% (1)
Basics To Configure A CISCO Router To Connect To Internet
4 pages
The 8051 Microcontroller Instruction Set and Assembly Programming
No ratings yet
The 8051 Microcontroller Instruction Set and Assembly Programming
18 pages

Big Data Management Basic Principles

Uploaded by

Big Data Management Basic Principles

Uploaded by

Basic Principles

Big Data Management

October 15, 2019

Borrowed from http://www.ksi.mff.cuni.cz/~svoboda/courses/171-NDBI040/

Different aspects of data distribution

Vertical scaling (scaling up/down)

Vertical Scalability: Drawbacks

Vertical Scalability: Drawbacks

Horizontal scaling (scaling out/in)

Horizontal Scalability: Consequences

Significantly increases complexity

Horizontal Scalability: Fallacies

Horizontal Scalability: Conclusion

A standalone node still might be a better option in certain cases

Horizontal Scalability: Architecture

Generic techniques of data distribution

Sharding (horizontal partitioning)

Why is sharding difficult?

Why is sharding even more difficult?

Sharding and Replication

Observations with respect to the replication

Sharding and Replication

Sharding and Master-Slave Replication

Sharding and Replication

Sharding and Peer-to-Peer Replication

Sharding and Replication

Combinations of sharding and replication

Sharding and Replication

Questions to figure out for any distribution model

CAP Theorem Consequences

If at most two properties can be guaranteed ...

CAP Theorem Consequences

CAP Theorem Consequences

Partition tolerance is necessary in clusters

CAP Theorem Example

You are asked to design a distributed cluster of 4 data nodes.

Traditional ACID properties

New concept of BASE properties

ACID and BASE

CAP Theorem Conclusion

Write consistency (update consistency)

Read consistency (replication consistency)

How many nodes need to be involved to get strong consistency?

Examples for replication factor N = 3

There is a wide range of options influencing...

You might also like