Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D

This document provides an overview of NoSQL databases and some of their basic principles. It discusses how NoSQL databases sacrifice some ACID properties to improve performance and scalability. The document also covers key concepts like the CAP theorem, eventual consistency, horizontal scaling, and approaches to data distribution including sharding and replication. It provides examples of how these concepts work and the tradeoffs involved in designing distributed database systems.

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views27 pages

Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D

Uploaded by

Ankit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

NDBI040

Big Data Management

and NoSQL Databases
Lecture 4. Basic Principles

Doc. RNDr. Irena Holubova, Ph.D.

[email protected]

http://www.ksi.mff.cuni.cz/~holubova/NDBI040/
NoSQL Overview
 Main objective: implement distributed state
 Different objects stored on different servers
 Same object replicated on different servers
 Main idea: give up some of the ACID
 To improve performance
 Simple interface:
 Write(=Put): needs to write all replicas
 Read (=Get): may get only one
 Strong consistency → eventual consistency
Basic Principles
 Scalability
 How to handle growing amounts of data
without losing performance
 CAP theorem
 Distribution models
 Sharding,replication, consistency, …
 How to handle data in a distributed manner
Scalability
Vertical Scaling (scaling up)
 Traditional choice has been in favour of strong
consistency
 System architects have in the past gone in favour of scaling up
(vertical scaling)
 Involves larger and more powerful machines
 Works in many cases but…
 Vendor lock-in
 Not everyone makes large and powerful machines
 Who do, often use proprietary formats
 Makes a customer dependent on a vendor for products and
services
 Unable to use another vendor
Scalability
Vertical Scaling (scaling up)
 Higher costs
 Powerful machines usually cost a lot more than commodity
hardware
 Data growth perimeter
 Powerful and large machines work well until the data grows to fill
it
 Even the largest of machines has a limit
 Proactive provisioning
 Applications have no idea of the final large scale when they start
out
 Scaling vertically = you need to budget for large scale upfront
Scalability
Horizontal Scaling (scaling out)
 Systems are distributed across multiple machines or
nodes (horizontal scaling)
 Commodity machines, cost effective
 Often surpasses scalability of vertical approach
 Fallacies of distributed computing:
 The network is reliable
 Latency is zero
 Bandwidth is infinite
 The network is secure
 Topology does not change
 There is one administrator
 Transport cost is zero
 The network is homogeneous
https://blogs.oracle.com/jag/resource/Fallacies.html
CAP Theorem
Consistency
 After an update, all readers in a distributed system see
the same data
 All nodes are supposed to contain the same data at all
times
 Example:
 A single database instance is always consistent
 If multiple instances exist, all writes must be duplicated before
write operation is completed
CAP Theorem
Availability
 All requests (reads, writes) are always answered,
regardless crashes
 Example:
 A single instance has an availability of 100% or 0%
 Two servers may be available 100%, 50%, or 0%
Partition Tolerance
 System continues to operate, even if two sets of servers
get isolated
 Example:
 Failed connection will not cause troubles if the system is tolerant
CAP Theorem
ACID vs. BASE
 Theorem: Only 2 of the 3
guarantees can be given in a
“shared-data” system.
 Proven in 2000, the idea is
older
 (Positive) consequence: we can
concentrate on two challenges
 ACID properties guarantee
consistency and availability
 pessimistic
 e.g., database on a single
machine
 BASE properties guarantee
availability and partition
tolerance
 optimistic
 e.g., distributed databases
CAP Theorem
Criticism

 Not really a “theorem”, since definitions are

imprecise
 The real proven theorem has more limiting
assumptions
 CP makes no “sense”, because it suggest never
available
 No A vs. no C is asymmetric
 No C = all the time
 No A = only when the network is partitioned
CAP Theorem
Consistency

 A single-server system is a CA system

 Clusters have to be tolerant of network partitions
 CAP theorem: you can only get two out of three
 Reality: you can trade off a little Consistency to get
some Availability
 It is not a binary decision
BASE
 In contrast to ACID
 Leads to levels of scalability that cannot be obtained with ACID
 At the cost of (strong) consistency

Basically Available
 The system works basically all the time
 Partial failures can occur, but without total system failure
Soft State
 The system is in flux and non-deterministic
 Changes occur all the time
Eventual Consistency
 The system will be in some consistent state
 At some time in future
Strong Consistency

read(a) = 1 write(a) = 2 read(a) = 2

John

read(a) = 1 read(a) = 2
George

read(a) = 1 read(a) = 2
Paul
Eventual Consistency

read(a) = 1 write(a) = 2 read(a) = 1 read(a) = 2

John

read(a) = 1 read(a) = 2
Peter

read(a) = 1 read(a) = 1 read(a) = 2

Paul

inconsistent window
Distribution Models
 Scaling out = running the database on a cluster
of servers
 Two orthogonal techniques to data distribution:
 Replication – takes the same data and copies it over
multiple nodes
 Master-slave or peer-to-peer
 Sharding – puts different data on different nodes
 We can use either or combine them
Distribution Models
Single Server
 No distribution at all
 Run the database on a single machine
 It can make sense to use NoSQL with a single-
server distribution model
 Graph databases
 The graph is “almost” complete → it is difficult to distribute it
Distribution Models
Sharding
 Horizontal
scalability →
putting different
parts of the data
onto different
servers
 Different people
are accessing
different parts of
the dataset
Distribution Models
Sharding
 The ideal case is rare
 To get close to it we have to ensure that data that is
accessed together is clumped together
 How to arrange the nodes:
a. One user mostly gets data from a single server
b. Based on a physical location
c. Distributed across the nodes with equal amounts of the load
 Many NoSQL databases offer auto-sharding
 A node failure makes shard’s data unavailable
 Sharding is often combined with replication
Distribution Models
Master-slave Replication
 We replicate data
across multiple
nodes
 One node is
designed as
primary (master),
others as
secondary
(slaves)
 Master is
responsible for
processing any
updates to that
data
Distribution Models
Master-slave Replication
 For scaling a read-intensive dataset
 More read requests → more slave nodes
 The master fails → the slaves can still handle read
requests
 A slave can be appointed a new master quickly (it is a
replica)
 Limited by the ability of the master to process
updates
 Masters are appointed manually or automatically
 User-defined vs. cluster-elected
Distribution Models
Peer-to-peer Replication
 Problems of master-
slave replication:
 Does not help with
scalability of writes
 Provides resilience
against failure of a
slave, but not of a
master
 The master is still a
bottleneck
 Peer-to-peer
replication: no
master
 All the replicas have
equal weight
Distribution Models
Peer-to-peer Replication
 Problem: consistency
 We can write at two different places: a write-write
conflict
 Solutions:
 Whenever we write data, the replicas coordinate to
ensure we avoid a conflict
 At the cost of network traffic
 But we do not need all the replicas to agree on the
write, just a majority
Distribution Models
Combining Sharding and Replication
 Master-slave replication and sharding:
 We have multiple masters, but each data item only
has a single master
 A node can be a master for some data and a slave for
others
 Peer-to-peer replication and sharding:
A common strategy for column-family databases
 A good starting point for peer-to-peer replication is to
have a replication factor of 3, so each shard is
present on three nodes
Consistency
Write (update) Consistency
 Problem: two users want to update the same
record (write-write conflict)
 Issue: lost update
 Pessimistic (preventing conflicts from occurring)
vs. optimistic solutions (lets conflicts occur, but
detects them and takes actions to sort them out)
 Write locks, conditional update, save both updates
and record that they are in conflict, …
Consistency
Read Consistency
 Problem: one user reads, other writes (read-write
conflict)
 Issue: inconsistent read
 Relational databases support the notion of transactions
 NoSQL databases support atomic updates within a
single aggregate
 But not all data can be put in the same aggregate
 Update that affects multiple aggregates leaves open a
time when clients could perform an inconsistent read
 Inconsistency window
 Another issue: replication consistency
 A special type of inconsistency in case of replication
 Ensuring that the same data item has the same value when read
from different replicas
Consistency
Quorums
 How many nodes need to be involved to get strong
consistency?
 Write quorum: W > N/2
 N = the number of nodes involved in replication (replication
factor)
 W = the number of nodes participating in the write
 The number of nodes confirming successful write
 “If you have conflicting writes, only one can get a majority.”
 How many nodes you need to contact to be sure you
have the most up-to-date change?
 Read quorum: R + W > N
 R = the number of nodes we need to contact for a read
 „Concurrent read and write cannot happen.“
References
 http://nosql-database.org/
 Pramod J. Sadalage – Martin Fowler: NoSQL Distilled:
A Brief Guide to the Emerging World of Polyglot
Persistence
 Eric Redmond – Jim R. Wilson: Seven Databases in
Seven Weeks: A Guide to Modern Databases and the
NoSQL Movement
 Sherif Sakr – Eric Pardede: Graph Data Management:
Techniques and Applications
 Shashank Tiwari: Professional NoSQL

E20-594 Avamar Specialist
No ratings yet
E20-594 Avamar Specialist
102 pages
Nosql Databases
No ratings yet
Nosql Databases
379 pages
04 NoSQL
No ratings yet
04 NoSQL
126 pages
Module 2
No ratings yet
Module 2
40 pages
Ebook - Cracking The System Design Interview Course
100% (2)
Ebook - Cracking The System Design Interview Course
91 pages
Lecture 3 - Principles of NoSQL Databases
No ratings yet
Lecture 3 - Principles of NoSQL Databases
49 pages
Module 1
No ratings yet
Module 1
69 pages
Lec 3 - Basic Concepts
No ratings yet
Lec 3 - Basic Concepts
32 pages
No SQL
No ratings yet
No SQL
39 pages
CH-07 Replication
No ratings yet
CH-07 Replication
35 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Nosql Module 2
100% (1)
Nosql Module 2
87 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
DRKP Module 2 1
No ratings yet
DRKP Module 2 1
77 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
4.NoSQL 1
No ratings yet
4.NoSQL 1
69 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
43 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
Module 2 Final
No ratings yet
Module 2 Final
39 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
2 NoSQL Databases Principles
No ratings yet
2 NoSQL Databases Principles
58 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
CAP Theorem
No ratings yet
CAP Theorem
39 pages
7.1.1 Administrator's Reference AIX
No ratings yet
7.1.1 Administrator's Reference AIX
1,718 pages
Chapter 4 1712934164766
No ratings yet
Chapter 4 1712934164766
28 pages
No SQL
No ratings yet
No SQL
14 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
III Sharding Strategies
No ratings yet
III Sharding Strategies
30 pages
Big Data Analytics Lecture 3A
No ratings yet
Big Data Analytics Lecture 3A
27 pages
Module 2
No ratings yet
Module 2
36 pages
Lecture 27
No ratings yet
Lecture 27
19 pages
BDA CH 2 (StorageConcepts)
No ratings yet
BDA CH 2 (StorageConcepts)
33 pages
Nosql Mod2
No ratings yet
Nosql Mod2
25 pages
Nosql KK
No ratings yet
Nosql KK
23 pages
Bda Module 3
No ratings yet
Bda Module 3
20 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Lec21Notes Merged
No ratings yet
Lec21Notes Merged
20 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Module 2.3
No ratings yet
Module 2.3
25 pages
IntroNoSQL Revised
No ratings yet
IntroNoSQL Revised
28 pages
Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
Huawei OceanStor Dorado V6 All Flash Storage Systems Technical White Paper
No ratings yet
Huawei OceanStor Dorado V6 All Flash Storage Systems Technical White Paper
199 pages
Nosql
No ratings yet
Nosql
12 pages
NoSQL - Unit 2
No ratings yet
NoSQL - Unit 2
11 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
10 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
MDS 271 2448001
No ratings yet
MDS 271 2448001
9 pages
CS3492-DBMS Unit-5
No ratings yet
CS3492-DBMS Unit-5
9 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
SQ L Server 2008 Fail Over Cluster
No ratings yet
SQ L Server 2008 Fail Over Cluster
185 pages
Big Data Management Basic Principles
No ratings yet
Big Data Management Basic Principles
55 pages
Bda Ia2 Bda
No ratings yet
Bda Ia2 Bda
7 pages
Docu46997 Unisphere For VMAX 1.6 Product Guide
No ratings yet
Docu46997 Unisphere For VMAX 1.6 Product Guide
588 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
SQL Server KT Ful
No ratings yet
SQL Server KT Ful
36 pages
No SQL Ia-01 - Micro
No ratings yet
No SQL Ia-01 - Micro
6 pages
p8 451 High Availability
No ratings yet
p8 451 High Availability
192 pages
Hbase Hive Pig
No ratings yet
Hbase Hive Pig
144 pages
Module-2 NOSQL
No ratings yet
Module-2 NOSQL
5 pages
Data Engineering Unit 3
No ratings yet
Data Engineering Unit 3
4 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
Advanced Distributed Systems Replication: What Is Replication? Reasons For Replication
No ratings yet
Advanced Distributed Systems Replication: What Is Replication? Reasons For Replication
20 pages
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
NOSQL
No ratings yet
NOSQL
23 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
Aks Replication Control
No ratings yet
Aks Replication Control
71 pages
Nutanix ECA v6.5 Datasheet
No ratings yet
Nutanix ECA v6.5 Datasheet
5 pages
Hitachi Nas Platform Best Practices Guide For Nfs With Vmware Vsphere
No ratings yet
Hitachi Nas Platform Best Practices Guide For Nfs With Vmware Vsphere
29 pages
BDA - Module 5
No ratings yet
BDA - Module 5
31 pages
A Roadmap To Enterprise Data Integration
No ratings yet
A Roadmap To Enterprise Data Integration
32 pages
ECS Monthly Support Highlights - July 2021
No ratings yet
ECS Monthly Support Highlights - July 2021
8 pages
FusionStorage 8.0.1 Block Storage HyperReplication Feature Guide 05
No ratings yet
FusionStorage 8.0.1 Block Storage HyperReplication Feature Guide 05
95 pages
Dell FluidFS Version 5
No ratings yet
Dell FluidFS Version 5
210 pages
Ism v2 Lab Guide Ilt
No ratings yet
Ism v2 Lab Guide Ilt
38 pages
Chapter 3: AIS Enhancements Through Information Technology and Networks
No ratings yet
Chapter 3: AIS Enhancements Through Information Technology and Networks
28 pages
Dell SC Series Storage Synchronous Replication and Live Volume
No ratings yet
Dell SC Series Storage Synchronous Replication and Live Volume
96 pages
ZFS Appliance Replication - Reverse Replication
No ratings yet
ZFS Appliance Replication - Reverse Replication
20 pages
09-WLS11gR1 Labs - High Availability
No ratings yet
09-WLS11gR1 Labs - High Availability
43 pages
AD Guide
No ratings yet
AD Guide
12 pages
Mysql Replication Troubleshooting 1723528792
No ratings yet
Mysql Replication Troubleshooting 1723528792
9 pages
MariaDB Galera Cluster
100% (1)
MariaDB Galera Cluster
69 pages
Integrating Oracle and SQL Server: E-Guide
No ratings yet
Integrating Oracle and SQL Server: E-Guide
17 pages
International Journal On Recent and Inno
No ratings yet
International Journal On Recent and Inno
5 pages
Module 4 Mobile App Data Storage and Replication
No ratings yet
Module 4 Mobile App Data Storage and Replication
4 pages
PharmaSoftSQL Architecture and Design
No ratings yet
PharmaSoftSQL Architecture and Design
12 pages
Spanner Google Database System
No ratings yet
Spanner Google Database System
6 pages
4.9 Years Chanti - Resume
No ratings yet
4.9 Years Chanti - Resume
3 pages
ExaGrid Multi Hop - DS
No ratings yet
ExaGrid Multi Hop - DS
2 pages

Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D

Uploaded by

Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D

Uploaded by

NDBI040

Big Data Management

Doc. RNDr. Irena Holubova, Ph.D.

 Not really a “theorem”, since definitions are

 A single-server system is a CA system

read(a) = 1 write(a) = 2 read(a) = 2

read(a) = 1 write(a) = 2 read(a) = 1 read(a) = 2

read(a) = 1 read(a) = 1 read(a) = 2

You might also like