0% found this document useful (0 votes)

25 views30 pages

III Sharding Strategies

Uploaded by

youssefbenzetta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views30 pages

III Sharding Strategies

Uploaded by

youssefbenzetta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Data sharding and replication

Genoveva Vargas Solar

French Council of scientific research, LIG-LAFMIA, France
[email protected]
http://www.vargas-solar.com
NoSql Stores: availability and 2
performance
n Replication
n Copy data across multiple servers
(each bit of data can be found in
multiple servers)
n Increase data availability
n Faster query evaluation

n Sharding
n Distribute different data across
multiple servers
n Each server acts as the single source
of a data subset

n Orthogonal techniques
Replication: pros & cons 3

n Data is more available n Increased updates cost

n Failure of a site containing E n Synchronisation: each replica
does not result in unavailability must be updated
of E if replicas exist
n Increased complexity of
n Performance concurrency control
n Parallelism: queries processed n Concurrent updates to distinct
in parallel on several nodes replicas may lead to
n Reduce data transfer for local inconsistent data unless
data special concurrency control
mechanisms are implemented
Sharding: why is it useful? 4

n Scaling applications by reducing n Improve read and write performance

Smaller amount of data in each user group implies faster
data sets in any single databases n
querying
n Segregating data n Isolating data into smaller shards accessed data is more
likely to stay on cache
n Sharing application data More write bandwidth: writing can be done in parallel
Securing sensitive data by
n
n n Smaller data sets are easier to backup, restore and
isolating it manage
Load%balancer%

n Massively work done

Web%3% Cache%1%
n Parallel work: scale out across more nodes
n Parallel backend: handling higher user loads
Web%2%
n Share nothing: very few bottlenecks
Cache%2%
Web%1%
n Decrease resilience improve availability
n If a box goes down others still operate
Cache%3%
n But: Part of the data missing
MySQL%
Master%
MySQL% Resume%database%
Master%
Site%database%
Sharding and replication
n Sharding with no replication: unique copy, distributed data sets
n (+) Better concurrency levels (shards are accessed independently)
n (-) Cost of checking constraints, rebuilding aggregates
n Ensure that queries and updates are distributed across shards

n Replication of shards
n (+) Query performance (availability)
n (-) Cost of updating, of checking constraints, complexity of concurrency control

n Partial replication (most of the times)

n Only some shards are duplicated

5
Contact: Genoveva Vargas-Solar, CNRS, LIG-LAFMIA
[email protected]
http://www.vargas-solar.com/teaching/

6
References
n Eric A., Brewer "Towards robust distributed systems." PODC. 2000

n Rick, Cattell "Scalable SQL and NoSQL data stores." ACM SIGMOD Record 39.4 (2011): 12-27

n Juan Castrejon, Genoveva Vargas-Solar, Christine Collet, and Rafael Lozano, ExSchema:
Discovering and Maintaining Schemas from Polyglot Persistence Applications, In Proceedings of
the International Conference on Software Maintenance, Demo Paper, IEEE, 2013

n M. Fowler and P. Sadalage. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot
Persistence. Pearson Education, Limited, 2012

n C. Richardson, Developing polyglot persistence applications,

http://fr.slideshare.net/chris.e.richardson/developing-polyglotpersistenceapplications-
gluecon2013
7
8

NOSQL STORES: AVAILABILITY AND

PERFORMANCE
Replication master - slave 9

Master'
n Helps with read scalability but does not help with
all'updates'
made'to'the'master'
write scalability

reads'can'be'done' n Read resilience: should the master fail, slaves can

from'master'or'slaves' still handle read requests
changes'propagate''
To'slaves'

n Master failure eliminates the ability to handle writes

until either the master is restored or a new master is
appointed

n Biggest complication is consistency

Slaves'
n Possible write – write conflict
n Makes one node the authoritative copy/replica that n Attempt to update the same record at the same
handles writes while replica synchronize with the master time from to different places
and may handle reeds

n Master is a bottle-neck and a point of failure

n All replicas have the same weight
n Replicas can all accept writes
n The lose of one of them does not prevent access to
the data store
Master-slave replication management
n Masters can be appointed
n Manually when configuring the nodes cluster
n Automatically: when configuring a nodes cluster one of them elected as master. The master can appoint a new master
when the master fails reducing downtime

n Read resilience
n Read and write paths have to be managed separately to handle failure in the write path and still reads can occur
n Reads and writes are put in different database connections if the database library accepts it

n Replication comes inevitably with a dark side: inconsistency

n Different clients reading different slaves will see different values if changes have not been propagated to all slaves
n In the worst case a client cannot read a write it just made
n Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost

10
Replication: peer-To-Peer 11

Master'

n Deals with inconsistencies

n Replicas coordinate to avoid
all'nodes'read'
conflict
Network traffic cost for
and'write'all'data'
nodes'communicate' n
their'writes'
coordinating writes
n Unnecessary to make all replicas
agree to write, only the majority
n Survival to the loss of the minority
of replicas nodes
n Allows writes to any node; the nodes coordinate to n Policy to merge inconsistent writes
synchronize their copies
n Full performance on writing to any
n The replicas have equal weight
replica
Sharding 12

n Puts different data on separate nodes

Each%shard%reads%and% n Each user only talks to one servicer
so she gets rapid responses
writes%its%own%data%

n The load should be balanced out

nicely between servers

n Ability to distribute both data and n Ensure that

load of simple operations over many
n data that is accessed together is
servers, with no RAM or disk shared
among servers clumped together on the same
node
n A way to horizontally scale writes
n Improve read performance n that clumps are arranged on the

n Application/data store support nodes to provide best data access

Sharding

Database laws Principle

n Small databases are fast n Start with a big monolithic
database
n Big databases are slow n Break into smaller databases
n Keep databases small n Across many clusters
n Using a key value
Instead of having one million customers information
on a single big machine ….

100 000 customers on smaller and different machines

+ Sharding criteria 14

n Partitioning
n Relational: handled by the DBMS (homogeneous DBMS)
n NoSQL: based on ranging of the k-value

n Federation
n Relational
n Combine tables stored in different physical databases
n Easier with denormalized data
n NoSQL:
n Store together data that are accessed together
n Aggregates unit of distribution
15

Sharding

Architecture Process
Pick a dimension that helps sharding easily
Each application server (AS) is
n
n (customers, countries, addresses)
running DBS/client
n Pick strategies that will last a long time as
repartition/re-sharding of data is operationally
difficult
n Each shard server is running
This is done according to two different principles
n a database server n

n Partitioning: a partition is a structure that

divides a space into tow parts
n replication agents and query
n Federation: a set of things that together
agents for supporting parallel compose a centralized unit but each individually
maintains some aspect of autonomy
query functionality

Customers data is partitioned by ID in shards using an

algorithm d to determine which shard a customer ID belongs to
Replication: aspects to consider 16

n Conditioning n Important elements to consider

n Data to duplicate
n Copies location
Fault
Performance tolerance n Duplication model (master –
slave / P2P)
n Consistency model (global –
copies)

Transparency
levels Availability
à Find a compromise !
17

PARTITIONING
A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS
Background: distributed relational 18
databases
n External schemas (views) are often subsets
of relations (contacts in Europe and
America)

n Access defined on subsets of relations:

80% of the queries issued in a region have
to do with contacts of that region

n Relations partition
n Better concurrency level
n Fragments accessed independently

n Implications
n Check integrity constraints

n Rebuild relations
19

Fragmentation
n Horizontal n Hybrid
n Groups of tuples of the same relation
n Budget < 300 000 or >= 150 000
n Not disjoint are more difficult to manage

n Vertical
n Groups attributes of the same relation
n Separate budget from loc and pname of
the relation project
20

Fragmentation: rules

Vertical Horizontal
Tuples of the same fragment must be statistically homogeneous
n Clustering n

n If t1 and t2 are tuples of the same fragment then t1 and t2 have

the same probability of being selected by a query
n Grouping elementary fragments
n Budget and location information in two n Keep important conditions
relations n Complete
n Every tuple (attribute) belongs to a fragment (without
information loss)
n Splitting n If tuples where budget >= 150 000 are more likely to be
selected then it is a good candidate
n Decomposing a relation according to n Minimum
affinity relationships among attributes n If no application distinguishes between budget >= 150 000
and budget < 150 000 then these conditions are unnecessary
21

Sharding: horizontal partitioning

Load%balancer%
n The entities of a database are split into two or
more sets (by row) Web%3% Cache%1%

n In relational: same schema several physical Web%2%

bases/servers
Cache%2%
n Partition contacts in Europe and America shards Web%1%
where they zip code indicates where the will be found
n Efficient if there exists some robust and implicit way to
identify in which partition to find a particular entity MySQL%
Master% Cache%3%

n Last resort shard MySQL%

Master%
n Needs to find a sharding function: modulo, round
robin, hash – partition, range - partition MySQL% MySQL%
Slave%1% MySQL% Slave%n%
Slave%2%

MySQL% Odd%IDs%
MySQL% Slave%n%
Slave%1% MySQL%
Slave%2%

Even%IDs%
22

FEDERATION
A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH
INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY
FEDERATION: vertical SHARDING 23

n Principle Load%balancer%
n Partition data according to their logical
affiliation Web%3% Cache%1%
n Put together data that are commonly accessed

Web%2%
n The search load for the large partitioned entity can
be split across multiple servers (logical and Cache%2%
Web%1%
physical) and not only according to multiple indexes
in the same logical server
MySQL%
Cache%3%
n Different schemas, systems, and physical Master%

bases/servers
MySQL%
Master%
n Shards the components of a site and not only data
Internal%
user%
MySQL%
Slave%1%

MySQL% Resume%database%
MySQL% Slave%n%
Slave%1% MySQL%
Slave%2%

Site%database%
24

NOSQL STORES: PERSISTENCY MANAGEMENT

«memcached»
n «memcached» is a memory management protocol based on a cache:
n Uses the key-value notion
n Information is completly stored in RAM

n «memcached» protocol for:

n Creating, retrieving, updating, and deleting information from the
database
n Applications with their own «memcached» manager (Google,
Facebook, YouTube, FarmVille, Twitter, Wikipedia)

25
Storage on disc (1)
n For efficiency reasons, information is stored using the RAM:
n Work information is in RAM in order to answer to low latency requests
n Yet, this is not always possible and desirable

Ø The process of moving data from RAM to disc is called "eviction”; this
process is configured automatically for every bucket

26
Storage on disc (2)
n NoSQL servers support the storage of key-value pairs on disc:
n Persistency–can be executed by loading data, closing and
reinitializing it without having to load data from another source
n Hot backups– loaded data are sotred on disc so that it can be
reinitialized in case of failures
n Storage on disc– the disc is used when the quantity of data is
higher thant the physical size of the RAM, frequently used
information is maintained in RAM and the rest es stored on disc

27
Storage on disc (3)
n Strategies for ensuring:
n Each node maintains in RAM information on the key-value pairs it stores.
Keys:
n may not be found, or
n they can be stored in memory or on disc
n The process of moving information from RAM to disc is asynchronous:
n The server can continue processing new requests
n A queue manages requests to disc
Ø In periods with a lot of writing requests, clients can be notified that the
server is termporaly out of memory until information is evicted

28
29

NOSQL STORES: CONCURRENCY CONTROL

Multi version concurrency control
(MVCC)
n Objective: Provide concurrent access to the database and in programming languages to implement transactional memory

n Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a
half-written or inconsistent piece of data.

n Lock: readers wait until the writer is done

n MVCC:
n Each user connected to the database sees a snapshot of the database at a particular instant in time
n Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been
committed
n When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à
multiple versions stored, but only one is the latest
n Writes can be isolated by virtue of the old versions being maintained
n Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects

Distributed Database Concepts
No ratings yet
Distributed Database Concepts
52 pages
Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Module 2
No ratings yet
Module 2
40 pages
DRKP Module 2 1
No ratings yet
DRKP Module 2 1
77 pages
Module 2
No ratings yet
Module 2
36 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
Nosql Mod2
No ratings yet
Nosql Mod2
25 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
Mathina BDA
No ratings yet
Mathina BDA
11 pages
Lecture 3 - Principles of NoSQL Databases
No ratings yet
Lecture 3 - Principles of NoSQL Databases
49 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
Lec 3 - Basic Concepts
No ratings yet
Lec 3 - Basic Concepts
32 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
31 pages
Nosql M2-P1-P2
No ratings yet
Nosql M2-P1-P2
75 pages
Mod5 CH2
No ratings yet
Mod5 CH2
36 pages
No SQL
No ratings yet
No SQL
14 pages
NoSql Module 2 Part 1
No ratings yet
NoSql Module 2 Part 1
13 pages
BDA CH 2 (StorageConcepts)
No ratings yet
BDA CH 2 (StorageConcepts)
33 pages
17 DatabaseArchitectures
No ratings yet
17 DatabaseArchitectures
41 pages
2 NoSQL Databases Principles
No ratings yet
2 NoSQL Databases Principles
58 pages
Nosql Module 2
100% (1)
Nosql Module 2
87 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
26 Distributed Dbms Nosql
No ratings yet
26 Distributed Dbms Nosql
45 pages
Chapter 4 - Distributed Database System
No ratings yet
Chapter 4 - Distributed Database System
52 pages
Ddis U1-3
No ratings yet
Ddis U1-3
40 pages
Big Data Management Basic Principles
No ratings yet
Big Data Management Basic Principles
55 pages
Nosql Databases
No ratings yet
Nosql Databases
379 pages
04 Surveys Cattell PDF
No ratings yet
04 Surveys Cattell PDF
16 pages
Distributed Databases
No ratings yet
Distributed Databases
53 pages
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
No ratings yet
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
17 pages
6q9k5yndkd9j-SDE DF400 020 Full Deck
No ratings yet
6q9k5yndkd9j-SDE DF400 020 Full Deck
81 pages
Distributed Databases: CMP-3440 - Database Systems
No ratings yet
Distributed Databases: CMP-3440 - Database Systems
12 pages
Lec21Notes Merged
No ratings yet
Lec21Notes Merged
20 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
NOSQL Databases and Big Data Storage Systems: Shilpa R Assistant Professor Cse, Sdmit Ujire
No ratings yet
NOSQL Databases and Big Data Storage Systems: Shilpa R Assistant Professor Cse, Sdmit Ujire
61 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
10 pages
Module 3
No ratings yet
Module 3
14 pages
DBMS Module-5 2024 Chap 2
No ratings yet
DBMS Module-5 2024 Chap 2
25 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
BIG - DATA - Unit 4
No ratings yet
BIG - DATA - Unit 4
99 pages
Bda Ia2 Bda
No ratings yet
Bda Ia2 Bda
7 pages
Chapter - 4 - NoSQL - 1676181987
No ratings yet
Chapter - 4 - NoSQL - 1676181987
85 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
3 Bda Chapter3 Answer
No ratings yet
3 Bda Chapter3 Answer
7 pages
NoSQL
No ratings yet
NoSQL
18 pages
Unit5 Notes Short DB
No ratings yet
Unit5 Notes Short DB
6 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
No SQL Ia-01 - Micro
No ratings yet
No SQL Ia-01 - Micro
6 pages
9T83B3874 75kVA GE
100% (1)
9T83B3874 75kVA GE
1 page
07 The Victorian Age 1830 1901 Posledna Verzija
No ratings yet
07 The Victorian Age 1830 1901 Posledna Verzija
14 pages
Oscommerce Online Merchant V3.0.2
No ratings yet
Oscommerce Online Merchant V3.0.2
7 pages
Sign Language Recognition Using Machine Learning
No ratings yet
Sign Language Recognition Using Machine Learning
7 pages
ENGL - 221 Morgan Reid Call For Proposal Memo-1
No ratings yet
ENGL - 221 Morgan Reid Call For Proposal Memo-1
2 pages
Haryana Roadways Training PDF
No ratings yet
Haryana Roadways Training PDF
41 pages
Sigma Steel Structure PDF
No ratings yet
Sigma Steel Structure PDF
4 pages
EXECON DISSEMINATION Fdinal
No ratings yet
EXECON DISSEMINATION Fdinal
3 pages
Als Project
No ratings yet
Als Project
18 pages
Graph Theory Algorithms
No ratings yet
Graph Theory Algorithms
19 pages
Ch04 (Bodie10)
No ratings yet
Ch04 (Bodie10)
22 pages
Stefx85: Joined May 2015 Status: Member 894 Posts Online Now
No ratings yet
Stefx85: Joined May 2015 Status: Member 894 Posts Online Now
2 pages
Power Point MCQ
No ratings yet
Power Point MCQ
25 pages
DA-087-08 - No CAR For Any Personal Properties
No ratings yet
DA-087-08 - No CAR For Any Personal Properties
2 pages
List of Regulated Electrical Equipment 250718
No ratings yet
List of Regulated Electrical Equipment 250718
15 pages
Summary of Jurisdiction of Philippine Courts
No ratings yet
Summary of Jurisdiction of Philippine Courts
13 pages
FUFA Question Paper - Compre - FOFA (ECON F212) 1st Sem 2018-19
No ratings yet
FUFA Question Paper - Compre - FOFA (ECON F212) 1st Sem 2018-19
2 pages
13 Practical Statistics Using SPSS Revision 2009
100% (1)
13 Practical Statistics Using SPSS Revision 2009
60 pages
VW Case Study
No ratings yet
VW Case Study
2 pages
Women in Aviation: 1930-1939
100% (2)
Women in Aviation: 1930-1939
73 pages
OSYFORMPART1
No ratings yet
OSYFORMPART1
3 pages
Environmental Management - Plate Tectonics
No ratings yet
Environmental Management - Plate Tectonics
17 pages
OECD 423 ACUTEw
No ratings yet
OECD 423 ACUTEw
15 pages
Assisted Living Facilities
No ratings yet
Assisted Living Facilities
8 pages
25 0010239 VST Static Calculation
No ratings yet
25 0010239 VST Static Calculation
58 pages
Revised Final Vacancy, Stenographer Gr. C D Exam 2024 As On 27.06.2025
No ratings yet
Revised Final Vacancy, Stenographer Gr. C D Exam 2024 As On 27.06.2025
6 pages
2019 JBKnowledge Construction Technology Report
No ratings yet
2019 JBKnowledge Construction Technology Report
60 pages
3
No ratings yet
3
6 pages
BNP - Settlement Agreement - Department of The Treasury
No ratings yet
BNP - Settlement Agreement - Department of The Treasury
10 pages
BS Procedure
No ratings yet
BS Procedure
3 pages

III Sharding Strategies

Uploaded by

III Sharding Strategies

Uploaded by

Data sharding and replication

Genoveva Vargas Solar

n Data is more available n Increased updates cost

n Scaling applications by reducing n Improve read and write performance

n Massively work done

n Partial replication (most of the times)

n C. Richardson, Developing polyglot persistence applications,

NOSQL STORES: AVAILABILITY AND

reads'can'be'done' n Read resilience: should the master fail, slaves can

n Master failure eliminates the ability to handle writes

n Biggest complication is consistency

n Master is a bottle-neck and a point of failure

n Replication comes inevitably with a dark side: inconsistency

n Deals with inconsistencies

n Puts different data on separate nodes

n The load should be balanced out

n Ability to distribute both data and n Ensure that

n Application/data store support nodes to provide best data access

Database laws Principle

100 000 customers on smaller and different machines

n Partitioning: a partition is a structure that

Customers data is partitioned by ID in shards using an

n Conditioning n Important elements to consider

n Access defined on subsets of relations:

n If t1 and t2 are tuples of the same fragment then t1 and t2 have

Sharding: horizontal partitioning

n In relational: same schema several physical Web%2%

n Last resort shard MySQL%

NOSQL STORES: PERSISTENCY MANAGEMENT

n «memcached» protocol for:

NOSQL STORES: CONCURRENCY CONTROL

n Lock: readers wait until the writer is done

You might also like