III Sharding Strategies
III Sharding Strategies
n Sharding
n Distribute different data across
multiple servers
n Each server acts as the single source
of a data subset
n Orthogonal techniques
Replication: pros & cons 3
n Replication of shards
n (+) Query performance (availability)
n (-) Cost of updating, of checking constraints, complexity of concurrency control
5
Contact: Genoveva Vargas-Solar, CNRS, LIG-LAFMIA
[email protected]
http://www.vargas-solar.com/teaching/
6
References
n Eric A., Brewer "Towards robust distributed systems." PODC. 2000
n Rick, Cattell "Scalable SQL and NoSQL data stores." ACM SIGMOD Record 39.4 (2011): 12-27
n Juan Castrejon, Genoveva Vargas-Solar, Christine Collet, and Rafael Lozano, ExSchema:
Discovering and Maintaining Schemas from Polyglot Persistence Applications, In Proceedings of
the International Conference on Software Maintenance, Demo Paper, IEEE, 2013
n M. Fowler and P. Sadalage. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot
Persistence. Pearson Education, Limited, 2012
Master'
n Helps with read scalability but does not help with
all'updates'
made'to'the'master'
write scalability
n Read resilience
n Read and write paths have to be managed separately to handle failure in the write path and still reads can occur
n Reads and writes are put in different database connections if the database library accepts it
10
Replication: peer-To-Peer 11
Master'
Sharding
n Partitioning
n Relational: handled by the DBMS (homogeneous DBMS)
n NoSQL: based on ranging of the k-value
n Federation
n Relational
n Combine tables stored in different physical databases
n Easier with denormalized data
n NoSQL:
n Store together data that are accessed together
n Aggregates unit of distribution
15
Sharding
Architecture Process
Pick a dimension that helps sharding easily
Each application server (AS) is
n
n (customers, countries, addresses)
running DBS/client
n Pick strategies that will last a long time as
repartition/re-sharding of data is operationally
difficult
n Each shard server is running
This is done according to two different principles
n a database server n
Transparency
levels Availability
à Find a compromise !
17
PARTITIONING
A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS
Background: distributed relational 18
databases
n External schemas (views) are often subsets
of relations (contacts in Europe and
America)
n Relations partition
n Better concurrency level
n Fragments accessed independently
n Implications
n Check integrity constraints
n Rebuild relations
19
Fragmentation
n Horizontal n Hybrid
n Groups of tuples of the same relation
n Budget < 300 000 or >= 150 000
n Not disjoint are more difficult to manage
n Vertical
n Groups attributes of the same relation
n Separate budget from loc and pname of
the relation project
20
Fragmentation: rules
Vertical Horizontal
Tuples of the same fragment must be statistically homogeneous
n Clustering n
MySQL% Odd%IDs%
MySQL% Slave%n%
Slave%1% MySQL%
Slave%2%
Even%IDs%
22
FEDERATION
A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH
INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY
FEDERATION: vertical SHARDING 23
n Principle Load%balancer%
n Partition data according to their logical
affiliation Web%3% Cache%1%
n Put together data that are commonly accessed
Web%2%
n The search load for the large partitioned entity can
be split across multiple servers (logical and Cache%2%
Web%1%
physical) and not only according to multiple indexes
in the same logical server
MySQL%
Cache%3%
n Different schemas, systems, and physical Master%
bases/servers
MySQL%
Master%
n Shards the components of a site and not only data
Internal%
user%
MySQL%
Slave%1%
MySQL% Resume%database%
MySQL% Slave%n%
Slave%1% MySQL%
Slave%2%
Site%database%
24
25
Storage on disc (1)
n For efficiency reasons, information is stored using the RAM:
n Work information is in RAM in order to answer to low latency requests
n Yet, this is not always possible and desirable
Ø The process of moving data from RAM to disc is called "eviction”; this
process is configured automatically for every bucket
26
Storage on disc (2)
n NoSQL servers support the storage of key-value pairs on disc:
n Persistency–can be executed by loading data, closing and
reinitializing it without having to load data from another source
n Hot backups– loaded data are sotred on disc so that it can be
reinitialized in case of failures
n Storage on disc– the disc is used when the quantity of data is
higher thant the physical size of the RAM, frequently used
information is maintained in RAM and the rest es stored on disc
27
Storage on disc (3)
n Strategies for ensuring:
n Each node maintains in RAM information on the key-value pairs it stores.
Keys:
n may not be found, or
n they can be stored in memory or on disc
n The process of moving information from RAM to disc is asynchronous:
n The server can continue processing new requests
n A queue manages requests to disc
Ø In periods with a lot of writing requests, clients can be notified that the
server is termporaly out of memory until information is evicted
28
29
n Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a
half-written or inconsistent piece of data.
n MVCC:
n Each user connected to the database sees a snapshot of the database at a particular instant in time
n Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been
committed
n When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à
multiple versions stored, but only one is the latest
n Writes can be isolated by virtue of the old versions being maintained
n Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects
30