CH 21
CH 21
Database System Concepts - 7th Edition 21.2 ©Silberschatz, Korth and Sudarshan
Parallel/Distributed Data Storage History
▪ 1980/1990s
• Distributed database systems with tens of nodes
▪ 2000s:
• Distributed file systems with 1000s of nodes
▪ Millions of Large objects (100’s of megabytes)
▪ Web logs, images, videos, …
▪ Typically create/append only
• Distributed data storage systems with 1000s of nodes
▪ Billions to trillions of smaller (kilobyte to megabyte) objects
▪ Social media posts, email, online purchases, …
▪ Inserts, updates, deletes
• Key-value stores
▪ 2010s: Distributed database systems with 1000s of nodes
Database System Concepts - 7th Edition 21.3 ©Silberschatz, Korth and Sudarshan
I/O Parallelism
▪ Reduce the time required to retrieve relations from disk by partitioning the
relations on multiple disks, on multiple nodes (computers)
• Our description focuses on parallelism across nodes
• Same techniques can be used across disks on a node
▪ Horizontal partitioning – tuples of a relation are divided among many
nodes such that some subset of tuple resides on each node.
• Contrast with vertical partitioning, e.g. r(A,B,C,D) with primary key A
into r1(A,B) and r2(A,C,D)
• By default, the word partitioning refers to horizontal partitioning
Database System Concepts - 7th Edition 21.4 ©Silberschatz, Korth and Sudarshan
I/O Parallelism
Database System Concepts - 7th Edition 21.5 ©Silberschatz, Korth and Sudarshan
Range Partitioning
Database System Concepts - 7th Edition 21.6 ©Silberschatz, Korth and Sudarshan
I/O Parallelism (Cont.)
Database System Concepts - 7th Edition 21.7 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques
Database System Concepts - 7th Edition 21.8 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques (Cont.)
Round robin:
▪ Best suited for sequential scan of entire relation on each query.
• All nodes have almost an equal number of tuples; retrieval work is thus
well balanced between nodes.
▪ All queries must be processed at all nodes
Hash partitioning:
▪ Good for sequential access
• Assuming hash function is good, and partitioning attributes form a key,
tuples will be equally distributed between nodes
▪ Good for point queries on partitioning attribute
• Can lookup single node, leaving others available for answering other
queries.
▪ Range queries inefficient, must be processed at all nodes
Database System Concepts - 7th Edition 21.9 ©Silberschatz, Korth and Sudarshan
Comparison of Partitioning Techniques (Cont.)
Range partitioning:
▪ Provides data clustering by partitioning attribute value.
• Good for sequential access
• Good for point queries on partitioning attribute: only one node needs to
be accessed.
▪ For range queries on partitioning attribute, one to a few nodes may need to
be accessed
• Remaining nodes are available for other queries.
• Good if result tuples are from one to a few blocks.
• But if many blocks are to be fetched, they are still fetched from one to
a few nodes, and potential parallelism in disk access is wasted
▪ Example of execution skew.
Database System Concepts - 7th Edition 21.10 ©Silberschatz, Korth and Sudarshan
Handling Small Relations
▪ Partitioning not useful for small relations which fit into a single disk block
or a small number of disk blocks
• Instead, assign the relation to a single node, or
• Replicate relation at all nodes
▪ For medium sized relations, choose how many nodes to partition across
based on size of relation
▪ Large relations typically partitioned across all available nodes.
Database System Concepts - 7th Edition 21.11 ©Silberschatz, Korth and Sudarshan
Types of Skew
▪ Data-distribution skew: some nodes have many tuples, while others may
have fewer tuples. Could occur due to
• Attribute-value skew.
▪ Some partitioning-attribute values appear in many tuples
▪ All the tuples with the same value for the partitioning attribute end
up in the same partition.
▪ Can occur with range-partitioning and hash-partitioning.
• Partition skew.
▪ Imbalance, even without attribute –value skew
▪ Badly chosen range-partition vector may assign too many tuples to
some partitions and too few to others.
▪ Less likely with hash-partitioning
Database System Concepts - 7th Edition 21.12 ©Silberschatz, Korth and Sudarshan
Types of Skew (Cont.)
▪ Note that execution skew can occur even without data distribution skew
• E.g. relation range-partitioned on date, and most queries access
tuples with recent dates
▪ Data-distribution skew can be avoided with range-partitioning by creating
balanced range-partitioning vectors
▪ We assume for now that partitioning is static, that is partitioning vector is
created once and not changed
• Any change requires repartitioning
• Dynamic partitioning once allows partition vector to be changed in a
continuous manner
▪ More on this later
Database System Concepts - 7th Edition 21.13 ©Silberschatz, Korth and Sudarshan
Handling Skew in Range-Partitioning
Database System Concepts - 7th Edition 21.14 ©Silberschatz, Korth and Sudarshan
Histograms
▪ Histogram on attribute age of relation person
50
40
frequency
30
20
10
Database System Concepts - 7th Edition 21.15 ©Silberschatz, Korth and Sudarshan
Virtual Node Partitioning
▪ Key idea: pretend there are several times (10x to 20x) as many virtual
nodes as real nodes
• Virtual nodes are mapped to real nodes
• Tuples partitioned across virtual nodes using range-partitioning
vector
▪ Hash partitioning is also possible
▪ Mapping of virtual nodes to real nodes
• Round-robin: virtual node i mapped to real node (i mod n)+1
• Mapping table: mapping table virtual_to_real_map[] tracks which
virtual node is on which real node
▪ Allows skew to be handled by moving virtual nodes from more
loaded nodes to less loaded nodes
▪ Both data distribution skew and execution skew can be handled
Database System Concepts - 7th Edition 21.16 ©Silberschatz, Korth and Sudarshan
Handling Skew Using Virtual Node Partitioning
▪ Basic idea:
• If any normal partition would have been skewed, it is very likely the
skew is spread over a number of virtual partitions
• Skewed virtual partitions tend to get spread across a number of nodes,
so work gets distributed evenly!
▪ Virtual node approach also allows elasticity of storage
• If relation size grows, more nodes can be added and virtual nodes
moved to new nodes
Database System Concepts - 7th Edition 21.17 ©Silberschatz, Korth and Sudarshan
Dynamic Repartitioning
Database System Concepts - 7th Edition 21.18 ©Silberschatz, Korth and Sudarshan
Dynamic Repartitioning
▪ Virtual nodes in such a scheme are often called tablets
▪ Example of initial partition table and partition table after a split of tablet 6
and move of tablet 1
Tablet move
Tablet split
Database System Concepts - 7th Edition 21.19 ©Silberschatz, Korth and Sudarshan
Routing of Queries
Database System Concepts - 7th Edition 21.20 ©Silberschatz, Korth and Sudarshan
Replication
Database System Concepts - 7th Edition 21.21 ©Silberschatz, Korth and Sudarshan
Basics: Data Replication
▪ Location of replicas
• Replication within a data center
▪ Handles machine failures
▪ Reduces latency if copy available locally on a machine
▪ Replication within/across racks
• Replication across data centers
▪ Handles data center failures (power, fire, earthquake, ..), and
network partitioning of an entire data center
▪ Provides lower latency for end users if copy is available on nearby
data center
Database System Concepts - 7th Edition 21.22 ©Silberschatz, Korth and Sudarshan
Updates and Consistency of Replicas
Database System Concepts - 7th Edition 21.23 ©Silberschatz, Korth and Sudarshan
Protocols to Update Replicas
▪ Two-phase commit
• Coming up in Chapter 23
• Assumes all replicas are available
▪ Persistent messaging
• Updates are sent as messages with guaranteed delivery
• Replicas are updated asynchronously (after original transaction
commits)
▪ Eventual consistency
• Can lead to inconsistency on reads from replicas
▪ Consensus protocols
• Protocol followed by a set of replicas to agree on what updates to
perform in what order
• Can work even without a designated master
Database System Concepts - 7th Edition 21.24 ©Silberschatz, Korth and Sudarshan
Parallel Indexing
▪ Local index
• Index built only on local data
▪ Global index
• Index built on all data, regardless of where it is stored
• Index itself is usually partitioned across nodes
▪ Global primary index
• Data partitioned on the index attribute
▪ Global secondary index
• Data partitioned on the attribute other than the index attribute
Database System Concepts - 7th Edition 21.25 ©Silberschatz, Korth and Sudarshan
Global Primary and Secondary Indices
Database System Concepts - 7th Edition 21.26 ©Silberschatz, Korth and Sudarshan
Global Secondary Index
• Partition ris on Ki
• At each node containing a partition of r, create index on (Kp) if Kp is a
key, otherwise create index on (Kp, Ku)
• Update the relation ris on any updates to r on attributes in ris
Database System Concepts - 7th Edition 21.27 ©Silberschatz, Korth and Sudarshan
Distributed File Systems
Database System Concepts - 7th Edition 21.28 ©Silberschatz, Korth and Sudarshan
Hadoop File System (HDFS)
Database System Concepts - 7th Edition 21.29 ©Silberschatz, Korth and Sudarshan
Hadoop Distributed File System
Database System Concepts - 7th Edition 21.30 ©Silberschatz, Korth and Sudarshan
Limitations of GFS/HDFS
Database System Concepts - 7th Edition 21.31 ©Silberschatz, Korth and Sudarshan
Sharding
Sharding (recall from Chapter 10)
▪ Divide data amongst many cheap databases (MySQL/PostgreSQL)
▪ Manage parallel access in the application
• Partition tables map keys to nodes
• Application decides where to route storage or lookup requests
▪ Scales well for both reads and writes
▪ Limitations
• Not transparent
▪ application needs to be partition-aware
▪ AND application needs to deal with replication
• (Not a true parallel database, since parallel queries and transactions
spanning nodes are not supported)
Database System Concepts - 7th Edition 21.32 ©Silberschatz, Korth and Sudarshan
Key Value Storage Systems
Database System Concepts - 7th Edition 21.33 ©Silberschatz, Korth and Sudarshan
Typical Data Storage Access API
Database System Concepts - 7th Edition 21.34 ©Silberschatz, Korth and Sudarshan
Data Storage Systems vs. Databases
Database System Concepts - 7th Edition 21.35 ©Silberschatz, Korth and Sudarshan
Data Representation
Database System Concepts - 7th Edition 21.36 ©Silberschatz, Korth and Sudarshan
Storing and Retrieving Data
Database System Concepts - 7th Edition 21.37 ©Silberschatz, Korth and Sudarshan
Architecture of Key-Value Store
(modelled after Yahoo! PNUTS)
Database System Concepts - 7th Edition 21.38 ©Silberschatz, Korth and Sudarshan
Geographically Distributed Storage
Database System Concepts - 7th Edition 21.39 ©Silberschatz, Korth and Sudarshan
Index Structures in Key-Value Stores
Database System Concepts - 7th Edition 21.40 ©Silberschatz, Korth and Sudarshan
Transactions in Key-Value Stores
Database System Concepts - 7th Edition 21.41 ©Silberschatz, Korth and Sudarshan
Transactions in Key-Value Stores
Database System Concepts - 7th Edition 21.42 ©Silberschatz, Korth and Sudarshan
Querying and Performance Optimizations
Database System Concepts - 7th Edition 21.43 ©Silberschatz, Korth and Sudarshan
End of Chapter 21
Database System Concepts - 7th Edition 21.44 ©Silberschatz, Korth and Sudarshan