5 Partitioning
5 Partitioning
What is partitioning?
In large systems, we are dealing with tons of data, and as a result tables may
become too big/perform too many queries for one single machine.
Partitioning is splitting this table up into many chunks to go on various database
nodes.
Partitioning is often used in conjunction with replication.
Objectives of partitioning
On each node we want:
● A relatively similar amount of data
● A relatively similar amount of reads/and writes to the data
Pros:
Cons:
:f22476
f22476:r91mbb
r91mbb:
Hash Range Partitioning
Take a hash of the key, and put it into the proper partition
jordan
:f22476
hash function
f22476:r91mbb
po23av
r91mbb:
Hash Range Partitioning
Take a hash of the key, and put it into the proper partition
jordan
:f22476
hash function
f22476:r91mbb
po23av
r91mbb:
Hash Range Partitioning Tradeoffs
Pros:
● Keys are evenly distributed between nodes (assuming good hash function)
Cons:
● No more range queries on the partition key, have to check every partition
● If a key has a lot of activity will still lead to hot spots
Indexes in a partitioned database configuration
Recall: An index is additional metadata that shows memory addresses of rows
corresponding to certain field values in the row
Secondary Index
In this system, we have 20 partitions no matter how many nodes there are
Fixed Number of Partitions
In this system, we have 20 partitions no matter how many nodes there are
Take all of the white chunks from each server and pass them to the new server!
Fixed Number of Partitions - Considerations
Choose a number of partitions that is reasonable:
● If too low, each partition will get too big and we will not be able to scale the
application further (additionally transferring the partition to another node will
take a super long time)
● If too high, there will be a lot of overhead on disk devoted to each partition
If your dataset is going to vary significantly in the future, maybe this isn’t for you
Dynamic Partitioning
Certain databases will adjust partition ranges dynamically so that they can reduce
hot spots as the data access patterns change over time:
● Once a partition becomes too big, it is split into two pieces and one is
assigned to another node
● Sometimes dynamic partitioning is not good because if the database
incorrectly assumes a node is down, when there is actually just a slow
network, it will repartition leading to more strain on the network
Fixed number of partitions per node
● Each node has a certain number of partitions on it which grow in size
proportionally to the dataset
● If a new node joins the cluster it will split some of the partitions on existing
nodes into two pieces, and take those for itself
● Very similar to consistent hashing algorithm to avoid unfair data splits
Sharding Summary
Unlike replication, which is always important to have (to increase availability),
partitioning adds a lot of complexity to a system and should mainly only be used
when the dataset has gotten big enough that putting the whole table on a single
node is infeasible.
Index Choices:
Rebalancing Choices:
● Fixed number of partitions is simpler to reason about, but requires choosing a good number
● A changing number of partitions may scale better, but doing so automatically may lead to
unnecessarily rebalancing and putting extra stress on our databases