0% found this document useful (0 votes)
113 views5 pages

Partitioning PDF

This document discusses different techniques for partitioning data in a database to improve performance. It describes horizontal partitioning which divides a table into subsets based on a key, including range partitioning which separates data into partitions by date ranges and round robin partitioning which assigns data randomly. It also discusses vertical partitioning which splits a table into separate tables based on attributes. The goal of partitioning is to enable parallel processing of distinct data subsets to increase query speed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views5 pages

Partitioning PDF

This document discusses different techniques for partitioning data in a database to improve performance. It describes horizontal partitioning which divides a table into subsets based on a key, including range partitioning which separates data into partitions by date ranges and round robin partitioning which assigns data randomly. It also discusses vertical partitioning which splits a table into separate tables based on attributes. The goal of partitioning is to enable parallel processing of distinct data subsets to increase query speed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Prof.

Hasso Plattner

A Course in
In-Memory Data Management
The Inner Mechanics
of In-Memory Databases

August 30, 2013

This learning material is part of the reading material for Prof.


Plattner’s online lecture "In-Memory Data Management" taking place at
www.openHPI.de. If you have any questions or remarks regarding the
online lecture or the reading material, please give us a note at openhpi-
[email protected]. We are glad to further improve the material.
Chapter 9
Partitioning

9.1 Definition and Classification

Partitioning is the process of dividing a logical database into distinct inde-


pendent datasets. Partitions are database objects itself and can be managed
independently. The main reason to apply data partitioning is to achieve
data-level parallelism. Data-level parallelism enables performance gains, a
classic example for that is to use a multi-core CPU to process several dis-
tinct data areas in parallel, whereas each core works on a separate partition.
Since partitioning is applied as a technical step to increase the query speed,
it should be transparent1 to the user. In order to ensure the transparency of
the applied partitioning for the end user, a view showing the complete table
as a union of all query results from all involved partitions is required. With
data-level parallelism it is possible to increase performance, availability, or
manageability of datasets. Which of these sometimes contradicting goals is
favored usually depends on the actual use case. Two short examples are
given in Section 9.4. Because data partitioning is a classical NP-complete2
problem, finding the best partition is a complicated task, even if the desired
goal has been clearly outlined [Kar72]. There are mainly two types of data
partitioning: horizontal and vertical partitioning, which will be covered in
detail in the following.

1 Transparent in IT means that something is completely invisible to the user, not that
the user can inspect the implementation through the cover. Except of their e↵ects like
improvements in speed or usability, transparent components should not be noticeable at
all.
2 NP-complete means that the problem can not be solved in polynomial time.

63
64 9 Partitioning

9.2 Vertical Partitioning

Vertical partitioning results in splitting the data into attribute groups with
replicated primary keys. These groups are then distributed across two (or
more) tables. Attributes that are usually accessed together should be in the
same table, in order to increase join and materialization performance. Such
optimizations can only be applied if actual usage data exists, which is one
point why application development should always be based on real customer
data and workloads.

First Last
ID DoB Gender City Country
Name Name

First Last
ID DoB Gender ID City Country
Name Name

Fig. 9.1: Vertical Partitioning

In row-based databases, vertical partitioning is possible in general, but


is not a common approach. Column-based databases automatically support
vertical partitioning, since each column can be regarded as a possible parti-
tion.

9.3 Horizontal Partitioning

Horizontal Partitioning is used more often in classic row-oriented databases.


To apply this partitioning, the table is split into disjoint tuple groups by some
condition. There are several sub-types of horizontal partitioning:
The first partitioning approach we present here is range partitioning , which
separates tables into partitions by a predefined partitioning key, which deter-
mines how individual data rows are distributed to di↵erent partitions. The
partition key can consist of a single key column or multiple key columns.
For example, customers could be partitioned based on their date of birth. If
one is aiming for a number of four partitions, each partition would cover a
range of about 25 years3 . Because the implications of the chosen partition
key depend on the workload, it is not trivial to find the optimal solution.
The second horizontal partitioning type is round robin partitioning. With
round robin, a partitioning server does not use any tuple information as
partitioning criteria, so there is no explicit partition key. The algorithm simply

3Based on the assumption that the companies’ customers mainly live nowadays and are
between 0 and 100 years old
9.3 Horizontal Partitioning 65

Par$$on)1)
Partition 1 Par$$on)3)
Partition 2

First Last First Last


ID DoB Gender City Country ID DoB Gender City Country
Name Name Name Name

3 Nina Burg 1952/12/12 w London UK

Par$$on)2)
Partition 3 Par$$on)4)
Partition 4

First Last First Last


ID DoB Gender City Country ID DoB Gender City Country
Name Name Name Name

1 John Dillan 1943/05/12 m Berlin Germany 2 Peter Black 1982/06/02 m Austin USA

4 Lucy Sehan 1990/01/20 w Jerusalem Israel

Par$$oning)along)the)age:) )Par$$on)1:)) )76))–))100) 5 Ariel Shiva 1984/07/18 w Tokio Japan


) ) )Par$$on)2:)) )51))–))))75)
) ) )Par$$on)3:)) )26))–))))50)
) ) )Par$$on)4:)))) )))))0))–))))25)) 6 Sharon Lokida 1982/02/24 m Madrid Spain

Fig. 9.2: Range Partitioning

assigns tuples turn by turn to each partition, which automatically leads to


an even distribution of entries and should support load-balancing to some
extent.
However, since specific entries might be accessed way more often than
others, an even workload distribution can not be guaranteed. Improvements
from intelligent data co-location or appropriate data-placement are not lever-
aged, because the data distribution is not dependent on the data, but only
on the insertion order.

Partition 1 Partition 3

First Last First Last


ID DoB Gender City Country ID DoB Gender City Country
Name Name Name Name

1 John Dillan 1943/05/12 m Berlin Germany 3 Nina Burg 1952/12/12 w London UK

5 Ariel Shiva 1984/07/18 w Tokio Japan

Partition 2 Partition 4

First Last First Last


ID DoB Gender City Country ID DoB Gender City Country
Name Name Name Name

2 Peter Black 1982/06/02 m Austin USA 4 Lucy Sehan 1990/01/20 w Jerusalem Israel

6 Sharon Lokida 1982/02/24 m Madrid Spain

Fig. 9.3: Round Robin Partitioning

The third horizontal partitioning type is hash-based partitioning. Hash par-


titioning uses a hash function4 to specify the partition assignment for each
row.
The main challenge for hash-based partitioning is to choose a good hash
function, that implicitly achieves locality or access improvements.

4 A hash function maps a potentially large amount of data with often variable length to
a smaller value of fixed length. In the figurative sense, hash functions generate a digital
fingerprint of the input data.
66 REFERENCES

Partition 1 Partition 3

First Last First Last


ID DoB Gender City Country hash(Country) ID DoB Gender City Country hash(Country)
Name Name Name Name

4 Lucy Sehan 1990/01/20 w Jerusalem Israel 0x00 3 Nina Burg 1952/12/12 w London UK 0x03

Partition 2 Partition 4

First Last First Last


ID DoB Gender City Country hash(Country) ID DoB Gender City Country hash(Country)
Name Name Name Name

1 John Dillan 1943/05/12 m Berlin Germany 0x01 2 Peter Black 1982/06/02 m Austin USA 0x02

5 Ariel Shiva 1984/07/18 w Tokio Japan 0x02

Fig. 9.4: Hash-Based Partitioning

The last partitioning type is semantic partitioning. It uses knowledge about


the application to split the data. For example, a database can be partitioned
according to the life-cycle of a sales order. All tables required for the sales
order represent one or more di↵erent life-cycle steps, such as creation, pur-
chase, release, delivery, or dunning of a product. One possibility for suitable
partitioning is to put all tables that belong to a certain life-cycle step into a
separate partition.

9.4 Choosing a Suitable Partitioning Strategy

There are number of di↵erent optimization goals to be considered while


choosing a suitable partitioning strategy. For instance, when optimizing for
performance, it makes sense to have tuples of di↵erent tables, that are likely
to be joined for further processing, on one server. This way the join can be
done much faster due to optimal data locality, because there is no delay for
transferring the data across the network. In contrast, for statistical queries
like counts, tuples from one table should be distributed across as many nodes
as possible in order to benefit from parallel processing.
To sum up, the best partitioning strategy depends very much on the
specific use case.

9.5 References

[Kar72] R. Karp. Reducibility among combinatorial problems. In R. Miller


and J. Thatcher, editors, Complexity of Computer Computations, pages
85–103. Plenum Press, 1972.

You might also like