Data Partition Survey
Data Partition Survey
JOUR-
NAL OF COMPUTER SCIENCE AND TECHNOLOGY 39(2): 346−368 Mar. 2024. DOI: 10.1007/s11390-024-3538-1
Abstract Data partitioning techniques are pivotal for optimal data placement across storage devices, thereby enhanc-
ing resource utilization and overall system throughput. However, the design of effective partition schemes faces multiple
challenges, including considerations of the cluster environment, storage device characteristics, optimization objectives, and
the balance between partition quality and computational efficiency. Furthermore, dynamic environments necessitate ro-
bust partition detection mechanisms. This paper presents a comprehensive survey structured around partition deployment
environments, outlining the distinguishing features and applicability of various partitioning strategies while delving into
how these challenges are addressed. We discuss partitioning features pertaining to database schema, table data, workload,
and runtime metrics. We then delve into the partition generation process, segmenting it into initialization and optimiza-
tion stages. A comparative analysis of partition generation and update algorithms is provided, emphasizing their suitabili-
ty for different scenarios and optimization objectives. Additionally, we illustrate the applications of partitioning in preva-
lent database products and suggest potential future research directions and solutions. This survey aims to foster the imple-
mentation, deployment, and updating of high-quality partitions for specific system scenarios.
Keywords data partitioning, survey, partitioning feature, partition generation, partition update
Survey
The work was supported by the National Key Research and Development Program of China under Grant No. 2023YFB4503603,
the National Natural Science Foundation of China under Grant Nos. 62072460, 62076245, and 62172424, and the Beijing Natural
Science Foundation under Grant No. 4212022.
*Corresponding Author
horizontal partitioning (HP), vertical partitioning structure designed for quickly locating and retrieving
(VP), and irregular partitioning (IP), as detailed in tuples, such as 1-dimensional indexes (B-tree[2]), and
Table 1. HP operates on a row-wise basis, keeping n-dimensional indexes (KD-tree[3], R-tree[4]). However,
complete tuples within each partition, whereas VP its performance tends to degrade when handling high-
functions column-wise, allowing incomplete yet consis- dimensional data or certain types of queries. In con-
tent column data. IP, on the other hand, focuses on trast, partitioning performs well in these scenarios.
the data itself, without imposing strict restrictions on Partition vs Materialized View. Materialized view
how it is partitioned. Thus, in terms of partition techniques[5, 6] adopt a space-for-time strategy, creat-
shape, both HP and VP divide the table space into ing views separating queried data copies from raw da-
rectangular areas, whereas IP allows partitions of ar- ta and routing relevant queries to the most suitable
bitrary shapes, including rectangles. IP designs parti- view for faster execution. However, copying the com-
tion shapes tailored to query access patterns to plete query results requires additional storage space.
achieve optimal query efficiency, ideal for online ana- We present a detailed partitioning workflow and
lytical processing (OLAP) and hybrid transactional/ review a wide spectrum of existing partitioning stud-
analytical processing (HTAP) applications. HP and ies. Some studies[7, 8] share a similar topic to ours;
VP also take into account partial record integrity to however, their focus lies on data-driven horizontal
facilitate online transaction processing (OLTP), there- partitioning for specific environments (e.g., Hadoop
by making them suitable for any load scenario. cluster②). Our survey, in contrast, considers a broad-
Data partitioning can be designed based on the er range of generalized scenarios. We explore various
database schema, data and load distribution, or a partition types and place greater emphasis on parti-
combination of these features. Schema-driven ap- tioning requirements, design details, and the imple-
proaches examine the join relationships among tables mentation process. We further delve into the feature
to centrally allocate tuples involved in join operations. extraction and cost model design before partitioning,
Data-driven approaches commonly employ domain along with addressing the data and load update is-
and hash values of column values to create partitions. sues after partitioning.
Query-driven approaches concentrate on mining nest- This paper is organized as follows: Section 2 pro-
ed filtering rules from queries to ensure each tuple is vides an overview of data partitioning, including its
assigned to the most appropriate partition. four-stage workflow and core modules. Sections 3–5
Other physical designs also significantly impact explore the development trajectory of partitioning, in-
query latency, disk space usage, and more. To eluci- corporating classical approaches to horizontal, verti-
date the role of partitioning, we next briefly describe cal, and irregular partitioning, respectively. Section 6
how it differs from other design strategies. summarizes the support for partitioning in industry-
Partition vs Storage Structure. Partitioning speci- leading database products. Section 7 gives open prob-
fies which data should be stored in the same block lems in this field and potential solutions. Finally, we
file, while storage structure solves how the data is or- conclude the survey in Section 8.
ganized within a block. For example, Parquet[1], a
widely adopted column-store file format in HDFS 2 Data Partitioning Overview
(Hadoop Distributed File System)①, provides effi-
cient data compression and encoding schemes to en- The partitioning workflow typically comprises four
hance the performance of read-intensive queries. stages, as depicted in Fig.1. Stage 1, feature extrac-
Partition vs Index. Index is an auxiliary data tion, addresses the issue of what to use for partition-
Table 1. Comparison of Three Common Partition Types
Type Partition Strategy Partition Shape Scenario
OLTP OLAP HTAP
HP Row-wise Rectangular ✔ ✔ ✔
VP Column-wise Rectangular ✔ ✔ ✔
IP Data-wise Arbitrary ✘ ✔ ✔
Fig.1. Data partitioning workflow. (a) Feature extraction. (b) Partition generation. (c) Partition deployment. (d) Partition update.
ing. This stage entails analyzing the database (DB) tition boundaries for M2 ( [21, 23] ⇒ [21, 22] ) and M3
schema, parsing representative queries, conducting ( [24, 25] ⇒ [23, 25] ) are promptly adjusted, and a da-
column data statistics, and selecting system optimiza- ta migration plan is devised to move tuple T7 from
tion metrics. Stage 2, partition generation, includes M2 to M3.
two subtasks: partition initialization, which quickly Fig.2 displays a framework comprising five key
establishes initial partitions using a low-complexity modules used in the partitioning workflow. This sur-
algorithm, and partition optimization, where the ini- vey concentrates on the modules highlighted in green.
tial solution is iteratively refined based on predefined 1) Deployment Scenario. Partitioning optimiza-
cost models. Stage 3, partition deployment, involves tion objectives, such as performance, manageability,
routing data to partition files via automated write and device costs, are greatly affected by system envi-
transactions based on created partition structures. ronments, user requirements, and the storage devices
Stage 4, automatic partition update, timely adjusts used. For instance, in a distributed database, parti-
partitions to sustain stable system performance amid tioning tasks are more complex, necessitating the con-
data, load, and hardware resource uncertainties, siderations of factors like multi-node clusters, node
which includes deciding update timings and formulat- replicas, and network latency, to ensure uniform par-
ing detailed update plans accordingly. tition access and reduce cross-node operations. Ta-
Consider a teaching system comprising three ta- bles 2 and 3 offer categorizations and symbolic repre-
bles: student (S), course (C), student course (SC). Be- sentations of common optimization objectives and
fore partitioning a table (e.g., S), we first analyze its database environments, respectively.
entity-relationship (E-R) graph and common column 2) Partition Type. Before designing partitions, it
data distributions, gathering query information and is necessary to choose the partition type to use based
system metrics as necessary. Assuming the age col- on the given scenario, as shown in Table 1.
umn has been selected as the partition key, initial 3) Cost Model. After identifying the deployment
partitioning rules are derived from its value domain, scenario and deciding the partition type, a cost mod-
and skew partitions are further split according to the el is created to assess the given partition scheme and
column histogram statistics. With the partitions, its associated update plan. There are three types of
eight given tuples (T0 , . . . , T7) are distributed across cost estimations: optimizer-based models, simplifying
three machines (M1, M2, M3). Subsequently, a ser- cost design at the expense of accuracy; network-based
vice is established to continuously monitor the envi- learning models, offering high precision but requiring
ronment. When detecting an overload on M2, the par- sufficient metric samples and extensive training over-
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 349
Determine Call
Partition Type Cost Model
Determine
Monitor Horizontal Vertical Irregular Optimizer- Function- Network-
Partitioning Partitioning Partitioning Based Based Based
Call
Partition Update Module
Monitoring Service Data Migration Plan
Query Window Threshold Control Random Rule Heuristic
RL-Based MP-Based
(QW)-Based (TH)-Based Theory (CT) (RM)-Based (RE)-Based (HC)-Based
Definition 1 (Static Horizontal Partitioning). Stat- Database Schema. Depending on the given
ic horizontal partitioning aims to find a classifier ϕ(·) database schema, we can 1) classify tables into
for a table with m tuples D = (e1 , e2 , . . . , em ) and n large/small ones based on the number of tuples, and
collected queries Q = (q1 , q2 , . . . , qn ) . When a new static/dynamic ones based on data changes; 2) ana-
tuple arrives, the classifier ϕ assigns it to the speci- lyze the characteristics of numerical columns, includ-
fied partition P in time, i.e., P = ϕ(e), ∀e ∈ D. The ing data type, constraints, indexes, and triggers, etc.;
classifier partitions all tuples into k distinct parti- 3) learn the foreign key relationships and constraints
tions, represented as P = (P1 , P2 , . . . , Pk ) , to achieve between tables to help construct co-partitions[19, 28, 32, 33].
optimal system objectives such as low query latency Table Data. When analyzing numerical column
and high system throughput. The total cost of process- data, distribution types (e.g., uniform, skewed/hot
Round -Robin
Range Kangaroo Ameoba
Hash KD-Tree SOP AQWA AdaptDB QdTree MTO PAW
Earlier 2002 2006 2008 2010 2011 2012 2013 2014 2015 2016 2017 2018 2020 2021 2022
Hash, Range Rao Agrawal06 REF DYFRAM MESA Horti - DynPart E-Store SOAP Clay NashDB BaW SAHARA
Round -Robin culture
PREF
Schism SWORD GPT Advisor
Cumulus
Fig.3. Timeline of HP research development, including general empirical-based approaches (round-robin, range, and hash), as well
as on-axis studies (KD-tree[3], SOP[9], AQWA[10], Kangaroo[11], Ameoba[12], AdaptDB[13], QdTree[14], MTO[15], and PAW[16]) focused
on centralized environments and off-axis studies (Rao[17], Agrawal06[18], REF[19], DYFRAM[20], Schism[21], MESA[22], Horticulture[23],
DynPart[24], SWORD[25], E-Store[26], SOAP[27], PREF[28], Cumulus[29], Clay[30], NashDB[31], GPT[32], BaW[33], Advisor[34], and SA-
HARA[35]) on distributed environments.
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 351
spot[20, 23, 24, 26, 35], discrete) and domain statistical act as a classifier, partitioning new data and guiding
metrics (e.g., median[12, 13, 16], maximum, and mini- incoming queries to skip unnecessary blocks. Kanga-
mum values[15, 16]) are considered. These can also be roo[11] utilizes grid and tree structures for partitioning.
depicted using histogram technologies[15, 20]. In a 2D table space, the grids are represented by two
Workload. In HP, important query logical fea- bit strings, with positions marked as 1 acting as the
tures (e.g., filter conditions, join keys, operator cost partition boundaries. Kangaroo then applies a genet-
estimates, and SQL keywords) and physical features ic algorithm (GA) for partition initialization and
(e.g., read-to-write ratios, occurrence frequencies, sub- merging, deriving the optimal partition scheme. Its
mission/completion times, and inserted/updated tree-based approach replaces the grid with a tree rep-
rows) can be extracted from query plans. Some stud- resentation within the GA process.
ies count partition or tuple access frequencies[21, 25, 30] Greedy-Based. To solve SOP limitations, such as
to identify hot and cold data[26, 35] by tracking query- the exponential growth in execution time with more
tuple accesses. Furthermore, load can be classified as predicates, Yang et al.[14] proposed a greedy-built
either heavy or light based on the average query ar- query data routing tree (QdTree). QdTree is a bina-
rival rate. ry tree created by selecting the predicate with the
Database Runtime Metric. OS-level metrics relat- maximum split benefit as the split condition at each
ed to HP are chosen to monitor the database state, tree expansion step until no further splits are possible.
including resource usage (e.g., memory, CPU, disk), Each leaf node maintains metadata for routing, with
performance (e.g., query latency, throughput), and the path from root to leaf serving as the search pro-
machine hotspots. cess for assigning tuples to partitions. Ding et al.[15]
extended QdTree to multi-table datasets with a mul-
3.3 Partitioning Process in Centralized ti-table optimizer (MTO), leveraging sideways infor-
Databases mation passing through joins. MTO periodically com-
putes a reward value to decide the best repartition
In this subsection, we discuss studies designed for timing and then uses dynamic programming (DP) to
centralized systems or those that neglect factors such find the optimal reorganization set of non-overlap-
as multi-nodes, replicas, and network costs. ping subtrees. Li et al.[16] proposed PAW (Partition-
Empirical-Based. Range partitioning typically ing Aware of Workload Variance), focusing on creat-
splits data based on a pre-defined range of values de- ing partitions adaptable to future load variances by
rived from partition keys. This method is suitable for scaling historical queries and employing multi-step
data with prior statistics but requires careful selec- splits to replace multiple one-step predicate splits in
tion of partition boundaries, which is difficult for QdTree when splitting smaller nodes.
large-scale datasets. Hash partitioning maps tuples to However, in a new environment where query logs
specific partitions using a hash function, ideal for un- are unavailable, query-driven physical design tech-
ordered data. Round-robin partitioning is a special niques will become ineffective, leading to the
type of hash partitioning that assigns data to avail- database's cold start issue. Moreover, collecting repre-
sentative queries is sometimes difficult; for instance, a
able N machine nodes in a circular fashion, i.e., as-
study[13] on IoT startups revealed that, even after an-
signing the i -th data row to the ( i mod N )-th node,
alyzing the first 80% of historical queries, the remain-
to ensure equi-sized balanced partitions. These tradi-
ing 20% still contained 57% new queries previously
tional methods are data-driven and do not require
unseen. To tackle this issue, Aly et al.[10] developed an
prior load knowledge.
adaptive query-workload-aware partitioning (AQWA).
ML-Based. SOP[9] (Skipping-Oriented Partition- AQWA utilizes the KD-tree[3] structure for creating
ing) adopts the Apriori algorithm[36] to extract m rep- initial partitions with equal spatial points distribu-
resentative filter predicates from load, and converts tion. It dynamically maintains update plans for all
each tuple into an m -bit one-hot feature vector with visited nodes, considering split gain and data migra-
each bit indicating tuple-predicate satisfaction. These tion costs. To support KNN queries, AQWA uses
vectors are clustered into different blocks via the MinDist and MaxDist indicators[38] along with the vir-
Ward algorithm[37], with each block generating a tual grid technology to compute query boundaries.
union vector (as known as partition map) by perform- Amoeba[12] initializes a heterogeneous binary tree,
ing bitwise OR operations on its vectors. These maps similar to KD-tree, and dynamically modifies it for in-
352 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
coming queries using three node update operations: state drives has greater potential in boosting system
swap, pushup, and rotate. AdaptDB[13] adapts Amoe- throughput, due to their slower read/write speeds
ba for join operations by splitting each Amoeba tree than memory drives. Early partitioning studies[39–41]
based on joined columns. It employs a greedy search in E-DH/S environments relate to physical design
strategy to co-partition joined blocks, yielding a supe- tools offering layout suggestions for data and load
rior hyper-join operation over shuffle-join. AdaptDB balancing. However, they do not design a cost func-
manages repartitioning via a fixed-length query win- tion for accurate evaluation of alternative solutions.
dow, refreshing the tree for new queries, and reallo- Rao et al.[17] combined a rank-based method with cost
cating old nodes. estimations derived from query optimizer statistics to
Table 7 summarizes the horizontal partitioning quickly recommend partition keys. Agrawal et al.[18]
techniques discussed above for centralized environ- refined this by treating workload as a sequence with
ments. The “Cost” column indicates whether a cost temporal features, eliminating redundant and ineffi-
model is used or not. The “Deployment” column indi- cient designs. Other similar studies[42, 43] utilize opti-
cates whether the partitions have been deployed in a mizer and load information, adopting greedy and
real database environment. The “Method Content” heuristic-based strategies for effective partitioning.
column uses various symbols to represent different However, while these strategies mentioned above
partitioning stages: partition initialization (∇), parti- excel in large-scale data scans, they easily incur dis-
tion optimization (〇), and partition update ( ⟳ ). The tributed (i.e., cross-node) calls during small transac-
representations are applied to all subsequent tables. tions touching only a few tuples.
ML-Based. Schism[21] addresses this issue by mini-
3.4 Partitioning Process in Distributed mizing distributed transactions. Fig.4 illustrates its
Databases partitioning process. 1) Data preparation, inputting
table data and transaction information (omitted). 2)
Data-driven approaches are universally applicable Partitioning. A hypergraph is created, with nodes rep-
to various database environments and can always resenting tuples or tuple replicas. Replication edges
achieve data balancing. However, the performance of connect a tuple to its replicas, while transaction edges
query-driven approaches, tailored for E-CH/S envi- connect all tuples accessed by the same transaction. A
ronments, might be limited by new factors in E-DH/S Metis partitioner[44] then splits the hypergraph into
environments. Thus, in this subsection, we introduce multiple balanced partitions with minimal cross-parti-
the studies specifically designed for distributed envi- tion transactions. In the illustrated example with five
ronments. tuples, we get partitions 0 and 1 after graph splitting.
3) Explanation and validation. Decision trees are con-
3.4.1 Disk Storage Environment structed based on tuple features within each parti-
tion to find predicate-based explanations for adapt-
Optimizing data placement on hard and solid- ing new data. In Fig.4, the decision tree is construct-
Table 7. Major Horizontal Partitioning Strategies for Centralized Environments
Category Work Baseline Objective Automatic Cost Deployment Method Content
Empirical Range, hash, N/A O1, O2 ✘ ✘ ✔ Partitioning by columns or
round-robin data insertion order∇
ML SOP[9] SimpleRange O3 ✘ ✔ ✔ Frequent itemset+Ward clustering∇
ML Kangaroo[11] Random O4, O8 ✘ ✔ ✔ GA-based grid/tree generation∇; partition
schemes initialization using DPΟ
Greedy AQWA[10] Uniform grids O4, O8 M-TH+ ✔ ✔ Spatial data-based recursive KD-tree∇; greedy
D-RE tree node split selection ⟳
Greedy Ameoba[12], FullScan, O3, O5 M-QW+ ✔ ✔ Heterogeneous tree∇; heuristic-groupΟ;
AdaptDB[13] SOP[9] D-RE/RM predicate-based tree update ⟳
Greedy QdTree[14] SOP[9] O3 ✘ ✘ ✔ Greedy-based binary predicate tree
Greedy MTO [15]
QdTree[14] O3, O8 M-TH+ ✘ ✔ QdTree∇; join-induced predicatesΟ; tree update
D-MP using DP ⟳
Greedy PAW[16] QdTree[14] O3 ✘ ✘ ✔ Query deviation prediction+ multi-group split∇;
data replicationΟ
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 353
Tuple
… proposing heuristic rules for efficiently distributing in-
ID
1 1
coming data. Unlike SWORD's approach of isolating
… 2 2
2 2 … 3 updated data during repartitioning, SOAP[27] inte-
3 3 … 2 1
4 4 … 1 grates repartition operations into normal transactions
0
5 5 it ion
…
Part for smooth partition management. SOAP employs a
(a) 1 cost-based method to prioritize repartition transac-
5 Partit
All Tuples io n1 tions and utilizes a feedback model for scheduling
5 4
5
5
their executions. NashDB[31] supports user-defined
- < - query prioritization and efficient resource use, com-
Transaction Edges
Partitions Partition Partition
Replication Edges
bining economic models, dynamic programming, and
0, 1 0 1
the Munkres algorithm[45] to optimize node usage and
(c) (b)
minimize data migration costs.
Fig.4. Graph partitioning process introduced in [21]. (a) Input:
table data. (b) Hypergraph creation and partitioning. (c) Deci- Table 8 summarizes common horizontal partition-
sion tree construction. ing techniques for distributed disk storage environ-
ments.
ed with the a1 column serving as the decision point,
using criteria such as a1 = 1 , 2 ⩽ a1 < 4 , and a1 ⩾ 4 3.4.2 Distributed Partition Key Recommendation
for decision branches. The leaf nodes indicate that tu- in Disk Storage Environments
ples meeting the specified criteria are allocated to
their respective partitions. Nehme et al.[22] developed Non-co-located joins cause excessive data transfer
the MEMO-based search algorithm (MESA) for long- overhead among machine nodes, adversely affecting
running analytical transactions touching large-scale join performance. Co-partitioning tables using shared
tuples, while Schism adapts to small short-lived trans- join keys can significantly reduce data shuffling. We
actions. The MEMO structure is a search space for term this problem as Distributed Partition Key Rec-
parallel query optimization. MESA generates MEM- ommendation (DKR). For example, in Spark SQL,
Os for each query and then faster simulates and ex- data can be organized into multiple buckets accord-
plores tree-style partition candidate configurations us- ing to the hash or range values of selected partition
ing a branch and bound strategy. keys. Costa et al.[46] verified that creating a consis-
To adapt Schism to load changes, SWORD[25] tent number of buckets for join keys across two large
compresses the hypergraph into virtual nodes, periodi- tables can significantly boost join performance over
cally monitors load variations, and sets a threshold traditional sort-merge joins.
for distributed transaction ratios to determine reparti- Empirical-Based. When facing joins with refer-
tion timings, employing virtual node swaps for incre- ence constraints, the query executor requires copying
mental graph updates to minimize data movement. partition keys and strategies from parent to child ta-
Cumulus[29] filters out infrequent transactions and bles, and subsequently repeats partition merging,
predicts future transaction frequency with an expo- splitting, or key updates across all parent-child tables.
nential moving average. It dynamically re-partitions Eadon et al.[19] proposed reference partitioning (REF)
data in a user-driven live migration to avoid poten- that enables partition maintenance operations per-
tial hotspots, balancing the increase in repartitioning formed on parent tables to be extended to child ta-
overhead with the decrease in distributed transaction bles, ensuring the migration of child tuples is handled
costs. as a single atomic operation when the partition key in
Greedy-Based. DYFRAM[20] addresses the cold the parent table is modified.
start problem by initially creating simple range parti- Greedy-Based. PREF[28] (Predicate-Based Refer-
tions for equi-width data distribution histograms, ence Partitioning) improves REF by supporting co-
then periodically evaluates whether to replicate parti- partitioning of tables for any join predicate, not just
tions based on partition size limitations and cross-par- foreign keys, through tuple duplication. A join graph
tition overheads. DynPart[24], designed for continuous- is defined with each node denoting a table and each
ly growing database (e.g., observation and log data). edge indicating joins over two tables. PREF assigns
As data volume increases, DynPart models the affini- weights to each edge as the connected smaller table
ty between data and partition based on given queries, size, and extracts candidate key configurations from
354 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
Table 8. Major Horizontal Partitioning Strategies for Distributed Disk Storage Environments
Category Work Baseline Objective Automatic Cost Deployment Method Content
ML Schism[21] Manual partitions O1, O2, O6 ✘ ✘ ✔ Metis∇, decision treeΟ
ML MESA[22] Rao et al.[17], O3, O8 ✘ ✘ ✘ Memo-based search∇;
Schism[21] pruning branch and
bound treeΟ
ML SWORD[25] Schism[21], simple O1, O2, O4, O7 M-TH+D-HC ✔ ✔ Graph
hash partitions compression/partition∇;
node
swapping/replication ⟳
ML Cumulus[29] Schism[21] O2, O7 M-SD+D-RE ✔ ✔ Multi-objective cost
model;
on-demand repartition ⟳
Greedy DYFRAM[20] Optimal solution O3, O7 M-TH+D-HC ✔ ✘ Histogram+rule-based
replication/partitioning∇
Greedy DynPart[24] Schism[21] O3, O7 M-TH+D-HC ✔ ✘ Single partition∇;
affinity-based heuristic
strategyΟ
Greedy SOAP[27] SWORD[25] O2, O4 M-CT ✔ ✔ PID-controller ⟳
Greedy NashDB[31] SWORD[25], O2, O4, O5 ✘ ✔ ✔ Economic model∇;
optimal solution greedy Munkres
algorithm ⟳
each query, greedily merging them to minimize the key. 2) Action, comprising a candidate set that in-
graph weights. GPT[32] reduces data redundancy in cludes actions to replicate or (de-)activate edges be-
PREF. It first selects vertices and edges to be added tween partition keys. 3) Reward function, which uti-
from the join graph by considering both the storage lizes the cost model to calculate the performance
overhead and shuffle-free query benefits, and then gains of each action as the reward, disregarding data
adopts a multi-column partitioning to hash partition migration overheads.
key values for each edge. BAW[33] (Best of All Table 9 summarizes the partition key recommen-
Worlds) is an assumption-free framework that uses dation techniques for distributed disk environments.
exact integer linear programming and heuristic vari-
ants to transform the DKR problem into a graph 3.4.3 Main Memory Storage Environment
matching problem, unlike prior studies[19, 28] that rely
on many assumptions not generally applicable. In modern OLTP systems with small, repetitive,
RL-Based. Hilprecht et al.[34] introduced a parti- and short-lived transactions, applications can keep
tion advisor using Q-learning[47] to automatically as- their entire dataset in memory through widely shared
sess and recommend partition keys under varying server clusters, making it more feasible to develop
loads. The advisor refines a network-centric cost mod- new storage system prototypes than to add indexes to
el with actual runtimes and designs a training envi- traditional disk-oriented DBMSs. H-Store[48] is such a
ronment consisting of three parts: 1) State, which is a main memory database that supports user-defined
one-hot encoding of table attributes indicating layout designs. The studies[23, 26, 30, 35] discussed below
whether an attribute at each position is a partition are all designed on H-Store, where network latency
Table 9. Major Distributed Partition Key Recommendation Strategies for Optimizing Join Operations
Category Work Baseline Objective Automatic Cost Deployment Method Content
Empirical REF[19] N/A O4, O12 ✘ ✘ ✔ Reference partitioning∇
Greedy PREF[28] REF[19] O4, O12 ✘ ✔ ✔ Schema/query driven design∇
Greedy GPT[32] PREF[28] O4, O12 ✘ ✔ ✔ Join graph+hash-based
multi-column partitioning∇
Greedy BAW[33] Greedy O4, O7 ✘ ✘ ✔ Integer linear programming∇;
matching graph matching∇
DL Hilprecht et al.[34] PREF[28] O4 M-RL ✔ ✔ Network-centric cost model+
Q-learning algorithm ⟳
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 355
and resource utilization have become critical factors. Reducing hardware expenses alongside improving
Horticulture[23] estimates the coordination and performance is also an important research topic. SA-
skew costs between machine nodes to achieve load HARA[35] minimizes resource overhead while satisfy-
balancing and reduce distributed transactions. To ing all performance objectives by leveraging query ac-
handle complex database schemas and a larger num- cess skew to move cold data to cheaper storage layers,
ber of partitions, it uses a large neighborhood search retaining only hot data in main memory.
algorithm converging to near-optimal partitioning so- Table 10 summarizes major horizontal partition-
lutions within a reasonable time overhead. However, ing techniques for distributed memory environments.
it does not provide any partition update strategy. E-
Store[26] dynamically reallocates resource to accommo- 3.5 Cost Estimation for Horizontal
date demand spikes and new transactions. It periodi- Partition Scheme
cally collects metrics at the tuple, partition, OS levels,
identifies hot keys for hot tuple assignment, and even- Table 11 compares representative HP cost models.
ly distributes cold data in large chunks for the re- Notably, function-based cost models are prevalent, fo-
maining space. If CPU utilization exceeds a given cusing on a wide range of elements including block
threshold, E-Store scales cluster nodes and uses a skipping, join overhead, and hardware resources.
two-tiered bin packing algorithm to optimize tuple-to-
partition assignments. Clay[30] enhances E-Store by 3.5.1 Centralized Environment
addressing the issue of accessing tuples in multiple
blocks and non-colocated on the same cluster node. It Most studies[9, 12–16] evaluate partition quality by
adopts a two-tier partitioning with fine-grained map- calculating the number of scanned tuples using a skip-
ping (Metis[44] for hypergraphs) for hot tuples and ping-based cost function. In the SOP[9] model, the giv-
coarse-grained mapping (simple range/hash strate- en query set Q is initially encoded into n distinct fea-
gies) for cold tuples. When some partitions become ture vectors F = (F1 , F2 , . . . , Fn ) . The number of
overloaded, Clay employs a threshold-based sub-graph queries satisfying Fi is represented as zi . A function
migration algorithm to update them. f (P, Fi ) returns the number of accessed tuples when
Table 10. Major Horizontal Partitioning Strategies for Distributed Main Memory Environments
Category Work Baseline Objective Automatic Cost Deployment Method Content
Greedy Horticulture[23] Schism[21], manual partitions O2, O6 ✘ ✔ ✔ Skew-aware model+
large-neighborhood search∇
Greedy E-Store[26] Optimal solution O5 M-TH+D-MP ✘ ✔ Two-tiered partitioning∇;
greedy/first-fit ⟳
Greedy SAHARA[35] Unpartitioned state, DB-expert O9, O11 ✘ ✔ ✔ Hot/cold data division∇;
MaxMinDiff range partitionsΟ
ML Clay[30] E-Store[26], Metis[44] O2, O5 M-TH+D-RE ✘ ✔ Tuple grouping+graph split∇;
heuristic data migration plan ⟳
Table 11. Comparative Analysis of Major Horizontal Partitioning Cost Models in Diverse Environments
Category Cost Model Objective Environment Characteristic
Optimizer Rao et al.[17], MESA[22] O3, O8 E-DH/S Adjusting query plan node costs for different
partitions based on table/index statistics
Function SOP[9], AdaptDB[13], O3, O5 E-C(D)H/S Skipping-based block scan cost and join cost
MTO[15]
Function Horticulture[23] O2, O6 E-DM Quantifying the effects of load skew on the cluster
Function DYFRAM[20], SOAP[27], O5, O4, O6, O10 E-DH/S, Costs for dynamic environments (replication/reparti-
SWord[25], E-Store[26], E-DM tion operations, cold/hot data)
Clay[30]
Function PREF[28], GPT[32], O3, O12 E-DH/S Fine-grained cost designs for the PKR problem
BAW[33]
Function NashDB[31] O1, O3 E-DH/S A monetary value function for tuples; converting the
HP problem into an economic problem
Function SAHARA[35] O11 E-DM A novel objective for reducing hardware cost
Learning Hilprecht et al.[34] O4, O8 E-DH/S A network-centric cost model
356 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
To get the skew factor Fskew , Horticulture first Finally, we discuss the partition update costs,
computes the skew factor for each t -th interval ( SKt ) which typically consider cost savings of new parti-
by dividing the average partition skew value by the tions and data migration expenses. They directly de-
ideal skew value, i.e., termine whether the repartition scheme is executed.
( |P| ( i ) ) The cost savings arise from lower transaction execu-
∑ Npar /Npar 1 tion and resource costs, whereas data migration ex-
SKt (P, Q) = log /N̂par / log ,
ρ̄ txn ρ̄ txn penses cover the overheads tied to partition and repli-
i=1
i
ca modifications.
where Npar represents the number of transactions ac-
cessing the i -th partition, and ρ̄txn represents the ide-
3.6 Summary
al transaction distribution, estimated as 1/N̂par .
Next, Horticulture accumulates them to obtain We summarize key characteristics of HP as fol-
the final skew factor, i.e., lows: 1) It is crucial to survey the storage and deploy-
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 357
ment environments before designing partitions. 2) tion Ccg to evaluate only the division of each CG, and
When query features are scarce, query-driven meth- then finds optimal classifier (ϕ∗ ) to generate P .
ods often utilize data features as a supplement to ∑
build finer-grained or size-constrained partitions. 3) CGs∗ = arg min Ccg (CG, Q),
CGs
CG∈CGs
Each partitioning strategy has unique strengths and ∗
Pi =ϕ(e, CGs ), ∀e ∈ D,
weaknesses. Mathematical programming requires fea- ( )
sibility verification due to partitioning's NP-hard na- ϕ∗ = arg min C P ⇐ ϕ(D, CGs∗ ), Q . (1)
ϕ
ture. Learning-based algorithms exhibit high perfor- Definition 4 (Dynamic Vertical Partitioning). This
mance but adapt poorly to environmental changes.
concept exhibits parallel characteristics to dynamic
Conversely, greedy algorithms offer more flexibility
HP and will not be defined repeatedly here.
for existing partitioning constraints, but may lack sta-
ble performance, which could be improved with addi-
4.2 Feature Extraction
tional optimization phases.
Database Schema. 1) VP is essentially an exten-
4 Vertical Partitioning
sion of table splitting, and high-frequency column
Subsection 4.1 and Subsection 4.2 provide the def- groups can be directly extracted from independent
inition and feature extraction of the vertical partition- business scenarios in advance. 2) Distinguishing be-
ing (VP) problem, respectively. Mainstream VP con- tween indexed and non-indexed columns[52, 64, 72] con-
struction strategies for centralized and distributed en- siderably affects column grouping. 3) Small/large ta-
vironments are presented in Subsections 4.3 and 4.4, bles are categorized based on the number of at-
respectively, and their cost models are introduced in tributes to verify algorithms' execution efficiency.
Subsection 4.5. Fig.5 depicts the development trajec- Table Data. When constructing column groups,
tory of VP methods. examining attribute types, such as primary/con-
strained keys, can help reduce join costs. When creat-
4.1 Formalization ing range-based horizontal splits, the attribute distri-
bution characteristic[73] serves as a crucial reference
Definition 3 (Static Vertical Partitioning). Static factor in determining partition keys.
VP is a two-phase partitioning technique for process- Workload. Query features[50, 53, 60, 65] such as access-
ing the collected queries Q . A table data D is initially ing attributes (projection, filter, and join columns),
divided vertically into disjoint column groups CGs , affected rows, selectivity, SQL keywords[57, 69, 73], and
which are subsequently split horizontally into k dis- submission time are commonly extracted in VP. Here,
tinct partitions P = (P1 , P2 , . . . , Pk ) through two accessing attributes are used to calculate co-occur-
candidate strategies: 1) all CGs are split into P as a rence frequency between attributes; selectivity[72, 73]
whole containing aligned tuples; 2) each CG is inde- reflects the proportion of scanned tuples in the total
pendently split into partitions and then merged into P . table tuples, with higher selectivity typically indi-
The objective of VP is to generate the optimal combi- cates a greater query weight.
nation of CGs and P that minimizes the final process- Database Runtime Metric. Similar to HP, the VP
ing cost C of Q . VP first identifies optimal column layout primarily focuses on key metrics like system
groups (CGs∗ ) by introducing an additional cost func- throughput, processor stalls, and resource utilization.
1975 1984-1989 1993 2003 2004 2008 2010 2011 2012 2013 2014 2016 2017 2018 2019 2021
HYRISE Trojan CHAC Peloton Casper
Fig.5. Timeline of VP research development, including on-axis studies (Hoffer[49], Navathe84[50], Navathe89[51], OBP[52], GA[53], Hill-
Climb[54], AutoPart[55], Agrawal04[56], VF[57], Lisbeth[58], AutoStore[59], Smopd[60], Dyvep[61], Smopdc[62], GSOP[63], HYF[64],
ActiveDB[65], GridFormation[66], AutoVP[67], and SCVP[68]) based on centralized environments and off-axis studies (HYRISE[69], Tro-
jan[70], CHAC[71], Peloton[72], and Casper[73]) based on distributed environments.
358 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
4.3 Partitioning Process in Centralized constraints. PAX[75] (Partition Attributes Across) lay-
Databases out decomposes relations at the page level to avoid
the join expenses of prior VP studies[49–53] that break
Recent research[66, 67] has highlighted the benefits down a table into multiple subtables. HillClimb[54] ex-
of applying reinforcement learning algorithms for dy- tends PAX by defining a finer page layout. Starting
namic partition updates. This marks a major evolu- with PAX's single-column partitions, it merges the
tion from earlier methods[49–51, 53, 55–58, 64], which de- two partitions offering the largest query cost reduc-
pended on the partitioning feature of attribute affini- tion in iterative rounds until no further reduction is
ty. These methods aimed to enhance performance but possible. Fig.6 illustrates this process. The CGs
often at the cost of increased execution time. Howev- |a1 |a2 |a3 |a4 |a5 |a6 | are the initial page layout. The first
er, as the field progressed, there was a shift towards round merges a1 and a2 for their greatest merging
more efficient, lightweight, end-to-end partitioning benefit. Then we update merging benefits of valid
methods[59–63, 65, 68], reflecting a continuous effort to candidate mergers, and a5 and a6 are next merged.
balance system design's efficiency and effectiveness. The process continues until reaching the optimal state
ML-Based. Hoffer et al.[49] calculated an attribute |a1 a2 a3 |a4 a5 a6 | with no feasible mergers left.
affinity matrix (AAM) from column accesses and ap-
plied the BEA[74] algorithm to cluster the AAM into Initial
Merged Benefit
column groups (CGs). Navathe et al.[50] refined this Partitions 10
approach by introducing a two-phase partitioning 8
8
with a cost model to select appropriate cost types for …
index scans for optimal selection. SCVP[68] improves enabling faster experience learning and MDP process.
HYF's cost model by incorporating tuple reconstruc- Table 12 summarizes vertical partitioning tech-
tion costs. Leveraging the cost independence proper- niques for centralized environments.
ty between CGs, SCVP first designs an estimation
function for rapid CG division gain calculation, mak- 4.4 Partitioning Process in Distributed
ing it suitable for large tables and heavy loads. It Databases
then applies spectral clustering on AAM to form ini-
tial CGs and adopts a greedy search strategy to split In big data systems, VP layouts are commonly
and merge CGs based on frequent patterns. built on page-level stores like [75]. Trojan[70] defines
Constructing a self-adaptive VP layout is crucial. an interestingness score to reflect how effectively the
AutoStore[59] introduces O2P (One-dimensional On- CG accelerates most queries, then solves a 0-1 knap-
line Partitioning) to monitor query changes through a sack problem to select the optimal CG combinations.
query window and updates AAM online. O2P uses the Trojan achieves layout-aware replication by design-
BEA algorithm to recluster only the CGs referenced ing unique CGs for each replica, better adapting to
by new queries, and designs a transforming benefit given queries. CHAC[71] (Column-oriented Hadoop At-
model to decide whether the repartition decision tribute) extracts frequent closed item sets from a fre-
should be executed. SMOPD[60] improves on Auto- quency-weighted AAM to generate overlapping and
Store by determining appropriate checkpoint inter- non-overlapping candidate clustering solutions, and
vals for repartitioning based on historical data analy- designs a cost model to select the optimal solution.
sis, employing AutoClust[76] for partition updates. VP-based hybrid storage is customized for HTAP
SMOPD-C[62] further adapts SMOPD to distributed databases. HYRISE[69] measures cache misses result-
settings by updating monitoring procedures. To solve ing from data movement from RAM to cache, miti-
cold start issues of VP, DYVEP[61] designs a statistic gating cache pollution in update operations using
collector to monitor the changes in query patterns non-temporal writes. It creates CG layouts that adapt
and database schema, creating new partitions or trig- to cache lines to accelerate read operations. Peloton[72]
gering repartitioning when query latency increases or clusters queries by their co-accessed attributes via the
table attributes are deleted. k -means algorithm, selecting representative queries
Empirical-Based. ActiveDB[65] uses 21 active rules for each cluster by optimizer estimates and submis-
to monitor both internal and external system and us- sion time. It then prioritizes these queries, using a
er activities. The first 15 rules gather query-related greedy policy to extract CGs and maintains recent
statistical indicator changes. Two rules estimate the query statistics in a time series graph to periodically
current performance change to determine the necessi- replace old CGs. Casper[73], a column layout that
ty for partition updates. The final four rules use sta- works with VP algorithms like HYRISE and Peloton,
tistical features to create new partitions and access optimizes HTAP load processing in in-memory
their performance improvement threshold. DBMS. It estimates block read/write I/O costs for
DL-Based. GridFormation[66] is the first learning- various transaction operations, aligns block sizes with
based agent using Q-learning[47] for online VP layout cache lines, and track each operation access via block
design. The state is defined as a collection of sets, domain histograms. This helps establish ILP equa-
each indicating a partition containing a list of tuple tions to allocate data while satisfying constraints re-
IDs. GridFormation's partitioning process follows a lated to read/update latencies.
Markov decision process (MDP), with rewards calcu- Table 13 summarizes vertical partitioning tech-
lated based on touched partitions and tuple access ra- niques for distributed environments.
tio of each query. AutoVP[67] redesigns the GridFor-
4.5 Cost Estimation for Vertical Partition
mation agent to accelerate training, offering three op-
Scheme
tional DQN variants[77] and using HillClimb[54] and
HDD[78] to evaluate temporary partitions. It simpli- This subsection reviews common function-based
fies state representation to a 2D array, with each row cost models (see Table 14) employed for VP evalua-
corresponding to a query and each column to a table tion, including two-phase partitioning and partition
attribute. Rewards are based on the cost difference updates. They consider query execution on VP lay-
between the current state and HillClimb's ideal state, outs, partition updates, and the impact of indexes,
360 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
Table 14. Comparative Analysis of Function-Based Vertical Partitioning Cost Models in Diverse Environments
Cost Model Objective Environment Characteristic
VF[57], GSOP[63] O3 E-CH/S Incorporating the cost of tuple construct across partitions
ACO[78] O3 E-CH/S Cost designs for bandwidth-based disk access operations
AutoPart[55], HYF[64] O3 E-CH/S Approximating the costs of index scans and block joins
AutoStore[59] O3, O5 E-CH/S Considering the repartitioning potential benefit
CHAC[71], Trojan[70] O3, O5 E-DH/S Finer cost estimations for map-reduce phases
DataMorhing[54], HYRISE[69] O10 E-DM Estimating cache misses for diverse data access operations
Casper[73] O3 E-DM Modeling costs for five distinct data access operations
joins, and map-reduce operations. For non-PAX VP use a fixed query window, while [60–62] employ dy-
techniques, the additional tuple reconstruction cost namic windows based on query performance thresh-
olds. The choice of monitoring approach does not im-
for cross-partition queries is another crucial factor.
pact the modeling of repartitioning benefits. However,
AutoStore[59] differs from other approaches by consid-
4.5.1 Centralized Environment ering potential benefits of new partitions rather than
solely evaluating them based on historical loads. It in-
To determine when to repartition, some studies[59] troduces a transformation benefit Btf , resulting from
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 361
updating current partitions P to a new scheme P′ When local data is not available, the network cost Ctr
when executing n queries Q collected in the window. arises from transferring data from one machine to an-
Btf is calculated as Ccg (Q, P) − Ccg (Q, P′ ) , with Ccg de- other, i.e., Ctr = (1 − ptr ) × (Ssplit /BWnet ) . BWnet de-
noting the query processing cost over a given vertical notes the network bandwidth (1 GB/s) and ptr is the
layout. AutoStore assumes the presence of multiple occurrence probability (0.97) of remote accesses. As-
future windows with workloads similar to the current suming a map initialization time of 0.1 s ( Cinit ), the
window, and estimates their frequency using an expo- total latency of query q over the Trojan layout is
nential decaying model with a shape parameter γ , computed as (Ctr (q) + Crand (q) + Cseq (q) + Cinit )×Nmap .
i.e., f req = 1/(1 − γ −n ) . The potential benefit of up-
dating current partitions is then calculated as 4.6 Summary
Br = f req × Btf − Ccg , and new partitions are de-
ployed only if Br > 0 . Differing from HP, VP involves a two-phase pro-
Various approaches[57, 59, 63, 64, 78] calculate Ccg by cess of column grouping and horizontal division of tu-
breaking down the total query cost into scan and tu- ples, with each phase being NP-hard. In the first
ple reconstruction costs of multiple accessed CGs. The phase, mathematical programming algorithms effi-
scan cost counts the number of both random and se- ciently identify CGs in small tables, while greedy and
quential I/O blocks, with random I/O accounting for ML-based algorithms are preferred for large tables. In
unclustered and clustered index scan costs plus index the second phase, partitions within each CG are typi-
lookup costs. The tuple reconstruction cost considers cally generated using hash or range values of keys.
only the join cost (e.g., hash and sort-merge joins) if Additionally, cost models play a crucial role in the
tuples between different CGs are not aligned; other- VP process, calculating scan costs for CGs and cross-
wise, a minimal tuple addressing cost is considered. CG reconstruction costs for selecting candidate parti-
tions. Despite its advantages, deploying and evaluat-
4.5.2 Distributed Environment ing VP in real-world databases is challenging due to
the limited native support for VP creation.
VP layout is prevalent in distributed Hadoop en-
vironments. For example, both Trojan[70] and 5 Irregular Partitioning
CHAC[71] consider the impact of column groups dur-
ing the map phase as the main cost factor. However, Irregular partitioning (IP) is a cutting-edge tech-
unlike Trojan, CHAC estimates costs roughly, focus- nique for handling analytical and mixed loads. How-
ing solely on data access volume and omitting disk ever, deploying it poses challenges such as maintain-
read/write characteristics and network cost considera- ing storage structure, updating partitions, and coordi-
tions. We will introduce the Trojan cost model next. nating query executors. Furthermore, there is a
To avoid tuple reconstruction, Trojan is based on scarcity of relevant studies according to [8, 81]. In
PAX and considers data reading and network costs. this section, we define the IP problem in Subsection
The known parameters include the block size Sb , num- 5.1. The partitioning features required by IP dis-
ber of machines n , map tasks m , and the split size cussed in Section 3 and Section 4, will not be reintro-
Ssplit . Ssplit determines the number of data slices, with duced. Subsequently, we describe several classic IP
each being handled by a single mapper. When pro-
techniques in Subsection 5.2 and provide a summary
cessing a query q , the number of blocks read is denot-
in Subsection 5.3. Fig.7 depicts a simple development
ed as Nb , and then the number of map phases is cal-
trajectory of IP methods.
culated as Nmap = Nb × Sb /(Ssplit × m × n) .
The read cost for each map phase includes both Empirical-Based Greedy-Based
random I/O, Crand (q) = Frand × (Ssplit × Ccg′ /(Sbuffer ×
Teradata GridTable Jigsaw
|Ccg |)) , and sequential I/O, Cseq (q) = Ssplit × Ccg′ /
2016 2020 2021 2022
(BWdisk × |Ccg |) . Frand denotes the average random
Proteus
seek time (0.005 s); Ssplit is set to 256 MB; Ccg and Ccg′
represent the complete and accessed column sets, re- Fig.7. Timeline of IP research development, including on-axis
studies (Teradata[79], GridTable[80], and Jigsaw[81]) based on
spectively; Sbuffer is the buffer size (512 KB), and centralized environments and off-axis studies (Proteus[82]) based
BWdisk is the average disk bandwidth (100 MB/s). on distributed environments.
362 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
Table 15. Summary of Irregular Partitioning Strategies in Centralized and Distributed Environments
Category Work Baseline Objective Automatic Cost Composition Content
Empirical Teradata[79] Simple range O3, O10 ✘ Optimizer-based I/O costs; Rowid-based storage+ multi-
partitions CPU metrics level range partitioning∇
Empirical GridTable[80] N/A O4 ✘ Access and transition costs Three level-specific data
between grids manipulation operations
Greedy Jigsaw[81] Schism, O3 ✘ Read I/Os of layouts; memory Segment partitioning∇;
Schism+Peloton for hash tables greedy mergingΟ
Greedy Proteus[82] TiDB O4 M-TH+D-RE Costs for layout-aware/ Layout creation rules∇;
-agnostic storages hybird predictors for queries ⟳
ciated with irregular partition replication, mainte- sides, their simplicity enables DBAs to effortlessly
nance, and joins. create and manage partitions. Organizing data based
on data distribution can also makes it easier to con-
duct data analysis, particularly for time-series data.
6 Data Partitioning in Industry
In contrast, certain products (e.g., Vertica⑩,
Greenplum⑪, and VoltDB⑫) incorporate load analy-
Table 16 compares popular database products and
sis into their partitioning design. VoltDB is an in-
their partitioning support. Most DBMSs, e.g., Red-
memory DMBS for fast data processing tasks like on-
shift③, Firebolt④, Databricks⑤, GaussDB⑥, TiDB⑦,
line gaming and IoT sensors. By analyzing historical
OceanBase⑧, and SingleStore⑨, offer user-defined HP load and data distribution, it scales transaction pro-
strategies such as range, hash, key, list, and round- cessing capacity, creating optimal range partitions.
robin, where partition keys necessitate manual selec- This ensures load balancing and allows high-frequen-
tion and updates. These systems prioritize balanced cy transactions to be executed locally.
resource utilization among nodes and cluster scalabili- Some products, e.g., ClickHouse⑬, StarRocks⑭,
ty/parallelism through partitioning, rather than fo- Apache Hudi⑮, Oracle Autonomous Database⑯, and
cusing solely on maximizing system performance. Be- Snowflake⑰, provide automated partition key selec-
Table 16. Partitioning Support Comparison of Popular Database Products for OLAP, OLTP, and HTAP Scenarios
Scenario Type Partitioning Strategy Strategy Type Automatic Representative Product
OLAP HP Key, hash, range, list, round-robin Data-driven ✘ Redshift③, Firebolt④, Databricks⑤, GaussDB⑥
HP Round-robin, list, hash, range Data-driven M-TH ClickHouse⑬, StarRocks⑭, Apache Hudi⑮
HP Automatic interval/list Data-/query-driven M-SD Oracle Autonomous Database⑯
HP Auto clustering Data-driven M-TH Snowflake⑰
HP&VP Range, table projections+hash Data-/query-driven ✘ Vertica⑩, Greenplum⑪
OLTP HP Key, hash, range, list Data-driven ✘ PostgreSQL, MySQL, Oracle, SQLServer
HP Key, hash, range, list Data-/query-driven ✘ VoltDB⑫
VP Sharding+table views Data-/query-driven ✘ PostgreSQL, MySQL, Oracle, SQLServer
HTAP HP Key, hash, range, list Data-driven ✘ TiDB⑦, OceanBase⑧
HP Hash Data-driven ✘ SingleStoreDB⑨
tion and updates. The Oracle Autonomous Database adopts simple data-driven methods to allocate tuples
service is a resource-intensive and time-consuming op- into blocks after obtaining column groups (CGs). Al-
eration, invoked on-demand rather than running peri- though [53, 56] have considered load information,
odically. Analyzing workload information, it automat- they still encounter convergence or performance is-
ically identifies candidate partitioning tables and rec- sues. This inefficiency prevents the VP algorithm
ommends partitions for optimal I/O reduction using from achieving its optimal potential, even when the
strategies: automatic interval, automatic list, and CG division is aligned with column access patterns. A
hash. Snowflake creates micro-partitions via ZoneMap promising solution is to incorporate proven effective
indexes and column distribution histograms. Data is query-driven HP algorithms like QdTree[14] into VP.
organized in a natural order (unclustered state), then Reliability of Partition Updating. Monitoring ser-
clustered by selected keys to prevent cluster key val-
vices frequently rely on recently collected query logs
ue duplication across partitions (clustered state). As
to design new partitions; however, this method ne-
new data arrives, the number of duplicate key values
glects the similarity between future and historical
across different partitions increases; each partition's
loads, making it challenging to estimate updated par-
depth is quantified by the count of its overlapping
titions' potential performance. While [34, 59, 82] have
partitions. To preserve overall data order, Snowflake
prioritizes selecting micro-partitions with higher tried to model special scenarios to calculate future
depths and sorts and merges them independently. benefits of new partitions, these assumptions often
Vertica and Greenplum are among the few prod- prove unrealistic. This issue presents significant opti-
ucts natively supporting VP creation for efficient par- mization potential in two aspects: firstly, improving
tition pruning. This is achieved by creating local col- the prediction accuracy of future load for generating
umn group projections on disk for partitioned tables better new partitions; and secondly, reducing the
and evenly distributing projected data to partitions number of problem assumptions.
via hashing. By storing related data together, Vertica Deep Learning Models for Cost Estimation. To
can more efficiently utilize system resources. Fine- the best of our knowledge, no public, network-centric
grained projection replicas make it easier to achieve cost model exists for partitioning. However, the learn-
high availability and data recovery. Conversely, oth- ing and generalization capabilities of deep neural net-
er products simulate VP by combining sharding and works render them particularly suitable for such
views, an approach complicating table schema, sub- tasks. The main challenge lies in collecting sufficient
table data consistency, and query planning, which training samples due to the high partition deploy-
may lead to performance issues like overloaded shards ment cost and the vast partition solution space. A vi-
and increased partition maintenance costs. able solution entails compressing or trimming the so-
lution space by identifying factors influencing the
7 Open Problems query plan. This could be achieved by using a pruned
branch bounding tree for candidate partitions and re-
In this section, we explore remaining challenges moving the deployment and metric measurements of
and potential solutions in the current data partition- cold data. Subsequently, query plans and execution
ing community. metrics for various partitions are collected to train an
Partitioning for Non-Numeric Columns. Query ac- RNN-stacked tree network.
cess patterns pertaining to non-numeric columns are
often ignored, which greatly limits the optimization 8 Conclusions
space of partitioning. A feasible solution to this dilem-
ma involves transforming non-numeric column data In this paper, we modularized the partitioning
into numeric data via data encoding. Date columns technique, emphasizing the significance of cluster and
can be transformed into numeric values through storage environments in formulating an efficient parti-
timestamp functions, while enumeration columns are tioning path. Our approach enhances the tracking of
dictionary-encoded based on their semantic or alpha- partitioning progress and clarifies the considerations
betical order. For more complex column values, a trie- necessary at each partitioning stage, ensuring opti-
based index tree[83] can be built, with a depth-first mal designs. Before partitioning, it is crucial to align
traversal to derive encoding keys. cost models and partition types with specific environ-
Block Allocation Within VP. Current research mental characteristics. Furthermore, the intricate re-
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 365
lationship between data migration plans during parti- Management of Data, Jun. 2014, pp.1115–1126. DOI: 10.
tion updates and cluster configuration underscores the 1145/2588555.2610515.
[10] Aly A M, Mahmood A R, Hassan M S, Aref W G, Ouz-
importance of a holistic approach. We also classified
zani M, Elmeleegy H, Qadah T. AQWA: Adaptive query
partition generation strategies based on algorithm workload aware partitioning of big spatial data. Proceed-
types, distinguishing key features such as model con- ings of the VLDB Endowment, 2015, 8(13): 2062–2073.
vergence and partition quality to aid in strategy selec- DOI: 10.14778/2831360.2831361.
tion. For future research, we would like to explore [11] Aly A M, Elmeleegy H, Qi Y, Aref W. Kangaroo: Work-
load-aware processing of range data and range queries in
feasible solutions for addressing existing key chal-
Hadoop. In Proc. the 9th ACM International Conference
lenges including non-numeric column-based partition- on Web Search and Data Mining, Feb. 2016, pp.397–
ing and the reliability of partition updating. We hope 406. DOI: 10.1145/2835776.2835841.
our framework and findings could contribute to the [12] Shanbhag A, Jindal A, Madden S, Quiane J, Elmore A J.
advancement of partitioning systems and provide A robust partitioning scheme for ad-hoc query workloads.
practical insights for DBAs in various environments. In Proc. the 2017 Symposium on Cloud Computing, Sept.
2017, pp.229–241. DOI: 10.1145/3127479.3131613.
Conflict of Interest The authors declare that [13] Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB:
Adaptive partitioning for distributed joins. Proceedings of
they have no conflict of interest.
the VLDB Endowment, 2017, 10(5): 589–600. DOI: 10.
14778/3055540.3055551.
References [14] Yang Z H, Chandramouli B, Wang C et al. Qd-tree:
Learning data layouts for big data analytics. In Proc. the
[1] Melnik S, Gubarev A, Long J J et al. Dremel: A decade of
2020 ACM SIGMOD International Conference on Man-
interactive SQL analysis at web scale. Proceedings of the
agement of Data, Jun. 2020, pp.193–208. DOI: 10.1145/
VLDB Endowment, 2020, 13(12): 3461–3472. DOI: 10.
3318464.3389770.
14778/3415478.3415568. [15] Ding J L, Minhas U F, Chandramouli B et al. Instance-
[2] Bayer R, McCreight E. Organization and maintenance of optimized data layouts for cloud analytics workloads. In
large ordered indices. In Proc. the 1970 ACM SIGFIDET Proc. the 2021 International Conference on Management
(Now SIGMOD) Workshop on Data Description, Access of Data, Jun. 2021, pp.418–431. DOI: 10.1145/3448016.
and Control, Nov. 1970, pp.107–141. DOI: 10.1145/1734663. 3457270.
1734671. [16] Li Z, Yiu M L, Chan T N. PAW: Data partitioning meets
[3] Bentley J L. Multidimensional binary search trees used workload variance. In Proc. the 38th IEEE International
for associative searching. Communications of the ACM, Conference on Data Engineering, May 2022, pp.123–135.
1975, 18(9): 509–517. DOI: 10.1145/361002.361007. DOI: 10.1109/icde53745.2022.00014.
[4] Guttman A. R-trees: A dynamic index structure for spa- [17] Rao J, Zhang C, Megiddo N, Lohman G. Automating
tial searching. In Proc. the 1984 ACM SIGMOD Interna- physical database design in a parallel database. In Proc.
tional Conference on Management of Data, Jun. 1984, the 2002 ACM SIGMOD International Conference on
pp.47–57. DOI: 10.1145/602259.602266. Management of Data, Jun. 2002, pp.558–569. DOI: 10.
[5] Yuan H T, Li G L, Feng L, Sun J, Han Y. Automatic 1145/564691.564757.
view generation with deep learning and reinforcement [18] Agrawal S, Chu E, Narasayya V. Automatic physical de-
learning. In Proc. the 36th IEEE International Confer- sign tuning: Workload as a sequence. In Proc. the 2006
ence on Data Engineering, Apr. 2020, pp.1501–1512. DOI: ACM SIGMOD International Conference on Manage-
10.1109/ICDE48307.2020.00133. ment of Data, Jun. 2006, pp.683–694. DOI: 10.1145/
[6] Han Y, Li G L, Yuan H T, Sun J. An autonomous mate- 1142473.1142549.
rialized view management system with deep reinforce- [19] Eadon G, Chong E I, Shankar S, Raghavan A, Srini-
ment learning. In Proc. the 37th IEEE International Con- vasan J, Das S. Supporting table partitioning by refer-
ference on Data Engineering, Apr. 2021, pp.2159–2164. ence in oracle. In Proc. the 2008 ACM SIGMOD Interna-
DOI: 10.1109/ICDE51399.2021.00217. tional Conference on Management of Data, Jun. 2008,
[7] Zhang H, Chen G, Ooi B C, Tan K L, Zhang M H. In- pp.1111–1122. DOI: 10.1145/1376616.1376727.
memory big data management and processing: A survey. [20] Hauglid J O, Ryeng N H, Nørvåg K. DYFRAM: Dynam-
IEEE Trans. Knowledge and Data Engineering, 2015, ic fragmentation and replica management in distributed
27(7): 1920–1948. DOI: 10.1109/TKDE.2015.2427795. database systems. Distributed and Parallel Databases,
[8] Mahmud M S, Huang J Z, Salloum S et al. A survey of 2010, 28(2): 157–185. DOI: 10.1007/s10619-010-7068-1.
data partitioning and sampling methods to support big [21] Curino C, Jones E, Zhang Y, Madden S. Schism: A work-
data analysis. Big Data Mining and Analytics, 2020, 3(2): load-driven approach to database replication and parti-
85–101. DOI: 10.26599/BDMA.2019.9020015. tioning. Proceedings of the VLDB Endowment, 2010,
[9] Sun L W, Franklin M J, Krishnan S, Xin R S. Fine- 3(1/2): 48–57. DOI: 10.14778/1920841.1920853.
grained partitioning for aggressive data skipping. In Proc. [22] Nehme R, Bruno N. Automated partitioning design in
the 2014 ACM SIGMOD International Conference on parallel database systems. In Proc. the 2011 ACM SIG-
366 J. Comput. Sci. & Technol., Mar. 2024, Vol.39, No.2
MOD International Conference on Management of Data, MOD International Conference on Management of Data,
Jun. 2011, pp.1137–1148. DOI: 10.1145/1989323.1989444. Jun. 2020, pp.143–157. DOI: 10.1145/3318464.3389704.
[23] Pavlo A, Curino C, Zdonik S. Skew-aware automatic [35] Brendle M, Weber N, Valiyev M, May N, Schulze R,
database partitioning in shared-nothing, parallel OLTP Böhm A, Moerkotte G, Grossniklaus M. SAHARA: Mem-
systems. In Proc. the 2012 ACM SIGMOD International ory footprint reduction of cloud databases with automat-
Conference on Management of Data, May 2012, pp.61–72. ed table partitioning. In Proc. the 25th International Con-
DOI: 10.1145/2213836.2213844. ference on Extending Database Technology, Mar. 29–Apr.
[24] Liroz-Gistau M, Akbarinia R, Pacitti E et al. Dynamic 1, 2022. DOI: 10.5441/002/edbt.2022.02.
workload-based partitioning algorithms for continuously [36] Agrawal R, Srikant R. Fast algorithms for mining associa-
growing databases. In Transactions on Large-Scale Data- tion rules in large databases. In Proc. the 20th Interna-
and Knowledge-Centered Systems XII, Hameurlain A, tional Conference on Very Large Data Bases, Sept. 1994,
Küng J, Wagner R (eds.), Springer, 2013, pp.105–128. pp.487–499.
DOI: 10.1007/978-3-642-45315-1_5. [37] Ward J H Jr. Hierarchical grouping to optimize an objec-
[25] Quamar A, Kumar K A, Deshpande A. 2013. SWORD: tive function. Journal of the American Statistical Associa-
Scalable workload-aware data placement for transaction- tion, 1963, 58(301): 236–244. DOI: 10.1080/01621459.1963.
al workloads. In Proc. the 16th International Conference 10500845.
on Extending Database Technology, Mar. 2013, pp.430– [38] Roussopoulos N, Kelley S, Vincent F. Nearest neighbor
441. DOI: 10.1145/2452376.2452427. queries. In Proc. the 1995 ACM SIGMOD International
[26] Taft R, Mansour E, Serafini M, Duggan J, Elmore A J, Conference on Management of Data, May 1995, pp.71–79.
Aboulnaga A, Pavlo A, Stonebraker M. E-store: Fine- DOI: 10.1145/223784.223794.
grained elastic partitioning for distributed transaction [39] Sacca D, Wiederhold G. Database partitioning in a clus-
processing systems. Proceedings of the VLDB Endow- ter of processors. ACM Trans. Database Systems, 1985,
ment, 2014, 8(3): 245–256. DOI: 10.14778/2735508.2735514. 10(1): 29–56. DOI: 10.1145/3148.3161.
[27] Chen K J, Zhou Y L, Cao Y. Online data partitioning in [40] Copeland G, Alexander W, Boughter E, Keller T. Data
distributed database systems. In Proc. the 18th Interna- placement in Bubba. In Proc. the 1988 ACM SIGMOD
tional Conference on Extending Database Technology, International Conference on Management of Data, Jun.
Mar. 2015, pp.1–12. DOI: 10.5441/002/edbt.2015.02. 1988, pp.99–108. DOI: 10.1145/50202.50213.
[28] Zamanian E, Binnig C, Salama A. Locality-aware parti- [41] Stöhr T, Märtens H, Rahm E. Multi-dimensional database
tioning in parallel database systems. In Proc. the 2015 allocation for parallel data warehouses. In Proc. the 26th
ACM SIGMOD International Conference on Manage- International Conference on Very Large Data Bases, Sept.
ment of Data, May 2015, pp.17–30. DOI: 10.1145/2723372. 2000, pp.273–284.
2723718. [42] Bruno N, Chaudhuri S. An online approach to physical
[29] Fetai I, Murezzan D, Schuldt H. Workload-driven adap- design tuning. In Proc. the 23rd IEEE International Con-
tive data partitioning and distribution—The Cumulus ap- ference on Data Engineering, Apr. 2007, pp.826–835. DOI:
proach. In Proc. the 2015 IEEE International Conference 10.1109/ICDE.2007.367928.
on Big Data, Oct. 29–Nov. 1, 2015, pp.1688–1697. DOI: [43] Garcia-Alvarado C, Raghavan V, Narayanan S, Waas F
10.1109/BigData.2015.7363940. M. Automatic data placement in MPP databases. In
[30] Serafini M, Taft R, Elmore A J et al. Clay: Fine-grained Proc. the IEEE 28th International Conference on Data
adaptive partitioning for general database schemas. Pro- Engineering Workshops, Apr. 2012, pp.322–327. DOI: 10.
ceedings of the VLDB Endowment, 2016, 10(4): 445–456. 1109/ICDEW.2012.45.
DOI: 10.14778/3025111.3025125. [44] Karypis G, Kumar V. METIS: A software package for
[31] Marcus R, Papaemmanouil O, Semenova S, Garber S. partitioning unstructured graphs, partitioning meshes,
NashDB: An end-to-end economic method for elastic and computing fill-reducing orderings of sparse matrices.
database fragmentation, replication, and provisioning. In Technical Report, TR 97-061, Univeristy of Minnesota,
Proc. the 2018 International Conference on Management 1997. https://hdl.handle.net/11299/215346, Mar. 2024.
of Data, May 2018, pp.1253–1267. DOI: 10.1145/3183713. [45] Kuhn H W. The Hungarian method for the assignment
3196935. problem. In 50 Years of Integer Programming 1958-2008,
[32] Nam Y M, Kim M S, Han D. A graph-based database Jünger M, Liebling T M, Naddef D, Nemhauser G L, Pul-
partitioning method for parallel OLAP query processing. leyblank W R, Reinelt G, Rinaldi G, Wolsey L A (eds.),
In Proc. the 34th IEEE International Conference on Data Springer, 2010, pp.29–47. DOI: 10.1007/978-3-540-68279-
Engineering, Apr. 2018, pp.1025–1036. DOI: 10.1109/ICDE. 0_2.
2018.00096. [46] Costa E, Costa C, Santos M Y. Evaluating partitioning
[33] Parchas P, Naamad Y, Van Bouwel P, Faloutsos C, and bucketing strategies for hive-based big data ware-
Petropoulos M. Fast and effective distribution-key recom- housing systems. Journal of Big Data, 2019, 6(1): 34.
mendation for amazon redshift. Proceedings of the VLDB DOI: 10.1186/s40537-019-0196-1.
Endowment, 2020, 13(12): 2411–2423. DOI: 10.14778/3407 [47] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou
790.3407834. I, Wierstra D, Riedmiller M. Playing Atari with deep re-
[34] Hilprecht B, Binnig C, Röhm U. Learning a partitioning inforcement learning. arXiv: 1312.5602, 2013. https://arx-
advisor for cloud databases. In Proc. the 2020 ACM SIG- iv.org/abs/1312.5602, Mar. 2024.
Peng-Ju Liu et al.: Enhancing Storage Efficiency and Performance: Survey of Data Partitioning Techniques 367
Endowment, 2019, 12(13): 2393–2407. DOI: 10.14778/ Peng-Ju Liu received his B.S. de-
3358701.3358707. gree in information management and
[74] McCormick W T, Schweitzer P J, White T W. Problem
information system from Dalian Mar-
decomposition and data reorganization by a clustering
technique. Operations Research, 1972, 20(5): 993–1009.
itime University, Dalian, in 2020. He
DOI: 10.1287/opre.20.5.993. is currently pursuing his Ph.D. degree
[75] Ailamaki A, DeWitt D J, Hill M D, Skounakis M. Weav- at the School of Information, Renmin
ing relations for cache performance. In Proc. the 27th In-
University of China, Beijing. His re-
ternational Conference on Very Large Data Bases, Sept.
2001, pp.169–180. search interests include adaptable data partitioning,
[76] Li L Z, Gruenwald L. Autonomous database partitioning load forecasting, and learning-based query optimization.
using data mining on single computers and cluster com-
puters. In Proc. the 16th International Database Engi-
neering & Applications Sysmposium, Aug. 2012, pp.32–41. Cui-Ping Li is currently a profes-
DOI: 10.1145/2351476.2351481.
sor at Renmin University of China,
[77] van Hasselt H, Guez A, Silver D. Deep reinforcement
learning with double Q-learning. In Proc. the 30th AAAI Beijing. She received her Ph.D. de-
Conference on Artificial Intelligence, Feb. 2016, pp.2094– gree from Chinese Academy of Sci-
2100. DOI: 10.1609/aaai.v30i1.10295. ences, Beijing, in 2003. Before that,
[78] Jindal A, Palatinus E, Pavlov V, Dittrich J. A compari-
she received her B.S. and M.S. de-
son of knives for bread slicing. Proceedings of the VLDB
Endowment, 2013, 6(6): 361–372. DOI: 10.14778/2536336. grees from Xi'an Jiaotong University,
2536338. Xi'an, in 1994 and 1997, respectively. She received the
[79] Al-Kateb M, Sinclair P, Au G, Ballinger C. Hybrid row- Second Prize of the National Award for Science and
column partitioning in Teradata®. Proceedings of the
VLDB Endowment, 2016, 9(13): 1353–1364. DOI: 10.14778/
Technology Progress in 2018. Her main research inter-
3007263.3007273. ests include social network analysis, social recommenda-
[80] Pinnecke M, Durand G C, Broneske D, Zoun R, Saake G. tion, and big data analysis.
GridTables: A One-Size-Fits-Most H2TAP data store.
Datenbank-Spektrum, 2020, 20(1): 43–56. DOI: 10.1007/
s13222-019-00330-x.
Hong Chen is currently a professor
[81] Kang D H, Jiang R C, Blanas S. Jigsaw: A data storage
and query processing engine for irregular table partition- at Renmin University of China, Bei-
ing. In Proc. the 2021 International Conference on Man- jing. She received her Ph.D. degree
agement of Data, Jun. 2021, pp.898–911. DOI: 10.1145/ from Chinese Academy of Sciences,
3448016.3457547.
Beijing, in 2000. Before that, she re-
[82] Abebe M, Lazu H, Daudjee K. Proteus: Autonomous
adaptive storage for mixed workloads. In Proc. the 2022 ceived her B.S. and M.S. degrees from
International Conference on Management of Data, Jun. Renmin University of China, Beijing,
2022, pp.700–714. DOI: 10.1145/3514221.3517834. in 1986 and 1989, respectively. She received the Second
[83] Wang J Y, Chai C L, Liu J B, Li G L. FACE: A normal-
Prize of the National Award for Science and Technolo-
izing flow based cardinality estimator. Proceedings of the
VLDB Endowment, 2021, 15(1): 72–84. DOI: 10.14778/ gy Progress in 2018. Her research interests include
3485450.3485458. database technology and high-performance computing.