ICDE_2018_A_Graph-Based_Database_Partitioning_Method_for_Parallel_OLAP_Query_Processing
ICDE_2018_A_Graph-Based_Database_Partitioning_Method_for_Parallel_OLAP_Query_Processing
Abstract—As the amount of data to process increases, a a seed table, which is hash-partitioned. Each of the seed’s
scalable and efficient horizontal database partitioning method descendant tables is partitioned by an edge with its parent
becomes more important for OLAP query processing in parallel table. The PREF/WD algorithm usually returns a set of trees,
database platforms. Existing partitioning methods have a few i.e., a forest, as a result. PREF/WD tends to generate many
major drawbacks such as a large amount of data redundancy
and not supporting join processing without shuffle in many cases trees to maximize the data-locality, where the same table might
despite their large data redundancy. We elucidate the drawbacks occur in multiple trees, and therefore, be duplicated many
arise from their tree-based partitioning schemes and propose times.
a novel graph-based database partitioning method called GPT Although PREF is the state-of-the-art partitioning method, it
that improves query performance with lower data redundancy. still has three major drawbacks. First, PREF/SD tends to cause
Through extensive experiments using three benchmarks, we show
that GPT significantly outperforms the state-of-the-art method a large number of tuple-level duplicates, and this tendency
in terms of both storage overhead and query performance. becomes more marked as the database schema becomes more
complex. This large amount of duplicates causes the initial
1 I NTRODUCTION bulk loading of a database to be very slow. Second, PREF/WD
As the amount of data to process increases, a scalable and tends to cause a large number of table-level duplicates. That is,
efficient parallel database platform becomes more important. A it stores the same table many times across partitions. Third,
number of parallel database platforms exist, including Apache PREF requires shuffle for query processing in many cases,
Spark [1], Apache Impala [2], SAP HANA [3], HP Vertica [4] despite its large data redundancy, and so, query performance
and Greenplum [5]. To exploit parallel data processing for tends to be degraded. Most of the drawbacks of PREF come
OLAP queries, they typically store data blocks over a cluster from its tree-based partitioning scheme. In PREF, all edges in
of machines, execute local operations in each machine, and a tree or forest have a direction from source (i.e., referencing
then repartition (shuffle) the local processing results to handle table) to destination (i.e., referenced table), which incurs so-
join or aggregation. Here, repartitioning is an expensive remote called cumulative redundancy [12]. In addition, no cycles are
operation involving network communication, and its cost tends allowed in the tree-based partitioning scheme, and so, join
to increase as the data size or the number of machines operations in complex queries cannot be processed without
increases [6], [7], [8]. shuffle in many cases, but must be processed with shuffle. We
In order to avoid expensive join operations with shuffle, present the above drawbacks in detail in Section 2.
a number of methods have been proposed to horizontally To solve the above problems, we propose a novel graph-
partition a database in an offline manner [9], [10], [11], based database partitioning method called GPT. Intuitively,
[12]. The methods in [9], [10] co-partition only the tables the GPT method determines an undirected multigraph from
containing the common join keys. However, these methods a schema graph or workload graph as its partitioning scheme.
are not particularly useful for complex schema with many In the undirected multigraph, a vertex represents a table to
tables or for complex queries with join paths over multiple be partitioned, and an edge represents a co-partitioning rela-
tables that use different join keys. The REF method [11] tionship. Since the partitioning scheme is a single graph where
partitions a table R by a foreign key of R referring to another each table occurs only once, there are no table-level duplicates.
table S that is already partitioned by a primary or foreign For co-partitioning between two tables, we propose the hash-
key of S. The PREF method [12], which is the state-of-the- based multi-column (HMC) partitioning method. It is a kind
art method, generalizes the REF method by exploiting not of hash-based partitioning method that has no parent-child
only referential constraints but also join predicates (PREF- dependencies among tables. Due to no dependency among
partitioning for short). PREF fully replicates manually selected tables, it does not incur cumulative redundancy. Consequently,
small tables and partitions the remaining large tables. If query it results in far fewer tuple-level duplicates.
workload is available, PREF uses a workload-driven (WD) The GPT method determines the undirected multigraph so
algorithm that uses the query workload to automatically find as to contain many triangles of vertices (tables). Therefore,
the best partitioning scheme. Otherwise, it uses a schema- most join operations involving the tables in these triangles
driven (SD) algorithm that uses the database schema. The can to be processed without network communication. Here,
PREF/SD algorithm usually returns a single tree as a result, GPT determines the partitioning scheme so that these trian-
where a node indicates a table to be partitioned and an edge gles have common shared vertices called hub tables, which
indicates PREF-partitioning. The root of the tree is called improve the query performance while using less storage space.
GPT also determines the partitioning scheme in a cost-based
* Author to whom correspondence should be addressed. manner by considering the trade-off between the benefit of
Figure 1 shows the schema-driven partitioning schemes The partitioning scheme of PREF/WD has fewer tuple-level
determined by PREF and GPT for TPC-DS, where each box duplicates than does that of PREF/SD due to its lower depth
1026
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
of each tree in the forest. Instead, however, it includes a large and optimizing a query plan is beyond the scope of this paper.
number of table-level duplicates. For instance, in Figure 2(c), Instead, we present a basic query processing method for the
the large fact table SS appears repeatedly in Tree#2, Tree#4, database partitioned by GPT in Section 4.
and Tree#5. Here, we note that the table SS is partitioned 3 GPT M ETHOD
differently in each tree since the paths from the root table
In this section, we propose our graph-based database
in each tree are different. The tables shown in gray indicate
partitioning (GPT) method. GPT determines an undirected
table duplicates. In general, as the query workload increases,
multigraph as a partitioning scheme for a given schema or
PREF/WD determines more trees. Therefore, the number of
workload graph. Section 3.1 introduces input join graph and
table duplicates also increases.
output partitioning scheme. Section 3.2 presents the problem
In contrast, as shown in Figure 2(b), GPT/WD has no such
definition, Section 3.3 explains the triangles and hubs in
table-level duplicates since its partitioning scheme is a single
the partitioning scheme, Section 3.4 proposes the partitioning
graph. The graph determined by GPT/WD is usually similar to
algorithm, and Section 3.5 shows a case study using TPC-DS.
that constructed by GPT/SD. We will explain how to determine
We summarize the symbols used in the paper in Table 1.
such a graph and how to partition a table with considering its
TABLE I
many adjacent tables in the graph in Section 3. L IST OF SYMBOLS .
2.4 Join Operations with repartitioning Symbol Meaning
C(T ) a set of partitioning columns for a table, T
In spite of its large data redundancy, the tree-based partition- P (T ) the horizontally partitioned table for table T
ing schemes of PREF require shuffle during join operation in T [i] i-th column of T (i ∈ Z+ )
||T || the size of T (in bytes)
many cases, which can degrade query performance. A complex ||P (T )|| the size of the partitioned table P (T ) (in bytes)
analytic query may contain cycles in the corresponding query N the number of horizontal partitions
join graph, and repartition operations are unavoidable when
processing such cycles using PREF since it does not allow 3.1 Join Graph and Partitioned Graph
cycles in its partitioning scheme. We construct an input graph from a database schema or
For example, Figure 2(a) shows the query join graph for query workload. We simply call it as join graph and define it
the TPC-DS Q17 query, which involves five partitioned tables in Definition 1.
and nine join conditions. Each edge in the graph indicates Definition 1: (Join graph) A join graph G=(V, E, l(e ∈
one or more join conditions between the corresponding two E), w(e ∈ E)) is an undirected and weighted multigraph.
end tables. When executing the Q17 query on top of the A vertex v ∈ V denotes a table. An edge e ∈ E denotes
database partitioned by PREF/WD in Figure 2(c), Tree#2 is a (potential) join relationship between two tables R and S,
used (details in [12]). In Tree#2, the tables and join operations especially between R[i] and S[j], where i ∈ R, and j ∈ S.
used for Q17 are drawn using thick boxes and thick lines, The labeling function l(e) returns the equi-join predicates for
respectively. Some join operations such as SS-I and SS-D edge e, i.e., l(e) = (R[i], S[j]). The weight function w(e)
can be processed without shuffle since the database is already returns the join frequency of the edge e.
partitioned appropriately. However, when projecting the query A join graph is constructed using either a schema-
join graph of Q17 onto Tree#2, two dotted red edges, CS-D driven approach or a workload-driven approach. The schema-
and SR-D, do not exist in Tree#2, which means those two driven (SD) approach generates a join graph GS = (VS , ES )
edges must be processed by join operations with shuffle. based on the database schema S. The set of tables in S
In contrary, all the join operations of Q17 can be processed becomes VS , and the set of referential constraints in S
without shuffle on the database partitioned by GPT/WD. When becomes ES , which are considered as potential equi-join
projecting the query join graph of Q17 onto the partitioning operations. The weight function w(e) of GS returns 1. The
scheme by GPT/WD in Figure 2(b), the query join graph workload-driven (WD) approach generates a join graph GW =
becomes a subgraph of GPT/WD, and so, all joins can be (VW , EW ) based on the query workload W . The set of
processed in a single MapReduce round. As shown in Fig- tables appeared in W becomes VW , and the set of equi-join
ure 5(b) in more detail, three tables CS, SR and SS used predicates in W becomes EW . The weight function w(e) of
in Q17 are partitioned by their date and item columns due GW returns the number of occurrences of the join predicate
to two “hub” tables D and I in GPT/WD. Here, the pair of l(e) in the workload W , i.e., the join frequency of e in W .
tables CS and SR have a common partition column, item, We denote the resulting partitioning scheme as P G, which
although we do not co-partition those tables explicitly. GPT is a subgraph of the input join graph G (i.e., P G ⊆ G). To
performs hash-based partitioning on the partition column(s) of determine P G, we start from an empty P G s.t. P G.V = ∅ and
each table, and so, the tables CS and SR are co-partitioned P G.E = ∅. We regard adding a vertex v ∈ G.V to P G (i.e.,
implicitly. As a result, the join condition CS.item = SR.item v ∈ P G.V ) as a horizontal partitioning of table v and regard
in Q17 can be processed without shuffle. We call this type not adding a vertex v ∈ G.V to P G (i.e., v ∈ / P G.V ) as
of edge between CS and SR an indirect join edge. In fact, replicating table v across machines. In addition, we regard
there are many other indirect join edges in Figure 2(b), but adding an edge e ∈ G.E in P G (i.e., e ∈ P G.E) as co-
we omit them for simplicity. Our GPT method determines a partitioning two end tables of e according to l(e). We need to
partitioning scheme by considering such indirect join edges, decide whether to add each v to P G or not, and also whether
which we will explain in Section 3. The method of generating to add each e to P G or not.
1027
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
As described above, we categorize the vertices (i.e., tables)
of G into two types: P art-tables and Rep-tables. For a P art-
table T , we let C(T ) be a subset of the columns of T used for
partitioning T . We note that if e = (R[i], S[j]) exists in P G,
then R[i] ∈ C(R), and S[j] ∈ C(S). When a vertex (table) T Fig. 3. Examples of triangle edges, an indirect join edge, and a Hub-
is of the P art type and has no edges, then it means C(T ) = ∅, table ({e, e1 , e2 , triEdges(e, T [k])} ⊂ E).
and we simply split T into fixed size blocks and distribute
them across machines randomly. When T is of the Rep type, and each adjacent table S ∈ adj(T ) is scanned once. The size
we do not need to choose a set of partitioning columns for T . of a partitioned table P (S) can reach up to ||S|| · |C(S)|,
However, when T is of the P art type, we need to choose its since the table S can be co-partitioned with its adjacent tables
partitioning columns carefully since it can affect both storage using its |C(S)| different columns, and no correlation exists
overhead and query performance. among these columns. Likewise, in Eq.(2), we assume that the
replicated table T over N partitions is scanned |adj(T )| times.
3.2 Determination of Vertices
Here, we can regard the size of a partitioned table P (S) as
We consider a good partitioned graph can improve the query
||S|| · (|C(S)| − 1) since there is no edge between T and S,
performance largely using only a small amount of additional
and thus, there is one less partitioning column.
storage space. Without loss of generality, there are two criteria In Eqs.(1)-(2), it is difficult to know |C(S)| for each
for evaluating the goodness of P G: space overhead from S in advance. We can eliminate the term by calculating
data redundancy and query performance improvement by co- P artCost(T ) − RepCost(T ) as in Eq.(3).
partitioning. We try to find the optimal partitioned graph
P G∗ that maximally satisfies these criteria under a certain Dif f Cost(T ) = (||P (T )||−||T ||·N )|adj(T )|+ ||S|| (3)
S∈adj(T )
cost function. However, since there are up to 2|V | possible
combinations in terms of vertices of P G and up to 2|E| If Dif f Cost(T ) ≥ 0 for a table T , we classify T as the
possible combinations in terms of edges of P G, finding P G∗ Rep type. Otherwise, we classify T as the P art type (i.e.,
might be computationally prohibitive. For example, in the dif fP R (T ) < 0). Intuitively, under a fixed N , when table T
TPC-DS benchmark, |V | is greater than 20, and |E| is larger is relatively small, and has a large number of adjacent tables
than 100, and furthermore, problems in many real applications whose sizes are relatively large, then Dif f Cost(T ) tends to
exceed the size of TPC-DS benchmark [13], [14]. be larger than zero. In this case, T is classified as the Rep
To solve this problem, we use a heuristic approach consist- type.
ing of the following two steps: determining the set of vertices 3.3 Triangles and Hubs
to be added to P G, and then, determining the set of edges In real graphs, a hub vertex is one connected to many other
among those vertices to be added to P G. We present the first vertices. Likewise, join graphs of real databases can include
step in this section, and the second step in Section 3.4. hub tables that are connected with many other tables [14].
As described above, determining a vertex v as a P art type We denote these tables as Hub-tables. We have observed
means partitioning v horizontally, while determining v as a that there are a lot of triangles of tables that share Hub-
Rep type means fully replicating v across partitions. Fully tables as common vertices in the join graphs. These triangles
replicating a table smaller than a certain fixed threshold is a provide many opportunities to improve query performance via
widely used technique in parallel database system [8]. PREF join without shuffle. For instance, Figure 2(b) shows three
also uses a fixed threshold (e.g., 1000 tuples). However, GPT explicit triangles of tables, (CS, CR, D), (SR, SS, D), and
uses an adaptive threshold rather than a fixed one, where the (W S, W R, D), that share a hub table D. Many more implicit
decision to partition or replicate v is based on the sizes of v’s triangles exist due to indirect join edges, but we omit them.
adjacent tables. We explain the concepts of triangle edges, indirect join
Let adj(T ) be a set of adjacent tables of a table T in a join edge, and Hub-table using Figure 3, where, for simplicity, we
graph. We can formulate a total cost of I/O operations for an assume each of the five tables R, S, T , X, and Y has only
equi-join among table T and adj(T ) as shown in Eq.(1) for a a single column. In Figure 3, we regard the triangle edges
T of the P art type, or as shown in Eq.(2) for a T of the Rep triEdges(e, T [k]) for a given edge e and a vertex T [k] as
type. Here, we assume that the cost of an equi-join between the two edges (R[i], T [k]) and (S[j], T [k]). Here, triEdges
T and S (S ∈ adj(T )) is proportional to the sum of the sizes (e, T [k]) and e form a triangle together in the join graph G.
of T and S. At this point, we do not yet know the type of S, The table T might have additional columns that satisfy the
and so, regard it as a P art type for simplicity. Since S is of above condition, and thus form multiple triangles together with
a P art type, no shuffle will be required for join between T e. We denote a set of those {k} columns of T as triCols(e, T ).
and S in either equation. In Figure 3, we regard the indirect join edge (X[m], Y [n]) as
P artCost(T ) = ||P (T )||·|adj(T )|+ ||S|| · |C(S)| (1) one that does not exist in G, but forms a triangle together
S∈adj(T ) with two edges e1 and e2 via T [k]. Then, we informally
define a Hub-table as a table that forms one or more triangles
RepCost(T )=||T ||·N ·|adj(T )|+ ||S||(|C(S)| − 1) (2) (a, b, c) together with either an actual edge e ∈ E or an
S∈adj(T )
indirect join edge, similar to T in the figure.
In Eq.(1), we assume that the partitioned table P (T ) is We note that an edge such as e = (R[i], S[j]) might have
scanned |adj(T )| times due to the join with its adjacent tables, multiple Hub-tables that form triangles with it. We denote
1028
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
these as hub(e) as follows: The purpose of our GPT method is to find a near opti-
mal partitioned graph P G (≈ P G∗ ) that can improve the
hub(e) = {Ti | ∃k ∈ Ti : (e, triEdges(e, Ti [k]))} (4) query performance largely using only a reasonable amount
of additional storage compared with the original unpartitioned
In addition, we denote the set of all triangle edges that share
database. In particular, we use a bottom-up approach that adds
the edge e as triEdges(e):
an edge one-by-one to the initial no-edge P G. In addition, we
triEdges(e) = triEdges(e, Ti [k]) (5) should consider adjusting the types of some Hub-vertices of
Ti ∈hub(e),k∈triCols(e,Ti ) P G from the Rep type to the P art type, as explained in
A Hub-table T can improve query performance through Section 3.3. The GPT method both determines the edges and
horizontal partitioning T on the column T [k] in many cases. adjusts the vertices for P G in an intertwined manner.
Thus, we change the type of a Hub-table from the Rep type to In general, adding an edge e to P G increases the storage
the P art type. In detail, adding an actual edge e=(R[i], S[j]) overhead due to tuple duplicates from co-partitioning. To
to P G, i.e., co-partitioning on e, allows three joins, i.e., measure the storage overhead, we adopt the definition of data
(R[i], S[j]) (R[i], T [k]) and (S[j], T [k]), to be processed redundancy(DR) for database D in Eq.(7) [12]. A zero DR
without shuffle. Partitioning T on T [k] is also effective even value means that ||P (D)|| is equal to ||D||, i.e., no additional
storage overhead occurs from horizontal partitioning.
when e is an indirect join edge. In Figure 3, co-partitioning
on (R[i], T [k]) and (S[j], T [k]) allows processing the join ||P (D)|| Ti ∈D ||P (Ti )||
DR(D) = −1= −1 (7)
operation e = (R[i], S[j]) without shuffle and without explicit ||D|| Ti ∈D ||Ti ||
co-partitioning on e. This approach can be particularly useful
Now, we discuss the benef it(e) in Eq. 6, i.e., the benefit
for a SD join graph, where e does not appear in a database
when adding an edge e to P G, which means the amount of
schema, but appears in the query workload. Therefore, in
disk I/O required to process a join without shuffle thanks to
general, when a Hub-table T has a higher degree, i.e.,
e ∈ P G. We define the sum of the sizes of two end tables of
is shared among more triangles, partitioning T can further
e as ||e||. The benefit increases proportionally to w(e) as well
improve query performance.
as to ||e||, where w(e) = 1 for a schema-driven join graph.
3.4 Determination of Edges Let us consider the case of adding the first single edge e to the
no-edge P G. This addition does not increase DR since both
Now, we discuss how to determine the set of edges among tables are hash-partitioned on their single column, and we can
the vertices to be added to P G. Since determining the optimal say that the benefit of adding e is w(e) × ||e||. Here, there
set of edges for P G is still too difficult, we set a limit on is no penalty in terms of storage overhead. The GPT method
the number of partitioning columns for each vertex, instead chooses such edges with high priority.
of allowing an arbitrary number of partitioning columns to To evaluate the benefit of each edge in a join graph, we
be used. We denote the limit on the number of partitioning categorize the edges into three types: intra edges among P art-
columns as κ. tables, inter edges between P art-table and Rep-table, and the
It is a user-defined parameter that can control the space indirect join edges defined in Section 3.3. We denote three
overhead of P G as a single knob. Since κ limits the number kinds of edges as Ea , Er , and Et , respectively. We do not
of partitioning columns for each table, the maximum size of consider the edges between two Rep-tables since Rep-tables
an entire partitioned database is approximately proportional already have edges with all other tables implicitly. Below, we
to κ. As we use a higher κ, the size of the partitioned present the benefit of initially adding an intra edge e ∈ Ea in
database increases, and at the same time, the opportunity of Eq.(8), an inter edge e ∈ Er in Eq.(9), and an indirect join
join operation without shuffle also increases. edge e ∈ Et in Eq.(10).
Given κ, i.e., the space overhead parameter, we can improve
the query performance by choosing a set of good edges to be benef it(e ∈ Ea ) = ||e|| · w(e)+ (||e || · w(e )) (8)
e ∈triEdges(e)
added to P G. To evaluate the goodness of P G, we use the
concept of the benefit of choosing the set of edges P G.E. We benef it(e ∈ Er ) = ||e|| · w(e) (9)
denote the cost function for the concept by benef it(P G.E),
which means the sum of the amount of disk I/O for processing benef it(e ∈ Et ) = (||e || · w(e )) (10)
a join without shuffle under P G. Without loss of generality, e ∈triEdges(e)
we can say a P G that has a bigger benef it(P G.E) value In Eq.(8), adding e ∈ Ea allows three join operations
is a better partitioning scheme. Thus, our goal is to find the corresponding to e and e ∈ triEdges(e) to be processed
optimal partitioned graph P G∗ that maximizes the benefit. We without shuffle by changing hub(e) to the P art type, when e
let a set of all possible P Gs for a given κ be PGκ . Then, we includes hub(e). In Eq.(9), adding e ∈ Er allows only the
can define our problem as shown in Problem Definition 2. join operation corresponding to e to be processed without
shuffle. In Eq.(10), adding e ∈ Et allows the join operations
Definition 2: (Problem Definition) Given a database D
corresponding to e ∈ triEdges(e) to be processed without
and a join graph G = (V ,E, l, w), the problem is finding
shuffle. We note that e is not considered as a benefit in Eq.(10)
the optimal partitioned graph P G∗ such that
since it is not an actual edge in a join graph. However, in case
P G∗ = arg max {benef it(P Gi .E)}. (6) of SD join graph, the join operations corresponding to e can
P Gi ∈PGκ
1029
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
exist in the query workload, and so, the edge e itself can be Then, if the sizes of two end tables increase as a result of
beneficial. adding e, GPT updates, especially, decreases the benefits of
For example, we assume that six tables exist, R, S, T , X, Y , adj(e) to reflect the loss of data redundancy. GPT repeats this
and Z, as shown in Figure 4. Then, in the figure, the red edges main step until benf itQ is empty.
are the targets of benef it(e1 ∈ Ea ) in Eq.(8), the orange edge 3.5 A Case Study: TPC-DS Benchmark
is the target of benef it(e2 ∈ Er ) in Eq.(9), and the purple In this section, we show the partitioning schemes deter-
edges are the targets of benef it(e3 ∈ Et ) in Eq.(10). mined by GPT for the TPC-DS benchmark. In Figure 5,
ǽǾ ǽǾ GPT/SD and GPT/WD are quite similar with each other.
ďĞŶĞĨŝƚ;Ğ ∈ Ϳ PREF uses different algorithms to determine a partitioning
ǽǾ ďĞŶĞĨŝƚ;Ğ ∈ Ϳ scheme, depending on whether the input is a schema or a query
ďĞŶĞĨŝƚ;Ğ ∈ Ϳ workload. On the contrary, GPT uses the same algorithm in
ǽǾ ǽǾ Algorithm 1 to determine the partitioning scheme, regardless
ǽǾ
of the input. The join queries in OLAP workloads are typically
Fig. 4. Examples of three kinds of benef it(e). derived from foreign key relationships in the corresponding
schema [15], and so, both GPT/SD (using a join graph from
We present the GPT algorithm in Algorithm 1. Given a a database schema) and GPT/WD (using that from the query
join graph G and the parameter κ, it produces a P G that can workload) become similar. Thus, the GPT method might be
improve query performance largely while increasing DR only useful especially when query workload is not given, which
slightly. For brevity, we denote the P art-tables and Rep-tables will be shown in Section 5.3.
of a join graph G as VP art and VRep , respectively (VP art ∪
VRep = V ).
Algorithm 1 GPT: Graph-based database ParTitioning
Input: G = {V, E, w, l}, // undirected multi-graph
κ // max # of partitioning columns per table
Variable: benef itQ // max-priority queue of benef it(e), e
Output: P G = {V, E} // partitioned graph (subgraph of G)
1: // Step1: initialization
2: split V into VP art and VRep ; // according to Eq.(3)
3: add VP art to P G.V ;
4: Ea ← {e|e = (R.i, S.j) ∈ E ∧ R = S ∧ R ∈ VP art ∧ S ∈ VP art };
Fig. 5. Partitioning schemes of GPT for TPC-DS (κ = 2).
5: Er ← {e|e = (R.i, S.j) ∈ E ∧ R = S ∧ R ∈ VP art ∧ S ∈ VRep };
6: Et ← {e|e is an indirect join edge }; Both GPT/SD and GPT/WD have the same set of P art-
7: // Step2: building an initial benef itQ tables since those P art-tables have large portion in both in
8: for each e ∈ Ea ∪ Er ∪ Et do terms of the cost model. However, they have different Hub-
9: benef itQ.insert(benef it(e), e); tables due to their different input join graphs. GPT/SD in
10: end for Figure 5(a) has the tables I and C as hubs, while GPT/WD
11: // Step3: adding edges and vertices to P G in Figure 5(b) has the tables I and D as hubs. The number
12: while benef itQ = ∅ do of Rep-tables in GPT is greater than that of PREF. However,
13: benef it, e ← benef itQ.extractM ax(); it is not an issue since replicating Rep-tables requires only
14: if (|C(R)| < κ) ∧ (|C(S)| < κ) s.t. (R, S) ∈ e then
a small amount of storage overhead (e.g., 0.2% of the whole
15: add hub(e) to P G.V ;
16: add e to P G.E; partitioned database when the database size is 1 TB).
17: add triEdges(e) to P G.E; Both GPT/SD and GPT/WD include each table only
18: benef itQ.updateBenef it(adj(e)); once (i.e., no table-level duplicates), and also, have no cumu-
19: end if lative redundancy that PREF has since they have no parent-
20: end while
child dependencies. Moreover, the graph-based partitioning
21: return P G;
schemes of GPT allow query processing to be performed
In the initialization step (Lines 2-6), GPT sets VP art to the without shuffle in most cases even for complex queries in the
initial P G.V and classifies E into Ea , Er , and Et . Then, GPT TPC-DS benchmark. That is partly due to a lot of indirect
builds a max-priority queue benef itQ that sorts and maintains join edges that implicitly exist in the partitioning schemes. In
all the edges by their benef it(e). In the main step (Lines 12- Figure 5(b), GPT/WD contains 20 edges, and an additional 36
20), GPT extracts the edge e with the highest benefit from indirect join edges (a total of 56 edges). We omit the indirect
benef itQ and adds it to P G. Then, we check the κ constraint join edges in the figure for simplicity. Instead, we present them
for the two end tables of e and add it to P G only when the in Table II. A total of 36 indirect join edges exist among the
constraint is satisfied. Here, if the edge e has Hub-tables as in seven P art-tables and the two Hub-tables, D and I.
Eq.(4), GPT also adds both hub(e) and triEdges(e) to P G. 4 Q UERY P ROCESSING
We note that adding triEdges(e) does not increase DR at all We first propose our HMC partitioning method for co-
if an edge e exists. After adding e to P G, GPT identifies the partitioning each edge in the partitioning scheme in Sec-
set of adjacent edges of the two end vertices of e, i.e., adj(e). tion 4.1. Then, we present how the scan operator eliminates
1030
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
TABLE II partitioning are shown in black. Table S has no dup bitmaps
L IST OF INDIRECT JOIN EDGES IN GPT/WD (κ = 2).
since it has only a single partitioning column. The first tuple
no. edges no. edges no. edges
1 (CR.item, Inv.item) 2 (CR.date, Inv.date) 3 (Inv.item, SR.item) of R, (1, 3, 5), is copied to both partitions P1 (R) and P3 (R),
4
7
(CR.item, WS.item)
(CR.item, SR.item)
5
8
(CR.date, WS.date)
(CR.date, SR.date)
6
9
(Inv.item, SS.item)
(Inv.item, WR.item)
where the bitmap vectors in P1 (R) and P3 (R) are (1, 0) and
10 (CR.tem, SS.item) 11 (CR.date, SS.date) 12 (Inv.item, WS.item) (0, 1), respectively. The second tuple (2, 5, 7) is copied only
13 (CR.item, WR.item) 14 (CR.date, WR.date) 15 (Inv.date, SR.date)
16 (CS.item, WS.item) 17 (CS.date, WS.date) 18 (Inv.date, SS.date) to partition P2 (R), where its bitmap vector is (1, 1).
19 (CS.item, Inv.item) 20 (CS.date, Inv.date) 21 (Inv.date, WR.date)
22 (CS.item, SR.item) 23 (CS.date, SR.date) 24 (Inv.date, SS.date)
25 (CS.item, SS.item) 26 (CS.date, SS.date) 27 (SR.item, WS.item)
28 (CS.item, WR.item) 29 (CS.date, WR.date) 30 (SR.date, WR.date) ǽŗǾ ǽŘǾ ǽřǾ ǽŗǾ ǽŘǾ ǽřǾ ǽŗǾ ǽŘǾ
31 (SS.item, WS.item) 32 (SR.item, WR.item) 33 (SR.date, WS.date) ŗ ř ś ŗ ř ś ǻŗǰŖǼ ŗ ś
34 (SS.date, WR.date) 35 (SS.item, WR.item) 36 (SS.date, WR.date) Ř ś ŝ Ś ŗ Ř ǻŗǰŗǼ
ř ś ş
duplicates efficiently in Section 4.2 and discuss the differences Ś ŗ Ř
between GPT and PREF in terms of data redundancy and Ř ś ŝ ǻŗǰŗǼ Ř ř
ǽŗǾ ǽŘǾ ř ś ş ǻŖǰŗǼ
query performance in Section 4.3.
ŗ ś
4.1 HMC Partitioning Ř ř
ŗ ř ś ǻŖǰŗǼ ř Ř
ř Ř
In this section, we present our hash-based co-partitioning ř ś ş ǻŗǰŖǼ
method, called HMC, for the edges in P G. We define HMC
Fig. 6. Example of HMC partitioning (N = 3).
partitioning to perform co-partitioning between two tables in
Definition 3. We let t.x be the value of column x of tuple When storing each partition Pi (T ) of T , we use the concept
t ∈ T. of subpartition to efficiently eliminate duplicates in terms of
Definition 3: (HMC partitioning) HMC partitioning par- disk I/O. We can divide Pi (T ) into multiple disjoint subpar-
titions a table T horizontally by hashing the column values titions based on its bitmap information dup(Pi (T )). When
of its partitioning column(s) C(T ). We denote the table |C(T )| = n, the number of possible bitmap vectors becomes
P
2n . For a given table, the number of possible dup bitmap
partitioned by HMC partitioning as P (T ) = Pi (T ), where
i=1 vectors is limited since the number of possible partitioning
Pi (T ) is the i-th partition of P (T ). For a hash function columns is also limited. For example, if |C(T )| = 2, the
h(·) (1 ≤ h(·) ≤ P ), a tuple t ∈ T is stored in a set of possible bitmap vectors are {00, 01, 10, 11} in binary strings.
partitions {Ph(t.c) (T )|c ∈ C(T )}, where h(t.c) is the hash We create a subpartition for each distinct bitmap vector and
value of the column value t.c for the partitioning column store the tuples having the same dup bitmap vector in the
c ∈ C(T ). same subpartition. Here, we note that GPT does not need
Since h(·) is applied to each column t.c ∈ C(T ) indepen- to store dup bitmap vectors at all, since each subpartition
dently, a tuple t ∈ T might be duplicated in multiple partitions already represents a unique bitmap vector for the tuples in
when |C(T )| > 1. A tuple t that has null values in some par- the subpartition. We denote such a bitmap vector by bitV . As
titioning columns (i.e., ∃c ∈ C(T ) : t.c = null), is duplicated with dup, the length of bitV is |C(T )|, and we denote bitV
in only the partitions {Ph(t.c) (T )|t.c = null ∧ c ∈ C(T )}. for a subpartition s ∈ Pi (T ) as bitV (Pi (T ))[s].
Our HMC partitioning method uses bitmap information Figure 7 shows an example of subpartitions using the same
called dup in order to eliminate tuple duplicates during table R used in Figure 6. We assume N = 3 and |C(R)| = 2.
query processing. Tuple duplicates are common in horizon- Then, a total of 3×22 = 12 subpartitions are created for R. For
tal partitioning methods [12], and thus, an efficient method example, a tuple (1, 3, 5) is stored in subpartition 2 of P3 (R)
for eliminating duplicates is very important. For a partition and subpartition 3 of P1 (R). The tuple (2, 5, 7) is stored only
Pi (T ), we denote a bitmap vector of length |C(T )| for a in subpartition 4 of P2 (R) since its bitmap vector is (1, 1),
tuple t ∈ Pi (T ) as dup(Pi (T ))[t]. In a bitmap table or a and its hash value is 2. Here, the bitV of subpartition 4 is
bitmap vector, 0 indicates a duplication, while 1 indicates no (1, 1).
duplication. The content of dup bitmaps for a tuple t ∈ T
can be determined during the data loading of T . We let
C(T ) = {c1 , . . . , cm }, where m = |C(T )|. For a tuple t ∈ T
copied to a partition Pi (T ), the dup(Pi (T ))[t] bitmap vector
is determined as [b(1), . . . , b(k), . . . , b(m)], where b(k) = 1 if
h(t.ck ) = i, but b(k) = 0, otherwise. We let the set of partition
IDs where a tuple t ∈ T is duplicated be {p1 , . . . , pn }. Then,
n bitmap vectors exist for a tuple t across partitions, where Fig. 7. Example of subpartitions (N = 3).
n ≤ |C(T )|. The sum of the 1s in these bitmap vectors is equal
to |C(T )|. In a certain dup(Pi (T ))[t], there might be more 4.2 Duplicate elimination
than two 1s, when two or more column values of partitioning
In this section, we present how to eliminate tuple duplicates
columns have the same hash value h(·). When |C(T )| = 1,
to ensure the correctness of query results. If SQL queries
we do not need dup bitmaps for T since only 1s exist in the
are executed on partitioned tables, it is essential to eliminate
bitmaps, that is, there are no duplicates.
tuple duplicates from the query results across partitions for
Figure 6 shows an example of HMC partitioning for two
correctness. Existing methods [12] usually rewrite the query
tables R and S, where N = 3. The columns used for
1031
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
plan by adding repartitioning operations, to eliminate dupli- of partitions Pi (R), Pi (S), neither Pi (R) nor Pi (S) have
cates. That approach is a kind of lazy elimination, since duplicate tuples and are already co-partitioned on column j,
some tuple duplicates are carried through the pipeline of plan and thus, do not require shuffle during a join without false
in each machine, and eliminated after shuffling via network negatives and false positives. The remaining part of the join
communication. Carrying these unnecessary duplicates with predicate, (r1 = s1 ) ∧ · · · ∧ (rk = sk ) except (rj = sj ) can be
repartitioning operations can cause extra query processing checked on the tuple pairs resulting from the join operation
overhead. within each partition.
The concept of subpartition presented in Section 4.1 allows For instance, we consider a join between two partitioned ta-
us to eliminate duplicate tuples without carrying unnecessary bles P (R) and P (S) in Figure 6. We assume the join condition
duplicates and repartitioning operations in most cases. Intu- is R[2] = S[1]. Then, CQ (R) = R[2] and CQ (S) = S[1], and
itively, we can selectively access the subpartitions that are not CQ (R) ∩ CQ (S) = R[2] = S[1] under the join condition.
determined as duplicates when reading a partition Pi (T ) from In this join scan mode, the scan operator for P (R) uses
storage without false negatives and false positives. In detail, the second bitmap column of bitV (Pi (R)), that is, accesses
we read only the subpartitions of Pi (T ) corresponding to the subpartitions 2 and 4 in Figure 7. The scan operator for P (S)
partitioning columns of T relevant to a given query. Since only scans Pi (S) (1 ≤ i ≤ 3) since there are no bitV bitmaps,
the scan operator is a low-end operation, there is no need to i.e., no subpartitions. Then, the two tuples in the bold boxes
rewrite a query plan in principle, but just need to change the of P (R) are successfully joined with the tuple of P (S).
scan operator so as to access to such subpartitions.
Join operation with shuffle: If CQ (R) ∩ CQ (S) = ∅, then
We explain how to make the scan operator be aware of the
it means R and S are not co-partitioned with each other, and
bitmap information for the following two cases : (1) scanning
so, a repartitioning operation is unavoidable. This case is rare
a table irrelevant to a join and (2) scanning a table relevant to
under GPT method since a lot of triangles and indirect join
a join.
edges in GPT’s partitioning scheme tend to cover a given query
4.2.1 Scanning a Table Irrelevant to a Join (Single-Scan join graph. The scan operator for R just reads P (R) in the
Mode): If a table T does not involve join operations with other single-scan mode described above. Likewise, the scan operator
tables, we can perform duplicate elimination by checking any for S also reads P (S) in the single-scan mode. We note that no
particular bitmap column of bitV in each partition Pi (T ). We duplicates of tuples are read from R and S. Then, the standard
denote the i-th bitmap column bitV as bitV [·][i]. Although repartitioning operation performs join operation between R
we can use any i-th bitmap column for 1 ≤ i ≤ C(T ) for and S, and only a minimal number of tuples are shuffled as
duplicate elimination, we just use the first bitmap column for in an unpartitioned database. There is no need to eliminate
simplicity. Then, for a subpartition s ∈ Pi (T ), bitV [s][1] = 1 duplicates during the join.
indicates all tuples in the subpartition s are original tuples,
and so, we read them from the subpartition s. In contrast, 4.3 Comparison Analysis with PREF
bitV [s][1] = 0 indicates that all tuples in the subpartition s
are duplicated tuples, and so, we should not read them. In terms of data redundancy (DR), the DR of GPT increases
For instance, we consider scanning a partitioned table P (R) proportionally with the parameter κ since each table can have
in the single-scan mode in Figure 7. We assume that the first up to κ partitioning columns, and the size of each table
bitmap column of bitV [s] (1 ≤ s ≤ 4) is used for scanning. At increases up to κ times under the HMC partitioning. Here, we
the storage level, the scan operator accesses only subpartitions note that the size of whole partitioned database is regardless of
3 and 4 in Figure 7. They correspond to the set of tuples in whether the number of partitions N increases, or the schema of
orange in Figure 6, where no tuple duplicates exist. database becomes more complex. The number of edges in the
partitioning scheme also does not affect the DR of GPT at all,
4.2.2 Scanning a Table Relevant to a Join (Join Scan
since the number of partitioning columns is still limited to κ
Mode): We assume a join predicate between R and S in a
no matter how many adjacent edges each table has. In contrast,
query Q is (r1 = s1 )∧· · ·∧(rk = sk ) where a set {ri , . . . , rk }
the DR of PREF/SD increases proportionally with the number
is a subset of the columns of R, and a set {si , . . . , sk } is
of partitions N due to its reference partitioning as shown in
a subset of the columns of S. Then, we let CQ (R) and
Figure 1(a), where a lot of D tuples are duplicated in every
CQ (S) be the sets of partitioning columns used in the join
partition, and thus, a lot of W R tuples are also duplicated
predicate for tables R and S, respectively. That is, CQ (R) =
in every partition. The DR of PREF/WD tends to increase as
{ri , . . . , rk } ∩ C(R), and CQ (S) = {si , . . . , sk } ∩ C(S).
the schema of database becomes more complex, since query
Join operation without shuffle: If CQ (R) ∩ CQ (S) = ∅, trees are more diverse, and thus, more number of trees are
then it means R and S are co-partitioned with each other, and found after merging. Each occurrence of a vertex (table) in
so, we can perform join operations without shuffle by using the forest is stored independently due to its different reference
the co-partitioned column(s), i.e., CQ (R) ∩ CQ (S). The scan partitioning. Thus, if a vertex T occurs M times in the forest,
operator for R reads P (R) in a single-scan mode, but checks the total size of T in the partitioned database is at least M
the bitmap column bitV (Pi (R)) [·][j] s.t. j ∈ CQ (R)∩CQ (S), times larger than the size of the original T . If N > κ (for
instead of bitV (Pi (R))[·][1]. Likewise, the scan operator PREF/SD), or the average frequency of a vertex in the forest,
for S also reads P (S) in a single-scan mode by checking avg(M ) is larger than κ (for PREF/WD), GPT can achieve a
bitV (Pi (S))[·][j] s.t. j ∈ CQ (R) ∩ CQ (S). Then, in each pair better DR than PREF.
1032
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
In terms of query performance, GPT has the following two
advantages over PREF: (1) scanning smaller amount of data
multiple blocks, and reads only the necessary subpartitions
&!(" ,+!, )*,%,%('+
having no duplicated tuples, which is determined by checking
$'#%'#+&!(" ,+!, $'#%'#)*,%,%('+
bitV , with respect to a given query. We will show the effects
of (1) and (2) in Section 5.3. Fig. 8. Data redundancy of PREF and GPT.
5 E XPERIMENTAL E VALUATION and used κ = 2 for all the experiments related to GPT in
default.
In this section, we present the experimental results in H/W setting: We conduct all the experiments on the same
three parts. First, we compare GPT with the state-of-the- cluster of eleven machines (one master and 10 workers) in
art partitioning method PREF [12] to prove that GPT has default. For the scalability experiments, we use a cluster of
both a lower data redundancy and a shorter data loading 21 machines (one master and 20 workers). Each machine is
time than does PREF. Here, we evaluate GPT and PREF equipped with a six-core CPU, 32 GB memory, and two types
for both schema-driven (SD) and workload-driven (WD) ap- of secondary storage (4 TB HDD and 1.2 TB PCI-E SSD).
proaches, while varying the number of partitions and the scale They are connected with 1 Gbps interconnection in default.
of database. Second, we compare the query performance using S/W setting: We use HDFS in Apache Hadoop 2.4.1 to
the database partitioned by GPT with that by PREF, in order store the datasets for all systems. For the query processing of
to prove that the graph-based partitioning of GPT outperforms PREF and GPT, we use the MapReduce framework in Apache
the tree-based partitioning of PREF. Third, we evaluate the Hadoop 2.4.1. We assign 6 GB memory for each map and
query performance while varying the κ parameter in order to reduce task such that up to five concurrent map/reduce tasks
show its characteristics. can be executed. To guarantee the data locality of the blocks
5.1 Experimental Setup of the same partition, we apply the custom block placement
policy of HDFS as used in [19]. To obtain the partitioning
Datasets/queries: For experiments using SQL queries, we
schemes of PREF, we use the author’s implementation4 .
use three different benchmarks: TPC-DS [16], IMDB [17],
and BioWarehouse [14]. The first benchmark, TPC-DS [16], 5.2 Data Redundancy and Loading for TPC-DS
is widely used to evaluate the performance of OLAP queries In this section, we evaluate the space overhead of both
running on parallel database systems. The size of the TPC- GPT and PREF by measuring their data redundancy (DR) with
DS database is controlled by the scale factor (SF) parameter. Eq.(7). Figure 8 shows the DR values of databases partitioned
SF=10 generates a database of approximately 10 GB, and by GPT and PREF while changing the scale of database and
SF=1000 generates a database of approximately 1000 GB, the number of partitions. As Figure 8 shows, our GPT method
which are the typical scales used in the TPC-DS benchmark. significantly outperforms the PREF method in both the SD and
Evaluating the query performance for complex large join WD approaches in all cases.
operations over a partitioned database can directly reveal the We note that PREF results in a low DR for a relatively
efficiency of a partitioning scheme. Thus, we use the TPC-DS simple database schema such as TPC-H as reported in [12], but
queries that contain multiple join operations, and at least one it results in a very high DR for relatively complex and more
among them uses a large fact table as an operand by following realistic database schema such as TPC-DS. In particular, the
the criteria in [18], which are a total of 20 queries. The second data redundancy of PREF/SD increases drastically as the scale
benchmark, The Internet Movie DataBase (IMDB)1 , contains of database or the number of partitions increases due to the
detailed information related to movies which contains a total phenomenon of cumulative redundancy explained in Section 2.
of 21 tables and 6.4 GB data in text format2 . The schema of Under PREF/WD, the data redundancy does not increase since
IMDB is less complex than that of TPC-DS. To evaluate query each tree is too small to incur cumulative redundancy, and only
performance, we use 20 queries provided by the authors in [17] table-level duplicates exist among the multiple trees. However,
which contain two large tables (cast info and movie info) it is much higher than that of GPT/WD due to its table-
and more than eight join conditions. The third benchmark, level duplicates. We note that the data redundancy of GPT
BioWarehouse3 , is a collection of heterogeneous bioinformatic is fairly stable regardless of both the scale of database and the
datasets such as GenBank and NCBI Taxonomy. It contains number of partitions, since it mainly depends on the number
43 tables and 18.4 GB data. The schema of BioWarehouse of partitioning columns (κ). As a database schema becomes
is more complex than that of TPC-DS. To evaluate query more complex, i.e., snowstorm schema [20], with hundreds of
performance, we use five queries provided by the authors tables, we would expect the gap between GPT and PREF to
in [14]. We implemented all the queries on top of Hadoop become wider.
We also evaluate the performance of database bulk loading
1 http://www.imdb.com
2 ftp://ftp.fu-berlin.de/pub/misc/movies/database/
of GPT and PREF. Figure 9 shows the elapsed times of bulk
3 http://biowarehouse.ai.sri.com
4 https://code.google.com/archive/p/xdb/
1033
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
loading while changing the scales of database and the number query workload is not available. In fact, GPT/SD outperforms
of partitions. In the figure, our GPT method significantly out- PREF/SD by approximately nine times in Figure 10(c). We
performs the PREF method in both the SD and WD approaches omit GPT/SD in the following experiments hereafter, if it is
for all cases. These results are mainly due to data redundancy not necessary.
shown in Figure 8. In more detail, in Figure 9(a), the loading Scalability: Figure 10(d) shows the elapsed times for pro-
times of both PREF and GPT increase proportionally to the cessing 20 TPC-DS queries while varying the number of
scales of database since the sizes of the partitioned databases machines. For this experiment, we use SF=1000 of TPC-DS
increase. In Figure 9(b), the loading times of GPT remain and partition the database using GPT/WD. We set the number
fairly stable as the number of partitions increases since the of partitions N to the number of machines. The result shows
sizes of the partitioned databases remain the same. that the performance of GPT is quite scalable in terms of
the number of machines. In GPT, most part of the queries
are processed without shuffle, i.e., in a truly shared-nothing
&*,! -%'!,!
&*,! -%'!,!
manner. In addition, the DR of the partitioned database is
not affected by the number of machines, but only affected by
κ. That means the amount of data to be processed on each
machine decreases as the number of machines increases. As
a result, the performance of GPT should be quite scalable in
terms of the number of machines used.
&!)" -,!- *+-%-%)(,
$(#%(#,&!)" -,!- $(#%(#*+-%-%)(,
Performance breakdown: Figure 10(e) shows the perfor-
mance breakdown of our proposed method using three queries
Fig. 9. Elapsed times of bulk loading of PREF and GPT.
Q17, Q29, and Q85. There are three possible configurations
5.3 Query Performance for TPC-DS based on two major techniques in the paper: GPT partitioning
and subpartitioning. The major performance improvement
PREF vs. GPT: Figure 10 shows the query performance on
comes from GPT partitioning, since it allows most of join
the databases partitioned by PREF/SD, PREF/WD, GPT/SD,
operations to be processed without shuffle. Subpartitioning
and GPT/WD for 20 TPC-DS queries. In this experiment, we
further improves the performance by 1.26-1.69 times, since
use SF=1000 for TPC-DS and set N = 10. In Figure 10(a),
it avoids scanning duplicated tuples.
GPT/SD significantly outperforms PREF/SD on most of the
queries, although its data redundancy is much lower than that 5.4 Results for IMDB and BioWarehouse
of PREF. In this figure, we note that Y-axis in this figure is log- Figure 11 shows the partitioning schemes of GPT/WD (κ =
scale. For some queries, the large sizes of PREF-partitioned 2) for IMDB and BioWarehouse. As in TPC-DS, the GPT
tables tend to degrade the performance of query processing. method can find a single graph that includes some hub tables
For example, the two fact tables, Inv and WR, in Figure 1(a) for each benchmark. Compared with Figure 5(b), Figure 11(a)
are almost fully duplicated in every partition. Thus, the Q85 is small due to the simpler IMDB schema, while Figure 11(b)
query which includes join operation between two fact tables, is large due to the more complex BioWarehouse schema.
WS and WR, requires a large amount of I/O to scan the table Figure 12 shows the comparison results between GPT/WD
WR, which severely degrades the query performance as shown and PREF/WD in terms of data redundancy and query perfor-
in Figure 10(a). Compared to PREF/SD, GPT/SD improves the mance for two benchmarks. GPT/WD outperforms PREF/WD
performance for the Q85 query 122 times. in terms of both data redundancy and query performance.
In Figure 10(b), GPT/WD still outperforms PREF/WD for In Figure 12(a), DR of PREF/WD for IMDB is quite large
most of queries, despite its lower data redundancy (i.e., using due to severe table-level duplicates from lots of trees (i.e.,
smaller storage space). In Figure 10(c), GPT/WD improves the 10) and tuple-level duplicates from many FK-FK relationships
performance of PREF/WD by 48%. However, in that result, the between parent and child tables. In Figure 12(b), we use the
DR of GPT/WD is only 0.92, while that of PREF/WD is 2.16. sums of elapsed times to evaluate query performance, as in
As a result, GPT/WD is 48% faster and its storage overhead Figure 12(b). We note that the gaps between PREF/WD and
is 2.35 smaller than PREF/WD. For some queries, the tree- GPT/WD for IMDB and BioWarehouse are larger than that for
based partitioning scheme of PREF/WD tends to degrade the TPC-DS. This is because the queries in IMDB and BioWare-
performance largely due to shuffle during joins. For example, house are more complex (e.g., contain more join conditions)
the Q17 query mentioned in Section 2 belongs to that case. than those in TPC-DS.
As we shown in Figure 10(c), the gap in query performance 5.5 Characteristics of GPT
between PREF/SD and PREF/WD is huge, whereas the per-
formance gap between GPT/SD and GPT/WD is negligible. In this section, we evaluate the data redundancy and query
Query performance tends to depend heavily on the partitioning performance of GPT while varying κ. Figure 13 shows the
partitioning schemes of GPT/SD when κ = 1 and κ = 3. Each
scheme since the partitioning scheme represents all possible
table has only a single partitioning column in Figure 13(a),
opportunities for join operations without shuffle in general.
whereas each table has up to three partitioning columns in
PREF/SD and PREF/WD result in quite different with each
Figure 13(b). Only a single Hub-table appears in Figure 13(a),
other in Figures 1(a) and 2(c), while GPT/SD and GPT/WD
whereas there are three Hub-tables in Figure 13(b) due to the
are very similar with each other in Figure 5. These results
increased number of partitioning columns.
mean that the GPT method can be very useful especially when
1034
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
,0)#)-/#"0'*#/#!
,0)#)-/#"0'*#/#!
)-/#"0'*#/#!
1#.'#/ .0'0',+'+%/!&#*# *!&'+#/
,*-.'/,+.#/1)0/$,./!&#*".'2#+--.,!&4'/'/),%/!)# !1*/,$#)-/#"0'*#/ "!) ')'05
)-/#"0'*#/#!
)-/#"0'*#/#!
-.0'0',+'+%
1#.'#/ 1 -.0'0',+'+%
,*-.'/,+.#/1)0/$,.3,.(),"".'2#+--.,!&
##.$,.*+!# .#(",3+
Fig. 10. (a)-(c) Query performance of PREF and GPT in elapsed times; (d) scalability of GPT; (e) performance breakdown while varying optimization techniques.
D *376' κ
E *376' κ
Fig. 11. Partitioning schemes of GPT/WD for IMDB and BioWarehouse.
Fig. 13. Partitioning schemes of GPT/SD (varying κ).
*.'#'+-#".&(#-#!
.2"+&+"/1&%2*,&1&$
.,#"/)")!0
!
"2"0&%3-%"-$4
&*,#%*/-# &*,#%*/-# 5 5 5
.-#. .-#. /"02*2*.-*-($.+3,-15
.,#"/)")!0 /#,0+#,$*,()!# !$.-'*(30"2*.-1
""2"0&%3-%"-$4 #3&04/&0'.0,"-$&
Fig. 12. Comparison using IMDB, TPC-DS (SF=100) and BioWarehouse.
.2"+&+"/1&%2*,&1&$
Figure 14(a) shows the data redundancy of GPT/SD and
GPT/WD when using TPC-DS with SF=1000. As explained in
Section 3.4, DR increases proportionally to κ. At κ = 1, there
is no redundancy in P art-tables, but DR is non-zero due to "2"1&2
is 2, but actual DR values are less than 2 due to correlations Fig. 14. Data redundancy and query performance of GPT (varying κ).
among the partitioning columns. Figure 14(b) shows the query
6 R ELATED W ORK
performance of GPT/WD under a wide range of H/W settings
Database Partitioning Scheme for OLAP: The major
while varying κ. Here, we use TPC-DS with SF=1000. Among
performance gain in database partitioning comes from parallel
three κ values, κ = 2 shows the best query performance with
query processing without repartition operations [12], [21],
only a small DR (less than 1) for all H/W settings, which
[22]. To remove repartition operations, it is essential to decide
is coincident with the explanation in Section 3.4. A lower
the appropriate partitioning columns for tables. The methods
κ value (i.e., κ = 1) can result in worse performance due
in [10], [9] co-partition the tables by their join columns,
to the repartitioning overhead, while a higher κ value (i.e.,
so that join operation between co-partitioned tables can be
κ = 3) can results in worse performance due to storage
processed without shuffle. The REF method in [11] proposed
overhead. We note that using faster storage, i.e., PCI-E SSD,
reference partitioning that considers referential constraints in
reduces the performance gaps among different κ values since
the table schema as partitioning predicates. Columns in refer-
it reduces both the storage overhead and read/write overhead
ential constraint become the partitioning columns for tables,
of intermediate data during repartitioning. Figure 14(c) shows
and therefore, the tables in the same referential constraint
the query performance of GPT/WD under a wide range of
are co-partitioned. The PREF method in [12] partitions the
H/W settings while varying κ and varying benchmark datasets.
tables based on not only referential constraints but also on
We note that the setting κ = 2 still shows the best overall
join predicates. Although this approach allows the database to
performance for the different datasets.
1035
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.
be partitioned by a more number of constraints or predicates by PREF due to fewer join operations requiring shuffles.
than does the REF method, it also tends to incur more data re- Acknowledgments This work was partly supported by
dundancy. PREF method generates tree-structured partitioning Institute for Information communications Technology Pro-
schemes that have data dependencies between the parent and motion(IITP) grant funded by the Korea government(MSIT)
child tables. Such partitioning schemes trigger two kinds of (No. R0190-15-2012, High Performance Big Data Analytics
drawbacks: high data redundancy and low query performance. Platform Performance Acceleration Technologies Develop-
Our graph-based partitioning schemes solve those drawbacks ment), the DGIST RD Program of the Ministry of Science
of PREF method. There have been proposed skipping-oriented and ICT (17-BD-0404), and Basic Science Research Program
database partitioning methods [23], [24] which focus on scan- through the National Research Foundation of Korea(NRF)
ning less data during query processing for relatively simple funded by the Ministry of Science, ICT and Future Planning
queries that have few join operations. AdaptDB [25] proposes (2017R1E1A1A01077630).
an on-line partitioning method that focuses on repartitioning R EFERENCES
small portions of data continuously at runtime, but still uses [1] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng,
tree-based partitioning schemes. T. Kaftan, M. J. Franklin, A. Ghodsi et al., “Spark sql: Relational data processing
in spark,” in SIGMOD, 2015.
Database Partitioning Scheme for OLTP: A number [2] M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson,
of studies have been proposed to improve the performance M. Grund, D. Hecht, M. Jacobs et al., “Impala: A modern, open-source sql engine
for hadoop,” in CIDR, 2015.
of OLTP query processing [26], [27], [28], [29], [30]. For [3] F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, “Sap
OLTP query processing, it is usually beneficial to use as few hana database: data management for modern business applications,” in SIGMOD
Record, 2012.
machines as possible for a single query since the amount of [4] A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandiver, L. Doshi, and C. Bear,
data accessed is usually quite small, and usually many queries “The vertica analytic database: C-store 7 years later,” in VLDB, 2012.
[5] F. M. Waas, “Beyond conventional data warehousing: massively parallel data
need to be processed. Thus, parallel OLTP systems [26], processing with greenplum database,” in BIRTE (Informal Proceedings), 2008.
[27], [28], [30] try to minimize the number of distributed [6] X. Zhang, L. Chen, and M. Wang, “Efficient multi-way theta-join processing using
mapreduce,” in VLDB, 2012.
transactions for given query workloads. To do that, they [7] W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann, “High-speed query
partition the database based on query workloads such that processing over high-speed networks,” in VLDB, 2015.
[8] S. Chu, M. Balazinska, and D. Suciu, “From theory to practice: Efficient join
the overheads of query processing in the partitions is not query evaluation in a parallel database system,” in SIGMOD, 2015.
skewed, but balanced. To partition a database, they usually [9] S. Fushimi, M. Kitsuregawa, and H. Tanaka, “An overview of the system software
of a parallel relational database machine grace,” in VLDB, 1986.
create a graph with a node per tuple and edges between nodes [10] D. J. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H.-I. Hsiao, R. Ras-
accessed by the same transaction, and use the existing graph mussen et al., “The gamma database machine project,” in TKDE, 1990.
[11] G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, J. Srinivasan, and S. Das,
partitioner (e.g., METIS [31]) to split the graph into multiple “Supporting table partitioning by reference in oracle,” in SIGMOD, 2008.
balanced partitions that minimize the number of cross-partition [12] E. Zamanian, C. Binnig, and A. Salama, “Locality-aware partitioning in parallel
database systems,” in SIGMOD, 2015.
transactions. The size of the target graph to be partitioned in [13] C. Loboz, S. Smyl, and S. Nath, “Datagarage: Warehousing massive performance
these OLTP systems is quite large since it is a tuple-level data on commodity servers,” in VLDB, 2010.
[14] T. J. Lee, Y. Pouliot, V. Wagner, P. Gupta, D. W. Stringer-Calvert, J. D. Tenenbaum,
graph, whereas that in our GPT is very small since it is a and P. D. Karp, “Biowarehouse: a bioinformatics database warehouse toolkit,”
table-level graph. BMC bioinformatics, vol. 7, no. 1, 2006.
[15] A. Weininger, “Efficient execution of joins in a star schema,” in SIGMOD, 2002.
7 C ONCLUSIONS [16] R. O. Nambiar and M. Poess, “The making of tpc-ds,” in VLDB, 2006.
In this paper, we have proposed a novel graph-based [17] V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann, “How
good are query optimizers, really?” in VLDB, 2015.
database partitioning method called GPT that can improve [18] H. Ma, B. Shao, Y. Xiao, L. J. Chen, and H. Wang, “G-sql: fast query processing
query performance largely while using only a small amount via graph exploration,” in VLDB, 2016.
[19] M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson,
of additional storage space. Different from the state-of-the- “Cohadoop: flexible data placement and its exploitation in hadoop,” in VLDB,
art partitioning method PREF, the GPT method determines 2011.
[20] R. Ahmed, R. Sen, M. Poess, and S. Chakkappen, “Of snowstorms and bushy
an undirected multigraph rather than a tree or a forest, as a trees,” in VLDB, 2014.
partitioning scheme. GPT determines the partitioning scheme [21] J. Rao, C. Zhang, N. Megiddo, and G. Lohman, “Automating physical database
design in a parallel database,” in SIGMOD, 2002.
in a cost-based manner by considering the trade-off between [22] D. DeWitt and J. Gray, “Parallel database systems: the future of high performance
data redundancy and the number of opportunities of join database systems,” in CACM, 1992.
[23] L. Sun, M. J. Franklin, J. Wang, and E. Wu, “Skipping-oriented partitioning for
processing without shuffle. The resulting partitioning scheme columnar layouts,” in VLDB, 2016.
contains a lot of explicit or implicit triangles of tables that [24] S. Nishimura and H. Yokota, “Quilts: Multidimensional data partitioning frame-
work based on query-aware and skew-tolerant space-filling curves,” in SIGMOD,
can cover a query join graph in many cases, allowing a 2017.
query engine to process the join query without performing [25] Y. Lu, A. Shanbhag, A. Jindal, and S. Madden, “Adaptdb: adaptive partitioning
for distributed joins,” in VLDB, 2017.
repartitioning. Each edge of the undirected multigraph is as- [26] C. Curino, E. Jones, Y. Zhang, and S. Madden, “Schism: a workload-driven
sumed to be co-partitioned by the proposed HMC partitioning approach to database replication and partitioning,” in VLDB, 2010.
[27] A. Pavlo, C. Curino, and S. Zdonik, “Skew-aware automatic database partitioning
method with subpartitions. This approach incurs no cumulative in shared-nothing, parallel oltp systems,” in SIGMOD, 2012.
redundancy, and results in both less overhead to eliminate [28] A. Quamar, K. A. Kumar, and A. Deshpande, “Sword: scalable workload-aware
data placement for transactional workloads,” in EDBT, 2013.
duplicates and faster initial bulk loading. Through extensive [29] A. Turcu, R. Palmieri, B. Ravindran, and S. Hirve, “Automated data partitioning
experiments using three benchmarks including TPC-DS, we for highly scalable and strongly consistent transactions,” in TPDS, 2016.
[30] M. Serafini, R. Taft, A. J. Elmore, A. Pavlo, A. Aboulnaga, and M. Stonebraker,
have shown that the database partitioned by GPT has 2.35 “Clay: fine-grained adaptive partitioning for general database schemas,” in VLDB,
times smaller storage overhead than that by the state-of-the 2016.
[31] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for
art method PREF, and at the same time, query performance partitioning irregular graphs,” SIAM Journal on scientific Computing, vol. 20, no. 1,
using the database partitioned by GPT is 48% faster than that 1998.
1036
Authorized licensed use limited to: University of New South Wales. Downloaded on December 02,2024 at 11:21:09 UTC from IEEE Xplore. Restrictions apply.