0% found this document useful (0 votes)
17 views

sigmod_15_locality-aware partitioning in parallel database systems

Uploaded by

Ding Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

sigmod_15_locality-aware partitioning in parallel database systems

Uploaded by

Ding Rui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Locality-aware Partitioning

in Parallel Database Systems

Erfan Zamanian† Carsten Binnig*† Abdallah Salama*

† *
Brown University Baden-Wuerttemberg Cooperative State University
Providence, USA Mannheim, Germany

ABSTRACT CUSTOMER(C(
HASH$BY$
.( ORDERS(O(
REF$BY$
.( LINEITEM(L(
REF$BY$
(custkey)( (custkey)( (orderkey)(
Parallel database systems horizontally partition large
amounts of structured data in order to provide parallel da- .(

ta processing capabilities for analytical workloads in shared- .( NATION( .( SUPPLIER(S(


HASH$BY$
nothing clusters. One major challenge when horizontally par- REPLICATED$
(suppkey)(
titioning large amounts of data is to reduce the network costs
for a given workload and a database schema. A common Figure 1: Partitioned TPC-H Schema (simplified)
technique to reduce the network costs in parallel database
systems is to co-partition tables on their join key in order
to avoid expensive remote join operations. However, existing A common technique to reduce the network costs in ana-
partitioning schemes are limited in that respect since only lytical workloads which was already introduced in the 1990’s
subsets of tables in complex schemata sharing the same join by the first parallel database systems is to co-partition ta-
key can be co-partitioned unless tables are fully replicated. bles on their join keys in order to avoid expensive remote join
In this paper we present a novel partitioning scheme called operations [8, 10]. However, in complex schemata with ma-
predicate-based reference partition (or PREF for short) that ny tables this technique is limited to only subsets of tables,
allows to co-partition sets of tables based on given join pre- which share the same join key. Moreover, fully replicating
dicates. Moreover, based on PREF, we present two automa- tables is only desirable for small tables. Consequently, with
tic partitioning design algorithms to maximize data-locality. existing partitioning schemes remote joins are typically una-
One algorithm only needs the schema and data whereas the voidable for complex analytical queries with join paths over
other algorithm additionally takes the workload as input. multiple tables using different join keys.
In our experiments we show that our automated design al- Reference partitioning [9] (or REF partitioning for short) is
gorithms can partition database schemata of different com- a more recent partitioning scheme that co-partitions a table
plexity and thus help to effectively reduce the runtime of by another table that is referenced by an outgoing foreign
queries under a given workload when compared to existing key (i.e., referential constraint). For example, as shown in
partitioning approaches. Figure 1, if table CUSTOMER is hash partitioned on its pri-
mary key custkey, then table ORDERS can be co-partitioned
1. INTRODUCTION using the outgoing foreign key (fk) to the table CUSTOMER
Motivation: Modern parallel database systems (such as table. Thus, using REF partitioning, chains of tables linked
SAP HANA [5], Greenplum [21] or Terradata [15]) and other via foreign keys can be co-partitioned. For example, table
parallel data processing platforms (such as Hadoop [22], Im- LINEITEM can also be REF partitioned by table ORDERS. Ho-
pala [1] or Shark [23]) horizontally partition large amounts of wever, other join predicates different from the foreign key or
data in order to provide parallel data processing capabilities even incoming foreign keys are not supported by REF parti-
for analytical queries (e.g., OLAP workloads). One major tioning. For example, table SUPPLIER in Figure 1 can not be
challenge when horizontally partitioning data is to achieve REF partitioned by the table LINEITEM.
a high data-locality when executing analytical queries since Contributions: In this paper, we present a novel parti-
excessive data transfer can significantly slow down the query tioning scheme called predicate-based reference partitioning
execution on commodity hardware [19]. (or PREF for short). PREF is designed for analytical workloads
where data is loaded in bulks. The PREF partitioning scheme
Permission to make digital or hard copies of all or part of this work for generalizes the REF partitioning scheme such that a table can
personal or classroom use is granted without fee provided that copies are not be co-partitioned by a given join predicate that refers to ano-
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components ther table (called partitioning predicate). In Figure 1, table
of this work owned by others than ACM must be honored. Abstracting with SUPPLIER can thus be PREF partitioned by table LINEITEM
credit is permitted. To copy otherwise, or republish, to post on servers or to using an equi-join predicate on the attribute suppkey as par-
redistribute to lists, requires prior specific permission and/or a fee. Request titioning predicate. In order to achieve full data-locality with
permissions from [email protected]. regard to the partitioning predicate, PREF might introduce
SIGMOD’15, May 31 - June 04, 2015, Melbourne, VIC, Australia duplicate tuples in different partitions. For example, when
Copyright is held by the owner/author(s). Publication rights licensed to
ACM. ACM 978-1-4503-2758-9/15/05 ...$15.00 PREF partitioning the table SUPPLIER as described before and
http://dx.doi.org/10.1145/2723372.2723718. the same value for the suppkey attribute appears in multiple

17
partitions of the table LINEITEM, then the referencing tuple we call S the referenced table and R the referencing table.
in table SUPPLIER will be duplicated to all corresponding The referenced table could be again PREF partitioned. The
partitions of SUPPLIER. That way, joins which use the par- seed table of a PREF partitioned table R is the first table T
titioning predicate as join predicate can be executed locally in the path of the partitioning predicates that is not PREF
per node. However, in the worst case, the PREF partitioning partitioned.
scheme might lead to full replication of a table. Our expe- Condition (1) in the definition above means that a tuple
riments show that this is only a rare case for complex sche- r ∈ R is in a partition Pi (R) if there exists at least one
mata with a huge number of tables and can be avoided by tuple s ∈ Pi (S) that satisfies the given partition predicate
our automatic partitioning design algorithms. p (i.e., p(r, s) evaluates to true) for the given i. A tuple s
Furthermore, it is a hard problem to manually find the that satisfies p(r, s) is called partitioning partner. Therefore,
best partitioning scheme for a given database schema that if p is satisfied for tuples in different partitions of S, then
maximizes data-locality using our PREF partitioning scheme. a copy of r will be inserted to all these partitions which
Existing automated design algorithms [14, 18, 20] are not leads to duplicates (i.e., redundancy). Moreover, condition
aware of our PREF partitioning scheme. Thus, as a second (2) means that each tuple r ∈ R must be assigned to at
contribution, we present two partitioning design algorithms least one partition (even if there exists no tuple s ∈ Pi (S) in
that are aware of PREF. Our first algorithm is schema-driven any partition of S that satisfies p(r, s)). In order to satisfy
and assumes that foreign keys in the schema represent po- condition (2), we assign all tuples r ∈ R that do not have
tential join paths of a workload. Our second algorithm is a partitioning partner in S in a round-robin fashion to the
workload-driven and additionally takes a set of queries into different partitions Pi (R) of R.
account. The main idea is to first find an optimal partitio- As mentioned before, any partitioning scheme can be used
ning configuration separately for subsets of queries that sha- (e.g., hash, range, round-robin or even PREF) for the refe-
re similar sets of tables and then incrementally merge tho- renced table. For simplicity but without loss of generality,
se partitioning configuration. In our experiments, we show we use only the HASH and PREF partitioning scheme in the
that by using the PREF partitioning scheme, our partitioning remainder of the paper. Moreover, only simple equi-join pre-
design algorithms outperform existing automated design al- dicates (as well as conjunctions of simple equi-join predica-
gorithms, which rely on a tight integration with the data- tes) are supported as partitioning predicates p since other
base optimizer (i.e, to get the estimated costs for a given join predicates typically result in full redundancy of the PREF
workload). partitioned table (i.e., a tuple is then likely to be assigned
Outline: In Section 2, we present the details of the PREF to each partition of R).
partitioning scheme and discuss details about query proces- Example: Figure 2 shows an example of a database be-
sing and bulk loading. Afterwards, in Section 3, our schema- fore partitioning (upper part) and after partitioning (lower
driven automatic partitioning design algorithm is presented. part). In the example, the table LINEITEM is hash partitio-
Section 4 then describes the workload-driven algorithm and ned and thus has no duplicates after partitioning. The ta-
discusses potential optimizations to reduce the search space. ble ORDERS (o) is PREF partitioned by table LINEITEM (l)
Our comprehensive experimental evaluation with the TPC- using a partitioning predicate on the join key (orderkey);
H [3] and the TPC-DS benchmark [2] is discussed in Section i.e., ORDERS is the referencing table and LINEITEM the refe-
5. We have chosen these two benchmarks as we wanted to renced table as well as the seed table. For the table ORDERS,
show how our algorithms work for a simple schema with uni- the PREF partitioning scheme introduces duplicates to achie-
formly distributed data (TPC-H) and for a complex schema ve full data-locality for a potential equi-join over the join key
with skewed data (TPC-DS). Finally, we conclude with re- (orderkey). Furthermore, the table CUSTOMER (c) is PREF
lated work in Section 6 and a summary in Section 7. partitioned by ORDERS using the partitioning predicate on
the join key (custkey); i.e., CUSTOMER is the referencing ta-
2. PREDICATE-BASED REFERENCE PAR- ble and ORDERS the referenced table whereas LINEITEM is the
TITIONING seed table of the CUSTOMER table. Again, PREF partitioning
In the following, we first present the details of our predicate- the CUSTOMER table results in duplicates. Moreover, we can
based reference partitioning scheme (or PREF for short) and see that the customer (custkey=3), who has no order, is also
then discuss important details of executing queries over PREF added to the partitioned table CUSTOMER (in the first parti-
partitioned tables as well as bulk-loading those tables. In tion).
terms of notation we use capital letters for tables (e.g., ta- Thus, by using the PREF partitioning scheme, all tables
ble T ) and small letters for individual tuples (e.g., tuple in a given join path of a query can be co-partitioned as
t ∈ T ). Moreover, if a table T is partitioned into n partiti- long as there is no cycle in the query graph. Finding the
ons, the individual partitions are identified by Pi (T ) (with best partitioning scheme for all tables in a given schema
1 ≤ i ≤ n). and workload that maximizes data-locality under the PREF
scheme, however, is a complex task and will be discussed in
2.1 Definition and Terminology Sections 3 and 4.
The PREF partitioning scheme is defined as follows: For query processing, we create two additional bitmap in-
Definition 1 (PREF Partitioning Scheme). If a ta- dexes when PREF partitioning a table R by S: The first bit-
ble S is partitioned into n partitions using an arbitrary hori- map index dup indicates for each tuple r ∈ R if it is the first
zontal partitioning-scheme, then table R is PREF partitioned occurrence (indicated by a 0 in the bitmap index) or if r is a
by that table S and a given partitioning predicate p, iff (1) duplicate (indicated by a 1 in the bitmap index). That way
for all 1 ≤ i ≤ n, Pi (R) = {r|r ∈ R ∧ (∃s ∈ Pi (S)|p(r, s))} duplicates that result from PREF partitioning can be easily
holds and (2) ∀r ∈ R|(∃r ∈ Pi (R), 1 ≤ i ≤ n). In PREF eliminated during query processing, as will be explained in

18
Database  D  (before  Par==oning):   intermediate result produced by operator o.
Table  Lineitem l Table  Orders o Table  Customer c
• Dup(o): defines if the (intermediate) result o is free of
linekey   orderkey   orderkey   custkey   custkey   cname  
0   1   1   1   1   A  
duplicates resulting from PREF partitioning (Dup(o) =
1   4   2   1   2   B   0) or not (Dup(o) = 1). For tables, we use the same
2   1   3   2   3   C  
3   2   4   1   notation; i.e., Dup(T ) defines whether table T contains
4   3   duplicates due to PREF partitioning or not. For a hash
partitioned table T , we get Dup(T ) = 0.
Database  DP  (aAer  Par==oning):   • P art(o): defines a partitioning scheme for the (inter-
Table  Lineitem l Table  Orders o Table  Customer c mediate) result o including the partitioning method
Data Data Indexes Data Indexes
orderkey   custkey   dup   hasL   custkey   cname   dup   hasO  
P art(o).m (HASH, PREF, or NONE if neither of the other
linekey   orderkey  
0   1   1   1   0   1   1   A   0   1   schemes holds), the list of partitioning attributes
3   2   2   1   0   1   3   C   0   0  
P art(o).A and the number of partitions P art(o).c.
linekey   orderkey   orderkey   custkey   dup   hasL   custkey   cname   dup   hasO  
1   4   4   1   0   1   1   A   1   1   Again, for tables we use the same notation; i.e., P art(T )
4   3   3   2   0   1   2   B   0   1   defines the partitioning scheme of table T .
linekey   orderkey   orderkey   custkey   dup   hasL   custkey   cname   dup   hasO   In the following, we discuss the rewrite rules for each ty-
2   1   1   1   1   1   1   A   1   1  
pe of operator of an SPJA query individually. Moreover, we
HASH-­‐par..oned   PREF-­‐par..oned  on  Lineitem PREF-­‐par..oned  on  Orders
by  linekey%3 by  o.orderkey=l.orderkey by  c.custkey=o.custkey assume that the last operator of a plan P is always projecti-
on operator that can be used to eliminate duplicates resul-
Figure 2: A PREF partitioned Database ting from PREF partitioning. We do not discuss the selection
operator since neither additional duplicate eliminations nor
re-partitioning operators need to be added to its input (i.e.,
more details in Section 2.2. The second index hasS indicates
the selection operator can be executed without applying any
for each tuple r ∈ R if there exists a tuple s ∈ S which
of these rewrites).
satisfies p(r, s). That way, anti-joins and semi-joins can be
optimized. The example in Figure 2 shows these indexes for Inner equi-join o = (oin1 1oin1 .a1 =oin2 .a2 oin2 ): The
the two PREF partitioned tables. Details about how these only re-write rule, we apply to an inner equi-join, is to add
indices are used for query processing is discussed in the fol- additional re-partitioning operators over its inputs oin1 and
lowing Section 2.2. oin2 . In the following, we discuss three cases when no re-
partitioning operator needs to be added.
2.2 Query Processing (1) The first case holds, if both inputs are hash parti-
In the following, we discuss how queries need to be rewrit- tioned and they use the same number of partitions
ten for correctness, if PREF partitioned tables are included (i.e., P art(oin1 ).c = P art(oin2 ).c holds). Moreover,
in a given SQL query Q. This includes adding operations P art(oin1 ).A = [a1 ] and P art(oin2 ).A = [a2 ] must
for eliminating duplicates resulting from PREF partitioned hold as well (i.e., the join keys are used as partitio-
tables and adding re-partitioning operations for correct par- ning attributes).
allel query execution. Furthermore, we also discuss rewrites (2) The second case holds, if P art(oin1 ).m = HASH and
for optimizing SQL queries (e.g., to optimize queries with P art(oin2 ).m = PREF whereas the join predicate a1 =
anti-joins or outer joins). All these rewrite rules are applied a2 must be the partitioning predicate of the PREF sche-
to a compile plan P of query Q. Currently, our rewrite rules me. Moreover, the partitioning scheme P art(oin1 ) must
only support SPJA queries (Selection, Projection, Join and be the one used for the seed table of the PREF scheme
Aggregation), while nested SPJA queries are supported by P art(oin2 ).
rewriting each SPJA query block individually. (3) The third case holds, if P art(oin1 ).m = PREF and
P art(oin2 ).m = PREF whereas the join predicate a1 =
Rewrite Process: The rewrite process is a bottom-up a2 must be the partitioning predicate of the PREF sche-
process, which decides for each operator o ∈ P if a dis- ma of one input (called referencing input). The other
tinct operation or a re-partitioning operation (i.e., a shuffle input is called referenced input. Moreover, both PREF
operation) must be applied to its input(s) before executing schemes must reference the same seed table.
the operator o. Note that our distinct operator is not the For example, case (2) above holds for a join operation
same as the SQL DISTINCT operator. Our distinct ope- (l 1l.linekey=o.linekey o) over the partitioned database in Fi-
rator eliminates only those duplicates which are generated gure 2. Moreover, case (3) holds for a join operation
by our PREF scheme. Duplicates from PREF partitioning can (o 1o.custkey=c.custkey c) over the same schema.
be eliminated using a disjunctive filter predicate that uses Otherwise, if none of these three cases holds, the rewri-
the condition dup=0 for each dup attribute of a tuple in an te procedure applies re-partitioning operators to make sure
(intermediate) result. A normal SQL DISTINCT operator, that both inputs use a hash partitioning scheme where the
however, can still be executed using the attributes of a tuple join key is the partitioning attribute and both schemes use
to find duplicates with the same values. In the rest of the the same number of partitions. The re-partitioning opera-
paper, we always refer to the semantics of our distinct ope- tor also eliminates duplicates resulting form a PREF scheme
rator. Moreover, the re-partitioning operator also eliminates as discussed before. If one input is already hash partitioned
duplicates resulting from PREF partitioning before shuffling (using the join key as partitioning attribute), then we only
tuples over the network. need to re-partition the other input accordingly.
In order to decide if a distinct operation or a re-partitioning After discussing the rewrite rules, we now present how
operation must be applied, the rewrite process annotates two the properties Dup(o) and P art(o) are set by the rewri-
properties to each intermediate result of an operator o ∈ P . te procedure. If we add a re-partitioning operation as dis-
In the following, we also use the variable o to refer to the cussed before, then we use the hash partitioning scheme

19
SQL$Query)Q:) Canonical)Plan:) Rewri5en)Plan:) the input oin is never re-partitioned. However, if Dup(oin ) =
πSUM(o.total) as
πSUM(o.total) as revenue
Dup=0
revenue
1 we add a distinct operation on input oin that eliminates
SELECT
SUM(o.total) as revenue
Sch=HASH!
χSUM(o.total),c.cname duplicates using the dup indexes. For this operator, we set
FROM χSUM(o.total),c.cname Dup=0
Orders o JOIN Customer c Sch=HASH!
Reparthash(c.cname)
P art(o) = oin and Dup(o) = 0.
ON o.custkey = c.custkey
Dup=1
GROUP BY
c.cname ⨝c.custkey=o.custkey
Sch=PREF!
⨝c.custkey=o.custkey Further Rewrites for Query Optimization: Further
Dup=1
Sch=PREF!
Dup=1
Sch=PREF!
rewrite rules can be applied for query optimization when
Customer c Orders o Customer c Orders o joining a PREF partitioned table R with a referenced table
S on p using the index hasS: (1) An anti-join over R and
Figure 3: Rewrite Process for Plan P S can be rewritten by using a selection operation with the
filter predicate hasS = 0 on R without actually joining S.
(2) A semi-join over R and S can be rewritten by using a
of the re-partitioning operator to initialize P art(o) and set selection operation with the filter predicate hasS = 1 on R
Dup(o) = 0 since we eliminate duplicates. In case that we without actually joining S.
do not add a re-partitioning operation (i.e., in the cases 1-3
before), we initialize P art(o) as follows: In case (1), we set 2.3 Bulk Loading
P art(o) to be the hash partitioning scheme of one of the in- As discussed before, the PREF scheme is designed for data
puts (remember, that both inputs use the same partitioning warehousing scenarios where new data is loaded in bulks. In
scheme) and Dup(o) = 0 since hash partitioned tables ne- the following, we discuss how inserts can be executed over
ver contain duplicates. In cases (2) and (3), we use the PREF a PREF partitioned table R that references a table S. We
scheme of the referenced input to initialize P art(o). Moreo- assume that the referenced table S has already been bulk
ver, in case (2), we always set Dup(o) = 0. In case (3), we loaded.
set Dup(o) = 0 if the referenced input has no duplicates. In order to insert a new tuple r into table R, we have to
Otherwise, we set Dup(o) = 1. identify those partitions Pi (R) into which a copy of a tu-
For example, for the intermediate result of the join ple r must be inserted. Therefore, we need to identify those
(c 1c.custkey=o.custkey o) where case (3) holds, the rewri- partitions Pi (S) of the referenced table S that contain a
te procedure initializes P art to be the PREF scheme of the partitioning partner (i.e., a tuple s which satisfies the par-
ORDERS table and sets Dup to 1. titioning predicate p for the given tuple r). For example, in
Figure 2 table CUSTOMER is PREF partitioned referencing table
Other joins: While equi-joins can be executed on parti-
ORDERS. When inserting a customer tuple with custkey = 1
tioned inputs (as discussed before), other joins such as cross
into table CUSTOMER, a copy must be inserted into all three
products o = (oin1 × oin2 ) and theta joins o = (oin1 1p oin2 )
partitions since all partitions of ORDERS have a tuple with
with arbitrary join predicates p need to be executed as re-
custkey = 1.
mote joins that ships the entire smaller relation to all clus-
For efficiently implementing the insert operation of new
ter nodes. For these joins, we set P art(o).m = NONE . Mo-
tuples without executing a join of R with S, we create a
reover, we also eliminate duplicates in both inputs and set
partition index on the referenced attribute of table S. The
Dup(o) = 0.
partition index is a hash-based index that maps unique at-
Furthermore, an outer join can be computed as the union
tribute values to partition numbers i. For example, for the
of an inner equi-join and an anti-join. An efficient implemen-
table ORDERS schema in Figure 2, we create a partition index
tation for anti-joins is presented at the end of this section.
on the attribute custkey that maps e.g. custkey = 1 to par-
Aggregation o = χGrpAtts,AggF uncs (oin ): If the partition titions 1 to 3. We show in our experiments in Section 5 that
scheme of the input operator oin is hash partitioned and if partition indexes help to efficiently execute bulk loading of
the condition GrpAtts.startW ith(P art(oin ).A) (i.e., the list new tuples.
of group-by attributes starts with or is the same as the list of Finally, updates and deletes over a PREF partitioned table
partitioning attributes), then we do not need to re-partition are applied to all partitions. However, we do not allow that
the input. Otherwise, a re-partitioning operator is added updates modify those attributes used in a partitioning pre-
that hash partitions the input oin by the GrpAtts. Moreo- dicate of a PREF scheme (neither in the referenced nor in the
ver, if Dup(oin ) = 1 holds, the re-partitioning operator also referencing table). Since join keys are typically not updated
eliminates duplicates (as described before). Finally, the re- in data warehousing scenarios this restriction does not limit
write process sets P art(o) to P art(oin ) if no re-partitioning the applicability of PREF.
operator is added. Otherwise, it sets P art(o) to the hash
partitioning scheme used for re-partitioning. Moreover, in 3. SCHEMA-DRIVEN AUTOMATED PAR-
any case it sets Dup(o) = 0 . TITIONING DESIGN
Figure 3 shows an example of an aggregation query over
the partitioned database shown in Figure 2. In that exam- In this section, we present our schema-driven algorithm for
ple, the output of the join is PREF partitioned and contains automated partitioning design. We first discuss the problem
duplicates (as already discussed before). Thus, the input of statement and then give a brief overview of our solution. Af-
the aggregation must be re-partitioned using a hash parti- terwards, we present important details on how we maximize
tioning scheme on the group-by attribute c.cname (which is data-locality while minimizing data-redundancy.
used as the partitioning scheme of its output). Moreover, the 3.1 Problem Statement and Overview
re-partitioning operators eliminates the duplicates resulting
from PREF. The Problem Statement can be formulated as the following
optimization problem: Given a schema S (including referen-
Projection o = πAtts (oin ): For the projection operator tial constraints) and the non-partitioned database D, define

20
Schema'Graph'GS' Maximum'Spanning'' Par99oning''
(with'weights):' Tree'MAST:' Configura9on:'
me total weight for a connected GS . For example, in Figu-
1.5m" 1.5m"
re 4, instead of discarding the edge between SUPPLIER and
O" L" O" L" PREF
O" L" SP"
NATION, one could also discard the edge between CUSTOMER
on L"
150k" 10k" 150k" 10k"
PREF PREF
and NATION since this edge has the same weight. If different
C" S" C" S" C" S"
on O" on L" MASTs with the same total weight exist, then the following
25" 25" 25" step must be applied for each MAST individually.
N" N" PREF
on C"
N"
Finally, in the last step we enumerate all possible partitio-
ning configurations that can be applied for the MAST to find
Figure 4: Schema-driven Partitioning Design out which partitioning configuration introduces the mini-
mum data-redundancy. Minimizing data-redundancy is im-
a partitioning scheme (HASH or PREF) for each table T ∈ S portant since this has direct effect on the runtime of queries
(called partitioning configuration) such that data-locality in (even if we can achieve maximal data-locality). The partitio-
the resulting partitioned database DP is maximized with re- ning configurations, which we enumerate in our algorithm,
gard to equi-join operations over the referential constraints, all follow the same pattern: one table in the MAST is selected
while data-redundancy is minimized. In other words, whi- to be the seed table which uses a hash partitioning sche-
le the main optimization goal is maximizing data-locality me. In general, it could use any of the existing partitioning
under the given partitioning schemes, among those materia- schemes such as hash, round-robin, or range partitioning.
lization configurations with the same highest data-locality, As partitioning attribute, we use the join attribute in the
the one with the minimum data-redundancy should be cho- label l(e) of the edge e ∈ E, which is connected to the no-
sen. de representing the seed table and has the highest weight
w(e). All other tables are recursively PREF partitioned on
Note that in the above problem statement, we do not con- the seed table using the labels of the edges in the MAST as
sider full replication as a possible choice for a table. The re- partitioning predicates. Figure 4 (right hand side) shows one
ason is that full replication is only desirable for small tables, potential partitioning configuration for the MAST, which uses
while PREF can find a middle ground between partitioning the LINEITEM table as the seed table.
and full replication for other tables that can not be fully rep-
licated. Furthermore, small tables that are candidates for full 3.2 Maximizing Data-Locality
replication can be excluded from the given database schema Data-locality (DL) for a given schema graph GS and the
before applying our design algorithm. In order to solve the subset of edges Eco used for co-partitioning is defined as
before mentioned optimization problem, our algorithm exe- follows:
cutes the following three steps. P
e∈E w(e)
The first step is to create an undirected labeled and weigh- DL = P co
e∈E w(e)
ted graph GS = (N, E, l(e ∈ E), w(e ∈ E)) for the given
schema S (called schema graph). While a node n ∈ N repres- While DL = 1 means that Eco contains all edges in GS
ents a table, an edge e ∈ E represents a referential constraint (i.e., no remote join is needed), DL = 0 means that Eco is
in S. Moreover, the labeling function l(e ∈ E) defines the empty (i.e., no table is co-partitioned by any other table).
equi-join predicate for each edge (which is derived from the For example, if we hash partition all tables of a schema on
referential constraint) and the weighting function w(e ∈ E) their primary keys, then data-locality will be 0 (as long as
defines the network costs if a remote join needs to be exe- the tables do not share the same primary key attributes).
cuted over that edge. The weight w(e ∈ E) of an edge is In order to maximize data-locality for a given schema
defined to be the size of the smaller table connected to the graph GS that has only one connected component, we ex-
edge e. The intuition behind this is that the network costs tract the maximum spanning tree MAST based on the given
of a potential remote join over an edge e depend on the size weights w(e ∈ E). The set of edges in the MAST represents
of the smaller table, since this table is typically shipped over the desired set Eco since adding one more edge to a MAST
the network. It is clear that we ignore the selectivity of more will result in a cycle which means that not all edges can be
complex queries (with selection operators and multiple joins) used for co-partitioning. If GS has multiple connected com-
and thus w(e ∈ E) only represents an upper bound. Howe- ponents, we extract the MAST for each connected component.
ver, our experiments show that w(e ∈ E) is a good proxy to In this case Eco represents the union over the edges of all
represent the total workloads costs even for workloads with maximum spanning trees.
complex queries. Figure 4 (left hand side) shows the schema One other solution (instead of extracting the MAST) is to
graph resulting from our simplified version of the TPC-H duplicate tables (i.e., nodes) in the GS in order to remove cy-
schema for scaling factor SF = 1. cles and allow one table to use different partitioning schemes.
As a second step, we extract a subset of edges Eco from However, join queries could still potentially require remote
GS that can be used to co-partition all tables in GS such joins. For example, if we duplicate table NATION in the GS
that data-locality is maximized. For a given connected GS , of Figure 4 (left hand side), we can co-partition one copy of
the desired set of edges Eco is the maximum spanning tree NATION by CUSTOMER and one copy of NATION by SUPPLIER.
(or MAST for short). The reason of using the MAST is that However, a query using the join path C − N − S then still
by discarding edges with minimal weights from the GS , the needs a remote join either from over the edge C − N or the
network costs of potential remote joins (i.e., over edges not edge N − S. Therefore, in our schema-driven algorithm we
in the MAST) are minimized and thus data-locality as defined do not duplicate nodes at all.
above is maximized. A potential result of this step is shown
in Figure 4 (center). 3.3 Minimizing Data-Redundancy
Typically, there exists more than one MAST with the sa- The next step after maximizing data-locality is to find

21
on the edges of the MAST (see function addPREF). For each
Listing 1: Enumerating Partitioning Configurations partitioning configuration newP C, we finally estimate the
1 function findOptimalPC ( MAST mast, Database D ){ size of the partitioned database when applying newP C and
2 PartitionConfig optimalP C ; compare it to the optimal partitioning configuration so far
3 optimalP C . estimatedSize = MAX_INT ; (line 12-14). While seed tables in our partitioning design al-
4 gorithms never contain duplicate tuples, PREF partitioned
5 for ( each node nST in N (mast)){
tables do. In order to estimate the size of a database after
6 // build new PC based on seed table
7 PartitionConfig newP C ; partitioning, the expected redundancy in all tables which are
8 newPC . addScheme (nST , SP ); PREF partitioned must be estimated. Redundancy is cumu-
9 addPREF ( mast , nST , newP C ); lative, meaning that if a referenced table in the PREF scheme
10 contains duplicates, the referencing table will inherit those
11 // estimate size of newPC duplicates as well. For example, in Figure 2 the duplicate
12 estimateSize (newP C , mast, D );
orders tuple with orderkey = 1 in the ORDERS table results
13 if (newP C . estimatedSize <optimalP C . estimatedSize )
14 optimalP C = newP C ; in a duplicate customer tuple with custrkey = 1 in the refe-
15 } rencing CUSTOMER table. Therefore, in order to estimate the
16 return optimalP C ; size of a given table, all referenced tables up to the seed
17 } (redundancy-free) table must be considered. The details of
18 the size estimation of tables after partitioning are explained
19 // r e c u r s i v e l y PREF p a r t i t io n tables
in Appendix A.
20 function addPREF ( MAST mast, Node ref erring ,
21 PartitionConfig pc){
22 for ( each node ref connected 3.4 Redundancy-free Tables
23 to ref erring by edge e in mast){ As described in Section 3.3, the final size of a given table
24 if (pc. containsScheme (ref )) after partitioning is determined by the redundancy factors
25 continue ;
of all the edges from the seed table to the analyzed table.
26 newP C . addScheme (ref , PREF on ref erring by l(e));
27 addPREF (mast, ref , pc); In complex schemata with many tables, this might result in
28 } full or near full redundancy for PREF partitioned tables. This
29 } is because only one table is determined by the algorithm
to be the seed table, while all other tables are PREF parti-
tioned. In order to remedy this problem, our enumeration
algorithm can additionally take user-given constraints as in-
a partitioning configuration for all tables in the schema S, put which disallows data-redundancy for individual tables.
which minimizes data-redundancy in the partitioned data- Therefore, we adopt the enumeration algorithm described
base DP . Therefore, we first define data-redundancy (DR) in Section 3.3 as follows: (1) We also enumerate partitioning
as follows: configurations which can use more than one seed table. We
P
|T P | start with configurations with one seed table and increase
|DP |
DR = − 1 = PT ∈S −1 the number up to |S| seed tables until we satisfy the user-
|D| T ∈S |T | given constraints. Since the maximal data-locality for a MAST
While |DP | represents the size of the database after par- monotonically decreases with an increasing number of seed
titioning, |D| represents the original size of the database tables, we can stop the enumeration early once we find a
before partitioning. |DP | is defined to be the sum of sizes of partitioning configuration that satisfies the user-given cons-
all tables T ∈ S after partitioning (denoted by T P ). Conse- traints. This scheme, will be the partitioning scheme with
quently, DR = 0 means that no data-redundancy was added the maximal data-locality that also satisfies the user-given
to any table after partitioning, while DR = 1 means that constraints. (2) We prune partitioning configurations ear-
100% data-redundancy was added after partitioning (i.e., ly that add data-redundancy for tables where a user-given
each tuple in D exists in average twice in DP ). Fully repli- constraint disallows data-redundancy. That means for ta-
cating each table to all n nodes of a cluster thus results in bles where we disallow data-redundancy, we can either use
data-redundancy n − 1. a seed partitioning scheme or a PREF partitioning scheme
In Listing 1, we show the basic version of our algorithm whose partition predicate refers to the primary key of a ta-
to enumerate different partitioning configurations (PCs) for ble that has no data-redundancy.
a given MAST. For simplicity (but without loss of generality),
we assume that the schema graph GS has only one connected 4. WORKLOAD-DRIVEN AUTOMATED
component with only one MAST. Otherwise, we can apply the
enumeration algorithm for each MAST individually. PARTITIONING DESIGN
The enumeration algorithm (function findOptimalPC in In this section, we discuss our workload-driven automa-
Listing 1) gets a MAST and a non-partitioned database D ted partitioning design algorithm. Again, we start with the
as input and returns the optimal partitioning configuration problem statement and then give a brief overview of our so-
for all tables in D. The algorithm therefore analyzes as ma- lution. Afterwards, we discuss the details of our algorithm.
ny partitioning configurations as we have nodes in the MAST
(line 5-15). Therefore, we construct partitioning configura- 4.1 Problem Statement and Overview
tions (line 7-9) that follow the same pattern: one table is The Problem Statement can be formulated as the follo-
used as the seed table that is partitioned by one of the seed wing optimization problem: Given a schema S, a workload
partitioning schemes (or SP for short) such as hash parti- W = {Q1 , Q2 , ..., Qn } and the non-partitioned database D,
tioning and all other tables are recursively PREF partitioned define a partitioning scheme (HASH or PREF) for the tables

22
Maximum'Spanning'' Merged'MASTs'' Merged'MASTs'
Trees'MAST(Qi):' (First'Phase):' (Second'Phase):' 4.2 Maximizing Data-Locality
PREF PREF PREF PREF
150k" 1.5m" SP" SP"
Q1" C" O" L" on C " on O " on C " on O " In order to maximize data-locality for a given workload
Q1+2" C" O" L" C" O" L" Q1+2"
W , we first create a separate schema graph GS (Qi ) for each

|DP(Q3)|+|DP(Q4)|=6.02m0
1.5m"
Q2" O" L"

|DP(Q3+4)|=6.01m0
PREF
SP"
10k"
on S "
PREF PREF
query Qi ∈ W as described before. The schema graph for
Q3" L" S" Q3" L" S" on S " on N " SP"

L" S" N" Q3+4"


the workload-driven algorithm is defined the same way as
25"
Q4" S" N" Q4" S" N" described in Section 3 as GS = (N, E, l(e ∈ E), w(e ∈ E))
PREF SP"
on N " and can be derived from the query graph of a query Qi : A
query graph is defined in the literature as an undirected la-
Figure 5: Workload-driven Partitioning Design beled graph GQ = (N, E, l(e ∈ E)) where each node n ∈ N
represents a table (used by the query). An edge e ∈ E repres-
ents a join predicate between two tables while the labeling
used by each query Qi ∈ W (called partitioning configu- function l(e ∈ E) that returns the join predicate for each
ration) such that data-locality is maximized for each query edge.
Qi individually, while data-redundancy is globally minimi- Currently, when transforming a query graph GQ (Qi ) in-
zed for all Qi ∈ W . Like schema-driven partitioning, the to an equivalent schema graph GS (Qi ), we only consider
main optimization goal is to maximize data-locality under those edges which use an equi-join predicate as label. Note
the given partitioning schemes; data-redundancy is only sub- that this does not mean that queries in workload W can on-
ordinate. In order to solve this optimization problem our al- ly have equi-join predicates. It only means that edges with
gorithm executes the following three steps. non-equi join predicates are not added to the schema graph
In the first step our algorithm creates a separate schema since these predicates result in high data-redundancy any-
graph GS (Qi ) for each query Qi ∈ W where edges represent way when used for co-partitioning tables by PREF as discus-
the join predicates in a query. Afterwards, we compute the sed in Section 2. Moreover, for creating the schema graph
MAST(Qi ) for each GS (Qi ). That way, data-locality for each GS , a weighting function w(e ∈ E) needs to be defined for
query Qi ∈ W is maximized, since one optimally partitioned the GS . This is trivial since the table sizes are given by the
minimal database DP (Qi ) could be generated for each query non-partitioned database D that is an input to the workload-
individually. However, this would result in a very high data- driven algorithm as well. Note that in addition to table sizes,
redundancy since individual tables will most probably exist edge weights GS could also reflect costs of a query optimizer
several times (using different partitioning schemes for diffe- to execute the join, if these information items are provided.
rent queries). For example, Figure 5 (left hand side) shows However, then the merging function would need to be mo-
the MASTs resulting from four different queries in a workload re complex (i.e., a simple union of nodes and edges is not
W = {Q1 , Q2 , Q3 , Q4 }. Again, if different MASTs with the enough since the same edge could have different weights). In
same total weight exist for one query, we can keep them all the following, we assume that edge weights represent table
to find the optimal solution in the following steps. sizes.
In the second step, we merge MASTs of individual queries Once the schema graph GS (Qi ) is created for each que-
in order to reduce the search space of the algorithm. Given ry Qi ∈ W , we can derive the maximum spanning tree
the MASTs, the merge function creates the union of nodes MAST(Qi ) for each GS (Qi ). The MAST(Qi ) represents the set
and edges in the individual MASTs.In this phase we merge of edges Eco (Qi ) that can be used for co-partitioning tables
a MAST(Qj ) into a MAST(Qi ) if the MAST of Qj is fully con- in Qi . All edges that are in the query graph of Qi but not in
tained in the MAST of Qi (i.e. MAST(Qi ) contains all nodes the MAST(Qi ) will result in remote joins. Data-locality DL for
and edges with the same labels and weights of MAST(Qj )). a query is thus defined in the same way as before in Section
Thus, no cycles can occur in this merge phase. The merged 3.2 as the fraction of the sum of weights in Eco (Qi ) and the
MAST is denoted by MAST(Qi+j ) If MAST(Qj ) is fully contai- sum of weights for all edges in GS (Qi ). As shown in Secti-
ned in different MASTs, we merge it into one of these MASTs. on 3 using the edges of a MAST for co-partitioning maximizes
Moreover, at the end of the first merging phase, we determi- data-locality unless we additionally allow to duplicate tables
ne the optimal partitioning configuration and estimate the (i.e., nodes) in order to remove cycles in GS (Qi ). Moreover,
total size of the partitioned database for each merged MAST in contrast to the schema-driven algorithm, if a connected
(using function findOptimalPC in Listing 1). Figure 5 (cen- GS has different MASTs with the same total weight, our al-
ter) shows a potential result of the first merging phase. This gorithm additionally finds the optimal partitioning configu-
step effectively reduces the search space for the subsequent ration for each of the MASTs and estimates the size of the
merging phase. partitioned database as shown in Listing 1. For the subse-
In the last step (i.e. a second merge phase), we use a quent merging phase, we only keep that MAST which results
cost-based approach to further merge MASTs. In this step, in a partitioned database with minimal estimated size.
we only merge MAST(Qj ) into MAST(Qi ) if the result is acy-
clic and if we do not sacrifice data-locality while reducing 4.3 Minimizing Data-Redundancy
data-redundancy (i.e., if |DP (Qi+j )| < |DP (Qi )|+|DP (Qj )| Merging the MASTs of different queries is implemented in
holds). Figure 5 (right hand side) shows a potential result two steps as described before: using heuristics in the first
of the second merging phase. In this example MAST of Q3 merge phase to effectively reduce the search space for the
and Q4 are merged since the size of the resulting database second cost-based merge phase to further reduce data-re-
DP (Q3+4 ) after merging is smaller than the sum of sizes of dundancy. The result after both merging phases is a set of
the individual partitioned databases DP (Q3 ) + DP (Q4 ). For MASTs and an optimal partitioning configuration for each
query execution, a query can be routed to the MAST which MAST. If a table appears in different MASTs using different
contains the query and which has minimal data-redundancy partitioning schemes in the partitioning configuration, we
for all tables read by that query. duplicate the table in the final partitioned database DP that

23
we create for all MASTs. However, if a table appears in dif- Selected'Configura0ons:' Other'Configura0ons:'
ferent MASTs and uses the same partitioning scheme we do
Level%3:% {Q1+2,Q3} {Q1+2+3},{Q1+3,Q2},{Q1,Q2+3}
not duplicate this table in DP . Data-redundancy for a set
of MASTs is thus defined as a fraction of the sum of all parti-
Level%2:% {Q1+2} {Q1+3} {Q2+3} {Q1,Q2},{Q1,Q3},{Q2,Q3}
tioned tables and the size of the non-partitioned D. In the
following, we only discuss the cost-based merging in detail
Level%1:% {Q1} {Q2} {Q3}
since the first merge step is trivial.
For cost-based merging, we first define the term merge
configuration. A merge configuration is a set of merge ex- Figure 6: Enumerating Merge Configurations
pressions, which defines for each query Qi in a given set
of queries if the MASTs are merged or not: Qi+j is a merge inputs. Each binary merge step for level l has to analyze
expression which states that the MASTs of Qi and Qj are mer- maximally l resulting merge configurations. For example, if
ged, while {Qi , Qj } is a set of two merge expressions which we want to enumerate all merge configurations of level l = 4
state that the MASTs are not merged. Thus, the most simple which result from merging one merge configuration of level
merge expression is a single query Qi . For example, for a set l = 3 having two merge expressions {Q1+2 , Q3 } and a query
individual queries is {Q1 , Q2 , Q3 }, {Q1+2 , Q3 } is one poten- Q4 , we have to enumerate three resulting merge configura-
tial merge configuration which holds two merge expressions tions {Q1+2 , Q3 , Q4 }, {Q1+2+4 , Q3 }, and {Q1+2 , Q3+4 } but
where Q1 and Q2 are merged into one MAST. The problem not for example {Q1+2+3+4 }. Moreover, memoizing analyzed
statement for cost-based merging can thus be re-formulated merge configurations also helps to prune the search space
to find the merge configuration which results in minimal because the same merge configuration might be enumerated
data-redundancy for all queries in the workload W without by different binary merge steps. For all merge configurations,
sacrificing data-locality. the binary merge step has to check if the merge configuration
The search space for all merge configuration for n queries is valid (i.e., no cycle occurs in the MAST). Finally, estima-
in a given workload W is the same as counting the number ting the size for a merge configuration is done by estimating
of non-empty partitions of a set which is defined by the Bell the size for each MAST separately (see Section 3) and then
number B(n) as follows [11]: summing up the individual estimated sizes.
n
Example: Figure 6 shows an example of our dynamic pro-
X
B n − B0 = S(n, k)
k=1
gramming algorithm for enumerating merge configurations
for three queries. The left hand side shows the selected mer-
S(n, k) is the Stirling number of the second kind [11],
ge configurations whereas the right hand side shows the
which counts the number of ways to partition a set of n
other enumerated merge configurations per level. In this
elements into k nonempty subsets. Our first merge phase re-
example, the optimal merge configuration of the third le-
duces the search space, since queries that are contained in
vel {Q1+2 , Q3 } builds on the optimal merge configuration
other queries are merged (i.e. removed from the workload),
{Q1+2 } of the second level.
which reduces n in the formula above. For example, for TPC-
DS we can reduce the MASTs for 99 queries to 17 MASTs (i.e.,
connected components) after the first merging phase. Howe- 5. EXPERIMENTAL EVALUATION
ver, for huge workloads the search space is typically very big In this section we report our experimental evaluation of
after the first merging phase. the techniques presented in our paper. In our experiments we
Therefore, we use dynamic programming for efficiently used the TPC-H benchmark (8 tables without skew and 22
finding the optimal materialization configuration for a gi- queries) as well as the TPC-DS benchmark (24 tables with
ven workload W . We can use dynamic programming since skew and 99 queries). The goal of the experimental evaluati-
the optimality principle holds for merge configurations: Let on is to show: (1) the efficiency of parallel query processing
M be an optimal merge configuration for queries {Q1 , Q2 , over a PREF partitioned database (Section 5.1), (2) the costs
..., Qn }. Then, every subset MS of M must be an optimal of bulk loading a PREF partitioned database (Section 5.2),
merge configuration for the queries it contains. To see why (3) the effectiveness of our two automatic partitioning de-
this holds, assume that the merge configuration M contains sign algorithms: schema-driven (SD) and workload-driven
a subset MS which is not optimal. That is, there exists (W D) (Section 5.3), and (4) the accuracy of our redundan-
another merge configuration MS 0 for the queries contained cy estimates and the runtime needed by these algorithms
in MS with strictly lower data-redundancy. Denote by M 0 under different sampling rates (Section 5.4).
the merge configuration derived by replacing MS in M by
MS 0 . Since M 0 contains the same queries as M , the data- For actually running queries in parallel, we implemented
redundancy of M 0 is lower than the data-redundancy of M . the PREF partitioning scheme and query processing capabi-
This contradicts the optimality of M . lities over PREF partitioned tables in an open source parallel
We execute dynamic programming to find the optimal database called XDB [6].1 XDB is built as a middleware
merge configuration with n queries. In our dynamic pro- over single-node database instances running MySQL. The
gramming algorithm, to find the optimal merge configura- middleware of XDB provides the query compiler which par-
tion for level l (i.e., with l queries), we execute a binary allelizes SQL queries and then uses a coordinator to execute
merge step of an optimal merge configuration of level l − 1 sub-plans in parallel over multiple MySQL nodes. Thus, the
with one individual query. Thus, in total dynamic program- complete query execution is pushed down into MySQL.
ming must analyze 2n different merge configurations. Moreo-
ver, a binary merge step must enumerate all possible merge
1
configurations of size l which can be constructed from both https://code.google.com/p/xdb/

24
5.1 Efficiency of Query Processing Variant# DL DR
Classical 1.0 1.21
Setup: In this experiment, we have been executing all 22
SD (wo small tables) 1.0 0.5
TPC-H queries on a database with SF = 10. We did not use SD (wo small tables,wo data-red.) 0.7 0.19
a higher SF since this SF can already show the effects of W D (wo small tables) 1.0 1.5
varying data-locality and data-redundancy. For the experi-
ment, we deployed XDB on an Amazon AWS cluster with 10 Table 1: Details of TPC-H Queries
EC2 nodes (m1.medium) which represent commodity machi-
nes with low computing power. Each m1.medium EC2 node
has 1 virtual CPUs (2 ECUs), 3.75 GB of RAM and 420 GB which uses a left outer join. After rewriting this query fi-
of local instance storage. Each node was running the follo- nishes in approximately 40s. The total runtime shows that
wing software stack: Linux, MySQL 5.6.16, and XDB using the partitioning configuration suggested by W D (wo small
Java 8. tables) outperforms all other variants. Moreover, both SD
variants also outperform CP .
For the TPC-H schema, we found that CP represents the
3500
Classical best partitioning configuration with minimal total runtime
3000 SD (wo small tables)
SD (wo small tables, wo redundancy) for all queries when not using PREF. Thus, CP in this ex-
Time (in seconds)

WD (wo small tables)


2500 periment can be seen as an lower bound for existing design
2000 algorithms (such as [14, 18]) that are not aware of PREF.
1500 Consequently, by comparing with CP , we indirectly compa-
1000 re our design algorithms to those algorithms.
500
Figure 8 shows the runtime of each individual TPC-H que-
0
Total Runtime ry. The results show that whenever a query involves a remote
Figure 7: Total runtime of all TPC-H queries operation the runtime is higher (e.g., the runtime for que-
ry 17 and 20 for SD wo redundancy is worse than for SD
Results: For partitioning the TPC-H database, we com- or W D). Furthermore, when no remote operation is needed
pare the following variants where Table 1 shows the resulting but data-redundancy is high in CP , then the query perfor-
data-locality DL and data-redundancy DR for all variants: mance also decreases significantly. This can be seen when we
• Classical Partitioning (CP ): This represents the clas- compare the runtime of queries 9, 11, 16, 17 for CP with all
sical partition design in data warehousing [12], where other schemes. For example, query 9 joins in total 6 tables
one manually selects the biggest table LINEITEM and where 4 are fully replicated and PARTSUPP with 8m tuples is
co-partitions the biggest connected table ORDERS to one of them.
hash partitioned them on their join key. Moreover, all However, when compared to W D which has even a hig-
other tables are replicated to all nodes. her total data-redundancy, we see that this has no negative
• SD (wo small tables): This represents our SD algo- influence on the runtime of queries at all (which seems con-
rithm where we remove small tables (i.e., NATION, tradictory to the result for CP ). The explanation is that
REGION, and SUPPLIER) from the schema before app- for W D, each query has a separate database (i.e., only the
lying the design algorithm and replicate those tables tables needed by that query) which results in a minimal
to all 10 nodes (as discussed in Section 3.1). The SD redundancy per query. In fact, the average data-redundancy
design algorithm then suggests to use the LINEITEM over all individual databases is even a little lower than for
table as seed table. SD (wo small tables). However, when taking the union of all
• SD (wo data-redundancy, wo small tables): Compared individual databases (of all queries) the data-redundancy is
to the variant before, we additionally disallow data-re- even higher as for CP as shown in Table 1.
dundancy for all non-replicated tables. For this vari-
ant, the SD design algorithm suggests to use two seed Finally, Figure 9 shows the effectiveness of our optimizati-
tables (PART and CUSTOMER) where LINEITEM is PREF ons that we presented in Section 2.2. Therefore, we execute
partitioned by ORDERS, and ORDERS by CUSTOMER, whi- different queries with (w) and without (wo) activating the-
le PARTSUPP is PREF partitioned by PART. se optimizations. As database, we use the TPC-H database
• W D (wo small tables): Our workload-driven partiti- SF = 10 partitioned using SD (wo small tables). Figure
on design merges all 22 queries into 4 connected com- 9 shows the execution time for the following three queries:
ponents in the first merge phase and then it is redu- (1) the first query (left hand side) counts distinct tuples
ced to 2 connected components by our second cost- in CUSTOMER (which has duplicates), (2) the second query
based merge phase: one connected component has 4 (center) executes a semi join of CUSTOMER and ORDERS (and
tables where CUSTOMER is the seed table (while ORDERS, counts all customers with orders), and (3) the third que-
LINEITEM, and PART are PREF partitioned) and the other ry (right hand side) executes an anti join of CUSTOMER and
connected component has also 4 tables where PART is ORDERS (and counts all customers without orders). The exe-
the seed table (while PARTSUPP, LINEITEM, and ORDERS cution times show that with our optimizations the runtime
are PREF partitioned). gets efficiently reduced by approximately two orders of ma-
Figure 7 shows the total runtime of all TPC-H queries. gnitude for query (1) and (2). Moreover, query (3) did not
For all variants, we excluded the runtime of queries 13 and finish within 1 hour without optimization while it only took
22 since these queries did not finish within 1 hour in MySQL 0.497 seconds to complete with optimization.
using any of the partitioning configurations (due to expensi-
ve remote operations). In fact, when using our optimizations 5.2 Costs of Bulk Loading
that we presented in Section 2.2, we can rewrite query 13 Setup: In this experiment, we bulk loaded the TPC-H

25
800
Classical

626.840
700 SD (wo small tables)
SD (wo small tables, wo redundancy)

543.898
WD (wo small tables)
600
Time (in seconds)

436.459
500

400

191.365
300

124.636

124.522
110.226

110.399
108.243

106.863
200

93.509

81.060

78.286
76.996

76.756
71.954

61.678
55.910

55.017
54.353

53.546

46.993
46.747
45.742
45.067

44.949
43.755
41.042

39.750
36.406
32.516
30.018

29.100
28.026

25.308

24.548
23.618
22.832

20.126
19.604

19.503
18.651

16.660
16.515
16.008

15.874

15.393

15.628
13.698

12.971
12.734
12.844

12.600
12.440

11.758

11.194

11.163

11.316
10.208

10.349
100

9.823
9.568

9.554
9.348

9.426
9.335
8.070

7.893

7.359
7.268

6.917

6.873
5.875
3.149

2.760
2.586
2.492
0.865
0.813
0.851

0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21

Figure 8: Runtime for individual TPC-H queries

10000 400
123.7
101.4

w Optimizations Classical
1000 wo Optimizations 350 SD (wo small tables)
Time (in seconds)

100 SD (wo small tables, wo redundancy)


300

Time (in seconds)


WD (wo small tables)
1.07

1.02

10
0.50

250
1
200
Aborted

0.1
150
0.01
100
0.001
50
D

Se

An
is

ti
tin

j
ij

0
oi
ct

oi

n
n

Bulk loading Time

Figure 9: Effectiveness of Optimizations Figure 10: Costs of Bulk Loading

database with SF = 10 into XDB. For the cluster setup, we algorithms was implemented in Java 8 and we did not paral-
use the same 10 machines as in the experiment before (see lelize its computation. Compared to the experiments before,
Section 5.1). we also use the TPC-DS database in this experiment to show
the effects of skew.
Results: We report the elapsed time of bulk loading a
partitioned TPC-H database for all partitioning schemes dis- Results: We first report the actual data-locality and data-
cussed in Section 5.1. While the Classical Partitioning (CP ) redundancy resulting from partitioning a TPC-H and a TPC-
scheme uses only hash partitioning and replication, all other DS database of scaling factor SF = 10 into 10 partitions
schemes also use PREF partitioned tables that are bulk loa- (i.e., for 10 nodes). We did not use a higher SF since the
ded using the procedure described in Section 2.3. Thus, our results for data-locality and data-redundancy would be very
schemes (SD and W D) have to pay higher costs when inser- similar for our design algorithms with a higher SF . After-
ting a tuple into a PREF partitioned table since this requires a wards, we show how data-redundancy evolves, if we scale
look-up operation on the referenced table. However, CP has the number of nodes and partitions from 1 to 100 for both
a much higher data-redundancy (as shown already before) databases (using SF = 10 for all experiments). This shows
and therefore has higher I/O costs. how well scale-out scenarios are supported.
The results in Figure 10 show that the total costs of SD TPC-H (10 partitions): For partitioning the TPC-H data-
(wo small tables) are only a little higher when compared to base, we use all variants shown for the first experiment in
CP . In SD (wo small tables, wo redundancy) the costs are Section 5.1. Figure 11(a) shows the data-locality and the
a factor 2× higher compared to SD (wo small tables). The actual data-redundancy, which results for the different va-
reason is that the biggest table LINEITEM is PREF partitioned riants shown before. Additionally, we added to two baseli-
where each tuple needs a look-up operation. When disallo- nes: All Replicated (i.e., all tables are replicated) and All
wing redundancy in SD, it is a common pattern that the Hashed (i.e., all tables are hash partitioned on their prima-
biggest table is PREF partitioned. The reason is that the big- ry key). While All Replicated (AR) achieves perfect data-
gest table is likely to have outgoing foreign keys that can be locality (DL = 1) by full data-redundancy (DR = 9 = n−1)
leveraged as partitioning predicates without adding redun- where n = 10 is the number of nodes, All Hashed (AH)
dancy. Finally, W D has the highest bulk loading costs since has no data-redundancy (DR = 0) but at the same ti-
it pays the costs for higher redundancy and look-ups for bulk me achieves no data-locality (DL = 0). Same as All Re-
loading PREF tables. When comparing Figure 7 (Execution plicated, CP also achieves perfect data-locality (DL = 1)
Costs) and Figure 10 (Loading Costs), we see that the bet- with less but still a high data-redundancy. Our design algo-
ter query performance is often paid by higher bulk loading rithms also achieve high data-locality, however with much
costs, which is worthwhile in data warehousing scenarios. less data-redundancy. For example, SD (wo small tables)
achieves perfect data-locality (DL = 1) with very little
5.3 Effectiveness of Partition Design data-redundancy (DR = 0.5) while W D has a slightly hig-
Setup: In this experiment, we use an Amazon EC2 ma- her data-redundancy (DR = 1.5). Moreover, when reducing
chine of type m2.4xlarge with 8 virtual CPUs (26 ECUs), data-redundancy to DR = 0.19 by SD (wo small tables, wo
68.4 GB of RAM and 2 · 840 GB of local instance storage data-redundancy), we still achieve a reasonable data-locality
to run our partitioning algorithms. The partitioning design of DL = 0.7.

26
1.4 Data-Locality
(DL = 0.49) in its naive variant. SD individual stars miti-
12
1.2
Data-Redundancy gates this with almost the same DR and DL = 0.65. Finally,

1.0

9.0

1.0

1.0

1.0

Data-Redundancy
10
our W D algorithm results in perfect data-locality (without
Data-Locality

0.70
8 any manual effort) by adding a little more data-redundancy
0.8
0.6
6 (DR = 1.4) compared to CP (individual stars).
0.4 4

1.50
1.21
TPC-H and TPC-DS (1-100 partitions): The goal of this

0.50

0.19
0.2 2
experiment is to show the effect of scale-out on the data-
0

0 0
locality and data-redundancy of all schemes discussed befo-
Al

Al

SD

SD

W
P
l

D
H

(w
as

ep

re. In this experiment, we partition the TPC-H and TPC-DS

o
he

l.

re
d

d.
database of SF = 10 into 1−100 partitions. For partitioning

)
(a) TPC-H we compare the best SD and the W D variant to the best
CP variant of our previous experiments. We do not show
1.4 Data-Locality
both baselines All Replicated and All Hashed. While for All
12
1.2
Data-Redundancy Replicated DR would be linearly growing (i.e., DR = n),
1.0

9.0
1.0

1.0

1.0

Data-Redundancy
10
All Hashed always has DR = 0. Figure 12 shows the resul-
Data-Locality

1
8
0.65

0.8 ting data-redundancy (DR) for TPC-H and TPC-DS: The


0.49
4.15

6
0.6 best CP scheme has a DR which is growing slower than All
0.4 4 Replicated but has still a linear growth rate. W D and SD
1.40
1.32

0.38
0.23

0.2 2 have a sub-linear growth rate, which is much lower for big
0

0 0 numbers of nodes. Consequently, this means for CP that


Al

Al

SD

SD

W
P

P
lH

lR

each individual node has to store more data as for the other
N

In

In
as

ep

ai

ai
d

d
.S

.S
he

ve

ve
l.

schemes when scaling-out. Thus, scaling-out scenarios are


d

ta

ta
r

r
s

not well supported in CP since the performance of query


(b) TPC-DS processing will decrease. Note that here we only show data-
redundancy, since one can easily reason that data-locality
Figure 11: Locality vs. Redundancy
will not change with varying number of nodes for all sche-
mes. Therefore, since the data-redundancy of our approach
TPC-DS (10 partitions): For partitioning the TPC-DS da- grows much slower compared to CP , and their data-locality
tabase, we compare the following variants: remains unchanged, it means that increasing number of no-
des will have a more positive effect on query processing in
• CP (Naive and Individual Stars): This represents the
our approach compared to the CP scheme.
classical partition design as described before. For TPC-
DS we applied it in two variants: (Naive) where we only 20
co-partition the biggest table by its connected biggest CP (wo small tables)
SD (wo small tables)
table and replicate all other tables, and (Individual WD (wo small tables)
Data-Redundancy

15
Stars) where we manually split the TPC-DS schema
into individual star schemata by separating each fact 10

table and all its dimension tables into an individual


5
schemata (resulting in duplicate dimension tables at
the cut) and then apply CP for each star. 0
0 10 20 30 40 50 60 70 80 90 100
• SD (Naive and Individual Stars, wo small tables): For Number of Nodes
SD we removed 5 small tables (each with less than
1000 tuples) and applied the SD algorithm in the two (a) TPC-H
variants described before: (Naive) where we we apply 20
CP (Individual Stars)
the SD algorithm to all tables, and (Individual Stars) SD (Individual Stars)
WD (wo small tables)
Data-Redundancy

where we apply SD to each individual star. 15

• W D (wo small tables): We applied our W D algorithm,


10
which merged all 99 queries representing 165 individu-
al connected components (after separating SPJA sub- 5
queries) into 17 connected components (i.e., MASTs) in
the first merge phase and then by dynamic program- 0
0 10 20 30 40 50 60 70 80 90 100
ming we reduced them to 7 connected components (i.e., Number of Nodes
the number of fact tables). (b) TPC-DS
Figure 11(b) shows the actual data-locality and data-re-
dundancy, which results for the different variants shown be- Figure 12: Varying # of Partitions and Nodes
fore, as well as for the two baselines (All Replicated and
All Hashed ). Important is that CP has a higher data-re- 5.4 Accuracy vs. Efficiency of Partitioning
dundancy DR = 4.15 to achieve perfect data-locality as for Setup: We use the same setup as in the experiment before
TPC-H. This is due to replicating more tables of the TPC- (see Section 5.3).
DS schema. CP (individual stars) involves manual effort but
therefore has a much lower data-redundancy DR = 1.32. Results: In this experiment, we show the accuracy of our
Moreover, while SD introduces even less data-redundancy data-redundancy (DR) estimates when partitioning a TPC-
(DR = 0.23), it also achieves a much lower data-locality H data-base (SF = 10, wo skew) and a TPC-DS database

27
1
Error (TPC-H)
1000 tes. Compared to PREF, for multi-way joins SVP and AVP
Time (TPC-H) can only add a predicate to at most two co-partitioned ta-

Runtime (in seconds)


0.8 Error (TPC-DS) 800
Time (TPC-DS) bles which results in expensive full table scans for the other
0.6 600
tables in the join path. Moreover, for modern data manage-
Error

0.4 400 ment platforms with a huge number of (commodity) nodes


0.2 200
and large data sets full replication is also not desirable.
0 0 Automatic Design Tuning for Parallel Database Systems:
0 10 20 30 40 50 60 70 80 90 100
Sampling Rate (in %)
While there exists a lot of work in the area of physical de-
sign tuning for single node database system, much less work
Figure 13: Accuracy vs. Runtime (SD) exists for tuning parallel database systems [17, 16, 7, 14,
18]. Especially, for automatically finding optimal partitio-
(SF = 10, w skew) for varying sampling rates (i.e., 1 − ning schemes for OLAP workloads, we are only aware of
100%). For showing accuracy, we calculate the approxima- the a few approaches (e.g., those described in [14, 18, 20]).
tion error by |Estimated(DR) − Actual(DR)|/Actual(DR). Compared to our automated design algorithms that build
Moreover, we also analyze the runtime effort under different on PREF, these approaches rely only on existing partitioning
sampling rates (which includes the runtime to build histo- schemes (such as hash, range-based, round-robin) as well
grams from the database). Figure 13 shows the results of this as replication and decide which tables to co-partition and
experiment for the SD (wo small tables) variant. We can see which to replicate. Moreover, the two approaches in in [14,
that a small sampling rate of 10% results in a very low appro- 18] are tightly coupled with the database optimizer. Howe-
ximation error of about 3% for TPC-H and 8% for TPC-DS ver, in this paper we show that our partitioning design al-
while the runtime effort is acceptable since it only needs to gorithms, which are independent from any database optimi-
be executed once (101s for TPC-H and 246s for TPC-DS). zer, can give a much better query performance by efficiently
The difference in approximation error between TPC-H and using on our novel PREF scheme (even when not knowing
TPC-DS can be accounted for by the difference in the data the workload in advance). Recently, different automatic de-
distribution of these two benchmarks. While TPC-H is uni- sign partitioning algorithms have been suggested for OLTP
formly distributed, TPC-DS is highly skewed, which results workloads [17, 16, 7]. However, the goal of these approaches
in higher approximation error . The results of W D are not is to cluster all data used by individual transactions on a
shown in Figure 13) since it has the same approximation single node in order to avoid distributed transactions. For
error as SD. Moreover, the runtime of W D is dominated by OLAP, however, it is desirable to distribute the data needed
the merge phase which leads to approximately a factor of for one transaction (i.e., an analytical query) evenly to diffe-
10× increase compared to SD. rent nodes to allow data parallel processing. Thus, many of
the automatic design partitioning algorithms for OLTP are
not applicable for OLAP workloads.
6. RELATED WORK
Horizontal Partitioning Schemes for Parallel Database Sys- 7. CONCLUSIONS AND OUTLOOK
tems: Horizontally co-partition large tables on their join keys
was already introduced in the 1990’s by parallel database In this paper, we presented PREF, a novel horizontal parti-
systems such as Gamma [8] and Grace [10] in order to avo- tioning scheme, that allows to co-partition a set of tables by
id remote join operations. Today co-partitioning is getting a given set of join predicates by introducing duplicates. Fur-
even more important for modern parallel data management thermore, based on PREF, we also discussed two automatic
platforms such as Shark [10] in order to avoid expensive partitioning design algorithms that maximize data-locality
shuffle operations in MapReduce-based execution engines while minimizing data-redundancy. While our schema-driven
since CPU performance has grown much faster than net- design algorithm uses only a schema as input and derives
work bandwidth [19]. However, in complex schemata with potential join predicates from the schema, the workload-
many tables, co-partitioning on the join key is limited since driven algorithm additionally uses a set of queries as in-
only subsets of tables can be co-partitioned which share the put. Our experiments show, that while the schema-driven
same join key. Reference partitioning [9] (or REF partitioning algorithms works reasonably well for small schemata, the
for short) is an existing partitioning scheme to co-partition a workload-driven design algorithm is more efficient for com-
table by another table referenced by an outgoing foreign key plex schemata with a bigger number of tables.
(i.e., referential constraint). Using REF partitioning, chains One potential avenue of future work is to adopt our auto-
of tables can be co-partitioned based on outgoing foreign matic partitioning design algorithms to consider data-locality
keys . However, REF partitioning does not support incoming also for other operations than joins only (e.g., aggregations).
foreign keys. Our PREF partitioning generalizes REF to use Moreover, it would also be interesting to adopt our partitio-
an arbitrary equi-join predicate as partitioning predicate. ning design algorithms to dynamic data (i.e., updates) and
Another option to achieve a high-data-locality for joins is for mixed workloads (OLTP and OLAP) as well as for pure
to fully replicate tables to all nodes in a cluster. However, OLTP workloads. We believe that our partitioning design
when fully replicating tables data parallelism typically de- algorithms can also be used to partition schemata for OLTP
creases, since the complete query is routed to one copy and workloads (when we disallow data-redundancy for all tables)
executed locally. Simple Virtual Partitioning (SVP) [4] and since tuples that are used by a transaction can typically be
Adaptive Virtual Partitioning (AVP) [13] are two techniques described by a set of join predicates. Finally, partition pru-
that achieve data parallelism for fully replicated databases ning for PREF is another interesting avenue of future work.
by splitting a query into sub-queries which read only subsets
(i.e., virtual partitions) of the data by adding filter predica-

28
8. REFERENCES APPENDIX
[1] Cloudera Impala. http://www.cloudera.com/content/ A. ESTIMATING REDUNDANCY
cloudera/en/products-and-services/cdh/impala.html. As discussed in Section 2, predicate-based reference par-
[2] TPC-DS. http://www.tpc.org/tpcds/. titioning might result in some redundancy in PREF parti-
[3] TPC-H. http://www.tpc.org/tpch/. tioned tables. Since seed tables in our partitioning design
[4] F. Akal, K. Böhm, and H.-J. Schek. OLAP Query algorithms will never contain duplicate tuples (and thus red-
Evaluation in a Database Cluster: A Performance Study on undancy), in order to estimate the size of a database after
Intra-Query Parallelism. In ADBIS, pages 218–231, 2002.
partitioning, the expected redundancy in all tables which
[5] C. Binnig, N. May, and T. Mindnich. SQLScript: Efficiently
Analyzing Big Enterprise Data in SAP HANA. In BTW, are PREF partitioned must be estimated. In the following,
pages 363–382, 2013. we explain our probabilistic method to estimate the size of
[6] C. Binnig, A. Salama, A. C. Müller, E. Zamanian, a given PREF partitioned table.
H. Kornmayer, and S. Lising. XDB: a novel database Redundancy is cumulative, meaning that if a referenced
architecture for data analytics as a service. In SoCC, table in the PREF scheme contains duplicates, the referencing
page 39, 2013. table will inherit those duplicates as well. This can be best
[7] C. Curino, Y. Zhang, E. P. C. Jones, and S. Madden. explained by the example in Figure 2. Table ORDERS has two
Schism: a Workload-Driven Approach to Database
Replication and Partitioning. PVLDB, 3(1):48–57, 2010.
copies of the tuple with orderkey=1. Table CUSTOMER, which
[8] D. J. DeWitt, S. Ghandeharizadeh, D. A. Schneider, is PREF partitioned by ORDERS inherits this duplicate (i.e.,
A. Bricker, H.-I. Hsiao, and R. Rasmussen. The Gamma customer withcustkey=1 is therefore stored redundantly in
Database Machine Project. IEEE Trans. Knowl. Data partition 1 and 3). In other words, PREF partitioning can be
Eng., 2(1):44–62, 1990. viewed as walking a tree, where nodes are the tables (the
[9] G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, seed table being the root) and each referenced table is the
J. Srinivasan, and S. Das. Supporting table partitioning by parent of its referencing table(s). The redundancy in the
reference in Oracle. In SIGMOD Conference, pages
parent table results in the redundancy in the child table.
1111–1122, 2008.
[10] S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview
Therefore, in order to estimate the redundancy in a PREF
of The System Software of A Parallel Relational Database table, we should take into account the redundancy of all the
Machine GRACE. In VLDB, pages 209–219, 1986. tables along the path to the seed table.
[11] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete To this end, we assign a redundancy factor (denoted by r)
Mathematics: A Foundation for Computer Science. to each edge in the MAST. To find the redundancy factor of
Addison-Wesley Longman Publishing Co., Inc., Boston, an edge, we use histogram of the join key in the referenced
MA, USA, 2nd edition, 1994. table (whereas we can use sampling to reduce the runtime
[12] H. Herodotou, N. Borisov, and S. Babu. Query
effort to build histograms). For example, assume that we
optimization techniques for partitioned tables. In
Proceedings of the ACM SIGMOD International want to estimate the size of the table ORDERS in Figure 2 af-
Conference on Management of Data, SIGMOD 2011, ter partitioning. For each value in column orderkey of table
Athens, Greece, June 12-16, 2011, pages 49–60, 2011. LINEITEM, we calculate the expected number of duplicates
[13] A. A. B. Lima, M. Mattoso, and P. Valduriez. Adaptive of the corresponding tuple in table ORDERS we expect after
Virtual Partitioning for OLAP Query Processing in a partitioning (i.e. the number of partitions which contain a
Database Cluster. JIDM, 1(1):75–88, 2010. copy of that tuple). The idea behind this method is that
[14] R. V. Nehme and N. Bruno. Automated partitioning design tuples with lower frequency in the histogram are expected
in parallel database systems. In SIGMOD Conference,
pages 1137–1148, 2011. to end up in fewer number of partitions (i.e. fewer duplica-
[15] M. T. Özsu and P. Valduriez. Principles of Distributed tes) as compared to tuples with higher frequency. Therefore,
Database Systems, Third Edition. Springer, 2011. by calculating the expected value of copies of each distinct
[16] A. Pavlo, C. Curino, and S. B. Zdonik. Skew-aware value in the join key of the referenced table, and then add
automatic database partitioning in shared-nothing, parallel them all together, we can find an estimate of the size of the
OLTP systems. In SIGMOD Conference, pages 61–72, 2012. referencing table after partitioning.
[17] A. Quamar, K. A. Kumar, and A. Deshpande. SWORD: We now formally explain our method. Let the random
scalable workload-aware data placement for transactional variable X denote the expected number of copies of a tuple
workloads. In EDBT, pages 430–441, 2013.
after partitioning, n represent the number of partitions and
[18] J. Rao, C. Zhang, N. Megiddo, and G. M. Lohman.
Automating physical database design in a parallel f denotes the frequency of that tuple in the histogram. Note
database. In SIGMOD Conference, pages 558–569, 2002. that X can take any number between 1 and m = min(n, f ):
[19] W. Rödiger, T. Mühlbauer, P. Unterbrunner, A. Reiser, X = 1 results when all references of that tuple happen to
A. Kemper, and T. Neumann. Locality-Sensitive Operators be in the same partition, and therefore, no replica is needed.
for Parallel Main-Memory Database Clusters. In ICDE, On the other hand, X = min(N, f ) is the maximum value,
2014. since the number of copies of a tuple is either n (i.e. full
[20] T. Stöhr, H. Märtens, and E. Rahm. Multi-Dimensional replication) or f , when f < n and each reference ends up
Database Allocation for Parallel Data Warehouses. In
VLDB, pages 273–284, 2000. in a separate partition. The expected number of copies of a
[21] F. M. Waas. Beyond Conventional Data Warehousing - tuple with frequency f is therefore as follows:
Massively Parallel Data Processing with Greenplum
Database. In BIRTE (Informal Proceedings), 2008.
[22] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Ef,n [X] = 1·Pf,n (X = 1)+2·Pf,n (X = 2)+· · ·+m·Pf,n (X = m)
Inc., 1st edition, 2009.
[23] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker,
and I. Stoica. Shark: SQL and rich analytics at scale. In where Pf,N (X = x) is the probability that the tuple has
SIGMOD Conference, pages 13–24, 2013. x copies (meaning that it will be replicated to x different

29
partitions, out of the total N partitions). It can be calculated The redundancy factor is defined as follows:
as follows:
P
v∈Ve E[v]
n

· x! · S(f, x) r(e) =
x |Tj |
Pf,n (X = x) =
nf
Note that r(e) can be ranged between two extremes: 1 (no
In the formula above, S(f, x) is the Stirling number of the redundancy) and n (full redundancy). As mentioned before,
second kind, and is the number of ways to partition a set the after-partitioning size of a referencing table is not deter-
of x distinguishable objects (tuples) into n non-empty indis- mined only by the redundancy factor of the immediate edge
tinguishable boxes (partitions). The numerator, therefore, is coming from its referenced table, but also by the redundancy
the number ofways to choose x partitions out of total n par- factor of all the edges along the path from the seed table.
titions, i.e. nx , and then to partition f tuples into these x Finally, we can now estimate the after-partitioning size of
distinguishable partitions, which is why the Stirling number table Ti :
is multiplied by x!. The denominator is the total number of
ways to put f distinct tuples into n distinct partitions. Y
The above mentioned formula requires a number of ex- |TiP | = |Ti | · r(e)
pensive recursions (because of the Stirling number). Since e∈path(TRF ,Ti )

Ef,n [X] depends only on f and n, the entire computation where path(TRF , Ti ) consists of all the edges from TRF to
can be done in a preprocessing phase. Therefore, instead of Ti in the MAST.
actually calculating the expected number of copies for each For example, assume that r(LINEITEM → ORDERS) turns
tuple in run-time, only a fast look-up in a pre-loaded table out to be 2, meaning that |ORDERSP | would be twice as big
is enough. Thus, the time complexity of finding Ef,n [X] in as its original size |ORDERS|. Now, if r(ORDERS → CUSTOMER)
run-time is O(1). is 3, the after-partitioning size of table CUSTOMER would be
We are now ready to calculate the redundancy factor of estimated to be 2 · 3 = 6 times its original size.
an edge in MAST. We define it as the after-partitioning size
of a table divided by its original size. Let Ve denote the set
of distinct values in the join key of the referenced table Ti
over its outgoing edge e to table Tj (for example,the distinct
values of column orderkey of table LINEITEM in Figure 2 is
{1, 2, 3}).

30

You might also like