0% found this document useful (0 votes)
11 views12 pages

MLIP

The document discusses transforming the problem of join ordering in database query optimization into a mixed integer linear program (MILP). This allows using existing MILP solvers that have improved over decades and offer useful features like anytime optimization and parallel search. The authors present a MILP formulation for left-deep query plans and show encouraging experimental results, able to find optimal plans for joins of 60 tables.

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views12 pages

MLIP

The document discusses transforming the problem of join ordering in database query optimization into a mixed integer linear program (MILP). This allows using existing MILP solvers that have improved over decades and offer useful features like anytime optimization and parallel search. The authors present a MILP formulation for left-deep query plans and show encouraging experimental results, able to find optimal plans for joins of 60 tables.

Uploaded by

Chinar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Solving the Join Ordering Problem

via Mixed Integer Linear Programming

Immanuel Trummer and Christoph Koch


{firstname}.{lastname}@epfl.ch
École Polytechnique Fédérale de Lausanne
arXiv:1511.02071v1 [cs.DB] 6 Nov 2015

ABSTRACT solver implementation. It is therefore in general advised to


We transform join ordering into a mixed integer linear pro- consider and to evaluate both approaches for solving an op-
gram (MILP). This allows to address query optimization by timization problem.
mature MILP solver implementations that have evolved over We apply this generic insight to the problem of database
decades and steadily improved their performance. They of- query optimization. For the last thirty years, the problem of
fer features such as anytime optimization and parallel search exhaustive query optimization, more precisely the core prob-
that are highly relevant for query optimization. lem of join ordering and operator selection [26], has typically
We present a MILP formulation for searching left-deep been solved by customized code inside the query optimizer.
query plans. We use sets of binary variables to represent join Query optimizers consist of millions of code lines [34] and are
operands and intermediate results, operator implementation the result of thousands of man years worth of work [18]. The
choices or the presence of interesting orders. Linear con- question arises whether this development effort is actually
straints restrict value assignments to the ones representing necessary or whether we can transform query optimization
valid query plans. We approximate the cost of scan and join into another popular problem formalisms and use existing
operations via linear functions, allowing to increase approx- solvers. We study that question in this paper.
imation precision up to arbitrary degrees. Our experimental We transform the join ordering problem into a mixed inte-
results are encouraging: we are able to find optimal plans for ger linear program (MILP). We select that formalism for its
joins of 60 tables; a query size that is beyond the capabilities popularity. Integer programming approaches are currently
of prior exhaustive query optimization methods. the method of choice to solve thousands of optimization
problems from a wide range of areas [20]. Corresponding
software solvers have sometimes evolved over decades and
1. INTRODUCTION reached a high level of maturity [5]. Commercial solvers
From the developer’s perspective, there are two ways of such as Cplex1 or Gurobi2 are available for MILP as well as
solving a hard optimization problem on a computer: either open source alternatives such as SCIP3 .
we write optimization code from scratch that is customized Those solvers offer several features that are useful for
for the problem at hand or we transform the problem into query optimization. First of all, they possess the anytime
a popular problem formalism and use existing solver im- property: they produce solutions of increasing quality as
plementations. In principle, the first approach could lead optimization progresses and are able to provide bounds for
to more efficient code as it allows to exploit specific prob- how far the current solution is from the optimum. Chaud-
lem properties. Also, we do not require a transformation huri recently mentioned the development of anytime algo-
that might blow up the size of the problem representation. rithms as one of the relevant research challenges in query
In practice however, our customized code competes against optimization [7]. Mapping query optimization to MILP im-
mature solver implementations for popular problem mod- mediately yields an algorithm with that property (note that
els that have been fine-tuned over decades [5], driven by a recently proposed anytime algorithms for multi-objective
multitude of application scenarios. Using an existing solver query optimization [31] are not applicable to traditional
reduces the amount of code that needs to be written and we query optimization). Second, MILP solvers already offer
might obtain desirable features such as parallel optimization support for parallel optimization which is an active topic of
or anytime behavior (i.e., obtaining solutions of increasing research in query optimization as well [12, 34, 27]. Finally,
quality as optimization progresses) automatically from the the performance of MILP solvers has improved (hardware-
independently) by more than factor 450,000 over the past
twenty years [5]. It seems entirely likely that those advances
can speed up query optimization as well (and anticipating
our experimental results, we find indeed classes of query op-
timization problems where a MILP based approach treats

1
http://www.ibm.com/software/products/en/ibmilogcpleoptistud
2
http://www.gurobi.com/
3
http://scip.zib.de/

1
query sizes that are illusory for prior exhaustive query opti- by the product of table cardinalities and predicate selectiv-
mization algorithms). ities. As we cannot directly represent a product via linear
In summary, by connecting query optimization to integer constraints, we focus on the logarithm of the cardinality: the
programming, we benefit from over sixty years of theoret- logarithm of a product is the sum of the logarithms of the
ical research and decades of implementation efforts. Even factors. Based on our binary variables representing selected
better, having a mapping from query optimization to MILP tables and evaluated predicates, we calculate the logarithm
does not only enable us to benefit from past research but of the cardinality for all intermediate results that appear in
also from all future research and development advances that a query plan. Based on the logarithm of the cardinality,
originate in the fruitful area of MILP. Performance improve- we approximate the cost of query plans via sets of linear
ments have been steady in the past [5] and, as several major constraints and via auxiliary variables.
software vendors compete in that market, are likely in the We must approximate cost functions since the cost of
future as well. standard operators is usually not linear in the logarithm
Given that integer programming transformations have been of input and output cardinalities. We can however choose
proposed for many optimization problems that connect to the approximation precision by choosing the number of con-
query optimization [1, 2, 10, 25, 35], it is actually surpris- straints and auxiliary variables. This allows in principle ar-
ing that no such mapping has been proposed for the join bitrary degrees of precision. Also note that there are entire
ordering problem itself so far. There are even sub-domains sub-domains of query optimization in which it is standard
of query optimization, notably parametric query optimiza- to approximate plan cost functions via linear functions [11,
tion [11, 15, 16] and multi-objective parametric query opti- 15, 16, 32]. Approximating plan cost via linear function is
mization [32], where it is common to approximate the cost therefore a widely-used approach.
of query plans via piecewise-linear functions. The purpose Our goal here was to give a first intuition for how our
here is however to model the dependency of plan cost on transformation works and we have therefore considered join
unknown parameters while traditional approaches such as order alone and in a simplified setting. Later we show how
dynamic programming are used to find the optimal join or- to extend our approach for representing alternative operator
der. None of the aforementioned publications transforms implementations, complex cost models taking into account
the join ordering problem into a MILP and the same applies interesting orders and the evaluation cost of expensive pred-
for additional related work that we discuss in Section 2. icates, or richer query languages.
A MILP is specified by a set of variables with either con- We formally analyze our transformation in terms of the
tinuous or integer value domain, a set of linear constraints on resulting number of constraints and variables. In our ex-
those variables, and a linear objective function that needs to perimental evaluation, we apply the Gurobi MILP solver to
be minimized. An optimal solution to a MILP is an assign- query optimization problems that have been reformulated
ment from variables to values that minimizes the objective as MILP problems. We compare against a classical dynamic
function. We sketch out next how we transform the join programming based query optimization algorithm on differ-
ordering problem into a MILP. ent query sizes and join graph structures. Our results are
Left-deep query plans can be represented as follows (we encouraging: the MILP approach often generates guaran-
simplify by not considering alternative operator implemen- teed near-optimal query plans after few seconds where dy-
tations while the extensions are discussed later). For a given namic programming based optimization does not generate
query, we can derive the total number of required join op- any plans up to the timeout of one minute.
erations from the number of query tables. As we know the The original scientific contributions of this paper are the
number of required joins in advance, we introduce for each following:
join operand and for each query table a binary variable in-
dicating whether the table is part of that join operand. We • We show how to reformulate query optimization as
add linear constraints enforcing for instance that single ta- MILP problem.
bles are selected for the inner join operands (a particularity
• We analyze the problem mapping and express the num-
of left-deep query plans), that the outer join operands are
ber of variables and constraints as function of the query
the result of the prior join (except for the first join), or that
dimensions.
join operands have no overlap. The result is a MILP where
each solution represents a valid left-deep query plan. • We evaluate our approach experimentally and compare
This is not yet useful: we must associate query plans with against a classical dynamic programming based query
cost in order to obtain the optimal plan from the MILP optimizer.
solver. The cost of a query plan depends on the cardinality
(or byte size) of intermediate results. The cardinality of an The remainder of this paper is organized as follows. We
intermediate result depends on the selected tables and on discuss related work in Section 2. In Section 3, we intro-
the evaluated predicates. We introduce a binary variable duce our formal problem model. Section 4 describes how we
for each predicate and each intermediate result, indicating transform query optimization into MILP. We analyze how
whether the predicate has been evaluated to reduce cardinal- the size of the resulting MILP problem grows in the dimen-
ity. Predicate variables are restricted by linear constraints sion of the original query optimization problem in Section 6.
that make it impossible to evaluate a predicate as long as In Section 7, we experimentally evaluate an implementation
not all query tables it refers to are present in the correspond- of our MILP approach in comparison with a classical dy-
ing result. The cardinality of the join of a set of tables on namic programming based query optimization algorithm.
which predicates have been evaluated is usually estimated

2
2. RELATED WORK 3. MODEL AND ASSUMPTIONS
MILP representations have been proposed for many op- The goal of query optimization is to find an optimal or
timization problems in the database domain, including but near-optimal plan for a given query. It is common to intro-
not limited to multiple query optimization [10], index se- duce new query optimization algorithms by means of simpli-
lection [25], materialized view design [35], selection of data fied problem models. We also use a simple query and query
samples [1], or partitioning of data for parallel processing [2]. plan model throughout most of the paper while we discuss
In the areas of parametric query optimization and multi- extensions to richer query languages and plan models as well.
objective parametric query optimization it is common to In our simplified model, we represent a query as a set Q
model the cost of query plans by linear functions that de- of tables that need to be joined together with a set P of
pend on unknown parameters [11, 15, 16, 32]. None of those binary predicates that connect the tables in Q (extensions
prior publications formalizes however the join ordering and to nested queries, queries with aggregates, queries with pro-
operator selection problem as MILP. jections, and queries with non-binary predicates will be dis-
Query optimization algorithms can be roughly classified cussed). For each binary predicate p ∈ P , we designate by
into exhaustive algorithms that formally guarantee to find T1 (p), T2 (p) ∈ Q the two tables that the predicate refers to.
optimal query plans and into heuristic algorithms which do Predicates can only be evaluated in relations in which both
not possess those formal guarantees. Exhaustive query op- tables they refer to have been joined.
timization algorithms are often based on dynamic program- We assume in the simplified problem model that one scan
ming [26, 33, 21, 22]. We compare against such an approach and one binary join operator are available. As we consider
in our experimental evaluation. binary joins, a query with n tables requires n − 1 join op-
Our MILP-based approach to query optimization can be erations. A query plan is defined by the operands of those
used as an exhaustive query optimization algorithm since n − 1 join operations, more precisely by the tables that are
we can configure the MILP solver to return a guaranteed- present in those operands. We consider left-deep plans. For
optimal solution. The MILP solver can however easily be left-deep query plans, the inner operand is always a single
configured to return solutions that are guaranteed near- table; the outer operand is the result from the previous join
optimal (i.e., the cost of the result plan is within a cer- (except for the outer operand of the first join which is a
tain factor of the optimum) or to return the best possible single table).
plan within a given amount of time. This makes the MILP Query plans are compared according to their execution
approach more flexible than typical exhaustive query op- cost. The execution cost of a plan depends on the cardinality
timization algorithms. Furthermore, MILP solvers posses of the intermediate results it produces. We write Card(t) ≥
the anytime property, meaning that they produce multiple 1 to designate the cardinality of table t and Sel(p) ∈ (0, 1]
plans of decreasing cost during optimization. The devel- to designate the selectivity of predicate p. We assume in
opment of anytime algorithms for query optimization has the simplified model that the cardinality of the join between
recently been identified as a research challenge [7]. Trans- several tables, after having evaluated a set of join predicates,
forming query optimization into MILP immediately yields corresponds to the product of the table cardinalities and the
anytime query optimization. Note that anytime algorithms predicate selectivities. We hence assume in the simplified
for multi-objective query optimization [31] cannot speed up model uncorrelated predicates while extensions to correlated
traditional query optimization with one plan cost metric. predicates will be discussed. We generally assume that the
The parallelization of exhaustive query optimization al- execution cost of a query plan is the sum of the execution
gorithms (not to be confused with query optimization for cost of all its operations. We will show how to represent
parallel execution) is currently an active research topic [12, various cost functions.
13, 27, 34]. MILP solvers such as Cplex or Gurobi are able to We translate the problem of finding a cost-minimal plan
exploit parallelism and transforming query optimization into for a given query into a mixed integer linear programming
MILP hence yields parallel query optimization as well. The problem (MILP). A MILP problem is defined by a set of
development of parallel query optimizers for new database variables (that can have either integer or continuous value
systems requires generally significant investments [27]; the domains), a set of linear constraints on those variables, and
amount of code to be written can be significantly reduced a linear objective function on those variables that needs to
by using a generic solver as optimizer core. be minimized. A solution to a MILP is an assignment from
Various heuristic and randomized algorithms have been variables to values from the respective domain such that all
proposed for query optimization [3, 6, 17, 28, 30, 29]. In constraints are satisfied. An optimal solution minimizes the
contrast to many exhaustive algorithms, most of them pos- objective function value among all solutions.
sess the anytime property and generate plans of improving
quality as optimization progresses. Those approaches can
4. JOIN ORDERING APPROACH
however not give any formal guarantees at any point in time
about how far the current solution is from the optimum. The join ordering problem is usually solved by algorithms
MILP solvers provide upper-bounds during optimization on that are specialized for that problem and run inside the
the cost difference between the cost of the current solution query optimizer. We adopt a radically different approach:
and the theoretical optimum. Such bounds can for instance we translate the join ordering problem into a MILP problem
be used to stop optimization once the distance reaches a that we solve by a generic MILP solver.
threshold. Randomized algorithms do not offer that possi- MILP is an extremely popular formalism that is used to
bility and the returned solutions may be arbitrarily far from solve a variety of problems inside and outside the database
the optimum. community. By mapping the join ordering problem into a
MILP formulation, we benefit from decades of theoretical

3
research in the area of MILP as well as from solver im- the maximal value of one for tiotj ) except for the last join.
plementations that have reached a high level of maturity. We add the constraint tiotjmax + tiitjmax ≤ 1 for the last
By linking query optimization to MILP, we make sure that join (and optionally for the other joins as well).
query optimization will from now on indirectly benefit from The number of joins is one less than the number of query
all theoretical advances and refined implementations that tables. We join two (different) tables in the first join. After
become available in the MILP domain. that, each join adds one new table to the set of joined tables
We explain in the following our mapping from a join or- since the outer operand contains all tables that have been
dering problem to a MILP. We describe the variables and joined so far, since the inner operand consists of one table,
constraints by which we represent valid join orders in Sec- and since inner and outer join operands do not overlap. As
tion 4.1. We show how to model the cardinality of join a result, we can only represent complete query plans that
operands in Section 4.2. In Section 4.3 we associate plans join all tables.
with cost values based on the operand cardinalities. We could have chosen a different representation of query
Note that we introduce our mapping by means of a basic plans with less variables. The problem is that we need to be
problem model in this section while we discuss extensions to able to approximate the cost of query plans based on that
the query language, plan space, and cost model in Section 5. representation using linear functions. Our representation of
query plans might at first seem unnecessarily redundant but
4.1 Join Order it allows to impose the constraints that we discuss next. Also
A MILP program is characterized by variables with as- note that MILP solvers typically try to eliminate unneces-
sociated value domains, a set of linear constraints on those sary variables and constraints in preprocessing steps. This
variables, and a linear objective function on those variables makes it less important to reduce the number of variables
that needs to be minimized. Table 1 summarizes the vari- and constraints at the cost of readability.
ables that we require to model join ordering as MILP prob-
lem and Table 2 summarizes the associated constraints. We Example 1. We illustrate the representation of left-deep
introduce them step-by-step in the following. query plans for the join query R ✶ S ✶ T . Answering the
We start by discussing the variables and constraints that query requires two join operations. Hence we introduce six
we need in order to represent valid left-deep query plans. variables tiotj for t ∈ {R, S, T } and j ∈ {0, 1} to repre-
Later we discuss the variables and constraints that are re- sent outer join operands and six variables tiitj to repre-
quired to estimate the cost of query plans. sent inner join operands. The join order (R ✶ S) ✶ T
We represent left-deep query plans for a query Q as fol- is for instance represented by setting tioR0 = tiiS0 = 1 and
lows. For the moment, we assume that only one join op- tioR1 = tioS1 = tiiT 1 = 1 and setting the other variables
erator and one scan operator are available while we discuss representing join operands to zero. This assignment satis-
extensions in Section 5. Under those assumptions, a query fies the two constraints
P that restrict inner operands to single
plan is specified by the join operands. We introduce a set of tables (e.g., t∈{R,S,T } tiit1 = 1 for the second join), it sat-
binary variables tiotj (short for Table In Outer join operand ) isfies the constraint restricting
P the outer operand in the first
with the semantic that tiotj is one if and only if query table join to a single table ( t∈{R,S,T } tiot0 = 1), and it satis-
t ∈ Q appears in the outer join operand of the j-th join. We fies the constraints making the outer operand of the second
numerate joins from 0 to jmax where jmax is determined by join equal to the union of the operands in the first join (e.g.,
the number of query tables. Analogue to that, we introduce tioR1 = tioR0 + tiiR0 ).
a set of binary variables tiitj (short for Table In Inner join
operand ) indicating whether the corresponding table is in 4.2 Cardinality
the inner operand of the j-th join. Our goal is to find query plans with minimal cost and
The variables representing left-deep plans have binary value hence we must associate query plans with a cost value. The
domains. Note that not all possible value combinations rep- execution cost of a query plan depends heavily on the car-
resent a valid left-deep plan. For instance, we could repre- dinality of intermediate results. We need to represent the
sent joins with empty join operands. Or we could build plans cardinality of join operands and join results in order to cal-
that join only a subset of the query tables and are therefore culate the cost of query plans. Inner operands consist always
incomplete. We must impose constraints in order to restrict of a single table and calculating their cardinality is straight-
the considered value combinations to the ones representing forward: designating by cij (short for Cardinality of Inner
valid and complete left-deep plans. operand ) the cardinality ofPthe inner operand of join num-
Left-deep plans are characterized by the particularity that ber j, we simply set cij = t tiitj Card(t) where Card(t) is
the inner operand consists of only one table P for each join. the cardinality of table t.
We capture that fact by the constraint t tiitj = 1 which Calculating cardinality for outer join operands is however
we need to introduce for each join j. A similar constraint non-trivial as we can only use linear constraints: the cardi-
restricts the table selections for the outer operand of the first nality of a join result is usually estimated as the product of
join (join index j = 0) as only one table can be selected as the cardinalities of the join operands times the selectivity of
initial operand. For the following joins (join index j ≥ 1), all predicates that are applied during the join. The product
the outer join operand is always the result of the previous is a non-linear function and does not directly translate into
join which is another characteristic of left-deep plans. This linear constraints.
translates into the constraints tiotj = tiit,j−1 + tiot,j−1 . We circumvent that problem via the following trick. While
The latter constraint actually excludes the possibility that cardinality is actually defined as the product of table cardi-
the same table appears in both operands of a join (since the nality values and predicate selectivity values, we represent
result of the sum between tiit,j−1 + tiot,j−1 cannot exceed the logarithm of the cardinality instead and the logarithm of

4
Table 1: Variables for formalizing join ordering for left-deep query plans as integer linear program.
Symbol Domain Semantic
tiotj /tiitj {0, 1} If table t is in outer/inner operand of j-th join
paopj {0, 1} If p-th predicate can be evaluated on outer operand of j-th join
lcoj R Logarithm of cardinality of outer operand of j-th join
ctorj {0, 1} If cardinality of outer operand of j-th join reaches r-th threshold
coj /cij R+ Approximated cardinality of outer/inner operand of j-th join

Table 2: Constraints for join ordering in left-deep plan spaces.


Constraint Semantic
P P
t tiot0 = 1/∀j : t tiitj = 1 Select one table for outer operand of first join/for all inner operands
∀j∀t : tiotj + tiitj ≤ 1 The tables in the join operands cannot overlap for the same join
∀j ≥ 1∀t : tiotj = tiit,j−1 + tiot,j−1 Results of prior join are outer operand for next join
∀p∀j : paopj ≤ tioT1 (p)j ; paopj ≤ tioT2 (p)j Predicates are applicable if both referenced tables are in outer operand
P
∀j : cij = t Card(t)tiitj Determines cardinality of inner operand
P
∀jP: lcoj = t log(Card(t))tiotj + Determines logarithm of outer operand cardinality,
p log(Sel(p))paopj taking into account selected tables and applicable predicates
∀j∀r : lcoj − ctorj · ∞ ≤ log(θr ) Activates threshold flag if cardinality reaches threshold
P
∀j : coj = r ctorj δθr Translates activated thresholds into approximate cardinality

a product is the sum of the logarithms of the factors. More later. Under this assumption, applying a predicate has only
formally, given a set T ⊆ Q of query tables such that the set beneficial effects as it reduces the cardinality of intermedi-
of predicates P is applicable to T (i.e., for each binary pred- ate results and therefore the cost of the following joins. This
icate in P the two tables it refers to are included in T ) and means that we only need to introduce constraints prevent-
designating by Card(t) for t ∈ T the cardinality of table t ing the solver from using predicates that are inapplicable but
and by Sel(p) the selectivity
Q of predicate
Q p ∈ P , a cardinal- we do not need to add constraints forcing the evaluation of
ity estimate is given by t∈T Card(t) · Pp∈P Sel(p) and the predicates explicitly.
logarithm
P of the cardinality estimate is t∈T log(Card(t))+ Using the variables capturing the applicability of predi-
p∈P log(Sel(p)) which is a linear function. cates, we can now write the logarithm of the join operand
We introduce the set of variables lcoj (short for Loga- cardinalities. For outer join operands, we set
rithmic Cardinality of Outer operand ) which represents the X X
logarithm of the cardinality of the outer operand of the j- lcoj = log(Card(t))tiotj + log(Sel(p))paopj
t p
th join. The aforementioned linear formula for calculating
the logarithm of the cardinality depends on the selected ta- and thereby take into account table cardinalities as well as
bles as well as on the applicable predicates. The selected predicate selectivities.
tables are directly given in the variables tiotj . We introduce Unfortunately, the cost of most operations within a query
additional binary variables to represent the applicable predi- plan is not linear in the logarithm of the cardinality values.
cates: variable paopj (short for Predicate Applicable in Outer In the following, we show how to transform the logarithm
join operand ) captures whether predicate p is applicable in of the cardinality values into an approximation of the raw
the outer operand of the j-th join. We currently consider cardinality values. This allows to write cost functions that
only binary predicates (we discuss extensions later) and as are linear in the cardinality of their input and output. This
the inner operands consist of single tables, we do not need is sufficient for many but not for all standard operations.
to introduce an analogue set of predicate variables for the Similar techniques to the ones we describe in the following
inner operands. can however be used to represent for instance log-linear cost
We denote by T1 (p) and T2 (p) the first and the second functions as we describe in more detail in Section 4.3.
table that predicate p refers to. A predicate is applicable We must transform the logarithm of the cardinality into
to an operand whose table set T contains T1 (p) and T2 (p). the cardinality itself. This is not a linear transformation
We make sure that predicates cannot be applied if one of the and hence we resort to approximation. We assume that a
two tables is missing by adding for each predicate p and each set Θ = {θr } of cardinality threshold values has been de-
join j a pair of constraints of the form paopj ≤ tioT1 (p) and fined for integer indices r with 0 ≤ r ≤ rmax . In addi-
paopj ≤ tioT2 (p) . We currently assume that predicate evalu- tion, we introduce a set of binary variables ctorj (short for
ations do not incur any cost while extensions are discussed Cardinality Threshold reached by Outer operand ) that indi-
cate for each join j and each cardinality threshold value θr

5
whether the cardinality of the outer operand reaches the cor- true and approximate cardinality is at most one order of
responding threshold value. If threshold θr is reached then magnitude.
the corresponding threshold variable ctorj must take value
one and otherwise value zero. To guarantee that the pre- 4.3 Cost
vious statement holds, we introduce constraints of the form Now we can for instance P sum up the cardinalities over
lcoj −ctorj ·∞ ≤ log(θr ) for each join j where ∞ is in practice all intermediate results ( j≥1 cioj ) and thereby obtain a
a sufficiently large constant such that the constraint can be simple cost metric that is equivalent to the Cout cost metric
satisfied by setting the threshold variable ctorj to one. We introduced by Cluet and Moerkotte [9]. Join orders minimiz-
do not explicitly enforce that the threshold variable is set ing that cost metric were shown to minimize cost according
to zero in case that the threshold is not reached. The con- to the cost formulas of some of the standard join operators as
straints that we introduce next make however sure that the well [9]. We will however show in the following how the cost
cardinality estimate and therefore the cost estimate increase of all standard join operators, namely hash join, sort-merge
with every threshold variable that is set to one. Hence the join, and block nested loop join, can be modeled directly.
solver will set the threshold variables to zero wherever it The standard cost formula for a hash join operation is
can. based on the number of pages that the two input operands
Based on the threshold variables, we can formulate a lin- consume on disk. We designate by pgoj the number of disk
ear approximation for the raw cardinality. We introduce the pages consumed by the outer operand of join number j and
set of variables coj representing the raw cardinality of the pgij is the analogue value for the inner operand. If a hash
P
outer operand of the j-th join and set coj = r ctorj δθr
join operator is used for the join then its cost is given by 3 ·
where the values δθr are chosen appropriately such that if (pgoj + pgij ). This is a linear formula but we must calculate
threshold variables cto0j up to ctomj are set to one for some the size of the operands in disc pages.
specific join j then the cardinality variable coj takes a value The byte size of an intermediate result, and therefore the
between θm and θm+1 (assuming that thresholds are indexed number of consumed disk pages, depends not only on the
in ascending order such that ∀r : θr < θr+1 ). We can for cardinality but also on the columns that are present. For
instance set δθr = θr − θr−1 for r ≥ 1 and δθ0 = θ0 . the moment, we make the simplifying assumption that each
tuple has a fixed byte size. We show how to relax that
Example 2. We illustrate how to calculate join operand restriction in the next section. Under this simplifying as-
cardinalities and continue the previous example with join sumption, we can however express the disk pages of the
query R ✶ S ✶ T . We have two joins and introduce therefore outer operands as pgoj = ⌈coj · tupSize/pageSize⌉ where
four variables (ci0 , ci1 , co0 , and co1 ) representing operand tupSize is the fixed byte size per tuple and pageSize the
cardinalities. Assume that tables R, S, and T have cardi- number of bytes per disk page. Factor tupSize/pageSize
nalities 10, 1000, and 100 respectively. We calculate the is a constant due to our simplifying assumption and hence
cardinality of the two inner join operands by summing over we can set pgoj = coj · tupSize/pageSize to obtain the ap-
the variables indicating the presence of a table in an in- proximate numberP of disk pages. Alternatively, we could
ner operand, weighted by the cardinality values (e.g., ci0 = write pgoj = r ⌈θr · tupSize/pageSize⌉(ctojr − ctoj,r+1 )
10tiiR0 + 1000tiiS0 + 100tiiT 0 ). The cardinality of the outer and approximate it using the threshold variables (the expres-
operands can depend on predicates. Assume that one predi- sion (ctojr − ctoj,r+1 ) yields value one only for the threshold
cate p is defined between tables R and S. We introduce two variable with the highest threshold that is still set to one).
variables, paop0 and paop1 , indicating whether the predicate Note that the factors of the form ⌈θr ·tupSize/pageSize⌉ are
can be evaluated in the outer operand of the corresponding constants. The second version has the advantage that we
join. Predicates can be evaluated if both referenced tables can explicitly control the approximation precision for pgoj
are in the corresponding operand. We introduce four con- by tuning the number of thresholds. The disc pages for
straints (e.g., paop0 ≤ tioR0 and paop0 ≤ tioS0 ) forcing the the inner operands can be obtained in a simplified way as
value of the predicate variable to zero if at least one of the each inner
P operand consists of only one table: we simply set
tables is not present. We introduce two variables storing the pgij = t tiitj ⌈Card(t) · tupSize/pageSize⌉.
logarithm of the outer operand cardinality: lco0 and lco1 . The cost of sort-merge join operators can be approximated
We assume that the selectivity of p is 0.1. Then the loga- in a similar way. We assume here that both inputs must
rithmic cardinality for the first outer join operand is given be sorted while we generalize in the next subsection. If
by lco0 = 1paoR0 + 3paoS0 + 2paoT 0 − 1paop0 , assuming both input operands need to be sorted first then the join
that the logarithm base is 10. To simplify the example, we cost is given by 2pgoj ⌈log(pgoj )⌉ + 2pgij ⌈log(pgij )⌉ + pgoj +
assume that only two cardinality thresholds are considered: pgij . We have already shown how to obtain the number of
θ0 = 10, and θ1 = 1000. We introduce four variables ctorj disc pages pgoj and pgij . The log-linear numbers of disc
with r ∈ {0, 1} and j ∈ {0, 1} indicating whether the cardi- pages, pgoj log(pgoj ) and pgij log(pgij ), can be obtained in
nality of the outer join operand reaches each threshold for a similar way. We use the cardinality thresholds for the outer
the first or second join. Each threshold variable is con- operand and simply sum over tables for the inner operand.
strained by one constraint (e.g., lco0 − ∞ · cto0,0 ≤ 1). The cost function for the block nested loop join is given by
Now we define the cardinality of the outer join operands ⌈pgoj /buf f er⌉ · pgij where buf f er is the amount of buffer
by constraints such as co0 = 10cto0,0 + (1000 − 10)cto1,0 . space dedicated to the outer operand. We assume here that
This provides a lower bound for the true cardinality. If we pipelining is used while the generalization is straightforward.
know for instance that cardinality values are upper-bounded There are several options for approximating that cost func-
by 100000 due to the query properties, we can also set co0 = tion with linear constraints. We can approximate the join
100cto0,0 + (10000 − 100)cto1,0 . Then the difference between cost function by omitting the ceiling operator and obtain

6
pgoj /buf f er · pgij . Similar to how we calculated the cardi- can choose between different operator implementations. We
nality of the outer operands, we can switch to a logarithmic show how to handle interesting orders and other intermedi-
representation and write the logarithm of the join cost as ate result properties in Section 5.4. In Section 5.5, we finally
log(pgoj ) + log(pgij ) − log(buf f er). Then we can transform discuss how we can extend our approach to handle queries
the logarithm of the join cost into the raw join cost value with aggregates and nested queries.
using a set of newly introduced threshold variables. We sketch out the following extensions relatively quickly
Another idea is to exploit the specific shape of the in- due to space restrictions. They use however similar ideas as
ner join operands. As only one table is selected for the we applied in the last section. Our goal is less to provide
inner Pjoin operand, we can express join cost by the for- a detailed model for each possible scenario but rather to
mula t tiitj · pages(t) · blocksj where pages(t) = ⌈Card(t) · demonstrate that the MILP formalism is flexible enough to
tupSize/pageSize⌉ designates the disk page size of table t cover the most relevant aspects of query optimization.
and blocksj = ⌈pgoj /buf f er⌉ ≈ pgoj /buf f er is the number
of iterations of the outer loop executed by the block nested 5.1 Predicate Extensions
loop join. This is a weighted sum over products between a So far we have considered binary predicates. We show
binary variable (the variables tiitj indicating whether table how n-ary predicates can be modeled. Let p be an n-ary
t was selected for the inner operand of join number j) and a predicate. N-ary predicates refer to n tables and we desig-
continuous variable (the variables blocksj ). This formula is nate by T1 (p) to Tn (p) the tables on which p is evaluated.
hence not directly linear but the product between a binary All tables that p refers to must be present in the operands in
variable and a continuous variable can be expressed by in- which p is evaluated. If paopj indicates whether predicate p
troducing one auxiliary variable and a set of constraints [4]. can be evaluated in the outer operand of the j-th join then
The only condition for this transformation is that the con- we must introduce constraint paopj ≤ tioTi (p)j for each join
tinuous variable is non-negative and upper-bounded by a and each i ∈ {1, . . . , n}. This forces variables paopj to zero
constant. Both is the case (note that we generally only if at least one table is not present. Note that we must in-
model a bounded cardinality range which implies also an troduce analogue predicate variables for the inner operands
upper bound on the number of loop iterations). The advan- for all unary predicates.
tage of the second representation is that we only need to In our basic model, we assume that predicates are un-
introduce a number of variables and constraints that is lin- correlated. Then the accumulated selectivity of a predicate
ear in the number of tables (instead of linear in the number group corresponds always to the product of the selectivity
of thresholds like for the first possibility). values of the single tables. In reality this is not always the
We have seen that join orders, the cardinality of inter- case, even if it is a common simplification to assume uncor-
mediate results, and the cost of join operations according related predicates. Assume that there is a correlated group
to standard cost formulas can all be represented as MILP. Pcor of predicates such that the accumulated selectivity of all
In the next section we introduce several extensions of the predicates in Pcor differs significantly from their selectivity
problem model that we used so far. product. Then we introduce a new predicate g that repre-
sents the correlated predicate group. The Q selectivity Sel(g)
is chosen in a way such that Sel(g) p∈Pcor Sel(p) yields
5. EXTENSIONS the correct selectivity, taking correlations into account. So
We introduced our mapping for query plans by means of a the selectivity of g corrects the erroneous selectivity that is
basic problem model that focuses on join order. We discuss based on the assumption of independent predicates.
extensions of the query language, of the query plan model, Now we just need to make sure that the predicate vari-
and of the cost model in this section. able associated with g is set to one in all operands in which
Note that not all proposed extensions are necessary in all predicates from Pcor are selected but not otherwise. We
each scenario: the basic model introduced in the last section force paogj to one if all correlated
P predicates are present by
allows for instance to find join orders which minimize the requiring paogj ≥ 1−|Pcor |+ p∈Pcor paopj . We force paogj
sum of intermediate result sizes. Such join orders are opti- to zero if at least one of the correlated predicates is not acti-
mal according to many standard operator cost functions [9]. vated by introducing n constraints of the form paogj ≤ paopj
It is therefore in many scenarios possible to obtain good for p ∈ Pcor . No other constraints need to be introduced for
query plans based on the join order that was calculated us- paogj but terms including paogj must be included in all ex-
ing the basic model. To transform a join order into a query pressions representing cardinality, byte size, etc.
plan, we choose optimal operator implementations based on So far we have assumed that predicate evaluation is not
the cardinality of the join operands, we evaluate predicates associated with cost. We constrained the variables paopj
as early as possible (predicate push-down), and we project only to zero if required tables are not in the operand. We
out columns as soon as they are not required anymore. did not explicitly force them to one at any point since, as
An alternative is to let the MILP solver make some of they reduce cardinality, their evaluation reduces cost and
the decisions related to projection, predicate evaluation, and the MILP solver will generally choose to evaluate them as
join operator selection. We show how this can be accom- early as possible.
plished if desired. In addition, we discuss extensions of the This model is not always appropriate. If predicate evalu-
cost and query model. ations are expensive then it can be preferable to postpone
In Section 5.1, we discuss how to represent n-ary predi- their evaluation [8, 14, 19]. The predicate-related variables
cates, correlated predicates, and predicates that are expen- paopj influence the cardinality estimates of join operands.
sive to evaluate. We show how to handle projections in They capture whether the corresponding predicate was al-
Section 5.2 and in Section 5.3 we show how the MILP solver ready evaluated as otherwise it cannot influence cardinality.

7
We cannot use those variables directly to incorporate the forces the column variable to zero if the associated table
cost of predicate evaluations. The effect on cardinality of is not present. Not selecting any columns would be the
having evaluated a predicate once will persist for all future most convenient way for the optimizer to reduce plan costs.
operations. The evaluation cost needs however only to be To prevent this from happening, we must enforce that all
payed once. We introduce additional variables pcopj (short columns that the query refers to are in the final result.
for Predicate evaluation Cost for Outer operand ) and set Also, we must enforce that all columns that predicates re-
pcopj = paop,j+1 − paop,j . Intuitively, the predicate was fer to are present once they are evaluated. We introduced
evaluated in the current join if it is evaluated in the input variables indicating the immediate evaluation of a predicate
to thePnext join but not in the input of the current join. The during a specific join. Those are the variables that need to
sum j pcopj coj yields the evaluation cost associated with be connected to the columns they require via corresponding
predicate p (we can additionally weight by a factor that rep- constraints. We must also make sure that a column can-
resents predicate evaluation cost per tuple). This is not a not reappear in later joins after it has been projected out
linear function as we multiply variables. We have however a (otherwise that would be a convenient way of reducing in-
product between a binary variable and a continuous variable termediate result sizes while still satisfying the constraints
again. As before, we can transform such expressions into a requiring certain columns in the final result). Introducing
set of linear constraints and a new auxiliary variable [4]. constraints of the form clojl ≥ cloj+1,l satisfies that require-
Now that evaluation of predicates is not automatically de- ment.
sirable anymore, we must introduce additional constraints
making sure that all predicates are evaluated at the end of 5.3 Choosing Operator Implementations
query execution. Designating by jmax the index of the last We have already discussed the cost functions of different
join, we simply set paop,jmax +1 = 1 by convention. This join operator implementations in the last section. So far we
means that each predicate that was not evaluated before have however assumed that only one of those cost functions
the last join must be evaluated during the last join since is used to calculate the cost for all joins. This allows to
pcopjmax = 1 − paopjmax . We finally introduce constraints select optimal operator implementations after a good join
making sure that no predicate is initially evaluated and we order, minimizing intermediate result sizes, has been found.
introduce constraints making sure that an evaluated pred- We can however also task the MILP solver to pick operator
icate remains evaluated. The latter constraints are in fact implementations as we outline in the following.
optional since additional predicate evaluations increase the Denote by I the set of join operator implementations. We
cost. Depending on the solver implementation, it can nev- have shown how to calculate join cost for each of the stan-
ertheless be beneficial to add such constraints to reduce the dard join operators. We can introduce a variable pjcji (short
search space size. for Potential Join Cost) for each join j and for each operator
implementation i ∈ I representing the cost of the join if that
5.2 Projection operator is used. We use the term potential since whether
Our cost formulas have so far been based on cardinality that cost is actually counted depends on whether or not the
alone as we have assumed a constant byte size per tuple. corresponding operator implementation is selected.
This is of course a simplification and we must in general We introduce binary variables josji (short for Join Opera-
take into account the columns that we project on and their tor Selected ) to indicate for each operator implementation i
byte sizes. We designate by L the set of columns over all and join j whether the operator was used to realize the join.
query tables. By Byte(l) we denote the number of bytes We require that exactly one implementation P is selected for
per tuple that column l ∈ L requires. We introduce one each join as expressed by the constraint i josji = 1 that
variable clojl (short for CoLumn in Outer operand ) for each we must introduce for each join. Having the potential cost
join j and each column l ∈ L to indicate whether column l for each join operator as well as information on which opera-
is present in the outer operand of join j (and analogue vari- tor is selected, we can for each operator calculate the actual
ables for the inner operands). Then a refined formula for the join cost ajcji . The actual join cost associated with one spe-
estimatedP number of bytes consumed by the outer operand cific operator implementation is zero if that operator is not
is coj · l∈L clojl Byte(l). This is the sum over products selected. Otherwise (if that operator is selected) the actual
between a constant (Byte(l)), a binary variable (clojl ), and join cost corresponds to the potential join cost. We have
a continuous variable that takes only non-negative values the following relationship between potential and actual join
(coj ). This formula can be expressed using only linear con- cost ajcji = josji · pjcji . Here we multiply a binary with
straints using the same transformations that we used already a non-negative continuous variable and can apply the same
before [4]. Special rules apply for the inner operand again: linearization as before [4]. The sum over the actual join cost
for the inner operand, we can estimate the byte size (or any variables over all operator implementations yields the cost
derived measure such as the number of disc pages) by sum- of each join operation.
ming over the column variables, weighted by the column
byte size as well as by the cardinality of the table that the 5.4 Intermediate Result Properties
column belongs to. Alternative join operator implementations can sometimes
We must still constrain the variables clojl to make sure produce intermediate results with different physical prop-
that only valid query plans can be represented. First of all erties (while the contained data remains the same over all
we must connect columns to their respective tables. If the alternative implementations). Tuple orderings are perhaps
table associated with a column is not present then the col- the most famous example [26]. If tuples are produced in
umn cannot be present either in a given operand. If column an interesting order then the cost of successive operations
l is associated with table t then the constraint clojl ≤ tiotj can be reduced (e.g., the sorting stage can be dropped for a

8
sort-merge join). Also, the distinction whether an interme- been treated as a research problem on its own; correspond-
diate result is written to disc or remains in main memory is ing publications focus on the unnesting algorithms and use
a physical property of that result and influences the cost of join order optimization algorithms as a sub-function (e.g.,
successive operations. [23]).
Assume that we consider a set X of relevant intermediate
result properties. Then we can introduce a binary variable 6. FORMAL ANALYSIS
ohpjx (short for Outer operand Has Property) indicating
State-of-the art MILP solvers use a plethora of heuristics
whether the outer operand of the j-th join has property x.
and optimization algorithms which makes it hard to predict
Property x could for instance represent the fact that the
the run time for a given MILP instance. It is however a
corresponding result is materialized. Property x could also
reasonable assumption that optimization time tends to in-
represent one specific tuple ordering.
crease in the number of variables and constraints, even if
The properties constrain the operator implementations
preprocessing steps are sometimes able to eliminate redun-
that can be applied for the next join. We could for in-
dant elements. The assumptions that we make here are
stance introduce one operator implementation representing
supported by the experimental results that we present in
a pipelined block nested loop join while another operator
the next section: we see a strong (even if not perfect) corre-
implementation represents a block nested loop join without
lation between the number of variables and constraints and
pipelining. The applicability of the pipelined join would
the MILP solver performance.
have to be restricted based on whether or not the corre-
For the aforementioned reasons, we study in the following
sponding input remains in memory. If implementation i
how the asymptotic number of variables and constraints in
requires property x in the outer join operand in order to be-
the MILP grows in the dimensions of the query optimiza-
come applicable then we can impose the constraint josji ≤
tion problem from which it was derived. We denote in the
ohpjx to express that fact.
following by n = |Q| the number of query tables to join and
Operators such as the sort-merge join can be decomposed
by m = |P | the number of predicates. By l = |Θ| we denote
into different sub-operators (e.g., sorting the outer operand,
the number of thresholds that are used to approximate car-
sorting the inner operand, merging). This avoids having
dinality values. The following theorems refer to the basic
to introduce a new variable for each possible combination
problem model that was presented in Section 4.
of situations (e.g., outer operand sorted and inner operand
sorted, outer operand sorted but inner operand not sorted, Theorem 1. The MILP has O(n · (n + m + l)) variables.
etc.).
Whether an intermediate result has a certain physical Proof. Give n tables to join, each complete query plan
property is determined by the operator which produces the has O(n) joins. We require O(n) binary variables per join
result (and possibly by properties of the input to the produc- to indicate which tables form the join operands, we require
ing operation). If a subset Ie ⊆ I produces results O(m) binary variables per operand to indicate which predi-
P with a cer- cates can be evaluated, and we require O(l) continuous vari-
tain property x then we can set ohpj+1,x = i∈Ie josij . As
only one of the operators is selected, the aforementioned con- ables per operand to calculate cardinality estimates.
straint is valid and sets the left expression either to zero or
Theorem 2. The MILP has O(n·(n+m+l)) constraints.
to one. Certain properties such as interesting orders might
be provided automatically by certain tables (if the data on Proof. For each join operand we need O(n) constraints
disk has that order). Then we need additional constraints to restrict table selections, O(m) constraints to restrict pred-
to connect properties to tables. icate applicability, and O(l) constraints to force the thresh-
In summary, we have shown that all of the most impor- old variables to the right value.
tant aspects of query optimization can be represented in the
MILP formalism. 7. EXPERIMENTAL EVALUATION
Using existing MILP solvers as base for the query op-
5.5 Extended Query Languages timizer reduces coding overhead and automatically yields
parallelized anytime query optimization due to the features
We have already implicitly discussed several extensions to
of typical MILP solvers. In this section, we compare the per-
the query language in this section. We discussed how non-
formance of a MILP based optimizer to a classical dynamic
binary predicates and projection are supported. This gives
programming based query optimization algorithm.
us a system handling select-project-join (SPJ) queries.
We describe and justify our experimental setup in Sec-
It is generally common to introduce query optimization
tion 7.1 and discuss our results in Section 7.2.
algorithms using SPJ queries for illustration. There are
however standard techniques by which an optimization al- 7.1 Experimental Setup
gorithm treating SPJ queries can be extended into an algo-
We implemented a prototype of the MILP based opti-
rithm handling richer query languages.
mizer that was introduced in the last sections. We trans-
The seminal paper by Selinger [26] describes how a com-
form query optimization problems into MILP problems and
plex SQL statement containing nested queries can be de-
use the Gurobi4 solver in version 5.6.3 to find optimal or
composed into several simple query blocks that use only se-
near-optimal solutions to the resulting MILP problems. The
lection, projection, and joins; the join order optimization
MILP solution is read out and used to construct a corre-
algorithm is applied to each query block separately. Later,
sponding query plan.
the problem of unnesting a complex SQL statement contain-
4
ing aggregates and sub-queries into simple SPJ blocks has http://www.gurobi.com/

9
·104 ·104 justified if queries are executed on big data where choosing

Nr. Constraints
Nr. Variables 1.5 2 a sub-optimal plan can have devastating consequences [27].
1.5 During the 60 seconds of optimization time, we compare
1 1 optimization algorithms in regular intervals according to the
0.5 0.5 following criterion. We compare them based on the factor
0 0 by which the cost of the best plan found so far is higher
10 20 30 40 50 60 10 20 30 40 50 60 than the optimum at most. MILP solvers calculate such
bounds based on the integrality gap. The classical dynamic
Nr. query tables Nr. query tables
programming algorithm is not an anytime algorithm but
ILP (Low Precision) ILP (Medium Precision) after its execution finishes, the produced plan is optimal
ILP (High Precision) and hence the optimality factor is one.
We do not compare algorithms based on the cost overhead
that the generated plans have compared to the optimum.
Figure 1: Median number of variables and con-
Instead, we compare them based on an upper bound on the
straints of a MILP problem representing the opti-
relative cost overhead that the algorithm can formally guar-
mization of one query.
antee at a certain point in time. The actual cost overhead
is only known in hindsight after optimization has finished
(and for some of the query sizes we consider, calculating the
We compare this approach against the classical dynamic truly optimal query plans would cause high computational
programming algorithm by Selinger [26]. Dynamic program- overheads). The upper bound that we use as criterion is the
ming algorithms are very popular for exhaustive query op- only value that is known at optimization time and therefore
timization [21, 22] and are for instance used inside the op- the only value on which termination decisions can be based
timizer of the Postgres database system5 . on for instance (e.g., we could terminate optimization once
We compare the two aforementioned algorithms on ran- the query optimizer is certain that the current plan is not
domly generated queries. We generate queries according more expensive than the optimum by more than factor 2).
to the method proposed by Steinbrunn et al. [28] which is The comparison criterion that we use excludes any ran-
widely used to benchmark query optimization algorithms [28, domized or heuristic query optimization algorithms [3, 6,
6, 32]. We generate queries of different sizes (referring to 17, 28, 30, 29] from our experimental evaluation: such algo-
the number of tables to join) and with different join graph rithms cannot give any formal guarantees on the optimality
structures (chain graphs, star graphs, and cycle graphs [28]). of the produced plans. They cannot even give upper bounds
We allow cross products which increases the search space on the relative cost overhead of the generated plans.
size significantly compared to the case without cross prod- Our algorithms (for the MILP approach: the part that
ucts [24]. transforms query optimization into MILP) are implemented
We assume that hash joins are used and search the opti- in Java 1.7. The experiments were executed using the Java
mal join order. The MILP approach approximates the byte HotSpot(TM) 64-Bit Server Virtual Machine version on an
sizes of the intermediate results and therefore the cost of iMac with i5-3470S 2.90GHz CPU and 16 GB of DDR3
join operations. We evaluate three configurations of our RAM.
algorithm that differ in the precision by which they approx-
imate cardinality (higher approximation precision requires 7.2 Experimental Results
more MILP variables and constraints). Our first configura- We start by analyzing the size of the generated MILP
tion offers high precision and approximates cardinality with problems. Figure 1 shows the number of constraints and
a tolerance of factor 3. Our second configuration reduces ap- variables. We show results for queries with a star-shaped
proximation precision and has a tolerance factor of 10. Our join graph structure while the results for chain and cycle
third configuration reduces approximation precision further graph structures differ only marginally (the only difference
and has tolerance factor 100. Our most precise configuration is that cycle graphs require one additional predicate variable
uses 60 threshold variables per intermediate result up to 40 per intermediate result compared to star graphs). The ILP
table joins and 100 threshold variables per result for queries configuration with higher approximation precision requires
joining 50 and 60 tables. At the other side of the spectrum in all cases more variables and constraints. For all config-
is the low-precision configuration which uses 15 threshold urations, the number of variables and constraints increases
variables per result for up to 40 tables and 25 variables for with increasing number of query tables.
more than 40 tables. Figure 2 shows performance results for left-deep plans.
We compare algorithms by the quality of the plans that We allow cross product joins. The experimental setup was
they produce after a certain amount of optimization time. explained and justified in Section 7.1. The figure shows me-
We allow up to 60 seconds of optimization time and compare dian values for 20 randomly generated queries. For 10 query
the output generated by all algorithms in regular time inter- tables, all compared algorithms find the optimal plan very
vals. The high amount of optimization time seems justified quickly. For 20 query tables, the dynamic programming ap-
since we compare the algorithms also on very large queries. proach already takes more than six seconds in average to
All compared algorithms need significantly less time than find the optimal plan while the MILP approach is faster.
60 seconds to produce optimal plans for small queries. In- With 20 query tables we are reaching the limit of what is
vesting 60 seconds into optimization can however be well usually considered practical by dynamic programming algo-
rithms. Also note that we allow cross product joins which
5
http://www.postgresql.org/ increases the size of the plan space significantly.

10
Chain, 10 tables Cycle, 10 tables Star, 10 tables
Cost/LB 101 101 101

Cost/LB

Cost/LB
100 100 100
Chain, 20 tables Cycle, 20 tables Star, 20 tables
101 101 101
Cost/LB

Cost/LB

Cost/LB
100 100 100
Chain, 30 tables Cycle, 30 tables Star, 30 tables
101 101 101
Cost/LB

Cost/LB

Cost/LB
100 100 100
Chain, 40 tables Cycle, 40 tables Star, 40 tables
102 102 102
Cost/LB

Cost/LB

Cost/LB
101 101 101

100 100 100


Chain, 50 tables Cycle, 50 tables Star, 50 tables
104 104 104
Cost/LB

Cost/LB

Cost/LB
102 102 102

100 100 100


Chain, 60 tables Cycle, 60 tables Star, 60 tables
106 106 106
Cost/LB

Cost/LB

Cost/LB
103 103 103

100 100 100


6 12 18 24 30 36 42 48 54 60 6 12 18 24 30 36 42 48 54 60 6 12 18 24 30 36 42 48 54 60
Optimization Time (s) Optimization Time (s) Optimization Time (s)
DP ILP (High Precision) ILP (Medium Precision) ILP (Low Precision)

Figure 2: Comparing dynamic programming based optimizer versus integer linear programming for left-deep
query plans.

For higher numbers of query tables, up to 60, the dynamic timize when excluding cross products and applying dynamic
programming approach does not return any plan within one programming [24]; for MILP approaches it is apparently the
minute of optimization time. Note that increasing the num- opposite.
ber of tables by 10 increases the number of table sets that We conclude that the MILP approach does not only match
the dynamic programming approach must consider by factor but even outperforms traditional exhaustive query optimiza-
210 = 1024. It is therefore not surprising that this algorithm tion algorithms for left-deep plan spaces by a significant
is not able to optimize queries with 30 tables and more. margin.
All configurations of the MILP approach find optimal
or at least guaranteed near-optimal plans for up to 40 ta-
bles, often already after a few seconds. For 50 and 60 ta- 8. CONCLUSION
ble joins, all MILP configurations are able to find plans Basing newly developed query optimizers on existing MILP
quickly for star join graphs. For cycle graphs, the low- solver implementations reduces the size of the optimizer
precision configuration finds still optimal plans up to 60 code base and allows to benefit from features such as paral-
tables while the medium-precision configuration finds near- lelization and anytime behavior that those solvers encapsu-
optimal plans. Both configurations find optimal plans for 50 late.
tables and chain graphs while this is not possible for queries We have demonstrated how to transform query optimiza-
with 60 tables and a chain graph structure. This means tion into MILP. Our experimental results show that MILP
that optimization of chain and cycle queries seems to be approaches can outperform traditional dynamic program-
more challenging for MILP approaches than optimization of ming approaches significantly.
star queries. Note that star queries are more difficult to op- Generally it should be noted that the experimental results

11
in this paper are only snapshots and not intrinsic to the [18] R. Kaushik, C. Ré, and D. Suciu. General database
proposed mapping: as new MILP solver generations appear, statistics using entropy maximization. In Database
the performance of our MILP based approach is likely to Programming Languages, pages 84–99. 2009.
improve further without having to adapt the mappings. [19] A. Kemper, G. Moerkotte, K. Peithner, and
M. Steinbrunn. Optimizing disjunctive queries with
9. ACKNOWLEDGMENT expensive predicates. SIGMOD Record, 23(2):336–347,
1994.
This work was supported by ERC Grant 279804 and by a
[20] J. A. Lawrence and B. A. Pasternack. Applied
European Google PhD fellowship.
Management Science. 1997.
[21] G. Moerkotte and T. Neumann. Analysis of two
10. REFERENCES existing and one new dynamic programming algorithm
[1] S. Agarwal, B. Mozafari, and A. Panda. BlinkDB: for the generation of optimal bushy join trees without
queries with bounded errors and bounded response cross products. In VLDB, pages 930–941, 2006.
times on very large data. In European Conf. on [22] G. Moerkotte and T. Neumann. Dynamic
Computer Systems, pages 29–42, 2013. programming strikes back. In SIGMOD, pages 9–12,
[2] P. Beame, P. Koutris, and D. Suciu. Skew in parallel 2008.
query processing. In PODS, pages 212–223, 2014. [23] M. Muralikrishna. Improved unnesting algorithms for
[3] K. Bennett, M. Ferris, and Y. Ioannidis. A genetic join aggregate SQL queries. VLDB, pages 91–102,
algorithm for database query optimization. 1991. 1992.
[4] J. Bisschop. Integer Linear Programming Tricks. In [24] K. Ono and G. Lohman. Measuring the complexity of
AIMMS: Optimization Modeling, page 75ff. 215. join enumeration in query optimization. In VLDB,
[5] R. E. Bixby. A Brief History of Linear and pages 314–325, 1990.
Mixed-Integer Programming Computation. [25] S. Papadomanolakis and A. Ailamaki. An integer
Documenta Mathematica, pages 107–121, 2012. linear programming approach to database design. In
[6] N. Bruno. Polynomial heuristics for query ICDEW, pages 442–449, 2007.
optimization. In ICDE, pages 589–600, 2010. [26] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin,
[7] S. Chaudhuri. Query optimizers: time to rethink the R. A. Lorie, and T. G. Price. Access path selection in
contract? In SIGMOD, pages 961–968, 2009. a relational database management system. In
[8] S. Chaudhuri and K. Shim. Optimization of queries SIGMOD, pages 23–34, 1979.
with user-defined predicates. ACM Transactions on [27] M. a. Soliman, M. Petropoulos, F. Waas,
Database Systems, 24(2):177–228, 1999. S. Narayanan, K. Krikellas, R. Baldwin, L. Antova,
[9] S. Cluet and G. Moerkotte. On the complexity of V. Raghavan, A. El-Helw, Z. Gu, E. Shen, G. C.
generating optimal left-deep processing trees with Caragea, C. Garcia-Alvarado, and F. Rahman. Orca:
cross products. In ICDT, pages 54–67, 1995. A modular query optimizer architectur for big data. In
SIGMOD, pages 337–348, 2014.
[10] T. Dokeroglu, M. A. Bayr, and A. Cosar. Integer
linear programming solution for the multiple query [28] M. Steinbrunn, G. Moerkotte, and A. Kemper.
optimization problem. In Information Sciences and Heuristic and randomized optimization for the join
Systems, pages 51–60. 2014. ordering problem. VLDBJ, 6(3):191–208, 1997.
[11] S. Ganguly. Design and analysis of parametric query [29] A. Swami. Optimization of large join queries:
optimization algorithms. In VLDB, pages 228–238, combining heuristics and combinatorial techniques.
1998. SIGMOD, pages 367–376, 1989.
[12] W.-S. Han, W. Kwak, J. Lee, G. M. Lohman, and [30] A. Swami and A. Gupta. Optimization of large join
V. Markl. Parallelizing query optimization. In VLDB, queries. In SIGMOD, pages 8–17, 1988.
pages 188–200, 2008. [31] I. Trummer and C. Koch. An incremental anytime
[13] W.-S. Han and J. Lee. Dependency-aware reordering algorithm for multi-objective query optimization. In
for parallelizing query optimization in multi-core SIGMOD, pages 1941–1953, 2015.
CPUs. In SIGMOD, pages 45–58, 2009. [32] I. Trummer and C. Koch. Multi-objective parametric
[14] J. M. Hellerstein and M. Stonebraker. Predicate query optimization. VLDB, 8(3):221–232, 2015.
migration: optimizing queries with expensive [33] B. Vance and D. Maier. Rapid bushy join-order
predicates. SIGMOD, 22(2):267–276, 1993. optimization with Cartesian products. SIGMOD,
[15] A. Hulgeri and S. Sudarshan. Parametric query 25(2):35–46, 1996.
optimization for linear and piecewise linear cost [34] F. M. Waas and J. M. Hellerstein. Parallelizing
functions. In VLDB, pages 167–178, 2002. extensible query optimizers. In SIGMOD, page 871,
[16] A. Hulgeri and S. Sudarshan. AniPQO: almost 2009.
non-intrusive parametric query optimization for [35] J. Yang, K. Karlapalem, and Q. Li. Algorithms for
nonlinear cost functions. In VLDB, pages 766–777, materialized view design in data warehousing
2003. environment. In VLDB, pages 136–145, 1997.
[17] Y. E. Ioannidis and Y. Kang. Randomized algorithms
for optimizing large join queries. In SIGMOD Record,
volume 19, pages 312–321, 1990.

12

You might also like