Query Optimization in Distributed Database Systems
Query Optimization in Distributed Database Systems
2
Transmission cost
• Transmission requirements are neutral with respect to
systems; they are typically a function of the amount of data
transmitted among sites
• The optimization of a distributed query can be partitioned
into two independent problems: the distribution of the
access strategy among sites, which is done considering
transmission only, and the determination of local access
strategies at each site, which use traditional methods of
centralized databases
• Transmission cost:
TC(X) = C0 + C1 * x
3
Database Profile
Database profile:
• The number of tuples in each relation Ri (card(Ri))
• The size of each attribute A (size(A) )
• The size of Ri (size(Ri)) is sum of the sizes of its attributes
• For each attribute A in each relation Ri: the number of
distinct values appearing in Ri (val(A[Ri])), max and min
LDBS1 LDBS2
Supply1 Supply2
Dept1 Dept2
4
Database Profile
Dept card(dept)= 30
DEPTNUM NAME AREA MGRNUM
size 2 15 1 7
val 30 30 6 30
5
Database Profile
6
Profile of partial results of algebraic
operations - SELECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality - to each selection we associate a selectivity
factor which is the fraction of tuples satisfying it
In simple selection attribute = value (A=v), can be
defined as follows:
= 1/val(A[Ri])
under the assumptions that values are homogeneously
distributed. Thus
card(S) = * card(R)
7
Profile of partial results of algebraic
operations - SELECTION
• Size: selection does not affect the size of relations
size(S) = size(R)
• Distinct values : depends on the selection criterion
Consider an attribute B which is not used in selection
formula. The determination of val(B[S]) may be as follows
Given n=card(R) - objects uniformly distributed over m =
val(B[R]) colors. How many different colors c= val(B[S])
are selected if we take just r objects?
8
Profile of partial results of algebraic
operations - SELECTION
• Yao approximation:
9
Profile of partial results of algebraic
operations - PROJECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality – projection affects the cardinality of
operands since duplicates are eliminated from the result.
This effect is difficult to evaluate, the following three rules
can be applied
– If the projection involves a single attribute A, set
card(S) = val(A[R])
– If the product AiAttr(S) val(Ai[R]) is less than card(R), where
Attr(S) are the attributes in the result of the projection, set
card(S) = AiAttr(S) val(Ai[R])
10
Profile of partial results of algebraic
operations - PROJECTION
– If the projection includes a key of R, set
card(S) = card(R)
• Note that if the system does not eliminate duplicates, the
cardinality of the result is the same as the cardinality of the
operand relation
• Size: the size of the result of a projection is reduced to the
sum of the sizes of attributes in its specification
• Distinct values : the distinct values of projected attributes
are the same as in the operand relation
11
Profile of partial results of algebraic
operations – GROUP BY
Let G denote the attributes on which the grouping is
performed, AF indicates the aggregate functions to be
evaluated
• Cardinality – we give an upper bound on the cardinality
of S:
card(S) < AiG val(Ai[R])
• Size: for all attributes A appearing in G
size(R.A) = size (S.A)
• Distinct values : for all attributes A appearing in G
val(A[S]) = val(A[R])
12
Profile of partial results of algebraic
operations – UNION
• Cardinality – we have:
card(T) < card(R) + card(S)
Equality holds when duplicates are not eliminated
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R]) + val(A[S])
13
Profile of partial results of algebraic
operations – DIFFERENCE
• Cardinality – we have:
max(0, card((R)-card(S)) < card(T) < card(R)
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R])
14
Profile of partial results of algebraic
operations – CARTESIAN PRODUCT
• Cardinality – we have:
card(T) < card(R) x card(S)
• Size: we have
size(T) = size(R) + size(S)
• Distinct values : the distinct values of attributes are the
same as in the operand relation
15
Profile of partial results of algebraic
operations – JOIN
• Cardinality – estimating precisely the cardinality of T is
very complex; we can give an upper bound to card(T)
because card(T) < card(R) x card(S), but this value is
usually much higher than the actual cardinality. Assuming
that all the values of A in R appear also as values of B in S
and vice versa and that the two attributes are both
uniformly distributed over tuples of R and S, we have
card(T) = (card(R) x card(S))/val(A[R])
if one of the two attributes, say A, is a key of R, then
card(T) = card(S)
16
Profile of partial results of algebraic
operations – JOIN
• Size: we have
size(T) = size(R) + size(S)
In the case of natural join the size of the join attribute must
be subtracted from the size of the result
• Distinct values : if A is a join attribute, an upper bound is
val(A[T]) < min(val(A[R]), val(B[S]) )
if A is not a join attribute, an upper bound is
val(A[T]) < val(A[R]) + val(B[S])
17
Profile of partial results of algebraic
operations – SEMIJOIN
Consider the semijoin T=R SJ A=B S
• Cardinality – the estimation of the cardinality of T is
similar to that of a selection operation; we denote with
the selectivity of the semijoin operation, which measures
the fraction of the tuples of R which belong to the result.
The estimation is the following:
= 1/val(A[S]) / val(dom[A])
Given
card(T) = * card(R)
18
Profile of partial results of algebraic
operations – SEMIJOIN
• Size: The size of the result of a semijoin is the same as the
size of its first operand
size(T) = size(R)
• Distinct values : the number of distinct values of attributes
which do not belong to the semijoin specification can be
estimated using Yao’s formula with n= card(R),
m=val(A[R]), and r =card(T). If A is the only attribute
appearing in the semijoin specification, then
val(A[T]) = * val(A[R])
19
Architecture of a Query Processing
Query result
Parser Catalog
Base data
20
Architecture of a Query Processing
• Parser: the query is parsed and translated into an internal
representation (flex and bison can be used for the
construction of SQL parser)
• Query Rewrite: query rewrite transforms a query in order
to carry out optimizations that are good regardless of the
physical state of the system (elimination of redundant
predicates, unnesting of subqueries, simplification of
expressions). Query rewrite is carried out by a rule engine
• Query Optimizer: this component carries out
optimizations that depend on the physical state of the
system. QO decides which index, which method, and in
which order to execute operations of a query.
21
Architecture of a Query Processing
• Query optimizer: in distributed system QO must decide at
which site each operation is to be executed. QO
enumerates alternative plans and chooses the best plan
using a cost estimation model
• Plan: specifies precisely how the query is to be executed.
The nodes are operators, and every operator carries out one
particular operation. The edges represent consumer-
producer relationships of operators.
• Plan Refinement: this component transforms the plan into
an executable plan. In DB2 this transformation involves
the generation of an assembler-like code to evaluate
expressions and predicates efficiently
22
Query evaluation plan
Site 0 PJ A1
NLJ A2=B2
scan
temp
receive receive
send send
PJ B3
PJ A3
SL C=cos
Inxscan(A) Scan(B)
23
Query evaluation plan
• Fragment reducers: a set of unary operations which apply
to the same fragment are collected into programs
• Binary operations: joins and unions
• Optimization graph: nodes represent reduced fragments,
and joins (unions) are represented by edges (hypernodes)
A2=B2
A B
24
Query Optimization (1)
• Plan enumeration with Dynamic Programming
Input: SPJ query q on relations R1, ..., Rn
Output: A query plan for q
1. for i=1 to n do {
2. optPlan({Ri}) = accessPlans(Ri)
3. prunePlans(optPlan({Ri}))
4. }
5. for i=2 to n do {
6. for all S {R1, ..., Rn} such that |S| = i do {
7. optPlan(S) =
25
Query Optimization (2)
8. for all O S do {
9. optPlan(S) = optPlan(S)
joinPlans(optPlan(O), optPlan(S-O))
10. prunePlans(optPlan(S))
11. }
12. }
13. }
14. return optPlan({R1, ..., Rn})
26
Query Optimization (3)
• Optimization criteria:
– Classic cost model (total time, total resource
consumption) – estimate the cost of every individual
operator of the plan and then sum up these costs – this
model is useful to estimate the overall throughput of a
system
– Mean response time model – estimate the lowest
response time of a query
27
Query Execution Techniques
• Row blocking – implementation of send and receive
operators is based on TCP/IP, UDP protocols;
idea: ship tuples in a blockwise fashion
• Optimization of Multicasts: send data sequentially
instead of sending data twice (NY Berlin Poznan)
• Joins with Horizontally Partitioned Data –
(A1 A2) JN B or (A1 JN B) (A2 JN B)
If A and B are both partitioned than we have more plans
• Semijoin and Bloojoin programs
28
Semijoin Programs
• Semijoin between R and S over two attributes A and B is
defined as follows:
( R SJ A=B S) JN A=B S is equal R JN A=B S
29
Reducers
• Semijoin programs can be regarded as reducers, i.e.
Operations that can be applied to reduce the cardinality of
their operands
• Let RED(Q, R) denote the set of reducer programs that can
be built for a given relation R in a given query Q
• There is one reducer program, element of RED(Q, R),
which reduces R more than all other programs – full
reducer
• The problem : find all full reducers for the relations of a
query (difficult task)
• Acyclic (tree queries) versus cyclic queries
30
Reducers
• Is it possible to give a limitation to the length of the full
reducer?
• Tree queries – YES
The limitation on the length of the full reducer amounts to
n-1, where n is the number of nodes of the tree
• Cyclic queries – NO
The limitation on the length of the ‘best’ reducer is linearly
bound by the number of tuples of some relations of the
query
• Best reducer does not mean full reducer
31
Example (1)
R S T
A B B C C A
1 a a x x 2
2 b b y y 3
3 c c z z 4
S
Cyclic query
B=B C=C
R T
A=A
S
Acyclic query
B=B C=C
R T
33
Testing the graph for cycles
• There are two cases in which cycles can be broken without
changing the meaning of the query
1. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.A), in which
R, S, T are relation names, and A, B, C are attributes, any
one of the edges can be dropped, as any edge can be
obtained from the remaining ones by transitivity.
2. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.D), we can
substitute (R.A=R.D) for (T.C=R.D) because, by
transitivity, T.C must equal R.A; the remaining graph
contains two edges (R.S) and (S.T) and is acyclic, because
an interrelation clause can be sabstituted by an intrarelation
clause
34