queryoptimization-examples
queryoptimization-examples
• Introduction (Ch. 1) ⋆
⋆ Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011 D I S T R I B U T E D DBMS
Ch.8
/2
OUTLINE (TODAY)
⋆ Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011 D I S T R I B U T E D DBMS
Ch.8
/3
DISTRIBUTED QUERY OPTIMIZATION
D I S T R I B U T E D DBMS
Ch.8
/4
ELEMENTS OF THE OPTIMIZER
D I S T R I B U T E D DBMS
Ch.8/
6
CENTRALIZED VS. DISTRIBUTED QUERY
OPTIMIZATION
D I S T R I B U T E D DBMS
Ch.8/
7
JOIN ORDERING IN THE DISTRIBUTED CONTEXT
• Join ordering is important in centralized query optimization
• Use of semijoins to reduce relation sizes (and thus communication costs) before
performing join operations
JOIN ORDERING – 2 RELATIONS
D I S T R I B U T E D DBMS
Ch.8/
10
JOIN ORDERING – EXAMPLE
ENO PNO
Execution alternatives:
EMP PROJ
1. EMP Site 2 Site 1 Site 3
Site 2 computes EMP'=EMP ⋈ ASG
EMP' Site 3
Join graph of distributed query
• Semijoins can be used to reduce the sizes of operands to transfer (similar to what
selections do)
➡ Reduced communication costs
D I S T R I B U T E D DBMS
Ch.8/
13
SEMIJOIN ALGORITHMS – SUM UP
• Using semijoin is convenient if R ⋉ A S has high selectivity (select few tuples) and/or size
of R is large
• It is bad otherwise, due to the additional transfer of A(S)
• Cost of transferring A(S) can be reduced by using bit arrays
• A disadvantage of using semijoin is the loss of indices
Bit arrays
• Let h be a hash function that distributes possible values for A into n buckets:
h : Dom(A) { 0, … , n-1 }
R S • Recall:
id R A idS A o BA[i] = 1 iff ∃ value v for attribute A in S s.t. h(v) = i
o a tuple of R with value v for A belongs to R’ iff BA[h(v)] = 1
1 1 1 5
2 2 2 5
3 2 3 3 • h(x) = x mod 4
4 5 4 5 • n=4 (4 buckets)
5 4 5 3 • h(1) = h(5) = 1
6 5 • BA[0] = 0 (no value v occurs in S.A s.t. h(v) = 0)
7 4
8 5
• BA[1] = 1 (due to occurrence of 5 for attribute A in S)
• BA[2] = 0 (no value v occurs in S.A s.t. h(v) = 2)
• Full reducer for a relation is the semijoin program that reduces the relation the most
• Finding full reducer for a relation with exhaustive brute force approach
➡ For cyclic queries full reducer cannot be found
✦ Solution: break the cycle
➡ With other queries: inefficient (NP-hard)
✦ Solution: only use semijoin when problem is simple
✓ e.g., for chained queries, where relations are in sequence and each one joins with
D I S the
T R I next
B U T one
ED DBMS
Ch.8/
16
DISTRIBUTED QUERY OPTIMIZATION
➡ Coordinator chooses
1. execution site and
2. transfer method
➡ Apprentice sites (where fragments are stored and queries are executed)
✦ Apprentices behave as in the case of centralized query optimization in optimizing
localized queries (over fragments) assigned to them
✓ Choose best join ordering, join algorithm, and access method for relations
D I S T R I B U T E D DBMS
Ch.8/
17
CHOICES OF THE MASTER SITE
1. Choice of the execution sites
➡ E.g., R ⋈ S can be executed:
✦ at the site where R is stored
✦ at the site where S is stored
✦ at a third site (e.g., where a 3rd relation waits to be joined – allows for parallel transfer)
2. Transfer method
➡ ship-whole: relation is transferred to the join execution site entirely
✦ In some cases (e.g., for outer relations of in case of merge join) there is no need to store the relation:
join as it arrives, in pipelined mode
➡ fetch-as-needed (only needed tuples are transferred, i.e., tuples selected by the join):
✦ equivalent to perform semijoin of one relation with tuple of the other one (to reduce size of the
former) before executing the join
✦ e.g., semi-join of inner relation wrt outer one (only needed tuples of inner relation are transferred)
✓ tuples of the outer relation are sent (only the join attribute) to the site of the inner relation
✓ matching tuples of the inner relation are sent to the site of the external relation to execute the join
Choices of the master produce 4 strategies (not all combinations are worth being considered)
D I S T R I B U T E D DBMS
Ch.8/
18
STRATEGY 1 – SHIP-WHOLE/INNER SITE
1. ship-whole/site of inner relation: move outer relation (R) to the site of the inner
relation (S)
• CT(x): communication time to transfer x bytes
(a) Retrieve outer tuples
• LT(x): local processing time to perform op. x
(b) Send them to the inner relation site • s = card(S ⋉ A R)/card(R): average number of
tuples of S that match a tuple of R
(c) Join them as they arrive
D I S T R I B U T E D DBMS
Ch.8/
19
STRATEGY 2 – SHIP-WHOLE/OUTER SITE
2. ship-whole/site of outer relation: move inner relation (S) to the site of outer
relation (R)
Cannot join as S arrives; it needs to be stored
D I S T R I B U T E D DBMS
Ch.8/
20
STRATEGY 3 – FETCH-AS- NEEDED/OUTER
SITE
4. move both inner (S) and outer (R) relations to another site
SELECT *
FROM EMP, PAY
WHERE SALARY > $a
where $a is a variable whose value is specified by the user at runtime
CP
Normally, pushing
inside ⋈ is a good
heuristics, but it can be
⋈ SALARY > $a bad if selection rate of
⋈ is higher than the
SALARY > $a EMP ⋈ one of
D I S T R I B U T E D DBMS
Ch.8/
26