4-2-Query_Processing
4-2-Query_Processing
Systems
M. Tamer Özsu
Patrick Valduriez
◼ Ship-whole vs ship-as-needed
◼ Search Strategy
❑ Explores the search space and selects the best plan, using the
cost model
❑ Exhaustive search, heuristic algorithms
◼ Cost model
❑ I/O cost + CPU cost + communication cost
❑ These might have different weights in different distributed
environments (LAN vs WAN)
❑ Can also maximize throughput
Input Query
Equivalent QEP
Best QEP
⋈ R3 ⋈ ⋈
R1 R2 R1 R2 R3 R4
Search Strategy
◼ Deterministic ⋈
⋈ ⋈ R4
⋈ ⋈ R3 ⋈ R3
R1 R2 R1 R2 R1 R2
◼ Randomized
⋈ ⋈
⋈ R3 ⋈ R2
R1 R2 R1 R3
◼ Static
❑ Compilation ➔ optimize prior to the execution
❑ Difficult to estimate the size of the intermediate results
❑ Can amortize (divide) over many executions
◼ Dynamic
❑ Run time optimization
❑ Exact information on the intermediate relation sizes
❑ Have to re-optimize for multiple executions
◼ Hybrid
❑ Compile using a static algorithm
❑ If the error in estimate sizes > threshold, re-optimize at run time
◼ Centralized
❑ Single site determines the “best” schedule
❑ Simple
❑ Need knowledge about the entire distributed database
◼ Distributed
❑ Cooperation among sites to determine the schedule
❑ Need only local information
❑ Cost of cooperation
◼ Hybrid
❑ One site determines the global schedule
❑ Each site optimizes the local subqueries
SELECT ENAME,RESP
FROM EMP
NATURAL JOIN ASG
NATURAL JOIN PROJ
Site 1 R S Site 2
if size(R) > size(S)
Consider
PROJ ⋈PNO ASG ⋈ENO EMP
Site 1 Site 3
Join Ordering – Example
◼ To select one of these, the following sizes must be known or
predicted:
size(EMP), size(ASG), size(PROJ),
size(EMP ⋈ ASG), and size(ASG ⋈ PROJ)
◼ Furthermore, if the response time is considered, the optimization
must consider that transfers can be done in parallel with strategy
5.
◼ An alternative is to use heuristics that consider only the sizes of
the operand relations by assuming, for example, that the
cardinality of the resulting join is the product of operand
cardinalities.
◼ In this case, relations are ordered by increasing sizes and the
order of execution is given by this ordering and the join graph.
◼ For instance, the order (EMP, ASG, PROJ) could use strategy 1,
while the order (PROJ, ASG, EMP) could use strategy 4.
Semijoin based Algorithms
◼ Semijoin operation can be used to
decrease the total time of join queries R
A B C
◼ Consider the join of two relations: 1 1 2
2 1 2
❑ R[A] (located at site 1)
3 2 2
❑ S[A] (located at site 2) 6 2 3
7 3 3
◼ Alternatives:
S
1. Do the join R ⋈AS A D E
1 3 6
2. Perform one of the semijoin equivalents
2 3 6
R ⋈ AS (R ⋉AS) ⋈AS 5 5 7
R ⋈A (S ⋉A R)
8 1 3
(R ⋉A S) ⋈A (S ⋉A R)
Semijoin based Algorithms
R ⋈ AS
1. Do the join R ⋈AS A B C D E
1 1 2 3 6
2. Perform one of the semijoin
2 1 2 3 6
equivalents
R ⋈A S (R ⋉AS) ⋈AS R ⋉AS
R ⋈A (S ⋉A R)
A B C
1 1 2
(R ⋉A S) ⋈A (S ⋉A R) 2 1 2
R
S
A B C
1 1 2 A D E S ⋉A R
2 1 2 1 3 6 A D E
3 2 2 2 3 6
6 2 3 5 5 7
1 3 6
7 3 3 8 1 3 2 3 6
Semijoin based Algorithms
◼ Perform the join ( R ⋈AS ) assuming that S Site 2
size(R) > size(S). A
❑ send S to Site 1 Site 1 R
❑ Site 1 computes R ⋈A S
◼ Consider semijoin (R ⋉AS) ⋈AS
❑ S' = A(S)
❑ S' → Site 1
❑ Site 1 computes R' = R ⋉AS'
❑ R'→ Site 2
❑ Site 2 computes R' ⋈AS
◼ Semijoin is better if size(A(S)) + size(R ⋉AS’))
< size(S)
Join versus Semijoin
Consider
ET (ENO, ENAME, TITLE, CITY)
AT (ENO, PNO, RESP, DUR, CITY)
PT (PNO, PNAME, BUDGET, CITY)
q': SELECT V1.A1 INTO R1'
FROM R1 V1
WHERE P1(V1.A1)
q12: SELECT ASG.ENO INTO GVAR
FROM ASG,JVAR
WHERE ASG.PNO=JVAR.PNO
q1 : SELECT EMP.ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO AND
ASG.PNO=PROJ.PNO AND
PROJ.PNAME="CAD/CAM"
EMP PROJ
Block Nested-Loop Join
◼ In the worst case, if there is enough memory only to hold one block of
each relation, the estimated cost is
❑ br bs + br block transfers,
❑ 2 * br seeks,
◼ Fetch as needed
+ CT (size (R))
+ CT(length(A)) * card(R)
◼ Given
❑ A set of sites S = {s1, s2, …, sn},
❑ A query Q ={q1, q2, q3, q4, …, qm} such that each subquery qi is the
processing unit that accesses one relation and communicates with its
neighboring queries
❑ For each qi in Q, a feasible allocation set of sites Sq={s1, s2, …,sk}
where each site stores a copy of the relation in qi
❑ Each site si has a load, denoted by load(si), which reflects the
number of queries currently submitted.
◼ The objective is to find an optimal allocation of Q to S such
that
❑ the unbalanced load of S is minimized
❑ The total communication cost is minimized
2-Step Algorithm
New Load
2(q1)
2
2
2
2-Step Algorithm Example
◼ Consider the following query: (R1) ▷◁R2 ▷◁R3 ▷◁R4
◼ It performs 4 iterations
◼ Iteration 2: the next subquery to be selected is q2.
❑ select q2, allocate to s2 (it could be allocated to s4
which has the same load as s2,
❑ set load(s2) = 3 or set load(s4) = 3
9
Adaptive Query Processing – Definition
◼ Adaptive query processing is a form of dynamic query
processing, with
❑ a feedback loop between the execution environment
and the query optimizer in order to react to
unforeseen variations of runtime conditions.
◼ A query processing system is defined as adaptive if it
receives information from the execution environment and
determines its behavior according to that information in
an iterative manner.
❑ a general adaptive query processing process
◼ Q = {P (R) ⋈ S ⋈ T)
◼ Assume that
❑ the only access method to relation T is through an index
on join attribute T.A,
❑ the second join can only be an index join over T.A.