Lecture 06
Lecture 06
Agenda
• Recap of query optimization
• Transformation rules for P&D systems
• Memoization
• Cost difference between a good and a bad way of evaluating a query can
be enormous
– Example: performing a r X s followed by a selection r.A = s.B is much
slower than performing a join on the same condition
2. Selection operations
( ( E )) are
commutative.
( ( E ))
1 2 2 1
• Cost based
– Simulated annealing
– Randomized generation of candidate QEP
– Problem, how to guarantee randomness
Memoization Techniques
• How to generate alternative Query Evaluation Plans?
– Early generation systems centred around a tree representation of the
plan
– Hardwired tree rewriting rules are deployed to enumerate part of the
space of possible QEP
– For each alternative the total cost is determined
– The best (alternatives) are retained for execution
r3 r3 Level 2 plans
x
r2 r1 r1 r2 r2 r3 r3 r4 r1 r4 Level 1 plans
Distributed Query Processing
• For centralized systems, the primary criterion
for measuring the cost of a particular strategy
is the number of disk accesses.
• In a distributed system, other issues must be
taken into account:
– The cost of a data transmission over the network.
– The potential gain in performance from having
several sites process parts of the query in parallel.
Transformation rules for
distributed systems
• Primary horizontally fragmented table:
– Rule 9: The union is commutative
E1 E2 = E2 E1
– Rule 10: Set union is associative.
(E1 E2) E3 = E1 (E2 E3)
– Rule 12: The projection operation distributes over union
L(E1 E2) = (L(E1)) (L(E2))
– Data complexity,
• Federated database often come without proper statistical
summaries
next next
JOIN JOIN
s
Join and sorting
• Index-joins are asymmetric, you can not easily change their role
– Combine index-join + operands as a unit in the process
• Ripple joins
– Break the space into smaller pieces and solve the join operation for
each piece individually
– The piece crossings are moments of symmetry
The Idea
JOIN
Tuple buffer
Eddie
next next next next
Rivers and Eddies
Eddies are tuple routers that distribute arriving tuples to interested operators
– What are efficient scheduling policies?
• Fixed strategy? Random ? Learning?
Static Eddies
• Delivery of tuples to operators can be hardwired in the Eddie to reflect a
traditional query execution plan
Naïve Eddie
• Operators are delivered tuples based on a priority queue
• Intermediate results get highest priority to avoid buffer congestion
Observations for selections
• Extended priority queue for the operators
– Receiving a tuple leads to a credit increment
– Returning a tuple leads to a credit decrement
– Priority is determined by “weighted lottery”
• Naïve Eddies exhibit back pressure in the tuple flow; production is limited by
the rate of consumption at the output
• Research challenges:
– Better learning algorithms to adjust flow
– Aggressive adjustments
– Remove pre-optimization
– Balance ‘hostile’ parallel environment
– Deploy eddies to control degree of partitioning (and replication )