QueryProcessing Lect 3
QueryProcessing Lect 3
lecture 3
1
Query Processing
Query
Processor
2
Query Processing Components
● Query language that is used
⬥ SQL (Structured Query Language)
● Query execution methodology
⬥ The steps that the system goes through in executing
high-level (declarative) user queries
● Query optimization
⬥ How to determine the “best” execution plan?
3
Query Language
● Tuple calculus: { t | F(t) }
where t is a tuple variable, and F(t) is a well formed formula
● Example:
⬥ Get the numbers and names of all programmers.
4
Query Language (cont.)
● Domain calculus:
where xi is a domain variable, and is a well
formed formula
● Example:
{ x, y | EMP(x, y, “Programmer") }
5
Query Language (cont.)
● SQL is a tuple calculus language.
SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“Programmer”
7
DB example
Figure 3.3
8
Centralized Query Processing
Alternatives
SELECTENAME
FROM EMP E, ASG G
WHERE E.ENO = G.ENO AND RESP=“manager”
● Strategy 1:
● Strategy 2:
10
Distributed Query Processing Plans
● By centralized optimization,
1111
Distributed Query Plan I
Plan I: To transport all segments to query site and
execute there.
Site 5
12
Distributed Query Plan II
Plan II (Optimized):
Site 5
Result = (EMP1 ’ ∪ EMP2
’)
EMP1’ EMP2’
Site 3 Site 4
EMP1’ = EMP1 ⋈ ENO ASG1’ EMP2’ = EMP2 ⋈ ENO ASG2’
ASG1’ ASG2’
Site 1 Site 2
ASG1’ = σ RESP=“manager” (ASG1) ASG2’ = σ RESP =“manager” (ASG2)
13
Costs of the Two Plans
● Assume:
⬥ size(EMP)=400, size(ASG)=1000, 20 tuples with RESP =“manager”
⬥ tuple access cost = 1 unit; tuple transfer cost = 10 units
⬥ ASG and EMP are locally clustered on attribute RESP and ENO, respectively.
● Plan 1
⬥ Transfer EMP to site 5: 400*tuple transfer cost 4000
⬥ Transfer ASG to site 5: 1000*tuple transfer cost 10000
⬥ Produce ASG’: 1000*tuple access cost 1000
⬥ Join EMP and ASG’: 400*20*tuple access cost 8000
Total cost 23,000
● Plan 2
⬥ Produce ASG’: (10+10)*tuple access cost 20
⬥ Transfer ASG’ to the sites of EMP: (10+10)*tuple transfer cost 200
⬥ Produce EMP’: (10+10)*tuple access cost * 2 40
⬥ Transfer EMP’ to result site: (10+10)*tuple transfer cost 200
Total cost 460 14
Query Optimization Objectives
● Minimize a cost function
I/O cost + CPU cost + communication cost
● These might have different weights in different distributed
environments
15
Communication Cost
● Wide area network
● Communication cost will dominate
- Low bandwidth
- Low speed
- High protocol overhead
● Most algorithms ignore all other cost components
❖ Heuristics
• Not optimal
• restrict the solution space so that only a few
strategies are considered
• Perform unary operations (selection and
projection) first
• Reorder operations to reduce intermediate
relation size
• Replace a join by a series of semijoins to
minimize data communication.
18
Query Optimization Granularity
● Single query at a time
⬥ Cannot use common intermediate results
19
Query Optimization Timing
● Static
⬥ Do it at compilation time by using statistics, appropriate
for exhaustive search, optimized once, but executed
many times.
⬥ Difficult to estimate the size of the intermediate results
⬥ Can amortize over many executions
● Dynamic
⬥ Do it at execution time, accurate about the size of the
intermediate results, repeated for every execution,
expensive.
20
Query Optimization Timing (cont.)
● Hybrid
⬥ Compile using a static algorithm
⬥ If the error in estimate size > threshold, re-optimizing at
run time
21
Statistics
● Relation
⬥ Cardinality
⬥ Size of a tuple
⬥ Fraction of tuples participating in a join with another relation
● Attributes
⬥ Cardinality of the domain
⬥ Actual number of distinct values
● Common assumptions
⬥ Independence between different attribute values
⬥ Uniform distribution of attribute values within their domain
22
Decision Sites
● For query optimization, it may be done by
⬥ Single site – centralized approach
– Single site determines the best schedule
– Simple
– Need knowledge about the entire distributed database
⬥ All the sites involved – distributed approach
– Cooperation among sites to determine the schedule
– Need only local information
– Cost of operation
⬥ Hybrid – one site makes major decision in cooperation
with other sites making local decisions
– One site determines the global schedule
– Each site optimizes the local subqueries
23
Network Topology
24
Network Topology (cont.)
25
Other Information to Exploit
26