Chapter 5: Overview of Query Processing
Chapter 5: Overview of Query Processing
Chapter 5: Overview of Query Processing
Acknowledgements: I am indebted to Arturas Mazeika for providing me his slides of this course.
• Query processing: A 3-step process that transforms a high-level query (of relational
calculus/SQL) into an equivalent and more efficient lower-level query (of relational
algebra).
1. Parsing and translation
– Check syntax and verify relations.
– Translate the query into an equivalent
relational algebra expression.
2. Optimization
– Generate an optimal evaluation plan
(with lowest cost) for the query plan.
3. Evaluation
– The query-execution engine takes an
(optimal) evaluation plan, executes that
plan, and returns the answers to the
query.
• Strategy 2:
– Move ASG1 and ASG2 to Site 5
– Move EMP1 and EMP2 to Site 5
– Select and join at Site 5
• Query optimization is a crucial and difficult part of the overall query processing
• Objective of query optimization is to minimize the following cost function:
I/O cost + CPU cost + communication cost
• Ordering of the operators of relational algebra is crucial for efficient query processing
• Rule of thumb: move expensive operators at the end of query processing
• Cost of RA operations:
Operation Complexity
Select, Project O(n)
(without duplicate elimination)
Project O(n log n)
(with duplicate elimination)
Group
Join
Semi-join O(n log n)
Division
Set Operators
Cartesian Product O(n2 )
• Statistics
• Decision sites
• Network topology
• Use of semijoins
• Statistics
– Relation/fragments
∗ Cardinality
∗ Size of a tuple
∗ Fraction of tuples participating in a join with another relation/fragment
– Attribute
∗ Cardinality of domain
∗ Actual number of distinct values
∗ Distribution of attribute values (e.g., histograms)
– Common assumptions
∗ Independence between different attribute values
∗ Uniform distribution of attribute values within their domain
• Decision sites
– Centralized
∗ Single site determines the ”best” schedule
∗ Simple
∗ Knowledge about the entire distributed database is needed
– Distributed
∗ Cooperation among sites to determine the schedule
∗ Only local information is needed
∗ Cooperation comes with an overhead cost
– Hybrid
∗ One site determines the global schedule
∗ Each site optimizes the local sub-queries
• Network topology
– Wide area networks (WAN) point-to-point
∗ Characteristics
· Low bandwidth
· Low speed
· High protocol overhead
∗ Communication cost dominate; all other cost factors are ignored
∗ Global schedule to minimize communication cost
∗ Local schedules according to centralized query optimization
– Local area networks (LAN)
∗ Communication cost not that dominant
∗ Total cost function should be considered
∗ Broadcasting can be exploited (joins)
∗ Special algorithms exist for star networks
• Use of Semijoins
– Reduce the size of the join operands by first computing semijoins
– Particularly relevant when the main cost is the communication cost
– Improves the processing of distributed join operations by reducing the size of data
exchange between sites
– However, the number of messages as well as local processing time is increased
• Query processing transforms a high level query (relational calculus) into an equivalent
lower level query (relational algebra). The main difficulty is to achieve the efficiency in
the transformation
• Query optimizers vary by search type (exhaustive search, heuristics) and by type of the
algorithm (dynamic, static, hybrid). Different statistics are collected to support the query
optimization process