CT80A0000 – DATA-INTENSIVE SYSTEMS
DISTRIBUTED QUERY PROCESSING
Lecture
Jiri Musto, D.Sc.
QUERY PROCESSING
High level user query
Query
Processor
Low-level data manipulation
commands for DBMS
3
QUERY PROCESSING COMPONENTS
Query language
E.g. SQL
Query execution
Steps of a given query
Query optimization
Manual or automatic (DBMS)
We assume a homogeneous distributed DBMS
4
OPTIMIZATION: SELECTING ALTERNATIVES
Each query can have multiple methods to reach the result
2+2, 1+3, 2*2, 2^2
To select the best option, there needs to be some metrics
Minimize the cost function (I/O, CPU, Communication)
Fastest, cheapest, transfer cost, total time, response time, etc.
Either done automatically by the system or manually by the user
5
TYPES OF OPTIMIZERS
Exhaustive search
Cost-based
Optimal
Heuristics
Not optimal (near optimal)
Optimize individual operations
Find a solution with reasonable cost
6
ONE OR MULTIPLE QUERIES
Single query at a time
Cannot use common intermediate results
Easier to optimize
Multiple queries at a time
Efficient if many similar queries
Decision space is much larger
7
OPTIMIZATION STRATEGY
Static
Optimize prior to the execution
Difficult to estimate the size of the intermediate results & error propagation
Dynamic
Run time optimization
Exact information on the intermediate relation sizes
Have to reoptimize for multiple executions
Hybrid
Compile using a static algorithm
If the error in estimate sizes is larger than threshold, reoptimize at run time
8
OPTIMIZATION DECISION SITES
Centralized
One site decides “best” schedule
Simple
Need knowledge about the entire distributed database
Distributed
Each site cooperates to determine schedule
Need only local information
Cost of cooperation
Hybrid
One site determines the global schedule
Each site optimizes the local schedules
9
QUERY PROCESSING METHODOLOGY
10
STEP 1 – QUERY DECOMPOSITION
Normalize and analyze the query
Bad queries are rejected
Simplify the query as much as possible
Remove irrelevant parameters
Restructure the query to get a more efficient option
11
STEP 2 – DATA LOCALIZATION
What fragments / partitions are involved in the query
Localize the query plan
Global parameters are changed to local ones
Optimize the localized plans
12
STEP 3 – GLOBAL QUERY OPTIMIZATION
Find the best (not necessarily optimal) global schedule
Minimize a cost function
Join processing
Bushy vs. linear trees
Which relation to ship where?
Ship-whole vs ship-as-needed
How data is joined together
Semijoins or not, what join methods are used
13
QUERY OPTIMIZATION PROCESS
QEP = Query Execution Plan Search/Solution space
Set of possible solutions (query trees)
Cost model
E.g. I/O cost + CPU cost + communication cost
Search strategy
How do we move inside the search space?
Exhaustive search, heuristic algorithms
Deterministic, randomized
14
NORMAL QUERY PROCESSING ISSUES
The optimizer needs sufficient knowledge about runtime
Runtime conditions should remain stable during query execution
Good for systems with few data sources and a controlled environment
What about changing environments?
Or large numbers of data sources?
Unpredictable runtime conditions?
15
EXAMPLE: QEP WITH BLOCKED OPERATOR
Join
• Student data cannot be
accessed
Join
• Whole process is blocked Grade
until regaining access
Join
• What if this would be
Course
reorganized?
Student Project
16
ADAPTIVE QUERY PROCESSING
Receive information from the execution environment
Modify process accordingly
Communication between optimizer and runtime environment and other components
Additional components
Monitoring (statistics, data, network, cost), assessment, reaction
Embedded in control operators of QEP
Tradeoff between reactiveness and overhead of adaptation
Change schedule, replace operators, modify behaviour
17
CONCLUSION ON QUERY PROCESSING
There are multiple ways to organize query processing
Centralized, distributed, hybrid
Queries need to be optimized automatically or manually
There are multiple methods of searching for optimal solution
The organization of query processing has an impact
Most often the best option is not the “most optimal” solution
18