6 Query Intro
6 Query Intro
Introduction Background Distributed Database Design Database Integration Semantic Data Control Distributed Query Processing
Overview Query decomposition and localization Distributed query optimization
Multidatabase Query Processing Distributed Transaction Management Data Replication Parallel Database Systems Distributed Object DBMS Peer-to-Peer Data Management Web Data Management Current Issues
M. T. zsu & P. Valduriez Ch.6/1
Distributed DBMS
query processor
Distributed DBMS
Ch.6/2
queries.
Query optimization
How do we determine the best execution plan?
Distributed DBMS
Ch.6/3
Selecting Alternatives
SELECT FROM WHERE AND Strategy 1 ENAME(RESP=ManagerEMP.ENO=ASG.ENO(EMPASG)) ENAME EMP,ASG EMP.ENO = ASG.ENO RESP = "Manager"
Strategy 2
ENAME(EMP ENO (RESP=Manager (ASG)) Strategy 2 avoids Cartesian product, so may be better
Distributed DBMS
Ch.6/4
Site 2
Site 3
Site 4
Site 5
Result
Site 5
' result EMP1' EMP2
Site 5
result= (EMP1 EMP2)ENORESP=Manager(ASG1 ASG2)
Site 3
EMP1'
Site 4
' EMP2
ASG1
ASG2
EMP1
EMP2
Site 1 Site 2
Site 3
Site 4
Site 1
' ASG1 RESP"Manager" ASG1
Site 2
ASG'2 RESP"Manager" ASG2
Distributed DBMS
Ch.6/5
Cost of Alternatives
Assume
size(EMP) = 400, size(ASG) = 1000
Strategy 1
produce ASG': (10+10) tuple access cost transfer ASG' to the sites of EMP: (10+10) tuple transfer cost produce EMP': (10+10) tuple access cost 2 transfer EMP' to result site: (10+10) tuple transfer cost Total Cost
Strategy 2
transfer EMP to site 5: 400 tuple transfer cost transfer ASG to site 5: 1000 tuple transfer cost produce ASG': 1000 tuple access cost join EMP and ASG': 400 20 tuple access cost Total Cost
M. T. zsu & P. Valduriez
Distributed DBMS
Ch.6/6
Distributed DBMS
Ch.6/7
Assume
relations of cardinality n sequential scan
Select Project (without duplicate elimination) Project (with duplicate elimination) Group Join Semi-join Division Set Operators
O(n log n)
O(n log n)
Cartesian Product
O(n2)
Distributed DBMS
Ch.6/8
Exhaustive search
Cost-based Optimal
Heuristics
Not optimal Regroup common sub-expressions Perform selection, projection first Replace a join by a series of semijoins Reorder operations to reduce intermediate relation size Optimize individual operations
Distributed DBMS
Ch.6/9
Distributed DBMS
Ch.6/10
Static Compilation optimize prior to the execution Difficult to estimate the size of the intermediate results error propagation Can amortize over many executions R* Dynamic Run time optimization Exact information on the intermediate relation sizes Have to reoptimize for multiple executions Distributed INGRES Hybrid Compile using a static algorithm If the error in estimate sizes > threshold, reoptimize at run time Mermaid
M. T. zsu & P. Valduriez Ch.6/11
Distributed DBMS
Relation
Cardinality Size of a tuple Fraction of tuples participating in a join with another relation
Attribute
Cardinality of domain Actual number of distinct values
Common assumptions
Independence between different attribute values Uniform distribution of attribute values within their domain
Distributed DBMS
Ch.6/12
Centralized
Single site determines the best schedule Simple Need knowledge about the entire distributed database
Distributed
Cooperation among sites to determine the schedule Need only local information
Cost of cooperation
Hybrid
One site determines the global schedule Each site optimizes the local subqueries
Distributed DBMS
Ch.6/13
Communication cost will dominate; ignore all other cost factors Global schedule to minimize communication cost Local schedules according to centralized query optimization
GLOBAL SCHEMA
FRAGMENT SCHEMA
STATS ON FRAGMENTS
LOCAL SCHEMAS