Query Optimization

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 29

Today’s topic

Distributed Query Processing

Distributed DBMS Page 7-9. 1


Query Processing

high level user query

query
processor

low level data manipulation


commands

Distributed DBMS Page 7-9. 2


Query Processing Components

 Query language that is used


 SQL

 Query execution methodology


 The steps that one goes through in executing high-level user
queries.

 Query optimization
 How do we determine the “best” execution plan?

Distributed DBMS Page 7-9. 3


Selecting Alternatives
SELECT ENAME  Project
FROM EMP,ASG  Select
WHERE EMP.ENO = ASG.ENO  Join
AND DUR > 37

Strategy 1
ENAME(DUR>37EMP.ENO=ASG.ENO(EMP  ASG))
Strategy 2
ENAME(EMP ENO (DUR>37 (ASG)))

Strategy 2 avoids Cartesian product, so is “better”


Distributed DBMS Page 7-9. 4
What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=ENO≤“E3”(ASG) ASG2=ENO>“E3”(ASG) EMP1=ENO≤“E3”(EMP) EMP2=ENO>“E3”(EMP) Result

Site 5 Site 5
result = EMP1’EMP2’ result2=(EMP1EMP2) ENODUR>37(ASG1ASG1)
EMP1’ EMP2’
ASG1 ASG2 EMP1 EMP2
Site 3 Site 4
EMP1’=EMP1 ASG1’ EMP2’=EMP2 ASG2’
ENO ENO
Site 1 Site 2 Site 3 Site 4

ASG1’ ASG2’
Site 1 Site 2
ASG1’=DUR>37(ASG1) ASG2’=DUR>37(ASG2)

Distributed DBMS Page 7-9. 5


Cost of Alternatives
 Assume:
 size(EMP) = 400, size(ASG) = 1000
 tuple access cost = 1 unit; tuple transfer cost = 10 units
 Strategy 1
 produce ASG': (10+10)tuple access cost 20
 transfer ASG' to the sites of EMP: (10+10)tuple transfer cost 200
 produce EMP': (10+10) tuple access cost2 40
 transfer EMP' to result site: (10+10) tuple transfer cost 200
Total cost 460
 Strategy 2
 transfer EMP to site 5:400tuple transfer cost 4,000
 transfer ASG to site 5 :1000tuple transfer cost 10,000
 produce ASG':1000tuple access cost 1,000
 join EMP and ASG':40020tuple access cost 8,000
Total cost 23,000

Distributed DBMS Page 7-9. 6


Query Optimization Objectives
Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different distributed
environments
Wide area networks
 communication cost will dominate (80 – 200 ms)
 low bandwidth
 low speed
 high protocol overhead
 most algorithms ignore all other cost components
Local area networks
 communication cost not that dominant (1 – 5 ms)
 total cost function should be considered

Distributed DBMS Page 7-9. 7


Complexity of Relational
Operations
Operation Complexity

Select
Project O(n)
 Assume (without duplicate elimination)
 relations of cardinality n Project
 sequential scan (with duplicate elimination) O(nlog n)
Group

Join
Semi-join O(nlog n)
Division
Set Operators

Cartesian Product O(n2)

Distributed DBMS Page 7-9. 8


Query Optimization Issues – Types
of Optimizers
 Exhaustive search
 cost-based
 optimal
 combinatorial complexity in the number of relations
 Heuristics
 not optimal
 regroup common sub-expressions
 perform selection, projection first
 replace a join by a series of semijoins
 reorder operations to reduce intermediate relation size
 optimize individual operations

Distributed DBMS Page 7-9. 9


Query Optimization Issues –
Optimization Granularity

 Single query at a time


 cannot use common intermediate results

 Multiple queries at a time


 efficient if many similar queries
 decision space is much larger

Distributed DBMS Page 7-9. 10


Query Optimization Issues –
Optimization Timing
 Static
 compilation  optimize prior to the execution
 difficult to estimate the size of the intermediate results  error
propagation
 can amortize over many executions
 Dynamic
 run time optimization
 exact information on the intermediate relation sizes
 have to reoptimize for multiple executions
 Hybrid
 compile using a static algorithm
 if the error in estimate sizes > threshold, reoptimize at run time

Distributed DBMS Page 7-9. 11


Query Optimization Issues –
Statistics

 Relation
 cardinality
 size of a tuple
 fraction of tuples participating in a join with another relation
 Common assumptions
 independence between different attribute values
 uniform distribution of attribute values within their domain

Distributed DBMS Page 7-9. 12


Query Optimization Issues –
Decision Sites
 Centralized
 single site determines the “best” schedule
 simple
 need knowledge about the entire distributed database
 Distributed
 cooperation among sites to determine the schedule
 need only local information
 cost of cooperation
 Hybrid
 one site determines the global schedule
 each site optimizes the local subqueries

Distributed DBMS Page 7-9. 13


Query Optimization Issues –
Network Topology
 Wide area networks (WAN) – point-to-point
 characteristics
 low bandwidth
 low speed
 high protocol overhead
 communication cost will dominate; ignore all other cost
factors
 global schedule to minimize communication cost
 local schedules according to centralized query optimization
 Local area networks (LAN)
 communication cost not that dominant
 total cost function should be considered
 broadcasting can be exploited (joins)
 special algorithms exist for star networks

Distributed DBMS Page 7-9. 14


Query Optimization Issues –
Replicated Fragments
 Process of localization
 Distributed queries expressed on global relations are
mapped into queries on physical fragments of relations by
translating relations into fragments

Distributed DBMS Page 7-9. 15


Query Optimization Issues – Use of
Semi joins
 Reduce the size of operand relation
 Size of data exchanged between sites is reduced

Distributed DBMS Page 7-9. 16


Distributed Query Processing
Methodology
Calculus Query on Distributed
Relations

Query
Query
GLOBAL
GLOBAL
Decomposition
Decomposition SCHEMA
SCHEMA

Algebraic Query on Distributed


Relations
CONTROL
Data FRAGMENT
SITE Data FRAGMENT
Localization
Localization
SCHEMA
SCHEMA

Fragment Query

Global STATS ON
Global STATS ON
Optimization
Optimization
FRAGMENTS
FRAGMENTS

Optimized Fragment Query


with Communication Operations

LOCAL Local LOCAL


Local LOCAL
SITES Optimization
Optimization
SCHEMAS
SCHEMAS

Optimized Local
Queries

Distributed DBMS Page 7-9. 17


Query Decomposition
 Decomposes a calculus query into algebraic query on
global relations in following steps
 Calculus query is rewritten in normalized form
 Normalized query is analyzed semantically
 Correct query is simplified
 Calculus query is restructured as an algebraic query

Distributed DBMS Page 7-9. 18


Data Localization
 Input is algebraic query on distributed relations
 Determines which fragments are involved in the query
and transforms a distributed query into a fragment
query

Distributed DBMS Page 7-9. 19


Global Query Optimization
 Find an optimal execution strategy
 Find best ordering of operations to minimize the cost

Distributed DBMS Page 7-9. 20


Local Query Optimization
 Done at all sites having fragments involved
 Each sub-query is optimized using local schema of the
site
 Uses algorithms of centralized systems

Distributed DBMS Page 7-9. 21


Restructuring
 Convert relational calculus to relational
algebra
ENAME Project
 Make use of query trees
 Example DUR=12 OR DUR=24
Find the names of employees other than J. Doe
who worked on the CAD/CAM project for
either 1 or 2 years. PNAME=“CAD/CAM” Select
SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO ENAME≠“J. DOE”
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe” PNO
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR =
24) ENO Join

PROJ ASG EMP

Distributed DBMS Page 7-9. 22


Example
Recall the previous example: ENAME Project
Find the names of employees other than J.
Doe who worked on the CAD/CAM
project for either one or two years. DUR=12 OR DUR=24

SELECT ENAME PNAME=“CAD/CAM” Select


FROM PROJ, ASG, EMP
WHERE ASG.ENO=EMP.ENO
ENAME≠“J. DOE”
AND ASG.PNO=PROJ.PNO
AND ENAME≠“J. Doe”
AND PROJ.PNAME=“CAD/CAM” PNO

AND (DUR=12 OR DUR=24)


ENO Join

PROJ ASG EMP

Distributed DBMS Page 7-9. 23


Equivalent Query
ENAME

PNAME=“CAD/CAM” (DUR=12  DUR=24) ENAME≠“J. DOE”

PNO ENO

ASG PROJ EMP

Distributed DBMS Page 7-9. 24


Restructuring
ENAME

PNO

PNO,ENAME

ENO

PNO PNO,ENO PNO,ENAME

PNAME = "CAD/CAM" DUR =12 DUR=24 ENAME ≠ "J. Doe"

PROJ ASG EMP

Distributed DBMS Page 7-9. 25


Cost Functions
 Total Time (or Total Cost)
 Reduce each cost (in terms of time) component individually
 Do as little of each cost component as possible
 Optimizes the utilization of the resources


 Response Time
 Do as many things as possible in parallel
 May increase total time because of increased total activity

Distributed DBMS Page 7-9. 26


Total Cost
Summation of all cost factors

Total cost = CPU cost + I/O cost + communication cost

CPU cost = unit instruction cost  no.of instructions

I/O cost = unit disk I/O cost  no. of disk I/Os

communication cost = message initiation + transmission

Distributed DBMS Page 7-9. 27


Total Cost Factors

 Wide area network


 message initiation and transmission costs high
 local processing cost is low
 ratio of communication to I/O costs = 20:1

 Local area networks


 communication and local processing costs are more or less
equal
 ratio = 1:1.6

Distributed DBMS Page 7-9. 28


Response Time

Elapsed time between the initiation and the completion of a query

Response time = CPU time + I/O time + communication time

CPU time = unit instruction time  no. of sequential instructions

I/O time = unit I/O time  no. of sequential I/Os

communication time = unit msg initiation time  no. of


sequential msg + unit transmission time  no. of
sequential bytes

Distributed DBMS Page 7-9. 29

You might also like