0% found this document useful (0 votes)
291 views42 pages

ADB - Unit - III (Chapter-2) - Query Processing and Decomposition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
291 views42 pages

ADB - Unit - III (Chapter-2) - Query Processing and Decomposition

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Advanced Databases

UNIT: III (Chapter-2)


Query Processing and
Decomposition
Reference:
Chapter – 6 & 7
Principles of Distributed Database Systems, M.Tamer Ozsu, Patrick Valduriez, 3rd Edition,
Springer

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/1


Outline
• Objectives of Query Processing
• Characterization of query processors
• Layers of query processing
• Query decomposition
• Localization of distributed data

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/2


Query Processing in a DDBMS
high level user query

query
processor

Low-level data manipulation


commands for D-DBMS

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/3


Query Processing Components
• Query language that is used

➡ SQL

• Query execution methodology

➡ The steps that one goes through in executing high-level (declarative) user
queries.

• Query optimization

➡ How do we determine the “best” execution plan?

• We assume a homogeneous D-DBMS

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/4


Selecting Alternatives

EMP(ENO, ENAME, TITLE)


ASG(ENO, PNO, RESP, DUR)

SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = "Manager“

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/5


Selecting Alternatives
EMP(ENO, ENAME, TITLE)
SELECT ENAME
ASG(ENO, PNO, RESP, DUR)
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = "Manager"

Strategy 1
ENAME(RESP=“Manager”EMP.ENO=ASG.ENO(EMP×ASG))
Strategy 2
 ENAME(EMP ⋈ENO (RESP=“Manager” (ASG))

Strategy 2 avoids Cartesian product, so may be “better”

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/6


What is the Problem?
Site 1 Site 2 Site 3 Site 4 Site 5
ASG1=ENO≤“E3”(ASG) ASG2= ENO>“E3”(ASG) EMP1= ENO≤“E3”(EMP) EMP2= ENO>“E3”(EMP) Result

Site 5
Strategy-B
Strategy-A Site 5
result  EMP1'  EMP2' result= (EMP1 × EMP2)⋈ENOσRESP=“Manager”(ASG1× ASG2)

EMP1' EMP2'
Site 3 Site 4 ASG1 ASG2 EMP1 EMP2
EMP’1=EMP1 ⋈ENO ASG’1 EMP’2=EMP2 ⋈ENO ASG’2
Site 1 Site 2 Site 3 Site 4

ASG 1' ASG '2


Site 1 Site 2
ASG 1'  σ RESP "Manager"ASG 1 ASG '2  σ RESP  "Manager"ASG 2

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/7


Cost of Alternatives
• Assume
➡ size(EMP) = 400, size(ASG) = 1000 20 Managers in ASG
➡ tuple access cost = 1 unit tuple transfer cost = 10 units
• Strategy-A
➡ produce ASG': (10+10)  tuple access cost =20
➡ transfer ASG' to the sites of EMP: (10+10)  tuple transfer cost=200
➡ produce EMP': (10+10)  tuple access cost  2=40
➡ transfer EMP' to result site: (10+10)  tuple transfer cost= 200
Total Cost 460
• Strategy-B
➡ transfer EMP to site 5: 400  tuple transfer cost 4,000
➡ transfer ASG to site 5: 1000  tuple transfer cost 10,000
➡ produce ASG': 1000  tuple access cost 1,000
➡ join EMP and ASG': 400  20  tuple access cost 8,000
Total Cost 23,000
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/8
Objectives of Query Processing
• Minimize a cost function
I/O cost + CPU cost + Communication cost
These might have different weights in different distributed environments
• Wide area networks
➡ Communication cost may dominate or vary much
✦ bandwidth
✦ speed
✦ high protocol overhead
• Local area networks
➡ communication cost not that dominant
➡ total cost function should be considered
• Can also maximize throughput

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/9


Characteristics of Query Processors

• Languages
• Types of Optimizers
• Optimization Timing
• Statistics
• Decision Sites
• Network Topology
• Replicated Fragments
• Use of Semijoins
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/10
Characteristics of Query Processors
Complexity of Relational Operations

Operation Complexity

Select
• Assume Project O(n)
(without duplicate elimination)
➡ relations of cardinality n
➡ sequential scan Project
(with duplicate elimination) O(n  log n)
Group

Join
Semi-join O(n  log n)
Division
Set Operators

Cartesian Product O(n2)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/11


Characteristics of Query Processors
Types of Optimization
• Exhaustive search – all possible execution strategies are considered
➡ Cost-based

➡ Optimal

➡ Combinatorial complexity in the number of relations

• Heuristics
➡ Not optimal

➡ Regroup common sub-expressions

➡ Perform selection, projection first

➡ Replace a join by a series of semijoins

➡ Reorder operations to reduce intermediate relation size

➡ Optimize individual operations

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/12


Characteristics of Query Processors
Optimization Granularity
• Single query at a time

➡ Cannot use common intermediate results

• Multiple queries at a time

➡ Efficient if many similar queries

➡ Decision space is much larger

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/13


Characteristics of Query Processors
Optimization Timing
• Static
➡ Compilation  optimize prior to the execution
➡ Difficult to estimate the size of the intermediate results. error
propagation
➡ Can amortize over many executions
• Dynamic
➡ Run time optimization
➡ Exact information on the intermediate relation sizes
➡ Have to reoptimize for multiple executions
➡ Distributed INGRES
• Hybrid
➡ Compile using a static algorithm
➡ If the error in estimate sizes > threshold, reoptimize at run time
➡ Mermaid

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/14


Characteristics of Query Processors
Statistics
• Relation
➡ Cardinality
➡ Size of a tuple
➡ Fraction of tuples participating in a join with another relation
• Attribute
➡ Cardinality of domain
➡ Actual number of distinct values
• Common assumptions
➡ Independence between different attribute values
➡ Uniform distribution of attribute values within their domain

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/15


Characteristics of Query Processors
Decision Sites
• Centralized
➡ Single site determines the “best” schedule
➡ Simple
➡ Need knowledge about the entire distributed database

• Distributed
➡ Cooperation among sites to determine the schedule
➡ Need only local information
➡ Cost of cooperation

• Hybrid
➡ One site determines the global schedule
➡ Each site optimizes the local subqueries

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/16


Characteristics of Query Processors
Network Topology
• Wide area networks (WAN) – point-to-point
➡ Characteristics
✦ Low bandwidth
✦ Low speed
✦ High protocol overhead
➡ Communication cost will dominate; ignore all other cost factors
➡ Global schedule to minimize communication cost
➡ Local schedules according to centralized query optimization

• Local area networks (LAN)


➡ Communication cost not that dominant
➡ Total cost function should be considered
➡ Broadcasting can be exploited (joins)
➡ Special algorithms exist for star networks

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/17


Layering scheme for Distributed
Query Processing
Calculus Query on Distributed Relations

Query GLOBAL
Decomposition SCHEMA

Algebraic Query on Distributed


Relations
CONTROL
Data FRAGMENT
SITE Localization SCHEMA

Fragment Query

Global STATS ON
Optimization FRAGMENTS

Optimized Fragment Query


with Communication Operations

LOCAL Local LOCAL


Optimization SCHEMAS
SITES

Optimized Local Queries


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/18
Query Decomposition
&
Localization of distributed data

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/19


Query Decomposition
Input : Calculus query on global relations
• Normalization
➡ manipulate query quantifiers and qualification
• Semantic Analysis
➡ detect and reject “incorrect” queries
➡ possible for only a subset of relational calculus
• Simplification
➡ eliminate redundant predicates
• Restructuring
➡ calculus query  algebraic query
➡ more than one translation is possible
➡ use transformation rules

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/20


Normalization
• Lexical and syntactic analysis
➡ check validity (similar to compilers)

➡ check for attributes and relations

➡ type checking on the qualification

• Put into normal form


➡ Conjunctive normal form

(p11 p12  …  p1n)  …  (pm1  pm2  …  pmn)


➡ Disjunctive normal form

(p11  p12  …  p1n)  …  (pm1  pm2  …  pmn)


➡ OR's mapped into union

➡ AND's mapped into join or selection


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/21
Semantic Analysis
• prove incorrect queries
• Type incorrect
➡ If any of its attribute or relation names are not defined in the global schema
➡ If operations are applied to attributes of the wrong type
• Semantically incorrect
➡ Components do not contribute in any way to the generation of the result
➡ Only a subset of relational calculus queries can be tested for correctness
➡ Those that do not contain disjunction and negation
➡ To detect
✦ connection graph (query graph)
✦ join graph

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/22


Semantic Analysis – Example
SELECT ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"

Query graph Join graph


DUR≥36

ASG ASG
EMP.ENO=ASG.ENO ASG.PNO=PROJ.PNO EMP.ENO=ASG.ENO ASG.PNO=PROJ.PNO

TITLE =
EMP RESP PROJ EMP PROJ
“Programmer”

ENAME
RESULT
PNAME=“CAD/CAM”

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/23


Semantic Analysis
If the query graph is not connected, the query may be wrong or
use Cartesian product
SELECT ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND PNAME = "CAD/CAM"
AND DUR > 36
AND TITLE = "Programmer"

ASG

EMP RESP PROJ

ENAME
RESULT
PNAME=“CAD/CAM”

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/24


Simplification
• Why simplify?

➡ Remember the example


• How? Use transformation rules

➡ Elimination of redundancy
✦ idempotency rules
p1  ¬( p1)  false
p1  (p1p2)  p1
p1  false  p1

➡ Application of transitivity
➡ Use of integrity rules

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/25


Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = "J. Doe"
OR (NOT(EMP.TITLE = "Programmer")
AND (EMP.TITLE = "Programmer"
OR EMP.TITLE = "Elect. Eng.")
AND NOT(EMP.TITLE = "Elect. Eng."))


SELECT TITLE
FROM EMP
WHERE EMP.ENAME = "J. Doe"

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/26


Restructuring
• Convert relational calculus to relational ENAME Project
algebra
• Make use of query trees σDUR=12 OR DUR=24
• Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
σPNAME=“CAD/CAM” Select
project for either 1 or 2 years.
SELECT ENAME σENAME≠“J. DOE”
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO ⋈PNO
AND ASG.PNO = PROJ.PNO
AND ENAME≠ "J. Doe" ⋈ENO Join
AND PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24) PROJ ASG EMP
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/27
Example
Recall the previous example: ENAME
Project
Find the names of employees other
than J. Doe who worked on the DUR=12  DUR=24
CAD/CAM project for either one or
two years.
PNAME=“CAD/CAM” Select
SELECT ENAME
FROM PROJ, ASG, EMP ENAME≠“J. DOE”
WHERE ASG.ENO=EMP.ENO
AND ASG.PNO=PROJ.PNO ⋈PNO
AND ENAME ≠ "J. Doe"
AND PROJ.PNAME="CAD/CAM" ⋈ENO Join

AND (DUR=12 OR DUR=24)


PROJ ASG EMP
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/28
Equivalent Query
ENAME

PNAME=“CAD/CAM”  (DUR=12  DUR=24) ENAME≠“J. Doe”

⋈PNO,ENO

EMP PROJ ASG


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/29
Data Localization
Input: Algebraic query on distributed relations
• Determine which fragments are involved
• Localization program
➡ substitute for each global query its materialization program

➡ optimize

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/30


Example
Assume ENAME

➡ EMP is fragmented into EMP1, EMP2, DUR=12 DUR=24


EMP3 as follows:
✦ EMP1= ENO≤“E3”(EMP) PNAME=“CAD/CAM”
✦ EMP2= “E3”<ENO≤“E6”(EMP)
ENAME≠“J. DOE”
✦ EMP3= ENO≥“E6”(EMP)
➡ ASG fragmented into ASG1 and ASG2 ⋈PNO
as follows:
✦ ASG1= ENO≤“E3”(ASG) ⋈ENO

✦ ASG2= ENO>“E3”(ASG) PROJ  


Replace EMP by (EMP1  EMP2  EMP3)
and ASG by (ASG1  ASG2) in any query EMP1EMP2 EMP3 ASG1 ASG2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/31
Provides Parallellism

⋈ENO ⋈ENO ⋈ENO ⋈ENO

EMP1 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/32


Eliminates Unnecessary Work

⋈ENO ⋈ENO ⋈ENO

EMP1 ASG1 EMP2 ASG2 EMP3 ASG2

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/33


Reduction for PHF
• Reduction with selection
➡ Relation R and FR={R1, R2, …, Rw} where Rj=p (R)
j

pi(Rj)= if x in R: ¬(pi(x) pj(x))


➡ Example
SELECT *
FROM EMP
WHERE ENO="E5"
ENO=“E5” ENO=“E5”

EMP1 EMP2 EMP3 EMP2


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/34
Reduction for PHF
• Reduction with join

➡ Possible if fragmentation is done on join attribute

➡ Distribute join over union

(R1 R2)⋈S  (R1⋈S)  (R2⋈S)

➡ Given Ri =p (R) and Rj = p (R)


i j

Ri ⋈Rj = if x in Ri, y in Rj: ¬(pi(x) pj(y))

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/35


Reduction for PHF
• Assume EMP is fragmented as ⋈ENO
before and
➡ ASG1: ENO ≤ "E3"(ASG)  
➡ ASG2: ENO > "E3"(ASG)
• Consider the query EMP1 EMP2 EMP3 ASG1 ASG2
SELECT *
FROM EMP,ASG
WHERE EMP.ENO=ASG.ENO 
• Distribute join over unions
• Apply the reduction rule ⋈ENO ⋈ENO ⋈ENO

EMP1 ASG1 EMP2 ASG2 EMP3 ASG2

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/36


Reduction for VF
• Find useless (not empty) intermediate relations

Relation R defined over attributes A = {A1, ..., An} vertically fragmented


as Ri =A'(R) where A' A:
D,K(Ri) is useless if the set of projection attributes D is not in A'
Example: EMP1=ENO,ENAME (EMP); EMP2=ENO,TITLE (EMP)

SELECT ENAME
FROM EMP
ENAME ENAME

⋈ENO

EMP1 EMP2 EMP1


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/37
Reduction for DHF
• Rule :
➡ Distribute joins over unions
➡ Apply the join reduction for horizontal fragmentation
• Example
ASG1: ASG ⋉ENO EMP1
ASG2: ASG ⋉ENO EMP2
EMP1: TITLE=“Programmer” (EMP)
EMP2: TITLE=“Programmer” (EMP)
• Query
SELECT *
FROM EMP, ASG
WHERE ASG.ENO = EMP.ENO
AND EMP.TITLE = "Mech. Eng."
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/38
Reduction for DHF
Generic query ⋈ENO
TITLE=“Mech. Eng.”

 

ASG1 ASG2 EMP1 EMP2

Selections first ⋈ENO

 TITLE=“Mech. Eng.”

ASG1 ASG2 EMP2


Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/39
Reduction for DHF

Joins over unions

⋈ENO ⋈ENO

TITLE=“Mech. Eng.” TITLE=“Mech. Eng.”

ASG1 EMP2 ASG2 EMP2


Elimination of the empty intermediate relations
(left sub-tree) ⋈ENO

TITLE=“Mech. Eng.”

ASG2 EMP2
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/40
Reduction for Hybrid Fragmentation
• Combine the rules already specified:
➡ Remove empty relations generated by contradicting selections on horizontal
fragments;

➡ Remove useless relations generated by projections on vertical fragments;

➡ Distribute joins over unions in order to isolate and remove useless joins.

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/41


Reduction for Hybrid Fragmentation
Example
ENAME
Consider the following hybrid
fragmentation: ENAME
ENO=“E5”
EMP1= ENO≤"E4" (ENO,ENAME (EMP))

EMP2= ENO>"E4" (ENO,ENAME (EMP))


⋈ENO
 ENO=“E5”
EMP3= ENO,TITLE (EMP)

and the query 


EMP2
SELECT ENAME
FROM EMP
WHERE ENO="E5" EMP1 EMP2 EMP3

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.6/42

You might also like