0% found this document useful (0 votes)
11 views29 pages

DDBS Lecture5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views29 pages

DDBS Lecture5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Distributed Databases

Lecture 5

* Distributed Databases - CSC451 1


Recap

● Horizontal Fragmentation
● Derived Horizontal Fragmentation
● Vertical Fragmentation

* Distributed Databases - CSC451 2


Outline

● Distributed Query Processing


● SQL
● Query Optimization
● Query Processing Issues
● Optimization timing
● Statistics
● Decision sites
● Network Topologies

* Distributed Databases - CSC451 3


Distributed Query Processing

High level user query

Query
Processor

Low level data processing

* Distributed Databases - CSC451 4


Distributed Query Processing Methodology

* Distributed Databases - CSC451 5


Problem?

* Distributed Databases - CSC451 6


Problem in DDBS?

Fragments stored at different sites

Site1 Site2
ASG1= σDUR>37 (ASG) ASG2= σDUR<=37 (ASG)

Site3 Site4
EMP1= σENO<=E3 (EMP) EMP2= σENO>E3 (EMP)
Site5
Result
* Distributed Databases - CSC451 7
Problem in DDBS?
Site 5

EMP1 EMP2
Site 3 Site 4

ASG1 ASG2
Site 1 Site 2

* Distributed Databases - CSC451 8


SQL
● A typical SQL query has the form:
(PROJECTION) select A1, A2, ..., An
(CART. PROD.) from r1, r2, ..., rm
(SELECTION) where P
● Ais represent attributes
● ris represent relations
● P is a predicate.
● The result of an SQL query is a relation
● SQL uses the terms schema, table, row and
column for database, relation, tuple and
attribute, respectively
* Distributed Databases - CSC451 9
Query Optimization Objective

Minimize cost function


● I/O cost + CPU + communication cost
● Different weights apply to different distributed systems

Wide are networks


● Communication cost dominates
● Most algorithms ignore all other cost components

Local area networks


● Communication cost not that dominant
● Total cost function is considered
* Distributed Databases - CSC451 10
Query Processing Issues

● Processing of queries involves considering


● Optimization timing
● Statistics
● Decision sites
● Network Topology

* Distributed Databases - CSC451 11


1. Optimization timing

● Static
● Strategy of transmissions and local processing activities
is fully determined before execution begins - at compile
time
● Difficult to estimate size of intermediate results
● Dynamic
● Run time optimization
● Each step is decided only after seeing results of
previous steps
● Useful for how to optimize for multiple executions

* Distributed Databases - CSC451 12


Optimization timing - Static

● Input: database statistics such as


● relation sizes
● attribute sizes,
● projected sizes of attributes, etc
● Produces as output:
● a strategy for answering the query (a pattern of
what transmissions to make, when, where;
what local processing to do, when and where)

* Distributed Databases - CSC451 13


Optimization timing – Static…

● Involves 4 steps (Local Processing, Reduction, Transport, Completion )


● Local Processing
● Does all the processing - projections, selections and
joins – that can be done initially in each site without
need for data interchange between sites
● The end result of this phase is that there will be one
participating relation at each participating site
● Reduction
● selected "semijoins" are done to reduce the size of the
participating relations by eliminating tuples that are not
needed in answering the query
* Distributed Databases - CSC451 14
Optimization timing – Static…

● Transport
● send that one relation from each participating site (result
of the reduction phase) to the querying site
● Completion
● finishing up processing using those relations to get final
answer (e.g., final projections, selections and joins)

● It is commonly assumed that all local processing has a cost of


zero in response-time and that all transmissions have a linear
response time cost (sending X bytes causes a delay of AX +
B time units.)

* Distributed Databases - CSC451 15


Optimization timing – Static…

● Consider the following distributed query: Assume R1


is at site 1 and R2 is at site 2 and the query arriving at
site 3 is:
SELECT R1.A2, R2.A2
FROM R1,R2
WHERE R1.A1=R2.A1

At site 1: site 2
R1: A1 A2 A3 A4 A5 A6 A7 A8 A9 R2 A1 A2
a AA B C C E A F d 1
a C D D E AA B B e 2
b A B C D B A B A g 3
c D D B B A C A C
e E B AA C C D D

* Distributed Databases - CSC451 16


Optimization timing – Static…

● Assume the response time for transmission of X bytes between


any 2 sites is R(X) = X + 10 time units

Strategy-1: (No reduction phase).


1. Send R1 to site 3: 45 bytes sent, cost R(45)=45+10= 55.
2. Send R2 to site 3
3. 6 bytes sent, cost of R(6)=6+10= 16.
4. Final join (costs 0) gives eEBAACCDD2.
● Response time= 71.
● However, If 1,2 done in parallel, Response time 55.

* Distributed Databases - CSC451 17


Optimization timing – Static…

R2[A1]Strategy 2 (Reduction: R2 A1-semijoin R1)


1. Project: R2[A1] = d e g COST = 0.
2. Send to site 1: R(3) = 3+10 = 13.
3. Do R2 A1-semijoin R1 giving: eEBAACCDD COST= 0.
4. Send reduced R1 to site 3: R(9) = 9+10 = 19.
5. Send R2 to site 3: R(6) = 6+10 = 16.
6. Final join gives eEBAACCDD2 Response time is 48.

● If 2,5 done in parallel, Response time 32

* Distributed Databases - CSC451 18


Optimization timing – Static…

● So clearly the reduction phase can reduce


response time of query.
● For static algorithms, the hard part is to
decide at site 3 what strategy to use
without knowing exactly what the data
looks like at the other two sites

* Distributed Databases - CSC451 19


Optimization timing - Dynamic

● There is a need to estimate the results of


above, since the actual results are not
known in advance at site 3
● That estimation method is important,
because the situation can be very
different from the above,

* Distributed Databases - CSC451 20


Optimization timing - Dynamic

● Same query as before but different data:


SELECT R1.A2, R2.A2
FROM R1,R2
WHERE R1.A1=R2.A1

At site 1: site 2
R1: A1 A2 A3 A4 A5 A6 A7 A8 A9 R2 A1 A2
d AA B C C E A F d 1
d C D D E AA B B e 2
e A B C D B A B A g 3
g D D B B A C A C
e E B AA C C D D

* Distributed Databases - CSC451 21


Optimization timing – Dynamic…

● Then Strategy 2 would be:


1. Project: R2[A1] = d,e,g COST= 0.
2. Send R2[A1] to site 1: R(3) = 3+10 = 13.
3. R2 A1-semijoin R1:
dAABCCEAF
dCDDEAABB
eABCDBABA
gDDBBACAC
eEBAACCDD
Cost = 0
4. Send reduced R1 to site 3: R(45)=45+10 = 55.
5. Send R2 to site 3: R(6) = 6+10 = 16.

* Distributed Databases - CSC451 22


Optimization timing – Dynamic…

● Final Join
dAABCCEAF1
dCDDEAABB1
eABCDBABA2
gDDBBACAC3
eEBAACCDD1
Response time 84

● In fact, Strategy 1 would be better for this data


situation

* Distributed Databases - CSC451 23


Optimization timing – Dynamic…

● The question is: "How should reduction phase


results be estimated?"
● Using SELECTIVITY THEORY (The Selectivity
of attributed R1 is the ratio of the number of
values present over the number of values
possible) the selectivity of R2.A1 is 3/26
● Using selectivity theory, the size of the semijoin,
R2 A1-semijoin R1 can be estimated as:
● (Original size of R1) * (selectivity of incoming
attribute, R2.A) or 45*3/26= 5.2
● Selectivity theory estimates 5.2 bytes of R1 will
survive the semijoin

* Distributed Databases - CSC451 24


Query Processing Issues

● Processing of queries involves considering


● Optimization timing
● Statistics
● Decision sites
● Network Topology

* Distributed Databases - CSC451 25


2. Statistics

Based on
● Relation
● Cardinality
● Size of tuple
● Fraction of tuples participating in a join with other
relations
● Attribute
● Actual number of distinct values

* Distributed Databases - CSC451 26


3. Decision Sites

Based on
● Centralized
● Single site determines the best schedule
● Simple
● Needs knowledge about the entire database
● Distributed
● Cooperation among sites to determine the schedule
● Need only local information
● Cost of cooperation
● Hybrid
● One site determines the global schedule
● Each site optimizes the local subqueries

* Distributed Databases - CSC451 27


4. Network Topology

Based on
● WAN
● Global schedule to minimize communication cost
● LAN
● Broadcasting can be exploited

* Distributed Databases - CSC451 28


Summary

● Distributed Query Processing


● SQL
● Query Optimization
● Query Processing Issues
● Optimization timing
● Statistics
● Decision sites
● Network Topologies

* Distributed Databases - CSC451 29

You might also like