0% found this document useful (0 votes)
33 views

QueryProcessing Lect 3

This document discusses distributed query processing. It explains that a distributed query processor must decompose high-level queries into data manipulation commands and consider communication costs to optimize query plans. Two example query plans over distributed relations are provided to demonstrate how minimizing network traffic can reduce query processing costs. The objectives, components, and optimization techniques of distributed query processing are outlined.

Uploaded by

ally
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

QueryProcessing Lect 3

This document discusses distributed query processing. It explains that a distributed query processor must decompose high-level queries into data manipulation commands and consider communication costs to optimize query plans. Two example query plans over distributed relations are provided to demonstrate how minimizing network traffic can reduce query processing costs. The objectives, components, and optimization techniques of distributed query processing are outlined.

Uploaded by

ally
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

4.

Distributed Query Processing

lecture 3

Overview of Query Processing


2019-2020 3rd Sem2 NW
Morning/Evening/Dr. Salma

1
Query Processing

High level user query

Query
Processor

Low level data manipulation commands

2
Query Processing Components
● Query language that is used
⬥ SQL (Structured Query Language)
● Query execution methodology
⬥ The steps that the system goes through in executing
high-level (declarative) user queries
● Query optimization
⬥ How to determine the “best” execution plan?

3
Query Language
● Tuple calculus: { t | F(t) }
where t is a tuple variable, and F(t) is a well formed formula
● Example:
⬥ Get the numbers and names of all programmers.

4
Query Language (cont.)
● Domain calculus:
where xi is a domain variable, and is a well
formed formula
● Example:
{ x, y | EMP(x, y, “Programmer") }

Variables are position sensitive!

5
Query Language (cont.)
● SQL is a tuple calculus language.

SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“Programmer”

End user uses non-procedural (declarative)


languages to express queries.
6
Query Processing Objectives & Problems

● Query processor transforms queries into procedural


operations to access data in an optimal way.

● Distributed query processor has to deal with query


decomposition and data localization.

7
DB example

Figure 3.3

8
Centralized Query Processing
Alternatives
SELECTENAME
FROM EMP E, ASG G
WHERE E.ENO = G.ENO AND RESP=“manager”

● Strategy 1:

● Strategy 2:

Strategy 2 avoids Cartesian product, so is it “better”.


9
Distributed Query Processing
● Query processor must consider the communication
cost and select the best site.
● The same query example, but relation G and E are
fragmented and distributed.

10
Distributed Query Processing Plans

● By centralized optimization,

● Two distributed query processing plans

1111
Distributed Query Plan I
Plan I: To transport all segments to query site and
execute there.
Site 5

Result = (EMP1 ∪ EMP⋈2) ENO


σTITLE=“manager” (ASG1 ∪
ASG2)
ASG1 ASG2 EMP1 EMP2

Site 1 Site 2 Site 3 Site 4

This causes too much network traffic, very costly.

12
Distributed Query Plan II
Plan II (Optimized):

Site 5
Result = (EMP1 ’ ∪ EMP2
’)
EMP1’ EMP2’
Site 3 Site 4
EMP1’ = EMP1 ⋈ ENO ASG1’ EMP2’ = EMP2 ⋈ ENO ASG2’
ASG1’ ASG2’
Site 1 Site 2
ASG1’ = σ RESP=“manager” (ASG1) ASG2’ = σ RESP =“manager” (ASG2)
13
Costs of the Two Plans
● Assume:
⬥ size(EMP)=400, size(ASG)=1000, 20 tuples with RESP =“manager”
⬥ tuple access cost = 1 unit; tuple transfer cost = 10 units
⬥ ASG and EMP are locally clustered on attribute RESP and ENO, respectively.
● Plan 1
⬥ Transfer EMP to site 5: 400*tuple transfer cost 4000
⬥ Transfer ASG to site 5: 1000*tuple transfer cost 10000
⬥ Produce ASG’: 1000*tuple access cost 1000
⬥ Join EMP and ASG’: 400*20*tuple access cost 8000
Total cost 23,000
● Plan 2
⬥ Produce ASG’: (10+10)*tuple access cost 20
⬥ Transfer ASG’ to the sites of EMP: (10+10)*tuple transfer cost 200
⬥ Produce EMP’: (10+10)*tuple access cost * 2 40
⬥ Transfer EMP’ to result site: (10+10)*tuple transfer cost 200
Total cost 460 14
Query Optimization Objectives
● Minimize a cost function
I/O cost + CPU cost + communication cost
● These might have different weights in different distributed
environments

● Can also maximize throughout

15
Communication Cost
● Wide area network
● Communication cost will dominate
- Low bandwidth
- Low speed
- High protocol overhead
● Most algorithms ignore all other cost components

● Local area network


● Communication cost not that dominate
● Total cost function should be considered
16
Types of Query Optimization
• Query optimization aims at choosing the “best” point in the
solution space of all possible execution strategies.
● Exhaustive search
▪ method for query optimization is to search the solution
space, exhaustively predict the cost of each strategy, and
select the strategy with minimum cost.
▪ Cost-based
▪ Optimal
▪ Combinatorial complexity in the number of relations (The
problem becomes worse as the number of relations or
fragments increases (e.g., becomes greater than 5 or 6).
▪ Workable for small solution spaces
17
Types of Query Optimization

❖ Heuristics
• Not optimal
• restrict the solution space so that only a few
strategies are considered
• Perform unary operations (selection and
projection) first
• Reorder operations to reduce intermediate
relation size
• Replace a join by a series of semijoins to
minimize data communication.
18
Query Optimization Granularity
● Single query at a time
⬥ Cannot use common intermediate results

● Multiple queries at a time


⬥ Efficient if many similar queries
⬥ Decision space is much larger

19
Query Optimization Timing
● Static
⬥ Do it at compilation time by using statistics, appropriate
for exhaustive search, optimized once, but executed
many times.
⬥ Difficult to estimate the size of the intermediate results
⬥ Can amortize over many executions

● Dynamic
⬥ Do it at execution time, accurate about the size of the
intermediate results, repeated for every execution,
expensive.
20
Query Optimization Timing (cont.)
● Hybrid
⬥ Compile using a static algorithm
⬥ If the error in estimate size > threshold, re-optimizing at
run time

21
Statistics
● Relation
⬥ Cardinality
⬥ Size of a tuple
⬥ Fraction of tuples participating in a join with another relation
● Attributes
⬥ Cardinality of the domain
⬥ Actual number of distinct values
● Common assumptions
⬥ Independence between different attribute values
⬥ Uniform distribution of attribute values within their domain
22
Decision Sites
● For query optimization, it may be done by
⬥ Single site – centralized approach
– Single site determines the best schedule
– Simple
– Need knowledge about the entire distributed database
⬥ All the sites involved – distributed approach
– Cooperation among sites to determine the schedule
– Need only local information
– Cost of operation
⬥ Hybrid – one site makes major decision in cooperation
with other sites making local decisions
– One site determines the global schedule
– Each site optimizes the local subqueries
23
Network Topology

● Wide Area Network (WAN) – point-to-point


⬥ Characteristics
– Low bandwidth
– Low speed
– High protocol overhead
⬥ Communication cost will dominate; ignore all other cost
factors
⬥ Global schedule to minimize communication cost
⬥ Local schedules according to centralized query
optimization

24
Network Topology (cont.)

● Local Area Network (LAN)


⬥ communication cost not that dominate
⬥ Total cost function should be considered
⬥ Broadcasting can be exploited
⬥ Special algorithms exist for star networks

25
Other Information to Exploit

● Using replications to minimize communication costs

● Using semijoins to reduce the size of operand


relations to cut down communication costs when
overhead is not significant.
Semijoins: is a technique for processing a join between two tables that are stored sites. The
basic idea is to reduce the transfer cost by first sending only the projected join column(s)

to the other site, where it is joined with the second relation .

26

You might also like