0% found this document useful (0 votes)
34 views10 pages

Unit2 1

Uploaded by

arjabkhadka93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

Unit2 1

Uploaded by

arjabkhadka93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Distributed Query Processing…

It is a process of transforming a high-level query (of relational


calculus/SQL) on a distributed database (i.e., a set of global relations)
into an equivalent and efficient lower-level query (of relational
algebra) on relation fragments.

• Distributed query processing is more complex

– Fragmentation/replication of relations
– Additional communication costs
– Parallel execution
Query Processing
Query Processing Example..Centralized
Scenario
• Example: Transformation of an SQL-query into an RA-query.
Relations: EMP(ENO, ENAME, TITLE), ASG(ENO,PNO,RESP,DUR)
Find the names of employee who have worked for more than 37 years.
– High level query
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO AND DUR > 37.
– Two possible transformations of the query are:
Expression 1: ∏ENAME(δDUR>37∧EMP.ENO=ASG.ENO(EMP × ASG))
Expression 2: ∏ENAME(EMP ]X[ ENO (δ DUR>37(ASG)))
– Expression 2 avoids the expensive and large intermediate Cartesian
product, and therefore typically is better.
Distributed scenario…

• We make the following assumptions about the data


fragmentation
– Data is (horizontally) fragmented:
Site1: ASG1 = δ ENO≤”E3”(ASG)
Site2: ASG2 = δ ENO>”E3”(ASG)
Site3: EMP1 = δ ENO≤”E3”(EMP)
Site4: EMP2 = δ ENO>”E3”(EMP)
Site5: Result
Example…

• Now consider the expression


: ∏ENAME(EMP ]X[ ENO (δ DUR>37(ASG)))
• Strategy 1 (partially parallel execution):
– Produce ASG1 and move to Site 3
– Produce ASG2 and move to Site 4
– Join ASG1 with EMP1 at Site 3
And move the result to Site 5
– Join ASG2 with EMP2 at Site 4 and
move the result to Site 5
– Union the result in Site 5
• Strategy 2:
– Move ASG1 and ASG2 to Site 5
– Move EMP1 and EMP2 to Site 5
– Select and join at Site 5
Example…
• Calculate the cost of the two strategies under the following assumptions:
– Tuples are uniformly distributed to the fragments; 20 tuples satisfy DUR>37
– size(EMP) = 400, size(ASG) = 1000
– tuple access cost = 1 unit; tuple transfer cost = 10 units
– ASG and EMP have a local index on DUR and ENO
• Strategy 1
– Produce ASG’s: (10+10) * tuple access cost 20
– Transfer ASG’s to the sites of EMPs: (10+10) * tuple transfer cost 200
– Produce EMP’s: (10+10) * tuple access cost * 2 40
– Transfer EMP’s to result site: (10+10) * tuple transfer cost 200
– Total cost 460
• Strategy 2
– Transfer EMP1, EMP2 to site 5: 400 * tuple transfer cost 4,000
– Transfer ASG1, ASG2 to site 5: 1000 * tuple transfer cost 10,000
– Select tuples from ASG1 [ ASG2: 1000 * tuple access cost 1,000
– Join EMP and ASG’: 400 * 20 * tuple access cost 8,000
– Total cost
23,000
Layers of Query Processing
Query Decomposition…
• Query decomposition:
Mapping of calculus query (SQL) to algebra operations (select,
project, join)
• Query decomposition consists of 4 steps:
1. Normalization: Transform query to a normalized form
2. Analysis: Detect and reject ”incorrect” queries; possible
only for a subset of relational calculus
3. Elimination of redundancy: Eliminate redundant
predicates
4. Rewriting: Transform query to RA and optimize query
Data Localization…

• Data localization
– Input: Algebraic query on global conceptual
schema
– Purpose:
• Apply data distribution information to the algebra
operations and determine which fragments are
involved
• Substitute global query with queries on fragments
• Optimize the global query
Query Optimization…

• Query optimization is a crucial and difficult part of the overall query processing
• Objective of query optimization is to minimize the following cost function:
I/O cost + CPU cost + communication cost

• Two different scenarios are considered:


– Wide area networks
 Communication cost dominates
• · low bandwidth
• · low speed
• · high protocol overhead
 Most algorithms ignore all other cost components
– Local area networks
 Communication cost not that dominant
 Total cost function should be considered

You might also like