0% found this document useful (0 votes)

23 views7 pages

Para Distr Query Processing Notes

The document discusses parallel and distributed query processing techniques. It describes how to parallelize relational algebra operations like sort, projection, selection and join. It also discusses distributed query processing where global queries are broken into subqueries to be processed locally and results combined.

Uploaded by

ihtishaamahmed6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

Para Distr Query Processing Notes

Uploaded by

ihtishaamahmed6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Advances in Data Management:

Parallel and Distributed Query Processing

1 Parallel implementation of the relational algebra

Note: The algorithms described below generalise to all three architectures. Just the mode of commu-
nication between the processors is different: via the interconnection network with shared-nothing —
which is what is assumed in the below description; and via shared memory or shared disk with shared-
memory/shared-disk.

The basic foundations of parallel query evaluation are the splitting and merging of parallel streams of
data:
Data streams arising from different disks or processors provide the input for each operator in the query.
The results being produced as an operator is evaluated in parallel over a set of nodes need to be merged
at one node in order to produce the overall result of the operator.
This result set may then be split again in order to parallelise subsequent processing in the query.

1.1 Sort
Given the frequency with which sorting is required in query processing, parallel sorting algorithms have
been much studied.
A commonly used parallel sorting algorithm is the parallel merge sort:
This first sorts each fragment of the relation individually on its local disk, e.g. using the external merge
sort algorithm we have already looked at.
Groups of fragments are then shipped to one node per group, which merges the group of fragments into
a larger sorted fragment. This process repeats until a single sorted fragment is produced at one of the
nodes.

1.2 Projection and selection

Selection and projection operators on a relation are usually performed together in one scan. This can be
considered as a reduction operator.
Reducing a fragment at a single node can be done as in a centralised DBMS, i.e. by a sequential scan or
by utilising appropriate indexes if available.
If duplicates are permitted in the overall result, or if they are not permitted but are known to be
impossible (e.g. if the result contains all the key attributes of the relation), then the fragments can just
be shipped to one node to be merged.
If duplicates may arise in the overall result and they need to be eliminated, then a variant of the parallel
merge sort algorithm described in Section 1.1 can be used to sort the result while at the same time
eliminating duplicates.

1
1.3 Join
Ro
n S using Parallel Nested Loops or Index Nested Loops
First the ‘outer’ relation has to be chosen. In particular, if S has an appropriate index on the join
attribute(s) then R should be the outer relation.
All the fragments of R are then shipped to all nodes. So each node i now has a whole copy of R as well
as its own fragment of S, Si .
The local joins R o
n Si are performed in parallel on all the nodes, and the results are finally shipped to
a chosen node for merging.
Ro
n S using Parallel Sort-Merge Join (for natural/equijoins)
The first phase of this involves sorting R and S on the join attribute(s). These sorts can be performed
using the parallel merge sort operation described in Section 1.1.
The sorted relations are then partitioned across the nodes using range partitioning with the same sub-
ranges on the join attribute(s) for both relations.
The local joins of each pair of sorted fragments Ri o
n Si are performed in parallel, and the results are
finally shipped to a chosen node for merging.
Ro
n S using Parallel Hash Join (for natural/equijoins only)
Each bucket of R and S is logically assigned to one node.
The first hashing phase, using the first hash function h1 , is undertaken in parallel on the all nodes. Each
tuple t from R or S is shipped to node i if the bucket assigned to it by h1 is the ith bucket.
The next phase is also undertaken in parallel on all nodes. On each node i, a hash table is created from
the local fragment of R, Ri , using another hash function h2 . The local fragment of S, Si , is then scanned
and h2 is used to probe the hash table for matching records of Ri for each record of Si .
The results produced at each node are shipped to a chosen node for merging.

1.4 Parallel Query Optimisation

The above parallel versions of the relational algebra operators have different costs compared to their
sequential counterparts, and it is these costs that need to be considered during the generation and
selection of parallel query plans:

• The costs of partitioning and merging the data now need to be taken into account.
• If the data distribution is skewed, this will have an impact on the overall time taken to complete
the evaluation of an operator, so that too needs to be taken into account.

• The results being produced by one operator can be pipelined into the evaluation of another operator
that is executing at the same time on a different node.
For example, consider this left-deep join tree where all the joins are nested loops joins:
((R1 o
n R2 ) o
n R3 ) o
n R4
The tuples being produced by R1 o n R2 can be used to ‘probe’ R3 and the resulting tuples can
be used to probe R4 , thus setting up a pipeline of three concurrently executing join operators on
three different nodes.

• There is now the possibility of executing different operators of the query concurrently on different
nodes.
For example, with this bushy join tree:
((R1 o
n R2 ) o
n (R3 o
n R4 ))

2
the join of R1 with R2 and the join of R3 with R4 can be executed concurrently. The partial results
of these joins can also be pipelined into their parent join operator.

In uniprocessor systems only left-deep join orders are usually considered (and this still gives n! possible
join orders for a join of n relations).
In multi-processor systems, other join orders can result in more parallelisation, e.g. bushy trees, as
illustrated above. Even if such a plan is more costly in terms of the number of I/O operations performed,
it may execute more quickly than a left-deep plan due to the increased parallelism it affords.
So, in general, the number of candidate query plans is much greater in parallel database systems and
more heuristics need to be employed to limit the search space of query plans. For example, one possible
approach is for the query optimiser to first find the best plan for sequential evaluation of the query; and
then find the fastest parallel execution of that plan.

2 Distributed Query Processing

The purpose of distributed query processing is to process global queries, i.e. queries expressed with
respect to the global or external schemas of a DDB system.
The local query processor at each site is responsible for processing sub-queries of global queries that
can be evaluated at that site.
A global query processor is needed at sites of the DDB system to which global queries can be
submitted. This will optimise each global query, distribute sub-queries of the query to the appropriate
local query processors, and collect the results of these sub-queries to return to the user.
In more detail, processing global queries consists of the following steps:

1. translating the query into a query tree;

2. replacing fragmented relations in this tree by their definition as unions/joins of their horizon-
tal/vertical fragments;
3. simplifying the resulting tree using several heuristics (see below);
4. global query optimisation, resulting in the selection of a query plan; this consists of sub-queries each
of which will be executed at one local site; the query plan is annotated with the data transmission
that will occur between sites;
5. local processing of the local sub-queries; this may include further optimisation of the local sub-
queries, based on local information about access paths and database statistics.

In Step 3, the simplifications that can be carried out in the case of horizontal partitioning include the
following:

• eliminating fragments from the argument to a selection operation if they cannot contribute any
tuples to the result of that selection;
• distributing join operations over unions of fragments, and eliminating joins that can yield no tuples;

For example, suppose a table Employee(empID, site, salary, . . . ) is horizontally fragmented into four
fragments:
E1 = σsite=0 A0 AN D salary<30000 Employee
E2 = σsite=0 A0 AN D salary>=30000 Employee
E3 = σsite=0 B 0 AN D salary<30000 Employee
E4 = σsite=0 B 0 AN D salary>=30000 Employee

3
then the query σsalary<25000 Employee is replaced in Step 2 by

σsalary<25000 (E1 ∪ E2 ∪ E3 ∪ E4 )

which simplifies to
σsalary<25000 (E1 ∪ E3 )

For example, suppose a table WorksIn(empID, site, project, . . . ) is horizontally fragmented into two
fragments:
W1 = σsite=0 A0 WorksIn
W2 = σsite=0 B 0 WorksIn
then the query Employee o
n W orksIn is replaced in Step 2 by:

(E1 ∪ E2 ∪ E3 ∪ E4 ) o
n (W1 ∪ W2 )

distributing the join over the unions of fragments gives:

n W1 ) ∪ (E2 o
(E1 o n W1 ) ∪ (E3 o
n W1 ) ∪ (E4 o
n W1 ) ∪
n W2 ) ∪ (E2 o
(E1 o n W2 ) ∪ (E3 o
n W2 ) ∪ (E4 o
n W2 )

and this simplifies to:

n W1 ) ∪ (E2 o
(E1 o n W1 ) ∪ (E3 o
n W2 ) ∪ (E4 o
n W2 )

One simplification that can be carried out in Step 3 in the case of vertical partitioning is that fragments in
the argument of a projection operation which have no non-key attributes in common with the projection
attributes can be eliminated.
For example, if a table Projects(projNum, budget, location, projName) is vertically partitioned into two
fragments:
P1 = πprojN um,budget,location Projects
P2 = πprojN um,projN ame Projects
then the query πprojN um,location Projects is replaced in Step 2 by:

πprojN um,location (P1 o

n P2 )

which simplifies to:

πprojN um,location P1
on the assumption that each P rojN um value in P1 also appears in P2 (which is guaranteed by the
loss-less join decomposition condition).
Step 4 (Query Optimisation) consists of generating a set of alternative query plans, estimating the
cost of each plan, and selecting the cheapest plan.
It is carried out in much the same way as for centralised query optimisation, but now communication
costs must also be taken into account when estimating the overall cost of a query plan.
Also, the replication of relations or fragments of relations is now a factor — there may be a choice of
which replica to use within a given query plan, with different costs being associated with using different
replicas.
Another difference from centralised query optimisation is that there may be a speed-up in query execution
times due to the parallel processing of parts of the query at different sites.

2.1 Distributed Processing of Joins

Given the potential size of the results of join operations, the efficient processing of joins is a significant
aspect of global query processing in distributed databases and a number of distributed join algorithms
have been developed. These include:

4
• the full-join method, and
• the semi-join method.

Full-join method
The simplest method for computing R o n S at the site of S consists of shipping R to the site of S and
doing the join there. This has a cost of

cost of reading R + c × pages(R) + cost of computing R o

n S at site(S)

where c is the cost of transmitting one page of data from the site of R to the site of S, and pages(R) is
the number of pages that R consists of.
If the result of this join were needed at a different site, then there would also be the additional cost of
sending the result of the join from site(S) to where it is needed.
Semi-join method
This is an alternative method for computing R o
n S at the site of S and consists of the following steps:

(i) Compute πR∩S (S) at the site of S, where πR∩S denotes projection on the common attributes of R
and S.
(ii) Ship πR∩S (S) to the site of R.
(iii) Compute R n S at the site of R, where n is the semi-join operator, defined as follows:

RnS =Ro
n πR∩S (S)

(iv) Ship R n S to the site of S.

(v) Compute R o
n S at the site of S, using the fact that

Ro
n S = (R n S) o
nS

Example 1. Consider the following relations, stored at different sites:

R = accounts(accno, cname, balance)
S = customer(cname, address, city, telno, creditRating).
Suppose we need to compute R o
n S at the site of S.
Suppose also that

• accounts contains 100,000 tuples on 1,000 pages

• customer contains 50,000 tuples on 500 pages
• the cname field of S consumes 0.2 of each record of S

With the full join method we have a cost of

cost of reading R + c × pages(R) + cost of computing R o

n S at site(S)

which is 1000 I/Os to read R, plus (c × 1000) to transmit R to the site of S, plus 1000 I/Os to save it
there, plus (3 × (1000 + 500)) I/Os (assuming a hash join) to perform the join. This gives a total cost of:
(c × 1000) + 6500 I/Os
With the semi-join method we have the cost of:

(i) Computing πR∩S (S) at site(S), i.e. 500 I/Os to scan S, generating 100 pages of just the cname
values

5
(ii) Shipping πR∩S (S) to site(R), i.e. c × 100 and saving it there, i.e. 100 I/Os.
(iii) Computing R n S at site(R), i.e. 3 × (100 + 1000) I/Os, assuming a hash join
(iv) Shipping the result of R n S to the site of S, i.e. c × 1000 and saving it there, i.e. 1000 I/Os
(assuming cname in R is a foreign key).
n S at site(S), i.e. 3 × (1000 + 500))
(v) Computing R o

This gives a total cost of

(c × 1100) + 9400 I/Os

So in this case the full join method ((c × 1000) + 6500 I/Os) is cheaper: we have gained nothing by using
the semi-join method since all the tuples of R join with tuples of S.
Example 2. Let R be as above (i.e., accounts) and let
S = σcity=0 London0 (customer)
Suppose again that we need to compute R o
n S at the site of S.
Suppose also that there are 100 different cities in customer, that there is a uniform distribution of
customers across cities, and a uniform distribution of accounts over customers. So S contains 500 tuples
on 5 pages.
With the full join method we have a cost of

cost of reading R + c × pages(R) + cost of computing R o

n S at site(S)

which is 1000 + (c × 1000) + 1000 + (3 × (1000 + 5)) I/Os

= (c × 1000) + 5015 I/Os.

With the semi-join method we have the cost of:

(i) Computing πR∩S (S) at site(S), i.e. 5 I/Os to scan S, generating 1 page of cname values
(ii) Shipping πR∩S (S) to site(R), i.e. c × 1 plus 1 I/O to save it there.
(iii) Computing R n S at site(R), i.e. 3 × (1 + 1000) assuming a hash join

(iv) Shipping R n S to the site of S, i.e. c × 10 since, due to a uniform distribution of accounts over
customers, 1/100th of R will match the cname values sent to it from S.
Plus the cost of saving the result of R n S at the site of S, 10 I/Os.
n S at site(S), i.e. 3 × (10 + 5))
(v) Computing R o

The overall cost is thus (c × 11) + 3064 I/Os versus (c × 1000) + 5015 I/Os for the full-join method.
So in this case the semi-join method is cheaper. This is because a significant number of tuples of R do
not join with S and so are not sent to the site of S.
Bloom join method
In general, a Bloom filter is an efficient memory structure (usually a bit vector) used to approximate the
contents of a set S in the following sense:

• given an item i,

– if the filter returns false or 0 for i, then i is definitely not in S;

– if it returns true or 1, then i may (or may not) be in S.

6
Apart from their use in distributed join processing, Bloom filters are also used in NoSQL systems such
as BigTable to avoid having to read an SSTable in full to find out that the key being searched for does
not appear in it.
The Bloom join method is similar to the semi-join method, as it too aims to reduce the amount of data
being sent from the site of R to the site of S.
However, rather than doing a projection on S and sending the resulting data to the site of R, a bit-vector
of a fixed size k is computed by hashing each tuple of S to the range [0..k − 1] (using the join attribute
values). The ith bit of the vector is set to 1 if some tuple of S hashes to i and is set to 0 otherwise.
Then, at the site of R, the tuples of R are also hashed to [0..k − 1] (using the same hash function and the
join attribute values), and only those tuples of R whose hash value corresponds to a 1 in the bit-vector
sent from S are retained for shipping to the site of S.
The cost of shipping the bit-vector from the site of S to the site of R is less than the cost of shipping
the projection of S in the semi-join method.
However, the size of the subset of R that is sent back to the site of S is likely to be larger (since only
approximate matching of tuples is taking place now), and so the shipping costs and join costs are likely
to be higher than with the semi-join method.

Manual RiAcquire Riegl
100% (1)
Manual RiAcquire Riegl
255 pages
CSE 444 Practice Problems
No ratings yet
CSE 444 Practice Problems
13 pages
Para Distr Query Processing Slides
No ratings yet
Para Distr Query Processing Slides
31 pages
Unit 3
No ratings yet
Unit 3
63 pages
Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
No ratings yet
Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems
29 pages
Lecture 10: Parallel Query Evaluation: CS 838: Foundations of Data Management Spring 2016
No ratings yet
Lecture 10: Parallel Query Evaluation: CS 838: Foundations of Data Management Spring 2016
4 pages
Query Processing in Distributed Database
No ratings yet
Query Processing in Distributed Database
20 pages
DBMS UNIT 4 Part 1
No ratings yet
DBMS UNIT 4 Part 1
15 pages
Query Execution
No ratings yet
Query Execution
87 pages
Algorithms For Query Processing and Optimization
No ratings yet
Algorithms For Query Processing and Optimization
77 pages
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
No ratings yet
Parallel Database: Architecture For Parallel Databases. Parallel Query Evaluation Parallelizing Individual Operations
27 pages
Lecture11 Query Processing
No ratings yet
Lecture11 Query Processing
37 pages
Unit 3 - DBMS
No ratings yet
Unit 3 - DBMS
15 pages
Lesson 06
No ratings yet
Lesson 06
44 pages
DDBS Unit 2
No ratings yet
DDBS Unit 2
7 pages
7-Query Processing
No ratings yet
7-Query Processing
47 pages
Presentation9 - Query Processing and Query Optimization in DBMS
No ratings yet
Presentation9 - Query Processing and Query Optimization in DBMS
36 pages
Implications of A Distributed Environment Part 2
No ratings yet
Implications of A Distributed Environment Part 2
38 pages
ADT Lecture 13
No ratings yet
ADT Lecture 13
15 pages
Query Optimization
No ratings yet
Query Optimization
60 pages
CSE 453 Slide 3
No ratings yet
CSE 453 Slide 3
72 pages
Session - 10 Querying
No ratings yet
Session - 10 Querying
36 pages
Lecture 1 Parallel Databases
No ratings yet
Lecture 1 Parallel Databases
30 pages
Chapter 2-Query Processing and Optimi
No ratings yet
Chapter 2-Query Processing and Optimi
43 pages
Query Optimization
No ratings yet
Query Optimization
103 pages
11 Iterators Relalg
No ratings yet
11 Iterators Relalg
46 pages
DE Module5 QueryOptimization
No ratings yet
DE Module5 QueryOptimization
11 pages
Vu Lec 35
No ratings yet
Vu Lec 35
42 pages
Lecture Notes
No ratings yet
Lecture Notes
96 pages
AMSAL
No ratings yet
AMSAL
58 pages
Query Optimization
No ratings yet
Query Optimization
5 pages
Module 2
No ratings yet
Module 2
17 pages
Uds24201j Unit III
No ratings yet
Uds24201j Unit III
34 pages
Execution
No ratings yet
Execution
37 pages
Dbms Query Evaluation
No ratings yet
Dbms Query Evaluation
28 pages
QEII
No ratings yet
QEII
44 pages
ADBS - Chapter Two
No ratings yet
ADBS - Chapter Two
41 pages
TDD: Topics in Distributed Databases: Parallel Database Management Systems
No ratings yet
TDD: Topics in Distributed Databases: Parallel Database Management Systems
38 pages
Query Processing
No ratings yet
Query Processing
28 pages
DBMS R19 Unit Iv
No ratings yet
DBMS R19 Unit Iv
25 pages
Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh
No ratings yet
Parallel & Distributed Databases: C S 5 6 1 - S P R I N G 2 0 1 2 Wpi, Mohamed Eltabakh
23 pages
ParallelDBs PDF
No ratings yet
ParallelDBs PDF
23 pages
Chapter 6 - Query Processing and Optimization Algorithm
No ratings yet
Chapter 6 - Query Processing and Optimization Algorithm
27 pages
Lecture 06
No ratings yet
Lecture 06
41 pages
Distributed Databases: CS347 May 30, 2001
No ratings yet
Distributed Databases: CS347 May 30, 2001
48 pages
M.C.a. (Sem - IV) Paper - IV - Adavanced Database Techniques
No ratings yet
M.C.a. (Sem - IV) Paper - IV - Adavanced Database Techniques
114 pages
Unit-2 Query Processing and Optimization, Query Equivalence, Join Strategies
No ratings yet
Unit-2 Query Processing and Optimization, Query Equivalence, Join Strategies
38 pages
4-Query - Processing (1) - PTIT
No ratings yet
4-Query - Processing (1) - PTIT
72 pages
Unit 6
No ratings yet
Unit 6
34 pages
Adb ch2
No ratings yet
Adb ch2
72 pages
Chapter 1 Part II
No ratings yet
Chapter 1 Part II
22 pages
Query Evaluation
No ratings yet
Query Evaluation
51 pages
Introduction To DBMS
No ratings yet
Introduction To DBMS
37 pages
DDS Unit - 2
No ratings yet
DDS Unit - 2
7 pages
Query Processing in DBMS
No ratings yet
Query Processing in DBMS
22 pages
Advance Database Management System: Unit - 2 .Query Processing and Optimization
No ratings yet
Advance Database Management System: Unit - 2 .Query Processing and Optimization
38 pages
Inter and Intra Query Parallelism
No ratings yet
Inter and Intra Query Parallelism
1 page
Parallel DB /D.S.Jagli 1 5/4/2012 1 1. Parallel DB /D.S.Jagli
No ratings yet
Parallel DB /D.S.Jagli 1 5/4/2012 1 1. Parallel DB /D.S.Jagli
70 pages
Ch12-Query Processing
No ratings yet
Ch12-Query Processing
34 pages
Flyer Seminar PIT PSRII KE-10 Revisi
No ratings yet
Flyer Seminar PIT PSRII KE-10 Revisi
3 pages
Bachelor of Technology: A Summer Training Report ON Core Java
No ratings yet
Bachelor of Technology: A Summer Training Report ON Core Java
23 pages
Computer Graphics Lab Manual
No ratings yet
Computer Graphics Lab Manual
11 pages
Zorro Fichas Prekinders
No ratings yet
Zorro Fichas Prekinders
5 pages
SCP23 PDF
No ratings yet
SCP23 PDF
108 pages
Validations in Project System
No ratings yet
Validations in Project System
17 pages
SDS1000X-E Service Manual
100% (2)
SDS1000X-E Service Manual
84 pages
WordPress Customization
No ratings yet
WordPress Customization
20 pages
Chapter 12
No ratings yet
Chapter 12
48 pages
c09 Aws Blu Age Custom Architecture 2405
No ratings yet
c09 Aws Blu Age Custom Architecture 2405
45 pages
Mep - December - 2023
No ratings yet
Mep - December - 2023
2 pages
Wheel of Names Random Name Picker 2
No ratings yet
Wheel of Names Random Name Picker 2
1 page
IT Project Management Chapter 6 Project Implementation
No ratings yet
IT Project Management Chapter 6 Project Implementation
25 pages
Sfr1M2-Gu Fu For Shima Seiki Ses User Manual
75% (4)
Sfr1M2-Gu Fu For Shima Seiki Ses User Manual
2 pages
Hourglass Interfaces For C++ APIs - Stefanus Du Toit - CppCon 2014
No ratings yet
Hourglass Interfaces For C++ APIs - Stefanus Du Toit - CppCon 2014
46 pages
Advanced VB Script
No ratings yet
Advanced VB Script
60 pages
The Golem Project: Crowdfunding Whitepaper
No ratings yet
The Golem Project: Crowdfunding Whitepaper
28 pages
IIT Bombay Unofficial LaTeX PH D Synopsis Report Template 2
No ratings yet
IIT Bombay Unofficial LaTeX PH D Synopsis Report Template 2
18 pages
Cloud Computing Assignment
No ratings yet
Cloud Computing Assignment
5 pages
F SMART 51 - Rev
No ratings yet
F SMART 51 - Rev
2 pages
HikCentral Access Control - V1.3.0 - Compatibility List of HIKVISION Products - 20220316
No ratings yet
HikCentral Access Control - V1.3.0 - Compatibility List of HIKVISION Products - 20220316
13 pages
In The First Part of This Tutorial On JSTL, The Author Gives A Brief Introduction To JSTL and Shows Why and How It Evolved
No ratings yet
In The First Part of This Tutorial On JSTL, The Author Gives A Brief Introduction To JSTL and Shows Why and How It Evolved
36 pages
Lab Report
No ratings yet
Lab Report
22 pages
This File Contains The Following Worksheets:: Quick Instructions
No ratings yet
This File Contains The Following Worksheets:: Quick Instructions
7 pages
Progress Tracking in Materials Management
No ratings yet
Progress Tracking in Materials Management
15 pages
Project Report Anil
No ratings yet
Project Report Anil
57 pages
Controlador Asco 4000 7000
No ratings yet
Controlador Asco 4000 7000
32 pages
Git Cheat Sheet
100% (3)
Git Cheat Sheet
1 page
Fast Company 2016 4
100% (1)
Fast Company 2016 4
112 pages

Para Distr Query Processing Notes

Uploaded by

Para Distr Query Processing Notes

Uploaded by

Advances in Data Management:

Parallel and Distributed Query Processing

1 Parallel implementation of the relational algebra

1.2 Projection and selection

1.4 Parallel Query Optimisation

2 Distributed Query Processing

1. translating the query into a query tree;

distributing the join over the unions of fragments gives:

and this simplifies to:

πprojN um,location (P1 o

which simplifies to:

2.1 Distributed Processing of Joins

cost of reading R + c × pages(R) + cost of computing R o

(iv) Ship R n S to the site of S.

Example 1. Consider the following relations, stored at different sites:

• accounts contains 100,000 tuples on 1,000 pages

With the full join method we have a cost of

cost of reading R + c × pages(R) + cost of computing R o

This gives a total cost of

cost of reading R + c × pages(R) + cost of computing R o

which is 1000 + (c × 1000) + 1000 + (3 × (1000 + 5)) I/Os

= (c × 1000) + 5015 I/Os.

With the semi-join method we have the cost of:

– if the filter returns false or 0 for i, then i is definitely not in S;

You might also like