Para Distr Query Processing Notes
Para Distr Query Processing Notes
The basic foundations of parallel query evaluation are the splitting and merging of parallel streams of
data:
Data streams arising from different disks or processors provide the input for each operator in the query.
The results being produced as an operator is evaluated in parallel over a set of nodes need to be merged
at one node in order to produce the overall result of the operator.
This result set may then be split again in order to parallelise subsequent processing in the query.
1.1 Sort
Given the frequency with which sorting is required in query processing, parallel sorting algorithms have
been much studied.
A commonly used parallel sorting algorithm is the parallel merge sort:
This first sorts each fragment of the relation individually on its local disk, e.g. using the external merge
sort algorithm we have already looked at.
Groups of fragments are then shipped to one node per group, which merges the group of fragments into
a larger sorted fragment. This process repeats until a single sorted fragment is produced at one of the
nodes.
1
1.3 Join
Ro
n S using Parallel Nested Loops or Index Nested Loops
First the ‘outer’ relation has to be chosen. In particular, if S has an appropriate index on the join
attribute(s) then R should be the outer relation.
All the fragments of R are then shipped to all nodes. So each node i now has a whole copy of R as well
as its own fragment of S, Si .
The local joins R o
n Si are performed in parallel on all the nodes, and the results are finally shipped to
a chosen node for merging.
Ro
n S using Parallel Sort-Merge Join (for natural/equijoins)
The first phase of this involves sorting R and S on the join attribute(s). These sorts can be performed
using the parallel merge sort operation described in Section 1.1.
The sorted relations are then partitioned across the nodes using range partitioning with the same sub-
ranges on the join attribute(s) for both relations.
The local joins of each pair of sorted fragments Ri o
n Si are performed in parallel, and the results are
finally shipped to a chosen node for merging.
Ro
n S using Parallel Hash Join (for natural/equijoins only)
Each bucket of R and S is logically assigned to one node.
The first hashing phase, using the first hash function h1 , is undertaken in parallel on the all nodes. Each
tuple t from R or S is shipped to node i if the bucket assigned to it by h1 is the ith bucket.
The next phase is also undertaken in parallel on all nodes. On each node i, a hash table is created from
the local fragment of R, Ri , using another hash function h2 . The local fragment of S, Si , is then scanned
and h2 is used to probe the hash table for matching records of Ri for each record of Si .
The results produced at each node are shipped to a chosen node for merging.
• The costs of partitioning and merging the data now need to be taken into account.
• If the data distribution is skewed, this will have an impact on the overall time taken to complete
the evaluation of an operator, so that too needs to be taken into account.
• The results being produced by one operator can be pipelined into the evaluation of another operator
that is executing at the same time on a different node.
For example, consider this left-deep join tree where all the joins are nested loops joins:
((R1 o
n R2 ) o
n R3 ) o
n R4
The tuples being produced by R1 o n R2 can be used to ‘probe’ R3 and the resulting tuples can
be used to probe R4 , thus setting up a pipeline of three concurrently executing join operators on
three different nodes.
• There is now the possibility of executing different operators of the query concurrently on different
nodes.
For example, with this bushy join tree:
((R1 o
n R2 ) o
n (R3 o
n R4 ))
2
the join of R1 with R2 and the join of R3 with R4 can be executed concurrently. The partial results
of these joins can also be pipelined into their parent join operator.
In uniprocessor systems only left-deep join orders are usually considered (and this still gives n! possible
join orders for a join of n relations).
In multi-processor systems, other join orders can result in more parallelisation, e.g. bushy trees, as
illustrated above. Even if such a plan is more costly in terms of the number of I/O operations performed,
it may execute more quickly than a left-deep plan due to the increased parallelism it affords.
So, in general, the number of candidate query plans is much greater in parallel database systems and
more heuristics need to be employed to limit the search space of query plans. For example, one possible
approach is for the query optimiser to first find the best plan for sequential evaluation of the query; and
then find the fastest parallel execution of that plan.
In Step 3, the simplifications that can be carried out in the case of horizontal partitioning include the
following:
• eliminating fragments from the argument to a selection operation if they cannot contribute any
tuples to the result of that selection;
• distributing join operations over unions of fragments, and eliminating joins that can yield no tuples;
For example, suppose a table Employee(empID, site, salary, . . . ) is horizontally fragmented into four
fragments:
E1 = σsite=0 A0 AN D salary<30000 Employee
E2 = σsite=0 A0 AN D salary>=30000 Employee
E3 = σsite=0 B 0 AN D salary<30000 Employee
E4 = σsite=0 B 0 AN D salary>=30000 Employee
3
then the query σsalary<25000 Employee is replaced in Step 2 by
σsalary<25000 (E1 ∪ E2 ∪ E3 ∪ E4 )
which simplifies to
σsalary<25000 (E1 ∪ E3 )
For example, suppose a table WorksIn(empID, site, project, . . . ) is horizontally fragmented into two
fragments:
W1 = σsite=0 A0 WorksIn
W2 = σsite=0 B 0 WorksIn
then the query Employee o
n W orksIn is replaced in Step 2 by:
(E1 ∪ E2 ∪ E3 ∪ E4 ) o
n (W1 ∪ W2 )
n W1 ) ∪ (E2 o
(E1 o n W1 ) ∪ (E3 o
n W1 ) ∪ (E4 o
n W1 ) ∪
n W2 ) ∪ (E2 o
(E1 o n W2 ) ∪ (E3 o
n W2 ) ∪ (E4 o
n W2 )
One simplification that can be carried out in Step 3 in the case of vertical partitioning is that fragments in
the argument of a projection operation which have no non-key attributes in common with the projection
attributes can be eliminated.
For example, if a table Projects(projNum, budget, location, projName) is vertically partitioned into two
fragments:
P1 = πprojN um,budget,location Projects
P2 = πprojN um,projN ame Projects
then the query πprojN um,location Projects is replaced in Step 2 by:
4
• the full-join method, and
• the semi-join method.
Full-join method
The simplest method for computing R o n S at the site of S consists of shipping R to the site of S and
doing the join there. This has a cost of
where c is the cost of transmitting one page of data from the site of R to the site of S, and pages(R) is
the number of pages that R consists of.
If the result of this join were needed at a different site, then there would also be the additional cost of
sending the result of the join from site(S) to where it is needed.
Semi-join method
This is an alternative method for computing R o
n S at the site of S and consists of the following steps:
(i) Compute πR∩S (S) at the site of S, where πR∩S denotes projection on the common attributes of R
and S.
(ii) Ship πR∩S (S) to the site of R.
(iii) Compute R n S at the site of R, where n is the semi-join operator, defined as follows:
RnS =Ro
n πR∩S (S)
Ro
n S = (R n S) o
nS
which is 1000 I/Os to read R, plus (c × 1000) to transmit R to the site of S, plus 1000 I/Os to save it
there, plus (3 × (1000 + 500)) I/Os (assuming a hash join) to perform the join. This gives a total cost of:
(c × 1000) + 6500 I/Os
With the semi-join method we have the cost of:
(i) Computing πR∩S (S) at site(S), i.e. 500 I/Os to scan S, generating 100 pages of just the cname
values
5
(ii) Shipping πR∩S (S) to site(R), i.e. c × 100 and saving it there, i.e. 100 I/Os.
(iii) Computing R n S at site(R), i.e. 3 × (100 + 1000) I/Os, assuming a hash join
(iv) Shipping the result of R n S to the site of S, i.e. c × 1000 and saving it there, i.e. 1000 I/Os
(assuming cname in R is a foreign key).
n S at site(S), i.e. 3 × (1000 + 500))
(v) Computing R o
So in this case the full join method ((c × 1000) + 6500 I/Os) is cheaper: we have gained nothing by using
the semi-join method since all the tuples of R join with tuples of S.
Example 2. Let R be as above (i.e., accounts) and let
S = σcity=0 London0 (customer)
Suppose again that we need to compute R o
n S at the site of S.
Suppose also that there are 100 different cities in customer, that there is a uniform distribution of
customers across cities, and a uniform distribution of accounts over customers. So S contains 500 tuples
on 5 pages.
With the full join method we have a cost of
(i) Computing πR∩S (S) at site(S), i.e. 5 I/Os to scan S, generating 1 page of cname values
(ii) Shipping πR∩S (S) to site(R), i.e. c × 1 plus 1 I/O to save it there.
(iii) Computing R n S at site(R), i.e. 3 × (1 + 1000) assuming a hash join
(iv) Shipping R n S to the site of S, i.e. c × 10 since, due to a uniform distribution of accounts over
customers, 1/100th of R will match the cname values sent to it from S.
Plus the cost of saving the result of R n S at the site of S, 10 I/Os.
n S at site(S), i.e. 3 × (10 + 5))
(v) Computing R o
The overall cost is thus (c × 11) + 3064 I/Os versus (c × 1000) + 5015 I/Os for the full-join method.
So in this case the semi-join method is cheaper. This is because a significant number of tuples of R do
not join with S and so are not sent to the site of S.
Bloom join method
In general, a Bloom filter is an efficient memory structure (usually a bit vector) used to approximate the
contents of a set S in the following sense:
• given an item i,
6
Apart from their use in distributed join processing, Bloom filters are also used in NoSQL systems such
as BigTable to avoid having to read an SSTable in full to find out that the key being searched for does
not appear in it.
The Bloom join method is similar to the semi-join method, as it too aims to reduce the amount of data
being sent from the site of R to the site of S.
However, rather than doing a projection on S and sending the resulting data to the site of R, a bit-vector
of a fixed size k is computed by hashing each tuple of S to the range [0..k − 1] (using the join attribute
values). The ith bit of the vector is set to 1 if some tuple of S hashes to i and is set to 0 otherwise.
Then, at the site of R, the tuples of R are also hashed to [0..k − 1] (using the same hash function and the
join attribute values), and only those tuples of R whose hash value corresponds to a 1 in the bit-vector
sent from S are retained for shipping to the site of S.
The cost of shipping the bit-vector from the site of S to the site of R is less than the cost of shipping
the projection of S in the semi-join method.
However, the size of the subset of R that is sent back to the site of S is likely to be larger (since only
approximate matching of tuples is taking place now), and so the shipping costs and join costs are likely
to be higher than with the semi-join method.