0% found this document useful (0 votes)
29 views26 pages

Join Processing in Database With Large Main Memories Systems

Uploaded by

absials
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

Join Processing in Database With Large Main Memories Systems

Uploaded by

absials
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Join Processing in Database Systems

with Large Main Memories


LEONARD D. SHAPIRO
North Dakota State University

We study algorithms for computing the equijoin of two relations in B system with a standard
architecturehut with largeamountsof main memory.Our algorithmsare especiallyefficientwhen
the mainmemoryavailableis B significantfractionof the sizeof oneof the relationsto hejoined;
hut theycanbeappliedwheneverthereis memoryequalto approximatelythe 8qumeroot of the size
of one relation.We presenta newalgorithmwhich is B hybrid of two hash-based algorithmsand
whichdominatesthe other algorithmawepresent,includingsort-merge. Evenin B virtual memory
environment,the hybridalgorithmdominatesall the otherswestudy.
Finally,wedescribehowthreepopulartoolsto increasethe efficiencyofjoins, namelyfilters,Babb
arrays,andsemijoins,canhegraftedontoany of our algorithms.
Categoriesand SubjectDescriptors:H.2.0 (Database Management]: General; H.2.4 [Database
Management]: Systems-queryprocessing; H.2.6 [Database Management]: Database Machines
GeneralTerms:Algorithms,Performance
AdditionalKeyWordsand Phrases:Hashjoin, join processing,
largemainmemory,sort-merge
join

1. INTRODUCTION
Database systems are gaining in popularity owing to features such as data
independence, high-level interfaces, concurrency control, crash recovery, and so
on. However, the greatest drawback to database management systems (other
than cost) is the inefficiency of full-function database systems, compared to
customized programs; and one of the most costly operations in database process-
ing is the join. Traditionally the most effective algorithm for executing a join (if
there are no indices) has been sort-merge [4]. In [S] it wae suggested that the
existence of increasingly inexpensive main memory makes it possible to use
hashing techniques to execute joins more efficiently than sort-merge. Here we
extend these results.
Some of the first research on joins using hashing [14, 211 concerned multipro-
cessor architectures. Our model assumes a “vanilla” computer architecture, that
is, a uniprocessor system available in the market today. Although the lack of
parallel processing in such systems deprives us of much of the potential speed of

Author’saddress:Departmentof ComputerScience,North DakotaStateUniversity,Fargo,ND


58105.
Permissionto copywithoutfeeall or part of this materialis grantedprovidedthat the copiesarenot
madeor distributedfor directcommercialadvantage, the ACM copyrightnoticeandthe title of the
publicationandits dateappear,andnoticeis giventhat copyingis hy permissionof the Association
for ComputingMachinery. To copy otherwise, or to republish, require a fee and/or specific
permission.
0 ,986 ACM 0362.5915/86/0900-0239 $00.75
240 * Leonard D. Shapiro

join processing on multiprocessor systems, our algorithms can he implemented


on current systems, and they avoid the complex synchronization problems of
some of the more sophisticated multiprocessor algorithms.
Our algorithms require significant amounts of main memory to execute most
efficiently. We assume it is not unreasonable to expect that the database system
can assign several megabytes of buffer space to executing a join. (Current VAX
systems can support 32 megabytes of real memory with 64K chips [a]; and it is
argued in [lo] that a system can he built with existing technology that will
support tens of gigabytes of main memory, and which appears to a programmer
to have a standard architecture.)
We will see that our algorithms are most effective when the amount of real
memory available to the process is close to the size of one of the relations.
The word “large” in the title refers to a memory size large enough that it is
not uncommon for all, or a significant fraction, of one of the relations to be
joined to fit in main memory. This is because the minimum amount of memory
required to implement our algorithms is approximately the square root of the
size of one of the relations (measured in physical blocks). This allows us to
process rather large relations in main memories which by today’s standards might
not he called large. For example, using our system parameters and 4 megabytes
of real memory as buffer space, we can join two relations, using our most efficient
algorithm, if the smaller of the two relations is at most 325 megabytes.
We show that with sufficient main memory, and for sufficiently large relations,
the most efficient algorithms are hash-based. We present two classes of hash-
based algorithms, one (simple hash) that is most efficient when most of one
relation fits in main memory and another (GRACE) that is most efficient when
much less of the smaller relation fits. We then describe a new algorithm, which
is a hybrid of simple and GRACE, that is the most efficient of those we study,
even in a virtual memory environment. This is in contrast to current commercial
database systems, which find sort-merge-join to be most efficient in many
situations and do not implement hash joins.
In Section 2 we present four algorithms for computing an equijoin, along with
their cost formulas. The first algorithm is sort-merge, modified to take advantage
of large main memory. The next is a very simple use of hashing, and another is
based on GRACE, the Japanese fifth-generation project’s database machine [14].
The last is a hybrid of the simple and GRACE algorithms. We show in Section
3 that for sufficiently large relations the hybrid algorithm is the most efficient,
inclusive of sort-merge, and we present some analytic modeling results. All our
hashing algorithms are based on the idea of partitioning: we partition each
relation into subsets which can (on the average) fit into main memory. In Section
3 we assume that all the partitions can fit into main memory, and in Section 4
discuss how to deal with the problem of partition overflow. In Section 5 we
describe the effect of using virtual memory in place of some main memory. In
Section 6 we discuss how to include in our algorithms other tools that have
become popular in database systems, namely selection filters, semijoin strategies,
and Babb arrays.
A description of these algorithms, similar to that in Section 2, and analytic
modeling results similar to those in the second half of Section 3, appeared in [6].
ACMTrsnsactions onDatabase
System,Vu,.11,No.3,September
1986.
Join Processing in Database Systems - 241

In 151 it is shown that hashing is preferable to nested-loop and sort-merge


algorithms for a variety of relational algebra operations-results consistent with
those we present. In [7] the results of [6] are extended to the multiprocessor
environment, and experimental results are reported. These results support the
analyses in [S] and in the present paper, and show that, in the cases reported, if
a bit-filtering technique is used (see Section 6) the timings of all algorithms are
similar above a certain memory size. A related algorithm, the nested-hash
algorithm, is also studied there, and is shown to have performance comparable
to the hybrid algorithm for large memory sizes, hut to be inferior to hybrid for
smaller memory sizes. In [22], the GRACE hash algorithm is studied in depth,
including an analysis of the case when more than two phases of processing are
needed and an analysis of various partitioning schemes. I/O accessesand CPU
time are analyzed separately, and it is shown that GRACE hash-join is superior
to merge-join.
1.I Notation and Assumptions
Our goal is to compute the equijoin of two relations labeled R and S. We use M
to denote main memory. We do not count initial reads of R or S or final writes
of the join output because these costs are identical for all algorithms. After the
initial reads of R and S, those relations are not referenced again by our algo-
rithms. Therefore, all I/O in the join processing is of temporary relations. We
choose to block these temporary relations at one track per physical block, so each
I/O which we count will he of an entire track. Therefore, throughout this paper
the term “block” refers to a full track of data. Of course, R and S may be stored
with a different blocking factor. In all the cost formulas and analytic modeling
in this paper we use the labels given in Figure 1.
In our model we do not distinguish between sequential and random I/O. This
is justified because all reads or writes of any temporary file in our algorithms will
be sequential from or to that tile. If only one file is active, I/O is sequential in
the traditional sense, except that because of our full-track blocking an I/O
operation is more likely to cause head movement. If more than one file is active
there will be more head movement, but the cost of that extra head movement is
assumed negligible. This is why we choose a full track as our blocking factor for
the temporary relations.
We assume that S is the larger relation, that is, ] R ] 5 ] S ] We use the “fudge
factor”, F, in Figure 1 to calculate values that are small increments of other
values. For example, a hash table for R is assumed to occupy ] R ] * F blocks.
In our cost formulas we assume that all selections and projections of R and S
have already been done and that neither relation R nor S is ordered or indexed.
We assome no overlap between CPU and I/O processing. We assume each tuple
from S joins with at most one block of tuples from R. If it is expected that there
will be few tuples in the resulting join, it may be appropriate to process only
tuple IDS instead of projected tuples, and then at the end translate the TIDs that
are output into actual tuple values. We view this as a separate process from the
actual join; our formulas do not include this final step.
For the first four sections of this paper we assume that the memory manager
allocates a fixed amount of real memory to each join process. The process knows
ACMTransartions
0” Database
SysmnB, “0,. 11,NO.3,septemb?r
1986.
242 * Leonard D. Shapiro

eomp time to compare keys in main memory


hash time to hash a key that is in main memory
mwe time to movea tuple in memory
swap time to swaptwo tuplesin memory
10 time to read or write a block between disk and main memory
F incremental factor (see below)
IRI number of blocks in R relation (similar for S and M)
IRI number of tuples in R (similar for S)
IMIR number of R tuples that can fit in M (similar for S)

Fig. 1. Notation used in cost formulas in this paper

how much memory is allocated to it, and can use this information in designing a
strategy for the join. The amount of real memory allocated is fixed throughout
the lifetime of the process. In Section 5 we discuss an alternate to this simple
memory management strategy.

2. THE JOIN ALGORITHMS


In this section we present four algorithms for computing the equijoin of relations
R and S. One is a modified sort-merge algorithm and the other three are based
on hashing.
Each of the algorithms we describe executes in two phases. In phase 1, the
relations R and S are restructured into runs (for sort-merge) or subsets of a
partition (for the three hashing algorithms). In phase 2, the restructured relations
are used to compute the join.

2.1 Sort-Merge-Join Algorithm


The standard sort-merge-join algorithm [4] begins by producing sorted runs of
tuples of S. The runs are on the average (over all inputs) twice as long as the
number of tuples that can tit into a priority queue in memory [15, p. 2541. This
requires one pass over S. In subsequent phases, the runs are sorted using an
n-way merge. R is sorted similarly. After R and S are sorted they are merged
together and tuples with matching join attributes are output. For a fixed relation
size, the CPU time to do a sort with n-way merges is independent of n, but I/O
time increases as n decreases and the number of phases increases. One should
therefore choose the merging factor n to be as large as possible so that the process
will involve as few phases as possible. Ideally, only two phases will be needed,
one to construct runs and the other to merge and join them. We show that if
] M ] is at least a, then only two phases are needed to accomplish the join.
Here are the steps of our version of the sort-merge algorithm in the case where
there are at least a blocks of memory for the process. (See Figure 2, sort-
merge-join).
In the following analysis we use average (over all inputs) values, for instance,
2 * 1M 1 is the length of a run.
(1) Scan S and produce output runs using a heap or some other priority
queue structure. Do the same for R. A run will be 2 * ] M ] blocks long. Given
ACMTransactions
onDatabase
Systems,
Vol.11,No.3,September
m6.
Join Processing in Database Systems - 243

that ( M ( 2 a, the runs will be at least Zm blocks in length. Therefore


there will be at most
ISI
-=‘m
2JIsT 2
distinct runs of S on the disk. Since S is the larger relation, R has at most the
same number of runs on disk. Therefore there will be at most J(sI runs of R
and S altogether on the disk at the end of phase 1.
(2) Allocate one block of memory for buffer space for each run of R and S.
Merge all the runs of R and concurrently merge all the runs of S. As tuples of R
and S are generated in sorted order by these merges, they can be checked for a
match. When a tuple from R matches one from S, output the pair.
In step (2), one input buffer is required per run, and there are at most a
runs, so if 1M 1 2 m, there is sufficient room for the input buffers. Extra
space is required for merging, but this is negligible since the priority queue
contains only one tuple per run.
If the memory manager allocates fewer than JisT blocks of memory to the
join process, more than two bases are needed. We do not investigate this case
further; we assume ( M ( 2 F 1S ( If I M I is greater than m, the extra blocks
of real memory can be used to store runs between phases, thus saving I/O costs.
This is reflected in the last term of the cost formula.
The cost of this algorithm is
({RI log,lRI + IS1 log#)) * (camp + swap) Manage priority queues in
both phases.
+(IRI+lSI)*IO Write initial runs.
+(IRI+lSI)*IO Read initial runs.
+ WI + ISI) * camp Join results of final merge.
- min(lRJ + (SI, (Ml - m) * 2 *IO I/O savings if extra memory
is available.
244 . Leonard D. Shapiro

2.2 Hashing Algorithms


The simplest use of hashing as a join strategy is the following algorithm, which
we call classic hushing: build a bash table, in memory, of tuples from the smaller
relation R, hashed on the joining attribute(s). Then scan the other relation S
sequentially. For each tuple in S, use the hash value for that tuple to probe the
hash table of R for tuples with matching key values. If a match is found, output
the pair, and if not then drop the tuple from S and continue scanning S.
This algorithm works best when the hash table for R can tit into real memory.
When most of a hash table for R cannot tit in real memory, this classic algorithm
can still be used in virtual memory, but it behaves poorly, since many tuples
cause page faults. The three hashing algorithms we describe here each extend
the classic hashing approach in some way so as to take into account the possibility
that a hash table for R will not lit into main memory.
If a hash table for the smaller relation R cannot fit into memory, each of the
hashing algorithms described in this paper calculates the join by partitioning R
and S into disjoint subsets and then joining corresponding subsets. The size of
the subsets varies for different algorithms. For this method to work, one must
choose a partitioning of R and S so that computing the join can be done by just
joining corresponding subsets of the two relations. The first mention of this
method is in Ill], and it also appears in [3]. Our use of it is closely related to the
description in [14].
The method of partitioning is first to choose a hash function h, and a partition
of the values of h into, say, Hi, , H,. (For example, the negative and
nonnegative values of h constitute a partition of h values into two subsets, HI
and Hz.) Then, one partitions R into corresponding subsets RI, , R,, where a
tuple r of R is in R; whenever h(r) is in Hi. Here by h(r) we mean the hash
function applied to the joining attribute of r. Similarly, one partitions S into
corresponding subsets S,, , S, with s in S; when h(s) is in Hi. The subsets R;
and Si are actually buckets, but we refer to them as subsets of the partition, since
they are not used as ordinary hash buckets.
If a tuple r in R is in Ri and it joins with a tuple s from S, then the joining
attributes of r and s must be equal, thus h(r) = h(s) and s is in Si. This is why,
to join R and S, it suffices to join the subsets Ri and Si for each i.
In all the bashing algorithms we describe we are required to choose a partition-
ing into subsets of a specified size (e.g., such that R is partitioned into two
subsets of equal size). Partitioning into specified size subsets is not easy to
accomplish if the distribution of the joining attribute of R is not well understood.
In this section we describe the hashing algoritbms as if bucket overflow never
occurs, then in Section 4 we describe how to deal with the problem.
Each of the three algorithms we describe uses partitioning, as described above.
Each proceeds in two phases: the first is to partition each relation. The second
phase is to build hash table(s) for R and probe for matches with each tuple of S.
The first algorithm we describe, simple hashing, does as little of the first phase-
partitioning-as possible on each step, and goes right into building a hash table
and probing. It performs well when most of R can tit in memory. The next
algorithm, GRACE hash, does all of the first phase at once, then turns to the
second phase of building hash tables and probing. It performs relatively well
ACMTransartions 0” Database
Systems.
“0,. I,. NO.3,September
1986.
Join Processingin DatabaseSystems * 245

when little of R can fit in memory. The third algorithm, hybrid bash, combines
the two, doing all partitioning on the first pass over each relation and using
whatever memory is left to build a hash table. It performs well over a wide range
of memory sizes.

2.3 Simple Hash-Join Algorithm


If a hash table containing all of R fits into memory (i.e., if 1R ) t F 5 1M I), the
simple hash-join algorithm which we define here is identical to what we have
called classic hash-join. If there is not enough memory available, our simple
hash-join scans R repeatedly, each time partitioning off as much of R as can fit
in a hash table in memory. After each scan of R, S is scanned and, for tuples
corresponding to those in memory, a probe is made for a match (see Figure 3,
simple hash-join).
The steps of our simple hash-join algorithm are

(1) Let P = min( I M (, I R 1 * F). Choose a hash function h and a set of hash
values so that P/F blocks of R tuples will hash into that set. Scan the (smaller)
relation R and consider each tuple. If the tuple hashes into the chosen range,
insert the tuple into a P-block hash table in memory. Otherwise, pass over the
tuple and write it into a new file on disk.
(2) Scan the larger relation S and consider each tuple. If the tuple hashes into
the chosen range, check the hash table of R tuples in memory for a match and
output the pair if a match occurs. Otherwise, pass over the tuple and write it to
disk. Note that if key values of the two relations are distributed similarly, there
will be P/F * ) S l/l R ) blocks of the larger relation S processed in this pass.
(3) Repeat steps (1) and (2), replacing each of the relations R and S by the
set of tuples from R and S that were “passed over” and written to disk in the
previous pass. The algorithm ends when no tuples from R are passed over.

This algorithm performs particularly well when most of R fits into main
memory. In that case, most of R (and S) are touched only once, and only what
cannot fit into memory is written out to disk and read in again. On the other
hand, when there is little main memory this algorithm behaves poorly, since in
that case there are many passes and both R and S are scanned over and over
again. In fact, this algorithm operates as specified for any amount of memory,
but to he consistent with the other hash-based algorithms we assume it is
undefined for less than m blocks of memory.
We assume here and elsewhere that the same hash function is used both for
partitioning and for construction of the hash tables.
In the following formula and in later formulas we must estimate the number
of compares required when the hash table is probed for a match. This amounts
to estimating the number of collisions. We have chosen to use the term
camp * F for the number of compares required. Although this term is too
simple to be valid in general, when F is 1.4 (which is the value we use in our
analytic modeling) it means that for a hash table with a load factor of 71 percent
the estimated number of probes is 1.4. This is consistent with the simulations
reported in (181.
ACMTransactions
cmDatabasesystems,
Vol.11,No.3,September
1986.
246 * Leonard D. Shmiro

The algorithm requires


IRI*F
I IMI1
passes to execute, where r 1 denotes the ceiling function. We denote this quantity
by A. Note that on the ith pass, i = 1, , A - 1, there are

tuples of R passed over.


The cost of the algorithm is
+ * ,Rl _ A * (A - 1) * (MIR
2 F 1
* (hash + move).

Hash and move R and passed-over tuples in R.


+ * is, _ A * (A - 1) * (Mls
2 F 1
* (hash + move).

Hash and move S and passed-over tuples in S.


- IS) * move On each pass, passed-over tuples of S are moved
into the buffer and others result in probing the
hash table for a match. The latter are not moved,
yet they are counted as being moved in the
previous term. This adjustment corrects that.
+ IS) * camp i F Check each tuple of S for a match.

(A-l).lRl-
A * (A - 1) * w
2
F 1
* 2 * Io.

Write and read passed-over tuples in R.

(A-l)*,SI-A’(A-l)*(M(tlS’
IRI 12
* IO.
F
- * 2

Write and read passed-over tuples in S.


ACMTransactions
onDatabase
Systems,
Vol.II, No.3,September
1986.
Join’Processing in Database Systems * 247

2.4 GRACE HashJoin Algorithm


As outlined in [14], the GRACE hash-join algorithm executes as two phases. The
first phase begins by partitioning R and S into corresponding subsets, such that
R is partitioned into sets of approximately equal size. During the second phase
of the GRACE algorithm the join is performed using a hardware sorter to execute
a sort-merge algorithm on each pair of sets in the partition.
Our version of the GRACE algorithm differs from that of [14] in two ways.
First, we do joining in the second hase by hashing, instead of using hardware
sorters. Second, we use only + F 1R 1 blocks of memory for both phases; the rest
is used to store as much of the partitions as possible so they need not be written
to disk and read back again. The algorithm proceeds as follows, assuming there
are m blocks of memory (see Figure 4):
(1) Choose a hash function h, and a partition of its hash values, so that R will
be artitioned into m subsets of approximately equal size.’ Allocate
F 1R 1 blocks of memory, each to be en output buffer for one subset of the
partition of R.
(2) Scan R. Using h, hash each tuple and place it in the appropriate output
buffer. When an output buffer fills, it is written to disk. After R has been
completely scanned, flush all output buffers to disk.
(3) Scan S. Using h, the seme function used to partition R, hash each tuple and
place in the appropriate output buffer. When an output buffer fills, it is
written to disk. After S has been completely scanned, flush all output buffers
to disk.
Steps (4) and (5) below are repeated for each set R;, 1 5 i 5 m, in the
partition for R, and its corresponding set Si.
(4) Read Ri into memory and build a hash table for it.
We pause to check that a hash table for R; can fit in memory. Assuming that
all the sets Ri are of equal size, since there are m of them, each of the
sets Ri will be

IRI pJ
mm -If F
blocks in length. A hash table for each subset R; will therefore require

blocks of memory, and we have assumed at least this much real memory.
(5) Hash each tuple of Si with the same hash function used to build the hash
table in (4). Probe for a match. If there is one, output the result tuple,
otherwise proceed with the next tuple of Si.
What if there are more or less than m blocks of memory available? Just
as with sort-merge-join, we do not consider the case when less than this minimum
’ Our assumption, that B tuple of S joins with at most one block of tuplesof R, is used here. If R
contams many tuples with the same joining attribute value, then this partitioning may not be possible.
ACM Transactions 0” Database Systems. Vol. 1,. No. 3. September 1986.
240 . Leonard D. Shapiro

Fig. 4. GRACE hash-join.


pllaF.2

number of blocks is available, and if more blocks are available, we use them to
store subsets of R and/or S so they need not be written to and read from disk.
This algorithm works very well when there is little memory available, because
it avoids repeatedly scanning R and S, as is done in simple hash. Yet when most
of R tits into memory, GRACE join does poorly since it scans both R and S
twice.
One advantage of using hash in the second phase, instead of sort-merge, as is
done by the designers of the GRACE machine, is that subsets of S can be of
arbitrary size. Only R needs to be partitioned into subsets of approximately equal
size. Since partition overflow can cause significant problems (see Section 4), this
is an important advantage.
The cost of this algorithm is
({R) + {Sl * (hash + move) Hash tuple and move to output
buffer.
+(IRI+ISI)*IO Write partitioned relations to
disk.
+(IRI+ISI)*IO Read partitioned sets.
+ (RI * (hash + move) Build hash tables in memory.
+ IS) * (hash + camp * F) Probe for a match.
-min(IRI+ISI,IMI~)*Z*IO IO savings if extra memory is
available.
2.5 Hybrid Hash-Join Algorithm
Hybrid hash combines the features of the two preceding algorithms, doing both
partitioning and hashing on the first pass over both relations. On the first pass,
instead of using memory as a buffer as is done in the GRACE algorithm, only as
many blocks (B, defined below) as are necessary to partition R into sets that can
fit in memory are used. The rest of memory is used for a hash table that is
processed at the same time that R and S are being partitioned (see Figure 5).
ACM Transactionson DatabaseSystems,Vol. 11, No. 3, September1986.
Join Processing in Database Systems * 249

Fig. 5. Hybrid bash-join.

Here are the steps of the hybrid hash algorithm. If I R 1 * F 5 M, then a hash
table for R will fit in real memory, and hybrid hash is identical to simple hash
in this case.

(1) Let

There will be B + 1 steps in the hybrid hash algorithm. (To motivate the formula
for B, we note that it is approximately equal to the number of steps in simple
hash. The small difference is due to setting aside some real memory in phase 1
for a hash table for R,.) First, choose a hash function h and a partition of its
hash values which will partition R into Ro, , RB, such that a hash table for R.
has 1M 1 - B blocks, and R, , , RB are of equal size. Then allocate B blocks in
memory to B output buffers. Assign the other I M I - B blocks of memory to a
hash table for Ro.
(2) Assign the ith output buffer block to Ri for i = 1, . , B. Scan R. Hash
each tuple with h. If it belongs to R. it will be placed in memory in a hash table.
Otherwise it belongs to R; for some i > 0, so move it to the ith output buffer
block. When this step has finished, we have a hash table for R. in memory, and
RI, , Rs are on disk.
(3) The partition of R corresponds to a partition of S compatible with h, into
sets S,,, , S,. Assign the ith output buffer block to Si for i = 1, , B. Scan
S, hashing each tuple with h. If the tuple is in So, probe the hash table in memory
for a match. If there is a match, output the result tuple, otherwise drop the tuple.
If the tuple is not in S,, it belongs to Si for some i > 0, so move it to the ith
output buffer block. Now RI, , RB and S,, , S, are on disk.
ACMTransactionaon DatabaseSystems.
Vol. 11,No.3.September
1986.
250 * Leonard D. Shapiro

Repeat steps (4) and (5) for i = 1, , B.


(4) Read R; and build a hash table for it in memory.
(5) Scan Si, hashing each tuple, and probing the hash table for Ri, which is in
memory. If there is a match, output the result tuple, otherwise toss the S tuple.
We omit the computation that shows that a hash table for Ri will actually fit
in memory. It is similar to the above computation for the GRACE join algorithm.
For the cost computation, denote by o the quotient IRol/l RI, namely the
fraction of R represented by Ro. To calculate the cost of this join we need to
know the size of S,, and we estimate it to be o * 1S 1.Then the fraction of R and
S sets remaining on the disk after step (3) is 1 - 9. The cost of the hybrid hash
join is
((R) + {Sl) * hash Partition R and S.
+ ({RJ + IS}) * (1 - q) * move Move tuples to output buffers.
+(IRI+ISI)*(l-q)*IO Write from output buffers.
+ (IRI + ISI) * (1 - 4) * 10 Read subsets into memory.
+ ({RJ + 1st) * (1 - q) * hash Build hash tables for R and hash to
probe for S during (4) and (5).
+ IN * move Move tuples to hash tables for R.
+ {S) * comp * F Probe for each tuple of S.
At the cost of some complexity, each of the above algorithms could be improved
by not flushing buffers at the end of phase 1. The effect of this change is analyzed
in Section 5.

3. COMPARISON OF THE FOUR JOIN ALGORITHMS


We begin by showing that when R and S are sufficiently large the hybrid
algorithm dominates the other two hash-based join algorithms, then we show
that hybrid also dominates sort-merge for sufficiently large relations. In fact, we
also show that GRACE dominates sort-merge, except in some cases where R and
S are close in si.ze. Finally, we present results of analytic modeling of the four
algorithms. Our assumption that R (and therefore S) are sufficiently large, along
with our previous assumption that 1M 1 is at least a (sort-merge) of
m (hash-based algorithms) means that we can also assume 1M 1 to be
large. The precise definition of “large” depends on system parameters, but it will
typically suffice that lR1 be at least 1000 and 1M 1 at least 5.
First we indicate why hybrid dominates simple hash-join. We assume less than
half of a hash table for R tits in memory because otherwise the hybrid and simple
join algorithms are identical. Denoting 1R 1 * F/2 - I M 1by E, our assumption
means that E > 0. If we ignore for a moment the space requirements for the
output buffers for both simple and hybrid hash, the I/O and CPU costs for both
methods are identical, except that some tuples written to disk in the simple hash-
join are processed’ more than once, whereas each is processed only once in hybrid
’ By processingwe mean, for tuples of R, one hash and one move of each tuple plus one read and one
write for each black. For S tuples we mean one hash and one move or compare per tuple plus two
I/OS per block.
Join Processing in Database Systems * 251

hash-join. In fact, ( R 1 - 2 * 1M 1/F = 2 * E/F blocks of R will be processed


more than once by simple hash (and similarly for some S tuples). So far hybrid
is ahead by the cost of processing at least 2 * E/F blocks of tuples. Now consider
the space requirements for output buffers, which we temporarily ignored
above. Simple hash uses only one output buffer. Hybrid uses approximately
(IRI * F/IMI) - 1 output buffers,3 that is, (IRI * F/MI) - 2 more than
simple hash-join uses, and I R 1 * F/I M I - 2 = 2E/ 1MI. Therefore hybrid
must process the extra 2 * E/I M 1 blocks in the second phase, since space for
them is taken up by buffers. In total, hybrid is ahead by the cost of processing
(2 * E/F) - (2 I E/I M I ) blocks,which is clearly a positive number. We conclude
that hybrid dominates simple hash-join.
Next we indicate why hybrid dominates GRACE. If we compare the cost
formulas, they are identical in CPU time, except that some terms in the hybrid
cost formula are multiplied by (1 - q). Since q 5 1, hybrid dominates GRACE
in CPU costs. The two algorithms read and write the following number of blocks:
GRACE: IRI + ISI -min(lR+ ISI, IMI - m).
Hybrid: IRI + ISI -q * (IRI + ISI).
To show that I/O costs for GRACE are greater than those for hybrid, it suffices
to prove that
q*(IRI+lSl)>IMI-m.
Since I S 1 > 0, we can discard it in the preceding formula. Since q l IR I =
I R0 I = I M ) - B, it suffices to prove that
m 2 B.
But B was the least number of buffers necessary to partition R - R. into sets
which fit in memory. From the description of the GRACE algorithm we know
that m buffers are always enough t n all of R into sets which can
tit in memory, so B cannot be more than
We have proved that hybrid dominates both the simple and the GRACE hash-
join algorithms. Now we compare the hash-join algorithms with sort-merge.
When a hash table for R can fit in main memory, it is clear that hybrid hash
will outperform sort-merge. This is because when a hash table for R can fit in
real memory there are no I/O costs, and CPU costs are, with slight rearranging:
(R) * [hash + move] +
Hybrid: i (S] * [hash + camp * F].

Sort: IN * [camp + ~lwAW) * (cow + swap)1+


1ISI * [camp + (log2(Sj) * (camp + swap)].
Since the times to hash and to compare are similar on any system, and swap is
more expensive than move or camp * F, the log terms will force sort-merge to be
more costly except when R is very small.

3Here we have used the formula for B above, and have assumed I M I is large enough that I MI - 1
can be approximated by I M 1.
252 * Leonard D. Shapiro

camp compare keys 3 microseconds


hash hash a key 9 microseconds
move move a tuple 20 microseconds
swap swap two tup1es 60 microseconds
IO read/write of B block 30 milliseconds
F incremental factor 1.4
IRI size of R 800 blocks
ISI size of S
IW R I number of R tuples/block
IW s I number of S tuples/block
block size 25,000 bytes

Fig. 6. System parameters


used in modeling in this paper.

To show that GRACE typically dominates sort-merge, the previous argument


can be extended as follows: first we show that GRACE typically has lower I/O
costs than sort-merge. The runs generated by sort-merge and the subsets gener-
ated by GRACE are of the same size in total, name1 R + lSl,sowhen
memory is at the minimum (a for sort-merge and wF 1R 1for GRACE), I/O
costs, which consist of writing and reading the runs or subsets, are identical.
More real memor results in equal savings, so GRACE has higher I/O costs only
if m > s’ 1S 1, which is atypical since 1R I is the smaller relation.
Next we compare the CPU costs of GRACE and sort-merge. The CPU cost of
GRACE is
GRACE. {R) i (2 * hash + 2 * move) +
’ 1 IS) * (2 * hash + camp * F + move).

As with hybrid’s CPU time, the coefficients of (RI and (S) are similar except
for the logarithm terms, which force sort-merge to be more costly. We
conclude that GRACE dominates sort-merge except when R is small or
when m > m.
We have modeled the performance of the four join algorithms by numerically
evaluating our formulas for a sample set of system parameters given in Figure 6.
We should note that all the modelings of the hash-based algorithms are somewhat
optimistic, since we have assumed no partition overflow. We discuss in Section
4 ways to deal with partition overflow.
In Figure 7 we display the relative performance of the four join algorithms as
described above. As we have noted, each algorithm requires a minimum amount
of main memory. For the relations modeled in Figure 7, the minimum memory
size for sort-merge is 1.1 megabytes, and for hash-based algorithms the minimum
memory size is 0.8 megabytes.
Among hash algorithms, simple and our modification of GRACE join each
perform as expected, with simple doing well for high memory values and GRACE
for low memory. Hybrid dominates both, as we have shown above.
The curves for simple and hybrid hash-join level off at just above 20 megabytes,
when a hash table for R fits in main memory. It is an easy matter to modify the
GRACE hash algorithms so that, if this occurs, then GRACE defaults to the
simple algorithm; thus GRACE and simple would be identical above that point.
ACMTransactionsonDatabaseSystems,
Vol.11,No.3,September
1986.
Join Processing in Database Systems * 253

20 - Heqahytesof
1 * 5 10 20 50 t real melmry

Fig. 7. CPU + l/O times of join algorithms with I R I = 20 megabytes, I S 1 = 40 megabytes.


(a) Sort-merge, (b) simple hash, (c) GRACE hash, (d) hybrid hash. Simple and hybrid hash are
identical after 13 megabytes.

(b,
I.31

Fig.% Maximum relation sizes for varying amounts of main memory. (a) Sort-merge (larger
relation), (b) hash-based (smaller relation).

Our algorithms require a minimum number of blocks of real memory, either


fl (sort-merge) or m (hash-based algorithms). Therefore, for a given
number of blocks of main memory there is a maximum relation size that can be
processed by these algorithms. Figure 8 shows these maximum sizes when the
ACMTransactions
0” Dat&%esystems,
Vol.11,No.3.September
1986.
254 . Leonard D. Shapiro

Fig. 9. CPU + I/O time of the hybrid algorithm with 5, 10, 20, and 30 megabytes of real memory.

block size is 25,000 bytes and F is 1.4. Note that the sort-merge curve represents
the maximum size of the larger relation, while curve (b) shows the maximum size
of the smaller relation for the hash-based algorithms.
Figure 9 shows the performance of the hybrid algorithm for a few fixed-memory
sizes, as relation sizes vary. Note that when the relation R can fit in main
memory, the execution time is not very large: less than 12 seconds for relations
up to 20 megabytes (excluding, as we always do, the time required to read R and
S and write the result, but assuming no CPU I/O overlap). It is also clear from
Figure 9 that when the size of main memory is much less than the size of R,
performance degrades rapidly.

4. PARTITION OVERFLOW
In all the hashing algorithms that use partitioning, namely simple, GRACE, and
hybrid hash, we have made assumptions about the expected size of the subsets
of the partitions. For example, in the simple hash-join algorithm, when the
relation R cannot fit in memory, we assumed that we could choose a hash
function h and a partition of its hash values that will partition R into two
subsets, so that a hash table for the first subset would fit exactly into memory.
What happens if we guess incorrectly, and memory tills up with the hash table
before we are finished processing R?
In [14] this problem is called “bucket overflow.” We use the term “partition
overflow” because we want to distinguish between the subsets of the partitions
of R and S, produced in the first phase of processing, and the buckets of the
hash table of R tuples, produced in the second phase, even though both are really
hash buckets.
The designers of GRACE deal with overflow by the use of “tuning” (i.e.,
beginning with very small partitions, and then, when the size of the smaller
partitions is known, combining them into larger partitions of the appropriate
ACMTransactions on Database
systems,“0,. 1L,NO.3,September
1986.
Join Processing in Database Systems * 255

size). This approach is also possible in our environment. We present other


approaches below.
In all our hash-based algorithms, each tuple of S in phase 1 is either used for
probing and then discarded, or copied to disk for later use. In phase 2, the
remaining tuples on disk are processed sequentially. Since an entire partition of
S never needs to reside in main memory (as is the case for partitions of R), the
size of S-partitions is of no consequence. Thus we need only find an accurate
partitioning of R.
To partition R, we can begin by choosing a bash function h with the usual
randomizing properties. If we know nothing about the distribution of the joining
attribute values in R, we can assume a uniform distribution and choose a partition
of h’s hash values accordingly. It is also possible to store statistics about the
distribution of h-values. In [X] a similar distribution statistic is studied, where
the identity function is used instead of a bash function, and it is shown that
using sampling techniques one can collect such distribution statistics on all
attributes of a large commercial database in a reasonable time.
Even with the problem reduced to partitioning R only, with a good choice of h
and with accurate statistics, overflow will occur. In the remainder of this section
we show how to handle the three kinds of partition overflow that occur in our
algorithms.

4.1 Partition Overflow on Disk


In two algorithms, GRACE and hybrid hash, partitions of R are created in disk
files, partitions which will later he required to fit in memor In both algorithms
these partitions were denoted RI, , R., where n = F 1R 1 for GRACE and
n = B for hybrid. After these partitions are created, it is possible that some of
them will he so large that a hash table for them cannot fit in memory.
If one R partition on disk overflows, that partition can be reprocessed. It can
be scanned and partitioned again, into two pieces, so that a hash table for each
will fit in memory. Alternatively, an attempt can be made to partition it into one
piece that will just fit, and another that can be added to a partition that turned
out to he smaller than expected. Note that a similar adjustment must be made
to the corresponding partition of S, so that the partitions of R and S will
correspond pairwise to the same bash values.

4.2 Partition Overflow in Memory: Simple Hash


In simple hash, as R is processed, a hash table of R tuples is built in memory.
What if the hash table turns out to be too large to iit in memory? The simplest
solution is to reassign some buckets, presently in memory, to the set of “passed-
over” tuples on disk, then continue processing. This amounts to modifying the
partitioning bash function slightly. Then the modified hash function will be used
to process S.

4.3 Partition Overflow in Memory: Hybrid Hash


In hybrid hash, as R is processed on step 1, a hash table is created from tuples
of R,. What if R. turns out to be too large to lit into the memory that remains
after some blocks are allocated to output buffers? The solution here is similar to
ACMTransactions
onDatabaseSystems,
“al. 11,No.3,September
1986.
256 . Leonard D. Shapiro

the simple hash case: reassign some buckets to a new partition on disk. This new
partition can be handled just like the others, or it can be spread over others if
some partitions are smaller than expected. All this is done before S is processed,
so the modified partitioning function can be used to process S.

5. MEMORY MANAGEMENT STRATEGIES


In this section we consider an alternate memory management strategy for our
algorithms. For simplicity we discuss only sort-merge and hybrid hash-join in
this section. The behavior of GRACE and simple hash are similar to the behaviors
we describe here. We begin in Section 5.1 by describing the weaknesses of the
“all real memory” model of the previous sections, where a process was allocated
a fixed amount of real memory for its lifetime and the amount of real memory
was known to the process. In Section 5.2 we consider virtual memory as an
alternative to this simple strategy, with at least a minimum amount of real
memory as a “hot set.” In Section 5.3 we describe how parts of the data space
are assigned to the hot set and to virtual memory, and in Section 5.4 we analyze
the impact of this new model on performance. Section 5.5 presents the results of
an analytic modeling of performance.
5.1 Problems with an All Real Memory Strategy
Until this section we have assumed a memory management strategy in which
each join operation (which we view as a single process) is assigned a certain
amount of real memory and that memory is available to it throughout its life.
Based on the amount of memory granted by the memory manager (denoted 1M 1
in Section 21, a strategy will be chosen for processing the join. The key here is
knowledge of the amount of memory available. Each of the algorithms we have
described above depends significantly on the amount of memory available to the
process. There are several problems inherent in designing such a memory
manager.
(1) If only one process requests memory space, how much of available memory
should be allocated to it? In order to answer this question the memory manager
must predict how many and what kind of other processes will require memory
before this process completes.
(2) If several processes request memory, how should it be allocated among
them? This is probably a simple optimization problem if each process can present
to the memory manager an efficiency graph, telling the time the process will take
to complete given various possible memory allocations.
(3) If all of memory is taken by active processes, and a new process requests
memory, the new process will have to wait until memory is available. This is an
intolerable situation in many scenarios. Swapping out a process is not acceptable
in general since such large amounts of memory are involved.
As was shown in Figure 8, our algorithms can join huge relations with only a
few megabytes of memory. Thus one might argue that the relatively small
amounts of real memory needed are affordable in a system with a large main
memory. But as one can see from Figures 7 and 9, excellent performance is
achieved only when the amount of real memory is close to the size of the smaller
ACMTransactions
onDatabase
Systems,
“a,. 11,No.3,September
ms.
Join Processingin DatabaseSystems * 257

(hybrid) or larger (sort-merge) relation. In general it will not be possible to


allocate to each join process an amount of memory near the size of the smaller
relation.

5.2 The Hot Set + Virtual Memory Model


One obvious solution to the problems just described is to assign each process all
the memory it requests, but in virtual memory, and to let active processes compete
for real memory via LRU or some other page-replacement algorithm.
If a process, which is executing a relational operator, is forced to compete for
pages with other processes via the usual LRU algorithm, severe thrashing can
result. This is pointed out in 1171 and in [20], where a variety of relational
operators are discussed. Sacco and Scholnick [17] propose to assign each process
a certain number of pages (the “hot-set size”) which are not subject to demand
paging. The hot-set size is estimated by the accessplanner, and is determined as
the point below which a sharp increase in processing time occurs-as the hot-set
size varies with each relation and each strategy, it must be estimated by the
access planner. Stonebraker [20] proposes allowing the database system to
override the usual LRU replacement algorithm when appropriate. We find that
a combination of these two approaches best suits our needs.
The algorithms discussed in this paper lend themselves to a hot-set approach
since below a certain real memory size (m or a) our algoritbms behave
very poorly.
Therefore, we adopt a similar strategy to that of [17], in that we expect each
process to have a certain number of “hot-set” pages guaranteed to it throughout
its lifetime. Those hot-set pages will be “wired down” in real memory for the
lifetime of the process. A facility for wiring down pages in a buffer is proposed
in [9]. The rest of the data space of the process will be assigned to virtual
memory. In the next section we describe what is assigned to the hot set and what
to virtual memory.
5.3 T and C
Recall that each of the algorithms sort-merge and hybrid hash-join operates in
two phases, first processing R and S and creating either runs or partitions, and
secondly reading these runs and partitions and processing them to create the
join. Each algorithm’s data space also splits into two pieces.
The first piece, which we denote T (for Tables), consists of a hash table or a
priority queue, plus buffers to input or output the partitions or runs. The second
piece of the algorithm’s data, which we denote C (for Cache), is the partitions or
runs generated during phase 1 and read during phase 2.
For sort-merge, as described in Section 2, T was fixed in size at m blocks.
If more than m blocks of real memory were available for sort-merge, the
remainder was used to store some or all of C, to save I/O costs. The blocks of C
not assigned to real memory were stored on disk. For hybrid, T occupied as much
real memory as was available (except that T was always between m
and F * 1R 1blocks).
Since all of T is accessed randomly, and in fact each tuple processed by either
algorithm generates a random accessto T, we assign T to the hot set. This means
ACMTransactions
0” Database
Systems, “0,. 11,No.3,September
,986.
250 . Leonard D. Shapiro

that the join process needs at least m or n blocks of real memory to


hold T. By Figure 8, we can see that +P- 1S 1 or F 1R 1 blocks of memory is a
reasonable amount. In the case of sort-merge, if additional space is available in
the hot set, then some of C will be assigned there. For simplicity, henceforth in
the case of sort-merge we let C refer to the set of runs stored in virtual memory.
For both hybrid and sort-merge join, C will be assigned to virtual memory.
The hot set + virtual memory model of this section differs from that of the
previous sections by substituting virtual memory for disk storage. This allows
the algorithms to run with relatively small amounts of wired-down memory, but
also to take advantage of other real memory shared with other processes.
To distinguish the algorithms we discuss here from those of Section 2, we
append the suffix RM (for Real Memory) to the algorithms of Section 2, which
use all real memory, and VM for the variants here, in which C is stored in Virtual
Memory.
5.4 What Are the Disadvantages of Placing C in Virtual Memory?
Let us suppose that the virtual memory in which C resides includes 1C 1 * Q
blocks of real memory, where Q 5 1. The quantity Q can change during the
execution of the algorithm, but for simplicity we assume it to be constant.
What are the potential disadvantages of storing C in virtual memory? There
are two. The first concerns the blocking factor of one track that we have chosen.
In the VM algorithms, where C is assigned to virtual memory, during phase 1, in
order to take advantage of all 1C 1 * Q blocks of real memory, and not knowing
what Q is, the algorithms should write all I C 1blocks to virtual memory and let
the memory manager page out 1C 1 * (1 - Q) of those blocks. But if the memory
manager pages out one page at a time, it may not realize the savings from writing
one track at a time. This will result in a higher I/O cost. On the other hand,
paging is supported by much more efficient mechanisms than normal I/O. For
simplicity, we assume this trade-off results in no net change in I/O costs
The second possible disadvantage of assigning C to virtual memory concerns
the usual LRU paging criterion. At the end of phase 1, in the VM algorithms
1C 1 * Q blocks of C will reside in real memory and 1C 1 * (1 - Q) blocks on
disk. Ideally, all 1C 1 * Q blocks will be processed in phase 2 directly from real
memory, without having to be written and then read back from disk. As we see
below, the usual LRU paging algorithm plays havoc with our plans, paging out
many of the 1C 1 * Q blocks to disk before they can be processed, and leaving in
memory blocks that are no longer of use. This causes a more significant problem.
To analyze LRU’s behavior more precisely, we must study the access pattern of
C as it is written to virtual memory and then read back into T for processing.
We must estimate how many of the 1C 1 * Q blocks in real memory at the end
of phase 1 will be paged out before they can be processed. We first consider
Hybrid-VM.
In the special case of Hybrid-VM, when F * 1R 1 c: 2 1M 1, that is, when only
R. and perhaps R, are constructed, unnecessary paging can be avoided completely.
Then C = RI, or C is empty, so C consists of one subset and can be read back in
phase 2 in any order. In particular, it can be read back in the opposite order from
which it was written, thus reading all the in-memory blocks first. Therefore, for
ACMTransactions onDlltabase
Systems,
Vol.1L,No.3,September
Em.
Join Processing in Database Systems * 259

Hybrid-VM,incaseF]R]zs2]M] an d’ in case the process’ resident size does


not change, all of the ] C ] * Q blocks in real memory at the end of phase 1 will
be processed before they are paged out.
In the remaining cases of Hybrid-VM, when F ] R ] > 2 ] M I, there are more
than two subsets Ri constructed, that is, B > 1 (B is defined in Section 2.5). The
B subsets are produced in parallel in phase 1 and read back serially in phase 2.
This parallel/serial behavior will cause poor real-memory usage under LRU, as
we shall see. To see this, consider the end of phase 1 in Hybrid-VM, which is the
time at which all of C has been created. ] C ] * Q blocks of C are in real memory
and ] C ] * (1 - Q) blocks are on disk, and the algorithm is about to read C for
processing. After phase 2 has begun and C has been processed for a while, the
part of C which remains in real memory consists of
C1: These tuples were on disk at the end of phase 1, and have been read into
real memory and processed.
C,: These tuples were in real memory at the end of phase 1. They have been
processed and are no longer needed by the algorithm in phase 2.
C,: These tuples were in real memory at the end of phase 1. They have not yet
been processed in phase 2.
What will happen next, assuming that phase 2 needs to read a block from disk,
and therefore to page out a page in memory? If the system uses the usual LRU
algorithm, then since the tuples in Cs were all used less recently than those in C,
and C,, the memory manager will choose a page from C, to page out, which is
exactlv the oonosite of what we would like! This “worst” behavior is nointed out
in [2Oj. .‘
Figure 10 gives an intuitive picture of why only ] C ] * Q’ blocks are read
directly from real memory in the case we are discussing, namely Hybrid-VM with
B > 1. Figure 10 shows the state of the system at the point in phase 2 of
Hybrid-VM at which C, first becomes empty. After this point all unprocessed
tuples of C are on disk, and so all requests for tuples from C will cause a page
fault. Before this point requests for tuples from C might not cause a page fault,
if those tuples were in real memory at the beginning of phase 2. The set Di
denotes the location of the tuples of C, before they were read onto disk. Since
the number of bytes in Ci and Di are equal, a little algebra shows that x must be
Q * B, and then that there are l C ] * Q’ blocks in C,. Thus only ] C ] * Q’ blocks
have been read directly from real memory.
This argument is based on the ideal picture of Figure 10. In practice, the sets
C,, D,, and C, in Figure 10 have jagged edges, and the argument we have given
is not precise. However, it can be shown that this argument is valid if B is large
and if the subsets are created at uniform speed, by the partitioning process in
phase 1. This analysis indicates that in Hybrid-VM, at least when B is large,
only ] C l * Q’ blocks are read directly from real memory.
A similar analysis is valid for sort-merge, based on the fact that runs in
sort-merge are produced serially and read back in parallel, with a similar
] C ] * Q’ conclusion.
Is there a way to avoid this poor paging behavior? One relatively simple
technique, called “throw immediately” in [ZO], is to mark a page of C as “aged’
ACMTransactions
anDatabasesystems.
“0,. 11,No.3.September
,!%s.
260 * Leonard D. Shapiro

Fig. 10. Hybrid bash-join in virtual memory with LRU. N

after it has been read into T, so that when part of C needs to be paged out the
system will take the artificially aged page instead of a yet unprocessed page. With
this page-aging facility, a full ) C 1 * Q blocks of C will be read directly from real
memory and will generate I/O savings.
5.5 Performance in the Hot Set + Virtual Memory Model
Figure 11 presents the results of an analytic modeling of Hybrid-VM, assuming
that 5 megabytes of real memory are allocated to the hot set where T resides,
and other real memory is used to support the virtual memory in which C resides.
In graph (a), we have assumed that only 1C 1 Q’ blocks of 1R 1 are read directly
from real memory, as would be the case if LRU were used. In (b) we assume
1C 1 * Q blocks of C are read from real memory, as would be the case if
page-aging were used.
According to Figure 11, the most efficient processing, with all real memory,
requires 28 megabytes and takes 16 seconds, compared to 54 megabytes and 25
seconds when virtual memory is used, with the hot-set size of 5 megabytes. More
memory is needed when virtual memory is used because in that case subsets from
both R and S are stored.
One way to explain the poorer performance of graph 11(b) compared to graph
11(c) is to view 11(b) as the result of reducing the size of the hot set, and therefore
T. When the hot set is large (e.g., when it can hold a hash table for R) no virtual
memory is needed, and performance is given by (c) at its minimum CPU + I/O
time. As the hot-set size decreases, performance degrades. At the minimum, when
the hot-set size is m, hybrid’s performance in the hot set + virtual memory
model is identical to that of GRACE, since the algorithms for GRACE and hybrid
in Section 2 are identical when I M ( = m.
The performance of sort-merge in the hot set + virtual memory model with
LRU plus page-aging is identical to the all real memory case, because sort-merge
uses real memory, beyond the Jlsi blocks needed for T to store C and
save I/O. Therefore, only a blocks of real memory should be assigned to the
hot set for sort-merge.
We conclude that if page-aging is possible, then the performance of sort-merge
is unaffected in the hot set + virtual memory model, but the performance of the
ACMTransactions onDatabase
Systems,
Vol.I I, No.3,September
1986.
Join Processing in Database Systems * 261

Fig. II. CPU + l/O time of Hybrid algorithm for varying amounts of memory. I RI = 20 megabytes,
IS I = 40 megabytes. For (a) and (b), hot-set size is 5 megabytes. (a) Hybrid-I’M with LRU,
(b) Hybrid-VM with page-aging, (c) Hybrid-RM: all real memory.

hybrid join degrades, as the hot-set size decreases, to the performance of


GRACE. Since we have shown, in Section 3, that GRACE typically dominates
sort-merge, we conclude that hybrid typically dominates sort-merge, even in
the hot set + virtual memory model.

6. OTHER TOOLS
In this section we discuss three tools that have been proposed to increase the
efficiency of join processing, namely database filters, Babb arrays, and semijoins.
Our objective is to show that all of them can be used equally effectively with any
of our algorithms.
Database filters [ 191 are an important tool to make database managers more
efficient. Filters are a mechanism to process records as they come off the disk,
and send to the database only those which qualify. Filters can be used easily with
our algorithms, since we have made no assumption about how the selections and
projections of the relations R and S are made before the join.
Another popular tool is the Babb array [I]. This idea is closely related to the
concept of partitioning which we have described in Section 2. As R is processed,
a boolean array is built. Each bit in the array corresponds to a hash bucket, and
the bit is turned on when an R tuple hashes into that bucket. Then, as each tuple
s from S is processed, the boolean array is checked, and ifs falls in a bucket for
which there are no R tuples, the s tuple can be discarded without checking R
itself. This is a very powerful tool when relatively few tuples qualify for the join.
The Babb array can easily be added to any of our algorithms. The first time R
is scanned, the array is built, and when S is scanned, some tuples can be discarded.
Its greatest cost is for space to store the array. Given a limited space in which to
ACMTransactions
onDatabase
Systems,
Vol.11,No.3,September
1986.
262 * Leonard D. Shapiro

store the array, another problem is to find a hash function to use in constructing
the array so that the array will carry maximum information. It is possible to use
several hash functions and an array for each, to increase the information, but
with limited space this alternative allows each hash function a smaller array and
therefore less information. Babb arrays are most useful when the join has a high
selectivity (i.e., when there are few matching tuples).
Finally, we discuss the semijoin [2]. This is often regarded as an alternative
way to do joins, but as we shall see it is a special case of a more general tool. The
semijoin is constructed as follows.
(1) Construct the projection of R on its joining attributes. We denote this
projection by r(R).
(2) Join r(R) to S. The result is called the semijoin of S with R and is denoted
S K R. The semijoin of S with R is the set of S tuples that participate in the
join of R and S.
(3) Join S K R to R. The result is equal to the join of R and S.
These steps can be integrated into any of our algorithms. When first scanning
R, one constructs T(R) and, when first scanning S, one discards tuples whose
joining attribute values do not appear in r(R). If the join has a low selectivity,
then this will reduce significantly the number of S tuples to be processed, and
will be a useful tool to add to any of the above algorithms.
The most significant expense of the semijoin tool is space to store r(R). For
example, in some cases r(R) might be almost as large as R. Can we minimize
the space needed to store r(R)? One obvious candidate is a Babb array. In fact,
the Babb array and semijoins are just two specific examples of a more general
tool, which can be described as follows.
(1’) Construct a structure o(R) which contains some information about the
relation T(R), where r(R) is defined in (1) above. In particular, o(R) must
contain enough information to tell when a given tuple is net in r(R).
(2’) Scan S and discard those tuples which, given the information in o(R),
cannot participate in the join. Denote the set of undiscarded S tuples
by R 1: S.
(3’) Join R 2 S to R. The result is equal to the join of R with S.
The semijoin tool takes o(R) equal to r(R), while the Babb array is another
representation of o(R), which may be much more compact than r(R). This more
general tool is a special case of the Tuneable Dynamic Filter described in [13].

7. CONCLUSIONS
We have defined and analyzed three bash-based equijoin algorithms, plus a
version of sort-merge that takes advantage of significant amounts of main
memory. These algorithms can also operate efficiently with relatively little main
memory. If the relations are sufficiently large, then one hash-based algorithm, a
hybrid of the other two, is proved to be the most efficient of all the algorithms
we study.
Join Processing in Database Systems - 263

The hash-based join algorithms all partition the relations into subsets which
can be processed in main memory. Simple mechanisms exist to minimize overflow
of these partitions and to correct it when it occurs, but the quantitative effect of
these mechanisms remains to be investigated.
The algorithms we describe can operate in virtual memory with a relatively
small “hot set” of nonpageable real memory. If it is possible to age pages, marking
them for paging out as soon as possible, then sort-merge has the same perform-
ance in the hot set plus virtual memory model as in the all real memory model,
while the performance of the hybrid algorithm degrades. If aging is not possible,
then the performance of both hybrid and sort-merge degrades. In fact, if a
fraction Q of the required virtual memory space is supported by real memory,
then the absence of an aging facility can result in performance equal to that with
only a fraction QZ of real pages. In the hot set plus virtual memory model, the
hybrid hash-based algorithm still has better performance than sort-merge for
sufficiently large relations.
Database filters, Babb arrays, and semijoin strategies can be incorporated into
any of our algorithms if they prove to be useful.
We conclude that, with decreasing main memory costs, hash-based algorithms
will become the preferred strategy for joining large relations.

REFERENCES
1. BABB, E. Implementing a relational database by means of specialized hardware. ACM Trans.
Dotabase Syst. 4, 1 (Mar. 1979).
2. BERNSTEIN, P. A. Query processing in a system for distributed databases (SDD- 1). ACM Tmna
Dotobase Syst. 6, 4 (Dec. 1981). 802.825.
3. BISON. D., BORAL, H., DEWI’IT’, D., AND WILKINSON, W. Parallel algorithms for theexecution
of relational database operations. ACM Trans. Database Sysf. 8, 3 (Sept. 1983), 324.353.
4. BLASGEN, M. W., AND ESWARAN, K. P. Storage and access in relational databases. IBM
Syst. J. 16, 4 (1977,.
5. BRATBERGSENGEN,K. Hashing methods and relational algebra operations. In Proceedings of
the Conference on Very Large Data Bases (Singapore, 1984).
6. DEWIV, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M.. AND Wow,, D.
Implementation techniques for main memory database systems. In Pmeedings of SIGMOD
(Boston. 1984), ACM, New York.
1. DEWIIT, D., AND GERBER,R. Multiprocessor hash-based join algorithms. In Proceedings of the
Conference on Very Large Data Bases (Stockholm, 1985).
8. DIGITAL EQUIPMENT CORP. Product announcement, 1984.
9. EFFELSBERG,W., AND HARDER, T. Principles of database buffer management. ACM Trans.
Data& Syst. 9, 4 (Dec. 1984), 560-595.
10. GARCIA-M• LINA, H., LIPTON, R., AND VALDES, J. A massive memory machine. IEEE Trans.
Cornput. c-33,5 (1984), 391-399.
11. GOODMAN, J. R. An investigation of multiprocessor structures and algorithms for data base
management. Electronics Research Lab. Memo. ECB/ERL M81/33, Univ. of California, Berkeley,
1981.
12. KERSCHBERG,L., TING, P., AND Y.&o, S. Query optimization in Star computer networks. ACM
Tram. Dotobase Syst. 7,4 (Dee ,982). 678-711.
13. KIESSLINC, W. Tunable dynamic filter algorithms for high performance database systems. In
Proceedings of the International Workshop on High Level Computer Architecture (May 1984).
6.10-6.20.
14. KITSUREGAWA,M., ET AL. Application of hash to data base machine and its architecture. New
Generation Comput. I (1983), 62-74.
264 - Leonard D. Shapiro

15. KNUTH, D. The Art of Computer Pmgramming: Sortingond Scorching, Vol. 3. Addison-Wesley,
Reading, Mass., 1973.
16. PIATETSKY-SHAPIRO, G., *ND CONNELL, C. Accurate estimation of the number of tuples
satisfying a condition. In Proceedings of SIGMOD Annd Meeting (Boston, 1984). ACM,
New York.
17. Sacco, G. M., AND SCHOLNICK, M. A mechanism for managing the buffer pool in a relational
database system using the hot-set model. Computer Science Res. Rep. RJ-3354, IBM Research
Lab., San Jose, Calif., Jan. 1982.
18. SEVERANCE,Il., AND DUEHNE, R. A practitioners guide to addressing algorithms. Commun.
ACM ,9,6 (June 19X), 314-326.
19. SLOTNICK,D. Logic per track devices. In Aduances in Computers, Vol. 10, J. Tou, Ed., Academic
Press, New York, 1970,291-296.
20. STONEBRAKER,M. Operating system support for database management. Commun. ACM 24, 7
(July 1981). 412-418.
21. VALDURIEZ, P., AND GARDARIN,G. Join and semijoin algorithms for a multiprocessor database
machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984). 133-161.
22. YAMI\NE, Y. A hash join technique for relational database systems. In Proceedings of the
Internotiod Conference on Foundations of Dora Organization (Kyoto, May 1965).

Received August 1984; revised December 1985; accepted December 1985.

You might also like