Join Processing in Database With Large Main Memories Systems
Join Processing in Database With Large Main Memories Systems
We study algorithms for computing the equijoin of two relations in B system with a standard
architecturehut with largeamountsof main memory.Our algorithmsare especiallyefficientwhen
the mainmemoryavailableis B significantfractionof the sizeof oneof the relationsto hejoined;
hut theycanbeappliedwheneverthereis memoryequalto approximatelythe 8qumeroot of the size
of one relation.We presenta newalgorithmwhich is B hybrid of two hash-based algorithmsand
whichdominatesthe other algorithmawepresent,includingsort-merge. Evenin B virtual memory
environment,the hybridalgorithmdominatesall the otherswestudy.
Finally,wedescribehowthreepopulartoolsto increasethe efficiencyofjoins, namelyfilters,Babb
arrays,andsemijoins,canhegraftedontoany of our algorithms.
Categoriesand SubjectDescriptors:H.2.0 (Database Management]: General; H.2.4 [Database
Management]: Systems-queryprocessing; H.2.6 [Database Management]: Database Machines
GeneralTerms:Algorithms,Performance
AdditionalKeyWordsand Phrases:Hashjoin, join processing,
largemainmemory,sort-merge
join
1. INTRODUCTION
Database systems are gaining in popularity owing to features such as data
independence, high-level interfaces, concurrency control, crash recovery, and so
on. However, the greatest drawback to database management systems (other
than cost) is the inefficiency of full-function database systems, compared to
customized programs; and one of the most costly operations in database process-
ing is the join. Traditionally the most effective algorithm for executing a join (if
there are no indices) has been sort-merge [4]. In [S] it wae suggested that the
existence of increasingly inexpensive main memory makes it possible to use
hashing techniques to execute joins more efficiently than sort-merge. Here we
extend these results.
Some of the first research on joins using hashing [14, 211 concerned multipro-
cessor architectures. Our model assumes a “vanilla” computer architecture, that
is, a uniprocessor system available in the market today. Although the lack of
parallel processing in such systems deprives us of much of the potential speed of
how much memory is allocated to it, and can use this information in designing a
strategy for the join. The amount of real memory allocated is fixed throughout
the lifetime of the process. In Section 5 we discuss an alternate to this simple
memory management strategy.
when little of R can fit in memory. The third algorithm, hybrid bash, combines
the two, doing all partitioning on the first pass over each relation and using
whatever memory is left to build a hash table. It performs well over a wide range
of memory sizes.
(1) Let P = min( I M (, I R 1 * F). Choose a hash function h and a set of hash
values so that P/F blocks of R tuples will hash into that set. Scan the (smaller)
relation R and consider each tuple. If the tuple hashes into the chosen range,
insert the tuple into a P-block hash table in memory. Otherwise, pass over the
tuple and write it into a new file on disk.
(2) Scan the larger relation S and consider each tuple. If the tuple hashes into
the chosen range, check the hash table of R tuples in memory for a match and
output the pair if a match occurs. Otherwise, pass over the tuple and write it to
disk. Note that if key values of the two relations are distributed similarly, there
will be P/F * ) S l/l R ) blocks of the larger relation S processed in this pass.
(3) Repeat steps (1) and (2), replacing each of the relations R and S by the
set of tuples from R and S that were “passed over” and written to disk in the
previous pass. The algorithm ends when no tuples from R are passed over.
This algorithm performs particularly well when most of R fits into main
memory. In that case, most of R (and S) are touched only once, and only what
cannot fit into memory is written out to disk and read in again. On the other
hand, when there is little main memory this algorithm behaves poorly, since in
that case there are many passes and both R and S are scanned over and over
again. In fact, this algorithm operates as specified for any amount of memory,
but to he consistent with the other hash-based algorithms we assume it is
undefined for less than m blocks of memory.
We assume here and elsewhere that the same hash function is used both for
partitioning and for construction of the hash tables.
In the following formula and in later formulas we must estimate the number
of compares required when the hash table is probed for a match. This amounts
to estimating the number of collisions. We have chosen to use the term
camp * F for the number of compares required. Although this term is too
simple to be valid in general, when F is 1.4 (which is the value we use in our
analytic modeling) it means that for a hash table with a load factor of 71 percent
the estimated number of probes is 1.4. This is consistent with the simulations
reported in (181.
ACMTransactions
cmDatabasesystems,
Vol.11,No.3,September
1986.
246 * Leonard D. Shmiro
(A-l).lRl-
A * (A - 1) * w
2
F 1
* 2 * Io.
(A-l)*,SI-A’(A-l)*(M(tlS’
IRI 12
* IO.
F
- * 2
IRI pJ
mm -If F
blocks in length. A hash table for each subset R; will therefore require
blocks of memory, and we have assumed at least this much real memory.
(5) Hash each tuple of Si with the same hash function used to build the hash
table in (4). Probe for a match. If there is one, output the result tuple,
otherwise proceed with the next tuple of Si.
What if there are more or less than m blocks of memory available? Just
as with sort-merge-join, we do not consider the case when less than this minimum
’ Our assumption, that B tuple of S joins with at most one block of tuplesof R, is used here. If R
contams many tuples with the same joining attribute value, then this partitioning may not be possible.
ACM Transactions 0” Database Systems. Vol. 1,. No. 3. September 1986.
240 . Leonard D. Shapiro
number of blocks is available, and if more blocks are available, we use them to
store subsets of R and/or S so they need not be written to and read from disk.
This algorithm works very well when there is little memory available, because
it avoids repeatedly scanning R and S, as is done in simple hash. Yet when most
of R tits into memory, GRACE join does poorly since it scans both R and S
twice.
One advantage of using hash in the second phase, instead of sort-merge, as is
done by the designers of the GRACE machine, is that subsets of S can be of
arbitrary size. Only R needs to be partitioned into subsets of approximately equal
size. Since partition overflow can cause significant problems (see Section 4), this
is an important advantage.
The cost of this algorithm is
({R) + {Sl * (hash + move) Hash tuple and move to output
buffer.
+(IRI+ISI)*IO Write partitioned relations to
disk.
+(IRI+ISI)*IO Read partitioned sets.
+ (RI * (hash + move) Build hash tables in memory.
+ IS) * (hash + camp * F) Probe for a match.
-min(IRI+ISI,IMI~)*Z*IO IO savings if extra memory is
available.
2.5 Hybrid Hash-Join Algorithm
Hybrid hash combines the features of the two preceding algorithms, doing both
partitioning and hashing on the first pass over both relations. On the first pass,
instead of using memory as a buffer as is done in the GRACE algorithm, only as
many blocks (B, defined below) as are necessary to partition R into sets that can
fit in memory are used. The rest of memory is used for a hash table that is
processed at the same time that R and S are being partitioned (see Figure 5).
ACM Transactionson DatabaseSystems,Vol. 11, No. 3, September1986.
Join Processing in Database Systems * 249
Here are the steps of the hybrid hash algorithm. If I R 1 * F 5 M, then a hash
table for R will fit in real memory, and hybrid hash is identical to simple hash
in this case.
(1) Let
There will be B + 1 steps in the hybrid hash algorithm. (To motivate the formula
for B, we note that it is approximately equal to the number of steps in simple
hash. The small difference is due to setting aside some real memory in phase 1
for a hash table for R,.) First, choose a hash function h and a partition of its
hash values which will partition R into Ro, , RB, such that a hash table for R.
has 1M 1 - B blocks, and R, , , RB are of equal size. Then allocate B blocks in
memory to B output buffers. Assign the other I M I - B blocks of memory to a
hash table for Ro.
(2) Assign the ith output buffer block to Ri for i = 1, . , B. Scan R. Hash
each tuple with h. If it belongs to R. it will be placed in memory in a hash table.
Otherwise it belongs to R; for some i > 0, so move it to the ith output buffer
block. When this step has finished, we have a hash table for R. in memory, and
RI, , Rs are on disk.
(3) The partition of R corresponds to a partition of S compatible with h, into
sets S,,, , S,. Assign the ith output buffer block to Si for i = 1, , B. Scan
S, hashing each tuple with h. If the tuple is in So, probe the hash table in memory
for a match. If there is a match, output the result tuple, otherwise drop the tuple.
If the tuple is not in S,, it belongs to Si for some i > 0, so move it to the ith
output buffer block. Now RI, , RB and S,, , S, are on disk.
ACMTransactionaon DatabaseSystems.
Vol. 11,No.3.September
1986.
250 * Leonard D. Shapiro
3Here we have used the formula for B above, and have assumed I M I is large enough that I MI - 1
can be approximated by I M 1.
252 * Leonard D. Shapiro
As with hybrid’s CPU time, the coefficients of (RI and (S) are similar except
for the logarithm terms, which force sort-merge to be more costly. We
conclude that GRACE dominates sort-merge except when R is small or
when m > m.
We have modeled the performance of the four join algorithms by numerically
evaluating our formulas for a sample set of system parameters given in Figure 6.
We should note that all the modelings of the hash-based algorithms are somewhat
optimistic, since we have assumed no partition overflow. We discuss in Section
4 ways to deal with partition overflow.
In Figure 7 we display the relative performance of the four join algorithms as
described above. As we have noted, each algorithm requires a minimum amount
of main memory. For the relations modeled in Figure 7, the minimum memory
size for sort-merge is 1.1 megabytes, and for hash-based algorithms the minimum
memory size is 0.8 megabytes.
Among hash algorithms, simple and our modification of GRACE join each
perform as expected, with simple doing well for high memory values and GRACE
for low memory. Hybrid dominates both, as we have shown above.
The curves for simple and hybrid hash-join level off at just above 20 megabytes,
when a hash table for R fits in main memory. It is an easy matter to modify the
GRACE hash algorithms so that, if this occurs, then GRACE defaults to the
simple algorithm; thus GRACE and simple would be identical above that point.
ACMTransactionsonDatabaseSystems,
Vol.11,No.3,September
1986.
Join Processing in Database Systems * 253
20 - Heqahytesof
1 * 5 10 20 50 t real melmry
(b,
I.31
Fig.% Maximum relation sizes for varying amounts of main memory. (a) Sort-merge (larger
relation), (b) hash-based (smaller relation).
Fig. 9. CPU + I/O time of the hybrid algorithm with 5, 10, 20, and 30 megabytes of real memory.
block size is 25,000 bytes and F is 1.4. Note that the sort-merge curve represents
the maximum size of the larger relation, while curve (b) shows the maximum size
of the smaller relation for the hash-based algorithms.
Figure 9 shows the performance of the hybrid algorithm for a few fixed-memory
sizes, as relation sizes vary. Note that when the relation R can fit in main
memory, the execution time is not very large: less than 12 seconds for relations
up to 20 megabytes (excluding, as we always do, the time required to read R and
S and write the result, but assuming no CPU I/O overlap). It is also clear from
Figure 9 that when the size of main memory is much less than the size of R,
performance degrades rapidly.
4. PARTITION OVERFLOW
In all the hashing algorithms that use partitioning, namely simple, GRACE, and
hybrid hash, we have made assumptions about the expected size of the subsets
of the partitions. For example, in the simple hash-join algorithm, when the
relation R cannot fit in memory, we assumed that we could choose a hash
function h and a partition of its hash values that will partition R into two
subsets, so that a hash table for the first subset would fit exactly into memory.
What happens if we guess incorrectly, and memory tills up with the hash table
before we are finished processing R?
In [14] this problem is called “bucket overflow.” We use the term “partition
overflow” because we want to distinguish between the subsets of the partitions
of R and S, produced in the first phase of processing, and the buckets of the
hash table of R tuples, produced in the second phase, even though both are really
hash buckets.
The designers of GRACE deal with overflow by the use of “tuning” (i.e.,
beginning with very small partitions, and then, when the size of the smaller
partitions is known, combining them into larger partitions of the appropriate
ACMTransactions on Database
systems,“0,. 1L,NO.3,September
1986.
Join Processing in Database Systems * 255
the simple hash case: reassign some buckets to a new partition on disk. This new
partition can be handled just like the others, or it can be spread over others if
some partitions are smaller than expected. All this is done before S is processed,
so the modified partitioning function can be used to process S.
after it has been read into T, so that when part of C needs to be paged out the
system will take the artificially aged page instead of a yet unprocessed page. With
this page-aging facility, a full ) C 1 * Q blocks of C will be read directly from real
memory and will generate I/O savings.
5.5 Performance in the Hot Set + Virtual Memory Model
Figure 11 presents the results of an analytic modeling of Hybrid-VM, assuming
that 5 megabytes of real memory are allocated to the hot set where T resides,
and other real memory is used to support the virtual memory in which C resides.
In graph (a), we have assumed that only 1C 1 Q’ blocks of 1R 1 are read directly
from real memory, as would be the case if LRU were used. In (b) we assume
1C 1 * Q blocks of C are read from real memory, as would be the case if
page-aging were used.
According to Figure 11, the most efficient processing, with all real memory,
requires 28 megabytes and takes 16 seconds, compared to 54 megabytes and 25
seconds when virtual memory is used, with the hot-set size of 5 megabytes. More
memory is needed when virtual memory is used because in that case subsets from
both R and S are stored.
One way to explain the poorer performance of graph 11(b) compared to graph
11(c) is to view 11(b) as the result of reducing the size of the hot set, and therefore
T. When the hot set is large (e.g., when it can hold a hash table for R) no virtual
memory is needed, and performance is given by (c) at its minimum CPU + I/O
time. As the hot-set size decreases, performance degrades. At the minimum, when
the hot-set size is m, hybrid’s performance in the hot set + virtual memory
model is identical to that of GRACE, since the algorithms for GRACE and hybrid
in Section 2 are identical when I M ( = m.
The performance of sort-merge in the hot set + virtual memory model with
LRU plus page-aging is identical to the all real memory case, because sort-merge
uses real memory, beyond the Jlsi blocks needed for T to store C and
save I/O. Therefore, only a blocks of real memory should be assigned to the
hot set for sort-merge.
We conclude that if page-aging is possible, then the performance of sort-merge
is unaffected in the hot set + virtual memory model, but the performance of the
ACMTransactions onDatabase
Systems,
Vol.I I, No.3,September
1986.
Join Processing in Database Systems * 261
Fig. II. CPU + l/O time of Hybrid algorithm for varying amounts of memory. I RI = 20 megabytes,
IS I = 40 megabytes. For (a) and (b), hot-set size is 5 megabytes. (a) Hybrid-I’M with LRU,
(b) Hybrid-VM with page-aging, (c) Hybrid-RM: all real memory.
6. OTHER TOOLS
In this section we discuss three tools that have been proposed to increase the
efficiency of join processing, namely database filters, Babb arrays, and semijoins.
Our objective is to show that all of them can be used equally effectively with any
of our algorithms.
Database filters [ 191 are an important tool to make database managers more
efficient. Filters are a mechanism to process records as they come off the disk,
and send to the database only those which qualify. Filters can be used easily with
our algorithms, since we have made no assumption about how the selections and
projections of the relations R and S are made before the join.
Another popular tool is the Babb array [I]. This idea is closely related to the
concept of partitioning which we have described in Section 2. As R is processed,
a boolean array is built. Each bit in the array corresponds to a hash bucket, and
the bit is turned on when an R tuple hashes into that bucket. Then, as each tuple
s from S is processed, the boolean array is checked, and ifs falls in a bucket for
which there are no R tuples, the s tuple can be discarded without checking R
itself. This is a very powerful tool when relatively few tuples qualify for the join.
The Babb array can easily be added to any of our algorithms. The first time R
is scanned, the array is built, and when S is scanned, some tuples can be discarded.
Its greatest cost is for space to store the array. Given a limited space in which to
ACMTransactions
onDatabase
Systems,
Vol.11,No.3,September
1986.
262 * Leonard D. Shapiro
store the array, another problem is to find a hash function to use in constructing
the array so that the array will carry maximum information. It is possible to use
several hash functions and an array for each, to increase the information, but
with limited space this alternative allows each hash function a smaller array and
therefore less information. Babb arrays are most useful when the join has a high
selectivity (i.e., when there are few matching tuples).
Finally, we discuss the semijoin [2]. This is often regarded as an alternative
way to do joins, but as we shall see it is a special case of a more general tool. The
semijoin is constructed as follows.
(1) Construct the projection of R on its joining attributes. We denote this
projection by r(R).
(2) Join r(R) to S. The result is called the semijoin of S with R and is denoted
S K R. The semijoin of S with R is the set of S tuples that participate in the
join of R and S.
(3) Join S K R to R. The result is equal to the join of R and S.
These steps can be integrated into any of our algorithms. When first scanning
R, one constructs T(R) and, when first scanning S, one discards tuples whose
joining attribute values do not appear in r(R). If the join has a low selectivity,
then this will reduce significantly the number of S tuples to be processed, and
will be a useful tool to add to any of the above algorithms.
The most significant expense of the semijoin tool is space to store r(R). For
example, in some cases r(R) might be almost as large as R. Can we minimize
the space needed to store r(R)? One obvious candidate is a Babb array. In fact,
the Babb array and semijoins are just two specific examples of a more general
tool, which can be described as follows.
(1’) Construct a structure o(R) which contains some information about the
relation T(R), where r(R) is defined in (1) above. In particular, o(R) must
contain enough information to tell when a given tuple is net in r(R).
(2’) Scan S and discard those tuples which, given the information in o(R),
cannot participate in the join. Denote the set of undiscarded S tuples
by R 1: S.
(3’) Join R 2 S to R. The result is equal to the join of R with S.
The semijoin tool takes o(R) equal to r(R), while the Babb array is another
representation of o(R), which may be much more compact than r(R). This more
general tool is a special case of the Tuneable Dynamic Filter described in [13].
7. CONCLUSIONS
We have defined and analyzed three bash-based equijoin algorithms, plus a
version of sort-merge that takes advantage of significant amounts of main
memory. These algorithms can also operate efficiently with relatively little main
memory. If the relations are sufficiently large, then one hash-based algorithm, a
hybrid of the other two, is proved to be the most efficient of all the algorithms
we study.
Join Processing in Database Systems - 263
The hash-based join algorithms all partition the relations into subsets which
can be processed in main memory. Simple mechanisms exist to minimize overflow
of these partitions and to correct it when it occurs, but the quantitative effect of
these mechanisms remains to be investigated.
The algorithms we describe can operate in virtual memory with a relatively
small “hot set” of nonpageable real memory. If it is possible to age pages, marking
them for paging out as soon as possible, then sort-merge has the same perform-
ance in the hot set plus virtual memory model as in the all real memory model,
while the performance of the hybrid algorithm degrades. If aging is not possible,
then the performance of both hybrid and sort-merge degrades. In fact, if a
fraction Q of the required virtual memory space is supported by real memory,
then the absence of an aging facility can result in performance equal to that with
only a fraction QZ of real pages. In the hot set plus virtual memory model, the
hybrid hash-based algorithm still has better performance than sort-merge for
sufficiently large relations.
Database filters, Babb arrays, and semijoin strategies can be incorporated into
any of our algorithms if they prove to be useful.
We conclude that, with decreasing main memory costs, hash-based algorithms
will become the preferred strategy for joining large relations.
REFERENCES
1. BABB, E. Implementing a relational database by means of specialized hardware. ACM Trans.
Dotabase Syst. 4, 1 (Mar. 1979).
2. BERNSTEIN, P. A. Query processing in a system for distributed databases (SDD- 1). ACM Tmna
Dotobase Syst. 6, 4 (Dec. 1981). 802.825.
3. BISON. D., BORAL, H., DEWI’IT’, D., AND WILKINSON, W. Parallel algorithms for theexecution
of relational database operations. ACM Trans. Database Sysf. 8, 3 (Sept. 1983), 324.353.
4. BLASGEN, M. W., AND ESWARAN, K. P. Storage and access in relational databases. IBM
Syst. J. 16, 4 (1977,.
5. BRATBERGSENGEN,K. Hashing methods and relational algebra operations. In Proceedings of
the Conference on Very Large Data Bases (Singapore, 1984).
6. DEWIV, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M.. AND Wow,, D.
Implementation techniques for main memory database systems. In Pmeedings of SIGMOD
(Boston. 1984), ACM, New York.
1. DEWIIT, D., AND GERBER,R. Multiprocessor hash-based join algorithms. In Proceedings of the
Conference on Very Large Data Bases (Stockholm, 1985).
8. DIGITAL EQUIPMENT CORP. Product announcement, 1984.
9. EFFELSBERG,W., AND HARDER, T. Principles of database buffer management. ACM Trans.
Data& Syst. 9, 4 (Dec. 1984), 560-595.
10. GARCIA-M• LINA, H., LIPTON, R., AND VALDES, J. A massive memory machine. IEEE Trans.
Cornput. c-33,5 (1984), 391-399.
11. GOODMAN, J. R. An investigation of multiprocessor structures and algorithms for data base
management. Electronics Research Lab. Memo. ECB/ERL M81/33, Univ. of California, Berkeley,
1981.
12. KERSCHBERG,L., TING, P., AND Y.&o, S. Query optimization in Star computer networks. ACM
Tram. Dotobase Syst. 7,4 (Dee ,982). 678-711.
13. KIESSLINC, W. Tunable dynamic filter algorithms for high performance database systems. In
Proceedings of the International Workshop on High Level Computer Architecture (May 1984).
6.10-6.20.
14. KITSUREGAWA,M., ET AL. Application of hash to data base machine and its architecture. New
Generation Comput. I (1983), 62-74.
264 - Leonard D. Shapiro
15. KNUTH, D. The Art of Computer Pmgramming: Sortingond Scorching, Vol. 3. Addison-Wesley,
Reading, Mass., 1973.
16. PIATETSKY-SHAPIRO, G., *ND CONNELL, C. Accurate estimation of the number of tuples
satisfying a condition. In Proceedings of SIGMOD Annd Meeting (Boston, 1984). ACM,
New York.
17. Sacco, G. M., AND SCHOLNICK, M. A mechanism for managing the buffer pool in a relational
database system using the hot-set model. Computer Science Res. Rep. RJ-3354, IBM Research
Lab., San Jose, Calif., Jan. 1982.
18. SEVERANCE,Il., AND DUEHNE, R. A practitioners guide to addressing algorithms. Commun.
ACM ,9,6 (June 19X), 314-326.
19. SLOTNICK,D. Logic per track devices. In Aduances in Computers, Vol. 10, J. Tou, Ed., Academic
Press, New York, 1970,291-296.
20. STONEBRAKER,M. Operating system support for database management. Commun. ACM 24, 7
(July 1981). 412-418.
21. VALDURIEZ, P., AND GARDARIN,G. Join and semijoin algorithms for a multiprocessor database
machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984). 133-161.
22. YAMI\NE, Y. A hash join technique for relational database systems. In Proceedings of the
Internotiod Conference on Foundations of Dora Organization (Kyoto, May 1965).