Frigo CacheOblivious FOCS99
Frigo CacheOblivious FOCS99
Main
Abstract organized by
Memory
optimal replacement
This paper presents asymptotically optimal algorithms for strategy
rectangular matrix transpose, FFT, and sorting on comput- Cache
ers with multiple levels of caching. Unlike previous optimal
algorithms, these algorithms are cache oblivious: no vari- CPU
ables dependent on hardware parameters, such as cache size
and cache-line length, need to be tuned to achieve optimal- W
work
ity. Nevertheless, these algorithms use an optimal amount
of work and move data optimally among multiple levels of 5
Z L Cache lines Q
cache misses
cache. For a cache with size Z and cache-line length L where Lines
Z / Ω(L2 ) the number of cache misses for an m 0 n ma- of length L
trix transpose is Θ(1 1 mn 2 L). The number of cache misses
for either an n-point FFT or the sorting of n numbers is
Θ(1 1 (n 2 L)(1 1 logZ n)). We also give an Θ(mnp)-work algo-
rithm to multiply an m 0 n matrix by an n 0 p matrix that in- Figure 1: The ideal-cache model
curs Θ(1 1 (mn 1 np 1 mp) 2 L 1 mnp 2 L 3 Z) cache faults.
We introduce an “ideal-cache” model to analyze our al-
gorithms, and we prove that an optimal cache-oblivious al- ure 1, consists of a computer with a two-level
gorithm designed for two levels of memory is also opti- memory hierarchy consisting of an ideal (data)
mal for multiple levels. We also prove that any optimal cache of Z words and an arbitrarily large main
cache-oblivious algorithm is also optimal in the previously memory. Because the actual size of words in a
studied HMM and SUMH models. Algorithms developed
for these earlier models are perforce cache-aware: their be-
computer is typically a small, fixed size (4 bytes,
havior varies as a function of hardware-dependent parame- 8 bytes, etc.), we shall assume that word size is
ters which must be tuned to attain optimality. Our cache- constant; the particular constant does not affect
oblivious algorithms achieve the same asymptotic optimality, our asymptotic analyses. The cache is partitioned
but without any tuning.
into cache lines, each consisting of L consecutive
words that are always moved together between
1 Introduction cache and main memory. Cache designers typ-
Resource-oblivious algorithms that nevertheless ically use L 6 1, banking on spatial locality to
use resources efficiently offer advantages of sim- amortize the overhead of moving the cache line.
plicity and portability over resource-aware al- We shall generally assume in this paper that the
gorithms whose resource usage must be pro- cache is tall:
grammed explicitly. In this paper, we study cache Z 7 Ω(L2 ) 4 (1)
resources, specifically, the hierarchy of memories which is usually true in practice.
in modern computers. We exhibit several “cache- The processor can only reference words that re-
oblivious” algorithms that use cache as effectively side in the cache. If the referenced word belongs
as “cache-aware” algorithms. to a line already in cache, a cache hit occurs, and
Before discussing the notion of cache oblivi- the word is delivered to the processor. Otherwise,
ousness, we first introduce the (Z 4 L) ideal-cache a cache miss occurs, and the line is fetched into
model to study the cache complexity of algo- the cache. The ideal cache is fully associative [18,
rithms. This model, which is illustrated in Fig- Ch. 5]: cache lines can be stored anywhere in the
cache. If the cache is full, a cache line must be
This research was supported in part by the Defense
Advanced Research Projects Agency (DARPA) under Grant
evicted. The ideal cache uses the optimal off-line
F30602-97-1-0270. Matteo Frigo was supported in part by a strategy of replacing the cache line whose next ac-
Digital Equipment Corporation fellowship. cess is farthest in the future [7], and thus it exploits
1
temporal locality perfectly. sumption (1), we can see that s 7 Θ( Z). Thus,
An algorithm with an input of size n is mea- each of the calls to O RD -M ULT runs with at most
sured in the ideal-cache model in terms of its work Z L 7 Θ(s2 L) cache misses needed to bring the
complexity W(n)—its conventional running time three matrices into the cache. Consequently, the
in a RAM model [4]—and its cache complexity cache complexity of the entire algorithm is Θ(1
Q(n; Z 4 L)—the number of cache misses it incurs n2 L (n Z)3 (Z L)) 7 Θ(1 n2 L n3 L Z),
as a function of the size Z and line length L of the since the algorithm has to read n 2 elements, which
ideal cache. When Z and L are clear from context, reside on n2 L cache lines.
we denote the cache complexity as simply Q(n) to The same bound can be achieved using a simple
ease notation. cache-oblivious algorithm that requires no tuning
We define an algorithm to be cache aware if it parameters such as the s in B LOCK -M ULT. We
contains parameters (set at either compile-time or present such an algorithm, which works on gen-
runtime) that can be tuned to optimize the cache eral rectangular matrices, in Section 2. The prob-
complexity for the particular cache size and line lems of computing a matrix transpose and of per-
length. Otherwise, the algorithm is cache obliv- forming an FFT also succumb to remarkably sim-
ious. Historically, good performance has been ple algorithms, which are described in Section 3.
obtained using cache-aware algorithms, but we Cache-oblivious sorting poses a more formidable
shall exhibit several cache-oblivious algorithms challenge. In Sections 4 and 5, we present two
that are asymptotically as efficient as their cache- sorting algorithms, one based on mergesort and
aware counterparts. the other on distribution sort, both which are op-
To illustrate the notion of cache awareness, timal.
consider the problem of multiplying two n n The ideal-cache model makes the perhaps-
matrices A and B to produce their n n product C. questionable assumption that memory is man-
We assume that the three matrices are stored in aged automatically by an optimal cache replace-
row-major order, as shown in Figure 2(a). We ment strategy. Although the current trend in
further assume that n is “big,” i.e. n 6 L in order architecture does favor automatic caching over
to simplify the analysis. The conventional way to programmer-specified data movement, Section 6
multiply matrices on a computer with caches is to addresses this concern theoretically. We show
use a blocked algorithm [17, p. 45]. The idea is to that the assumptions of two hierarchical mem-
view each matrix M as consisting of (n s) (n s) ory models in the literature, in which mem-
submatrices Mi j (the blocks), each of which has ory movement is programmed explicitly, are ac-
size s s, where s is a tuning parameter. The tually no weaker than ours. Specifically, we
following algorithm implements this strategy: prove (with only minor assumptions) that opti-
B LOCK -M ULT(A B C n)
mal cache-oblivious algorithms in the ideal-cache
1 for i 1 to n s
model are also optimal in the hierarchical mem-
2 do for j 1 to n s
assumes for simplicity that s evenly divides n, This section describes an algorithm for multiply-
but in practice s and n need have no special ing an m n by an n p matrix cache-obliviously
relationship, which yields more complicated code using Θ(mnp) work and incurring Θ(1 (mn
in the same spirit.)
2
nlg 7 L Z) cache misses. (a) 0 1 2 3 4 5 6 7 (b) 0 8 16 24 32 40 48 56
(b) A A AB 7 1 2
B1
B2
7 A1 B1 A2 B2 4
32 33 34 35 48 49 50 51
36 37 38 39 52 53 54 55
32 33 36 37 48 49 52 53
34 35 38 39 50 51 54 55
(c) AB 7 A B B
40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61
7 AB1 AB2
In case (a), we have m max n 4 p
. Matrix A is
1 2
44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63
Matrix B is split vertically, and each half is multi- B LOCK -M ULT, the matrices are held on Θ((mn
plied by A. For square matrices, these three cases
occurs when m 7 n 7 p 7 1, in which case the two np mp) L) cache misses that occur when the
Q(m n p) analysis because the matrices are stored in row-
O((mn np mp) L) if (mn np mp) Z major order. Tall caches are also needed if matri-
ces are stored in column-major order (Figure 2(b)),
2Q(m n 2 p) O(1)
if n m and n p
2Q(m n p 2) O(1)
otherwise
3
be set, since submatrices of size Θ( L L) are likewise perform two transpositions recursively.
cache-obliviously stored on one cache line. The The next two lemmas provide upper and lower
advantages of bit-interleaved and related layouts bounds on the performance of this algorithm.
have been studied in [14] and [9, 10]. One of the
practical disadvantages of bit-interleaved layouts Lemma 1 The cache-oblivious matrix-transpose algo-
rithm involves O(mn) work and incurs O(1 mn L)
ci j 7
given by
is optimal. Using matrix transposition as a sub- n 1
∑ X[ j]
ij
routine, we convert a variant [30] of the “six- Y[i] 7 n 4 (3)
step” fast Fourier transform (FFT) algorithm [6] j 0
into an optimal cache-oblivious algorithm. This where n 7 e2 1 n is a primitive nth root of
FFT algorithm uses O(n lg n) work and incurs
O 1 (n L) 1 log Z n cache misses.
be obtained with a divide-and-conquer strategy, n2 1 n1 1
i1 j1 i2 j2
∑ ∑ X[ j1 n2
i1 j2
however. If n m, we partition j2 ] n1 n n2
j2 0 j1 0
B1 (4)
A 7 (A1 A2 ) 4 B7
B2
Observe that both the inner and the outer sum-
Then, we recursively execute T RANSPOSE (A 1 4 B1 ) mation in Equation (4) is a DFT. Operationally,
and T RANSPOSE (A2 4 B2 ). If m 6 n, we divide ma- the computation specified by Equation (4) can be
trix A horizontally and matrix B vertically and performed by computing n2 transforms of size n1
4
(the inner sum), multiplying the result by the fac-
i j
tors n 1 2 (called the twiddle factors [13]), and
finally computing n1 transforms of size n2 (the
outer sum). L1 R
We choose n1 to be 2 lg n 2 and n2 to be 2 lg n 2 .
The recursive step then operates as follows.
buffers
1. Pretend that input is a row-major n 1 n2
matrix A. Transpose A in-place, i.e., use
the cache-oblivious algorithm to transpose A
onto an auxiliary array B, and copy B back L k
tion. In order to simplify the analysis of the Funnelsort is similar to mergesort. In order to
cache complexity, assume a tall cache, in which sort a (contiguous) array of n elements, funnelsort
case each transposition operation and the multi- performs the following two steps:
plication by the twiddle factors require at most
O(1 n L) cache misses. Thus, the cache com- 1. Split the input into n1 3 contiguous arrays of
(5)
O(1 n L)
log Z n 4
merger operates by recursively merging sorted se-
quences that become progressively longer as the
which is asymptotically optimal for a Cooley- algorithm proceeds. Unlike mergesort, however,
Tukey algorithm, matching the lower bound by a k-merger stops working on a merging subprob-
Hong and Kung [19] when n is an exact power lem when the merged output sequence becomes
5
“long enough,” and it resumes working on an- Proof. See Appendix B.
other merging subproblem. This upper bound matches the lower bound
Since this complicated flow of control makes a stated by the next theorem, proving that funnel-
k-merger a bit tricky to describe, we explain the sort is cache-optimal.
operation of the k-merger pictorially. Figure 3
shows a representation of a k-merger, which has
Theorem 4 The cache complexity of any sorting algo-
k sorted sequences as inputs. Throughout its ex-
ecution, the k-merger maintains the following in-
rithm is Q(n) 7 Ω 1 (n L) 1 logZ n .
variant. Proof. Aggarwal and Vitter [3] show that there
is an Ω (n L) log Z L(n Z) bound on the number
The output of this final k-merger becomes the incurs O 1 (n L) 1 logZ n cache misses if
output of the whole k-merger. The reader should the cache is tall. Unlike previous cache-efficient
notice that the intermediate buffers are overdi- distribution-sorting algorithms [1, 3, 21, 28, 30],
mensioned. In fact, each buffer can hold 2k 3 2 which use sampling or other techniques to find
elements, which is twice the number k 3 2 of el- the partitioning elements before the distribution
ements output by a k-merger. This additional
merger invokes R k3 2 times. Before each invo- size n. Recursively sort each subarray.
cation, however, the k-merger fills all buffers that 2. Distribute the sorted subarrays into q buckets
are less than half full, i.e., all buffers that contain
less than k3 2 elements. In order to fill buffer i, the
B1 4
that
4 Bq of size n1 4 4 nq , respectively, such
algorithm invokes the corresponding left merger
Li once. Since L i outputs k3 2 elements, the buffer (a) max x x Bi
min x x Bi 1
for
contains at least k3 2 elements after L i finishes. all 1
i q.
It can be proven by induction that the work (b) ni
2 n for all 1
i
q.
complexity of funnelsort is O(n lg n). The next the-
orem gives the cache complexity of funnelsort. (See below for details.)
Theorem 3 Funnelsort sorts n elements incurring at 3. Recursively sort each bucket.
most Q(n) cache misses, where 4. Copy the sorted buckets to array A.
Q(n) 7
O 1 (n L) 1
6
Distribution step The goal of Step 2 is to dis- median-finding algorithm uses O(m) work and
tribute the sorted subarrays of A into q buckets incurs O(1 m L) cache misses to find the me-
B1 4 B2 4 4 Bq . The algorithm maintains two in- dian of an array of size m. (In our case, we have
variants. First, at any time each bucket holds at m 7 2 n 1.) In addition, when a bucket splits,
variants. We keep state information for each sub- cache misses, and uses O(n) stack space to distribute n
array and bucket. The state of a subarray consists elements.
of the index next of the next element to be read
from the subarray and the bucket number bnum Proof. See Appendix C.
where this element should be copied. By conven-
tion, bnum 7 ∞ if all elements in a subarray have
Theorem 6 Distribution sort uses O(n lg n) work
been copied. The state of a bucket consists of the
pivot and the number of elements currently in the and incurs O(1 (n L) 1 log Z n ) cache misses to
sort n elements.
bucket.
We would like to copy the element at position
next of a subarray to bucket bnum. If this ele- Proof. The work done by the algorithm is given
ment is greater than the pivot of bucket bnum, we by
would increment bnum until we find a bucket for q
which the element is smaller than the pivot. Un- W(n) 7
nW( n)
∑ W(ni ) O(n) 4
fortunately, this basic strategy has poor caching i 1
behavior, which calls for a more complicated pro- where each n i
2 n and ∑ ni 7 n. The solution to
cedure.
that subarrays i 4 i 1 4
4 i m 1 have their scribed by the recurrence
O(1 n L) if n
Z4
q
Q(n)
nQ( n) ∑i 1 Q(ni ) otherwise 4
cursive implementation of D ISTRIBUTE:
O(1 n L)
copies all elements from subarray i that belong memory locations once. In the case where n 6
to bucket j. If bucket j has more than 2 n ele- Z, the recursive calls in Steps 1 and 3 cause
q
Q( n) ∑i 1 Q(ni ) cache misses and O(1 n L)
buckets of size at least n. For the splitting op- is the cache complexity of Steps 2 and 4, as shown
by Lemma 5. The theorem now follows by solving
7
6 Other cache models ideal-cache. The same argument holds for FIFO
caches.
In this section we show that cache-oblivious al-
gorithms designed in the two-level ideal-cache
model can be efficiently ported to other cache Corollary 8 For algorithms with regular cache com-
models. We show that algorithms whose com- plexity bounds, the asymptotic number of cache misses
plexity bounds satisfy a simple regularity con- is the same for LRU, FIFO, and optimal replacement.
dition (including all algorithms heretofore pre-
sented) can be ported to less-ideal caches in- Since previous two-level models do not support
corporating least-recently-used (LRU) or first-in, automatic replacement, to port a cache-oblivious
first-out (FIFO) replacement policies [18, p. 378]. algorithms to them, we implement a LRU (or
We argue that optimal cache-oblivious algorithms FIFO) replacement strategy in software.
are also optimal for multilevel caches. Finally, Lemma 9 A (Z 4 L) LRU-cache (or FIFO-cache) can
we present simulation results proving that opti- be maintained using O(Z) primary memory locations
mal cache-oblivious algorithms satisfying the reg- such that every access to a cache line in primary mem-
ularity condition are also optimal (in expecta- ory takes O(1) expected time.
tion) in the previously studied SUMH [5, 28] and
HMM [1] models. Thus, all the algorithmic results Proof. Given the address of the memory location
in this paper apply to these models, matching the to be accessed, we use a 2-universal hash func-
best bounds previously achieved. tion [20, p. 216] to maintain a hash table of cache
lines present in the primary memory. The Z L en-
Q(n; Z 4 L) to be regular if
Q(n; Z 4 L) 7 O(Q(n; 2Z 4 L)) (6) Theorem 10 An optimal cache-oblivious algorithm
We now show that optimal algorithms in with a regular cache-complexity bound can be imple-
the ideal-cache model whose cache complexity mented optimally in expectation in two-level models
bounds are regular can be ported to these mod- with explicit memory management.
els to run using optimal work and incurring an
optimal expected number of cache misses. Consequently, our cache-oblivious algorithms
The first lemma shows that the optimal and for matrix multiplication, matrix transpose, FFT,
omniscient replacement strategy used by an ideal and sorting are optimal in two-level models.
cache can be simulated efficiently by the LRU and
FIFO replacement strategies. 6.2 Multilevel ideal caches
Lemma 7 Consider an algorithm that causes
We now show that optimal cache-oblivious algo-
Q (n; Z 4 L) cache misses on a problem of size n using
rithms also perform optimally in computers with
a (Z 4 L) ideal cache. Then, the same algorithm incurs
multiple levels of ideal caches. Moreover, Theo-
Q(n; Z 4 L)
2Q (n; Z 2 4 L) cache misses on a (Z 4 L)
The (Z1 4 L1 ) 4 (Z2 4 L2 ) 4 4 (Zr 4 Lr ) ideal-cache
14 24
4 r 1, the values stored in cache i are also
8
stored in cache i 1. The performance of an algo- cache can be executed in the SUMH b(l) model in ex-
i 1
Theorem 11 An optimal cache-oblivious algorithm in where Z 7 2i , L 7
i , and Z is big enough to hold
i i r
the ideal-cache model incurs an asymptotically optimal all elements used during the execution of the algorithm.
number of cache misses on each level of a multilevel
cache with optimal replacement. Proof. Use the memory at the ith level as a cache
of size Zi 7 2i with line length L i 7
i and man-
Proof. The theorem follows directly from the def- age it with software LRU described in Lemma 9.
inition of cache obliviousness and the optimal- The rth level is the main memory, which is di-
ity of the algorithm in the two-level ideal-cache rect mapped and not organized by the software
model. LRU mechanism. An LRU-cache of size Θ(Z i )
can be simulated by the ith level, since it has
Theorem 12 An optimal cache-oblivious algorithm size Zi . Thus, the number of cache misses at level
with a regular cache-complexity bound incurs an i is 2Q(n; Θ(Zi ) 4 Li ), and each takes i b(i) time.
asymptotically optimal number of cache misses on each Since only one memory movement happens at
level of a multilevel cache with LRU, FIFO, or optimal any point in time, and there are O(W(n)) accesses
replacement. to level 1, the lemma follows by summing the in-
dividual costs.
Proof. Follows from Corollary 8 and Theorem 12.
Lemma 14 Consider a cache-oblivious algorithm
whose work on a problem of size n is lower-bounded
by W (n) and whose cache complexity is lower-
6.3 The SUMH model bounded by Q (n; Z 4 L) on an (Z 4 L) ideal-cache. Then,
In 1990 Alpern et al. [5] presented the uniform no matter how data movement is implemented in
memory hierarchy model (UMH), a parameter- SUMH b(l), the time taken on a problem of size n
ized model for a memory hierarchy. In the is at least
UMH b(l) model, for integer constants 4 6 1, r
i
the size of the ith memory level is Z i 7 2i and
T(n) 7 Ω W (n)
∑ b(i) Q (n 4 Θ(Z j ) 4 Li ) 4
i 1
the line length is L i 7 i . A transfer of one l -
where Zi 7 2i , Li 7 i and Zr is big enough to hold
length line between the caches on level l and l 1
all elements used during the execution of the algorithm.
must be nonincreasing and the processor accesses Proof. The optimal scheduling of the data move-
the cache on level 1 in constant time per access. ments does not need to obey the inclusion prop-
An algorithm given for the UMH model must in- erty, and thus the number of ith-level cache misses
clude a schedule that, given for a particular set of is at least as large as for an ideal cache of size
input variables, tells exactly when each block is ∑ij 1 Zi 7 O(Zi ). Since Q (n 4 Z 4 L) lower-bounds
moved along which of the buses between caches. the cache misses on a cache of size Z, at least
Work and cache misses are folded into one cost Q (n 4 Θ(Zi ) 4 Li ) data movements occur at level i,
measure T(n). Alpern et al. prove that an algo- each of which takes i b(i) time. Since only one
rithm that performs the optimal number of I/O’s movement can occur at a time, the total cost is the
at all levels of the hierarchy does not necessar- maximum of the work and the sum of the costs at
ily run in optimal time in the UMH model, since all the levels, which is within a factor of 2 of their
scheduling bottlenecks can occur when all buses sum.
are active. In the more restrictive SUMH model
[28], however, only one bus is active at a time. Theorem 15 A cache-oblivious algorithm that is op-
Consequently, we can prove that optimal cache- timal in the ideal-cache model and whose cache-
oblivious algorithms run in optimal expected time complexity is regular can be executed optimal expected
in the SUMH model. time in the SUMH b(l) model.
Lemma 13 A cache-oblivious algorithm with W(n) Proof. The theorem follows directly from regu-
larity and Lemmas 13 and 14.
work and Q(n; Z 4 L) cache misses on a (Z 4 L)-ideal
9
6.4 The HMM model to element at location x is given by a cost func-
tion f (x). The BT model [2] extends HMM to sup-
Aggarwal, Alpern, Chandra and Snir [1] pro- port block transfers. The UMH model by Alpern
posed the hierarchical memory model (HMM) in et al. [5] is a multilevel model that allows I/O at
which an access to location x takes f (x) time. The different levels to proceed in parallel. Vitter and
authors assume that f is a monotonically non- Shriver introduce parallelism, and they give algo-
decreasing
function, usually of the form log x rithms for matrix multiplication, FFT, sorting, and
or x . The final paper will show that opti- other problems in both a two-level model [29] and
mal cache-oblivious algorithms run in optimal ex- several parallel hierarchical memory models [30].
pected time in the HMM model. Vitter [27] provides a comprehensive survey of
external-memory algorithms.
7 Related work
In this section, we discuss the origin of the notion 8 Conclusion
of cache-obliviousness. We also give an overview [All is well that ends]
of other hierarchical memory models.
Our research group at MIT noticed as far back
as 1994 that divide-and-conquer matrix multi-
Acknowledgments
plication was a cache-optimal algorithm that re- Thanks to Bobby Blumofe, now of the University
quired no tuning, but we did not adopt the of Texas at Austin, who sparked early discussions
term “cache-oblivious” until 1997. This matrix- at MIT about what we now call cache oblivious-
multiplication algorithm, as well as a cache- ness. Thanks to David Wise of Indiana University,
oblivious algorithm for LU-decomposition with- Sid Chatterjee of University of North Carolina,
out pivoting, eventually appeared in [8]. Shortly Chris Joerg of Compaq Cambridge Research Cen-
after leaving our research group, Toledo [26] inde- ter, and Bin Song of MIT for helpful discussions.
pendently proposed a cache-oblivious algorithm
for LU-decomposition, but with pivoting. For
n n matrices, Toledo’s algorithm uses Θ(n 3 )
work and incurs Θ(1 n2 L n3 L Z) cache
10
Appendix solved with at most O(m n mn L) cache
misses.
A Analysis of matrix transposition The cache complexity thus satisfies the recur-
Lemma 1 The cache-oblivious matrix-transpose algo- rence
rithm involves O(mn) work and incurs O(1 mn L) Q(m 4 n)
O(m n mn L) if m 4 n [ L 2 4 L] 4
2Q(m 2 4 n) O(1) if m n 4
log Z n 4
Case I: max m 4 n
L provided that Z 7 Ω(L2 ). [Note to the program
Both the matrices fit in O(1) 2mn L lines. committee: we believe that this hypothesis can be
quired is at most Z L. Therefore Q(m 4 n) 7 this result will appear in the final paper.]
Case II: m
L n OR n
L m iliary lemmas. The first lemma bounds the space
For this case, assume first that m
L required by a k-merger.
n. The transposition algorithm divides the Lemma 16 A k-merger can be laid out in O(k2 ) con-
greater dimension n by 2 and performs di- tiguous memory locations.
vide and conquer. At some point in the re-
cursion, n is in the range L 2
n
L, and
1)S( k)
O(k2 ) 4
most O(1 nm L) cache misses to be read.
complexity for this base case is O(1 m). cache-efficiently, in the sense stated by the next
lemma.
O(1 m) if n [ L 2 4 L] 4
The case n
L m is analogous. line is read during a insert (delete) operation, the
Case III: m 4 n 6 L next
cache
L 1 insert (delete) operations do not cause a
miss. The result follows.
Like in Case II, at some point in the recursion
both n and m are in the range [ L 2 4 L]. The The next lemma bounds the number of cache
whole problem fits into cache and it can be misses QM incurred by a k-merger.
11
Lemma 18 If Z 7 Ω(L2 ), then a k-merger operates bounded by l k3 2 2 k. To see why, con-
with at most QM (k) cache misses, where sider that every invocation of a left merger puts
3 3
QM (k) 7 O 1 k k L k log Z k L
k3 2 elements into some buffer. Since k 3 elements
are output and the buffer space is 2k 2 , the bound
l k3 2 2 k follows.
Proof. There are two cases: either k Z or k 6
Z, where is a sufficiently small constant. every buffer to see whether it is empty. One
Assume first that k Z. By Lemma 16, the such check requires at most k cache misses, since
data structure associated with the k-merger re- there are k buffers. This check is repeated ex-
quires at most O(k2 ) 7 O(Z) contiguous memory actly k3 2 times, leading to at most k2 cache misses
locations, and therefore it fits into cache. The k- for all checks.
merger has k input queues, from which it loads These considerations lead to the recurrence
O(k3 ) elements. Let r i be the number of elements
extracted from the i-th input queue. Since k QM (k)
2k3 2 2 k QM ( k) k2
k
∑ O(1 ri L) 7 O(k k3 L) ck3 2 log Z k
2 k3 2
k A( k) k2
i 1 2L
ck3 log Z k L
k2 1 c log Z k L
2 k A( k)
the first case.
Assume now that k 6 Z. In this second case,
log Z n
3
QM (k)
ck log Z k L A(k) 4 (7)
where A(k) 7 k(1 2c log Z k L) is a o(k3 ) term. This Proof. If n Z for a small enough constant ,
O k3 L holds. Consequently, a big enough value O 1 n1 3 n L n log Z n L .
of c can be found that satisfies Inequality (7). With the hypothesis Z 7 Ω(L2 ), we have n L 7
For the inductive case, let k 6 Z. The k- Ω(n1 3). Moreover, we also have n1 3 7 Ω(1)
and lg n 7 Ω(lg Z). Consequently, Q M (n1 3) 7
12
C Analysis of Distribution Sort We again account for the splitting of buckets
This appendix contains the proof of Lemma 5, separately. We first prove that R satisfies the
which is used in Section 5. following recurrence:
R(c 4 m)
if c
L4 (8)
∑1 i 4 R(c 2 4 m i ) otherwise 4
where ∑1 i 4 mi 7 m.
Proof. See [12, p. 189] for the linear-time median
finding algorithm and the work analysis. The First, consider the base case c
L. An in-
cache complexity is given by the same recurrence vocation of D ISTRIBUTE(a 4 b 4 c) operates with
as the work complexity with a different base case. c subarrays and c buckets. Since there are
O(1 m L) if m
where is a sufficiently small constant. The result cache misses. O(c) 7 O(L) cache misses are
follows. due to the initial access to each subarray and
bucket. O(1 m L) is the cache complexity
Lemma 5 The distribute step uses O(n) work, incurs of copying the m elements from contiguous
O(1 n L) cache misses, and uses O(n) stack space to to contiguous locations. This completes the
distribute n elements. proof of the base case. The recursive case,
when c 6 L, follows immediately from the
Proof. In order to simplify the analysis of the algorithm. The solution for Equation (8) is
work used by D ISTRIBUTE, assume that C OPY- R(c 4 m) 7 O(L c2 L m L).
n contiguous elements. Additional O(1
n L) misses are incurred by restoring the
For the cache analysis, we distinguish two
cases. Let be a sufficiently small constant such
that the stack space used fits into cache.
Case I: n
Z
The input and the auxiliary space of size
O(n) fit into cache using O(1 n L) cache
Case II: n 6 Z
Let R(c 4 m) denote the cache misses incurred
by an invocation of D ISTRIBUTE(a 4 b 4 c) that
copies m elements from subarrays to buckets.
13
References [12] T. H. Cormen, C. E. Leiserson, and R. L.
[1] A. Aggarwal, B. Alpern, A. K. Chandra, and Rivest. Introduction to Algorithms. MIT Press
M. Snir. A model for hierarchical memory. and McGraw Hill, 1990.
In Proceedings of the 19th Annual ACM Sympo- [13] P. Duhamel and M. Vetterli. Fast Fourier
sium on Theory of Computing, pages 305–314, transforms: a tutorial review and a state of
May 1987. the art. Signal Processing, 19:259–299, Apr.
[2] A. Aggarwal, A. K. Chandra, and M. Snir. 1990.
Hierarchical memory with block transfer. [14] J. D. Frens and D. S. Wise. Auto-blocking
In 28th Annual Symposium on Foundations of matrix-multiplication or tracking blas3 per-
Computer Science, pages 204–216, Los Ange- formance from source code. In Proceedings
les, California, 12–14 Oct. 1987. IEEE. of the Sixth ACM SIGPLAN Symposium on
[3] A. Aggarwal and J. S. Vitter. The in- Principles and Practice of Parallel Programming,
put/output complexity of sorting and re- pages 206–216, Las Vegas, NV, June 1997.
lated problems. Communications of the ACM, [15] M. Frigo. A fast Fourier transform compiler.
31(9):1116–1127, Sept. 1988. In Proceedings of the ACM SIGPLAN’99 Con-
[4] A. V. Aho, J. E. Hopcroft, and J. D. Ull- ference on Programming Language Design and
man. The Design and Analysis of Computer Al- Implementation (PLDI), Atlanta, Georgia, May
gorithms. Addison-Wesley Publishing Com- 1999.
pany, 1974. [16] M. Frigo and S. G. Johnson. FFTW: An adap-
[5] B. Alpern, L. Carter, and E. Feig. Uniform tive software architecture for the FFT. In
memory hierarchies. In Proceedings of the Proceedings of the International Conference on
31st Annual IEEE Symposium on Foundations Acoustics, Speech, and Signal Processing, Seat-
of Computer Science, pages 600–608, Oct. 1990. tle, Washington, May 1998.
[6] D. H. Bailey. FFTs in external or hierarchical [17] G. H. Golub and C. F. van Loan. Matrix Com-
memory. Journal of Supercomputing, 4(1):23– putations. Johns Hopkins University Press,
35, May 1990. 1989.
[7] L. A. Belady. A study of replacement algo- [18] J. L. Hennessy and D. A. Patterson. Computer
rithms for virtual storage computers. IBM Architecture: A Quantitative Approach. Mor-
Systems Journal, 5(2):78–101, 1966. gan Kaufmann Publishers, INC., 2nd edition,
[8] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. 1996.
Leiserson, and K. H. Randall. An anal- [19] J.-W. Hong and H. T. Kung. I/O complexity:
ysis of dag-consistent distributed shared- the red-blue pebbling game. In Proceedings of
memory algorithms. In Proceedings of the the 13th Annual ACM Symposium on Theory of
Eighth Annual ACM Symposium on Parallel Al- Computing, pages 326–333, Milwaukee, 1981.
gorithms and Architectures (SPAA), pages 297–
308, Padua, Italy, June 1996. [20] R. Motwani and P. Raghavan. Randomized Al-
gorithms. Cambridge University Press, 1995.
[9] S. Chatterjee, V. V. Jain, A. R. Lebeck, and
S. Mundhra. Nonlinear array layouts for hi- [21] M. H. Nodine and J. S. Vitter. Determinis-
erarchical memory systems. In Proceedings of tic distribution sort in shared and distributed
the ACM International Conference on Supercom- memory multiprocessors. In Proceedings of
puting, Rhodes, Greece, June 1999. the Fifth Symposium on Parallel Algorithms
[10] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and Architectures, pages 120–129, Velen, Ger-
and M. Thottethodi. Recursive array lay- many, 1993.
outs and fast parallel matrix multiplication. [22] J. E. Savage. Extending the Hong-Kung
In Proceedings of the Eleventh ACM SIGPLAN model to memory hierarchies. In D.-Z. Du
Symposium on Parallel Algorithms and Architec- and M. Li, editors, Computing and Combina-
tures, June 1999. torics, volume 959 of Lecture Notes in Com-
[11] J. W. Cooley and J. W. Tukey. An algorithm puter Science, pages 270–281. Springer Verlag,
for the machine computation of the complex 1995.
Fourier series. Mathematics of Computation, [23] R. C. Singleton. An algorithm for comput-
19:297–301, Apr. 1965. ing the mixed radix fast Fourier transform.
14
IEEE Transactions on Audio and Electroacous-
tics, AU-17(2):93–103, June 1969.
[24] D. D. Sleator and R. E. Tarjan. Amortized ef-
ficiency of list update and paging rules. Com-
munications of the ACM, 28(2):202–208, Feb.
1985.
[25] V. Strassen. Gaussian elimination is not op-
timal. Numerische Mathematik, 13:354–356,
1969.
[26] S. Toledo. Locality of reference in LU de-
composition with partial pivoting. SIAM
Journal on Matrix Analysis and Applications,
18(4):1065–1081, Oct. 1997.
[27] J. S. Vitter. External memory algorithms and
data structures. In J. Abello and J. S. Vitter,
editors, External Memory Algorithms and Visu-
alization, DIMACS Series in Discrete Math-
ematics and Theoretical Computer Science.
American Mathematical Society Press, Prov-
idence, RI, 1999.
[28] J. S. Vitter and M. H. Nodine. Large-scale
sorting in uniform memory hierarchies. Jour-
nal of Parallel and Distributed Computing, 17(1–
2):107–114, January and February 1993.
[29] J. S. Vitter and E. A. M. Shriver. Algorithms
for parallel memory I: Two-level memories.
Algorithmica, 12(2/3):110–147, August and
September 1994.
[30] J. S. Vitter and E. A. M. Shriver. Algorithms
for parallel memory II: Hierarchical multi-
level memories. Algorithmica, 12(2/3):148–
169, August and September 1994.
[31] S. Winograd. On the algebraic complexity of
functions. Actes du Congrès International des
Mathématiciens, 3:283–288, 1970.
15