0% found this document useful (0 votes)
45 views15 pages

Frigo CacheOblivious FOCS99

This paper presents several cache-oblivious algorithms that are asymptotically as efficient as cache-aware algorithms for problems like matrix transpose, FFT, sorting, and matrix multiplication. The paper introduces an "ideal-cache" model to analyze cache complexity and proves that optimal cache-oblivious algorithms designed for two memory levels are also optimal for multiple levels. These cache-oblivious algorithms achieve the same asymptotic optimality as previous cache-aware algorithms but without any tuning of hardware-dependent parameters.

Uploaded by

zaninaida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views15 pages

Frigo CacheOblivious FOCS99

This paper presents several cache-oblivious algorithms that are asymptotically as efficient as cache-aware algorithms for problems like matrix transpose, FFT, sorting, and matrix multiplication. The paper introduces an "ideal-cache" model to analyze cache complexity and proves that optimal cache-oblivious algorithms designed for two memory levels are also optimal for multiple levels. These cache-oblivious algorithms achieve the same asymptotic optimality as previous cache-aware algorithms but without any tuning of hardware-dependent parameters.

Uploaded by

zaninaida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Cache-Oblivious Algorithms

EXTENDED ABSTRACT SUBMITTED FOR PUBLICATION .

Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran


MIT Laboratory

 
   "!# 
$%'&($)&+*,-&.! MA 02139
for Computer Science, 545 Technology Square, Cambridge,

Main
Abstract organized by
Memory
optimal replacement
This paper presents asymptotically optimal algorithms for strategy
rectangular matrix transpose, FFT, and sorting on comput- Cache
ers with multiple levels of caching. Unlike previous optimal
algorithms, these algorithms are cache oblivious: no vari- CPU
ables dependent on hardware parameters, such as cache size
and cache-line length, need to be tuned to achieve optimal- W
work
ity. Nevertheless, these algorithms use an optimal amount
of work and move data optimally among multiple levels of 5
Z L Cache lines Q
cache misses
cache. For a cache with size Z and cache-line length L where Lines
Z / Ω(L2 ) the number of cache misses for an m 0 n ma- of length L
trix transpose is Θ(1 1 mn 2 L). The number of cache misses
for either an n-point FFT or the sorting of n numbers is
Θ(1 1 (n 2 L)(1 1 logZ n)). We also give an Θ(mnp)-work algo-
rithm to multiply an m 0 n matrix by an n 0 p matrix that in- Figure 1: The ideal-cache model
curs Θ(1 1 (mn 1 np 1 mp) 2 L 1 mnp 2 L 3 Z) cache faults.
We introduce an “ideal-cache” model to analyze our al-
gorithms, and we prove that an optimal cache-oblivious al- ure 1, consists of a computer with a two-level
gorithm designed for two levels of memory is also opti- memory hierarchy consisting of an ideal (data)
mal for multiple levels. We also prove that any optimal cache of Z words and an arbitrarily large main
cache-oblivious algorithm is also optimal in the previously memory. Because the actual size of words in a
studied HMM and SUMH models. Algorithms developed
for these earlier models are perforce cache-aware: their be-
computer is typically a small, fixed size (4 bytes,
havior varies as a function of hardware-dependent parame- 8 bytes, etc.), we shall assume that word size is
ters which must be tuned to attain optimality. Our cache- constant; the particular constant does not affect
oblivious algorithms achieve the same asymptotic optimality, our asymptotic analyses. The cache is partitioned
but without any tuning.
into cache lines, each consisting of L consecutive
words that are always moved together between
1 Introduction cache and main memory. Cache designers typ-
Resource-oblivious algorithms that nevertheless ically use L 6 1, banking on spatial locality to
use resources efficiently offer advantages of sim- amortize the overhead of moving the cache line.
plicity and portability over resource-aware al- We shall generally assume in this paper that the
gorithms whose resource usage must be pro- cache is tall:
grammed explicitly. In this paper, we study cache Z 7 Ω(L2 ) 4 (1)
resources, specifically, the hierarchy of memories which is usually true in practice.
in modern computers. We exhibit several “cache- The processor can only reference words that re-
oblivious” algorithms that use cache as effectively side in the cache. If the referenced word belongs
as “cache-aware” algorithms. to a line already in cache, a cache hit occurs, and
Before discussing the notion of cache oblivi- the word is delivered to the processor. Otherwise,
ousness, we first introduce the (Z 4 L) ideal-cache a cache miss occurs, and the line is fetched into
model to study the cache complexity of algo- the cache. The ideal cache is fully associative [18,
rithms. This model, which is illustrated in Fig- Ch. 5]: cache lines can be stored anywhere in the
cache. If the cache is full, a cache line must be
This research was supported in part by the Defense
Advanced Research Projects Agency (DARPA) under Grant
evicted. The ideal cache uses the optimal off-line
F30602-97-1-0270. Matteo Frigo was supported in part by a strategy of replacing the cache line whose next ac-
Digital Equipment Corporation fellowship. cess is farthest in the future [7], and thus it exploits

1
temporal locality perfectly. sumption (1), we can see that s 7 Θ( Z). Thus, 

An algorithm with an input of size n is mea- each of the calls to O RD -M ULT runs with at most
sured in the ideal-cache model in terms of its work Z L 7 Θ(s2 L) cache misses needed to bring the
 

complexity W(n)—its conventional running time three matrices into the cache. Consequently, the
in a RAM model [4]—and its cache complexity cache complexity of the entire algorithm is Θ(1 

Q(n; Z 4 L)—the number of cache misses it incurs n2 L (n Z)3 (Z L)) 7 Θ(1 n2 L n3 L Z),
    

    

as a function of the size Z and line length L of the since the algorithm has to read n 2 elements, which
ideal cache. When Z and L are clear from context, reside on n2 L cache lines.


we denote the cache complexity as simply Q(n) to The same bound can be achieved using a simple
ease notation. cache-oblivious algorithm that requires no tuning
We define an algorithm to be cache aware if it parameters such as the s in B LOCK -M ULT. We
contains parameters (set at either compile-time or present such an algorithm, which works on gen-
runtime) that can be tuned to optimize the cache eral rectangular matrices, in Section 2. The prob-
complexity for the particular cache size and line lems of computing a matrix transpose and of per-
length. Otherwise, the algorithm is cache obliv- forming an FFT also succumb to remarkably sim-
ious. Historically, good performance has been ple algorithms, which are described in Section 3.
obtained using cache-aware algorithms, but we Cache-oblivious sorting poses a more formidable
shall exhibit several cache-oblivious algorithms challenge. In Sections 4 and 5, we present two
that are asymptotically as efficient as their cache- sorting algorithms, one based on mergesort and
aware counterparts. the other on distribution sort, both which are op-
To illustrate the notion of cache awareness, timal.
consider the problem of multiplying two n n The ideal-cache model makes the perhaps-
matrices A and B to produce their n n product C. questionable assumption that memory is man-
We assume that the three matrices are stored in aged automatically by an optimal cache replace-
row-major order, as shown in Figure 2(a). We ment strategy. Although the current trend in
further assume that n is “big,” i.e. n 6 L in order architecture does favor automatic caching over
to simplify the analysis. The conventional way to programmer-specified data movement, Section 6
multiply matrices on a computer with caches is to addresses this concern theoretically. We show
use a blocked algorithm [17, p. 45]. The idea is to that the assumptions of two hierarchical mem-
view each matrix M as consisting of (n s) (n s) ory models in the literature, in which mem-
 

submatrices Mi j (the blocks), each of which has ory movement is programmed explicitly, are ac-
size s s, where s is a tuning parameter. The tually no weaker than ours. Specifically, we
following algorithm implements this strategy: prove (with only minor assumptions) that opti-
B LOCK -M ULT(A B C n)
  
mal cache-oblivious algorithms in the ideal-cache
1 for i 1 to n s
 
model are also optimal in the hierarchical mem-
2 do for j 1 to n s
 

ory model (HMM) [1] and in the serial uniform


3 do for k 1 to n s  

memory hierarchy (SUMH) model [5, 28]. Sec-


4 do O RD -M ULT(A ik Bk j Ci j s) tion 7 discusses related work, and Section 8 offers
  

where O RD -M ULT (A 4 B 4 C 4 s) is a subroutine that some concluding remarks.


computes C C AB on s s matrices using
the ordinary O(s3) algorithm. (This algorithm 2 Matrix multiplication
 

assumes for simplicity that s evenly divides n, This section describes an algorithm for multiply-
but in practice s and n need have no special ing an m n by an n p matrix cache-obliviously
relationship, which yields more complicated code using Θ(mnp) work and incurring Θ(1 (mn
in the same spirit.)
 

np mp) L mnp L Z) cache misses. These


 

Depending on the cache size of the machine


  

results require the tall-cache assumption (1) for


on which B LOCK -M ULT is run, the parameter s
matrices stored with in a row-major layout for-
can be tuned to make the algorithm run fast, and
mat, but the assumption can be relaxed for cer-
thus B LOCK -M ULT is a cache-aware algorithm. To
tain other layouts. We also discuss Strassen’s
minimize the cache complexity, we choose s so
algorithm [25] for multiplying n n matrices,
that the three s s submatrices simultaneously
which uses Θ(nlg 7 ) work1 and incurs Θ(1 n2 L


fit in cache. An s s submatrix is stored on


 

Θ(s s2 L) cache lines. From the tall-cache as- 1




 We use the notation lg to denote log2 .

2
nlg 7 L Z) cache misses. (a) 0 1 2 3 4 5 6 7 (b) 0 8 16 24 32 40 48 56


In [8], two of the authors analyzed an optimal 8 9 10 11 12 13 14 15 1 9 17 25 33 41 49 57


divide-and-conquer algorithm for n n matrix 16 17 18 19 20 21 22 23 2 10 18 26 34 42 50 58
multiplication that contained no tuning parame- 24 25 26 27 28 29 30 31 3 11 19 27 35 43 51 59
ters, but we did not study cache-obliviousness per 32 33 34 35 36 37 38 39 4 12 20 28 36 44 52 60
se. That algorithm can be extended to multiply 40 41 42 43 44 45 46 47 5 13 21 29 37 45 53 61
rectangular matrices. To multiply a m n matrix 48 49 50 51 52 53 54 55 6 14 22 30 38 46 54 62
56 57 58 59 60 61 62 63 7 15 23 31 39 47 55 63
A and a n p matrix B, the algorithm halves the
largest of the three dimensions and recurs accord-
(c) 0 1 2 3 16 17 18 19 (d) 0 1 4 5 16 17 20 21
ing to one of the following three cases: 4 5 6 7 20 21 22 23 2 3 6 7 18 19 22 23
A1 A1 B 8 9 10 11 24 25 26 27 8 9 12 13 24 25 28 29
(a) AB 7
A2  B7

A2 B
4 12 13 14 15 28 29 30 31 10 11 14 15 26 27 30 31

(b)  A A   AB 7 1 2
B1
B2
7 A1 B1 A2 B2 4
32 33 34 35 48 49 50 51
36 37 38 39 52 53 54 55
32 33 36 37 48 49 52 53
34 35 38 39 50 51 54 55
(c) AB 7 A  B B   


40 41 42 43 56 57 58 59 40 41 44 45 56 57 60 61
7 AB1 AB2
In case (a), we have m  max n 4 p
. Matrix A is
1 2
44 45 46 47 60 61 62 63 42 43 46 47 58 59 62 63

by matrix B. In case (b), we have n  max m 4 p


.  
split horizontally, and both halves are multiplied Figure 2: Layout of a 16 16 matrix in (a) row major, (b)
column major, (c) 4 4-blocked, and (d) bit-interleaved

multiplied. In case (c), we have p  max m 4 n


.
Both matrices are split, and the two halves are layouts.

Matrix B is split vertically, and each half is multi- B LOCK -M ULT, the matrices are held on Θ((mn
plied by A. For square matrices, these three cases


np mp) L) cache lines, assuming a tall cache.




together are equivalent to the recursive multipli-




Thus, the only cache misses that occur during


cation algorithm described in [8]. The base case the remainder of the recursion are the Θ((mn 

occurs when m 7 n 7 p 7 1, in which case the two np mp) L) cache misses that occur when the


elements are multiplied and added into the result




matrices are brought into the cache. The recur-


matrix. sive case arises when the matrices do not fit in
It can be shown by induction that the work of cache, in which case we pay for the cache misses
this algorithm is O(mnp). Although this straight- of the recursive calls, which depend on the di-
forward divide-and-conquer algorithm contains mensions of the matrices, plus O(1) cache misses
no tuning parameters, it uses cache optimally. To for the overhead of manipulating submatrices.
analyze the algorithm, we assume that the three The solution to this recurrence is Q(m 4 n 4 p) 7
matrices are stored in row-major order, as shown O(1 (mn np mp) L mnp L Z), which is
 

in Figure 2(a). We further assume that any row in


    

the same as the cache complexity of the cache-




each of the matrices does not fit in 1 cache line, aware B LOCK -M ULT algorithm for square matri-
that is, min m 4 n 4 p L. [The final version of ces. Intuitively, the cache-oblivious divide-and-
this paper will contain the analysis for the general conquer algorithm uses cache effectively, because
case.] once a subproblem fits into the cache, no more
The following recurrence describes the cache cache misses occur for smaller subproblems.
complexity:
We require the tall-cache assumption (1) in this

 
Q(m n p) analysis because the matrices are stored in row-
       
 

O((mn np mp) L) if (mn np mp) Z major order. Tall caches are also needed if matri-
 
ces are stored in column-major order (Figure 2(b)),
 

2Q(m 2 n p) O(1) if m n and m p


but the assumption that Z 7 Ω(L2 ) can be re-
   

2Q(m n   2 p) O(1)
 if n m and n p 

2Q(m n   p 2) O(1)
 otherwise 

laxed for certain other matrix layouts. The s s-


(2)

blocked layout (Figure 2(c)), for some tuning pa-
where is a constant chosen sufficiently small to rameter s, can be used to achieve the same bounds
allow the three submatrices (and whatever small with the weaker assumption that the cache holds
number of temporary variables there may be) to at least some sufficiently large constant number
fit in the cache. The base case arises as soon as of lines. The cache-oblivious bit-interleaved lay-
all three matrices fit in cache. Using reasoning out (Figure 2(d)) has the same advantage as the
similar to that for analyzing O RD -M ULT within blocked layout, but no tuning parameter need

3
be set, since submatrices of size Θ( L L) are   likewise perform two transpositions recursively.
cache-obliviously stored on one cache line. The The next two lemmas provide upper and lower
advantages of bit-interleaved and related layouts bounds on the performance of this algorithm.
have been studied in [14] and [9, 10]. One of the
practical disadvantages of bit-interleaved layouts Lemma 1 The cache-oblivious matrix-transpose algo-
rithm involves O(mn) work and incurs O(1 mn L)


is that index calculations on conventional micro- 

processors can be costly. cache misses for an m n matrix.


For square matrices, the cache complexity Proof. See Appendix A.
Q(n) 7 Θ(1 n2 L n3 L Z) of the cache-
 


 

oblivious matrix multiplication algorithm Theorem 2 The cache-oblivious matrix-transpose al-


matches the lower bound by Hong and Kung [19]. gorithm is asymptotically optimal.
This lower bound holds for all algorithms that ex-
Proof. For an m n matrix, the matrix-
ecute the Θ(n3 ) operations given by the definition
transposition algorithm must write to mn
of matrix multiplication
distinct
 elements, which occupy at least
∑ aik bk j 
n
mn L 7 Ω(1 mn L) cache lines.
 

ci j 7 

k 1 As an example of application of the cache-


No tight lower bounds for the general prob- oblivious transposition algorithm, in the rest of
lem of matrix multiplication are known. By us- this section we describe and analyze a cache-
ing an asymptotically faster algorithm, such as oblivious algorithm for computing the discrete
Strassen’s algorithm [25] or one of its variants Fourier transform of a complex array of n ele-
[31], both the work and cache complexity can be ments, where n is an exact power of 2. The basic
reduced. Indeed, Strassen’s algorithm, which is algorithm is the well-known “six-step” variant [6,
cache oblivious, can be shown to have cache com- 30] of the Cooley-Tukey FFT algorithm [11]. Using
plexity O(1 n2 L nlg7 L Z). the cache-oblivious transposition algorithm, how-
 

  

ever, the FFT becomes cache-oblivious, and its


3 Matrix transposition and FFT performance matches the lower bound by Hong
and Kung [19].
This section describes a cache-oblivious algorithm Recall that the discrete Fourier transform (DFT)
for transposing a m n matrix that uses O(mn) of an array X of n complex numbers is the array Y
work and incurs O(1 mn L) cache misses, which


given by
is optimal. Using matrix transposition as a sub- n 1

∑ X[ j]
ij
routine, we convert a variant [30] of the “six- Y[i] 7 n 4 (3)
step” fast Fourier transform (FFT) algorithm [6] j 0
into an optimal cache-oblivious algorithm. This where  n 7 e2   1 n is a primitive nth root of

 
FFT algorithm uses O(n lg n) work and incurs
O 1 (n L) 1 log Z n cache misses.



The problem of matrix transposition is defined


 unity, and 0
i n.
Many known algorithms evaluate Equation (3)
in time O(n lg n) for all integers n [13]. In this
as follows. Given an m n matrix stored in a paper, however, we assume that n is an exact
row-major layout, compute and store A T into an power of 2, and compute Equation (3) according
n m matrix B also stored in a row-major layout. to the Cooley-Tukey algorithm, which works re-
The straightforward algorithm for transposition cursively as follows. In the base case where n 7
that employs doubly nested loops incurs Θ(mn) O(1), we compute Equation (3) directly. Other-

cache misses on one of the matrices when mn Z, wise, for any factorization n 7 n1 n2 of n, we have
which is suboptimal.
Optimal work and cache complexities can Y[i1  i2 n1 ] 7


 
be obtained with a divide-and-conquer strategy, n2  1 n1  1
 i1 j1   i2 j2
∑ ∑ X[ j1 n2
i1 j2
however. If n m, we partition  j2 ]  n1   n   n2


j2 0 j1 0
B1 (4)
A 7 (A1 A2 ) 4 B7
B2
Observe that both the inner and the outer sum-
Then, we recursively execute T RANSPOSE (A 1 4 B1 ) mation in Equation (4) is a DFT. Operationally,
and T RANSPOSE (A2 4 B2 ). If m 6 n, we divide ma- the computation specified by Equation (4) can be
trix A horizontally and matrix B vertically and performed by computing n2 transforms of size n1

4
(the inner sum), multiplying the result by the fac-
 i j
tors  n 1 2 (called the twiddle factors [13]), and
finally computing n1 transforms of size n2 (the
outer sum).  L1 R
We choose n1 to be 2 lg n 2 and n2 to be 2 lg n 2 .
The recursive step then operates as follows.
buffers
1. Pretend that input is a row-major n 1 n2
matrix A. Transpose A in-place, i.e., use
the cache-oblivious algorithm to transpose A
onto an auxiliary array B, and copy B back L k

onto A. Notice that if n1 7 2n2 , we can con-


k-merger
sider the matrix to be made up of records
containing two elements.
2. At this stage, the inner sum corresponds to
a DFT of the n2 rows of the transposed ma- Figure 3: Illustration of a k-merger. A k-merger is built
trix. Compute these n2 DFT’s of size n1 recur- recursively out of  k “left”  k-mergers L1 L2


L k ,
sively. Observe that, because of the previous
  

a series of buffers, and one “right”  k-merger R.


transposition, we are transforming a contigu-
ous array of elements. of 2. As with matrix multiplication, no tight lower
3. Multiply A by the twiddle factors, which can bounds for cache complexity are known for the
be computed on the fly with no extra cache general problem of computing the DFT.
misses.
4. Transpose A in-place, so that the inputs to 4 Funnelsort
the next stage is arranged in contiguous lo- Although it is cache oblivious, algorithms like fa-
cations. miliar two-way merge sort are not asymptotically
5. Compute n1 DFT’s of the rows of the matrix, optimal with respect to cache misses. The Z-
recursively. way mergesort mentioned by Aggarwal and Vit-
6. Transpose A in-place, so as to produce the ter [3] is optimal in terms of cache complexity,
correct output order. but it is cache aware. This section describes a
cache-oblivious sorting algorithm called “funnel-
It can be proven by induction that the work sort.” This algorithm has an asymptotically op-
complexity of this FFT algorithm is O(n lg n). We
 
timal work complexity O(n lg n), and an optimal
now analyze its cache complexity. The algorithm
always operates on contiguous data, by construc-
cache complexity O 1 (n L) 1 log Z n if the
cache is tall.




tion. In order to simplify the analysis of the Funnelsort is similar to mergesort. In order to
cache complexity, assume a tall cache, in which sort a (contiguous) array of n elements, funnelsort
case each transposition operation and the multi- performs the following two steps:
plication by the twiddle factors require at most
O(1 n L) cache misses. Thus, the cache com- 1. Split the input into n1 3 contiguous arrays of


size n2 3, and sort these arrays recursively.




plexity satisfies the recurrence




O(1 n L) 4
 if n

Z4  2. Merge the n1 3 sorted sequences using a n1 3-


merger, which is described below.
Q(n)
n1 Q(n2 ) n2 Q(n1 ) otherwise 4


(5)
O(1 n L)


Funnelsort differs from mergesort in the way



 

the merge operation works. Merging is per-



for a sufficiently small constant chosen such that
a subproblem of size Z fits in cache. This recur- formed by a device called a k-merger, which in-
rence has solution puts k sorted sequences and merges them. A k-
Q(n) 7 O 1   (n L) 1


  log Z n  4
merger operates by recursively merging sorted se-
quences that become progressively longer as the
which is asymptotically optimal for a Cooley- algorithm proceeds. Unlike mergesort, however,
Tukey algorithm, matching the lower bound by a k-merger stops working on a merging subprob-
Hong and Kung [19] when n is an exact power lem when the merged output sequence becomes

5
“long enough,” and it resumes working on an- Proof. See Appendix B.
other merging subproblem. This upper bound matches the lower bound
Since this complicated flow of control makes a stated by the next theorem, proving that funnel-
k-merger a bit tricky to describe, we explain the sort is cache-optimal.
operation of the k-merger pictorially. Figure 3
shows a representation of a k-merger, which has
 
Theorem 4 The cache complexity of any sorting algo-
k sorted sequences as inputs. Throughout its ex-
ecution, the k-merger maintains the following in-
rithm is Q(n) 7 Ω 1 (n L) 1 logZ n . 



variant. Proof. Aggarwal and Vitter [3] show that there
is an Ω (n L) log Z L(n Z) bound on the number
 

Invariant The invocation of a k-merger outputs the




of cache misses made by any sorting algorithm


first k3 elements of the sorted sequence obtained by
on their “out-of-core” memory model, a bound
merging the k input sequences.
that extends to the ideal-cache model. The theo-
A k-merger is built recursively out of k- rem can be proved by applying the tall-cache as-
sumption Z 7 Ω(L2 ) and the trivial lower bounds


mergers in the following way. The k inputs


of Q(n) 7 Ω(1) and Q(n) 7 Ω(n L).


are partitioned into k sets of k elements, and


 

these sets form the input to the k k-mergers



 

L1 4 L2 4 4 L  k in the left part of the figure. The 5 Distribution sort


outputs of these mergers are connected to the in-
puts of k buffers. Each buffer is a FIFO queue In this section, we describe another cache-
oblivious optimal sorting algorithm based on dis-


that can hold 2k3 2 elements. Finally, the outputs


tribution sort. Like the funnelsort algorithm
of the buffers are connected to the k inputs of
from Section 4, the distribution-sorting algorithm


the k-merger R in the right part of the figure.


 
uses O(n lg n) work to sort n elements and it



The output of this final k-merger becomes the incurs O 1 (n L) 1 logZ n cache misses if


 

output of the whole k-merger. The reader should the cache is tall. Unlike previous cache-efficient
notice that the intermediate buffers are overdi- distribution-sorting algorithms [1, 3, 21, 28, 30],
mensioned. In fact, each buffer can hold 2k 3 2 which use sampling or other techniques to find
elements, which is twice the number k 3 2 of el- the partitioning elements before the distribution
ements output by a k-merger. This additional


step, our algorithm uses a “bucket splitting” tech-


buffer space is necessary for the correct behav- nique to select pivots incrementally during the
ior of the algorithm, as will be explained below. distribution.
The base case of the recursion is a k-merger with Given an array A (stored in contiguous loca-
k 7 2, which produces k3 7 8 elements whenever tions) of length n, the cache-oblivious distribution
invoked. sort performs sorts A as follows:
A k-merger operates recursively in the follow-
ing way. In order to output k3 elements, the k- 1. Partition A into n contiguous subarrays of 

merger invokes R k3 2 times. Before each invo- size n. Recursively sort each subarray.


cation, however, the k-merger fills all buffers that 2. Distribute the sorted subarrays into q buckets
are less than half full, i.e., all buffers that contain
less than k3 2 elements. In order to fill buffer i, the
B1 4
that
4 Bq of size n1 4 4 nq , respectively, such 
algorithm invokes the corresponding left merger
Li once. Since L i outputs k3 2 elements, the buffer (a) max x x Bi
 
min x x   Bi  1
for
contains at least k3 2 elements after L i finishes. all 1
i q.
It can be proven by induction that the work (b) ni
2 n for all 1
i
q.
complexity of funnelsort is O(n lg n). The next the-


orem gives the cache complexity of funnelsort. (See below for details.)
Theorem 3 Funnelsort sorts n elements incurring at 3. Recursively sort each bucket.
most Q(n) cache misses, where 4. Copy the sorted buckets to array A.
Q(n) 7 
O 1  (n L) 1


  log Z n   A stack-based memory allocator is used to exploit


spatial locality.

6
Distribution step The goal of Step 2 is to dis- median-finding algorithm uses O(m) work and
tribute the sorted subarrays of A into q buckets incurs O(1 m L) cache misses to find the me-



B1 4 B2 4 4 Bq . The algorithm maintains two in- dian of an array of size m. (In our case, we have
variants. First, at any time each bucket holds at m 7 2 n 1.) In addition, when a bucket splits,



most 2 n elements and any element in bucket Bi



all subarrays whose bnum is greater than the bnum
is smaller than any element in bucket Bi 1 . Sec- 
of the split bucket must have their bnum’s incre-
ond, every bucket has an associated pivot. Ini- mented. The analysis of D ISTRIBUTE is given by
tially, only one empty bucket exists with pivot ∞. the following lemma.
The idea is to copy all elements from the sub-
arrays into the buckets while maintaining the in- Lemma 5 Step 2 uses O(n) work, incurs O(1 n L)


variants. We keep state information for each sub- cache misses, and uses O(n) stack space to distribute n
array and bucket. The state of a subarray consists elements.
of the index next of the next element to be read
from the subarray and the bucket number bnum Proof. See Appendix C.
where this element should be copied. By conven-
tion, bnum 7 ∞ if all elements in a subarray have


Theorem 6 Distribution sort uses O(n lg n) work
been copied. The state of a bucket consists of the
pivot and the number of elements currently in the and incurs O(1 (n L) 1 log Z n ) cache misses to
sort n elements.




bucket.
We would like to copy the element at position
next of a subarray to bucket bnum. If this ele- Proof. The work done by the algorithm is given
ment is greater than the pivot of bucket bnum, we by
would increment bnum until we find a bucket for q
which the element is smaller than the pivot. Un- W(n) 7 
nW( n) 


∑ W(ni )  O(n) 4
fortunately, this basic strategy has poor caching i 1
behavior, which calls for a more complicated pro- where each n i
2 n and ∑ ni 7 n. The solution to
cedure.


this recurrence is W(n) 7 O(n lg n).


The distribution step is accomplished by the re- The space complexity of the algorithm is given
cursive procedure D ISTRIBUTE (i 4 j 4 m) which dis- by
tributes elements from the ith through (i m
S(n)
S(2 n) O(n) 4


1)th subarrays into buckets starting from B j .




where the O(n) term comes from Step 2. The solu-



Given the precondition that each subarray i 4 i
 tion to this recurrence is S(n) 7 O(n).


14 4 i m 1 has its bnum j, the execution


The cache complexity of distribution sort is de-


of D ISTRIBUTE (i 4 j 4 m) enforces the postcondition


that subarrays i 4 i 1 4 
4 i m 1 have their scribed by the recurrence


 

bnum j m. Step 2 of the distribution sort in-




 O(1 n L) if n
Z4


vokes D ISTRIBUTE (1 4 1 4 n). The following is a re-




q


Q(n)
nQ( n) ∑i 1 Q(ni ) otherwise 4
cursive implementation of D ISTRIBUTE:


 

O(1 n L)



 

D ISTRIBUTE(i 4 j 4 m) where is a sufficiently small constant such that


1 if m 7 1

the stack space used by a sorting problem of size
2 then C OPY E LEMS (i 4 j)

Z, including the input array, fits completely in
3 else D ISTRIBUTE (i 4 j 4 m 2) cache. The base case n
Z arises when both


4 D ISTRIBUTE (i m 2 4 j 4 m 2) the input array A and the contiguous stack space


 

5 D ISTRIBUTE (i 4 j m 2 4 m 2) of size S(n) 7 O(n) fit in O(1 n L) cache lines


 

6 D ISTRIBUTE (i m 2 4 j m 2 4 m 2) of the cache. In this case, the algorithm incurs


  

 

O(1 n L) cache misses to touch all involved




In the base case, the procedure C OPY E LEMS(i 4 j) 


copies all elements from subarray i that belong memory locations once. In the case where n 6
to bucket j. If bucket j has more than 2 n ele- Z, the recursive calls in Steps 1 and 3 cause
q
Q( n) ∑i 1 Q(ni ) cache misses and O(1 n L)


ments after the insertion, it can be split into two 


 

buckets of size at least n. For the splitting op- is the cache complexity of Steps 2 and 4, as shown
by Lemma 5. The theorem now follows by solving


eration, we use the deterministic median-finding


algorithm [12, p. 189] followed by a partition. The the recurrence.

7
6 Other cache models ideal-cache. The same argument holds for FIFO
caches.
In this section we show that cache-oblivious al-
gorithms designed in the two-level ideal-cache
model can be efficiently ported to other cache Corollary 8 For algorithms with regular cache com-
models. We show that algorithms whose com- plexity bounds, the asymptotic number of cache misses
plexity bounds satisfy a simple regularity con- is the same for LRU, FIFO, and optimal replacement.
dition (including all algorithms heretofore pre-
sented) can be ported to less-ideal caches in- Since previous two-level models do not support
corporating least-recently-used (LRU) or first-in, automatic replacement, to port a cache-oblivious
first-out (FIFO) replacement policies [18, p. 378]. algorithms to them, we implement a LRU (or
We argue that optimal cache-oblivious algorithms FIFO) replacement strategy in software.
are also optimal for multilevel caches. Finally, Lemma 9 A (Z 4 L) LRU-cache (or FIFO-cache) can
we present simulation results proving that opti- be maintained using O(Z) primary memory locations
mal cache-oblivious algorithms satisfying the reg- such that every access to a cache line in primary mem-
ularity condition are also optimal (in expecta- ory takes O(1) expected time.
tion) in the previously studied SUMH [5, 28] and
HMM [1] models. Thus, all the algorithmic results Proof. Given the address of the memory location
in this paper apply to these models, matching the to be accessed, we use a 2-universal hash func-
best bounds previously achieved. tion [20, p. 216] to maintain a hash table of cache
lines present in the primary memory. The Z L en-


tries in the hash table point to linked lists in a heap


6.1 Two-level models of memory containing Z L records correspond-


ing to the cache lines. The 2-universal hash func-


Many researchers, such as [3, 19, 29], employ two-
tion guarantees that the expected size of a chain
level models similar to the ideal-cache model, but
is O(1). All records in the heap are organized as
without an automatic replacement strategy. In
a doubly linked list in the LRU order (or singly
these models, data must be moved explicitly be-
linked for FIFO). Thus, the LRU (FIFO) replace-
tween the the primary and secondary levels “by
ment policy can be implemented in O(1) expected
hand.” We define a cache complexity bound
time using O(Z L) records of O(L) words each.


Q(n; Z 4 L) to be regular if
Q(n; Z 4 L) 7 O(Q(n; 2Z 4 L)) (6) Theorem 10 An optimal cache-oblivious algorithm
We now show that optimal algorithms in with a regular cache-complexity bound can be imple-
the ideal-cache model whose cache complexity mented optimally in expectation in two-level models
bounds are regular can be ported to these mod- with explicit memory management.
els to run using optimal work and incurring an
optimal expected number of cache misses. Consequently, our cache-oblivious algorithms
The first lemma shows that the optimal and for matrix multiplication, matrix transpose, FFT,
omniscient replacement strategy used by an ideal and sorting are optimal in two-level models.
cache can be simulated efficiently by the LRU and
FIFO replacement strategies. 6.2 Multilevel ideal caches
Lemma 7 Consider an algorithm that causes
We now show that optimal cache-oblivious algo-
Q (n; Z 4 L) cache misses on a problem of size n using
rithms also perform optimally in computers with
a (Z 4 L) ideal cache. Then, the same algorithm incurs
multiple levels of ideal caches. Moreover, Theo-
Q(n; Z 4 L)
2Q (n; Z 2 4 L) cache misses on a (Z 4 L)


rem 10 extends to multilevel models with explicit


cache that uses either LRU or FIFO replacement.
memory management.
Proof. Sleator and Tarjan [24] have shown that
the cache misses on a (Z 4 L) cache using LRU re-



The (Z1 4 L1 ) 4 (Z2 4 L2 ) 4 4 (Zr 4 Lr ) ideal-cache


model consists of an arbitrarily large main mem-


placement is (Z (Z Z 1))-competitive with ory and a hierarchy of r caches, each of which is


optimal replacement on a (Z 4 L) ideal if both managed by an optimal replacement strategy. The


caches start with an empty cache. It follows that model assumes that the caches satisfy the inclu-
the number of misses on a (Z 4 L) LRU-cache is at sion property [18, p. 723], which says that for i 7
most twice the number of misses on a (Z 2 4 L)


14 24
4 r 1, the values stored in cache i are also

8
stored in cache i 1. The performance of an algo- cache can be executed in the SUMH   b(l) model in ex-


rithm running on an input of size n is measured pected time


by its work complexity W(n) and its cache com-

r 1
 i
plexities Qi (n; Zi 4 Li ) for each level i 7 1 4 2 4 4 r. T(n) 7 O W(n) ∑ Q(n; Θ(Zi ) 4 Li ) 4
b(i)



i 1
Theorem 11 An optimal cache-oblivious algorithm in where Z 7  2i , L 7
 i , and Z is big enough to hold
i i r
the ideal-cache model incurs an asymptotically optimal all elements used during the execution of the algorithm.
number of cache misses on each level of a multilevel


cache with optimal replacement. Proof. Use the memory at the ith level as a cache
of size Zi 7  2i with line length L i 7
 i and man-
Proof. The theorem follows directly from the def- age it with software LRU described in Lemma 9.
inition of cache obliviousness and the optimal- The rth level is the main memory, which is di-
ity of the algorithm in the two-level ideal-cache rect mapped and not organized by the software
model. LRU mechanism. An LRU-cache of size Θ(Z i )
can be simulated by the ith level, since it has
Theorem 12 An optimal cache-oblivious algorithm size Zi . Thus, the number of cache misses at level
with a regular cache-complexity bound incurs an i is 2Q(n; Θ(Zi ) 4 Li ), and each takes  i b(i) time.


asymptotically optimal number of cache misses on each Since only one memory movement happens at
level of a multilevel cache with LRU, FIFO, or optimal any point in time, and there are O(W(n)) accesses
replacement. to level 1, the lemma follows by summing the in-
dividual costs.
Proof. Follows from Corollary 8 and Theorem 12.
Lemma 14 Consider a cache-oblivious algorithm
whose work on a problem of size n is lower-bounded
by W (n) and whose cache complexity is lower-
6.3 The SUMH model bounded by Q (n; Z 4 L) on an (Z 4 L) ideal-cache. Then,
In 1990 Alpern et al. [5] presented the uniform no matter how data movement is implemented in
memory hierarchy model (UMH), a parameter- SUMH   b(l), the time taken on a problem of size n
ized model for a memory hierarchy. In the is at least

UMH   b(l) model, for integer constants 4 6 1, r
 i

the size of the ith memory level is Z i 7  2i and
T(n) 7 Ω W (n) 

∑ b(i) Q (n 4 Θ(Z j ) 4 Li ) 4



i 1
the line length is L i 7 i . A transfer of one  l -
where Zi 7  2i , Li 7  i and Zr is big enough to hold
length line between the caches on level l and l 1
all elements used during the execution of the algorithm.


takes  l b(l) time. The bandwidth function b(l)




must be nonincreasing and the processor accesses Proof. The optimal scheduling of the data move-
the cache on level 1 in constant time per access. ments does not need to obey the inclusion prop-
An algorithm given for the UMH model must in- erty, and thus the number of ith-level cache misses
clude a schedule that, given for a particular set of is at least as large as for an ideal cache of size
input variables, tells exactly when each block is ∑ij 1 Zi 7 O(Zi ). Since Q (n 4 Z 4 L) lower-bounds
moved along which of the buses between caches. the cache misses on a cache of size Z, at least
Work and cache misses are folded into one cost Q (n 4 Θ(Zi ) 4 Li ) data movements occur at level i,
measure T(n). Alpern et al. prove that an algo- each of which takes  i b(i) time. Since only one


rithm that performs the optimal number of I/O’s movement can occur at a time, the total cost is the
at all levels of the hierarchy does not necessar- maximum of the work and the sum of the costs at
ily run in optimal time in the UMH model, since all the levels, which is within a factor of 2 of their
scheduling bottlenecks can occur when all buses sum.
are active. In the more restrictive SUMH model
[28], however, only one bus is active at a time. Theorem 15 A cache-oblivious algorithm that is op-
Consequently, we can prove that optimal cache- timal in the ideal-cache model and whose cache-
oblivious algorithms run in optimal expected time complexity is regular can be executed optimal expected
in the SUMH model. time in the SUMH   b(l) model.

Lemma 13 A cache-oblivious algorithm with W(n) Proof. The theorem follows directly from regu-
larity and Lemmas 13 and 14.
work and Q(n; Z 4 L) cache misses on a (Z 4 L)-ideal

9
6.4 The HMM model to element at location x is given by a cost func-
tion f (x). The BT model [2] extends HMM to sup-
Aggarwal, Alpern, Chandra and Snir [1] pro- port block transfers. The UMH model by Alpern
posed the hierarchical memory model (HMM) in et al. [5] is a multilevel model that allows I/O at
which an access to location x takes f (x) time. The different levels to proceed in parallel. Vitter and
authors assume that f is a monotonically non- Shriver introduce parallelism, and they give algo-
decreasing
 function, usually of the form log x rithms for matrix multiplication, FFT, sorting, and
or x  . The final paper will show that opti- other problems in both a two-level model [29] and
mal cache-oblivious algorithms run in optimal ex- several parallel hierarchical memory models [30].
pected time in the HMM model. Vitter [27] provides a comprehensive survey of
external-memory algorithms.
7 Related work
In this section, we discuss the origin of the notion 8 Conclusion
of cache-obliviousness. We also give an overview [All is well that ends]
of other hierarchical memory models.
Our research group at MIT noticed as far back
as 1994 that divide-and-conquer matrix multi-
Acknowledgments
plication was a cache-optimal algorithm that re- Thanks to Bobby Blumofe, now of the University
quired no tuning, but we did not adopt the of Texas at Austin, who sparked early discussions
term “cache-oblivious” until 1997. This matrix- at MIT about what we now call cache oblivious-
multiplication algorithm, as well as a cache- ness. Thanks to David Wise of Indiana University,
oblivious algorithm for LU-decomposition with- Sid Chatterjee of University of North Carolina,
out pivoting, eventually appeared in [8]. Shortly Chris Joerg of Compaq Cambridge Research Cen-
after leaving our research group, Toledo [26] inde- ter, and Bin Song of MIT for helpful discussions.
pendently proposed a cache-oblivious algorithm
for LU-decomposition, but with pivoting. For
n n matrices, Toledo’s algorithm uses Θ(n 3 )
work and incurs Θ(1 n2 L n3 L Z) cache
 

  

misses. More recently, our group has produced


an FFT library called FFTW [16], which in its
most recent incarnation [15], employs a register-
allocation and scheduling algorithm inspired by
our cache-oblivious FFT algorithm. The general
idea that divide-and-conquer enhances memory
locality has been known for a long time [23].
Previous theoretical work on understanding hi-
erarchical memories and the I/O-complexity of
algorithms has been studied in cache-aware mod-
els lacking an automatic replacement strategy.
Hong and Kung [19] use the red-blue pebble game
to prove lower bounds on the I/O-complexity of
matrix multiplication, FFT, and other problems.
The red-blue pebble game models temporal lo-
cality using two levels of memory. The model
was extended by Savage [22] for deeper memory
hierarchies. Aggarwal and Vitter [3] introduced
spatial locality and investigated a two-level mem-
ory in which a block of P contiguous items can
be transferred in one step. They obtained tight
bounds for matrix multiplication, FFT, sorting,
and other problems. The hierarchical memory
model (HMM) by Aggarwal et al. [1] treats mem-
ory as a linear array, where the cost of an access

10
Appendix solved with at most O(m n mn L) cache


 

misses.
A Analysis of matrix transposition The cache complexity thus satisfies the recur-
Lemma 1 The cache-oblivious matrix-transpose algo- rence
rithm involves O(mn) work and incurs O(1 mn L) Q(m 4 n)


 


cache misses for an m n matrix. 


O(m n mn L) if m 4 n [ L 2 4 L] 4
 

   


2Q(m 2 4 n) O(1) if m n 4


Proof. It is clear that the algorithm does O(mn) 

2Q(m 4 n 2) O(1) otherwise 4




work. For the cache analysis, let Q(m 4 n) be the 

cache complexity of transposing a m n matrix. whose solution is Q(m 4 n) 7 O(1 mn L).




We assume that the matrices are stored in row-


major order, the column-major case having a sim-
B Analysis of funnel sort

ilar analysis.
Let be a constant sufficiently small such that In this appendix, we analyze the cache complex-
ity of funnelsort. The goal of the analysis is to


two submatrices of size m n and n m, where
max m 4 n
L, fit completely in the cache even show that funnelsort on n elements requires at
if each row is stored in a different cache line. We most Q(n) cache misses, where
distinguish the following three cases cases. Q(n) 7 O 1  (n L) 1


 log Z n  4


 

Case I: max m 4 n
L provided that Z 7 Ω(L2 ). [Note to the program


Both the matrices fit in O(1) 2mn L lines. committee: we believe that this hypothesis can be


weakened to Z 7 Ω(L1  ) for all -6 0. If correct,




From the choice of , the number of lines re- 

quired is at most Z L. Therefore Q(m 4 n) 7 this result will appear in the final paper.]


O(1 mn L). In order to prove this result, we need three aux-




 


Case II: m
L n OR n
L m iliary lemmas. The first lemma bounds the space
For this case, assume first that m
L  required by a k-merger.
n. The transposition algorithm divides the Lemma 16 A k-merger can be laid out in O(k2 ) con-
greater dimension n by 2 and performs di- tiguous memory locations.

 
vide and conquer. At some point in the re-
cursion, n is in the range L 2
n
L, and


Proof. A k-merger requires O(k2 ) memory loca-


the whole problem fits in cache. Because the tions for the buffers, plus the space required by
layout is row-major, at this point the input the k-mergers. The space S(k) thus satisfies the


array has n rows, m columns, and it is laid recurrence


out in contiguous locations, thus requiring at S(k)
( k


 1)S( k) 

 O(k2 ) 4
most O(1 nm L) cache misses to be read.


whose solution is S(k) 7 O(k2).




The output array consists of nm elements in


m rows, where in the worst case every row In order to achieve the bound on Q(n), it is im-
lies on a different cache line. Consequently, portant that the buffers in a k-merger be main-
tained as circular queues of size k. This require-
 
we incur at most O(m nm L) for writing the


ment guarantees that we can manage the queue




output array. Since n L 2, the total cache




complexity for this base case is O(1 m). cache-efficiently, in the sense stated by the next
lemma.


These observations yield the recurrence


Q(m 4 n)
Lemma 17 Performing r insert and remove opera-

  tions on a circular queue causes in O(1 r L) cache




O(1 m) if n [ L 2 4 L] 4


 

misses if two cache lines are available for the buffer.


2Q(m 4 n 2) O(1) otherwise 4


Proof. Associate the two cache lines to the head


whose solution is Q(m 4 n) 7 O(1 mn L).



and tail of the circular queue. If a new cache 

The case n
L m is analogous. line is read during a insert (delete) operation, the
Case III: m 4 n 6 L  next
cache
L 1 insert (delete) operations do not cause a
miss. The result follows.
 
Like in Case II, at some point in the recursion
both n and m are in the range [ L 2 4 L]. The The next lemma bounds the number of cache


whole problem fits into cache and it can be misses QM incurred by a k-merger.

11
Lemma 18 If Z 7 Ω(L2 ), then a k-merger operates bounded by l k3 2 2 k. To see why, con- 


with at most QM (k) cache misses, where sider that every invocation of a left merger puts
3 3
QM (k) 7 O 1 k k L k log Z k L 
k3 2 elements into some buffer. Since k 3 elements
are output and the buffer space is 2k 2 , the bound
 






l k3 2 2 k follows.
Proof. There are two cases: either k Z or k 6  


  Before invoking R, the algorithm must check 

Z, where is a sufficiently small constant. every buffer to see whether it is empty. One



Assume first that k Z. By Lemma 16, the such check requires at most k cache misses, since 

data structure associated with the k-merger re- there are k buffers. This check is repeated ex-
quires at most O(k2 ) 7 O(Z) contiguous memory actly k3 2 times, leading to at most k2 cache misses


locations, and therefore it fits into cache. The k- for all checks.
merger has k input queues, from which it loads These considerations lead to the recurrence
O(k3 ) elements. Let r i be the number of elements
extracted from the i-th input queue. Since k QM (k)
2k3 2 2 k QM ( k) k2 

 

 


Z and L 7 O( Z), there are at least Z L 7




Application of the inductive hypothesis yields the


 

Ω(k) cache lines available for the input buffers.


desired bound Inequality (7), as follows.
Lemma 17 applies, whence the total number of
cache misses for accessing the input queues is QM (k)
2k3 2 2 k QM ( k) k2 







k
∑ O(1 ri L) 7 O(k k3 L) ck3 2 log Z k
 


2 k3 2
k A( k) k2
 

i 1 2L
 

 


Similarly by Lemma 16, the cache complexity of


writing the output queue is at most O(1 k 3 L).



ck3 log Z k L


k2 1  c log Z k L




 

Finally, the algorithm incurs at most O(1 k 2 L) 2k3 2




 2 k A( k)





cache misses for touching its internal data struc-


If A(k) 7 k(1 2c log Z k L) (for example) Inequal-

tures. The total cache complexity is therefore





QM (k) 7 O 1 k k3 L , completing the proof of ity (7) follows.




 


the first case.
Assume now that k 6 Z. In this second case, 

Theorem 3 If Z 7 Ω(L2 ), then funnelsort sorts n el-


we prove by induction on k that, whenever k 6
 Z, we have

ements with at most Q(n) cache misses, where
Q(n) 7 O 1  (n L) 1


 log Z n  
3
QM (k)
ck log Z k L A(k) 4 (7)
 

 


where A(k) 7 k(1 2c log Z k L) is a o(k3 ) term. This Proof. If n Z for a small enough constant ,


then the algorithm fits into cache. To see why, ob-




particular value of A(k) will be justified later in the


analysis. serve that only one k-merger is active at any time.
The biggest k-merger is the top-level n 1 3-merger,
 
The base case of the induction consists of values
of k such that Z1 4 k Z. (It is not sufficient 
which requires O(n2 3) O(n) space. The algo-
to just consider k 7 Θ( Z), since k can become rithm thus can operate in O(1 n L) cache misses.



 

as small as Θ(Z1 4) in the recursive calls.) The If N 6 Z, we have the recurrence


analysis of the first case applies, yielding Q M (k) 7

O 1 k k3 L . Because k2 6 Z 7 Ω(L) and


  Q(n) 7 n1 3 Q(n2 3)  QM (n1 3) 


By Lemma 18, we have QM (n1 3) 7

  

k 7 Ω(1), the last term dominates, and Q M (k) 7


 


O k3 L holds. Consequently, a big enough value O 1 n1 3 n L n log Z n L .
 






of c can be found that satisfies Inequality (7). With the hypothesis Z 7 Ω(L2 ), we have n L 7



For the inductive case, let k 6 Z. The k- Ω(n1 3). Moreover, we also have n1 3 7 Ω(1)
and lg n 7 Ω(lg Z). Consequently, Q M (n1 3) 7


merger invokes the k-mergers recursively. Since


  


Z1 4 O n log Z n L holds, and the recurrence simpli-




k k, the inductive hypothesis can




be used to bound the number QM ( k) of cache fies to


 


misses incurred by the submergers. The “right” Q(n) 7 n1 3 Q(n2 3) O n log Z n L




merger R is invoked exactly k3 2 times. The to-




The result follows by induction on n.


tal number l of invocations of “left” mergers is

12
C Analysis of Distribution Sort We again account for the splitting of buckets
This appendix contains the proof of Lemma 5, separately. We first prove that R satisfies the
which is used in Section 5. following recurrence:
R(c 4 m)

Lemma 19 The median of n elements can be found


cache-obliviously using O(n) work and incurring O(L  m L)


if c
L4  (8)
∑1 i 4 R(c 2 4 m i ) otherwise 4


O(1 n L) cache misses.




where ∑1 i 4 mi 7 m.
Proof. See [12, p. 189] for the linear-time median
finding algorithm and the work analysis. The First, consider the base case c
L. An in- 
cache complexity is given by the same recurrence vocation of D ISTRIBUTE(a 4 b 4 c) operates with
as the work complexity with a different base case. c subarrays and c buckets. Since there are



O(1 m L)  if m

Z4  Ω(L) cache lines, the cache can hold all


the auxiliary storage involved and the cur-
Q(m) 7 Q( m 5 ) Q(7m 10 6) otherwise 4
 

 

rently accessed element in each subarray and


O(1 m L)


bucket. In this case there are O(L m L)





 

where is a sufficiently small constant. The result cache misses. O(c) 7 O(L) cache misses are
follows. due to the initial access to each subarray and
bucket. O(1 m L) is the cache complexity


Lemma 5 The distribute step uses O(n) work, incurs of copying the m elements from contiguous
O(1 n L) cache misses, and uses O(n) stack space to to contiguous locations. This completes the



distribute n elements. proof of the base case. The recursive case,
when c 6 L, follows immediately from the
Proof. In order to simplify the analysis of the algorithm. The solution for Equation (8) is
work used by D ISTRIBUTE, assume that C OPY- R(c 4 m) 7 O(L c2 L m L).
 

E LEMS uses O(1) work for procedural overhead.


 

We still need to account for the cache misses


We will account for the work due to copying el- caused by the splitting of buckets. Each split
ements and splitting of buckets separately. The causes O(1 n L) cache misses due to me-


work of D ISTRIBUTE is described by the recur-




dian finding (Lemma 19) and partitioning of


rence
T(c) 7 4T(c 2) O(1)


 
n contiguous elements. Additional O(1
n L) misses are incurred by restoring the



It follows that T(c) 7 O(c2 ), where c 7 n initially.




cache. As proven in the work analysis, there


The work due to copying elements is also O(n). are at most n split operations.
The total number of bucket splits is at most


By adding R( n 4 n) to the split complexity,


n. To see why, observe that there are at most


we conclude that the total cache complexity



n buckets at the end of the distribution step, of the distribution step is O(L n L n(1


since each bucket contains at least n elements.


  

n L)) 7 O(n L).


 

Each split operation involves O( n) work and so




the net contribution to the work is O(n). Thus,


the total work used by D ISTRIBUTE is W(n) 7
O(T( n)) O(n) O(n) 7 O(n).

 


For the cache analysis, we distinguish two
cases. Let be a sufficiently small constant such
that the stack space used fits into cache.
Case I: n
Z 
The input and the auxiliary space of size
O(n) fit into cache using O(1 n L) cache


lines. Consequently, the cache complexity is


O(1 n L).





Case II: n 6 Z
Let R(c 4 m) denote the cache misses incurred
by an invocation of D ISTRIBUTE(a 4 b 4 c) that
copies m elements from subarrays to buckets.

13
References [12] T. H. Cormen, C. E. Leiserson, and R. L.
[1] A. Aggarwal, B. Alpern, A. K. Chandra, and Rivest. Introduction to Algorithms. MIT Press
M. Snir. A model for hierarchical memory. and McGraw Hill, 1990.
In Proceedings of the 19th Annual ACM Sympo- [13] P. Duhamel and M. Vetterli. Fast Fourier
sium on Theory of Computing, pages 305–314, transforms: a tutorial review and a state of
May 1987. the art. Signal Processing, 19:259–299, Apr.
[2] A. Aggarwal, A. K. Chandra, and M. Snir. 1990.
Hierarchical memory with block transfer. [14] J. D. Frens and D. S. Wise. Auto-blocking
In 28th Annual Symposium on Foundations of matrix-multiplication or tracking blas3 per-
Computer Science, pages 204–216, Los Ange- formance from source code. In Proceedings
les, California, 12–14 Oct. 1987. IEEE. of the Sixth ACM SIGPLAN Symposium on
[3] A. Aggarwal and J. S. Vitter. The in- Principles and Practice of Parallel Programming,
put/output complexity of sorting and re- pages 206–216, Las Vegas, NV, June 1997.
lated problems. Communications of the ACM, [15] M. Frigo. A fast Fourier transform compiler.
31(9):1116–1127, Sept. 1988. In Proceedings of the ACM SIGPLAN’99 Con-
[4] A. V. Aho, J. E. Hopcroft, and J. D. Ull- ference on Programming Language Design and
man. The Design and Analysis of Computer Al- Implementation (PLDI), Atlanta, Georgia, May
gorithms. Addison-Wesley Publishing Com- 1999.
pany, 1974. [16] M. Frigo and S. G. Johnson. FFTW: An adap-
[5] B. Alpern, L. Carter, and E. Feig. Uniform tive software architecture for the FFT. In
memory hierarchies. In Proceedings of the Proceedings of the International Conference on
31st Annual IEEE Symposium on Foundations Acoustics, Speech, and Signal Processing, Seat-
of Computer Science, pages 600–608, Oct. 1990. tle, Washington, May 1998.
[6] D. H. Bailey. FFTs in external or hierarchical [17] G. H. Golub and C. F. van Loan. Matrix Com-
memory. Journal of Supercomputing, 4(1):23– putations. Johns Hopkins University Press,
35, May 1990. 1989.
[7] L. A. Belady. A study of replacement algo- [18] J. L. Hennessy and D. A. Patterson. Computer
rithms for virtual storage computers. IBM Architecture: A Quantitative Approach. Mor-
Systems Journal, 5(2):78–101, 1966. gan Kaufmann Publishers, INC., 2nd edition,
[8] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. 1996.
Leiserson, and K. H. Randall. An anal- [19] J.-W. Hong and H. T. Kung. I/O complexity:
ysis of dag-consistent distributed shared- the red-blue pebbling game. In Proceedings of
memory algorithms. In Proceedings of the the 13th Annual ACM Symposium on Theory of
Eighth Annual ACM Symposium on Parallel Al- Computing, pages 326–333, Milwaukee, 1981.
gorithms and Architectures (SPAA), pages 297–
308, Padua, Italy, June 1996. [20] R. Motwani and P. Raghavan. Randomized Al-
gorithms. Cambridge University Press, 1995.
[9] S. Chatterjee, V. V. Jain, A. R. Lebeck, and
S. Mundhra. Nonlinear array layouts for hi- [21] M. H. Nodine and J. S. Vitter. Determinis-
erarchical memory systems. In Proceedings of tic distribution sort in shared and distributed
the ACM International Conference on Supercom- memory multiprocessors. In Proceedings of
puting, Rhodes, Greece, June 1999. the Fifth Symposium on Parallel Algorithms
[10] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and Architectures, pages 120–129, Velen, Ger-
and M. Thottethodi. Recursive array lay- many, 1993.
outs and fast parallel matrix multiplication. [22] J. E. Savage. Extending the Hong-Kung
In Proceedings of the Eleventh ACM SIGPLAN model to memory hierarchies. In D.-Z. Du
Symposium on Parallel Algorithms and Architec- and M. Li, editors, Computing and Combina-
tures, June 1999. torics, volume 959 of Lecture Notes in Com-
[11] J. W. Cooley and J. W. Tukey. An algorithm puter Science, pages 270–281. Springer Verlag,
for the machine computation of the complex 1995.
Fourier series. Mathematics of Computation, [23] R. C. Singleton. An algorithm for comput-
19:297–301, Apr. 1965. ing the mixed radix fast Fourier transform.

14
IEEE Transactions on Audio and Electroacous-
tics, AU-17(2):93–103, June 1969.
[24] D. D. Sleator and R. E. Tarjan. Amortized ef-
ficiency of list update and paging rules. Com-
munications of the ACM, 28(2):202–208, Feb.
1985.
[25] V. Strassen. Gaussian elimination is not op-
timal. Numerische Mathematik, 13:354–356,
1969.
[26] S. Toledo. Locality of reference in LU de-
composition with partial pivoting. SIAM
Journal on Matrix Analysis and Applications,
18(4):1065–1081, Oct. 1997.
[27] J. S. Vitter. External memory algorithms and
data structures. In J. Abello and J. S. Vitter,
editors, External Memory Algorithms and Visu-
alization, DIMACS Series in Discrete Math-
ematics and Theoretical Computer Science.
American Mathematical Society Press, Prov-
idence, RI, 1999.
[28] J. S. Vitter and M. H. Nodine. Large-scale
sorting in uniform memory hierarchies. Jour-
nal of Parallel and Distributed Computing, 17(1–
2):107–114, January and February 1993.
[29] J. S. Vitter and E. A. M. Shriver. Algorithms
for parallel memory I: Two-level memories.
Algorithmica, 12(2/3):110–147, August and
September 1994.
[30] J. S. Vitter and E. A. M. Shriver. Algorithms
for parallel memory II: Hierarchical multi-
level memories. Algorithmica, 12(2/3):148–
169, August and September 1994.
[31] S. Winograd. On the algebraic complexity of
functions. Actes du Congrès International des
Mathématiciens, 3:283–288, 1970.

15

You might also like