Algorithms For Memory Hierarchies
Algorithms For Memory Hierarchies
Algorithms for
Memory Hierarchies
Advanced Lectures
13
Series Editors
Volume Editors
Ulrich Meyer
Peter Sanders
Max-Planck-Institut für Informatik
Stuhlsatzenhausweg 85, 66123 Saarbrücken, Germany
E-mail: {umeyer,sanders}@mpi-sb.mpg.de
Jop Sibeyn
Martin-Luther-Universität Halle-Wittenberg, Institut für Informatik
Von-Seckendorff-Platz 1, 06120 Halle, Germany
E-mail:[email protected]
Cataloging-in-Publication Data applied for
A catalog record for this book is available from the Library of Congress.
CR Subject Classification (1998): F.2, E.5, E.1, E.2, D.2, D.4, C.2, G.2, H.2, I.2, I.3.5
ISSN 0302-9743
ISBN 3-540-00883-7 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg New York
a member of BertelsmannSpringer Science+Business Media GmbH
http://www.springer.de
© Springer-Verlag Berlin Heidelberg 2003
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Boller Mediendesign
Printed on acid-free paper SPIN: 10873015 06/3142 543210
Preface
Algorithms that process large data sets have to take into account that the
cost of memory accesses depends on where the accessed data is stored. Tradi-
tional algorithm design is based on the von Neumann model which assumes
uniform memory access costs. Actual machines increasingly deviate from this
model. While waiting for a memory access, modern microprocessors can exe-
cute 1000 additions of registers. For hard disk accesses this factor can reach
seven orders of magnitude. The 16 chapters of this volume introduce and
survey algorithmic techniques used to achieve high performance on memory
hierarchies. The focus is on methods that are interesting both from a practical
and from a theoretical point of view.
This volume is the result of a GI-Dagstuhl Research Seminar. The Ge-
sellschaft für Informatik (GI) has organized such seminars since 1997. They
can be described as “self-taught” summer schools where graduate students
in cooperation with a few more experienced researchers have an opportunity
to acquire knowledge about a current topic of computer science. The seminar
was organized as Dagstuhl Seminar 02112 from March 10, 2002 to March
14, 2002 in the International Conference and Research Center for Computer
Science at Schloss Dagstuhl.
Chapter 1 gives a more detailed motivation for the importance of al-
gorithm design for memory hierarchies and introduces the models used in
this volume. Interestingly, the simplest model variant — two levels of mem-
ory with a single processor — is sufficient for most algorithms in this book.
Chapters 1–7 represent much of the algorithmic core of external memory
algorithms and almost exclusively rely on this simple model. Among these,
Chaps. 1–3 lay the foundations by describing techniques used in more spe-
cific applications. Rasmus Pagh discusses data structures like search trees,
hash tables, and priority queues in Chap. 2. Anil Maheshwari and Norbert
Zeh explain generic algorithmic approaches in Chap. 3. Many of these tech-
niques such as time-forward processing, Euler tours, or list ranking can be
formulated in terms of graph theoretic concepts. Together with Chaps. 4 and
5 this offers a comprehensive review of external graph algorithms. Irit Ka-
triel and Ulrich Meyer discuss fundamental algorithms for graph traversal,
shortest paths, and spanning trees that work for many types of graphs. Since
even simple graph problems can be difficult to solve in external memory, it
VI Preface
Models
Basics
Data Structures
Algorithms
Graphs
Techniques
Graphs
Special Graphs
Caches
Caches
Cache−Oblivious
Numerics
AI
Applications
Storage Networks
Systems
Parallelism
File Systems
Databases
Parallel Models
Parallel Sorting
makes sense to look for better algorithms for frequently occurring special
types of graphs. Laura Toma and Norbert Zeh present a number of aston-
ishing techniques that work well for planar graphs and graphs with bounded
tree width.
In Chap. 6 Christian Breimann and Jan Vahrenhold give a comprehensive
overview of algorithms and data structures handling geometric objects like
points and lines — an area that is at least as rich as graph algorithms. A
third area of again quite different algorithmic techniques are string problems
discussed by Juha Kärkäinen and Srinivasa Rao in Chap. 7.
Chapters 8–10 then turn to more detailed models with particular empha-
sis on the complications introduced by hardware caches. Beyond this common
motivation, these chapters are quite diverse. Naila Rahman uses sorting as an
example for these issues in Chap. 8 and puts particular emphasis on the of-
ten neglected issue of TLB misses. Piyush Kumar introduces cache-oblivious
algorithms in Chap. 9 that promise to grasp multilevel hierarchies within a
very simple model. Markus Kowarschik and Christian Weiß give a practical
introduction into cache-efficient programs using numerical algorithms as an
example. Numerical applications are particularly important because they al-
low significant instruction-level parallelism so that slow memory accesses can
dramatically slow down processing.
Stefan Edelkamp introduces an application area of very different char-
acter in Chap. 11. In artificial intelligence, search programs have to handle
huge state spaces that require sophisticated techniques for representing and
traversing them.
Chapters 12–14 give a system-oriented view of advanced memory hierar-
chies. On the lowest level we have storage networks connecting a large num-
ber of inhomogeneous disks. Kay Salzwedel discusses this area with particular
Preface VII
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
List of Contributors
Editors
Ulrich Meyer Jop F. Sibeyn
Max-Planck Institut für Informatik Martin-Luther Universität
Stuhlsatzenhausweg 85 Halle-Wittenberg
66123 Saarbrücken, Germany Institut für Informatik
[email protected] Von-Seckendorff-Platz 1
06120 Halle, Germany
Peter Sanders [email protected]
Max-Planck Institut für Informatik
Stuhlsatzenhausweg 85
66123 Saarbrücken, Germany
[email protected]
Authors
Christian Breimann Stefan Edelkamp
Westfälische Wilhelms-Universität Albert-Ludwigs-Universität Freiburg
Institut für Informatik Institut für Informatik
Einsteinstr. 62 Georges-Köhler-Allee, Gebäude 51
48149 Münster, Germany 79110 Freiburg, Germany
[email protected] [email protected]
Laura Toma
Duke University
Department of Computer Science
Durham, NC 27708, USA
[email protected]
1. Memory Hierarchies —
Models and Lower Bounds
Peter Sanders∗
The purpose of this introductory chapter is twofold. On the one hand, it serves
the rather prosaic purpose of introducing the basic models and notations used
in the subsequent chapters. On the other hand, it explains why these simple
abstract models can be used to develop better algorithms for complex real
world hardware.
Section 1.1 starts with a basic motivation for memory hierarchies and
Section 1.2 gives a glimpse on their current and future technological realiza-
tions. More theoretically inclined readers can skip or skim this section and
directly proceed to the introduction of the much simpler abstract models in
Section 1.3. Then we have all the terminology in place to explain the guiding
principles behind algorithm design for memory hierarchies in Section 1.4. A
further issue permeating most external memory algorithms is the existence of
fundamental lower bounds on I/O complexity described in Section 1.5. Less
theoretically inclined readers can skip the proofs but might want to remember
these bounds because they show up again and again in later chapters.
Parallelism is another important approach to high performance comput-
ing that has many interactions with memory hierarchy issues. We describe
parallelism issues in subsections that can be skipped by readers only inter-
ested in sequential memory hierarchies.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 1-13, 2003.
© Springer-Verlag Berlin Heidelberg 2003
2 Peter Sanders
megabytes), or large scale numerical simulations. Even if the input and output
of an application are small, it might be necessary to store huge intermediate
data structures. For example, some of the state space search algorithms in
Chapter 11 are of this type.
How should a machine for processing such large inputs look? Data should
be cheap to store but we also want fast processing. Unfortunately, there are
fundamental reasons why we cannot get memory that is at the same time
cheap, compact, and fast. For example, no signal can propagate faster than
light. Hence, given a storage technology and a desired access latency, there is
only a finite amount of data reachable within this time limit. Furthermore, in
a cheap and compact storage technology there is no room for wires reaching
every single memory cell. It is more economical to use a small number of
devices that can be moved to access a given bit.
There are several approaches to escape this so called memory wall prob-
lem. The simplest and most widely used compromise is a memory hierarchy.
There are several categories of memory in a computer ranging from small and
fast to large, cheap, and slow. Even in a memory hierarchy, we can process
huge data sets efficiently. The reason is that although access latencies to the
huge data sets are large, we can still achieve large bandwidths by accessing
many close-by bits together and by using several memory units in parallel.
Both approaches can be modeled using the same abstract view: Access to
large blocks of memory is almost as fast as access to a single bit. The algo-
rithmic challenge following from this principle is to design algorithms that
perform well on systems with blocked memory access. This is the main sub-
ject of this volume.
hence even larger access latencies [399]. Often there are separate L1 caches
for instructions and data.
The second level (L2) cache is on the same chip as the first level cache but
it has quite different properties. The L2 cache is as large as the technology
allows because applications that fit most of their data into this cache can
execute very fast. The L2 cache has access latencies around ten clock cycles.
Communication between L1 and L2 cache uses block sizes of 16–32 bytes. For
accessing off-chip data, larger blocks are used. For example, the Pentium 4
uses 128 byte blocks [399].
Some processors have a third level (L3) cache that is on a separate set of
chips. This cache is made out of fast static1 RAM cells. The L3 cache can
be very large in principle, but this is not always cost effective because static
RAMs are rather expensive.
The main memory is made out of high density cheap dynamic RAM
cells. Since the access speeds of dynamic RAMs have lagged behind processor
speeds, dynamic RAMs have developed into devices optimized for block ac-
cess. For example, RAMBUS RDRAM2 chips allow blocks of up to 16 bytes
to be accessed in only twice the time to access a single byte.
The programmer is not required to know about the details of the hierarchy
between caches and main memory. The hardware cuts the main memory into
blocks of fixed size and automatically maps a subset of the memory blocks
to L3 cache. Furthermore, it automatically maps a subset of the blocks in L3
cache to L2 cache and from L2 cache to L1 cache. Although this automatic
cache administration is convenient and often works well, one is up to un-
pleasant surprises. In Chapter 8 we will see that sometimes a careful manual
mapping of data to the memory hierarchy would work much better.
The backbone of current data storage are magnetic hard disks because
they offer cheap non volatile memory [643]. In the last years, extremely high
densities have been achieved for magnetic surfaces that allow several giga-
bytes to be stored on the area of a postage stamp. The data is accessed by
tiny magnetic devices that hover as low as 20 nm over the surface of the
rotating disk. It takes very long to move the access head to a particular track
of the disk and to wait until the disk rotates into the correct position. With
up to 10 ms, disk access can be 107 times slower than an access to a register.
However, once the head starts reading or writing, data can be transferred at
a rate of about 50 megabytes per second. Hence, accessing hundreds of KB
takes only about twice as long as accessing a single byte. Clearly, it makes
sense to process data in large chunks.
Hard disks are also used as a way to virtually enlarge the main mem-
ory. Logical blocks that are currently not in use are swapped to disk. This
mechanism is partially supported by the processor hardware that is able to
1
Static RAM needs six transistors per bit which makes it more area consuming
but faster than dynamic RAM that needs only one transistor per bit.
2
http://www.rambus.com
4 Peter Sanders
There are too many possible developments to explain or even perceive all of
them in detail but a few basic trends should be noted. The memory hierarchy
might become even deeper. Third level caches will become more common.
Intel has even integrated it on the Itanium 2 processor. In such a system,
an off-chip 4th level cache makes sense. There is also a growing gap between
the access latencies and capacities of disks and main memory. Therefore,
magnetic storage devices with smaller capacity but also lower access latency
have been proposed [669].
While storage density in CMOS-RAMs and magnetic disks will keep in-
creasing for quite some time, it is conceivable that different technologies will
get their chance in a longer time frame. There are some ideas available that
would allow memory cells consisting of single molecules [780]. Furthermore,
even with current densities, astronomically large amounts of data could be
stored using three-dimensional storage devices. The main difficulty is how to
write and read such memories. One approach uses holographic images stor-
ing large blocks of data in small three-dimensional regions of a transparent
material [716].
Regardless of the technology, it seems likely that block-wise access and
the use of parallelism will remain necessary to achieve high performance
processing of large volumes of data.
Parallelism
a main memory and a second level cache. The IBM Power 4 processor already
implements this technology. Several processors on different chips can share
main memory. Several processor boards can share the same network of disks.
Servers usually have many disk drives. In such systems, it becomes more and
more important that memory devices on all levels of the memory hierarchy
can work on multiple memory accesses in parallel.
On parallel machines, some levels of the memory hierarchy may be shared
whereas others are distributed between the processors. Local caches may hold
copies of shared or remote data. Thus, a read access to shared data may be
as fast as a local access. However, writing shared data invalidates all the
copies that are not in the cache of the writing processor. This can cause
severe overhead for sending the invalidations and for reloading the data at
subsequent remote accesses.
1.3 Modeling
We have seen that real memory hierarchies are very complex. We have mul-
tiple levels, all with their own idiosyncrasies. Hardware caches have replace-
ment strategies that vary between simplistic and strange [294], disks have
position dependent access delays, etc. It might seem that the best models are
those that are as accurate as possible. However, for algorithm design, this
leads the wrong way. Complicated models make algorithms difficult to design
and analyze. Even if we overcome these differences, it would be very difficult
to interpret the results because complicated models have a lot of parameters
that vary from machine to machine.
Attractive models for algorithm design are very simple, so that it is easy to
develop algorithms. They have few parameters so that it is easy to compare
the performance of algorithms. The main issue in model design is to find
simple models that grasp the essence of the real situation so that algorithms
that are good in the model are also good in reality.
In this volume, we build on the most widely used nonhierarchical model.
In the random access machine (RAM) model or von Neumann model [579],
we have a “sufficiently” large uniform memory storing words of size O(log n)
bits where n is the size of our input. Accessing any word in memory takes con-
stant time. Arithmetics and bitwise operations with words can be performed
in constant time. For numerical and geometric algorithms, it is sometimes
also assumed that words can represent real numbers accurately. Storage con-
sumption is measured in words if not otherwise mentioned.
Most chapters of this volume use a minimalistic extension that we will
simply call the external memory model. We use the notation introduced by
Aggarwal, Vitter, and Shriver [17, 755]. Processing works almost as in the
RAM model, except that there are only M words of internal memory that
can be accessed quickly. The remaining memory can only be accessed using
I/Os that move B contiguous words between internal and external memory.
6 Peter Sanders
CPU
Fast Memory
M
B
Large Memory
Then a good choice of the block size is B = t0 . When we access less data
we are at most a factor two off by accessing an entire block of size B. When
we access L > B words, we are at most a factor two off by counting L/B
block I/Os.
In reality, the access latency depends on the current position of the disk
mechanism and on the position of the block to be accessed on the disk.
Although exploiting this effect can make a big difference, programs that op-
timize access latencies are rare since the details depend on the actual disk
used and are usually not published by the disk vendors. If other applications
or the operating system make additional unpredictable accesses to the same
disk, even sophisticated optimizations can be in vain. In summary, by picking
an appropriate block size, we can model the most important aspects of disk
drives.
Parallelism
Although we mostly use the sequential variant of the external memory model,
it also has an option to express parallelism. External memory is partitioned
into D parts (e.g. disks) so that in each I/O step, one block can be accessed
on each of the parts.
With respect to parallel disks, the model of Vitter and Shriver [755] de-
viates from an earlier model by Aggarwal and Vitter [17] where D arbitrary
blocks can be accessed in parallel. A hardware realization could have D read-
ing/writing devices that access a single disk or a D-ported memory. This
model is more powerful because algorithms need not care about the map-
ping of data to disks. However, there are efficient (randomized) algorithms
for emulating the Aggarwal-Vitter model on the Vitter-Shriver model [656].
Hence, one approach to developing parallel disk external memory algorithms
is to start with an algorithm for the Aggarwal-Vitter model and then add an
appropriate load balancing algorithm (also called declustering).
Vitter and Shriver also make provisions for parallel processing. There are
P identical processors that can work in parallel. Each has fast memory M/P
and is equipped with D/P disks. In the external memory model there are
no additional parameters expressing the communication capabilities of the
processors. Although this is an oversimplification, this is already enough to
distinguish many algorithms with respect to their ability to be executed on
parallel machines. The model seems suitable for parallel machines with shared
memory.
For discussing parallel external memory on machines with distributed
memory we need a model for communication cost. The BSP model [742] that
is widely accepted for parallel (internal) processing fits well here: The P pro-
cessors work in supersteps. During a superstep, the processors can perform
local communications and post messages to other processors to the commu-
nication subsystem. At the end of a superstep, all processors synchronize and
exchange all the messages that have been posted during the superstep. This
8 Peter Sanders
In Chapter 8 we will see more refined models for the fastest levels of the
memory hierarchy, including replacements strategies used by the hardware
and the role of the TLB. Chapter 10 contributes additional practical examples
from numeric computing. Chapter 15 will explain parallel models in more
detail. In particular, we will see models that take multiple levels of hierarchy
into account.
There are also alternative models for the simple sequential memory hier-
archy. For example, instead of counting block I/Os with respect to a block
size B, we could allow variable block sizes and count the number of I/Os
k and the total I/O volume h. The total I/O cost could then be accounted
as I/O k + gI/O v where — in analogy to the BSP model — I/O stands for
the I/O latency and gI/O for the ratio between I/O speed and computation
speed. This model is largely equivalent to the block based model but it might
be more elegant when used together with the BSP model and it is more ad-
equate to explain differences between algorithms with regular and irregular
access patterns [227].
Another interesting variant is the cache oblivious model discussed in
Chapter 9. This model is identical to the external memory model except
that the algorithm is not told the values of B and M . The consequence of
this seemingly innocent variant is that an I/O efficient cache oblivious al-
gorithm works well not only on any machine but also on all levels of the
memory hierarchy at the same time. Cache oblivious algorithms can be very
simple, i.e., we do not need to know B and M to scan an array. But even
cache oblivious sorting is quite difficult.
Finally, there are interesting approaches to eliminate memory hierarchies.
Blocked access is only one way to hide access latency. Another approach is
pipelining where many independent accesses are executed in parallel. This
approach is more powerful but also more difficult to support in hardware.
Vector computers such as the NEC SX-6 support pipelined memory access
even to nonadjacent cells at full memory bandwidth. Several experimental
machines [2, 38] use massive pipelined memory access by the hardware to
run many parallel threads on a single processor. While one thread waits for
a memory access, the other threads can do useful work. Modern mainstream
processors also support pipelined memory access to a certain extend [399].
1. Memory Hierarchies — Models and Lower Bounds 9
Parallelism
Permuting and Sorting: Too often, the data is not arranged in a way that
scanning helps. Then we can rearrange the data into an order where scanning
is useful. When we already know where to place each elements, this means
permuting the data. When the permutation is defined implicitly via a total
ordering “<” of the elements, we have to sort with respect to “<”. Chapter 3
gives an upper bound of
N N
sort(N ) = Θ logM/B I/Os (1.2)
B B
1. Memory Hierarchies — Models and Lower Bounds 11
for sorting. In Section 1.5.1, we will see an almost identical lower bound for
permuting that is also a lower bound for the more difficult problem of sorting.
Searching: Any pointer based data structure indexing N elements needs
access time
search(N ) = Ω (logB N/M ) I/Os. (1.3)
This lower bound is explained in Section 1.5.2. In Chapter 2 we see a matching
upper bound for the simple case of a linear order. High dimensional problems
such as the geometric data structures explained in Chapter 6 can be more
difficult.
Arge and Bro Miltersen [59] give a more detailed account of lower bounds
for external memory algorithms.
We analyse the following problem. How many I/O operations are necessary
to generate a permutation of the input? A lower bound on permuting implies
a lower bound for sorting because for every permutation of a set of elements,
there is a set of keys that forces sorting to produce this permutation. The
lower bound was established in a seminal paper by Aggarwal and Vitter
[17]. Here we report a simplified proof based on unpublished lecture notes by
Albers, Crauser, and Mehlhorn [24].
To establish a lower bound, we need to specify precisely what a permuta-
tion algorithm can do. We make some restrictions but most of them can be
lifted without changing the lower bound significantly. We view the internal
memory as a bag being able to hold up to M elements. External memory is
an array of elements. Reading and writing external memory is always aligned
to block boundaries, i.e., if the cells of external memory are numbered, ac-
cess is always to cells i, . . . , i + B − 1 such that i is a multiple of B. At the
beginning, the first N/B blocks of external memory contain the input. The
internal memory and the remaining external memory contain no elements. At
the end, the output is again in the first N/B blocks of the external memory.
We view our elements as abstract objects, i.e., the only operation available on
them is to move them around. They cannot be duplicated, split, or modified
in any way. A read step moves B elements from a block of external memory to
internal memory. A write step moves any B elements from internal memory
into a block of external memory. In this model, the following theorem holds:
Theorem 1.1. Permuting N elements takes at least
N log(N/eB)
t≥2 · I/Os.
B log(eM/B) + 2 log(N/B)/B
For N = O (eM/B)B/2 the term log(eM/B) dominates the denominator
and we get a lower bound for sorting of
12 Peter Sanders
N log(N/eB) N N
2 · =Ω logM/B .
B O(log(eM/B)) B B
which is the same as the upper bound from Chapter 3.
The basic approach for establishing Theorem 1.1 is simple. We find an
upper bound ct for the number of different permutations generated after t
I/O steps looking at all possible sequences of t I/O steps. Since there are
N ! possible permutations of N elements, t must be large enough such that
ct ≥ N ! because otherwise there are permutations that cannot be generated
using t I/Os. Solving for t yields the desired lower bound.
A state of the algorithm can be described abstractly as follows:
1. the set of elements in the main memory;
2. the set of elements in each nonempty block of external memory;
3. the permutation in which the elements in each nonempty block of external
memory are stored.
We call two states equivalent if they agree in the first two components (they
may differ in the third).
In the final state, the elements are stored in N/B blocks of B elements
each. Each equivalence class of final states therefore consists of (B!)N/B
states. Hence, it suffices for our lower bound to find out when the num-
ber of equivalence classes of final states Ct reachable after t I/Os exceeds
N !/(B!)N/B .
We estimate Ct inductively. Clearly C0 = 1 .
Ct N/B if the I/O-operation is a read
Lemma 1.2. Ct+1 ≤ M
Ct N/B · B if the I/O-operation is a write.
N log(N/eB)
t≥2 · .
B log(eM/B) + 2 log(N/B)/B
...
M
... ... ...
B B
... ... ... ... ... ...
Fig. 1.2. Pointer based searching.
This chapter is a tutorial on basic data structures that perform well in mem-
ory hierarchies. These data structures have a large number of applications
and furthermore serve as an introduction to the basic principles of designing
data structures for memory hierarchies.
We will assume the reader to have a background in computer science that
includes a course in basic (internal memory) algorithms and data structures.
In particular, we assume that the reader knows about queues, stacks, and
linked lists, and is familiar with the basics of hashing, balanced search trees,
and priority queues. Knowledge of amortized and expected case analysis will
also be assumed. For readers with no such background we refer to one of the
many textbooks covering basic data structures in internal memory, e.g., [216].
The model we use is a simple one that focuses on just two levels of the
memory hierarchy, assuming the movement of data between these levels to
be the main performance bottleneck. (More precise models and a model that
considers all memory levels at the same time are discussed in Chapter 8 and
Chapter 9.) Specifically, we consider the external memory model described
in Chapter 1.
Our notation is summarized in Fig. 2. The parameters M , w and B de-
scribe the model. The size of the problem instance is denoted by N , where
N ≤ 2w . The parameter Z is query dependent, and is used to state output
sensitive I/O bounds. To reduce notational overhead we take logarithms to
always be at least 1, i.e., loga b should be read “max(loga b, 1)”.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 14-35, 2003.
© Springer-Verlag Berlin Heidelberg 2003
2. Basic External Memory Data Structures 15
Stacks and queues represent dynamic sets of data elements, and support op-
erations for adding and removing elements. They differ in the way elements
are removed. In a stack , a remove operation deletes and returns the set ele-
ment most recently inserted (last-in-first-out), whereas in a queue it deletes
and returns the set element that was first inserted (first-in-first-out).
Recall that both stacks and queues for sets of size at most N can be
implemented efficiently in internal memory using an array of length N and
a few pointers. Using this implementation on external memory gives a data
structure that, in the worst case, uses one I/O per insert and delete operation.
However, since we can read or write B elements in one I/O, we could hope to
do considerably better. Indeed this is possible, using the well-known technique
of a buffer.
An External Stack. In the case of a stack, the buffer is just an internal
memory array of 2B elements that at any time contains the k set elements
most recently inserted, where k ≤ 2B. Remove operations can now be imple-
mented using no I/Os, except for the case where the buffer has run empty.
In this case a single I/O is used to retrieve the block of B elements most
recently written to external memory.
One way of looking at this is that external memory is used to implement
a stack with blocks as data elements. In other words: The “macroscopic view”
in external memory is the same as the “microscopic view” in internal memory.
This is a phenomenon that occurs quite often – other examples will be the
search trees in Section 2.3 and the hash tables in Section 2.4.
Returning to external stacks, the above means that at least B remove
operations are made for each I/O reading a block. Insertions use no I/Os
except when the buffer runs full. In this case a single I/O is used to write the
B least recent elements to a block in external memory. Summing up, both
insertions and deletions are done in 1/B I/O, in the amortized sense. This is
the best performance we could hope for when storing or retrieving a sequence
of data items much larger than internal memory, since no more that B items
can be read or written in one I/O. A desired goal in many external memory
data structures is that when reporting a sequence of elements, only O(1/B)
I/O is used per element. We return to this in Section 2.3.
Exercise 2.1. Why does the stack not use a buffer of size B?
16 Rasmus Pagh
Problem 2.2. Above we saw how to implement stacks and queues having
a fixed bound on the maximum number of elements. Show how to efficiently
implement external stacks and queues with no bound on the number of ele-
ments.
Exercise 2.3. Argue that certain insertions and deletions will require N/B
I/Os if we insist on exactly B consecutive elements in every block (except
possibly the last).
To allow for efficient updates, we relax the invariant to require that, e.g.,
there are more than 23 B elements in every pair of consecutive blocks. This
increases the number of I/Os needed for a sequential scan by at most a factor
of three. Insertions can be done in a single I/O except for the case where the
block supposed to hold the new element is full. If either neighbor of the
block has spare capacity, we may push an element to this block. In case both
neighbors are full, we split the block into two blocks of about B/2 elements
each. Clearly this maintains the invariant (in fact, at least B/6 deletions
will be needed before the invariant is violated in this place again). When
deleting an element we check whether the total number of elements in the
block and one of its neighbors is 23 B or less. If this is the case we merge the
two blocks. It is not hard to see that this reestablishes the invariant: Each
2. Basic External Memory Data Structures 17
of the two pairs involving the new block now have more elements than the
corresponding pairs had before.
To sum up, a constant number of I/Os suffice to update a linked list. In
general this is the best we can hope for when updates may affect any part
of the data structure, and we want queries in an (eager) on-line fashion. In
the data structures of Section 2.1.1, updates concerned very local parts of
the data structure (the top of the stack and the ends of the queue), and we
were able to to better. Section 2.3.5 will show that a similar improvement is
possible in some cases where we can afford to wait for an answer of a query
to arrive.
Exercise 2.5. Show how to implement concatenation of two lists and split-
ting of a list into two parts in O(1) I/Os.
2.2 Dictionaries
Recall that N denotes the number of keys in the dictionary, and that B keys
(with associated information) can reside in each block of external memory.
There are two basic approaches to implementing dictionaries: Search trees
and hashing. Search trees assume that there is some total ordering on the
key set. They offer the highest flexibility towards extending the dictionary to
support more types of queries. We consider search trees in Section 2.3. Hash-
ing based dictionaries, described in Section 2.4, support the basic dictionary
operations in an expected constant number of I/Os (usually one or two). Be-
fore describing these two approaches in detail, we give some applications of
external memory dictionaries.
Dictionaries can be used for simple database retrieval as in the example above.
Furthermore, they are useful components of other external memory data
structures. Two such applications are implementations of virtual memory
and robust pointers.
Virtual Memory. External memory algorithms often do allocation and
deallocation of arrays of blocks in external memory. As in internal mem-
ory this can result in problems with fragmentation and poor utilization of
external memory. For almost any given data structure it can be argued that
fragmentation can be avoided, but this is often a cumbersome task.
A general solution that gives a constant factor increase in the number
of I/Os performed is to implement virtual memory using a dictionary. The
key space is K = {1, . . . , C} × {1, . . . , L}, where C is an upper bound of the
number of arrays we will ever use and L is an upper bound on the length of
any array. We wish the ith block of array c to be returned from the dictionary
when looking up the key (c, i). In case the block has never been written to, the
key will not be present, and some standard block content may be returned.
Allocation of an array consists of choosing c ∈ {1, . . . , C} not used for any
other array (using a counter, say), and associating a linked list of length 0
with the key (c, 0). When writing to block i of array c in virtual memory, we
associate the block with the key (c, i) in the dictionary and add the number i
to the linked list of key (c, 0). For deallocation of the array we simply traverse
the linked list of (c, 0) to remove all keys associated with that array.
In case the dictionary uses O(1) I/Os per operation (amortized expected)
the overhead of virtual memory accesses is expected to be a constant factor.
Note that the cost of allocation is constant and that the amortized cost of
deallocation is constant. If the dictionary uses linear space, the amount of
external memory used is bounded by a constant times the amount of virtual
memory in use.
Robust Pointers into Data Structures. Pointers into external memory
data structures pose some problems, as we saw in Section 2.1.2. It is often
2. Basic External Memory Data Structures 19
2.3 B-trees
This section considers search trees in external memory. Like the hashing based
dictionaries covered in Section 2.4, search trees store a set of keys along with
associated information. Though not as efficient as hashing schemes for lookup
of keys, we will see that search trees, as in internal memory, can be used as
the basis for a wide range of efficient queries on sets (see, e.g., Chapter 6 and
Chapter 7). We use N to denote the size of the key set, and B to denote the
number of keys or pointers that fit in one block.
B-trees are a generalization of balanced binary search trees to balanced
trees of degree Θ(B) [96, 207, 416, 460]. The intuitive reason why we should
change to search trees of large degree in external memory is that we would
like to use all the information we get when reading a block to guide the search.
In a naı̈ve implementation of binary search trees there would be no guarantee
that the nodes on a search path did not reside in distinct blocks, incurring
O(log N ) I/Os for a search. As we shall see, it is possible to do significantly
better. In this section it is assumed that B/8 is an integer greater than or
equal to 4.
The following is a modification of the original description of B-trees, with
the essential properties preserved or strengthened. In a B-tree all leaves have
the same distance to the root (the height h of the tree). The level of a B-tree
node is the distance to its descendant leaves. Rather than having a single key
in each internal node to guide searches to one of two subtrees, a B-tree node
guides searches to one of Θ(B) subtrees. In particular, the number of leaves
below a node (called its weight) decreases by a factor of Θ(B) when going
one level down the tree. We use a weight balance invariant, first described
for B-trees by Arge and Vitter [71]: Every node at level i < h has weight at
least (B/8)i , and every node at level i ≤ h has weight at most 4(B/8)i . As
shown in the following exercise, the weight balance invariant implies that the
degree of any non-root node is Θ(B) (this was the invariant in the original
description of B-trees [96]).
20 Rasmus Pagh
Exercise 2.7. Show that the weight balance invariant implies the following:
1. Any node has at most B/2 children.
2. The height of the B-tree is at most 1 + logB/8 N .
3. Any non-root node has at least B/32 children.
Note that B/2 pointers to subtrees, B/2 − 1 keys and a counter of the
number of keys in the subtree all fit in one external memory block of size B.
All keys and their associated information are stored in the leaves of the tree,
represented by a linked list containing the sorted key sequence. Note that
there may be fewer than Θ(B) elements in each block of the linked list if the
associated information takes up more space than the keys.
In a binary search tree the key in a node splits the key set into those keys
that are larger or equal and those that are smaller, and these two sets are
stored separately in the subtrees of the node. In B-trees this is generalized
as follows: In a node v storing keys k1 , . . . , kdv −1 the ith subtree stores keys
k with ki−1 ≤ k < ki (defining k0 = −∞ and kdv = ∞). This means that
the information in a node suffices to determine in which subtree to continue
a search.
The worst-case number of I/Os needed for searching a B-tree equals
the worst-case height of a B-tree, found in Exercise 2.7 to be at most
1 + logB/8 N . Compared to an external binary search tree, we save roughly
a factor log B on the number of I/Os.
a block with a key larger than b the search is over). The number of I/Os
used for reporting Z keys from the linked list is O(Z/B), where Z/B is the
minimum number of I/Os we could hope for. The feature that the number
of I/Os used for a query depends on the size of the result is called output
sensitivity. To sum up, Z elements in a given range can be reported by a B-
tree in O(logB N + Z/B) I/Os. Many other reporting problems can be solved
within this bound.
It should be noted that there exists an optimal size (static) data struc-
ture based on hashing that performs range queries in O(1 + Z/B) I/Os [35].
However, a slight change in the query to “report the first Z keys in the range
[a; b]” makes the approach used for this result fail to have optimal output
sensitivity (in fact, this query provably has a time complexity that grows
with N [98]). Tree structures, on the other hand, tend to easily adapt to such
changes.
Insertions and deletions are performed as in binary search trees except for
the case where the weight balance invariant would be violated by doing so.
Inserting. When inserting a key x we search for x in the tree to find the
internal node that should be the parent of the leaf node for x. If the weight
constraint is not violated on the search path for x we can immediately insert
x, and a pointer to the leaf containing x and its associated information. If
the weight constraint is violated in one or more nodes, we rebalance it by
performing split operations in overweight nodes, starting from the bottom
and going up. To split a node v at level i > 0, we divide its children into
two consecutive groups, each of weight between 2(B/8)i − 2(B/8)i−1 and
2(B/8)i + 2(B/8)i−1 . This is possible as the maximum weight of each child is
4(B/8)i−1 . Node v is replaced by two nodes having these groups as children
(this requires an update of the parent node, or the creation of a new root if v
is the root). Since B/8 ≥ 4 the weight of each of these new nodes is between
3 i 5 i i
2 (B/8) and 2 (B/8) , which is Ω((B/8) ) away from the limits.
Deleting. Deletions can be handled in a manner symmetric to insertions.
Whenever deleting a leaf would violate the lower bound on the weight of a
node v, we perform a rebalancing operation on v and a sibling w. If several
nodes become underweight we start the rebalancing at the bottom and move
up the tree.
Suppose v is an underweight node at level i, and that w is (one of) its
nearest sibling(s). In case the combined weight of v and w is less than 72 (B/8)i
we fuse them into one node having all the children of v and w as children. In
case v and w were the only children of the root, this node becomes the new
root. The other case to consider is when the combined weight is more than
7 i i
2 (B/8) , but at most 5(B/8) (since v is underweight). In this case we make
w share some children with v by dividing all the children into two consecutive
22 Rasmus Pagh
As seen in Chapter 1 the bound of O(logB N ) I/Os for searching is the best
we can hope for if we consider algorithms that use only comparisons of keys
to guide searches. If we have a large amount of internal memory and are
willing to use it to store the top M/B nodes of the B-tree, the number of
I/Os for searches and updates drops to O(logB (N/M )).
Exercise 2.12. How large should internal memory be to make O(logB (N/
M )) asymptotically smaller than O(logB N )?
where internal space usage is O(M ) words and external space usage is
O(N/B) blocks of B words.
Theorem 2.14. Suppose there is a (static) dictionary for w bit keys us-
ing N O(1) blocks of memory that supports predecessor queries in t I/Os,
worst-case, using O(B) words of internal memory. Then the following bounds
hold:
1. t = Ω(min(log w/ log log w, logBw N )).
2. If w is a suitable function of N then t = Ω(min(logB N, log N/ log log N )),
i.e., no better bound independent of w can be achieved.
Exercise 2.15. For what parameters are the upper bounds of Theorem 2.13
within a constant factor of the lower bounds of Theorem 2.14?
There are many variants of B-trees that add or enhance properties of basic
B-trees. The weight balance invariant we considered above was introduced in
the context of B-trees only recently, making it possible to associate expensive
auxiliary data structures with B-tree nodes at small amortized cost. Below we
summarize the properties of some other useful B-tree variants and extensions.
24 Rasmus Pagh
Problem 2.16. Show that, in the lower bound model of Aggarwal and Vit-
ter [17], merging two B-trees with Θ(N ) keys requires Θ(N/B) I/Os in the
worst case.
that there is a cost of a constant number of I/Os for each child – this is the
reason for making the number of children equal to the I/O-cost of reading
the buffer. Thus, flushing costs O(1/B) I/Os per operation in the buffer, and
since the depth of the tree is O(log M ( N
B )), the total cost of all flushes is
B
O( B1 log M ( N
B )) I/Os per operation.
B
The cost of performing a rebalancing operation on a node is O(M/B)
I/Os, as we may need to flush the buffer of one of its siblings. However, the
number of rebalancing operations during N updates is O(N/M ) (see [416]),
so the total cost of rebalancing is O(N/B) I/Os.
Problem 2.17. What is the I/O complexity of operations in a “buffer tree”
of degree Q?
Optimality. It is not hard to see that the above complexities are, in a certain
sense, the best possible.
Exercise 2.19. Show that it is impossible to perform insertion and delete-
minimums in time o( B1 log M ( N
B )) (Hint: Reduce from sorting, and use the
B
sorting lower bound – more information on this reduction technique can be
found in Chapter 6).
In internal memory it is in fact possible to improve the complexity of inser-
tion to constant time, while preserving O(log N ) time for delete-minimum
(see [216, Chapter 20] and [154]). It appears to be an open problem whether
it is possible to implement constant time insertions in external memory.
One way of improving the performance the priority queue described is to
provide “worst case” rather than amortized I/O bounds. Of course, it is not
possible for every operation to have a cost of less than one I/O. The best one
k
can hope for is that any subsequence of k operations uses O(1 + B log M ( N
B ))
B
I/Os. Brodal and Katajainen [157] have achieved this for subsequences of
length k ≥ B. Their data structure does not support deletions.
A main open problem in external memory priority queues is the com-
plexity of the decrease-key operation (when the other operations have com-
plexity as above). Internally, this operation can be supported in constant
time (see [216, Chapter 20] and [154]), and the open problem is whether a
corresponding bound of O(1/B) I/Os per decrease-key can be achieved. The
currently best complexity is achieved by “tournament trees”, described in
Chapter 4, where decrease-key operations, as well as the other priority queue
operations, cost O( B1 log( N
B )) I/Os.
We now consider hashing techniques, which offer the highest performance for
the basic dictionary operations. One aspect that we will not discuss here,
is how to implement appropriate classes of hash functions. We will simply
assume to have access to hash functions that behave like truly random func-
tions, independent of the sequence of dictionary operations. This means that
any hash function value h(x) is uniformly random and independent of hash
function values on elements other than x. In practice, using easily imple-
mentable “pseudorandom” hash functions that try to imitate truly random
functions, the behavior of hashing algorithms is quite close to that of this
idealized model. We refer the reader to [251] and the references therein for
more information on practical hash functions.
28 Rasmus Pagh
Several classic hashing schemes (see [460, Section 6.4] for a survey) perform
well in the expected sense in external memory. We will consider linear probing
and chaining with separate lists. These schemes need nothing but a single
hash function h in internal memory (in practice a few machine words suffice
for a good pseudorandom hash function). For both schemes the analysis is
beyond the scope of this chapter, but we provide some intuition and state
results on their performance.
Linear Probing. In external memory linear probing, a search for the key x
starts at block h(x) in a hash table, and proceeds linearly through the table
until either x is found or we encounter a block that is not full (indicating
that x is not present in the table). Insertions proceed in the same manner as
lookups, except that we insert x if we encounter a non-full block. Deletion
of a key x requires some rearrangement of the keys in the blocks scanned
when looking up x, see [460, Section 6.4] for details. A deletion leaves the
table in the state it would have been in if the deleted element had never been
inserted.
The intuitive reason that linear probing gives good average behavior is
that the pseudorandom function distributes the keys almost evenly to the
blocks. In the rare event that a block overflows, it will be unlikely that the
next block is not able to accommodate the overflow elements. More precisely,
if the load factor of our hash table is α, where 0 < α < 1 (i.e., the size of
the hash table is N/(αB) blocks), we have that the expected average number
of I/Os for a lookup is 1 + (1 − α)−2 · 2−Ω(B) [460]. If α is bounded away
from 1 (i.e., α ≤ 1 − for some constant > 0) and if B is not too small,
the expected average is very close to 1. In fact, the asymptotic probability of
having to use k > 1 I/Os for a lookup is 2−Ω(B(k−1)) . In Section 2.4.4 we will
consider the problem of keeping the load factor in a certain range, shrinking
and expanding the hash table according to the size of the set.
Chaining with Separate Lists. In chaining with separate lists we again
hash to a table of size approximately N/(αB) to achieve load factor α. Each
block in the hash table is the start of a linked list of keys hashing to that
block. Insertion, deletion, and lookups proceed in the obvious manner. As the
pseudorandom function distributes keys approximately evenly to the blocks,
almost all lists will consist of just a single block. In fact, the probability
that more than kB keys hash to a certain block, for k ≥ 1, is at most
e−αB(k/α−1) /3 by Chernoff bounds (see, e.g., [375, Eq. 6]).
2
showed that such a function can be implemented using O(N log(B)/B) bits
of internal memory. (In the interest of simplicity, we ignore an extra term
ω(N )
that only shows up when the key set K has size 2B .) If the number of
external blocks is only N/B and we want to be able to handle every possible
key set, this is also the best possible [525]. Unfortunately, the time and space
needed to evaluate Mairson’s hash functions is extremely high, and it seems
very difficult to obtain a dynamic version. The rest of this section deals with
more practical ways of implementing (dynamic) B-perfect hashing.
Extendible Hashing. A popular B-perfect hashing method that comes
close to Mairson’s bound is extendible hashing by Fagin et al. [285]. The
expected space utilization in external memory is about 69% rather than the
100% achieved by Mairson’s scheme.
Extendible hashing employs an internal structure called a directory to
determine which external block to search. The directory is an array of 2d
pointers to external memory blocks, for some parameter d. Let h : K →
{0, 1}r be a truly random hash function, where r ≥ d. Lookup of a key k is
performed by using h(k)d , the function returning the d least significant bits
of h(k), to determine an entry in the directory, which in turn specifies the
external block to be searched. The parameter d is chosen to be the smallest
number for which at most B dictionary keys map to the same value under
h(k)d . If r ≥ 3 log N , say, such a d exists with high probability. In case it
does not we simply rehash. Many pointers in the directory may point to the
same block. Specifically, if no more than B dictionary keys map to the same
value v under hd , for some d < d, all directory entries with indices having
v in their d least significant bits point to the same external memory block.
Clearly, extendible hashing provides lookups using a single I/O and con-
stant internal processing time. Analyzing its space usage is beyond the scope
of this chapter, but we mention some results. Flajolet [305] has shown √ that
the expected number of entries in the directory is approximately 4 N B
B
N . If
B is just moderately large, this is close to optimal, e.g., in case B ≥ log N
the number of bits used is less than 8N log(N )/B. In comparison, the opti-
mal space bound for perfect hashing to exactly N/B external memory blocks
is 12 N log(B)/B + Θ(N/B) bits. The expected external space usage can be
shown to be around N/(B ln 2) blocks, which means that about 69% of the
space is utilized [285, 545].
Extendible hashing is named after the way in which it adapts to changes
of the key set. The level of a block is the largest d ≤ d for which all its keys
map to the same value under hd . Whenever a block at level d has run full,
it is split into two blocks at level d + 1 using hd +1 . In case d = d we first
need to double the size of the directory. Conversely, if two blocks at level d ,
with keys having the same function value under hd −1 , contain less than B
keys in total, these blocks are merged. If no blocks are left at level d, the size
of the directory is halved.
2. Basic External Memory Data Structures 31
tie-breaking rule were discovered by Vöcking [756]). It can be shown that the
Ω((1−α)B)
probability of an insertion causing an overflow is N/22 [115]. That
is, the failure probability decreases doubly exponentially with the average
number of free spaces in each block. The constant factor in the Ω is larger than
1, and it has been shown experimentally that even for very small amounts of
free space in each block, the probability of an overflow (causing a rehash) is
very small [159]. The effect of deletions in two-way chaining does not appear
to have been analyzed.
In the above we several times assumed that the load factor of our hash table
is at most some constant α < 1. Of course, to keep the load factor below
α we may have to increase the size of the hash table employed when the
size of the set increases. On the other hand we wish to keep α above a
certain threshold to have a good external memory utilization, so shrinking
the hash table is also occasionally necessary. The challenge is to rehash to
the new table without having to do an expensive reorganization of the old
hash table. Simply choosing a new hash function would require a random
permutation of the keys, a task shown in [17] to require Θ( NB log M (N
B )) I/Os.
B
O(B)
When N = (M/B) , i.e, when N is not extremely large, this is O(N ) I/Os.
Since one usually has Θ(N ) updates between two rehashes, the reorganization
cost can be amortized over the cost of updates. However, more efficient ways
of reorganizing the hash table are important in practice to keep constant
factors down. The basic idea is to introduce more “gentle” ways of changing
the hash function.
Linear Hashing. Litwin [508] proposed a way of gradually increasing and
decreasing the range of hash functions with the size of the set. The basic
idea for hashing to a range of size r is to extract b = log r bits from a
“mother” hash function. If the extracted bits encode an integer k less than r,
this is used as the hash value. Otherwise the hash function value k − 2b−1 is
returned. When expanding the size of the hash table by one block (increasing
r by one), all keys that may hash to the new block r + 1 previously hashed to
block r + 1 − 2b−1 . This makes it easy to update the hash table. Decreasing
the size of the hash table is done in a symmetric manner.
The main problem with linear hashing is that when r is not a power
of 2, the keys are not mapped uniformly to the range. For example, if r is
1.5 times a power of two, the expected number of collisions between keys is
12.5% higher than that expected for a uniform hash function. Even worse,
the expected maximum number of keys hashing to a single bucket can be up
to twice as high as in the uniform case. Some attempts have been made to
alleviate these problems, but all have the property that the hash functions
used are not completely uniform, see [497] and the references therein. Another
problem lies in the analysis, which for many hashing schemes is complicated
2. Basic External Memory Data Structures 33
Exercise 2.21. Show that when inserting N elements, each element will be
part of a rebuilding O(B logB N ) times.
Some data structures for sets support deletions, but do not recover the space
occupied by deleted elements. For example, deletions in a static dictionary can
be done by marking deleted elements (this is called a weak delete). A general
technique for keeping the number of deleted elements at some fraction of
the total number of elements is global rebuilding: In a data structure of N
elements (present and deleted), whenever αN elements have been deleted,
for some constant α > 0, the entire data structure is rebuilt. The cost of
rebuilding is at most a constant factor higher than the cost of inserting αN
elements, so the amortized cost of global rebuilding can be charged to the
insertions of the deleted elements.
Exercise 2.22. Discuss pros and cons of using global rebuilding for B-trees
instead of the deletion method described in Section 2.3.2.
2. Basic External Memory Data Structures 35
2.6 Summary
This chapter has surveyed some of the most important external memory data
structures for sets and lists: Elementary abstract data structures (queues,
stacks, linked lists), B-trees, buffer trees (including their use for priority
queues), and hashing based dictionaries. Along the way, several important
design principles for memory hierarchy aware algorithms and data structures
have been touched upon: Using buffers, blocking and locality, making use
of internal memory, output sensitivity, data structures for batched dynamic
problems, the logarithmic method, and global rebuilding. In the following
chapters of this volume, the reader who wants to know more can find a wealth
of information on virtually all aspects of algorithms and data structures for
memory hierarchies.
Since the data structure problems discussed in this chapter are fundamen-
tal they are well-studied. Some problems have resisted the efforts of achiev-
ing external memory results “equally good” as the corresponding internal
memory results. In particular, the problems of supporting fast insertion and
decrease-key in priority queues (or show that this is not possible) have re-
mained challenging open research problems.
Acknowledgements. The surveys by Arge [55], Enbody and Du [280], and
Vitter [753, 754] were a big help in writing this chapter. I would also like to
acknowledge the help of Gerth Stølting Brodal, Ulrich Meyer, Anna Östlin,
Jan Vahrenhold, Berthold Vöcking, and last but not least the participants of
the GI-Dagstuhl-Forschungsseminar “Algorithms for Memory Hierarchies”.
3. A Survey of Techniques for Designing
I/O-Efficient Algorithms∗
Anil Maheshwari and Norbert Zeh
3.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 36-61, 2003.
© Springer-Verlag Berlin Heidelberg 2003
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 37
3.2.1 Scanning
Output buf.
Output stream
(b) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62
2 4 7 8 1 3 5 11
(c) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62
7 8 5 11
1 2 3 4
Fig. 3.1. Merging two sorted sequences. (a) The initial situation: The two lists are
stored on disk. Two empty input buffers and an empty output buffer have been
allocated in main memory. The output sequence does not contain any data yet.
(b) The first block from each input sequence has been loaded into main memory.
(c) The first B elements have been moved from the input buffers to the output
buffer.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 39
(d) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62
7 8 5 11
1 2 3 4
(e) 2 4 7 8 12 16 19 27 37 44 48 61 1 3 5 11 17 21 22 35 40 55 57 62
12 16 19 27 11
5 7 8
1 2 3 4
Fig. 3.1. (continued) (d) The contents of the output buffer are flushed to the
output stream to make room for more data to be moved to the output buffer.
(e) After moving elements 5, 7, and 8 to the output buffer, the input buffer for the
first stream does not contain any more data items. Hence, the next block is read
from the first input stream into the input buffer.
array A from array A takes O(N/B) I/Os rather than Θ(N ) I/Os, as would
be required to solve this task using direct disk accesses.
In our example we apply the scanning paradigm to a problem with one
input stream A and one output stream A . It is easy to apply the above
buffering technique to a problem with q input streams S1 , . . . , Sq and r out-
put streams S1 , . . . , Sr , as long as there is enough room to keep an input
buffer of size B per input stream Si and an output buffer of size B per
output stream Sj in internal memory. More precisely, p + q cannot be more
than M/B. Under this assumption the algorithm still takes O(N/B) I/Os,
where N = qi=1 |Si | + rj=1 |Sj |. Note, however, that this analysis includes
only the number of I/Os required to read the elements from the input streams
and write the output to the output streams. It does not include the I/O-
complexity of the actual computation of the output elements from the in-
put elements. One way to guarantee that the I/O-complexity of the whole
algorithm, including all computation, is O(N/B) is to ensure that only the
M −(q +r)B elements most recently read from the input streams are required
40 Anil Maheshwari and Norbert Zeh
for the computation of the next output element, or the required information
about all elements read from the input streams can be maintained succinctly
in M − (q + r)B space. If this can be guaranteed, the computation of all
output elements from the read input elements can be carried out in main
memory and thus does not cause any I/O-operations to be performed.
An important example where the scanning paradigm is applied to more
than one input stream is the merging of k sorted streams to produce a single
sorted output stream (see Fig. 3.1). This procedure is applied repeatedly with
a parameter of k = 2 in the classical internal memory MergeSort algorithm.
The I/O-efficient MergeSort algorithm discussed in the next section takes
advantage of the fact that up to k = M/B streams can be merged in a linear
number of I/Os, in order to decrease the number of recursive merge steps
required to produce a single sorted output stream.
3.2.2 Sorting
of two runs is done in internal memory. When merging two runs, choosing
the next element to be moved to the output run involves a single compar-
ison. When merging k > 2 runs, it becomes computationally too expensive
to find the minimum of elements x1 , . . . , xk in O(k) time because then the
running time of the merge phase would be O(kN logk (N/B)). In order to
achieve optimal running time in internal memory as well, the minimum of
elements x1 , . . . , xk has to be found in O(log k) time. This can be achieved by
maintaining the smallest elements x1 , . . . , xk , one from each run, in a priority
queue. The next element to be moved to the output run is the smallest in the
priority queue and can hence be retrieved using a DeleteMin operation. Let
the retrieved element be xi ∈ Si . Then after moving xi to the output run,
the next element is read from Si and inserted into the priority queue, which
guarantees that again the smallest unprocessed element from every run is
stored in the priority queue. This process is repeated until all elements have
been moved to the output run. The amount of space used by the priority
queue is O(k) = O(M/B), so that the priorty queue can be maintained in
main memory. Moving one element to the output run involves the execution
of one DeleteMin and one Insert operation on the priority queue, which
takes O(log k) time. Hence, the total running time of the MergeSort al-
gorithm is O(N log M + (N log k) logk (N/B)) = O(N log N ). We summarize
the discussion in the following theorem.
performing distribution sort on multiple disks have also been proposed, in-
cluding BalanceSort [586], sorting using the buffer tree [52], and algorithms
obtained by simulating bulk-synchronous parallel sorting algorithms [244].
The reader may refer to these references for details.
For merge sort, it is required that each iteration in the merging phase
is carried out in O(N/(DB)) I/Os. In particular, each read operation must
bring Ω(D) blocks of data into main memory, and each write operation must
write Ω(D) blocks to disk. While the latter is easy to achieve, reading blocks
in parallel is difficult because the runs to be merged were formed in the
previous iteration without any knowledge about how they would interact with
other runs in subsequent merge operations. Nodine and Vitter [587] propose
an optimal deterministic merge sort for multiple disks. The algorithm first
performs an approximate merge phase that guarantees that no element is too
far away from its final location. In the second phase, each element is moved
to its final location. Barve et al. [92, 93] claim that their sorting algorithm
is the most practical one. Using their approach, each run is striped across
the disks, with a random starting disk. When merging runs, the next block
needed from each disk is read into main memory. If there is not sufficient
room in main memory for all the blocks to be read, then the least needed
blocks are discarded from main memory (without incurring any I/Os). They
derive asymptotic upper bounds on the expected I/O complexity of their
algorithm.
∗ 48
+ − 8 6
/ ∗ 7 1 2 6 7 1
4 2 2 3 4 2 2 3
(a) (b)
Fig. 3.2. (a) The expression tree for the expression ((4 / 2) + (2 ∗ 3)) ∗ (7 − 1).
(b) The same tree with its vertices labelled with their values..
internal vertex v with label ◦ ∈ {+, −, ∗, /}, left child x, and right child y,
val (v) = val (x) ◦ val (y). The goal is to compute the value of the root of T .
Cast in terms of the general DAG evaluation problem defined above, tree T
is a DAG whose edges are directed from children to parents, labelling φ is the
initial assignment of real numbers to the leaves of T and of operations to the
internal vertices of T , and labelling ψ is the assignment of the values val (v)
to all vertices v ∈ T . For every vertex v ∈ T , its label ψ(v) = val (v) is com-
puted from the labels ψ(x) = val (x) and ψ(y) = val (y) of its in-neighbors
(children) and its own label φ(v) ∈ {+, −, ∗, /}.
In order to be able to evaluate a DAG G I/O-efficiently, two assumptions
have to be satisfied: (1) The vertices of G have to be stored in topologically
sorted order. That is, for every edge (v, w) ∈ G, vertex v precedes vertex w.
(2) Label ψ(v) has to be computable from labels φ(v) and ψ(u1 ), . . . , ψ(uk )
in O(sort(k)) I/Os. The second condition is trivially satisfied if every vertex
of G has in-degree no more than M .
Given these two assumptions, time-forward processing visits the vertices
of G in topologically sorted order to compute labelling ψ. Visiting the vertices
of G in this order guarantees that for every vertex v ∈ G, its in-neighbors are
evaluated before v is evaluated. Thus, if these in-neighbors “send” their labels
ψ(u1 ), . . . , ψ(uk ) to v, v has these labels and its own label φ(v) at its disposal
to compute ψ(v). After computing ψ(v), v sends its own label ψ(v) “forward
in time” to its out-neighbors, which guarantees that these out-neighbors have
ψ(v) at their disposal when it is their turn to be evaluated.
The implementation of this technique due to Arge [52] is simple and ele-
gant. The “sending” of information is realized using a priority queue Q (see
Chapter 2 for a discussion of priority queues). When a vertex v wants to send
its label ψ(v) to another vertex w, it inserts ψ(v) into priority queue Q and
gives it priority w. When vertex w is evaluated, it removes all entries with
priority w from Q. Since every in-neighbor of w sends its label to w by queu-
ing it with priority w, this provides w with the required inputs. Moreover,
every vertex removes its inputs from the priority queue before it is evaluated,
and all vertices with smaller numbers are evaluated before w. Thus, at the
time when w is evaluated, the entries in Q with priority w are those with
lowest priority, so that they can be removed using a sequence of DeleteMin
operations.
Using the buffer tree of Arge [52] to implement priority queue Q, In-
sert and DeleteMin operations on Q can be performed in O((1/B)·
logM/B (|E|/B)) I/Os amortized because priority queue Q never holds more
than |E| entries. The total number of priority queue operations performed by
the algorithm is O(|E|), one Insert and one DeleteMin operation per edge.
Hence, all updates of priority queue Q can be processed in O(sort(|E|)) I/Os.
The computation of labels ψ(v) from labels φ(v) and ψ(u1 ), . . . , ψ(uk ), for
all vertices v ∈ G, can also be carried out in O(sort(|E|)) I/Os, using the
48 Anil Maheshwari and Norbert Zeh
above assumption that this computation takes O(sort(k)) I/Os for a single
vertex v. Hence, we obtain the following result.
Theorem 3.3. [52, 192] Given a DAG G = (V, E) whose vertices are
stored in topologically sorted order, graph G can be evaluated in O(sort(|V | +
|E|)) I/Os, provided that the computation of the label of every vertex v ∈ G
can be carried out in O(sort(deg− (v))) I/Os, where deg− (v) is the in-degree
of vertex v.
Theorem 3.4. [775] Every graph problem P that can be solved by a pre-
sortable local single-pass vertex labelling algorithm can be solved in O(sort(|V |+
|E|)) I/Os.
Theorem 3.5. [775] Given an undirected graph G = (V, E), a maximal in-
dependent set of G can be found in O(sort(|V | + |E|)) I/Os and linear space.
color c(v) ∈ {1, . . . , ∆+1} to vertex v that has not been assigned to any neigh-
bor of v. The algorithm is presortable and single-pass for the same reasons as
the maximal independent set algorithm. The algorithm is local because the
color of v can be determined as follows: Sort the colors c(u1 ), . . . , c(uk ) of v’s
in-neighbors u1 , . . . , uk . Then scan this list and assign the first color not in
this list to v. This takes O(sort(k)) I/Os.
List ranking and the Euler tour technique are two techniques that have been
applied successfully in the design of PRAM algorithms for labelling problems
on lists and rooted trees and problems that can be reduced efficiently to
one of these problems. Given the similarity of the issues to be addressed in
parallel and external memory algorithms, it is not surprising that the same
two techniques can be applied in I/O-efficient algorithms as well.
an adversary can easily arrange the vertices of L in a manner that forces the
internal memory algorithm to perform one I/O per visited vertex, so that
the algorithm performs Ω(N ) I/Os in total. On the other hand, the lower
bound for list ranking shown in [192] is only Ω(perm(N )). Next we sketch
a list ranking algorithm proposed in [192] that takes O(sort(N )) I/Os and
thereby closes the gap between the lower and the upper bound.
We make the simplifying assumption that multiplication over X is as-
sociative. If this is not the case, we determine the distance of every vertex
from the head of L, sort the vertices of L by increasing distances, and then
compute the prefix product using the internal memory algorithm. After ar-
ranging the vertices by increasing distances from the head of L, the internal
memory algorithm takes O(scan(N )) I/Os. Hence, the whole procedure still
takes O(sort(N )) I/Os, and the associativity assumption is not a restriction.
Given that multiplication over X is associative, the algorithm of [192]
uses graph contraction to rank list L as follows: First an independent set I
of L is found so that |I| = Ω(N ). Then the elements in I are removed from L.
That is, for every element x ∈ I with predecessor y and successor z in L, the
successor pointer of y is updated to succ(y) = z. The label of x is multiplied
with the label of z, and the result is assigned to z as its new label in the
compressed list. It is not hard to see that the weighted ranks of the elements
in L−I remain the same after adjusting the labels in this manner. Hence, their
ranks can be computed by applying the list ranking algorithm recursively to
the compressed list. Once the ranks of all elements in L − I are known, the
ranks of the elements in I are computed by multiplying their labels with the
ranks of their predecessors in L.
If the algorithm excluding the recursive invocation on the compressed list
takes O(sort(N )) I/Os, the total I/O-complexity of the algorithm is given by
the recurrence I(N ) = I(cN )+O(sort(N )), for some constant 0 < c < 1. The
solution of this recurrence is O(sort(N )). Hence, we have to argue that every
step, except the recursive invocation, can be carried out in O(sort(N )) I/Os.
Given independent set I, it suffices to sort the vertices in I by their suc-
cessors and the vertices in L − I by their own IDs, and then scan the resulting
two sorted lists to update the weights of the successors of all elements in I.
The successor pointers of the predecessors of all elements in I can be updated
in the same manner. In particular, it suffices to sort the vertices in L − I by
their successors and the vertices in I by their own IDs, and then scan the two
sorted lists to copy the successor pointer from each vertex in I to its prede-
cessor. Thus, the construction of the compressed list takes O(sort(N )) I/Os,
once set I is given.
In order to compute the independent set I, Chiang et al. [192] apply
a 3-coloring procedure for lists, which applies time-forward processing to
“monotone” sublists of L and takes O(sort(N )) I/Os; the largest monochro-
matic set is chosen to be set I. Using the maximal independent set algorithm
of Section 3.5.1, a large independent set I can be obtained more directly in
52 Anil Maheshwari and Norbert Zeh
the same number of I/Os because a maximal independent set of a list has
size at least N/3. Thus, we have the following result.
Theorem 3.7. [192] A list of length N can be ranked in O(sort(N )) I/Os.
List ranking alone is of very limited use. However, combined with the
Euler tour technique described in the next section, it becomes a very powerful
tool for solving problems on trees that can be expressed as functions over a
traversal of the tree or problems on general graphs that can be expressed in
terms of a traversal of a spanning tree of the graph. An important application
is the rooting of an undirected tree T , which is the process of directing all
edges of T from parents to children after choosing one vertex of T as the root.
Given a rooted tree T (i.e., one where all edges are directed from parents to
children), the Euler tour technique and list ranking can be used to compute
a preorder or postorder numbering of the vertices of T , or the sizes of the
subtrees rooted at the vertices of T . Such labellings are used in many classical
graph algorithms, so that the ability to compute them is a first step towards
solving more complicated graph problems.
minimize the storage blow-up and at the same time minimize the number of
page faults incurred by a path traversal. Often there is a trade-off. That is, no
blocking manages to minimize both performance measures at the same time.
In this section we restrict our attention to graph layouts with constant storage
blow-up and bound the worst-case number of page faults achievable by these
layouts using an appropriate paging algorithm. Throughout this section we
denote the length of the traversed path by L. The traversal of such a path
requires at least L/B I/Os in any graph because at most B vertices can be
brought into main memory in a single I/O-operation.
The graphs we consider include lists, trees, grids and planar graphs. The
blocking for planar graphs generalizes to any class of graphs with small sep-
arators. The results presented here are described in detail in the papers of
Nodine et al. [585], Hutchinson et al. [419], and Agarwal et al. [7].
Blocking Lists. The natural approach for blocking a list is to store the
vertices of the list in an array, sorted in their order of appearance along the
list. The storage blow-up of this blocking is one (i.e., there is no blow-up
at all). Since every vertex is stored exactly once in the array, the paging
algorithm has no choice about the block to be brought into main memory
when a vertex is visited. Still, if the traversed path is simple (i.e., travels
along the list in only one direction), the traversal of a path of length L incurs
only L/B page faults. To see this, assume w.l.o.g. that the path traverses
the list in forward direction, i.e., the vertices are visited in the same order as
they are stored in the array, and consider a vertex v in the path that causes
a page fault. Then v is the first vertex in the block that is brought into main
memory, and the B − 1 vertices succeeding v in the direction of the traversal
are stored in the same block. Hence, the traversal of any simple path causes
one page fault every B steps along the path.
If the traversed path is not simple, there are several alternatives. Assuming
that M ≥ 2B, the same layout as for simple paths can be used; but the
paging algorithm has to be changed somewhat. In particular, when a page
fault occurs at a vertex v, the paging algorithm has to make sure that the
block brought into main memory does not replace the block containing the
vertex u visited just before v. Using this strategy, it is again guaranteed that
after every page fault, at least B − 1 steps are required before the next page
fault occurs. Indeed, the block containing vertex v contains all vertices that
can be reached from v in B − 1 steps by continuing the traversal in the same
direction, and the block containing vertex u contains all vertices that can
be reached from v in B steps by continuing the traversal in the opposite
direction. Hence, traversing a path of length L incurs at most L/B page
faults.
In the pathological situation that M = B (i.e., there is room for only one
block in main memory) and given the layout described above, an adversary
can construct a path whose traversal causes a page fault at every step. In
particular, the adversary chooses two adjacent vertices v and w that are in
56 Anil Maheshwari and Norbert Zeh
Fig. 3.3. A layout of a list on a disk with block size B = 4. The storage blow-up
of the layout is two.
logd B
logd B
Fig. 3.4. A blocking of a binary tree with block size 7. The subtrees in the first
partition are outlined with dashed lines. The subtrees in the second partition are
outlined with solid lines.
shown in Fig. 3.3, we choose one vertex r of T as the root and construct two
partitions of T into layers of height logd B (see Fig. 3.4). In the first partition,
the i-th layer contains all vertices at distance between (i − 1) logd B and
i logd B − 1 from r. In the second partition, the i-th layer contains all vertices
at distance between (i − 1/2) logd B and (i + 1/2) logd B = 1 from r. Each
layer in both partitions consists of subtrees of size at most B, so that each
subtree can be stored in a block. Moreover, small subtrees can be packed into
blocks so that no block is less than half full. Hence, both partitions together
use at most 4N/B blocks, and the storage blow-up is at most four.
The paging algorithm now alternates between the two partitions similar to
the above paging algorithm for lists. Consider the traversal of a path, and let
v be a vertex that causes a page fault. Assume that the tree currently held in
main memory is from the first partition. Then v is the root or a leaf of a tree
in the first partition. Hence, the tree in the second partition that contains v
contains all vertices that can be reached from v in (logd B)/2 − 1 steps. Thus,
by loading this block into main memory, the algorithm guarantees that the
next page fault occurs after at least (logd B)/2 − 1 steps, and traversing a
path of length L causes at most 2L/(logd B) page faults.
If all traversed paths are restricted to travel away from the root of T ,
the storage blow-up can be reduced to two, and the number of page faults
can be reduced to L/ logd B. To see this, observe that only the first of the
above partitions is needed, and for any traversed path, the vertices causing
page faults are the roots of subtrees in the partition. After loading the block
containing that root into main memory, logd B − 1 steps are necessary in
order to reach a leaf of the subtree, and the next page fault occurs after
logd B steps. For traversals towards the root, Hutchinson et al. [419] show
that using O(N/B) disk blocks, a page fault occurs every Ω(B) steps, so that
a path of length L can be traversed in O(L/B) I/Os.
58 Anil Maheshwari and Norbert Zeh
B
u
Frederickson
√ shows that for every planar graph G, there exists a set S of
O N/ B vertices so that no connected component of G − S has size more
than B. Based on this result, the following graph representation can be used
to achieve the above result. First ensure that every connected component
of G − S is stored in a single block and pack small connected components
into blocks so that every block is at least half full. This representation of
G − S uses at most 2N/B disk blocks. The second part of the blocking
consists of the (logd B)/2-neighborhoods of the vertices in S. That is, for
every vertex v ∈ S, the vertices reachable from v in at most (logd B)/2 steps
are stored in a single
√ block. These vertices fit into a single block because at
most d(logd B)/2 = B vertices can be reached in that many steps from v.
Packing these neighborhoods into blocks so that√ every block is at least half
full, this second part of the blocking uses O B|S|/B = O(N/B) blocks.
Hence, the storage blow-up is O(1).
Now consider the exploration of an arbitrary path in G. Let v be a vertex
that causes a page fault. If v ∈ S, the paging algorithm brings the block con-
taining the (logd B)/2-neighborhood of v into main memory. This guarantees
that at least (logd B)/2 steps along the path are required before the next
page fault occurs. If v ∈ S, v ∈ G − S. Then the paging algorithm brings
the block containing the connected component of G − S that contains v into
main memory. As long as the path stays inside this connected component, no
further page faults occur. When the next page fault occurs, it has to happen
at a vertex w ∈ S. Hence, the paging algorithm brings the block containing
the neighborhood of w into main memory, and at least (logd B)/2 steps are
required before the next page fault occurs. Thus, at most two page faults oc-
cur every (logd B)/2 steps, and traversing a path of length L incurs at most
4L/(logd B) page faults. This is summarized in the following theorem.
3. A Survey of Techniques for Designing I/O-Efficient Algorithms 61
3.8 Remarks
4.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 62-84, 2003.
© Springer-Verlag Berlin Heidelberg 2003
4. Elementary Graph Algorithms in External Memory 63
can be found from the current node then DFS backtracks to the most recently
visited node with unvisited neighbor(s) and continues there. Similar to BFS,
DFS has proved to be a useful tool, especially in artificial intelligence [177].
Another well-known application of DFS is in the linear-time algorithm for
finding strongly connected components [713].
Graph connectivity problems include Connected Components (CC), Bi-
connected Components (BCC) and Minimum Spanning Forest (MST/MSF).
In CC we are given a graph G = (V, E) and we are to find and enumerate
maximal subsets of the nodes of the graph in which there is a path between
every two nodes. In BCC, two nodes are in the same subset iff there are
two edge-disjoint paths connecting them. In MST/MSF the objective is to
find a spanning tree of G (spanning forest if G is not connected) with a
minimum total edge weight. Both problems are central in network design;
the obvious applications are checking whether a communications network is
connected or designing a minimum cost network. Other applications for CC
include clustering, e.g., in computational biology [386] and MST can be used
to approximate the traveling salesman problem within a factor of 1.5 [201].
We use the standard model of external memory computation [755]:
There is a main memory of size M and an external memory consisting of
D disks. Data is moved in blocks of size B consecutive words. An I/O-
operation can move up to D blocks, one from each disk. Further details
about models for memory hierarchies can be found in Chapter 1. We will
usually describe the algorithms under the assumption D = 1. In the fi-
nal results, however, we will provide the I/O-bounds for general D ≥ 1 as
well. Furthermore, we shall frequently use the following notational short-
cuts: scan(x) := O(x/(D · B)), sort(x) := O(x/(D · B) · logM/B (x/B)), and
perm(x) := O(min{x/D, sort(x)}).
Organization of the Chapter. We discuss external-memory algorithms for
all the problems listed above. In Sections 4.2 – 4.7 we cover graph traversal
problems (BFS, DFS, SSSP) and Sections 4.8 – 4.13 provide algorithms for
graph connectivity problems (CC, BCC, MSF).
next vertex to be visited are kept in some data-structure Q (a queue for BFS,
a stack for DFS, and a priority-queue for SSSP). After a vertex v is extracted
from Q, the adjacency list of v, i.e., the set of neighbors of v in G, is examined
in order to update Q: unvisited neighboring nodes are inserted into Q; the
priorities of nodes already in Q may be updated.
The Key Problems. The short description above already contains the main
difficulties for I/O-efficient graph-traversal algorithms:
(a)Unstructured indexed access to adjacency lists.
(b)Remembering visited nodes.
(c) (The lack of) Decrease Key operations in external priority-queues.
Whether (a) is problematic or not depends on the sizes of the adjacency
lists; if a list contains k edges then it takes Θ(1 + k/B) I/Os to retrieve all
its edges. That is fine if k = Ω(B), but wasteful if k = O(1). In spite of
intensive research, so far there is no general solution for (a) on sparse graphs:
unless the input is known to have special properties (for example planarity),
virtually all EM graph-traversal algorithms require Θ(|V |) I/Os to access
adjacency lists. Hence, we will mainly focus on methods to avoid spending
one I/O for each edge on general graphs1 . However, there is recent progress
for BFS on arbitrary undirected√graphs [542]; e.g., if |E| = O(|V |), the new
algorithm requires just O(|V |/ B + sort(|V |)) I/Os. While this is a major
step forward for BFS on undirected graphs, it is currently unclear whether
similar results can be achieved for undirected DFS/SSSP or BFS/DFS/SSSP
on general directed graphs.
Problem (b) can be partially overcome by solving the graph problems
in phases [192]: a dictionary DI of maximum capacity |DI| < M is kept
in internal memory; DI serves to remember visited nodes. Whenever the
capacity of DI is exhausted, the algorithms make a pass through the external
graph representation: all edges pointing to visited nodes are discarded, and
the remaining edges are compacted into new adjacency lists. Then DI is
emptied, and a new phase starts by visiting the next element of Q. This
phase-approach explored in [192] is most efficient if the quotient |V |/|DI| is
small2 ; O(|V |/|DI| · scan(|V | + |E|)) I/Os are needed in total to perform
all graph compactions. Additionally, O(|V | + |E|) operations are performed
on Q.
As for SSSP, problem (c) is less severe if (b) is resolved by the phase-
approach: instead of actually performing Decrease Key operations, several
priorities may be kept for each node in the external priority-queue; after a
node v is dequeued for the first time (with the smallest key) any further
appearance of v in Q will be ignored. In order to make this work, superfluous
1
In contrast, the chapter by Toma and Zeh in this volume (Chapter 5) reviews
improved algorithms for special graph classes such as planar graphs.
2
The chapters by Stefan Edelkamp (Chapter 11) and Rasmus Pagh (Chapter 2)
in this book provide more details about space-efficient data-structures.
4. Elementary Graph Algorithms in External Memory 65
elements still kept in the EM data structure of Q are marked obsolete right
before DI is emptied at the end of a phase; the marking can be done by
scanning Q.
Plugging-in the I/O-bounds for external queues, stacks, and priority-
queues as presented in Chapter 2 we obtain the following results:
|V |
BFS, DFS O |V | + M
· scan(|V | + |E|) I/Os
|V |
SSSP O |V | + M
· scan(|V | + |E|) + sort(|E|) I/Os
We turn to the basic BFS algorithm of Munagala and Ranade [567], MR BFS
for short. It is also used as a subroutine in more recent BFS approaches [542,
550]. Furthermore, MR BFS is applied in the deterministic CC algorithm of
[567] (which we discuss in Section 4.9).
Let L(t) denote the set of nodes in BFS level t, and let |L(t)| be the num-
ber of nodes in L(t). MR BFS builds L(t) as follows: let A(t) := N (L(t − 1))
4. Elementary Graph Algorithms in External Memory 67
Fig. 4.1. A phase in the BFS algorithm of Munagala and Ranade [567]. Level L(t)
is composed out of the disjoint neighbor vertices of level L(t − 1) excluding those
vertices already existing in either L(t − 2) or L(t − 1).
The Fast BFS algorithm of Mehlhorn and Meyer [542] refines the approach
of Munagala and Ranade [567]. It trades-off unstructured I/Os with increas-
ing the number of iterations in which an edge may be involved. Fast BFS
68 Irit Katriel and Ulrich Meyer
The total number of fringe nodes and neighbor nodes sorted and scanned dur-
ing the partitioning is at most Y := O(|V | + |E|). Therefore, the partitioning
requires
expected I/Os.
After the partitioning phase each node knows the (index of the) sub-
graph to which it belongs. With a constant number of sort and scan
operations Fast BFS can reorganize the adjacency lists into the format
F0 F1 . . . Fi . . . F|S|−1 , where Fi contains the adjacency lists of the nodes in
partition Si ; an entry (v, w, S(w), fS(w) ) from the adjacency list of v ∈ Fi
stands for the edge (v, w) and provides the additional information that w be-
longs to subgraph S(w) whose subfile FS(w) starts at position fS(w) within F .
The edge entries of each Fi are lexicographically sorted. In total, F occupies
O((|V | + |E|)/B) blocks of external storage.
4. Elementary Graph Algorithms in External Memory 69
The BFS Phase. In the second phase the algorithm performs BFS as de-
scribed by Munagala and Ranade (Section 4.3.1) with one crucial difference:
Fast BFS maintains an external file H (= hot adjacency lists); it comprises
unused parts of subfiles Fi that contain a node in the current level L(t − 1).
Fast BFS initializes H with F0 . Thus, initially, H contains the adjacency
list of the root node s of level L(0). The nodes of each created BFS level will
also carry identifiers for the subfiles Fi of their respective subgraphs Si .
When creating level L(t) based on L(t − 1) and L(t − 2), Fast BFS does
not access single adjacency lists like MR BFS does. Instead, it performs a
parallel scan of the sorted lists L(t − 1) and H and extracts N (L(t − 1));
In order to maintain the invariant that H contains the adjacency lists of all
vertices on the current level, the subfiles Fi of nodes whose adjacency lists
are not yet included in H will be merged with H. This can be done by first
sorting the respective subfiles and then merging the sorted set with H using
one scan. Each subfile Fi is added to H at most once. After an adjacency list
was copied to H, it will be used only for O(1/µ) expected steps; afterwards it
can be discarded from H. Thus, the expected total data volume for scanning
H is O(1/µ · (|V | + |E|)), and the expected total number of I/Os to handle H
and Fi is O (µ · |V | + sort(|V
| + |E|) + 1/µ · scan(|V | + |E|)). The final result
follows with µ = min{1, scan(|V | + |E|)/|V |}.
s 0
2/µ
4 6 0 4 2 4 7 4 0 6 1 8 1 5 1 3 1 6 0
2 7 1 0 4 2 7 6 1 8 5 3
8 5 3
Fig. 4.2. Using an Euler tour around a spanning tree of the input graph in order
to obtain a partition for the deterministic BFS algorithm.
are special algorithms for duplicate elimination, e.g. [1, 534]). Eventually,
the reduced subgraphs Si are used to create the reordered adjacency-list
files Fi ; this is done as in the randomized preprocessing and takes another
O(sort(|V | + |E|)) I/Os. Note that the reduced subgraphs Si may not be
connected any more; however, this does not matter as our approach only
requires that any two nodes in a subgraph are relatively close in the original
input graph.
The BFS-phase of the algorithm remains unchanged; the modified pre-
processing, however, guarantees that each adjacency-list will be part of the
external set H for at most 2/µ BFS levels: if a subfile Fi is merged with
H for BFS level L(t), then the BFS level of any node v in Si is at most
L(t) + 2/µ − 1. Therefore, the adjacency list of v in Fi will be kept in H for
at most 2/µ BFS levels.
Theorem 4.3 ([542]). External memory BFS on undirected graphs can be
solved using O |V | · scan(|V | + |E|) + sort(|V | + |E|) I/Os in the worst
case.
In this section we review a data structure due to Kumar and Schwabe [485]
which proved helpful in the design of better EM graph algorithms: the I/O-
efficient tournament tree, I/O-TT for short. A tournament tree is a complete
binary tree, where some rightmost leaves may be missing. In a figurative
sense, a standard tournament tree models the outcome of a k-phase knock-
out game between |V | ≤ 2k players, where player i is associated with the i-th
leaf of the tree; winners move up in the tree.
The I/O-TT as described in [485] is more powerful: it works as a priority
queue with the Decrease Key operation. However, both the size of the data
structure and the I/O-bounds for the priority queue operations depend on
the size of the universe from which the entries are drawn. Used in connection
with graph algorithms, the static I/O-TT can host at most |V | elements with
4. Elementary Graph Algorithms in External Memory 71
Index
M’
Key (Priority)
M’ Signals
Internal Memory
External Memory
M’ M’
M’ M’
Signals Elements
M’ M’
M’ M’
M’ M’ M’
Fig. 4.3. Principle of an I/O-efficient tournament tree. Signals are traveling from
the root to the leaves; elements move in opposite direction.
pairwise disjoint indices in {1, . . . , |V |}. Besides its index x, each element also
has a key k (priority). An element x1 , k1 is called smaller than x2 , k2 if
k1 < k2 .
The I/O-TT supports the following operations:
(i) deletemin: extract the element x, k with smallest key k and replace it
by the new entry x, ∞ .
(ii) delete(x): replace x, oldkey by x, ∞ .
(iii) update(x,newkey): replace x, oldkey by x, newkey if newkey < oldkey .
Note that (ii) and (iii) do not require the old key to be known. This feature
will help to implement the graph-traversal algorithms of Section 4.5 without
paying one I/O for each edge (for example an SSSP algorithm does not have to
find out explicitly whether an edge relaxation leads to an improved tentative
distance).
Similar to other I/O-efficient priority queue data structures (see Chapter 2
of Rasmus Pagh for an overview) I/O-TTs rely on the concept of lazy batched
processing. Let M = c · M for some positive constant c < 1; the static
I/O-TT for |V | entries only has |V |/M leaves (instead of |V | leaves in the
standard tournament tree). Hence, there are O(log2 (|V |/M )) levels. Elements
with indices in the range {(i − 1) · M + 1, . . . , i · M } are mapped to the i-
th leaf. The index range of internal nodes of the I/O-TT is given by the
72 Irit Katriel and Ulrich Meyer
union of the index ranges of their children. Internal nodes of the I/O-TT
keep a list of at least M /2 and at most M elements each (sorted according
to their priorities). If the list of a tree node v contains z elements, then
they are the smallest z out of all those elements in the tree being mapped
to the leaves that are descendants of v. Furthermore, each internal node is
equipped with a signal buffer of size M . Initially, the I/O-TT stores the
elements 1, +∞ , 2, +∞ , . . . , |V |, +∞ , out of which the lists of internal
nodes keep at least M /2 elements each. Fig. 4.3 illustrates the principle of
an I/O-TT.
In the following we sketch how the I/O-efficient tournament tree of Section 4.4
can be used in order to obtain improved EM algorithms for the single source
shortest path problem. The basic idea is to replace the data structure Q for
the candidate nodes of IM traversal-algorithms (Section 4.2) by the EM
tournament tree. The resulting SSSP algorithm works for undirected graphs
with strictly positive edge weights.
The SSSP algorithm of [485] constructs an I/O-TT for the |V | vertices
of the graph and sets all keys to infinity. Then the key of the source node
is updated to zero. Subsequently, the algorithm operates in |V | iterations
similarly to Dijkstra’s approach [252]: iteration i first performs a deletemin
operation in order to extract an element vi , ki ; the final distance of the
extracted node vi is given by dist(vi ) = ki . Then the algorithm issues
update(wj , dist(vi ) + c(vi , wj )) operations on the I/O-TT for each adjacent
edge (vi , wj ), vi = wj , having weight c(vi , wj ); in case of improvements the
new tentative distances will automatically materialize in the I/O-TT.
However, there is a problem with this simple approach; consider an edge
(u, v) where dist(u) < dist(v). By the time v is extracted from the I/O-TT,
u is already settled; in particular, after removing u, the I/O-TT replaces the
extracted entry u, dist(u) by u, +∞ . Thus, performing update(u, dist(v)+
c(v, u) < ∞) for the edge (v, u) after the extraction of v would reinsert the
settled node u into the set Q of candidate nodes. In the following we sketch
how this problem can be circumvented:
A second EM priority-queue3, denoted by SPQ, supporting a sequence of
z deletemin and insert operations with (amortized) O(z/B · log2 (z/B)) I/Os
is used in order to remember settled nodes “at the right time”. Initially, SPQ
is empty. At the beginning of iteration i, the modified algorithm additionally
checks the smallest element ui , ki from SPQ and compares its key ki with
the key ki of the smallest element ui , ki in I/O-TT. Subsequently, only
the element with smaller key is extracted (in case of a tie, the element in
the I/O-TT is processed first). If ki < ki then the algorithm proceeds as
described above; however, for each update(v, dist(u) + c(u, v)) on the I/O-TT
it additionally inserts u, dist(u) + c(u, v) into the SPQ. On the other hand,
if ki < ki then a delete(ui ) operation is performed on I/O-TT as well and a
new phase starts.
3
Several priority queue data structures are appropriate; see Chapter 2 for an
overview.
74 Irit Katriel and Ulrich Meyer
Fig. 4.4. Identifying spurious entries in the I/O-TT with the help of a second
priority queue SPQ.
In Fig. 4.4 we demonstrate the effect for the previously stated problem
concerning an edge (u, v) with dist(u) < dist(v): after node u is extracted
from the I/O-TT for the first time, u, dist(u) + c(u, v) is inserted into SPQ.
Since dist(u) < dist(v) ≤ dist(u) + c(u, v), node v will be extracted from I/O-
TT while u is still in SPQ. The extraction of v triggers a spurious reinsertion
of u into I/O-TT having key dist(v) + c(v, u) = dist(v) + c(u, v) > dist(u) +
c(u, v). Thus, u is extracted as the smallest element in SPQ before the re-
inserted node u becomes the smallest element in I/O-TT; as a consequence,
the resulting delete(u) operation for I/O-TT eliminates the spurious node u
in I/O-TT just in time. Extra rules apply for nodes with identical shortest
path distances.
As already indicated in Section 4.2, one-pass algorithms like the one just
presented still require Θ(|V |+(|V |+|E|)/B) I/Os for accessing the adjacency
lists. However, the remaining operations are more I/O-efficient: O(|E|) op-
erations on the I/O-TT and SPQ add another O(|E|/B · log2 (|E|/B)) I/Os.
Altogether this amounts to O(|V | + |E|/B · log2 (|E|/B)) I/Os.
The best known one-pass traversal-algorithms for general directed graphs are
often less efficient and less appealing than their undirected counterparts from
4. Elementary Graph Algorithms in External Memory 75
the previous sections. The key difference is that it becomes much more com-
plicated to keep track of previously visited nodes of the graph; the nice trick of
checking a constant number of previous levels for visited nodes as discussed
for undirected BFS does not work for directed graphs. Therefore we store
edges that point to previously seen nodes in a so-called buffered repository
tree (BRT) [164]: A BRT maintains |E| elements with keys in {1, . . . , |V |}
and supports the operations insert(edge, key) and extract all (key ); the latter
operation reports and deletes all edges in the BRT that are associated with
the specified key.
A BRT can be built as a height-balanced static binary tree with |V |
leaves and buffers of size B for each internal tree node; leaf i is associated
with graph node vi and stores up to degree(vi ) edges. Insertions into the
BRT happen at the root; in case of buffer overflow an inserted element (e, i)
is flushed down towards the i-th leaf. Thus, an insert operation requires
amortized O(1/B · log2 |V |) I/Os. If extract all (i) reports x edges then it
needs to read O(log2 |V |) buffers on the path from the root to the i-th leaf;
another O(x/B) disk blocks may have to be read at the leaf itself. This
accounts for O(x/B + log2 |V |) I/Os.
For DFS, an external stack S is used to store the vertices on the path
from the root node of the DFS tree to the currently visited vertex. A step of
the DFS algorithm checks the previously unexplored outgoing edges of the
topmost vertex u from S. If the target node v of such an edge (u, v) has not
been visited before then u is the father of v in the DFS tree. In that case,
v is pushed on the stack and the search continues for v. Otherwise, i.e., if v
has already been visited before, the next unexplored outgoing edge of u will
be checked. Once all outgoing edges of the topmost node u on the stack have
been checked, node u is popped and the algorithm continues with the new
topmost node on the stack.
Using the BRT the DFS procedure above can be implemented I/O effi-
ciently as follows: when a node v is encountered for the first time, then for
each incoming edge ei = (ui , v) the algorithm performs insert (ei , ui ). If at
some later point ui is visited then extract all (ui ) provides a list of all edges
out of ui that should not be traversed again (since they lead to nodes al-
ready seen before). If the (ordered) adjacency list of ui is kept in some EM
priority-queue P (ui ) then all superfluous edges can be deleted from P (ui )
in an I/O-efficient way. Subsequently, the next edge to follow is given by
extracting the minimum element from P (ui ).
The algorithm takes O(|V | + |E|/B) I/Os to access adjacency lists. There
are O(|E|) operations on the n priority queues P (·) (implemented as exter-
nal buffer trees). As the DFS algorithm performs an inorder traversal of a
DFS tree, it needs to change between different P (·) at most O(|V |) times.
Therefore, O(|V | + sort(|E|)) I/Os suffice to handle the operations on all
P (·). Additionally, there are O(|E|) insert and O(|V |) extract all operations
76 Irit Katriel and Ulrich Meyer
on the BRT; the I/Os required for them add up to O((|V | + |E|/B) · log2 |V |)
I/Os.
The algorithm for BFS works similarly, except that the stack is replaced
by an external queue.
Theorem 4.6 ([164, 485]). BFS and DFS on directed graphs can be solved
using O((|V | + |E|/B) · log2 |V |) I/Os.
Problem I/O-Bound
Undir. BFS O |V | · scan(|V | + |E|) + sort(|V | + |E|)
|V | |E|
Dir. BFS, DFS O min |V | + M · scan(|V | + |E|), |V | + D·B · log2 |V |
|V | |E|
Undir. SSSP O min |V | + M · sort(|V | + |E|), |V | + D·B · log2 |V |
|V |
Dir. SSSP O |V | + M · sort(|V | + |E|)
algorithm for directed BFS must also feature novel strategies to remember
previously visited nodes. Maybe, for the time being, this additional compli-
cation should be left aside by restricting attention to the semi-external case;
first results for semi-external BFS on directed Eulerian graphs are given in
[542].
For BCC, we show an algorithm that transforms the graph and then
applies CC. BCC is clearly not easier than CC; given an input G = (E, V )
to CC, we can construct a graph G by adding a new node and connecting it
with each of the nodes of G. Each biconnected component of G is the union
of a connected component of G with the new node.
requires an additional constant number of sorts and scans of the edges. The
total I/Os for one iteration is then O(sort(|E|)). Since each iteration at least
halves the number of nodes, log2 (|V |B/|E|) iterations are enough, for a total
of O(sort(|E|) log(|V |B/|E|)) I/Os.
After log2 (|V |B/|E|) iterations, we have a contracted graph in which each
node represents a set of nodes from the original graph. Applying BFS on the
contracted graph gives a component label to each supernode. We then need
to go over the nodes of the original graph and assign to each of them the
label of the supernode it was contracted to. This can be done by sorting the
list of nodes by the id of the supernode that the node was contracted to and
the list of component labels by supernode id, and then scanning both lists
simultaneously.
The complexity can be further improved by contracting√ more edges per phase
at the same cost. More precisely, in phase i up to S i edges adjacent to each
i 3/2
node will be contracted, where Si = 2(3/2) = Si−1 (less edges will be con-
tracted only if some nodes have become singletons, in which case they become
inactive). This means that the number of active nodes at the beginning of
phase i, |Vi |, is at most |V |/Si−1 · Si−2 ≤ |V |/(Si )2/3 (Si )4/9 ≤ |V |/Si , and
log log(|V |B/|E|) phases are sufficient to reduce the number of nodes as de-
sired.
To stay within the same complexity per phase, phase i is executed on a
reduced graph Gi , which contains only the relevant √ edges: those that will be
contracted in the current phase. Then |Ei | ≤ |Vi | Si . We will later see how
this helps, but first we describe the algorithm:
Algorithm 2. Phase i:
√
1. For each active node, select up to d = Si adjacent edges (less if the
node’s degree is smaller). Generate a graph Gi = (Vi , Ei ) over the active
nodes with the selected edges.
2. Apply log d phases of Algorithm 1 to Gi .
3. Replace each edge (u, v) in E by an edge (R(u), R(v)) and remove re-
dundant edges and nodes as in Algorithm 1.
Complexity Analysis. Steps 1 an 3 take O(sort(|E|)) I/Os as in Algo-
rithm 1. In√Step 2, each phase of Algorithm 1 takes O(sort(|Ei |)) I/Os. With
|Ei | ≤ |Vi | √ to Step 1)√and |Vi | ≤ (|V |/Si ) (as shown above), this is
Si (due
O sort(|V |/ Si ) and all log Si phases need a total of O(sort(|V |)) I/Os.
Hence, one phase of Algorithm 2 needs O(sort(|E|)) I/Os as before, giving
a total complexity of O(sort(|E|) log log(|V |B/|E|)) I/Os for the node re-
duction. The BFS-based CC algorithm can then be executed in O(sort(|E|))
I/Os.
For D > 1, we perform node reduction phases until |V | ≤ |E|/BD, giving:
Theorem 4.7 ([567]). CC can be solved using
O(sort(|E|) · max{1, log log(|V |BD/|E|)}) I/Os.
80 Irit Katriel and Ulrich Meyer
A Boruvka [143] step selects for each node the minimum-weight edge in-
cident to it. It contracts all the selected edges, replacing each connected com-
ponent they define by a supernode that represents the component, removing
all isolated nodes, self-edges, and all but the lowest weight edge among each
set of multiple edges.
Each Boruvka step reduces the number of nodes by at least a factor of two,
while contracting only edges that belong to the MSF. After i steps, each su-
pernode represents a set of at least 2i original nodes, hence the number of su-
pernodes is at most |V |/2i . In order to reduce the number of nodes to |E|/B,
log(|V |B/|E|) phases are necessary. Since one phase requires O(sort(|E|))
I/Os, this algorithm has complexity of O(sort(|E|) · max{1, log(|V |B/|E|))}).
As with the CC node reduction algorithm, this can be improved by com-
bining phases into superphases, where each superphase still needs
O(sort(|E|)) I/Os, and reduces more nodes than the basic step. Each su-
perphase is the same, √ except that the edges selected for Ei are not √the
smallest numbered Si edges adjacent to each node,√but the lightest Si
edges. Then a superphase which is equivalent to log si Boruvka steps is
executed with O(sort(|E|)) I/Os. The total number of I/Os for node reduc-
tion is O(sort(|E|) · max{1, log log(|V |B/|E|)}). The output is the union of
the edges that were contracted in the node reduction phase and the MSF of
the reduced graph.
Theorem 4.8 ([567]). MSF can be solved using
Munagala and Ranade [567] prove a lower bound of Ω (|E|/|V | · sort(|V |))
I/Os for CC, BCC and MSF. Note that |E|/|V | · sort(|V |) = Θ (sort(|E|)).
We have surveyed randomized algorithms that achieve this bound, but the
best known deterministic algorithms have a slightly higher I/O complexity.
Therefore, while both deterministic and randomized algorithms are efficient,
there still exists a gap between the upper bound and the lower bound in the
deterministic case.
Acknowledgements We would like to thank the participants of the GI-
Dagstuhl Forschungsseminar “Algorithms for Memory Hierarchies” for a
number of fruitful discussions and helpful comments on this chapter.
5. I/O-Efficient Algorithms for Sparse Graphs
Laura Toma and Norbert Zeh∗
5.1 Introduction
Massive graphs arise naturally in many applications. Recent web crawls, for
example, produce graphs with on the order of 200 million nodes and 2 billion
edges. Recent research in web modelling uses depth-first search, breadth-first
search, and the computation of shortest paths and connected components as
primitive routines for investigating the structure of the web [158]. Massive
graphs are also often manipulated in Geographic Information Systems (GIS),
where many problems can be formulated as fundamental graph problems.
When working with such massive data sets, only a fraction of the data can
be held in the main memory of a state-of-the-art computer. Thus, the trans-
fer of data between main memory and secondary, disk-based memory, and
not the internal memory computation, is often the bottleneck. A number of
models have been developed for the purpose of analyzing this bottleneck and
designing algorithms that minimize the traffic between main memory and
disk. The algorithms discussed in this chapter are designed and analyzed in
the parallel disk model (PDM) of Vitter and Shriver [755]. For a definition
and discussion of this model, the reader may refer to Chapter 1.
Despite the efforts of many researchers [1, 7, 52, 53, 60, 164, 192, 302, 419,
485, 521, 522, 550, 567, 737], the design of I/O-efficient algorithms for basic
graph problems is still a research area with many challenging open problems.
For most graph problems, Ω(perm(|V |)) or Ω(sort(|V |)) are lower bounds on
the number of I/Os required to solve them [53, 192], while the best known
algorithms for these problems on general graphs perform considerably more
I/Os. For example, the best known algorithms for DFS and SSSP perform
Ω(|V |) I/Os in the worst case; for BFS an algorithm performing o(|V |) I/Os
has been proposed only recently (see Table 5.1). While these algorithms are
I/O-efficient for graphs with at least B · |V | edges, they are inefficient for
sparse graphs.
In this chapter we focus on algorithms that solve a number of funda-
mental graph problems I/O-efficiently on sparse graphs. The algorithms we
discuss, besides exploiting the combinatorial and geometric properties of spe-
cial classes of sparse graphs, demonstrate the power of two general techniques
applied in I/O-efficient graph algorithms: graph contraction and time-forward
processing. The problems we consider are computing the connected and bi-
connected components (CC and BCC), a minimum spanning tree (MST), or
∗
Part of this work was done when the second author was a Ph.D. student at the
School of Computer Science of Carleton University, Ottawa, Canada.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 85-109, 2003.
© Springer-Verlag Berlin Heidelberg 2003
86 Laura Toma and Norbert Zeh
Table 5.1. The best known upper bounds for fundamental graph problems on
undirected graphs. The algorithms are deterministic and use linear space.
Table 5.2. The problems that can be solved in O(sort(N )) I/Os and linear space
on sparse graphs and the sections where they are discussed. A left-arrow indicates
that the problem can be solved using the more general algorithm to the left.
The notation and terminology used in this chapter is quite standard. The
reader may refer to [283, 382] for definitions of basic graph-theoretic concepts.
For clarity, we review a few basic definitions and define the graph classes
considered in this chapter.
Given a graph G = (V, E) and an edge (v, w) ∈ E, the contraction of
edge (v, w) is the operation of replacing vertices v and w with a new vertex x
and every edge (u, y), where u ∈ {v, w} and y ∈ {v, w}, with an edge (x, y).
This may introduce duplicate edges into the edge set of G. These edges are
removed. We call graph G sparse if |E | = O(|V |) for any graph H = (V , E )
that can be obtained from G through a series of edge contractions.1
A planar embedding Ĝ of a graph G = (V, E) is a drawing of G in the
plane so that every vertex is represented as a unique point, every edge is
represented as a contiguous curve connecting its two endpoints, and no two
edges intersect, except possibly at their endpoints. A graph G is planar if it
has a planar embedding. Given an embedded planar graph, the faces of G
are the connected components of R2 \ Ĝ. The boundary of a face f is the set
of vertices and edges contained in the closure of f .
A graph G = (V, E) is outerplanar if it has a planar embedding one of
whose faces has all vertices of G on its boundary. We call this face the outer
face of G.
√ A grid
√ graph is a graph whose vertices are a subset of the vertices of a
N × N regular grid. Every vertex is denoted by its coordinates (i, j) and
1
The authors of [192] call these graphs “sparse under edge contraction”, thereby
emphasizing the fact that the condition |E| = O(|V |) is not sufficient for a graph
to belong to this class.
88 Laura Toma and Norbert Zeh
1
9 10
7 8 2
4 1
3
6 10
9 5 2
8 4
4 10 1
6 8 10
8
4 4
6 5 5 4
5
1
8
7
4
4 3
6 5
Fig. 5.1. A graph of treewidth three and a tree-decomposition of width three for
the graph.
Sparse
Planar
Outer-
planar
Grid Bounded
treewidth
Fig. 5.2. The relationships between the different graph classes considered in this
survey.
5.3 Techniques
Before discussing the particular algorithms in this survey, we sketch the two
fundamental algorithmic techniques used in these algorithms: graph contrac-
tion and time-forward processing.
The I/O-efficient connectivity algorithm of [192] uses ideas from the PRAM
algorithm of Chin et al. [197] for this problem. First the graph contraction
technique from Section 5.3.1 is applied in order to compute a sequence G =
G0 , . . . , Gq of graphs whose vertex sets have geometrically decreasing sizes
and so that the vertex set of graph Gq fits into main memory. The latter
implies that the connected components of Gq can be computed using a simple
semi-external connectivity algorithm as outlined below. Given the connected
components of Gq , the connected components of G are computed by undoing
the contraction steps used to construct graphs G1 , . . . , Gq one by one and in
each step computing the connected components of Gi from those of Gi+1 .
The details of the algorithm are as follows:
In order to compute graph Gi+1 from graph Gi during the contraction
phase, every vertex in Gi = (Vi , Ei ) selects its incident edge leading to its
neighbor with smallest number. The selected edges form a forest Fi each of
whose trees contains at least two vertices. Every tree in Fi is then contracted
into a single vertex, which produces a new graph Gi+1 = (Vi+1 , Ei+1 ) with at
most half as many vertices as Gi . In particular, for 0 ≤ i ≤ q, |Vi | ≤ |V |/2i .
Choosing q = log(|V |/M ), this implies that |Vq | ≤ M , i.e., the vertex set
of Gq fits into main memory. Hence, the connected components of Gq can be
computed using the following simple algorithm:
Load the vertices of Gq into main memory and label each of them as
being in a separate connected component. Now scan the edge set of Gq and
merge connected components whenever the endpoints of an edge are found
to be in different connected components. The computation of this algorithm
is carried out in main memory, so that computing the connected components
of Gq takes O(scan(|Vq | + |Eq |)) I/Os.
To construct the connected components of graph Gi from those of
graph Gi+1 when undoing the contraction steps, all that is required is to
replace each vertex v of Gi with the tree in Fi it represents and assign v’s
component label to all vertices in this tree.
In [192] it is shown that the construction of Gi+1 from Gi as well
as computing the connected components of Gi from those of Gi+1 takes
O(sort(|Ei |)) I/Os. Hence, the whole connectivity algorithm takes
log(|V |/M)
i=0 O(sort(|Ei |)) I/Os. Since the graphs we consider are sparse,
92 Laura Toma and Norbert Zeh
log(|V |/M)
|Ei | = O(|Vi |) = O(|V |/2i ), so that i=0 O(sort(|Ei |)) = O(sort(|V |)).
That is, the contraction-based connectivity algorithm computes the con-
nected components of a sparse graph in O(sort(|V |)) I/Os.
Tarjan and Vishkin [714] present an elegant parallel algorithm to compute the
biconnected components of a graph G by computing the connected compo-
nents of an auxiliary graph H. Given a spanning tree T of G, every non-tree
edge (v, w) of G (i.e., (v, w) ∈ E(G)\E(T )) defines a fundamental cycle, which
consists of the path from v to w in T and edge (v, w) itself. The auxiliary
graph H contains one vertex per edge of G. Two vertices in H are adjacent if
the corresponding edges appear consecutively on a fundamental cycle in G.
Using this definition of H, it is easy to verify that two edges of G are in the
same biconnected component of G if the two corresponding vertices in H are
in the same connected component of H. In [192], Chiang et al. show that
the construction of H from G can be carried out in O(sort(|V | + |E|)) I/Os.
Since H has O(|E|) vertices and edges, the connected components of H can
be computed in O(sort(|E|)) I/Os using the connectivity algorithm from
Section 5.4.1. Hence, the biconnected components of G can be computed
in O(sort(|V |)) I/Os if G is sparse.
After covering connectivity problems, we now turn to the first two graph
searching problems: breadth-first search (BFS) and the single source shortest
path (SSSP) problem. Since BFS is the same as the SSSP problem if all edges
in the graph have unit weight, and both problems have an Ω(perm(|V |)) I/O
lower bound, we restrict our attention to SSSP-algorithms. Even though the
details of the SSSP-algorithms for different graph classes differ, their efficiency
is based on the fact that the considered graph classes have small separators.
In particular, a separator decomposition of a graph in each such class can be
94 Laura Toma and Norbert Zeh
(a) (b)
Fig. 5.3. (a) A partition of a planar graph into the shaded subgraphs using the
black separator vertices. (b) The boundary sets of the partition.
obtained I/O-efficiently (see Section 5.7), and the shortest path algorithms
apply dynamic programming to such a decomposition in order to solve the
SSSP problem.
vertices in ∂Gi . Then construct a complete graph Ri with vertex set ∂Gi and
assign weight distRi (v, w) to every edge (v, w) ∈ Ri . Graph GR is the union
of graphs R1 , . . . , Rq .
Assuming that M = Ω B 2 , there is enough room in main memory to
store one graph Ri and its compressed version Ri . Hence, graph GR can be
computed from graph G by loading graphs R1 , . . . , Rq into main memory, one
at a time, computing for each graph Ri the compressed version Ri without
incurring any I/Os and writing Ri to disk. As this procedure requires a single
scan of the list of graphs R1 , . . . , Rq , and these graphs have a total size
of O(N ), graph GR can be constructed in O(scan(N )) I/Os. Similarly, once
the distances from s to all separator vertices are known, the computation
of the distances from s to all non-separator vertices can be carried out in
another scan of the list of graphs R1 , . . . , Rq because the computation for the
vertices in Ri is local to Ri .
From the above discussion it follows that the SSSP problem can be solved
in O(sort(N )) I/Os on G if it can be solved in that many I/Os on GR .
Since GR has only O(N/B) vertices and O N/B 2 · B 2 = O(N ) edges, the
SSSP problem on GR can be solved in O((N/B) log2 (N/B)) I/Os using the
shortest path algorithm described in Chapter 4. In order to reduce the I/O-
complexity of this step to O(sort(N )), Arge et al. [60, 68] propose a modified
version of Dijkstra’s algorithm, which avoids the use of a DecreaseKey
operation. This is necessary because the best known external priority queue
that supports this operation [485] takes O((N/B) log2 (N/B)) I/Os to process
a sequence of N priority queue operations, while there are priority queues
that do not support this operation, but can process a sequence of N Insert,
Delete, and DeleteMin operations in O(sort(N )) I/Os [52, 157].
In addition to a priority queue Q storing the unvisited vertices of GR ,
the algorithm of Arge et al. maintains a list L of the vertices of GR , each
labelled with its tentative distance from s. That is, for every vertex stored
in Q, its label in L is the same as its priority in Q. For a vertex not in Q, list L
stores its final distance from s. Initially, all distances, except that of s, are ∞.
Vertex s has distance 0. Now the algorithm repeatedly performs DeleteMin
operations on Q to obtain the next vertex to process. For every retrieved
vertex v, the algorithm loads the adjacency list of v into main memory and
updates the distances from s to v’s neighbors as necessary. (The adjacency
list of v fits into main memory because every vertex in S has degree O(B)
in GR . To see this, observe that each vertex in S is on the boundary of O(1)
subgraphs Gi because graph G has bounded degree, and each subgraph has at
most B boundary vertices.) In order to update these distances, the algorithm
retrieves the entries corresponding to v’s neighbors from L and compares the
current tentative distance of each neighbor w of v to the length of the path
from s to w through v. If the path through v is shorter, the distance from s
to w is updated in L and Q. Since the old tentative distance from s to w
is known, the update on Q can be performed by deleting the old copy of w
96 Laura Toma and Norbert Zeh
and inserting a new copy with the updated distance as priority. That is, the
required DecreaseKey operation is replaced by a Delete and an Insert
operation.
Since graph GR has O(N/B) vertices and O(N ) edges, retrieving all ad-
jacency lists takes O(N/B + scan(N )) = O(scan(N )) I/Os. For the same
reason, the algorithm performs only O(N ) priority operations on Q, which
takes O(sort(N )) I/Os. It remains to analyze the number of I/Os spent on
accessing list L. If the vertices in L are not arranged carefully, the algo-
rithm may spend one I/O per access to a vertex in L, O(N ) I/Os in total.
In order to reduce this I/O-bound to O(N/B), Arge et al. use the fact that
there are only O(N/B 2 ) boundary sets, each of size O(B). If the vertices
in each boundary set are stored consecutively in L, the bound on the size
of each boundary set implies that the vertices in the set can be accessed in
O(1) I/Os. Moreover, every boundary set is accessed only O(B) times, once
per vertex on the boundaries of the subgraphs defining this boundary set.
Since there are O(N/B 2 ) boundary sets, the total number
of I/Os spent on
loading boundary sets from L is hence O B · N/B 2 = O(N/B).
The algorithm described above computes only the distances from s to all
vertices in G. However, it is easy to augment the algorithm so that it computes
shortest paths in O(sort(N )) I/Os using an additional post-processing step.
The SSSP algorithm for planar graphs and grid graphs computes shortest
paths in three steps: First it encodes the distances between separator ver-
tices in a compressed graph. Then it computes the distances from the source
to all separator vertices in this compressed graph. And finally it computes
the distances from the source to all non-separator vertices using the dis-
tance information computed for the separator vertices on the boundary of
the subgraph Gi containing each such vertex. The shortest path algorithm
for outerplanar graphs and graphs of bounded treewidth [522, 775] applies
this approach iteratively, using the fact that a tree-decomposition of the
graph provides a hierarchical decomposition of the graph using separators of
constant size.
Assume that the given tree-decomposition D = (T, X ) of G is nice in the
sense defined in Section 5.2 and that s ∈ Xv , for all v ∈ T .2 Then every
subtree of T rooted at some node v ∈ T represents a subgraph G(v) of G,
which shares only the vertices in Xv with the rest of G.
The first phase of the algorithm processes T from the leaves towards the
root and computes for every node v ∈ T and every pair of vertices x, y ∈ Xv ,
the distance from x to y in G(v). Since G(r) = G, for the root r of T , this
produces the distances in G between all vertices in Xr . In particular, the
2
Explicitly adding s to all sets Xv to ensure the latter assumption increases the
width of the decomposition by at most one.
5. I/O-Efficient Algorithms for Sparse Graphs 97
distances from s to all other vertices in Xr are known at the end of the first
phase. The second phase processes tree T from the root towards the leaves
to compute for every node v ∈ T , the distances from s to all vertices in Xv .
During the first phase, the computation at a node v uses only the weights
of the edges between vertices in Xv and distance information computed for
the vertices stored at v’s children. During the second phase, the computation
at node v uses the distance information computed for the vertices in Xv
during the first phase of the algorithm and the distances from s to all vertices
in Xp(v) , where p(v) denotes v’s parent in T . Since the computation at every
node involves only a constant amount of information, it can be carried out
in main memory. All that is required is passing distance information from
children to parents in the first phase of the algorithm and from parents to
children in the second phase. This can be done in O(sort(N )) I/Os using
time-forward processing because tree T has size O(N ), and O(1) information
is sent along every edge.
To provide at least some insight into the computation carried out at the
nodes of T , we discuss the first phase of the algorithm. For a leaf v, G(v) is
the graph induced by the vertices in Xv . In particular, |G(v)| = O(1), and
the distances in G(v) between all vertices in Xv can be computed in main
memory. For a forget node v with child w, G(v) = G(w) and Xv ⊂ Xw ,
so that the distance information for the vertices in Xv has already been
computed at node w and can easily be copied to node v. For an introduce
node v with child w, Xv = Xw ∪ {x}. A shortest path in G(v) between two
vertices in Xv consists of shortest paths in G(w) between vertices in Xw and
edges between x and vertices in Xw . Hence, the distances between vertices
in Xv are the same in G(v) and in a complete graph G (v) with vertex set Xv
whose edges have the following weights: If y, z ∈ Xw , then edge (y, z) has
weight distG(w) (y, z). Otherwise assume w.l.o.g. that y = x. Then the weight
of edge (x, z) is the same in G (v) as in G. The distances in G(v) between
all vertices in Xv can now be computed by solving all pairs shortest paths
on G (v). This can be done in main memory because |G (v)| = O(1). For a
join node u with children v and w, a similar graph G (u) of constant size
is computed, which captures the lengths of the shortest paths between all
vertices in Xu = Xv = Xw that stay either completely in G(v) or completely
in G(w). The distances in G(u) between all vertices in Xu are again computed
in main memory by solving all pairs shortest paths on G (u).
The second phase of the algorithm proceeds in a similar fashion, using the
fact that a shortest path from s to a vertex x in Xv either stays completely
inside G(v), in which case the shortest path information between s and x
computed in the first phase is correct, or it consists of a shortest path from s
to a vertex y in Xp(v) followed by a shortest path from y to x in G(p(v)).
Since outerplanar graphs have treewidth 2, the algorithm sketched above
can be used to solve SSSP on outerplanar graphs in O(sort(N )) I/Os. Alter-
natively, one can derive a separator decomposition of an outerplanar graph
98 Laura Toma and Norbert Zeh
(a) (b)
Fig. 5.4. (a) A planar graph G with its faces colored according to their levels.
Level-0 faces are white. Level-1-faces are light grey. Level-2 faces are dark grey.
(b) The corresponding partition of the graph into outerplanar subgraphs H0 (solid),
H1 (dotted), and H2 (dashed).
For the sake of simplicity, assume that the given planar graph G is bi-
connected. If this is not the case, a DFS-tree of G can be obtained in
O(sort(N )) I/Os by identifying the biconnected components of G using the
biconnectivity algorithm from Section 5.4.3 and merging appropriate DFS-
trees computed separately for each of these biconnected components.
In order to perform DFS in an embedded biconnected planar graph G, the
algorithm of [62], which follows ideas from [372], uses the following approach:
First the faces of G are partitioned into layers around a central face that
has the source of the DFS on its boundary (see Fig. 5.4a). The partition of
the faces of G into layers induces a partition of G into outerplanar graphs
of a particularly simple structure, so that DFS-trees of these graphs can be
5. I/O-Efficient Algorithms for Sparse Graphs 99
6 8
7 10
5
9
s
v1 18 4 0
1
11
17 3
13 12
15
2
16 13 u2
u1
14 v2
Fig. 5.5. (a) The face-on-vertex graph GF shown in bold. (b) Spanning tree T1 and
layer graph H2 are shown in bold. Attachment edges (ui , vi ) are thin solid edges.
The vertices in T1 are labelled with their DFS-depths. (c) The final DFS tree of G.
DFS
SSSP
Fig. 5.6. O(sort(N )) I/O reductions between fundamental problems on pla-
nar graphs. An arrow indicates that the pointing problem can be solved in
O(sort(N )) I/Os if the problem the arrow points to can be solved in that many
I/Os.
key to the efficiency of the other algorithms, and refer the reader to [775] for
the details of the more general separator algorithm.
The algorithm of [523] obtains an optimal h-partition of G by careful
application of the graph contraction technique, combined with a linear-time
internal memory algorithm for this problem. In particular, it first constructs
a hierarchy of planar graphs G = H0 , . . . , Hr whose sizes are geometrically
decreasing and so that |Hr | = O(N/B). The latter implies that applying the
internal memory algorithm to Hr in order to compute an optimal partition
of Hr takes O(N/B) = O(scan(N )) I/Os. Given the partition of Hr , the
algorithm now iterates over graphs Hr−1 , . . . , H0 , in each iteration deriving a
partition of Hi from the partition of Hi+1 computed in the previous iteration.
The construction of a separator Si for Hi starts with the set Si of vertices
in Gi that were contracted into the vertices in Si+1 during the construction
of Hi+1 from Hi . Set Si induces a preliminary partition of Hi , which is then
refined by adding new separator vertices to Si . The resulting set is Si .
The efficiency of the procedure and the quality of its output depend heav-
ily on the properties of the computed graph hierarchy. In [523] it is shown
that a graph hierarchy G = H0 , . . . , Hr with the following properties can be
constructed in O(sort(N )) I/Os:
(i) For all 0 ≤ i ≤ r, graph Hi is planar,
(ii) For all 1 ≤ i ≤ r, every vertex in Hi represents a constant number of
i
vertices in Hi−1 and at most 2 ivertices
in G, and
(iii) For all 0 ≤ i ≤ r, |Hi | = O N/2 .
Choosing r = log B, Property (iii) implies that |Hr | = O(N/B), as re-
quired by the algorithm. Properties (ii) and (iii) can be combined with an
appropriate choice of the size of the subgraphs in the partitions of graphs
Hr , . . . , H1 to guarantee that the final partition of G is optimal. In partic-
ular,
the algorithm makes sure that for 1 ≤ i ≤ r, separator Si induces an
h log2 B -partition of Hi , and only the refinement step computing S = S0
from S0 has the goal of producing an h-partition of G = H0 . Aleksandrov
and Djidjev [25] show that for any graph of√size N and any h > 0, their al-
gorithm computes a separator of size O N/ h that induces an h -partition
of the graph. Hence, since we use the algorithm of [25] to compute
√ Sr
and to
derive separator Si from Si , for 0 ≤ i < r, |Sr | = O |Hr |/ h log B , and
√
for i > 0, the construction of Si from Si adds O |Hi |/ h log B separator
√
vertices to Si . By Properties (ii) and (iii), this implies that |S0 | = O N/ h .
In order
√ to obtain an h-partition of G, the algorithm of [25] adds another
O N/ h separator vertices to S0 , so that S induces an optimal h-partition
of G.
The efficiency of the algorithm also follows from Properties (ii) and (iii).
We have already argued that for r = log B, |Hr | = O(N/B), so that the
linear-time separator algorithm takes O(scan(N )) I/Os to compute the initial
h log2 B -partition of Hr . Property (ii) implies that separator Si induces a
104 Laura Toma and Norbert Zeh
ch log2 B -partition of Hi , for some constant c ≥ 1. Under the assumption
that M ≥ ch log2 B, this implies that every connected component of Hi − Si
fits into main memory. Hence, the algorithm of [523] computes the connected
components of Hi − Si , loads each of them into main memory and applies
the internal memory algorithm of [25] to partition it into subgraphs of size
at most h log2 B (or h, if i = 0).
Since Sr can be computed in O(scan(N )) I/Os and the only external
memory computation required to derive Si from Si is computing the con-
nected components of Hi − Si , the whole algorithm takes O(scan(N )) +
r−1 r−1
i=0 O(sort(|H i |)) = i=0 O sort N/2 i
= O(sort(N )) I/Os.
In order to use the computed partition in the SSSP-algorithm from Sec-
tion 5.5.1, it has to satisfy a few more stringent properties than optimality in
√ that each of the O(N/h)
the above sense. In particular, it has to be guaranteed
subgraphs in the partition is adjacent to at most h separator vertices and
that there are only O(N/h) boundary sets as defined in Section 5.5.1. In [775],
it is shown that these properties can be ensured using a post-processing that
takes O(sort(N )) I/Os and increases the size of the computed separator by
at most a constant factor. The construction is based on ideas from [315].
For a grid graph G, the geometric information associated with its vertices
makes it very easy to compute an h-partition of G. In particular, every vertex
stores its coordinates (i, j) in the grid. Then
√ the separator
√ √ S is chosen to
contain all√vertices
in rows and columns h, 2 h, 3 h, .... Separator S has
size O N/ h and partitions G into subgraphs of size at most h. That is,
the computed partition is optimal. Since every vertex in a grid graph can
be connected only to its eight neighboring grid √ vertices, each subgraph in
the computed partition is adjacent to at most 4 h separator vertices. The
number of boundary sets in the partition is O(N/h). Hence, this partition
can be used in the shortest path algorithm from Section 5.5.1.
Having small separators is a structural property that all the graph classes we
consider have in common. In this section we review I/O-efficient algorithms
to gather more specific information about each class. In particular, we sketch
algorithms for computing outerplanar and planar embeddings of outerplanar
and planar graphs and tree-decompositions of graphs of bounded treewidth.
These algorithms are essential, at least from a theoretical point of view, as
all of the algorithms presented in previous sections, except the separator
algorithm for planar graphs, require an embedding or tree-decomposition to
be given as part of the input.
In order to test whether a given graph is planar, the algorithm of [523] exploits
the fact that the separator algorithm from Section 5.7.1 does not require a
planar embedding to be given as part of the input. In fact, the algorithm
can be applied even without knowing whether G is planar. The strategy
of the planar embedding algorithm is to use the separator algorithm and
try to compute an optimal B 2 -partition of G whose subgraphs G1 , . . . , Gq
have boundary size at most B. If the separator algorithm fails to produce
the desired partition in O(sort(N )) I/Os, the planar embedding algorithm
terminates and reports that G is not planar. Otherwise the algorithm first
tests whether each of the graphs G1 , . . . , Gq is planar. If one of these graphs
is non-planar, graph G cannot be planar. If graphs G1 , . . . , Gq are planar,
each graph Gi is replaced with a constraint graph Ci of size O(B). These
106 Laura Toma and Norbert Zeh
constraint graphs have the property that graph G is planar if and only if
the approximate graph A obtained by replacing each subgraph Gi with its
constraint graph Ci is planar. If A is planar, a planar embedding of G is
obtained from a planar embedding of A by locally replacing the embedding
of each constraint graph Ci with a consistent planar embedding of Gi .
This approach leads to an I/O-efficient algorithm using the following
observations: (1) Graphs G1 , . . . , Gq have size at most B 2 , and graphs
C1 , . . . , Cq have size O(B) each. Thus, the test of each graph Gi for planarity
and the construction of the constraint graph Ci from Gi can be carried out in
main memory, provided that M ≥ B 2 , which has to be true already in order
to apply the separator algorithm. (2) Graph A has size O(N/B) because it
is constructed from O(N/B 2 ) constraint graphs of size O(B), so that a lin-
ear time planarity testing and planar embedding algorithm (e.g., [144]) takes
O(scan(N )) I/Os to test whether A is planar and if so, produce a planar
embedding of A. The construction of consistent planar embeddings of graphs
G1 , . . . , Gq from the embeddings of graphs C1 , . . . , Cq can again be carried
out in main memory.
This seemingly simple approach involves a few technicalities that are dis-
cussed in detail in [775]. At the core of the algorithm is the construction of
the constraint graph Ci of a graph Gi . This construction is based on a careful
analysis of the structure of graph Gi and beyond the scope of this survey. We
refer the reader to [775] for details. However, we sketch the main ideas.
The construction is based on the fact that triconnected planar graphs
are rigid in the sense that they have only two different planar embeddings,
which can be obtained from each other by “flipping” the whole graph. The
construction of constraint graph Ci partitions graph Gi into its connected
components, each connected component into its biconnected components,
and each biconnected component into its triconnected components. The con-
nected components can be handled separately, as they do not interact with
each other. The constraint graph of a connected component is constructed
bottom-up from constraint graphs of its biconnected components, which in
turn are constructed from constraint graphs of their triconnected compo-
nents.
The constraint graph of a triconnected component is triconnected, and
its embedding contains all faces of the triconnected component so that other
parts of G may be embedded in these faces. The rest of the triconnected
component is compressed as far as possible while preserving triconnectivity
and planarity.
The constraint graph of a biconnected component is constructed from the
constraint graphs of its triconnected components by analyzing the amount of
interaction of these triconnected components with the rest of G. Depending
on these interactions, the constraint graph of each triconnected component
is either (1) preserved in the constraint graph of the biconnected component,
(2) grouped with the constraint graphs of a number of other triconnected
5. I/O-Efficient Algorithms for Sparse Graphs 107
components, or (3) does not appear in the constraint graph of the biconnected
component at all because it has no influence on the embedding of any part
of G that is not in this biconnected component. In the second case, the group
of constraint graphs is replaced with a new constraint graph of constant size.
The constraint graph of a connected component is constructed in a similar
manner from the constraint graphs of its biconnected components.
The algorithms for BFS, DFS, and SSSP on special classes of sparse graphs
are a major step towards solving these problems on sparse graphs in general.
In particular, the results on planar graphs have answered the long stand-
ing question whether these graphs allow O(sort(N )) I/O solutions for these
problems. However, all these algorithms are complex because they are based
on computing separators. Thus, the presented results pose a new challenge,
namely that of finding simpler, practical algorithms for these problems.
Since the currently
best known
separator algorithm for planar graphs
requires that M = Ω B 2 log2 B , the algorithms for BFS, DFS, and SSSP on
planar graphs inherit this constraint. It seems that this memory requirement
of the separator algorithm (and hence of the other algorithms as well) can be
removed or at least reduced if the semi-external single source shortest path
problem can be solved in O(sort(|E|)) I/Os on arbitrary graphs. (“Semi-
external” means that the vertices of the graph fit into main memory, but the
edges do not.)
For graphs of bounded treewidth, the main open problem is finding an
I/O-efficient DFS-algorithm. Practicality is not an issue here, as the chances
to obtain practical algorithms for these graphs are minimal, as soon as the
algorithms rely on a tree-decomposition.
For grid graphs, the presented shortest path algorithm uses a partition of
the graph into a number of cells that depends on the size of the grid. This
may be non-optimal if the graph is an extremely sparse subgraph of the grid.
An interesting question here is whether it is possible to exploit the geometric
information provided by the grid to obtain a partition of the same quality
as the one obtained by the separator algorithm for planar graphs, but with
much less effort, i.e., in a way that leads to a practical algorithm.
6. External Memory Computational Geometry
Revisited
Christian Breimann and Jan Vahrenhold∗
6.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 110-148, 2003.
© Springer-Verlag Berlin Heidelberg 2003
6. External Memory Computational Geometry Revisited 111
are identical. The lower bound for this problem is Ω((N/B) logM/B (N/B))
[59], and—looking at the reduction from the opposite direction—a matching
upper bound for the Element Uniqueness problem for points can be obtained
by solving what is called the Closest Pair problem (see Problem 6.3). For a
given collection of points, this problem consists of computing a pair with min-
imal distance. This distance is non-zero if and only if the collection does not
contain duplicates, that is if and only if the answer to the Element Uniqueness
problem is negative.
A more general view is given by Figure 6.1. It shows that reducing (trans-
forming) a problem A to a problem B means transforming the input of A first,
then solving problem B, and transforming its solution back afterwards (see
Figure 6.1 (a)). Such a transformation is said to be a τ (N )-transformation
if and only if transforming both the input and the solution can be done in
O(τ (N )) time. If an algorithm for solving the problem B has an asymptotic
complexity of O(fB (N )), the problem A can be solved in O(fB (N ) + τ (N ))
time. In addition, if the intrinsic complexity of the problem A is Ω(fA (N ))
and if τ (N ) ∈ o(fA (N )), then B also has a lower bound of Ω(fA (N )) (see Fig-
ure 6.1 (b) and, e.g., the textbook by Preparata and Shamos [614, Chap. 1.4]).
Reduction via Duality In Section 6.3, which is entitled “Problems Involv-
ing Sets of Points”, we will discuss the following problem (Problem 6.2):
“Given a set S of N halfspaces in IRd , compute the common inter-
section of these halfspaces.”
At first, it seems surprising that this problem should be discussed in
a section devoted to problems involving set of points. Using the concept
of geometric duality, however, points and halfspaces can be identified in a
consistent way: A duality transform maps points in IRd into the set G d of
non-vertical hyperplanes in IRd and vice versa. The classical duality transform
between points and hyperplanes is defined as follows:
d−1
G d → IRd : xd = ad + i=1 ai xi → (a1 , . . . , ad )
D: d−1
IRd → G d : (b1 , . . . , bd ) → xd = bd − i=1 bi xi
114 Christian Breimann and Jan Vahrenhold
as the sweep-line moves to a point where the topology of the active objects
changes discontinuously: for example, an object must be inserted into the
sweep-line structure as soon as the sweep-line hits its leftmost point, and it
must be removed after the sweep-line has passed its rightmost point. The
sweep-line structure can be maintained in logarithmic time per update if the
objects can be ordered linearly, e.g., by the y-value of their intersection with
the sweep line.
For a finite set of objects, there are only finitely many points where the
topology of the active objects changes discontinuously, e.g., when objects are
inserted into or deleted from the sweep-line structure; these points are called
events and are stored in increasing order of their x-coordinates, e.g, in a
priority queue. Depending on the problem to be solved, there may exist ad-
ditional event types apart from insert and delete events. The data structure
for storing the events is called event queue, and maintaining it as a prior-
ity queue under insertions and deletions can be accomplished in logarithmic
time per update. That is, if the active objects can be ordered linearly, each
event can be processed in logarithmic time (excluding the time needed for
operations involving active objects). As a consequence, the plane sweeping
technique often leads to optimal algorithms, e.g., the Closest Pair problem
can be solved in optimal time O(N log2 N ) [398].
The straightforward approach for externalizing the plane sweeping tech-
nique would be to replace the (internal) sweep-line structure by a corre-
sponding external data structure, e.g., a B-tree [96]. A plane-sweep algo-
rithm with an internal memory time complexity of O(N log2 N ) then spends
O(N logB N ) I/Os. For problems with an external memory lower bound of
Ω((N/B) logM/B (N/B)), however, the latter bound is at least a factor of
B away from optimal.1 The key to an efficient external sweeping technique
is to combine sweeping with ideas similar to divide-and-conquer, that is, to
subdivide the plane prior to sweeping. To aid imagination, consider the plane
subdivided into Θ(M/B) parallel (vertical) strips, each containing the same
1
Often the (realistic) assumption M/B > B is made. In such a situation, an
additional (non-trivial) factor of logB M > 2 is lost.
116 Christian Breimann and Jan Vahrenhold
number of data objects (see Figure 6.2(b)).2 Each of these strips is then
processed using a sweep over the data and eventually by recursion. How-
ever, in contrast to the description of the internal case, the sweep-line is
perpendicular to the y-axis, and sweeping is done from top to bottom. The
motivation behind this modified description is to facilitate the intuition be-
hind the novel ingredient of distribution sweeping, namely the subdivision
into vertical strips.
The subdivision proceeds using a technique originally proposed for distri-
bution sort [17], hence, the resulting external plane sweeping technique has
been christened distribution sweeping [345]. While in the situation of distri-
bution sort all partitioning elements have to be selected using an external
variant of the median find algorithm [133, 307], distribution sweeping can
resort to having an optimal external sorting algorithm at hand. The set of
all x-coordinates is sorted in ascending order, and for each (recursive) sub-
division of a strip, the Θ(M/B) partitioning elements can be selected from
the sorted sequence spending an overall number of O(N/B) I/Os per level of
recursion.
Using this linear partitioning algorithm as a subroutine, the distribution
sweeping technique can be stated as follows: Prior to entering the recursive
procedure, all objects are sorted with respect to the sweeping direction, and
the set of x-coordinates is sorted such that the partitioning elements can be
found efficiently. During each recursive call, the current data set is parti-
tioned into M/B strips. Objects that interact with objects from other strips
are found and processed during a sweep over the strips, while interactions
between objects assigned to the same strip are found recursively. The recur-
sion terminates when the number of objects assigned to a strip falls below M
and the subproblem can be solved in main memory. If the sweep for finding
inter-strip interactions can be performed using only a linear number of I/Os,
i.e., Θ(N/B) I/Os, the overall I/O complexity for distribution sweeping is
O((N/B) logM/B (N/B)).
Fig. 6.3. R-tree for data rectangles A, B, C, . . . , I, K, L. The tree in this example
has maximum fanout B = 3.
Problem 6.1 (Convex Hull). Given a set S of N points in IRd , find the
smallest (convex) polytope enclosing S (see Figure 6.4(a)).
Among the earliest internal memory algorithms for computing the convex
hull in two dimensions was a sort-and-scan algorithm due to Graham [352].
This algorithm, called Graham’s Scan, is based upon the invariant that when
traversing the boundary of a convex polygon in counterclockwise direction,
any three consecutive points form a left turn. The algorithm first selects
a point p that is known to be interior to the convex hull, e.g., the center of
gravity of the triangle formed by three non-collinear points in S. All points in
S are then sorted by increasing polar angle with respect to p. The convex hull
is constructed by pushing the points onto a stack in sorted order, maintaining
the above invariant. As soon as the next point to be pushed and the topmost
two points on the stack do not form a left turn, points are repeatedly removed
from the stack until only one point is left or the invariant is fulfilled. After all
points have been processed, the stack contains the points lying on the convex
hull in clockwise direction. As each point can be pushed onto (removed from)
the stack only once, Θ(N ) stack operations are performed, and the (optimal)
internal memory complexity, dominated by the sorting step, is O(N log2 N ).
This algorithm is one of the rare cases where externalization is completely
straightforward [345]. Sorting can be done using O((N/B) logM/B (N/B))
I/Os [17], and an external stack can be implemented such that Θ(N ) stack
operations require O(N/B) I/Os (see Chapter 2). The external algorithm we
obtain this way has an optimal complexity of O((N/B) logM/B (N/B)).
In general, O(N ) points of S can lie on the convex hull, but there are
situations where the number Z of points on the convex hull is (asymptotically)
much smaller. An output-sensitive algorithm for computing the convex hull
120 Christian Breimann and Jan Vahrenhold
in two dimensions has been obtained by Goodrich et al. [345]. Building upon
the concept of marriage-before-conquest [458], the authors combine external
versions of finding the median of an unsorted set [17] and of computing the
convex hull of a partially sorted point set [343] to obtain an optimal output-
sensitive external algorithm with complexity O((N/B) logM/B (Z/B)).
Independent from this particular problem, Hoel and Samet [402] claimed
that accessing disjoint decompositions of data space tends to be faster than
other decompositions for a wide range of hierarchical spatial index struc-
tures. Along these lines, Böhm and Kriegel [138] presented two algorithms
for solving the Convex Hull problem using spatial index structures. One al-
gorithm, computing the minimum and maximum values for each dimension
and traversing the index depth-first, is shown to be optimal in the number of
disk accesses as it reads only the pages containing points not enclosed by the
convex hull once. The second algorithm performs worse in terms of I/O but
needs less CPU time. It is unclear, however, how to extend these algorithms
to higher dimensions.
An approach to the d-dimensional Convex Hull problem is based on the
observation that the convex hull of S ⊂ IRd can be inferred from the inter-
section of halfspaces in the dual space (IRd )∗ [781] (see also Figures 6.4(a)
and (b)). For each point p ∈ S, the corresponding dual halfspace is given by
p∗ := {x ∈ (IRd )∗ | di=1 xi pi ≤ 1}. At least for d ∈ {2, 3}, the intersection of
halfspaces can be computed I/O-efficiently (see the following Problem 6.2),
and this results in corresponding I/O-efficient algorithms for the Convex Hull
problem in these dimensions.
problems (see also the survey by Smid [700]). In the external memory set-
ting, the (static) problem of finding the closest pair in a fixed set S of N
points can be solved by exploiting the reduction to the All Nearest Neighbors
problem (Problem 6.6), where for each point p ∈ S, we wish to determine
its nearest neighbor in S \ {p} (see Problem 6.5). Having computed this list
of N pairs of points, we can easily select two points forming a closest pair
by scanning the list while keeping track of the closest pair seen so far. As
we will discuss below, the complexity of solving the All Nearest Neighbors
is O((N/B) logM/B (N/B)), which gives us an optimal algorithm for solving
the static Closest Pair problem.
Handling the dynamic case is considerably more involved, as an insertion
or a deletion could change a large number of “nearest neighbors”, and con-
sequently, the reduction to the All Nearest Neighbors problem would require
touching at least the same number of objects.
Callahan, Goodrich, and Ramaiyer [168] introduced an external variant
of topology trees [316], and building upon this data structure, they managed
to develop an external version of the dynamic closest pair algorithm by Be-
spamyatnikh [121]. The data structure presented by Callahan et al. can be
used to dynamically maintain the closest pair spending O(logB N ) I/Os per
update.
The Closest Pair problem can also be considered in a bichromatic setting,
where each point is labeled with either of two colors, and where we wish to
report a pair with minimal distance among all pairs of points having different
colors [10, 351]. This problem can be generalized to the case of reporting the
K bichromatic closest pairs.
Problem 6.4 (K-Bichromatic Closest Pairs). Given a set S of N points
in IRd with S = S1 ∪ S2 and S1 ∩ S2 = ∅, find K closest pairs (p, q) ∈ S1 × S2 .
Some efficient internal memory algorithms for solving this problem have
been proposed [10, 451], but it seems that none of them can be externalized
efficiently. In the context of spatial databases, the K-Bichromatic Closest
Pairs problem can be seen as a special instance of a so-called θ-join which
is defined as follows: Given two sets S1 and S2 of objects and a predicate θ :
S1 × S2 → IB, compute all pairs (s1 , s2 ) ∈ S1 × S2 , for which θ(s1 , s2 ) = true.
In his approach to the K-Bichromatic Closest Pairs problem,Henrich [393]
considered the special case |S2 | = 1, and assuming that S1 is indexed hier-
122 Christian Breimann and Jan Vahrenhold
and query time [700], and not surprisingly, the external memory variant of
this problem is unsolved as well.
Berchtold et al. [112] proposed to use hierarchical spatial index struc-
tures to store the data points. They also introduced a different cost model
and compared the predicted and actual cost of solving the Nearest Neighbor
problem for real-world data using an X-tree [114] and a Hilbert-R-tree [287].
Brin [148] introduced the GNAT index structure which resembles a hierarchi-
cal Voronoi diagram (see Problem 6.11). He also gave empirical evidence that
this structure outperforms most other index structures for high-dimensional
data spaces.
The practical relevance of the nearest neighbor, however, becomes less
significant as the number of dimensions increases. For both real-world and
synthetic data sets in high-dimensional space (d > 10), Weber, Schek, and
Blott [759] as well as Beyer et al. [123] showed that under several distance
metrics the distance to the nearest neighbor is larger than the distance be-
tween the nearest neighbor and the farthest neighbor of the query point.
Their observation raises an additional quality issue: The exact nearest neigh-
bor of a query point might not be relevant at all. As an approach to cope with
this complication, Hinneburg, Aggarwal, and Keim [397] modified the Nearest
Neighbor problem by introducing the notion of important dimensions. They
introduced a quality criterion to determine which dimensions are relevant to
the specific proximity problem in question and examined the data distribu-
tion resulting from projections of the data set to these dimensions. Obviously,
their approach yields improvements over standard techniques only if the num-
ber of “important” dimensions is significantly smaller than the dimension of
the data space.
The All Nearest Neighbors problem, which can also be seen as a special
batched variant of the (single-shot) Nearest Neighbor problem, can be posed,
e.g., in order to find clusters within a point set. Goodrich et al. [345] pro-
posed an algorithm with O((N/B) logM/B (N/B)) I/O-complexity based on
the distribution sweeping paradigm: Their approach is to externalize a par-
allel algorithm by Atallah and Tsay [74] replacing work on each processor by
work within a single memory load. Recall that on each level of distribution
sweeping, only interactions between strips are handled, and that interactions
within a strip are handled recursively. In the situation of finding nearest
neighbors, the algorithm performs a top-down sweep keeping track of each
point whose nearest neighbor above does not lie within the same strip. The
crucial observation by Atallah and Tsay is that there are at most four such
points in each strip, and by choosing the branching factor of distribution
sweeping as M/(5B), the (at most) four blocks per strip containing these
6. External Memory Computational Geometry Revisited 125
points as well as the M/(5B) blocks needed to produce the input for the
recursive steps can be kept in main memory. Nearest neighbors within the
same strip are found recursively, and the result is combined with the result
of a second bottom-up sweep to produce the final answer.
In several applications, it it desirable to compute not only the exact near-
est neighbors but to additionally compute for each point the K points clos-
est to it. An algorithm for this so-called All K-Nearest Neighbors problem
has been presented by Govindarajan et al. [346]. Their approach (which
works for an arbitrary number d of dimensions) builds upon an exter-
nal data structure to efficiently maintain a well-separated pair decomposi-
tion [169]. A well-separated pair decomposition a set S of points is a hi-
erarchical clustering of S such that any two clusters on the same level of
the hierarchy are farther apart than any to points within the same clus-
ter, and several internal memory algorithms have been developed building
upon properties of such a decomposition. The external data structure of
Govindarajan et al. occupies O(KN/B) disk blocks and can be used to
compute all K-nearest neighbors in O((KN/B) logM/B (KN/B)) I/Os. Their
method can also be used to compute the K closest pairs in d dimensions in
O(((N + K)/B) logM/B ((N + K)/B)) I/Os using O((N + K)/B) disk blocks.
(a) Reverse nearest neighbors for point p. (b) K-nearest neighbors via lifting.
containing p. It is easy to verify that the points corresponding to the balls that
contain p are exactly the points having p as their nearest neighbor in S ∪ {p}.
In the internal memory setting, at least the static version of the Reverse
Nearest Neighbor problem can be solved efficiently [524]. The main problem
when trying to efficiently solve the problem in a dynamic setting is that
updating S essentially involves finding nearest neighbors in a dynamically
changing point set, and—as discussed in the context of Problem 6.5—no
efficient solution with at most polylogarithmic space overhead is known.
set of candidate pairs during nearest neighbor search. It partitions the data
space into cells and stores unique bit strings for these cells in an (option-
ally compressed) array. During a sequential scan of this array, candidates are
determined by using the stored approximations, before these candidates are
further examined to obtain the final result.
Establishing a trade-off between used disk space and obtained query time,
Goldstein and Ranakrishnan [338] presented an approach to reduce query
time by examining some characteristics of the data and storing redundant
information. Following their approach the user can explicitly relate query
performance and disk space, i.e., more redundant information can be stored
to improve query performance and vice versa. With a small percentage of
only approximately correct answers in the final result, this approach leads to
sub-linear query processing for high dimensions.
The description of algorithms for the K-Nearest Neighbors problem con-
cludes our discussion of proximity problems, that is of selecting certain points
according to their proximity to one or more query points. The next two prob-
lems also consist of selecting a subset of the original data, namely the set
contained in a given query range. These problems, however, have been dis-
cussed in detail by recent surveys [11, 56, 754], so we only sketch the main
results in this area.
The main source for solutions to the halfspace range searching problem
in the external memory setting is the paper by Agarwal et al. [6]. The au-
thors presented a variety of data structures that can be used for halfspace
range searching classifying their solutions in linear and non-linear space data
structures. All proposed algorithms rely on the following duality transform
and the fact that it preserves the “above-below” relation.
d−1
G d → IRd : xd = ad + i=1 ai xi → (a1 , . . . , ad )
D: d−1
IRd → G d : (b1 , . . . , bd ) → xd = bd − i=1 bi xi
128 Christian Breimann and Jan Vahrenhold
In the linear space setting, the general problem for d > 3 can be solved
using an external version of a partition tree [535] spending for any fixed ε > 0
O((N/B)1−1/d+ε + Z/B) I/Os per query. The expected preprocessing com-
plexity is O(N log2 N ) I/Os. For simplex range searching queries, that is for
reporting all points in S lying inside a given query simplex with µ faces of all
dimensions, O((µN/B)1−1/d+ε +Z/B) I/Os are sufficient. For halfspace range
searching and d = 2, the query cost can be reduced to O(logB N +Z/B) I/Os
(using O(N log2 N logB N ) expected I/Os to preprocess an external version of
a data structure by Chazelle, Guibas, and Lee [184]). Using partial rebuild-
ing, points can also be inserted into/removed from S spending amortized
O(log2 (N/B) logB N ) I/Os per update.
If one is willing to spend slightly super-linear space, the query cost in the
three-dimensional setting can be reduced to O(logB N +Z/B) I/Os at the ex-
pense of an expected overall space requirement of O((N/B) log2 (N/B)) disk
blocks. This data structure externalizes a result of Chan [175] and can be con-
structed spending an expected number of O((N/B) log2 (N/B) logB N ) I/Os.
Alternatively, Agarwal et al. [6] propose to use external versions of shallow
partition trees [536] that use O((N/B) logB N ) space and can answer a query
spending O((N/B)ε +Z/B) I/Os. This approach can also be generalized to an
arbitrary number d of dimensions: a halfspace range searching query can be
answered spending O((N/B)1−1/d/2+ε + Z/B) I/Os. The exact complex-
ity of halfspace range searching is unknown—even in the well-investigated
internal memory setting, there exist several machine model/query type com-
binations where no matching upper and lower bounds are known [11].
structure that can be updated in O(logB N ) I/Os per update and orthogonal
range queries in O((N/B)1−1/d + Z/B) I/Os per query. The external cross-
tree can be built in O((N/B) logM/B (N/B)) I/Os. In a different model that
excludes threaded data structures like the cross-tree, Kanth and Singh [444]
obtained similar bounds (but with amortized update complexity) by layering
B-trees and k-D-trees. Their paper additionally includes a proof of a matching
lower bound.
The Orthogonal Range Searching problem has also been considered in
the batched setting: Arge et al. [65] and Goodrich et al. [345] showed how
to solve the two-dimensional problem spending O((N/B) logM/B (N/B) +
Z/B) I/Os using linear space. Arge et al. [65] extended this result to higher
dimensions and obtained a complexity of O((N/B) logM/B d−1
(N/B) + Z/B)
I/Os. The one-dimensional batched dynamic problem, i.e., all Q updates are
known in advance, can be solved in O(((N +Q)/B) logM/B (N +Q)/B +Z/B)
I/Os [65], but no corresponding bound is known in higher dimensions.
Problems that are slightly less general than the Orthogonal Range Search-
ing problem are the (two-dimensional) Three-Sided Orthogonal Range Search-
ing and Two-Sided Orthogonal Range Searching problem, where the query
range is unbounded at one or two sides. Both problems have been consid-
ered by several authors [129, 421, 443, 624, 709, 750], most recently by Arge,
Samoladas, and Vitter [67] in the context of indexability [388]—see also more
specific surveys [11, 56, 754].
Another recent development in the area of range searching are algorithms
for range searching among moving objects. In this setting, each object is
assigned a (static) “flight plan” that determines how the position of an object
changes as a (continuous) function of time. Using external versions of partition
trees [535], Agarwal, Arge, and Erickson [5] and Kollios and Tsotras [463]
developed efficient data structures that can be used to answer orthogonal
range queries in one and two dimensions spending O((N/B)1/2+ε + Z/B)
I/Os. These solutions are time-oblivious in the sense that the complexity of
a range query does not depend on how far the point of time of the query
is in the future. Time-responsive solutions that answer queries in the near
future (or past) faster than queries further away in time have been proposed
by Agarwal et al. [5] and by Agarwal, Arge, and Vahrenhold [8].
We conclude this section by discussing the Voronoi diagram and its graph-
theoretic dual, the Delaunay triangulation. Both structures have a variety of
proximity-related applications, e.g., in Geographic Information Systems, and
we refer the interested reader to more specific treatments of how to work with
these structures [76, 275, 336].
Problem 6.11 (Voronoi Diagram). Given a set S of N points in IRd
and a distance metric d, compute for each point p ∈ S its Voronoi region
V (p, S) := {x ∈ IRd | d(x, p) ≤ d(x, q), q ∈ S \ {p}}.
Given the above definition, the Voronoi diagram consists of the union of all
N Voronoi regions which are disjoint except for a possibly shared boundary.
130 Christian Breimann and Jan Vahrenhold
(a) Voronoi diagram via lifting. (b) Delaunay triangulation via lifting.
Fig. 6.9. Computing the Voronoi diagram and the Delaunay triangulation.
tree. Their data structure occupies linear space and can be used to answer
stabbing queries spending O(logB N +Z/B) I/Os per query. As in the internal
setting, the data structure can be made dynamic, and the resulting dynamic
data structure supports both insertions and deletions with O(logB N ) worst-
case I/O-complexity.
The externalization technique used by Arge and Vitter is of independent
interest, hence, we will present it in a little more detail. In order to obtain a
query complexity of O(logB N +Z/B) I/Os, the fan-out of the base tree has to
be in O(B c ) for some constant c > 0, and for reasons that will become clear
immediately, this constant is chosen as c = 1/2. As mentioned above, the
boundaries between the children of a node v are stored at v and partition the
interval associated with v into consecutive slabs, and a segment s intersecting
the boundary of such a slab (but of no slab corresponding to a child of v’s
parent) is stored at v. The slabs intersected by s form a contiguous subinterval
[sl , sr ] of [s1 , s√B ]. In the situation of Figure 6.11(a), for example, the segment
s intersects the slabs s1 , s2 , s3 , and s4 , hence, l = 1 and r = 4. The indices l
and r induce a partition of s into three (possibly empty) subsegments: a left
subsegment s∩sl , a middle subsegment s∩[sl+1 , sr−1 ], and a right subsegment
s ∩ sr . √
Each of the B slabs associated with a node v has a left and right struc-
ture that stores left and right subsegments falling into the slab. In the situa-
tion of the interval tree, these structures are lists ordered by the x-coordinates
of the endpoints that do not lie on the slab boundary. Handling of middle
subsegments is complicated by the fact that a subsegment might span more
that one slab, and storing the segment at each such slab would increase both
space requirement and update time. To resolve this problem, Arge and Vitter
introduced the notion of multislabs: a multislab is a contiguous √ √ subinterval
of [s1 , s√B ], and it is easy to realize that there are Θ( B B) = Θ(B) such
multislabs. Each middle subsegment is stored in a secondary data structure
corresponding to the (unique) maximal multislab it spans, and as there are
only Θ(B) multislabs, the node v can accommodate pointers to all these
structures in O(1) disk blocks.6
As in the internal memory setting, a stabbing query with = x is answered
by performing a search for x and querying all secondary structures of the
nodes visited along the path. As the tree is of height O(logB N ), and as
each left and right structure that contributes Z ≥ 0 elements to the answer
set can be queried in O(1 + Z /B) I/Os, the overall query complexity is
O(logB N + Z/B) I/Os.7
6
To ensure that the overall space requirement is O(N/B) disk blocks, multislab
lists containing too few segments are grouped together into a special underflow
structure [71].
7
Note that each multislab structure queried contributes
√ all its elements to the
answer set, hence, the complexity of querying O( B log B N ) multislab structures
is O(Z/B).
6. External Memory Computational Geometry Revisited 133
The main problem with making the interval tree dynamic is that the in-
sertion of a new interval might augment the set of x-coordinates in S. As a
consequence, the base tree structure of the interval tree has to be reorganized,
and this in turn might require several segments moved between secondary
structures of different nodes. Using weight-balanced B-trees (see Chapter 2)
and a variant of the global rebuilding technique [599], Arge and Vitter ob-
tained a linear-space dynamic version of the interval tree that answers stab-
bing queries in O(logB N + Z/B) I/Os and can be updated in O(logB N )
I/Os worst-case.
multislab can be held in main memory, and since the number of multislabs is
quadratic in the number of slabs, the number of slabs, that is, the fan-out of
the base tree (and
thus of the corresponding distribution sweeping process),
is chosen as Θ( M/B).9
To facilitate finding the segment immediately above another segment’s
endpoint, the segments in the multislab structures have to be sorted accord-
ing to the “above-below” relation. Given that the solution to the Endpoint
Dominance problem will be applied to solve the Segment Sorting problem
(Problem 6.13), this seems a prohibited operation. Exploiting the fact,
how-
ever, that the middle subsegments have their endpoints on a set of Θ( M/B)
slab boundaries, Arge et al. [70] demonstrated how these segments can be
sorted in a linear number of I/Os using only a standard (one-dimensional)
sorting algorithm. Extending the external segment tree by keeping left and
right subsegments in sorted order as they are distributed to slabs on the
next level and using a simple counting argument, it can be shown that such
an extended external segment tree can be constructed top-down spending
O((N/B) logM/B (N/B)) I/Os.10
The endpoint dominance queries are then filtered through the tree re-
membering for each query point the lowest dominating segment seen so far.
Filtering is done bottom-up reflecting the fact that the segment tree has
been built top-down. Arge et al. [70] built on the concept of fractional cas-
cading [182] and proposed to use segments sampled from the multislab lists
of a node v to each child (instead of the other way round) as bridges that
help finding the dominating segment in v once the dominating segment in the
nodes below v (if any) has been found. The number of sampled segments is
chosen such that the overall space requirement of the tree does not (asymp-
totically) increase and that, simultaneously for all multislabs of a node v,
all segments between two sampled segments can be held in main memory.
Then, Q queries can be filtered through the extended external segment tree
spending O(((N +Q)/B) logM/B (N/B)) I/Os, and after the filtering process,
all dominating segments are found.
A second approach is based upon the close relationship to the Trape-
zoidal Decomposition problem (Problem 6.15), namely that the solution for
the Endpoint Dominance problem can be derived from the trapezoidal de-
composition spending O(N/B) I/Os. As we will sketch, an algorithm derived
in the framework of Crauser et al. [228] computes the Trapezoidal Decom-
position of N non-intersecting segments spending an expected number of
9
Using a base-tree with M/B fan-out does not asymptotically change the com-
plexity as O((N/B) log√M/B (N/B)) = O((N/B) logM/B (N/B)). More precisely,
the smaller fan-out results in a tree with twice as much levels.
10
At present, it is unknown whether an extended external segment tree can be
built efficiently in a multi-disk environment, that is, whether the complexity of
building this structure is O((N/DB) log M/B (N/B)) I/Os for D ∈ O(1) [70].
136 Christian Breimann and Jan Vahrenhold
O((N/B) logM/B (N/B)) I/Os, hence the Endpoint Dominance problem can
be solved spending asymptotically the same number of I/Os.
Arge et al. [70] demonstrate how the Segment Sorting problem (Prob-
lem 6.13) can be solved by reduction to the Endpoint Dominance problem
(Problem 6.14). Just as for computing the trapezoidal decomposition, two in-
stances of the Endpoint Dominance problem are solved, this time augmented
with horizontal segments at y = +∞ and y = −∞. Based upon the solu-
tion of these two instances, a directed graph G is created as follows: each
segment corresponds to a node, and if a segment u is dominated from above
(from below) by a segment v, the edge (u, v) (the edge (v, u)) is added to the
graph. The two additional segments ensure that each of the original segments
is dominated from above and from below, hence, the resulting graph is a pla-
nar (s, t)-graph. Computing the desired total order on S then corresponds
to topologically sorting G. As G is a planar (s, t)-graph of complexity Θ(N ),
this can be accomplished spending no more than O((N/B) logM/B (N/B))
I/Os [192].
one after the other, but in random order). Externalization is facilitated using
gradations (see, e.g., [566]), a concept originating in the design of parallel
algorithms. A gradation is a geometrically increasing random sequence of
subsets ∅ = S0 ⊆ · · · ⊆ S = S. The randomized incremental construction
with gradations refines the (intermediate) solution for a Si by simultaneously
adding all objects in Si+1 \Si (that is, in parallel respectively blockwise). This
framework is both general and powerful enough to yield algorithms with ex-
pected optimal complexity for a variety of geometric problems. As discussing
the sophisticated details and the analysis of the resulting algorithms would be
beyond the scope of this survey, we will only mention these results whenever
appropriate and instead refer the interested reader to the original article [228].
(a) Vertical ray shooting from point p. (b) Point location query for point p.
nadan [188] and described a dynamic data structure for storing ν left and
right subsegments with O(logB ν) update time. Maintenance of the middle
segments is complicated by the fact that not all segments are comparable
according to the above-below relation (Problem 6.13), and that insertion of a
new segment might globally affect the total order induced by this (local) par-
tial order. Using level-balanced B-trees (see Chapter 2) and exploiting special
properties of monotone subdivisions, Agarwal et al. [4] obtained a dynamic
data structure for storing ν middle subsegments with O(log2B ν) update time.
The global data structure uses linear space and can be used to answer a verti-
cal ray-shooting query in a monotone subdivision spending O(log2B N ) I/Os.
The amortized update complexity is O(log2B N ).
This result was improved by Arge and Vahrenhold [69] who applied the
logarithmic method (see Chapter 2) and an external variant of dynamic frac-
tional cascading [182, 183] to obtain the same update and query complexity
for general subdivisions.12 The analysis is based upon the (realistic) assump-
tion B 2 < M . Under the more restrictive assumption 2B < M , the amor-
tized insertion bound becomes O(logB N ·logM/B (N/B)) I/Os while all other
bounds remain the same.
A batched semidynamic version, that is, only deletions or only insertions
are allowed, and all updates have to be known in advance, has been proposed
by Arge et al. [65]. Using an external decomposition approach to the problem,
O(Q) point location queries and O(N ) updates can be performed in O(((N +
Q)/B) log2M/B ((N + Q)/B)) I/Os using O((N + Q)/B) space.
Usually, each edge in a planar partition stores the names of the two faces
of Π it separates. Then, algorithms for solving the Vertical Ray-Shooting
problem (Problem 6.17) can be used to answer point location queries with
constant additional work.
Most algorithms for vertical ray-shooting exploit hierarchical decomposi-
tions which can be generalized to a so-called trapezoidal search graph [680].
Using balanced hierarchical decompositions, searching then can be done ef-
ficiently in both the internal and external memory setting. As the query
points and thus the search paths to be followed are not known in advance,
external memory searching in such a graph will most likely result in unpre-
dictable access patterns and random I/O operations. The same is true for
using general-purpose tree-based spatial index structures.
It is well known that disk technologies and operating systems sup-
port sequential I/O operations more efficiently than random I/O opera-
tions [519, 739]. Additionally, for practical applications, it is often desirable
to trade asymptotically optimal performance for simpler structures if there is
12
The deletion bound can be improved to O(log B N ) I/Os amortized.
6. External Memory Computational Geometry Revisited 141
same process is repeated for the set constructed from the blue segments and
the endpoints of the red segments. We now describe the work done on each
level of recursion during the distribution sweeping.
In the terminology of the description of the external interval tree (see
Problem 6.12), the algorithm first detects intersections between red middle
subsegments and blue left and right subsegments. The key to an efficient
solution is to explicitly construct the endpoints of the blue left and right
subsegments that lie on the slab boundaries and to merge them into the sorted
list of red middle subsegments and the (proper) endpoints of the blue left
and right subsegments. During a top-down sweep over the plane (in segment
order), blue left and right subsegments are then inserted into active lists
of their respective slab as soon as their topmost endpoint is encountered,
and for each red middle subsegment s encountered, the active lists of the
slabs spanned by s are scanned to produce red-blue pairs of intersecting
segments. As soon as a red middle subsegment does not intersect a blue left
or right subsegment, this blue segment cannot be intersected by any other red
segment, hence, it can be removed from the slab’s active list. An amortization
argument shows that all intersections can be reported in a linear number of
I/Os. An analogous scan is performed to report intersections between blue
middle subsegments and red left and right subsegments.
In a second phase, intersections between middle subsegments of different
colors are reported. For each multislab, a multislab list is created, and each
red middle subsegment is then distributed to the list of the maximal multislab
that it spans. An immediate consequence of the red segments being sorted
is that each multislab list is sorted by construction. Using a synchronized
traversal of the sorted list of blue middle subsegments and multislab lists
and repeating the process for the situation of the blue middle subsegments
being distributed, all red-blue pairs of intersecting middle subsegments can be
reported spending a linear number of I/Os. Intersections between non-middle
subsegments of different colors are found by recursion within the slabs. As
in the orthogonal setting, a linear number of I/Os is spent on each level
of recursion, hence, the overall I/O-complexity is O((N/B) logM/B (N/B) +
Z/B).
Since computing the trapezoidal decomposition of a set of segments yields
the Z intersections points without additional work, an algorithm with ex-
pected optimal O((N/B) logM/B (N/B) + Z/B) I/O-complexity can be de-
rived in the framework of Crauser et al. [228].
of O(((N + Z)/B) logM/B (N/B)) has been proposed by Arge et al. [70]. The
main idea is to integrate all phases of the deterministic solution described
for the Bichromatic Segment Intersection problem (see Problem 6.19) into
one single phase. The distribution sweeping paradigm is not directly ap-
plicable because there is no total order on a set of intersecting segments.
Arge et al. [70] proposed to construct an extended external segment tree on
the segments and (during the construction of this data structure) to break the
segments stored in the same multislab lists into non-intersecting fragments.
The resulting segment tree can then be used to detect intersections between
segments stored in different multislab lists. For details and the analysis of
this second phase, we refer the reader to the full version of the paper [70].
Since computing the trapezoidal decomposition of a set of segments yields
the Z intersections points without additional work, an algorithm with ex-
pected optimal O((N/B) logM/B (N/B) + Z/B) I/O-complexity can be de-
rived in the framework of Crauser et al. [228]. It remains a open problem,
though, to find a deterministic optimal solution for the Segment Intersection
problem.
Jagadish [426] developed a completely different approach to finding all
line segments that intersect a given line segment. Applying this algorithm
to all segments and removing duplicates, it can also be used to solve Prob-
lem 6.20. This algorithm, which has experimentally shown to perform well
for real-world data sets [426], partitions the d-dimensional data space into
d partitions (one for each axis) and stores a small amount of data for each
line segment in the partition with whose axis this line segment defines the
smallest angle. The data stored is determined by using a modified version of
Hough transform [408]. For simplicity, the planar case is considered here first,
before we show how to generalize it to higher dimensions. In the plane, each
line segment determines a line given by either y = m · x + b or x = m · y + b,
and at least one of these lines has a slope in [−1, 1]. This equation is taken
to map m and b to a point in (2-dimensional) transform space by a duality
transform. An intersection test for a given line segment works as follows. The
two endpoints are transformed into lines first. Assuming for simplicity, that
these lines intersect (the approach also works for parallel lines), we know that
these two lines divide the transform space into four regions. Transforming a
third point of the line segment, the two regions between the transformed lines
can be determined easily. The points contained in these regions (or, rather,
the segment supported by their dual lines) are candidates for intersecting
line segments. Whether they really intersect can be tested by comparing the
projections on the partition axis of both segments which have been stored
along with each point in transform space. For d-dimensional data space, only
little changes occur. After determining the partition axis, the projections of
each line segment on the d − 1 planes involving this axis are treated as above
resulting in d − 1 lines and a point in (2(d − 1))-dimensional transform space.
In addition, the interval of the projection on the partition axis is stored. Note
144 Christian Breimann and Jan Vahrenhold
that this technique needs 2dN space to store the line segments. Unfortunately,
no asymptotic bounds for query time are given, but experiments show that
this approach is more efficient than using spatial index structures or trans-
forming the two d-dimensional endpoints into one point in 2d-dimensional
data space, which are both very common approaches. Some other problems
including finding all line segments passing through or lying in the vicinity of
a specified point can be solved by this technique [426].
The Segment Intersection problem has a natural extension: Given a set
of polygonal objects, report all intersecting pairs of objects. While at first it
seems that this extension is quite straightforward, we will demonstrate in the
next section that only special cases can be solved efficiently.
(a) Bounding boxes of road features (b) Bounding boxes of roads and
(Block Island, RI). hydrography features (Block Island).
Fig. 6.17. Using rectangular bounding boxes for a spatial join operation.
first used to query the multislab lists for overlap with other intervals be-
fore the middle subsegments are inserted into the multislab lists themselves
(left and right subsegments are treated recursively).13 A middle subsegment
is removed from the multislab lists when the sweep-line passes the lower
boundary of the original rectangle. Making sure that these deletions are per-
formed in a blocked manner, one can show that the overall I/O-complexity
is O((N/B) logM/B (N/B) + Z/B). In the case of rectangles, the reduction
to finding intersection of edges yields an efficient algorithm as the number
of intersecting pairs of objects is asymptotically the same as the number of
intersecting pairs of edges—in the more general case of polygons, this is not
the case (see Problem 6.22).
In the database community, this problem is considered almost exclusively
in the bichromatic case of the filter step of spatial join operations, and several
heuristics implementing the filter step and performing well for real-world
data sets have been proposed during the last decade. Most of the proposed
algorithms [101, 150, 151, 362, 401, 414, 603, 704] need index structures for
both data sets, while others only require one data set to be indexed [63,
365, 511, 527], but also spatial hash joins [512] and other non index-based
algorithms [64, 477, 606] have been presented. Moreover, other conservative
approximation techniques besides minimum bounding boxes—mainly in the
planar case—like convex hull, minimum bounding m-corner (especially m ∈
{4, 5}), smallest enclosing circle, or smallest enclosing ellipse [149, 150] as well
as four-colors raster signature [782] have been considered. Using additional
progressive approximations like maximum enclosed circle or rectangle leads
to fast identification of object pairs which can be reported without testing
the exact geometry [149, 150]. Rotem [639] proposed to transform the idea
of join indices [741] to n-dimensional data space using the grid file [583].
The central idea behind all approaches summarized above is to repeat-
edly reduce the working set by pruning or partitioning until it fits into main
memory where an internal memory algorithm can be used. Most index-based
algorithms exploit the hierarchical representation implicitly given by the in-
dex structures to prune parts of the data sets that cannot contribute to the
output of the join operator. In contrast, algorithms for non-indexed spatial
join try to reduce the working set by either imposing an (artificial) order
and then performing some kind of merging according to this order or by
hashing the data to smaller partitions that can be treated separately. The
overall performance of algorithms for the filter step, however, often depends
on subtle design choices and characteristics of the data set [63], and therefore
discussing these approaches in sufficient detail would be beyond the scope of
this survey.
The refinement step of the spatial join cannot rely on approximations of
the polygonal objects but has to perform computations on the exact repre-
13
In the bichromatic setting, two sets of multislab lists are used, one for each color.
6. External Memory Computational Geometry Revisited 147
sentations of the objects that passed the filter step. In this step, the problem
is to determine all pairs of polygonal objects that fulfill the join predicate.
(a) Two simple polygons may have (b) Two convex polygons may have
Θ(N 2 ) intersecting pairs of edges. Θ(N ) intersecting pairs of edges.
Some effort has also been made to combine spatial index structures and
internal memory algorithms ([174, 584]) for finding line segment intersec-
tions [100, 480], but these results rely on practical considerations about the
input data. Another approach which is also claimed to be efficient for real
world data sets [149], generates variants of R-trees, namely TR* -trees [671]
for both data sets and uses them to compute the result afterwards.
6.6 Conclusions
In this survey, we have discussed algorithms and data structures that can be
used for solving large-scale geometric problems. While a lot of research has
been done both in the context of spatial databases and in algorithmics, one
of the most challenging problems is to combine the best of these two worlds,
that is algorithmic design techniques and insights gained from experiments for
real-world instances. The field of external memory experimental algorithmics
is still wide open.
Several important issues in large-scale Geographic Information Systems
have not been addressed in the context of external memory algorithms, in-
cluding how to externalize algorithms on triangulated irregular networks or
how to (I/O-efficiently) perform map-overlay on large digital maps. We con-
clude this chapter by stating two prominent open problems for which optimal
algorithms are known only in the internal memory setting:
– Is it possible to triangulate a simple polygon given its vertices in coun-
terclockwise order along its boundary spending only a linear number of
I/Os?
– Is it possible to compute all Z pairs of intersecting line segments in a set of
N line segments in the plane using a deterministic algorithm that spends
only O((N/B) logM/B (N/B) + Z/B) I/Os?
7. Full-Text Indexes in External Memory
Juha Kärkkäinen∗ and S. Srinivasa Rao
7.1 Introduction
A full-text index is a data structure storing a text (a string or a set of strings)
and supporting string matching queries: Given a pattern string P , find all
occurrences of P in the text. The best-known full-text index is the suffix
tree [761], but numerous others have been developed. Due to their fast con-
struction and the wealth of combinatorial information they reveal, full-text
indexes (and suffix trees in particular) also have many uses beyond basic
string matching. For example, the number of distinct substrings of a string
or the longest common substrings of two strings can be computed in lin-
ear time [231]. Gusfield [366] describes several applications in computational
biology, and many others are listed in [359].
Most of the work on full-text indexes has been done on the RAM model,
i.e., assuming that the text and the index fit into the internal memory. How-
ever, the size of digital libraries, biosequence databases and other textual
information collections often exceed the size of the main memory on most
computers. For example, the GenBank [107] database contains more than
20 GB of DNA sequences in its August 2002 release. Furthermore, the size
of a full-text index is usually 4–20 times larger than the size of the text it-
self [487]. Finally, if an index is needed only occasionally over a long period
of time, one has to keep it either in internal memory reducing the memory
available to other tasks or on disk requiring a costly loading into memory
every time it is needed.
In their standard form, full-text indexes have poor memory locality. This
has led to several recent results on adapting full-text indexes to external
memory. In this chapter, we review the recent work focusing on two issues,
full-text indexes supporting I/O-efficient string matching queries (and up-
dates), and external memory algorithms for constructing full-text indexes
(and for sorting strings, a closely related task).
We do not treat other string techniques in detail here. Most string
matching algorithms that do not use an index work by scanning the text
more or less sequentially (see, e.g., [231, 366]), and are relatively trivial to
adapt to an externally stored text. Worth mentioning, however, are algo-
rithms that may generate very large automata in pattern preprocessing, such
as [486, 533, 573, 735, 770], but we are not aware of external memory versions
of these algorithms.
∗
Partially supported by the Future and Emerging Technologies programme of the
EU under contract number IST-1999-14186 (ALCOM-FT).
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 149-170, 2003.
© Springer-Verlag Berlin Heidelberg 2003
150 Juha Kärkkäinen and S. Srinivasa Rao
7.2 Preliminaries
We begin with a formal description of the problems and the model of com-
putation.
The Problems Let us define some terminology and notation. An alphabet
Σ is a finite ordered set of characters. A string S is an array of characters,
S[1, n] = S[1]S[2] . . . S[n]. For 1 ≤ i ≤ j ≤ n, S[i, j] = S[i] . . . S[j] is a
substring of S, S[1, j] is a prefix of S, and S[i, n] is a suffix of S. The set of
all strings over alphabet Σ is denoted by Σ ∗ .
The main problem considered here is the following.
All the full-text indexes described here have a linear space complexity.
Therefore, the focus will be on the time complexity of queries and updates
(Section 7.4), and of construction (Section 7.5).
Additionally, the string sorting problem will be considered in Section 7.5.5.
scan(N ) = Θ (N/B)
sort(N ) = Θ (N/B) logM/B (N/B)
search(N ) = Θ (logB N )
In this section, we introduce some basic techniques. We start with the (for our
purposes) most important internal memory data structures and algorithms.
Then, we describe two external memory techniques that are used more than
once later.
1
With techniques such as hashing, this is nearly true even for the integer alphabet
model. However, integer dictionaries are a complex issue and outside the scope
of this article.
152 Juha Kärkkäinen and S. Srinivasa Rao
p t t
o e pot
a m
t empo
p
a t
t o
t t ato attoo
e
o o tery
r
o
y
Fig. 7.1. Trie and compact trie for the set {potato, pottery, tattoo, tempo}
Most full-text indexes are variations of three data structures, suffix ar-
rays [340, 528], suffix trees [761] and DAWGs (Directed Acyclic Word
Graphs) [134, 230]. In this section, we describe suffix arrays and suffix trees,
which form the basis for the external memory data structures described here.
We are not aware of any adaptation of DAWG for external memory.
Let us start with an observation that underlies almost all full-text indexes.
If an occurrence of a pattern P starts at position i in a string S ∈ T , then P
is a prefix of the suffix S[i, |S|]. Therefore, we can find all occurrences of P
by performing a prefix search query on the set of all suffixes of the text: A
prefix search query asks for all the strings in the set that contain the query
string P as a prefix. Consequently, a data structure that stores the set of all
suffixes of the text and supports prefix searching is a full-text index.
The simplest data structure supporting efficient prefix searching is the
lexicographically sorted array, where the strings with a given prefix always
form a contiguous interval. The suffix array of a text T , denoted by SAT ,
is the sorted array of pointers to the suffixes of T (see Fig. 7.2). By a bi-
nary search, a string matching (prefix search) query can be answered with
O(log2 N ) string comparisons, which needs O(|P | log2 N ) time in the worst
case. Manber and Myers [528] describe how the binary search can be done in
O(|P | + log2 N ) time if additional (linear amount of) information is stored
about longest common prefixes. Manber and Myers also show how the suffix
array can be constructed in time O(N log2 N ). Suffix arrays do not support
efficient updates.
The trie is another simple data structure for storing a set of strings [460].
A trie (see Fig. 7.1) is a rooted tree with edges labeled by characters. A
node in a trie represents the concatenation of the edge labels on the path
from the root to the node. A trie for a set of strings is the minimal trie
whose nodes represent all the strings in the set. If the set is prefix free, i.e.,
no string is a proper prefix of another string, all the nodes representing the
strings are leaves. A compact trie is derived from a trie by replacing each
maximal branchless path with a single edge labeled by the concatenation of
the replaced edge labels (see Fig. 7.1).
7. Full-Text Indexes in External Memory 153
STT : $
7
a na
$
SAT : 6
4
a
ana
6
+na na$
3
2 anana $ $
1 banana 4 5
5 na na$
banana$
3 nana 2
1
Fig. 7.2. Suffix array SAT and suffix tree STT for the text T = {banana}. For
the suffix tree, a sentinel character $ has been added to the end. Suffix links are
shown with dashed arrows. Also shown are the answers to a string matching query
P = an: in SAT the marked interval, in STT the subtree rooted at +. Note that
the strings shown in the figure are not stored explicitly in the data structures but
are represented by pointers to the text.
The suffix tree of a text T , denoted by STT , is the compact trie of the
set of suffixes of T (see Fig. 7.2). With suffix trees, it is customary to add
a sentinel character $ to the end of each string in T to make the set of
suffixes prefix free. String matching (prefix searching) in a suffix tree is done
by walking down the tree along the path labeled by the pattern (see Fig. 7.2).
The leaves in the subtree rooted at where the walk ends represent the set of
suffixes whose prefix is the pattern. The time complexity is O(|P |) for walking
down the path (under the constant alphabet model) and O(Z) for searching
the subtree, where Z is the size of the answer.
The suffix tree has O(N ) nodes, requires O(N ) space, and can be con-
structed in O(N ) time. Most linear-time construction algorithms, e.g. [540,
736, 761], assume the constant alphabet model, but Farach’s algorithm [288]
also works in the integer alphabet model. All the fast construction algorithms
rely on a feature of suffix trees called suffix links. A suffix link is a pointer
from a node representing the string aα, where a is a single character, to
a node representing α (see Fig. 7.2). Suffix links are not used in searching
but they are necessary for an insertion or a deletion of a string S in time
O(|S|) [297] (under the constant alphabet model).
Fig. 7.3. Pat tree PTT for T = {banana$} using native encoding and binary
encoding. The binary encoding of characters is $=00, a=01, b=10, n=11.
The first technique is the Patricia trie [557], which is a close relative of
the compact trie. The difference is that, in a Patricia trie, the edge labels
contain only the first character (branching character) and the length (skip
value) of the corresponding compact trie label. The Patricia trie for the set of
suffixes of a text T , denoted by PTT , is called the Pat tree [340]. An example
is given in Fig. 7.3.
The central idea of Patricia tries and Pat trees is to delay access to the
text as long as possible. This is illustrated by the string matching procedure.
String matching in a Pat tree proceeds as in a suffix tree except only the
first character of each edge is compared to the corresponding character in the
pattern P . The length/skip value tells how many characters are skipped. If
the search succeeds (reaches the end of the pattern), all the strings in the
resulting subtree have the same prefix of length |P |. Therefore, either all of
them or none of them have the prefix P . A single string comparison between
the pattern and some string in the subtree is required to find out which is
the case. Thus, the string matching time is O(|P | + Z) as with the suffix tree,
but there is now only a single contiguous access to the text.
Any string can be seen as a binary string through a binary encoding of
the characters. A prefix search on a set of such binary strings is equivalent
to a prefix search on the original strings. Patricia tries and Pat trees are
commonly defined to use the binary encoding instead of the native encoding,
because it simplifies the structure in two ways. First, every internal node has
degree two. Second, there is no need to store even the first bit of the edge
label because the left/right distinction already encodes for that. An example
is shown in Fig. 7.3.
The second technique is lexicographic naming introduced by Karp, Miller
and Rosenberg [450]. A lexicographic naming of a (multi)set S of strings is
an assignment of an integer (the name) to each string such that any order
comparison of two names gives the same result as the lexicographic order
comparison of the corresponding strings. Using lexicographic names, arbi-
trarily long strings can be compared in constant time without a reference to
7. Full-Text Indexes in External Memory 155
4 ban 4 banana
2 ana 3 anana
6 nan 6 nana
2 ana 2 ana
5 na$ 5 na
1 a$$ 1 a
Fig. 7.4. Lexicographic naming of the substrings of length three in banana$$, and
of the suffixes of banana
the actual strings. The latter property makes lexicographic naming a suitable
technique for external memory algorithms.
A simple way to construct a lexicographic naming for a set S is to sort
S and use the rank of a string as its name, where the rank is the number of
lexicographically smaller strings in the set (plus one). Fig. 7.4 displays two
examples that are related to the use of lexicographic naming in Section 7.5.
Lexicographic naming has an application with linguistic texts, where
words can be considered as ‘atomic’ elements. As mentioned in the intro-
duction, inverted files are often preferred to full-text indexes in this case
because of their smaller space requirement. However, the space requirement
of full-text indexes (at least suffix arrays) can be reduced to the same level
by storing only suffixes starting at the beginning of a word [340] (making
them no more full-text indexes). A problem with this approach is that most
fast construction algorithms rely on the inclusion of all suffixes. A solution
is to apply lexicographic naming to the set of distinct words and transform
the text into strings of names. Full-text indexes on such transformed texts
are called word-based indexes [46, 227].
1 6 11 14 17 25 28 30 38
particular, they introduce two index structures for two and three level mem-
ory hierarchy (that use main memory and one/two levels of external storage),
and present experimental and analytical results for these. These additional
index structures are much smaller in terms of space compared to the text and
the suffix array. Though, theoretically these structures only improve the per-
formance by a constant factor, one can adjust the parameters of the structure
to get good practical performance. Here, we briefly describe the structure for
a two-level hierarchy. One can use a similar approach for building efficient
indexes for a steeper hierarchy.
Two-Level Hierarchy. The main idea is to divide the suffix array into
blocks of size p, where p is a parameter, and move one element of each block
into main memory, together with the first few characters of the corresponding
suffix. This structure can be considered a reduced representation of the suffix
array and the text file, and is called Short Pat array or SPat array2. The
SPat array is a set of suffix array entries where each entry also carries a
fixed number, say , of characters from the text, where is a parameter (see
Fig. 7.5). Due to the additional information about the text in the SPat array,
a binary search can be performed directly without accessing the disk. As a
result, most of the searching work is done in main memory, thus reducing the
number of disk accesses. Searching for a pattern using this structure is done
in two phases:
First, a binary search is performed on the SPat array, with no disk ac-
cesses, to find the suffix array block containing the pattern occurrence. Ad-
ditional disk accesses are necessary only if the pattern is longer than and
there are multiple entries in the SPat array that match the prefix of length
of the pattern. Then, O(log2 r) disk accesses are needed, where r is the
number matching entries.
2
A suffix array is sometimes also referred to as a Pat array.
7. Full-Text Indexes in External Memory 157
Second, the suffix array block encountered in the first phase is moved from
disk to main memory. A binary search is performed between main memory
(suffix array block containing the answer) and disk (text file) to find the first
and last entries that match the pattern. If the pattern occurs more than p
times in the text, these occurrences may be bounded by at most two SPat
array entries. In this case the left and right blocks are used in the last phase
of the binary search following the same procedure.
The main advantages of this structure are its space efficiency (little more
than a suffix array) and ease of implementation. See [83] for the analytical
and experimental results. This structure does not support updates efficiently.
are independent, Clark and Munro show that the expected size of the CPT
can be made less than 3.5 + log2 N + log2 log2 N + O(log2 log2 log2 N/ log2 N )
bits per node. This is achieved by setting the skip field size to log2 log2 log2 N .
They also use some space saving techniques to reduce the storage requirement
even further, by compromising on the query performance.
External Memory Representation. To control the accesses to the ex-
ternal memory during searching, Clark and Munro use the method of de-
composing the tree into disk block sized pieces, each called a partition. Each
partition of the tree is stored using the CPT structure described above. The
only change required to the CPT structure for storing the partitions is that
the offset pointers in a block may now point to either a suffix in the text or
to a subtree (partition). Thus an extra bit is required to distinguish these
two cases. They use a greedy bottom-up partitioning algorithm and show
that such a partitioning minimizes the maximum number of disk blocks ac-
cessed when traversing from the root to any leaf. While the partitioning rules
described by Clark and Munro minimize the maximum number of external
memory accesses, these rules can produce many small pages and poor fill ra-
tios. They also suggest several methods to overcome this problem. They show
that the maximum
√ number of pages traversed on any root to leaf path is at
most 1 + H/ B + 2 logB N , where H is the height of the Pat tree. Thus
searching using the CPT structure√ takes O(scan(|P | + Z) + search(N )) I/Os,
assuming that the height H is O( B logB N ). Although H could be Θ(N )
in the worst case, it is logarithmic for a random text under some reasonable
conditions on the distribution [711].
Updates. The general approach to updating the static CPT representation
is to search each suffix of the modified document and then make appropriate
changes to the structure based on the path searched. While updating the
tree, it may become necessary to re-partition the tree in order to retain the
optimality. The solution described by Clark and Munro to insert or delete
a suffix requires time proportional to the depth of the tree, and operates on
the compact form of the tree. A string is inserted to or deleted from the text
by inserting/deleting all its suffixes separately. See [206] for details and some
experimental results.
Ferragina and Grossi [296] have introduced the string B-tree which is a combi-
nation of B-trees (see Chapter 2) and Patricia tries. String B-trees link exter-
nal memory data structures to string matching data structures, and overcome
the theoretical limitations of inverted files (modifiability and atomic keys),
suffix arrays (modifiability and contiguous space) and Pat trees (unbalanced
tree topology). It has the same worst case performance as B-trees but han-
dles unbounded length strings and performs powerful search operations such
as the ones supported by Pat trees. String B-trees have also been applied to
7. Full-Text Indexes in External Memory 159
BT
56 20 64 31
BT BT
56 5 10 20 64 60 24 31
BT BT BT BT
56 1 35 5 10 45 68 20 64 52 48 60 24 41 31
1 5 10 20 24 31 35
a i d a t o m a t t e n u a t e c a r p a t e n t z o o a
41 45 48 52 56 60 64 68
t l a s s u n b y f i t d o g a c e l i d c o d b y e
(external and internal) dynamic dictionary matching [298] and some other
internal memory problems [296].
String B-trees are designed to solve the dynamic version of the indexed
string matching problem (Problem 1). For simplicity, we mainly describe the
structure for solving the prefix search problem. As mentioned in Section 7.3.1,
a string matching query can be supported by storing the suffixes of all the
text strings, and supporting prefix search on the set of all suffixes.
String B-tree Data Structure. Given a set S = {s1 , . . . , sN } of N strings
(the suffixes), a string B-tree for S is a B-tree in which all the keys are stored
at the leaves and the internal nodes contain copies of some of these keys.
The keys are the logical pointers to the strings (stored in external memory)
and the order between the keys is the lexicographic order among the strings
pointed to by them. Each node v of the string B-tree is stored in a disk block
and contains an ordered string set Sv ⊆ S, such that b ≤ |Sv | ≤ 2b, where
b = Θ(B) is a parameter which depends on the disk block size B. If we denote
the leftmost (rightmost) string in Sv by L(v) (R(v)), then the strings in S
are distributed among the string B-tree nodes as follows (see Fig. 7.6 for an
example):
– Partition S into groups of b strings except for the last group, which may
contain from b to 2b strings. Each group is mapped into a leaf v (with string
set Sv ) in such a way that the left-to-right scanning of the string B-tree
leaves gives the strings in S in lexicographic order. The longest common
prefix length lcp(Sj , Sj+1 ) is associated with each pair (Sj , Sj+1 ) of Sv ’s
strings.
160 Juha Kärkkäinen and S. Srinivasa Rao
– Each internal node v of the string B-tree has d(v) children u1 , . . . , ud(v) ,
with b/2 ≤ d(v) ≤ b (except for the root, which has from 2 to b children).
The set Sv is formed by copying the leftmost and rightmost strings con-
tained in each of its children, from left to right. More formally, Sv is the
ordered string set {L(u1 ), R(u1 ), L(u2 ), R(u2 ), . . . , L(ud(v) ), R(ud(v) )}.
Since the branching factor of the string B-tree is Θ(B), its height is Θ(logB N ).
Each node v of the string B-tree stores the set Sv (associated with the
node v) as a Patricia trie (also called a blind trie). To maximize the number
b of strings stored in each node for a given value of B, these blind tries are
stored in a succinct form (the tree encoding of Clark and Munro, for example).
When a node v is transferred to the main memory, the explicit representation
of its blind trie is obtained by uncompressing the succinct form, in order to
perform computation on it.
Search Algorithm. To search for a given pattern P , we start from the root
of the string B-tree and follow a path to a leaf, searching for the position
of P at each node. At each internal node, we search for its child node u
whose interval [L(u), R(u)] contains P . The search at node v is done by first
following the path governed by the pattern to reach a leaf l in the blind trie.
If the search stops at an internal node because the pattern has exhausted,
choose l to be any descendant leaf of that node. This leaf does not necessarily
identify the position of P in Sv , but it provides enough information to find
this position, namely, it points to one of the strings in Sv that shares the
longest common prefix with P . Now, we compare the string pointed to by l
with P to determine the length p of their longest common prefix. Then we
know that P matches the search path leading to l up to depth p, and the
mismatch character P [p + 1] identifies the branches of the blind trie between
which P lies, allowing us to find the position of P in Sv . The search is then
continued in the child of v that contains this position.
Updates. To insert a string S into the set S, we first find the leaf v and
the position j inside the leaf where S has to be inserted by searching for
the string S. We then insert S into the set Sv at position j. If L(v) or R(v)
change in v, then we extend the change to v’s ancestor. If v gets full (i.e.,
contains more than 2b strings), we split the node v by creating a new leaf u
and making it an adjacent leaf of v. We then split the set Sv into two roughly
equal parts of at least b strings each and store them as the new string sets
for v and u. We copy the strings L(v), R(v), L(u) and R(u) in their parent
node, and delete the old strings L(v) and R(v). If the parent also gets full,
then we split it. In the worst case the splitting can extend up to the root
and the resulting string B-tree’s height can increase by one. Deletions of the
strings are handled in a similar way, merging a node with its adjacent node
whenever it gets half-full. The I/O complexity of insertion or deletion of a
string S is O(scan(|S|) + search(N )).
For the dynamic indexed string matching problem, to insert a string S into
the text, we have to insert all its suffixes. A straightforward way of doing this
7. Full-Text Indexes in External Memory 161
Recently, Ciriani et al. [205] have given a randomized data structure that sup-
ports lexicographic predecessor queries (which can be used for implementing
prefix searching) and achieves optimal time and space bounds in the amor-
tized sense. More specifically, given a set of N strings S1 , . . . , SN and a se-
quence of m patterns P1 , . . . , Pm , their solution takes O( m i=1 scan(|Pi |) +
N
i=1 (ni logB (m/ni )) expected amortized I/Os, where ni is the number of
times Si is the answer to a query. Inserting or deleting a string S takes
O(scan(|S|) + search(N )) expected amortized I/Os. The search time matches
the performance of string B-trees for uniform distribution of the answers, but
improves on it for biased distributions. This result is the analog of the Static
Optimality Theorem of Sleator and Tarjan [699] and is achieved by designing
a self-adjusting data structure based on the well-known skip lists [616].
There are several efficient algorithms for constructing full-text indexes in in-
ternal memory [288, 528, 540, 736]. However, these algorithms access memory
in a nearly random manner and are poorly suited for external construction.
String B-trees provide the possibility of construction by insertion, but the
construction time of O(N search(N )) I/Os can be improved with specialized
construction algorithms.
162 Juha Kärkkäinen and S. Srinivasa Rao
In this section, we show that the different forms of full-text indexes we have
seen are equivalent in the sense that any of them can be constructed from
another in O(sort(N )) I/Os. To be precise, this is not true for the plain
suffix array, which needs to be augmented with the longest common prefix
array: LCP[i] is the longest common prefix of the suffixes starting at SA[i − 1]
and SA[i]. The suffix array construction algorithms described below can be
modified to construct the LCP array, too, with the same complexity. The
transformation algorithms are taken from [290].
We begin with the construction of a suffix array SA (and the LCP array)
from a suffix tree ST (or a Pat tree PT which has the same structure differing
only in edge labels). We assume that the children of a node are ordered
lexicographically. First, construct the Euler tour of the tree in O(sort(N ))
I/Os (see Chapter 3). The order of the leaves of the tree in the Euler tour is
the lexicographic order of the suffixes they represent. Thus, the suffix array
can be formed by a simple scan of the Euler tour. Furthermore, let w be the
highest node that is between two adjacent leaves u and v in the Euler tour.
Then, w is the lowest common ancestor of u and v, and the depth of w is the
length of the longest common prefix of the suffixes that u and v represent.
Thus LCP can also be computed by a scan of the Euler tour.
The opposite transformation, constructing ST (or PT) given SA and LCP,
proceeds by inserting the suffixes into the tree in lexicographic order, i.e.,
inserting the leaves from left to right. Thus, a new leaf u always becomes
the rightmost child of a node, say v, on the rightmost path in the tree (see
Fig. 7.7). Furthermore, the longest common prefix tells the depth of v (the
insertion depth of u). The nodes on the rightmost path are kept in a stack with
the leaf on top. For each new leaf u, nodes are popped from the stack until
the insertion depth is reached. If there was no node at the insertion depth,
a new node v is created there by splitting the edge. After inserting u as the
child of v, v and u are pushed on the stack. All the stack operations can be
performed with O(scan(N )) I/Os using an external stack (see Chapter 2). The
7. Full-Text Indexes in External Memory 163
v depth LCP[i]
SA[i-1] SA[i] u
Fig. 7.7. Inserting a new leaf u representing the suffix SA[i] into the suffix tree
construction numbers the nodes in the order they are created and represents
the tree structure by storing with each node its parent’s number. Other tree
representations can then be computed in O(sort(N )) I/Os.
The string B-tree described in Section 7.4.3 can also be constructed from
the suffix array and the LCP array in O(sort(N )) I/Os with a procedure
similar to the suffix tree construction. The opposite transformation is also
similar.
Let us call the suffixes starting in Tk the new suffixes and the suffixes start-
ing in Tk−1 . . . T1 the old suffixes. During the stage, a suffix starting at i is
represented by the pair T [i . . . i + − 1], SA−1 [i + ].3 Since SA−1 [i + ] is
a lexicographic name, this information is enough to determine the order of
3
The text is logically appended with copies of the character $ to make the pair
well-defined for all suffixes.
164 Juha Kärkkäinen and S. Srinivasa Rao
suffixes. The first step loads into internal memory Tk , Tk−1 , and the first
entries of SA−1
Tk−1 ...T1 , i.e., the part corresponding to Tk−1 . Using this infor-
mation, the representative pairs are formed for all new suffixes and the suffix
array SATk of the new suffixes is built, all in internal memory.
The second step is performed by scanning Tk−1 . . . T1 and SA−1 Tk−1 ...T1 si-
multaneously. When processing a suffix starting at i, SA−1 [i], SA−1 [i + ],
and T [i, i + − 1] are in internal memory. The latter two are needed for
the representative pair of the suffix and the first is modified. For each i, the
algorithm determines using SATk how many of the new suffixes are lexico-
graphically smaller than the suffix starting at i, and SA−1 [i] is increased by
that amount. During the scan, the algorithm also keeps an array C of coun-
ters in memory. The value C[j] is incremented during the scan when an old
suffix is found to be between the new suffixes starting at SATk [j − 1] and
SATk [j]. After the scan, the ranks of the new suffixes are easy to compute
from the counter array C and SATk allowing the execution of the third step.
The algorithm performs O(N/M ) stages, each requiring a scan through
an array of size O(N ). Thus, the I/O complexity is O((N/M ) scan(N )). The
CPU complexity deserves a closer analysis, since, according to the exper-
iments in [227], it can be the performance bottleneck. In each stage, the
algorithm needs to construct the suffix array of the new suffixes and per-
form O(N ) queries. Using the techniques by Manber and Myers [528], the
construction requires O(M log2 M ) time and the queries O(N M ) time. In
practice, the query time is O(N log2 M ) with a constant depending on the
type of the text. Thus, the total CPU time is O(N 2 ) in the worst case and
O((N 2 /M ) log2 M ) in practice.
Despite the quadratic dependence on the length of the text, the algo-
rithm is fast in practice up to moderate sized texts, i.e., for texts with small
ratio N/M [227]. For larger texts, the doubling algorithm described next is
preferable.
r0 : r1 : r2 : r3 : SA−1 :
4 b 4 ba 4 bana 4 banana$$ 4 banana
1 a 2 an 3 anan 3 anana$$$ 3 anana
5 n 5 na 6 nana 6 nana$$$$ 6 nana
1 a 2 an 2 ana$ 2 ana$$$$$ 2 ana
5 n 5 na 5 na$$ 5 na$$$$$$ 5 na
1 a 1 a$ 1 a$$$ 1 a$$$$$$$ 1 a
the triples are stored in the order of the last component. The following steps
are then performed:
1. sort the triples by the first two components (which is equivalent to sorting
substrings of length 2k )
2. scan to compute rk and update the triples to rk (i), rk−1 (i + 2k−1 ), i
3. sort the triples by the last component
4. scan to update the triples to rk (i), rk (i + 2k ), i
The algorithm does O(sort(N )) I/Os in each stage, and thus requires a total
of O(sort(N ) log2 N ) I/Os for constructing the suffix array.
The algorithm can be improved using the observation that, if a name rk (i)
is unique, then rh (i) = rk (i) for all h > k. We call a triple with a unique
first component finished. Crauser and Ferragina [227] show how step 2 can
be performed without using finished triples allowing the exclusion of finished
triples from the sorting steps. This reduces the I/O complexity of stage k
to O(sort(Nk−1 ) + scan(N )), where Nk−1 is the number of unfinished triples
after stage k − 1. We show how step 4 can also be done without finished
triples, improving the I/O complexity further to O(sort(Nk−1 )).
With only the unfinished triples available in step 2, the new rank of a
triple can no more be computed as its rank in the sorted list. Instead, the
new rank of a triple x, y, i is x+c, where c is the number of triples in the list
with the first component x and the second component smaller than y. This
works correctly because x = rk−1 (i) already counts the smaller substrings
that differ in the first 2k−1 characters, and all the triples that have the same
first component are unfinished and thus on the list.
The newly finished triples are identified and marked in step 2 but not
removed until in step 4, which we describe next. When the scan in step 4
processes x, y, i, the triples x , y , i , i = i + 2k−1 , and x , y , i , i =
i + 2k are also brought into memory if they were unfinished after stage k − 1.
The following three cases are possible:
1. If x , y , i exists (is unfinished), the new triple is x, x , i.
2. If x , y , i exists but x , y , i does not, x , y , i was already fin-
ished before stage k, and thus y is its final rank. Then, x, y , i is the
new triple.
3. If x , y , i does not exist, it was already finished before stage k. Then,
the triple x, y, i must now be finished and is removed.
166 Juha Kärkkäinen and S. Srinivasa Rao
The finished triples are collected in a separate file and used for constructing
the suffix array in the end.
Let us analyze the algorithm. Let Nk be the number of non-unique text
substrings of length 2k (with each occurrence counted separately), and let s
be the largest integer such that Ns > 0. The algorithm needs O(sort(Nk−1 ))
I/Os in stage k for k = 1, . . . , s +1. Including the initial stage, this gives
s
the I/O complexity O(sort(N ) + k=0 sort(Nk )). In the worst case, such
as the text T = aaa...aa, the I/O complexity is still O(sort(N ) log2 (N )).
In practice, the number of unfinished suffixes starts to decrease significantly
much before stage log2 (N ).
An algorithm for constructing the Pat tree using optimal O(sort(N )) I/Os has
been described by Farach-Colton et al. [290]. As explained in Section 7.5.1,
the bound extends to the other full-text indexes. The outline of the algorithm
is as follows:
1. Given the string T construct a string T of half the length by replacing
pairs of characters with lexicographic names.
2. Recursively compute the Pat tree of T and derive the arrays SAT and
LCPT from it.
3. Let SAo and LCPo be the string and LCP arrays for the suffixes of T that
start at odd positions. Compute SAo and LCPo from SAT and LCPT .
4. Let SAe and LCPe be the string and LCP arrays for the suffixes of T that
start at even positions. Compute SAe and LCPe from SAo and LCPo .
5. Construct the Patricia tries PTo and PTe of odd and even suffixes from
the suffix and LCP arrays.
6. Merge PTo and PTe into PTT .
Below, we sketch how all the above steps except the recursive call can be
done in O(sort(N )) I/Os. Since the recursive call involves a string of length
N/2, the total number of I/Os is O(sort(N )).
The naming in the first step is done by sorting the pairs of characters
and using the rank as the name. The transformations in the second and fifth
step were described in Section 7.5.1. The odd suffix array SAo is computed
by SAo [i] = 2 · SAT [i] − 1. The value LCPo [i] is first set to 2 · LCPT [i], and
then increased by one if T [SAo [i] + LCPo [i]] = T [SAo [i − 1] + LCPo [i]]. This
last step can be done by batched lookups using O(sort(N )) I/Os.
Each even suffix is a single character followed by an odd suffix. Let SA−1 o
be the inverse of SAo , i.e., a lexicographical naming of the odd suffixes. Then,
SAe can be constructed by sorting pairs of the form T [2i], SA−1 o [2i+1]. The
LCP of two adjacent even suffixes is zero if the first character does not match,
and one plus the LCP of the corresponding odd suffixes otherwise. However,
the corresponding odd suffixes may not be adjacent in SAo . Therefore, to
compute LCPe we need to perform LCP queries between O(N ) arbitrary
7. Full-Text Indexes in External Memory 167
does not matter much, because the unmerging will throw away the in-
correct subtree and replace it with the original subtrees from PTo and
PTe . For further details on unmerging, we refer to [290].
2. If the lengths are different, the longer edge is split by inserting a new
node v on it. The new node is then merged with end node v of the
other edge to form a trunk node v of type 2. This could already be
overmerging, which would be corrected later as above. In any case, the
recursive merging of the subtrees continues, but there is a problem: the
initial character of the edge leading to the only child w of v (i.e., the
lower part of the split edge) is not known. Retrieving the character from
the text could require an I/O, which would be too expensive. Instead, the
algorithm uses the trunk markings. Suppose the correct procedure would
be to merge the edge (v , w ) with an edge (v , w ) at least partially.
Then w must be a marked node, and furthermore, w must be the only
marked child of v , enabling the correct merge to be performed. If (v , w )
should not be merged with any child edge of v , i.e., if v does not have a
child edge with the first character matching the unknown first character
of (v , w ), then any merge the algorithm does is later identified in the
unmerging step and correctly unmerged. Then the algorithm still needs
to determine the initial character of the edge, which can be done in one
batch for all such edges.
Theorem 7.4 (Farach-Colton et al. [290]). The suffix array, suffix tree,
Pat tree, and string B-tree of a text of total length N can be constructed in
optimal O(sort(N )) I/Os.
Theorem 7.5 (Arge et al. [61]). The I/O complexity of sorting K strings
of total length N with K1 strings of length less than B of total length N1 ,
and K2 strings of length at least B of total length N2 , is
N1 N1
O min K1 logM K1 , logM/B + K2 logM K2 + scan(N ) .
B B
So far we have used the integer alphabet model, which assumes that each
character occupies a full machine word. In practice, alphabets are often small
and multiple characters can be packed into one machine word. For example,
DNA sequences could be stored using just two bits per character. Some of
the algorithms, in particular the doubling algorithms, can take advantage of
this. To analyze the effect, we have to modify the model of computation.
The packed string model assumes that characters are integers in the range
{1, . . . , |Σ|}, where |Σ| ≤ N . Strings are stored in packed form with each
machine word containing Θ(log|Σ| N ) characters. The main parameters of
the model are:
N = number of characters in the input strings
n = Θ(N/ log|Σ| N ) = size of input in units of machine words
M = size of internal memory in units of machine words
B = size of disk blocks in units of machine words
Note that, while the size of the text is Θ(n), the size of a full-text index is
Θ(N ) machine words.
170 Juha Kärkkäinen and S. Srinivasa Rao
Under the packed string model, the results presented in this chapter re-
main mostly unaffected, since the algorithms still have to deal with Θ(N )
word-sized entities, such as pointers and ranks. There are some changes,
though. The worst case CPU complexity of the merging algorithm is reduced
by a factor Θ(log|Σ| N ) due to the reduction in the complexity of comparing
strings.
The doubling algorithm for both index construction and sorting can
be modified to name substrings of length Θ(log|Σ| N ) in the initial stage.
The I/O complexity
s of the index construction algorithm then becomes
O(sort(N ) + k=0 sort(nk )), where nk is the number of non-unique text sub-
strings of length 2k log|Σ| N . On a random text with independent, uniform
distribution of characters, this is O(sort(N )) with high probability [711]. The
I/O complexity of the sorting algorithm becomes O(sort(n + K)).
We have considered only the simple string matching queries. Performing more
complex forms of queries, in particular approximate string matching [574], in
external memory is an important open problem. A common approach is to
resort to sequential searching either on the whole text (e.g, the most widely
used genomic sequence search engine BLAST [37]) or on the word list of
an inverted file [51, 84, 529]. Recently, Chávez and Navarro [179] turned
approximate string matching into nearest neighbor searching in metric space,
and suggested using existing external memory data structures for the latter
problem (see Chapter 6).
8. Algorithms for Hardware Caches and TLB
Naila Rahman∗
8.1 Introduction
Over the last 20 years or so CPU clock rates have grown explosively, and
CPUs with clock rates exceeding 2 GHz are now available in the mass mar-
ket. Unfortunately, the speed of main memory has not increased as rapidly:
today’s main memory typically has a latency of about 60 ns. This implies that
the cost of accessing main memory can be 120 times greater than the cost
of performing an operation on data which are in the CPU’s registers. Since
the driving force behind CPU technology is speed and that behind memory
technology is storage capacity, this trend is likely to continue. Researchers
have long been aware of the importance of reducing the number of accesses
to main memory in order to avoid having the CPU wait for data.
Several studies, for example [499], have shown that many computer programs
have good locality of reference. They may have good temporal locality, where a
memory location once accessed is soon accessed again, and have good spatial
locality, where an access to a memory location is followed by an access to
a nearby memory location. The hardware solution to the problem of a slow
main memory is to exploit the locality inherent in many programs by having
a memory hierarchy which inserts multiple levels of cache between the CPU
registers (or just CPU) and main memory. A cache is a fast memory which
holds the contents of some main memory locations. If the CPU requests the
contents of a main memory location, and the contents of that location is held
in some level of cache, the CPU’s request is answered by the cache itself (a
cache hit); otherwise it is answered by consulting the main memory (a cache
miss). A cache hit has small or no cost (penalty), 1-3 CPU cycles is fairly
typical, but a cache miss requires a main memory access, and is therefore very
expensive. To amortise the cost of a main memory access in case of a cache
miss, an entire block of consecutive main memory locations which contains
the location accessed is brought into cache on a miss. Programs which have
good spatial locality benefit from the fact that data are transferred from
main memory to cache in blocks, while programs which have good temporal
locality benefit from the fact that caches hold several blocks of data. Such
programs make fewer cache misses and consequently run faster. Caches which
∗
Supported by EPSRC grant GR/L92150
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 171-192, 2003.
© Springer-Verlag Berlin Heidelberg 2003
172 Naila Rahman
are increasing closer to the CPU are faster and smaller than caches further
from the CPU. The cache closest to the CPU is referred to as the L1 cache
and the cache at level i is referred to as the Li cache. Systems nowadays have
at-least two levels of cache. For high performance, the L1 cache is almost
always on the same chip as the CPU.
There is an important related optimisation which can contribute as
much or more to performance, namely minimising misses in the translation-
lookaside buffer (TLB). Some papers from the early and mid 90’s (see
e.g. [556, 12]) note the importance of minimising TLB misses when imple-
menting algorithms, but there has been no systematic study of this opti-
misation, even though TLB misses are often at least as expensive as cache
misses.
The TLB is used to support virtual memory in multi-processing operating
systems [392]. Virtual memory means that the memory addresses accessed by
a process refer to its own unique logical address space. This logical address
space contains as many locations as can be addressed on the underlying
architecture, which far exceeds the number of physical main memory locations
in a typical system. Furthermore, there may be several active processes in a
system, each with its own logical address space. To allow this, most operating
systems partition main memory and the logical address space of each process
into contiguous fixed-size pages, and store only some pages from the logical
address space of each active process in main memory at a time. Owing to
its myriad benefits, virtual memory is considered to be “essential to current
computer systems” [392]. The disadvantage of virtual memory is that every
time a process accesses a memory location, the reference to the corresponding
logical page must be translated to a physical page reference. This is done by
looking up the page table, a data structure in main memory. Using it in this
way would lead to unacceptable slowdown. Note that if the logical page is
not present in main memory at all, it is brought in from disk. The time for
this is generally not counted in the CPU times, as some other task is allowed
to execute on the CPU while the I/O is taking place.
The TLB is used to speed up address translation. It is a fast associative
memory which holds the translations of recently-accessed logical pages. If a
memory access results in a TLB hit, there is no delay, but a TLB miss can be
significantly more expensive than a cache miss; hence locality at the page level
is also very desirable. In most computer systems, a memory access can result
in a TLB miss alone, a cache miss alone, neither, or both. Algorithms which
make few cache misses can nevertheless have poor performance if they make
many TLB misses. Fig. 8.1 shows large virtual memories for two processes, a
smaller physical memory and a TLB which holds translations for a subset of
the pages in physical memory.
The sizes of cache and TLB are limited by several factors including cost
and speed [378]. Cache capacities are typically 16KB to 4MB, which is con-
siderably smaller than the size of main memory. The size of TLBs are also
8. Algorithms for Hardware Caches and TLB 173
Physical
TLB Memory P7 P11 Q3 P1
P1
P11
Virtual Memory
(Process A) P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11
Virtual Memory
(Process B) Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11
Fig. 8.1. Virtual memory, physical memory and TLB. Virtual memory for processes
A and B, each with 12 pages. Physical memory with 4 pages. TLB holds virtual
(logical) page to physical page translations for 2 pages.
very limited, with 64 to 128 entries being typical. Hence simultaneous locality
at the cache and page level is important for internal-memory computation.
We refer to the caches, TLB and main memory of a computer system as
the internal memory hierarchy.
8.1.2 Overview
4 sets
CACHE 2−ways
MAIN i
MEMORY
Fig. 8.2. A 2-way set-associative cache, with 4 sets. Memory block i maps to
cache set s. If all blocks in set s are occupied the cache selects a random or the
least-recently used block in set s for eviction.
8. Algorithms for Hardware Caches and TLB 175
lgS lgB 0
memory tag index block offset
address
Fig. 8.3. The tag, index and block offset fields of a main memory address. S =
(M/B)/a.
Virtual caches, where the cache is between the CPU and the memory
management unit, allow the virtual address to be used for cache accesses. On
a cache hit, this eliminates the need for an address translation and so reduces
the hit time. Note that on a cache miss, in order to access main memory,
an address translation would still be required. Unfortunately virtual caches
can considerably complicate the design of multi-tasking and mutli-processor
operating systems, so most systems have physical caches.
8.2.2 TLB
A TLB entry holds the virtual to physical address translation for one virtual
memory page. This translation effectively provides the translations for all
memory addresses in the page. If the program accesses a memory location
which belongs to (logical) page i, and the TLB holds the translation for page
i, the contents of the TLB do not change. If the TLB does not hold the
translation for page i, the translation for page i is brought into the TLB, and
the translation for some other page is removed.
Like caches, a TLB has to address the issue of page translation placement,
identification and replacement. Again, this is strictly under hardware control.
The virtual page address is used to identify the page translation. TLB en-
tries are typically fully associative and use a random, LRU or pseudo-LRU
replacement policy.
In most systems TLB misses and cache misses happen independently, in that
a memory access may result in a cache miss, a TLB miss, neither, or both.
This is because caches are usually physically tagged, i.e., the values stored in
cache are stored according to their physical, and not their virtual (logical),
memory addresses. Hence, when a program accesses a memory location using
its logical address, the address first has to be translated to a physical address
before the cache can be checked. Furthermore, a memory access which results
in both a cache miss and a TLB miss pays the cache miss penalty plus the
TLB miss penalty (there is no saving, or loss, if both kinds of misses occur
simultaneously).
There are several ways of measuring the number of cache or TLB misses that
a program makes. However any measurement is for a particular execution
of an algorithm and on a particular platform, it does not tell us how an
algorithm might behave on a different platform or with different parameters.
Simple and precise analytical models allow us to study factors in an algorithm
8. Algorithms for Hardware Caches and TLB 177
or in a machine platform that affect the number of cache and TLB misses
and so allow us to predict behaviour as parameters vary.
Several models have been introduced to capture the memory hierarchy
in a computer system and have been used to analyse algorithms and design
new algorithms with improved performance. We briefly review some of these
models and discuss the advantages and limitations of using them to model
the internal memory hierarchy. We then introduce the Cache Memory Model
(CMM) which has been used by several researchers to design and analyse
algorithms for cache and main memory. We discuss the advantages and some
of the limitations of this model and then introduce the Internal Memory
Model (IMM) which models cache, TLB and main memory.
In the Single Disk Model (SDM) [17] there is a fast main memory, where
all computation takes place, and a single slow disk, which holds data and
the result of computation. Data is moved between main memory and disk
as blocks. A transfer of a block of data between disk and main memory is
referred to as an I/O step and the most important performance measure is
the number of I/O steps an algorithm performs. In this model it is assumed
that a block of data from disk can be placed at any block in main memory
and the algorithm has full control over which block is evicted in order to
accommodate the new block. In the Parallel Disk Model (PDM) [755] there
are multiple disks which are accessed by multiple processors. In this chapter,
we refer to the SDM or the PDM as external memory models (EMM).
A large number of algorithms and data structures have been analysed and
designed on these models, see the survey papers [754, 55].
The advantage of the EMM is their simplicity, which ease analysis and
design of algorithms. The main reason why these models cannot easily be
used directly to analyse and design algorithms for internal memory is the as-
sumption that EMM algorithms can implement their own policy for replacing
data in the faster (main) memory, whereas in a cache or TLB this is strictly
under hardware control.
Algorithms designed for the EMM need to know parameters of the memory
hierarchy in order to tune performance for specific memories. Cache obliv-
ious algorithms, analysed on the cache oblivious model [321], do not have
any parameters that are tuned for a specific memory level. The model was
introduced to consider a single level of cache and main memory. In this model
the cache, called an ideal cache, holds M/B blocks each of B words, is fully-
associative and uses an optimal offline replacement policy. The performance
measures in this model are the number of cache misses and the number of
178 Naila Rahman
instructions. The cache oblivious model and cache oblivious algorithms are
discussed in detail in Chapter 9.
In this section we describe two models, the Cache Memory Model , which
models cache and main memory, and the Internal Memory Model , which
models cache, TLB and main memory.
Cache Memory Model (CMM). Several researchers have used a model
where there is a random-access machine, consisting of a CPU and main mem-
ory, augmented with a cache [494, 489, 654, 621, 620, 619, 685, 543, 139]. The
model has the following parameters:
N = the number of data items in the problems,
B = the number of data items in a cache block,
M = the number of data items in the cache,
a = the associativity of the cache,
and the performance measures in this model are:
– the number of cache misses,
– the number of instructions.
The cache parameters are derived directly from the discussion in Sec-
tion 8.2. In this chapter we will generally refer to the number of blocks in
the cache, M/B, rather than the number of data items that can be held in
the cache. The model assumes that blocks are evicted from a cache set on a
least recently used (LRU) basis.
The advantage of counting instructions and cache misses separately is
that it allows the use of the coarse O-notation for simple operations and also
to analyse the number of cache misses more carefully, if necessary.
The above model simplifies the architecture of real machines considerably.
For example the model considers only one level of cache and there may be
multiple caches. Another example is that we do not distinguish between a
read or write to memory. As discussed earlier, in a write-through cache each
write would access main memory or the next lower level cache.
Internal Memory Model (IMM). The CMM is useful for virtual caches,
where a virtual address can be used to index the cache. However most caches
are physical, which can only be indexed using a physical address, and, as
discussed in Section 8.2, these caches require an address translation before
checking for data. The IMM [619] extends the CMM to take account of virtual
memory and the TLB by adding the following parameters:
B = the number of data items in a page of memory,
T = the number translations held by the TLB,
8. Algorithms for Hardware Caches and TLB 179
Cache 0 1 2 11 12 13 18 19
TLB
Page Table
Virtual Memory
Fig. 8.4. The main memory, virtual memory, page table, cache and TLB. The TLB
holds translations for pages containing the memory blocks 0, . . . 3 and 12, . . . 15.
Note that the cache contains memory blocks 11, 18 and 19, which are contained in
pages for which translations are not held in the TLB.
(i) a L2 cache hit (2-3 CPU clock cycles) (ii) a memory access (≈ 30-100
cycles) or (iii) a trap to a software miss handler (hundreds of cycles) [738,
Chapter 6]. Another simplification is that TLBs almost always implement an
approximation to LRU replacement, rather than a true LRU policy.
Classifying Cache Misses. In [396], cache misses are classified as compul-
sory misses, capacity misses and conflict misses. These are as follows:
– A compulsory miss occurs on the very first access to a memory block, since
the block could not have been in cache.
– A capacity miss occurs on an access to a memory block that was previ-
ously evicted because the cache could not hold all the blocks being actively
accessed by the CPU.
– If all a ways in a cache set are occupied then an access to a memory block
that maps to that set will cause a block in the set to be evicted, even
though there may be unused blocks in other cache sets. The next access to
the evicted block will cause a conflict miss.
As an example, suppose we have an empty direct-mapped cache with
M/B blocks, where each block holds B items and we have an array DATA with
N = 2M items. In the direct-mapped cache memory address x is mapped to
cache block (x div B) mod (M/B). If we sequentially read all items in DATA
then we have N/B compulsory misses. The first access to DATA causes B
items to be loaded into a cache block, then after B accesses another B items
are brought into the cache. If we sequentially read all items in DATA again,
then we have N/B capacity misses. Now suppose the cache is empty and we
read DATA[0] and then DATA[M ], repeatedly N/2 times. DATA[0] and DATA[M ]
map to the same cache block, so we have 2 compulsory misses, for the first
accesses to DATA[0] and DATA[M ], and N − 2 conflict misses.
By definition a fully-associative cache does not have conflict misses. Sim-
ilarly, conflict misses do not occur in the EMM.
8. Algorithms for Hardware Caches and TLB 181
Algorithms designed in the CMM and IMM model aim to reduce the number
of TLB and/or cache misses without an excessive increase in the number of
instructions. We say that an algorithm is cache-efficient if it makes few cache
misses. We say that it is cache optimal if the number of cache misses meet the
asymptotic lower bound for I/Os in the EMM for that problem. We similarly
define algorithms to be TLB-efficient or TLB-optimal.
Analyses in the CMM and IMM are usually backed up with experimental
evaluations. Running times are used because:
– The relative miss penalty is much lower for caches or the TLB than for
disks, so constant factors are important, we find that asymptotic analysis
is not enough to determine the performance of algorithms.
– Cache misses, TLB misses and instruction counts do not tell us the running
times of algorithms. The models simplify the architecture of real machines
considerably. Experimental evaluations are used to validate the model.
Precise analysis in these models is often quite difficult. So an approximate
analysis may be used, which again has to be validated with empirical tech-
niques, such as measuring TLB and/or cache misses.
There are many hardware techniques for improving the cache performance of
computer systems. These techniques aim to either reduce the cache miss rate
or to reduce the penalty of a cache miss or to reduce the time for processing
a cache hit. See [392, 378] for further details.
Several techniques have been used by compilers to improve cache perfor-
mance [392]. Examples are:
– Pre-fetching is used to load data into cache before the data is needed, thus
reducing CPU stalls.
– If multiple arrays are accessed in the same dimension with the same indices
at the same time then this can cause conflict misses. By merging elements
from each array into an individual structure, which resides in one cache
block, the conflict misses are avoided. This improves spatial locality.
182 Naila Rahman
8.4.2 Algorithms
sort. For large records, they state that sorting keys and pointers to records
has better cache performance than sorting records.
LaMarca and Ladner [494, 493] evaluate the cache misses in Quicksort,
Mergesort, Heapsort and least-significant bit (LSB) radix sort algorithms us-
ing cache simulations and analysis. To reduce cache misses in Heapsort, they
suggest using d-ary heaps, where the root node occupies the last element of a
cache block, rather than the normal binary heaps. For Mergesort they suggest
using tiling and k-way merging. For Quicksort they suggest using a multi-
partition approach and Sedgewick’s technique of stopping the partitioning
when the partition size is small [678]. However, in order to reduce capacity
misses, they suggest insertion sorting these small partitions when they are
first encountered, rather than in one single pass over the data after all small
partitions have been created, as was proposed in [678]. They report that LSB
radix sort has poor cache performance, even though they did not analyse an
important source of conflict misses, due to concurrent accesses to multiple
locations in the destination array. Their new algorithms are derived almost
directly from EMM algorithms and hence reduced capacity misses, but not
conflict misses. They show that their algorithms outperform non-memory
tuned algorithms on their machine.
Xiao et al. [771] observe that the tiled and multi-way Mergesort algorithms
presented in [494] can have in the worst-case a large number of conflict cache
misses. They also observe that multi-way Mergesort can have a large number
of TLB misses. They suggest padding subarrays that are to be merged in or-
der to reduce cache misses in tiled Mergesort, and cache and TLB misses in
multi-way Mergesort. Using cache and TLB simulations they show that the
algorithms which use padding have fewer cache misses on several machines.
They also show that these new Mergesort implementations outperform exist-
ing Mergesort implementations.
An approximate analysis of the cache misses in an O(N ) time distribu-
tion (bucket) sorting algorithm for uniformly random floating-point keys is
given in [621]. Distribution sorting algorithms work as follows: In one pass
the algorithm permutes N keys into k classes, such that all keys in class i are
smaller than all keys in class i + 1, for i = 0, . . . , k − 1. After one pass of the
algorithm, the keys should have been permuted so that all elements of class
i lie consecutively before all elements of class i + 1, for i = 0, . . . , k − 2. Each
class is sorted recursively and the recursion ends when a class is ‘small’, at
which point the keys in the class may be sorted by say insertion sort. The
analysis in [621] considers the cache misses in one pass of the algorithm as
k varies and shows that for large k, which leads to fewer passes, the algo-
rithm makes many cache conflict misses. The study shows the trade-offs in
computation and memory access costs in a multi-pass algorithm and shows
how to derive a multi-pass algorithm which out-performs a single pass algo-
rithms, the various Mergesort and Quicksort algorithms described in [494]
and the Heapsort described in [654]. An analysis of the cache misses in dis-
184 Naila Rahman
tribution sorting when the keys are independently and randomly drawn from
a non-uniform distribution is given in [619].
Agarwal [12] notes the importance of reducing both cache and TLB misses
in the context of sorting randomly distributed data. To achieve this in a
bucket sort implementation, the number of buckets is chosen to be less than
the number of TLB entries.
Jiménez-González et al. [434, 432] present two different algorithms for 32
and 64 bit integer keys, the Cache Conscious Radix sort and the Counting
Split Radix sort respectively. Both algorithms are memory hierarchy con-
scious algorithms based on radix sort. They distribute keys such that the size
of each class is smaller than the size of one of the levels of the cache hierar-
chy, the target cache level, and/or the memory that can be mapped by the
TLB structure. They note that the count array used in the different steps of
the algorithms must fit in the L1 cache to obtain an efficient cache-conscious
implementation. They also state that the size of each class should not exceed
the size of the L2 cache. For good TLB performance during distribution, they
also note that the number of classes should not exceed the number of TLB
entries. Their algorithms are skew conscious. That is, the algorithms recur-
sively sort each class as many times as necessary or perform a sampling of
the data at the beginning of the algorithm to obtain balanced classes. Each
class is individually sorted by radix sort. A simple model for computing the
number of bits of the key to sort on in order to obtain a good radix sort
algorithm is proposed and analysed.
Rahman and Raman [620] present an extensive study of the design and
implementation of integer sorting algorithms in the IMM. We discuss some
of these results in Section 8.7.
Searching. Acharya et al. [3] note that in a trie [460], a search tree data
structure, the nodes nearest the root are large (containing many keys and
pointers) and those further away are increasing smaller. For large alphabets
they suggest using an array which occupies a cache block for a small node.
A bounded depth B-tree for a larger node. A hash table for a node which
exceeds the maximum size imposed by the bounded depth B-tree. The B-trees
and hash tables hold data in arrays. They also note that only one pointer to
a child node, is taken out of a node, whereas several keys may be compared,
so they suggest storing keys and pointers in separate arrays. For nodes with
small alphabets they again suggest storing the keys and pointers in separate
arrays, where the pointers are indexed using the characters in the alphabet.
They experimentally evaluate their tries against non-memory tuned search
trees and report significant speed improvements on several machines. Using
cache simulations, they show that their data structures have significantly
fewer cache misses than non-memory tuned search trees.
Rahman et al. [618] describe an implementation of very simple dynamic
cache oblivious search trees. They show that these search trees out-perform B-
trees. However, they also show that architecture-aware search trees which are
8. Algorithms for Hardware Caches and TLB 185
cache and TLB optimal out-perform the cache oblivious search trees. They
also discuss the problems associated with the design of dynamic architecture-
aware search trees which are cache and TLB optimal.
Priority Queues and Heaps. Sanders [654] notes that most EMM priority
queue data structure have high constant factors and this means that they
would not perform well if adapted to the CMM. A new EMM data structure,
the sequence heap, is described which has smaller constant factors in terms
of the number of I/Os and space utilisation than other EMM priority queue
data structures. The lower constant factors make sequence heaps suitable for
the CMM. This data structure uses k-way merging so, before it is used in
the CMM, the results from [543] are applied to select k appropriately for the
cache parameters. On random 32-bit integer keys and 32-bit values and when
the input is large, sequence heaps are shown to outperform implicit binary
heaps and aligned 4-ary heaps on several different machine architectures.
Bojesen et al. [139] analyse the cache misses during heap construction on
a fully-associative cache. They find that Floyd’s method [306] of repeatedly
merging small heaps to form larger heaps has poor cache performance and
that Williams’ method of repeated insertion [766] and Fadel et al.’s method of
layer-wise construction [284] perform better. They give new algorithms using
repeated insertions and repeated merging which have close to the optimal
number of cache misses. They note that divide and conquer algorithms are
good for hierarchical memory. They also note that by traversing a tree in
depth-first order rather than breath-first order improves locality.
8.6.1 Tiling
Divide and conquer algorithms designed on the RAM model typically deal
with two sets of data. For example Quicksort recursively partitions data into
two sets, and Mergesort repeatedly merges two sorted lists. These techniques
require Ω(lg N ) passes over the data and applying them directly to EMM
algorithms can lead to an unnecessarily large number of capacity misses. To
reduce the number of such misses, divide and conquer algorithms for the
EMM deal with k = O(M/B) sets of data, which reduces the number of
passes over the data to Ω(lg N/ lg k). For example, the analogue to Quicksort
in the EMM is distribution sorting, which recursively partitions data into k
sets, while an EMM Mergesort merges k sorted lists.
8. Algorithms for Hardware Caches and TLB 187
These EMM techniques imply that k sequences of data are accessed si-
multaneously and this can again cause problems in cache memory due to the
limited associativity of the cache. Consider the very simple example of ac-
cessing just k = 2 sequences in a direct mapped cache. If the start of the two
sequences map to the same cache block and there are round-robin accesses to
the two sequences then, other than the first accesses to each sequence, every
access will cause a cache conflict miss.
Mehlhorn and Sanders [543] analyse the number of cache misses in a set-
associative cache when an adversary makes N accesses to k sequences. Their
analysis assumes that the start of each sequence is uniformly and randomly
distributed in the cache. They show that in order to have O(N/B) cache
misses, asymptotically the same as the number of misses that must be made
in order just to read the data, the algorithm can use only k = M/B 1+1/a
sequences. This suggests that if EMM algorithmic techniques such as k-way
merging or partitioning into k sets are used in the CMM, then the parameter
k must be reduced by a factor of B 1/a .
Rahman and Raman [619] analyse the cache misses on a direct-mapped
cache when distribution sort is applied to data independently drawn from a
non-uniformly random distribution. Their analyses also applies to multiple
sequence accesses and can be used to obtain tighter upper and lower bounds
on k when the probability distribution is known.
A large number of algorithms and data structures have been designed for the
EMM, see the survey papers [754, 55], which could potentially be used on
the CMM and IMM. However in the EMM block placement in fast memory
is under the control of the algorithm whereas in the IMM it is hardware
controlled. This can lead to conflict misses which would not have been con-
sidered in the EMM. Gannon and Jalby [324] show that cache conflict misses
could be avoided by copying frequently accessed non-contiguous data into
contiguous memory, this ensures that the non-contiguous data are mapped
to their own cache block. Since then the technique has commonly been in use,
for example, in [492] copying was used to reduce conflict misses in blocked
matrix multiplication.
Using the copying technique suggested in [324], Sen and Chatterjee [685]
give an emulation theorem that formalises the statement that an EMM al-
gorithm and its analysis can be converted to an equivalent algorithm and
analysis on the CMM. The same emulation theorem gives asymptotically the
same number of cache and TLB misses [620].
An emulation theorem that allows an algorithm and analysis for a 3-level
hierarchical memory model to be converted to an equivalent algorithm and
analysis on the IMM is given in [618]. This result can be used to obtain
simultaneous cache and TLB optimality by applying the emulation to opti-
188 Naila Rahman
In the EMM it is generally assumed that the cost of I/Os is much greater
than the cost of computation, hence the design of the algorithm is usually
motivated by the need to minimise the number of I/Os. However in the
CMM and the IMM the relative miss penalties are far smaller and we have
to consider computation costs. We will consider the implications of this in
the context of distribution sorting.
Using the analyses in [543, 619] and following the practice in the I/O
model for selecting the number of classes k such that the number of I/Os
are minimised would suggest that, on a direct-mapped cache, in one pass of
distribution sorting, the algorithm should use k = O(M/B 2 ) classes. How-
ever, in practice, a fast multi-pass distribution sorting algorithm designed for
the CMM may use O(M/B) classes [621]. A very brief explanation for this is
demonstrated by the following example. On the Sun UltraSparc-II machine
the cost of a L2 cache miss is ≈ 30 CPU cycles and there are ≈ 30 com-
putations per key during one pass of the distribution sorting algorithm. An
algorithm which switches from one to two passes has a minimum additional
cost of N/B capacity misses, and 30 computations per key. The computation
cost translates to roughly 1 cache miss per key. So, on this machine, it is only
reasonable for the algorithm to switch from one to two passes, if the number
of cache misses in the first pass is more than N + 2N/B misses. The asymp-
totic analysis suggests the use of k = O(M/B 2 ) classes in order to obtain
O(N/B) misses in each pass, but clearly, given the cache miss penalty and
computation costs in this problem, this would have been non-optimal for the
overall running time.
This example demonstrates that, even if conflict misses are accounted
for, an algorithm designed for the EMM may not be directly applicable in
internal memory as the asymptotic I/O analysis would not lead to the optimal
parameter choices. The EMM algorithm could offer a good starting point but
would need further analysis to obtain good performance.
Algorithms designed for the EMM minimise the number of I/Os between two
levels of memory, disk and main memory. IMM algorithms have to minimise
misses between three memories. So an optimal EMM algorithm may not be
optimal on the IMM.
8. Algorithms for Hardware Caches and TLB 189
For a given values of N and w, the only parameter that can be varied in
the algorithm is r. By increasing r we reduce the number of passes over
the data. The RAM model assumes unit cost for arithmetic operations and
190 Naila Rahman
During the count phase, LSB radix sort makes the following memory accesses:
1. A sequential read access to the source array, to move to the next key to
count.
2. One or more random read/write accesses to the count array(s) to incre-
ment the count values for one or more passes.
During the permute phase, the algorithm makes the following accesses:
1. A sequential read access to the source array, to find the next record to
move.
2. A random read/write access to the count array, to find where to move
the next record and to increment the count array location just read.
3. A random write to one of 2r active locations in the destination array, to
actually move the record.
Since many more random memory blocks are accessed at any one time
during the permute phase than during the count phase, the permute phase
leads to far more cache misses. Therefore, our parameter choices concentrate
on obtaining a cache efficient permute phase. Using the analysis in [619, 543]
we can determine that on a direct-mapped cache we should select k = 2r
such that k = O(M/B 2 ). For larger values of k the number of conflict misses
in each pass would be asymptotically more than the number of compulsory
misses. Smaller values of k would increase the number of passes and so un-
necessarily increase the number of capacity misses.
On the Sun UltraSparc-II, when tuned for the CMM, the algorithm is
about 5% faster than when tuned for the RAM model. However the algorithm
does not out-perform a CMM tuned implementation of Quicksort.
The working set of pages is the set of pages an algorithm accesses at a par-
ticular time. If the program makes random accesses to these pages, and the
working set size is much larger than the size of the TLB, then the number
of TLB misses will be large. During the count phase, the algorithm accesses
the following working set of pages: (i) one active page in the source array
and (ii) W = 2r /B count array pages. During the permute phase, the
algorithm accesses the following working set of pages: (i) one active page in
the source array, (ii) W = 2r /B + min{2r , N/B } count and destination
8. Algorithms for Hardware Caches and TLB 191
array pages. Since the working set of pages is larger for the permute phase
and since it makes more memory accesses, this phase will have more TLB
misses than the count phase. Thus, again our parameter choices concentrate
on the permute phase.
We now heuristically analyse the permute phase to calculate the number
of TLB misses for r = 6, . . . , 11. For all these values, the count array fits into
one page and we may assume that the count page and the current source
page, once loaded, will never be evicted. We further simplify the process of
accesses to the TLB and ignore disturbances caused when the source or one
of the destination pointers crosses a page boundary (as these are transient).
With these simplifications, TLB misses on accesses to the destination array
may be modelled as uniform random access to a set of 2r pages, using an LRU
TLB of size T − 2. The probability of a TLB miss is then easily calculated
to be (2r − (T − 2))/2r .
This suggests that choosing r = 6 on the Sun UltraSparc-II still gives
a relatively low miss rate on average (miss probability 1/32), but choosing
r = 7 is significantly worse (miss probability 1/2).
Experiments suggest that on random data, for r = 1, . . . , 5 a single per-
mute phase with radix r takes about the same time, as expected. Also, for
r = 7 the permute time is—as expected—considerably (about 150%) slower
than for r ≤ 5. However, even for r = 6 it is about 25% slower than r ≤ 5.
This is probably because in practice T is effectively 61 or 62—it seems that
the operating system reserves a few TLB entries for itself and locks them to
prevent them from being evicted. Even using the simplistic estimate above,
we should get a miss probability in the 1/13 to 1/16 range.
The choice r = 5 which guarantees good TLB performance turns out not
to give the best performance on random data: it requires seven passes for
sorting 32-bit data.
On the Sun UltraSparc-II, when tuned for the IMM, using r = 6, the
algorithm is about 55% faster than CMM tuned Quicksort and LSB radix
sort algorithms.
Note that for the algorithm to make an optimal O(N/B) cache misses
and O(N/B ) TLB misses in each pass requires that T is not too small, i.e.,
we need log T = Θ(log M/B). See [620] for a more detailed discussion.
PLSB radix sort [620] is a variant of LSB radix sort which pre-sorts the
keys in small groups to increase locality and hence improves cache and TLB
performance. One pass of PLSB radix sort with radix r works in two stages.
First we divide the input array of N keys into contiguous segments of s ≤ N
keys each. Each segment is sorted using counting sort (a local sort) after
which we sort the entire array using counting sort (a global sort). In each
pass the time for sorting each of the N/s local sorts is O(s + 2r ) time and
192 Naila Rahman
the time for the global sort is O(N + 2r ), so the running time for one pass of
PLSB radix sort is O(N + 2r N/s).
The intuition for the algorithm is that each local sort groups keys of the
same class together and during the global sort we move sequences of keys
to successive locations in the sorted array, thus reducing TLB and cache
conflict misses between accesses to the destination array. The algorithm has
good temporal locality during the local sorts and good spatial locality during
the global sorts.
The segment size s is chosen such that the source and destination arrays
for a local sort both fit in cache, and map to non-conflicting cache locations.
The radix is chosen such that s/2r = O(B), this ensures that there are an
optimal O(N/B) cache misses and O(N/B) TLB misses in each pass.
On the Sun UltraSparc-II, using r = 11 and s = M/2 the algorithm is
twice as fast as CMM tuned Quicksort and LSB radix sort algorithms. The
algorithm is also 30% faster than IMM tuned LSB radix sort.
Chapter 16 discusses applying local sorting techniques in the context of
parallel sorting algorithms.
Table 8.1 summaries the running times obtained on the UltraSparc-II when
sorting 32-bit random integers using the various tuning techniques and pa-
rameter choices discussed above.
Table 8.1. Overall running times when sorting 32-bit unsigned integers using pre-
sorting LSB radix sort with an 11 bit radix (PLSB 11); cache and TLB tuned LSB
radix sort (LSB 6); cache tuned LSB radix sort (LSB 11); LSB radix sort tuned for
the RAM model (LSB 16); cache tuned Quicksort.
Timings(sec)
n PLSB 11 LSB 6 LSB 11 LSB 16 Quick
1 × 106 0.47 0.64 0.90 0.92 0.70
2 × 106 0.92 1.28 1.86 1.94 1.50
4 × 106 1.82 2.56 3.86 4.08 3.24
8 × 106 3.64 5.09 7.68 7.90 6.89
16 × 106 7.85 10.22 15.23 15.99 14.65
32 × 106 15.66 20.45 31.71 33.49 31.96
9. Cache Oblivious Algorithms∗
Piyush Kumar∗∗
9.1 Introduction
The cache oblivious model is a simple and elegant model to design algorithms
that perform well in hierarchical memory models ubiquitous on current sys-
tems. This model was first formulated in [321] and has since been a topic of
intense research. Analyzing and designing algorithms and data structures in
this model involves not only an asymptotic analysis of the number of steps
executed in terms of the input size, but also the movement of data optimally
among the different levels of the memory hierarchy. This chapter is aimed as
an introduction to the “ideal-cache” model of [321] and techniques used to de-
sign cache oblivious algorithms. The chapter also presents some experimental
insights and results.
A dream machine would be fast and would never run out of memory.
Since an infinite sized memory was never built, one has to settle for various
trade-offs in speed, size and cost. In both the past and the present, hardware
suppliers seem to have agreed on the fact that these parameters are well
optimized by building what is called a memory hierarchy (see Fig. 9.1, Chap-
ter 1). Memory hierarchies optimize the three factors mentioned above by
being cheap to build, trying to be as fast as the fastest memory present in
the hierarchy and being almost as cheap as the slowest level of memory. The
hierarchy inherently makes use of the assumption that the access pattern of
the memory has locality in it and can be exploited to speed up the accesses.
The locality in memory access is often categorized into two different types,
code reusing recently accessed locations (temporal) and code referencing data
items that are close to recently accessed data items (spatial) [392]. Caches use
both temporal and spatial locality to improve speed. Surprisingly many things
can be categorized as caches, for example, registers, L1, L2, TLB, Memory,
Disk, Tape etc. (Chapter 1). The whole memory hierarchy can be viewed
as levels of caches, each transferring data to its adjacent levels in atomic
units called blocks. When data that is needed by a process is in the cache, a
cache hit occurs. A cache miss occurs when data can not be supplied. Cache
misses can be very costly in terms of speed and can be reduced by designing
algorithms that use locality of memory access.
∗
An updated version of the chapter can be found at the webpage
http://www.compgeom.com/co-chap/
∗∗
Part of this work was done while the author was visiting MPI Saarbrücken. The
author is partially supported by NSF (CCR-9732220, CCR-0098172) and by the
grant from Sandia National Labs.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 193-212, 2003.
© Springer-Verlag Berlin Heidelberg 2003
194 Piyush Kumar
binary search trees both in theory and practice. In Section 9.7, a theoretically
optimal, randomized cache oblivious sorting algorithm along with the run-
ning times of an implementation is presented. In Section 9.8 we enumerate
some practicalities not caught by the model. Section 9.9 presents some of the
best known bounds of other cache oblivious algorithms. Finally we conclude
the chapter by presenting some related open problems in Section 9.10.
was shown in [321] that a fully associative LRU replacement policy can be
implemented in O(1) expected time using O( M B ) records of size O(B) in or-
dinary memory. Note that the above description about the cache oblivious
model proves that any optimal cache oblivious algorithm can also be opti-
mally implemented in the external memory model.
We now turn our attention to multi-level ideal caches. We assume that
all the levels of this cache hierarchy follow the inclusion property and are
managed by an optimal replacement strategy. Thus on each level, an opti-
mal cache oblivious algorithm will incur an asymptotically optimal number
of cache misses. From Lemma 9.1, this becomes true for cache hierarchies
maintained by LRU and FIFO replacement strategies.
Apart from not knowing the values of M, B explicitly, some cache oblivi-
ous algorithms (for example optimal sorting algorithms) require a tall cache
assumption. The tall cache assumption states that M = Ω(B 2 ) which is
usually true in practice. Its notable that regular optimal cache oblivious al-
gorithms are also optimal in SUMH [33] and HMM [13] models. Recently,
compiler support for cache oblivious type algorithms have also been looked
into [767, 773].
loads.
Chances are that the reader is already an expert in divide and conquer al-
gorithms. This paradigm of algorithm design is used heavily in both paral-
lel and external memory algorithms. Its not surprising that cache oblivious
algorithms make heavy use of this paradigm and many seemingly simple
algorithms that were based on this paradigm are already cache oblivious!
198 Piyush Kumar
Proof. Choosing a random pivot makes at most one cache miss. Splitting the
input set into two output sets, such that the elements of one are all less than
the pivot and the other greater than or equal to the pivot makes at most
O(1 + NB ) cache misses by Exercise 1.
As soon as the size of the recursion fits into B, there are no more cache
misses for that subproblem (Q(B) = O(1)). Hence the average cache com-
plexity of a randomized quicksort algorithm can be given by the following re-
currence (which is very similar to the average case analysis presented in [215]).
1 N
Q(N ) = (Q(i) + Q(N − i)) + 1 +
N N −1
B
i=1.. B
which solves to O( N
B log2
N
B ).
In C++ matrices are stored in “row-major” order, i.e. the rightmost di-
mension varies the fastest ( Chapter 10). In the above case, the number of
cache misses the code could do is O(N 2 ). The optimal cache oblivious matrix
2
transposition makes O(1 + NB ) cache misses. Before we go into the divide
and conquer based algorithm for matrix transposition that is cache oblivious,
let us see some experimental results. (see Fig. 9.2). The figure shows the run-
ning times of a blocked cache oblivious implementation, we stop the recursion
when the problem size becomes less than a certain block size and then use
the simple for loop implementation inside the block. In this experiment the
block sizes we chose were 32 bytes and 16 kilobytes. Note that using different
block sizes has little effect on the running time. This experiment was done
on Windows NT running on 1Ghz/512Mb RAM notebook. The code was
compiled with g++ on cygwin.
Here is the C/C++ code for cache oblivious matrix transposition. The
following code takes as input a submatrix given by (x, y)− (x+ delx, y + dely)
in the input matrix I and transposes it to the output matrix O. ElementType1
can be any element type , for instance long.
1
In all our experiments, ElementType was set to long.
200 Piyush Kumar
Q(N, P ) ≤
O(1 + N ) 2 ≤ P ≤ αB
if αB
2Q(N, P/2) + O(1) N ≤ αB < P
Case III: P ≤ αB < N Analogous to Case II.
Case IV: min{N, P } ≥ αB
O(N + P + B )
NP
2 ≤ N, P ≤ αB
if αB
Q(N, P ) ≤ 2Q(N, P/2) + O(1) P ≥ N
2Q(N/2, P ) + O(1) N ≥ P
9. Cache Oblivious Algorithms 201
Fig. 9.2. The graph compares a simple for loop implementation with a blocked
cache oblivious implementation of matrix transposition.
NP
The above recurrence solves to Q(N, P ) = O(1 + B ).
There is a simpler way to visualize the above mess. Once the√recursion
makes the matrix small enough such that max(N, P ) ≤ αB ≤ β M (here
β is a suitable constant), or such that the submatrix (or the block) we need
to transpose fits in memory, the number of I/O faults is equal to the scan
of the elements in the submatrix. A packing argument of these not so small
submatrices (blocks) in the large input matrix shows that we do not do too
many I/O faults compared to a linear scan of all the elements.
Remark: Fig. 9.2 shows the effect of using blocked cache oblivious algorithm
for matrix transposition. Note that in this case, the simple for loop algorithm
is almost always outperformed. This comparison is not really fair. The cache
oblivious algorithm gets to use blocking whereas the naive for loop moves one
element at a time. A careful implementation of a blocked version of the simple
for loop might beat the blocked cache oblivious transposition algorithm in
practice. (see the timings of Algorithm 2 and 5 in [178]). The same remark
also applies to matrix multiplication. (Fig. 9.3)
in the array of allocated nodes, and then the Bi ’s are laid out. Every subtree
is recursively laid out.
Another way to see the algorithm √ is to run a breadth first search on the
top node of the tree and run it till N nodes are in the BFS, see Fig. 9.4.
The figure
√ shows the run of the algorithm for the first BFS when the tree
size is N . Then the tree consists of the part that is covered by the BFS and
trees hanging out. BFS can now be recursively run on each tree, including√the
covered part. Note that in the second level of recursion, the tree size is N
1
and the BFS will √ cover only N 4 nodes since the same algorithm is run on
each subtree of N . The main idea behind the algorithm is to store recursive
sub-trees in contiguous blocks of memory.
Lets now try to analyze the number of cache misses when a search is per-
formed. We can conceptually stop the recursion at the level of detail where
the size of the subtrees has size ≤ B. Since these subtrees are stored con-
tiguously, they at most fit in two blocks. (A block can not span three blocks
of memory when stored). The height of these subtrees is log B. A search
path from root to leaf crosses O log N
log B = O(logB N ) subtrees. So the total
number of cache misses is bounded by O(logB N ).
Exercise 6 Show that the Van Emde Boas layout can be at most a constant
factor of 4 away from an optimal layout (which knows the parameter B).
Show that the constant 4 can be reduced to 2 for average case queries. Get a
better constant factor than 2 in average case.
9.6.1 Experiments
We did a very simple experiment to see how in real life, this kind of layout
would help. A vector was sorted and a binary search tree was built on it.
A query vector was generated with random numbers and searched on this
BST which was laid out in pre-order. Why we chose pre-order compared to
random layout was because most people code a BST in either pre/post/in-
order compared to randomly laying it (Which incidentally is very bad for
cache health).
Once this query was done, we laid the BST using Van Emde Boas Layout
and gave it the same query vector. Before timing both trees, we made sure
that both had enough queries to begin with otherwise, the order of timing
could also effect the search times. (Because the one that is searched last,
has some help from the cache). The code written for this experimentation is
below 300 lines. The experiment reported in Fig. 9.5 were done on a Itanium
dual processor system with 2GB RAM. (Only one processor was being used)
Currently the code copies the entire array in which the tree exists into
another array when it makes the tree cache oblivious. This could also be done
in the same array though that would have complicated the implementation
a bit. One way to do this is to maintain pointers to and from the array of
nodes to the tree structure, and swap nodes instead of copying them into
a new array. Another way could be to use a permutation table. We chose
to copy the whole tree into a new array just because this seemed to be the
simplest way to test the speed up given by cache oblivious layouts. For more
detailed experimental results on comparing searching in cache aware and
cache oblivious search trees, the reader is referred to [490]. There is a big
difference between the graphs reported here for searching and in [490]. One
of the reasons might be that the size of the nodes were fixed to be 4 bytes
in [490] whereas the experiments reported here use bigger size nodes.
9.7 Sorting
Sorting is a fundamental problem in computing. Sorting very large data sets
is a key routine in many external memory applications. We have already seen
how to do optimal sorting in the external memory model. In this section
we outline some theory and experimentation related to sorting in the cache
oblivious model. Some excellent references for reading more on the influence
of caches on the performance of sorting are [494, 588], Chapter 16 and
Chapter 8.
There are two optimal sorting algorithms known, funnel sort and distribu-
tion sort. Funnel sort is derived from merge sort and distribution sort is a
206 Piyush Kumar
Fig. 9.5. Comparison of Van Emde Boas searches with pre-order searches on a
balanced binary tree. Similar to the last experiment, this experiment was performed
on a Itanium with 48 byte node sizes.
The sorting procedure assumes the presence of three extra functions which
are cache oblivious, choosing a random sample in Step 2, the counting in
Step 4 and the distribution in Step 6. The random sampling step in the
algorithm is used to determine splitters for the buckets, so we are in fact
looking for splitters.
9. Cache Oblivious Algorithms 207
Fig. 9.6. Two Pass cache oblivious distribution sort, one level of recursion
Fig. 9.7. Two Pass cache oblivious distribution sort, two levels of recursion
Lets now peek at the analysis. For a good set of splitters, P r(∃i : ci ≥
√ 1 2 βα
α N ) ≤ N e−(1− α ) 2 (for some α > 1, β > log N ) [131]. This follows from
Chernoff bound type arguments. Once we know that the subproblems we are
going to work on are bounded in size with high probability, the expected
cache complexity follows the recurrence:
9. Cache Oblivious Algorithms 209
O(1 + N ) if N ≤ αM
Q(N ) ≤ √ B√ √ N
(9.2)
2 N Q( N ) + Q( N log N ) + O(1 + B) otherwise.
In theory, both the cache oblivious and the external memory models are
nice to work with, because of their simplicity. A lot of the work done in the
external memory model has been turned into practical results as well. Before
one makes his hand “dirty” with implementing an algorithm in the cache
oblivious or the external memory model, one should be aware of practical
things that might become detrimental to the speed of the code but are not
caught in the theoretical setup.
Here we list a few practical glitches that are shared by both the cache
oblivious and the external memory model. The ones that are not shared are
marked2 accordingly. A reader that wants to use these models to design prac-
tical algorithms and especially one who wants to write code, should keep these
issues in mind. Code written and algorithms designed keeping the following
things in mind, could be a lot faster than just directly coding an algorithm
that is optimal in either the cache oblivious or the external memory model.
TLBo : TLBs are caches on page tables, are usually small with 128-256 en-
tries and are like just any other cache. They can be implemented as fully
associative. The model does not take into account the fact that TLBs are
not tall. For the importance of TLB on performance of programs refer to the
section on cache oblivious models in Chapter 8.
Concurrency: The model does not talk about I/O and CPU concurrency,
which automatically looses it a 2x factor in terms of constants. The need
for speed might drive future uniprocessor systems to diversify and look for
alternative solutions in terms of concurrency on a single chip, for instance
the hyper-threading3 introduced by Intel in its latest Xeons is a glaring ex-
ample. On these kind of systems and other multiprocessor systems, coherence
misses might become an issue. This is hard to capture in the cache oblivious
model and for most algorithms that have been devised in this model already,
concurrency is still an open problem. A parallel cache oblivious model would
be really welcome for practitioners who would like to apply cache oblivious
algorithms to multiprocessor systems. (see Chapter 16)
Associativityo : The assumption of the fully associative cache is not so nice.
In reality caches are either direct mapped or k-way associative (typically
2
A superscript ’o’ means this issue only applies to the cache oblivious model.
3
One physical processor Intel Xeon MP forms two logical processors which share
CPU computational resources The software sees two CPUs and can distribute
work load between them as a normal dual processor system.
210 Piyush Kumar
k = 2, 4, 8). If two objects map to the same location in the cache and are ref-
erenced in temporal proximity, the accesses will become costlier than they are
assumed in the model (also known as cache interference problem [718] ). Also,
k−way set associative caches are implemented by using more comparators.
(see Chapter 8)
Instruction/Unified Caches: Rarely executed, special case code disrupts
locality. Loops with few iterations that call other routines make loop locality
hard to exploit and plenty of loopless code hampers temporal locality. Issues
related to instruction caches are not modeled in the cache oblivious model.
Unified caches (e.g. the latest Intel Itanium chips L2 and L3 caches) are used
in some machines where instruction and data caches are merged(e.g. Intel
PIII, Itaniums). These are another challenge to handle in the model.
Replacement Policyo : Current operating systems do not page more than
4GB of memory because of address space limitations. That means one would
have to use legacy code on these systems for paging. This problem makes
portability of cache oblivious code for big problems a myth! In the experi-
ments reported in this chapter, we could not do external memory experimen-
tation because the OS did not allow us to allocate array sizes of more than
a GB or so. One can overcome this problem by writing one’s own paging
system over the OS to do experimentation of cache oblivious algorithms on
huge data sizes. But then its not so clear if writing a paging system is easier
or handling disks explicitly in an application. This problem does not exist on
64-bit operating systems and should go away with time.
Multiple Diskso : For “most” applications where data is huge and external
memory algorithms are required, using Multiple disks is an option to increase
I/O efficiency. As of now, the cache oblivious model does not take into account
the existence of multiple disks in a system.
Write-through cacheso : L1 caches in many new CPUs is write through, i.e.
it transmits a written value to L2 cache immediately [319, 392]. Write through
caches are simpler to manage and can always discard cache data without
any bookkeeping (Read misses can not result in writes). With write through
caches (e.g. DECStation 3100, Intel Itanium), one can no longer argue that
there are no misses once the problem size fits into cache! Victim Caches
implemented in HP and Alpha machines are caches that are implemented as
small buffers to reduce the effect of conflicts in set-associative caches. These
also should be kept in mind when designing code for these machines.
Complicated Algorithmso and Asymptotics: For non-trivial problems
the algorithms can become quite complicated and impractical, a glaring in-
stance of which is sorting. The speed by which different levels of memory
differ in data transfer are constants! For instance the speed difference be-
tween L1 and L2 caches on a typical Intel pentium can be around 10. Using
an O() notation for an algorithm that is trying to beat a constant of 10, and
sometimes not even talking about those constants while designing algorithms
can show up in practice (Also see Chapter 8). For instance there are “con-
9. Cache Oblivious Algorithms 211
We present here problems, related bounds and references for more interested
readers. Note that in the table, sort() and scan() denote the number of
cache misses of scan and sorting functions done by an optimal cache oblivious
implementation.
Data Structure/Algorithm Cache Complexity Operations
Sorting strings in the cache oblivious model is still open. Optimal shortest
paths and minimum spanning forests still need to be explored in the model.
Optimal simple convex hull algorithms for d−dimensions is open. There are
a lot of problems that still can be explored in this model both theoretically
and practically.
212 Piyush Kumar
9.11 Acknowledgements
The author would like to thank Michael Bender, Matteo Frigo, Joe Mitchell,
Edgar Ramos and Peter Sanders for discussions on cache obliviousness and
to MPI Informatik, Saarbrücken, Germany, for hosting him.
10. An Overview of Cache Optimization
Techniques and Cache-Aware Numerical
Algorithms∗
Markus Kowarschik and Christian Weiß
10.1 Introduction
In order to mitigate the impact of the growing gap between CPU speed and
main memory performance, today’s computer architectures implement hier-
archical memory structures. The idea behind this approach is to hide both
the low main memory bandwidth and the latency of main memory accesses
which is slow in contrast to the floating-point performance of the CPUs. Usu-
ally, there is a small and expensive high speed memory sitting on top of the
hierarchy which is usually integrated within the processor chip to provide
data with low latency and high bandwidth; i.e., the CPU registers. Moving
further away from the CPU, the layers of memory successively become larger
and slower. The memory components which are located between the proces-
sor core and main memory are called cache memories or caches. They are
intended to contain copies of main memory blocks to speed up accesses to
frequently needed data [378, 392]. The next lower level of the memory hier-
archy is the main memory which is large but also comparatively slow. While
external memory such as hard disk drives or remote memory components in a
distributed computing environment represent the lower end of any common
hierarchical memory design, this paper focuses on optimization techniques
for enhancing cache performance.
The levels of the memory hierarchy usually subset one another so that
data residing within a smaller memory are also stored within the larger mem-
ories. A typical memory hierarchy is shown in Fig. 10.1.
Efficient program execution can only be expected if the codes respect the
underlying hierarchical memory design. Unfortunately, today’s compilers can-
not introduce highly sophisticated cache-based transformations and, conse-
quently, much of this optimization effort is left to the programmer [335, 517].
This is particularly true for numerically intensive codes, which our pa-
per concentrates on. Such codes occur in almost all science and engineering
disciplines; e.g., computational fluid dynamics, computational physics, and
mechanical engineering. They are characterized both by a large portion of
floating-point (FP) operations as well as by the fact that most of their ex-
ecution time is spent in small computational kernels based on loop nests.
∗
This research is being supported in part by the Deutsche Forschungsgemeinschaft
(German Science Foundation), projects Ru 422/7–1,2,3.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 213-232, 2003.
© Springer-Verlag Berlin Heidelberg 2003
214 Markus Kowarschik and Christian Weiß
CPU Registers
Main Memory
L3 Cache
L1 Data Cache L1 Inst Cache
L2 Cache
Fig. 10.1. A typical memory hierarchy containing two on-chip L1 caches, one
on-chip L2 cache, and a third level of off-chip cache. The thickness of the intercon-
nections illustrates the bandwidths between the memory hierarchy levels.
parts; one only keeps data, the other instructions. The latency of on-chip
caches is commonly one or two cycles. The chip designers, however, already
face the problem that large on-chip caches of new microprocessors running at
high clock rates cannot deliver data within one cycle since the signal delays
are too long. Therefore, the size of on-chip L1 caches is limited to 64 Kbyte or
even less for many chip designs. However, larger cache sizes with accordingly
higher access latencies start to appear.
The L1 caches are usually backed up by a level two (L2) cache. A few years
ago, architectures typically implemented the L2 cache on the motherboard,
using SRAM chip technology. Currently, L2 cache memories are typically
located on-chip as well; e.g., in the case of Intel’s Itanium CPU. Off-chip
caches are much bigger, but also provide data with lower bandwidth and
higher access latency. On-chip L2 caches are usually smaller than 512 Kbyte
and deliver data with a latency of approximately 5 to 10 cycles. If the L2
caches are implemented on-chip, an off-chip level three (L3) cache may be
added to the hierarchy. Off-chip cache sizes vary from 1 Mbyte to 16 Mbyte.
They provide data with a latency of about 10 to 20 CPU cycles.
Because of their limited size, caches can only hold copies of recently used
data or code. Typically, when new data are loaded into the cache, other data
have to be replaced. Caches improve performance only if cache blocks which
have already been loaded are reused before being replaced by others. The
reason why caches can substantially reduce program execution time is the
principle of locality of references [392] which states that recently used data
are very likely to be reused in the near future. Locality can be subdivided
into temporal locality and spatial locality. A sequence of references exhibits
temporal locality if recently accessed data are likely to be accessed again
in the near future. A sequence of references exposes spatial locality if data
located close together in address space tend to be referenced close together
in time.
order to guarantee low access latency, the question into which cache line the
data should be loaded and how to retrieve them henceforth must be handled
efficiently.
In respect of hardware complexity, the cheapest approach to implement
block placement is direct mapping; the contents of a memory block can be
placed into exactly one cache line. Direct mapped caches have been among
the most popular cache architectures in the past and are still very common
for off-chip caches.
However, computer architects have recently focused on increasing the set
associativity of on-chip caches. An a-way set-associative cache is characterized
by a higher hardware complexity, but usually implies higher hit rates. The
cache lines of an a-way set-associative cache are grouped into sets of size a.
The contents of any memory block can be placed into any cache line of the
corresponding set.
Finally, a cache is called fully associative if the contents of a memory
block can be placed into any cache line. Usually, fully associative caches are
only implemented as small special-purpose caches; e.g., TLBs [392]. Direct
mapped and fully associative caches can be seen as special cases of a-way
set-associative caches; a direct mapped cache is a 1-way set-associative cache,
whereas a fully associative cache is C-way set-associative, provided that C is
the number of cache lines.
In a fully associative cache and in a k-way set-associative cache, a mem-
ory block can be placed into several alternative cache lines. The question
into which cache line a memory block is copied and which block thus has
to be replaced is decided by a (block) replacement strategy. The most com-
monly used strategies for today’s microprocessor caches are random and least
recently used (LRU). The random replacement strategy chooses a random
cache line to be replaced. The LRU strategy replaces the block which has
not been accessed for the longest time interval. According to the principle of
locality, it is more likely that a data item which has been accessed recently
will be accessed again in the near future.
Less common strategies are least frequently used (LFU) and first in, first
out (FIFO). The former replaces the memory block in the cache line which
has been used least frequently, whereas the latter replaces the data which
have been residing in cache for the longest time.
Eventually, the optimal replacement strategy replaces the memory block
which will not be accessed for the longest time. It is impossible to implement
this strategy in a real cache, since it requires information about future cache
references. Thus, the strategy is only of theoretical value; for any possible
sequence of references, a fully associative cache with optimal replacement
strategy will produce the minimum number of cache misses among all types
of caches of the same size [717].
10. Cache Optimization and Cache-Aware Numerical Algorithms 217
In general, profiling tools are used in order to determine if a code runs effi-
ciently, to identify performance bottlenecks, and to guide code optimization
[335]. One fundamental concept of any memory hierarchy, however, is to hide
the existence of caches. This generally complicates data locality optimiza-
tions; a speedup in execution time only indicates an enhancement of locality
behavior, but it is no evidence.
To allow performance profiling regardless of this fact, many microproces-
sor manufacturers add dedicated registers to their CPUs in order to count
certain events. These special-purpose registers are called hardware perfor-
mance counters. The information which can be gathered by the hardware
performance counters varies from platform to platform. Typical quantities
which can be measured include cache misses and cache hits for various cache
levels, pipeline stalls, processor cycles, instruction issues, and branch mis-
predictions. Some prominent examples of profiling tools based on hardware
performance counters are the Performance Counter Library (PCL) [117], the
Performance Application Programming Interface (PAPI) [162], and the Digi-
tal Continuous Profiling Infrastructure (DCPI) (Alpha-based Compaq Tru64
UNIX only) [44].
Another approach towards evaluating code performance is based on in-
strumentation. Profiling tools such as GNU prof [293] and ATOM [282] in-
sert calls to a monitoring library into the program to gather information
for small code regions. The library routines may either include complex pro-
grams themselves (e.g., simulators) or only modify counters. Instrumentation
is used, for example, to determine the fraction of the CPU time spent in a
certain subroutine. Since the cache is not visible to the instrumented code
the information concerning the memory behavior is limited to address traces
and timing information.
Eventually, cache performance information can be obtained by cache mod-
eling and simulation [329, 383, 710] or by machine simulation [636]. Simula-
tion is typically very time-consuming compared to regular program execution.
Thus, the cache models and the machine models often need to be simplified
in order to reduce simulation time. Consequently, the results are often not
precise enough to be useful.
Data access optimizations are code transformations which change the order
in which iterations in a loop nest are executed. The goal of these transforma-
tions is mainly to improve temporal locality. Moreover, they can also expose
parallelism and make loop iterations vectorizable. Note that the data access
218 Markus Kowarschik and Christian Weiß
loop
interchange
words of an array are loaded into a cache line. If the array is larger than the
cache, accesses with large stride only use one word per cache line. The other
words which are loaded into the cache line are evicted before they can be
reused.
Loop interchange can also be used to enable and improve vectorization
and parallelism, and to improve register reuse. The different targets may
be conflicting. For example, increasing parallelism requires loops with no
dependencies to be moved outward, whereas vectorization requires them to
be moved inward.
Loop fusion also improves data locality. Assume that two consecutive
loops perform global sweeps through an array as in the code shown in Algo-
rithm 10.3.2, and that the data of the array are too large to fit completely in
cache. The data of array b which are loaded into the cache by the first loop
will not completely remain in cache, and the second loop will have to reload
the same data from main memory. If, however, the two loops are combined
with loop fusion only one global sweep through the array b will be performed.
Consequently, fewer cache misses will occur.
Loop Blocking. Loop blocking (also called loop tiling) is a loop transformation
which increases the depth of a loop nest with depth n by adding additional
loops to the loop nest. The depth of the resulting loop nest will be anything
from n + 1 to 2n. Loop blocking is primarily used to improve data locality
by enhancing the reuse of data in cache [30, 705, 768].
The need for loop blocking is illustrated in Algorithm 10.3.3. Assume that
the code reads an array a with stride-1, whereas the access to array b is of
stride-n. Interchanging the loops will not help in this case since it would cause
the array a to be accessed with stride-n instead.
Tiling a single loop replaces it by a pair of loops. The inner loop of the
new loop nest traverses a block of the original iteration space with the same
increment as the original loop. The outer loop traverses the original iteration
space with an increment equal to the size of the block which is traversed by
the inner loop. Thus, the outer loop feeds blocks of the whole iteration space
to the inner loop which then executes them step by step. The change in the
10. Cache Optimization and Cache-Aware Numerical Algorithms 221
i i
4 4
3 3
loop
2 2
blocking
1 1
1 2 3 4 1 2 3 4
j j
Fig. 10.3. Iteration space traversal for original and blocked code.
A very prominent example for the impact of the loop blocking transfor-
mation on data locality is matrix multiplication [127, 461, 492, 764], see also
Section 10.4.2. In particular, the case of sparse matrices is considered in [577].
Data Prefetching. The loop transformations discussed so far aim at reducing
the capacity misses which occur in the course of a computation. Misses which
are introduced by first-time accesses are not addressed by these optimizations.
Prefetching allows the microprocessor to issue a data request before the com-
putation actually requires the data [747]. If the data are requested early
enough the penalty of cold (compulsory) misses as well as capacity misses
not covered by loop transformations can be hidden2 .
Many modern microprocessors implement a prefetch instruction which is
issued as a regular instruction. The prefetch instruction is similar to a load,
with the exception that the data are not forwarded to the CPU after they
have been cached. The prefetch instruction is often handled as a hint for the
processor to load a certain data item, but the actual execution of the prefetch
is not guaranteed by the CPU.
Prefetch instructions can be inserted into the code manually by the pro-
grammer or automatically by a compiler [558]. In both cases, prefetching
involves overhead. The prefetch instructions themselves have to be executed;
i.e., pipeline slots will be filled with prefetch instructions instead of other
instructions ready to be executed. Furthermore, the memory addresses of the
prefetched data must be calculated and will be calculated again when the
load operation is executed which actually fetches the data from the memory
hierarchy into the CPU.
Besides software-based prefetching, hardware schemes have been proposed
and implemented which add prefetching capability to a system without the
2
For a classification of cache misses we refer to Chapter 8.
222 Markus Kowarschik and Christian Weiß
Data access optimizations have proven to be able to improve the data lo-
cality of applications by reordering the computation, as we have shown in
the previous section. However, for many applications, loop transformations
alone may not be sufficient for achieving reasonable data locality. Especially
for computations with a high degree of conflict misses3 , loop transformations
are not effective in improving performance [632].
Data layout optimizations modify how data structures and variables are
arranged in memory. These transformations aim at avoiding effects like cache
conflict misses and false sharing [392], see Chapter 16. They are further in-
tended to improve the spatial locality of a code.
Data layout optimizations include changing base addresses of variables,
modifying array sizes, transposing array dimensions, and merging arrays.
These techniques are usually applied at compile time, although some opti-
mizations can also be applied at runtime.
Array Padding. If two arrays are accessed in an alternating manner as in
Algorithm 10.3.4 and the data structures happen to be mapped to the same
cache lines, a high number of conflict misses are introduced.
In the example, reading the first element of array a will load a cache
line containing this array element and possibly subsequent array elements for
further use. Provided that the first array element of array b is mapped to the
same cache line as the first element of array a, a read of the former element
will trigger the cache to replace the elements of array a which have just been
loaded. The following access to the next element of array a will no longer be
satisfied by the cache, thus force the cache to reload the data and in turn to
replace the data of array b. Hence, the array b elements must be reloaded,
and so on. Although both arrays are referenced sequentially with stride-1, no
reuse of data which have been preloaded into the cache will occur since the
data are evicted immediately by elements of the other array, after they have
3
See again Chapter 8.
10. Cache Optimization and Cache-Aware Numerical Algorithms 223
structures as shown in Algorithm 10.3.5 will change the data layout such that
the elements become contiguous in memory.
Array Transpose. This technique permutes the dimensions within multi-
dimensional arrays and eventually reorders the array as shown in Algo-
rithm 10.3.6 [202]. This transformation has a similar effect as loop inter-
change, see Section 10.3.1.
the benefits from copying the data. Hence a compile time strategy has been
introduced in order to determine when to copy data [719]. This technique is
based on an analysis of cache conflicts.
The BLAS library is divided into three levels. Level 1 BLAS do vector-vector
operations; e.g., so-called AXPY computations such as y ← αx + y and dot
products such as α ← β+xT y. Level 2 BLAS do matrix-vector operations; e.g.,
y ← αop(A)x + βy, where op(A) = A, AT , or AH . Eventually, Level 3 BLAS
do matrix-matrix operations such as C ← αop(A)op(B) + βC. Dedicated
routines are provided for special cases such as symmetric and Hermitian ma-
trices. BLAS provides similar functionality for real and complex data types,
in both single and double precision.
LAPACK is another software library which is often used by numerical
applications [43]. LAPACK is based on the BLAS and implements routines
for solving systems of linear equations, computing least-squares solutions of
linear systems, and solving eigenvalue as well as singular value problems.
The associated routines for factorizing matrices are also provided; e.g., LU,
Cholesky, and QR decomposition. LAPACK handles dense and banded ma-
trices, see Section 10.4.4 below for a discussion of iterative solvers for sparse
linear systems. In analogy to the BLAS library, LAPACK implements sim-
ilar functionality for real and complex matrices, in both single and double
precision.
Our presentation closely follows the research efforts of the ATLAS 6 project
[764]. This project concentrates on the automatic application of empirical
code optimization techniques for the generation of highly optimized platform-
specific BLAS libraries. The basic idea is to successively introduce source-to-
source transformations and evaluate the resulting performance, thus generat-
ing the most efficient implementation of BLAS. It is important to note that
ATLAS still depends on an optimizing compiler for applying architecture-
dependent optimizations and generating efficient machine code. A similar
tuning approach has guided the research in the FFTW project [320].
ATLAS mainly targets the optimizations of Level 2 and Level 3 BLAS
while relying on the underlying compiler to generate efficient Level 1 BLAS.
This is due to the fact that Level 1 BLAS basically contains no memory reuse
and high level source code transformations only yield marginal speedups.
On the contrary, the potential for data reuse is high in Level 2 and even
higher in Level 3 BLAS due to the occurrence of at least one matrix operand.
Concerning the optimization of Level 2 BLAS, ATLAS implements both reg-
ister blocking7 and loop blocking. In order to illustrate the application of
these techniques it is sufficient to consider the update operation y ← Ax + y,
where A is an n × n matrix and x, y are vectors of length n. This operation
can also be written as
6
ATLAS: Automatically Tuned Linear Algebra Software. More details are provided
on http://math-atlas.sourceforge.net.
7
The developers of ATLAS refer to the term register blocking as a technique to
explicitly enforce the reuse of CPU registers by introducing temporary variables.
10. Cache Optimization and Cache-Aware Numerical Algorithms 227
n
yi ← ai,j xj + yi , 1≤i≤n ,
j=1
see [764]. By keeping the current value yi in a CPU register (i.e., by applying
register blocking), the number of read/write accesses to y can be reduced
from O(n2 ) to O(n). Furthermore, unrolling the outermost loop and hence
updating k components of the vector y simultaneously can reduce the number
of accesses to x by a factor of 1/k to n2 /k. This is due to the fact that each xj
contributes to each yi . In addition, loop blocking can be introduced in order
to reduce the number of main memory accesses to the components of the
vector x from O(n2 ) to O(n) [764], see Section 10.3 for details. This means
that loop blocking can be applied in order to load x only once into the cache.
While Level 2 BLAS routines require O(n2 ) data accesses in order to per-
form O(n2 ) FP operations, Level 3 BLAS routines need O(n2 ) data accesses
to execute O(n3 ) FP operations, thus containing a higher potential for data
reuse. Consequently, the most significant speedups are obtained by tuning
the cache performance of Level 3 BLAS; particularly the matrix multiply.
This is achieved by implementing an L1 cache-contained matrix multiply
and partitioning the original problem into subproblems which can be com-
puted in cache [764]. In other words, the optimized code results from blocking
each of the three loops of a standard matrix multiply algorithm, see again
Section 10.3, and calling the L1 cache-contained matrix multiply code from
within the innermost loop. Fig. 10.5 illustrates the blocked algorithm. In or-
der to compute the shaded block of the product C, the corresponding blocks
of its factors A and B have to be multiplied and added.
N K N
M M K
Matrix C Matrix A
Matrix B
Fig. 10.5. Blocked matrix multiply algorithm.
In order to leverage the speedups which are obtained by optimizing the cache
utilization of Level 3 BLAS, LAPACK provides implementations of block
algorithms in addition to the standard versions of various routines only based
on Level 1 and Level 2 BLAS. For example, LAPACK implements block
LU, block Cholesky, and block QR factorizations [43]. The idea behind these
algorithms is to split the original matrices into submatrices (blocks) and
process them using highly efficient Level 3 BLAS, see Section 10.4.2.
In order to illustrate the design of block algorithms in LAPACK we com-
pare the standard LU factorization of a non-singular n × n matrix A to the
corresponding block LU factorization. In order to simplify the presentation,
we initially leave pivoting issues aside. Each of these algorithms determines a
lower unit triangular n× n matrix8 L and an upper triangular n× n matrix U
such that A = LU . The idea of this (unique) factorization is that any linear
system Ax = b can then be solved easily by first solving Ly = b using a for-
ward substitution step, and subsequently solving U x = y using a backward
substitution step [339, 395].
Computing the triangular matrices L and U essentially corresponds to
performing Gaussian elimination on A in order to obtain an upper triangular
matrix. In the course of this computation, all elimination factors li,j are
stored. These factors li,j become the subdiagonal entries of the unit triangular
matrix L, while the resulting upper triangular matrix defines the factor U .
This elimination process is mainly based on Level 2 BLAS; it repeatedly
requires rows of A to be added to multiples of different rows of A.
The block LU algorithm works as follows. The matrix A is partitioned
into four submatrices A1,1 , A1,2 , A2,1 , and A2,2 . The factorization A = LU
can then be written as
A1,1 A1,2 L1,1 0 U1,1 U1,2
= , (10.1)
A2,1 A2,2 L2,1 L2,2 0 U2,2
where the corresponding blocks are equally sized, and A1,1 , L1,1 , and U1,1 are
square submatrices. Hence, we obtain the following equations:
8
A unit triangular matrix is characterized by having only 1’s on its main diagonal.
10. Cache Optimization and Cache-Aware Numerical Algorithms 229
According to Equation (10.2), L1,1 and U1,1 are computed using the stan-
dard LU factorization routine. Afterwards, U1,2 and L2,1 are determined
from Equations (10.3) and (10.4), respectively, using Level 3 BLAS solvers
for triangular systems. Eventually, L2,2 and U2,2 are computed as the re-
sult of recursively applying the block LU decomposition routine to Ã2,2 =
A2,2 −L2,1 U1,2 . This final step follows immediately from Equation (10.5). The
computation of à can again be accomplished by leveraging Level 3 BLAS.
It is important to point out that the block algorithm can yield different nu-
merical results than the standard version as soon as pivoting is introduced;
i.e., as soon as a decomposition P A = LU is computed, where P denotes
a suitable permutation matrix [339]. While the search for appropriate pivots
may cover the whole matrix A in the case of the standard algorithm, the block
algorithm restricts this search to the current block A1,1 to be decomposed
into triangular factors. The choice of different pivots during the decomposi-
tion process may lead to different round-off behavior due to finite precision
arithmetic.
Further cache performance optimizations for LAPACK have been devel-
oped. The application of recursively packed matrix storage formats is an
example of how to combine both data layout as well as data access opti-
mizations [42]. A memory-efficient LU decomposition algorithm with partial
pivoting is presented in [726]. It is based on recursively partitioning the input
matrix.
j<i j>i
If used as a linear solver by itself, the iteration typically runs until some
convergence criterion is fulfilled; e.g., until the Euclidean norm of the residual
r(k) = b − Ax(k) falls below some given tolerance.
For the discussion of optimization techniques we concentrate on the case
of a block tridiagonal matrix which typically results from the 5-point dis-
cretization of a PDE on a two-dimensional rectangular grid using finite dif-
ferences. We further assume a grid-based implementation9 of the method of
Gauss-Seidel using a red/black ordering of the unknowns.
For the sake of optimizing the cache performance of such algorithms, both
data layout optimizations as well as data access optimizations have been pro-
posed. Data layout optimizations comprise the application of array padding
in order to minimize the numbers of conflict misses caused by the stencil-
based computation [632] as well as array merging techniques to enhance the
spatial locality of the code [256, 479]. These array merging techniques are
based on the observation that, for each update, all entries ai,j of any matrix
row i as well as the corresponding right-hand side bi are always needed si-
multaneously, see Equation (10.6). Data access optimizations for red/black
Gauss-Seidel comprise loop fusion as well as loop blocking techniques. As we
have mentioned in Section 10.3, these optimizations aim at reusing data as
long as they reside in cache, thus enhancing temporal locality. Loop fusion
merges two successive passes through the grid into a single one, integrating
the update steps for the red and the black nodes. On top of loop fusion, loop
blocking can be applied. For instance, blocking the outermost loop means
beginning with the computation of x(k+2) from x(k+1) before the computa-
tion of x(k+1) from x(k) has been completed, reusing the matrix entries ai,j ,
9
Instead of using data structures to store the computational grids which cover
the geometric domains, these methods can also be implemented by employing
matrix and vector data structures.
10. Cache Optimization and Cache-Aware Numerical Algorithms 231
(k+1)
the values bi of the right-hand side, and the approximations xi which are
still in cache.
Again, the performance of the optimized codes depends on a variety of ma-
chine, operating system, and compiler parameters. Depending on the problem
size, speedups of up to 500% can be obtained, see [682, 762, 763] for details. It
is important to point out that these optimizing data access transformations
maintain all data dependencies of the original algorithm and therefore do not
influence the numerical results of the computation.
Similar research has been done for Jacobi’s method [94], which succes-
sively computes a new approximation x(k+1) from the previous approximation
x(k) as follows:
(k+1)
xi = a−1
i,i
bi − ai,j xj , 1 ≤ i ≤ n .
(k)
(10.7)
j=i
It is obvious that this method requires the handling of an extra array since
the updates cannot be done in place; in order to compute the (k+1)-th iterate
for unknown xi , the k-th iterates of all neighboring unknowns are required,
see Equation (10.7), although there may already be more recent values for
some of them from the current (k + 1)-th update step.
Moreover, the optimization of iterative methods on unstructured grids
has also been addressed [257]. These techniques are based on partitioning
the computational domain into blocks which are adapted to the size of the
cache. The iteration then performs as much work as possible on the current
cache block and revisits previous cache blocks in order to complete the update
process. The investigation of corresponding cache optimizations for three-
dimensional problems has revealed that TLB misses become more relevant
than in the two-dimensional case [478, 633].
More advanced research on hierarchical memory optimization addresses
the design of new iterative numerical algorithms. Such methods cover do-
main decomposition approaches with domain sizes which are adapted to the
cache capacity [36, 358] as well as approaches based on runtime-controlled
adaptivity which concentrates the computational work on those parts of the
domain where the errors are still large and need to be further reduced by
smoothing and coarse grid correction in a multigrid context [642, 518]. Other
research address the development of grid structures for PDE solvers based on
highly regular building blocks, see [160, 418] for example. On the one hand
these meshes can be used to approximate complex geometries, on the other
hand they permit the application of a variety of optimization techniques to
enhance cache utilization, see Section 10.3 for details.
232 Markus Kowarschik and Christian Weiß
10.5 Conclusions
11.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 233-250, 2003.
© Springer-Verlag Berlin Heidelberg 2003
234 Stefan Edelkamp
Restricted main memory calls for the use of secondary memory, where ob-
jects are either scheduled by the underlying operating system or explicitly
maintained by the application program.
Hierarchical memory problems (cf. Chapter 1) have been confronted to
AI for many years. As an example, take the garbage collector problem. Min-
sky [551] proposes the first copying garbage collector for LISP; an algorithm
using serial secondary memory. The live data is copied out to a file on disk,
and then read back in, in a contiguous area of the heap space; [122] ex-
tends [551] to parallelize Prolog based on Warren’s abstract machine, and
modern copy collectors in C++ [276] also refer to [551]. Moreover, garbage
collection has a bad reputation for thrashing caches [439].
Access time graduates on current memory structures: processor register
are better available than pre-fetched data, first-level and second level caches
are more performant than main memory, which in turn is faster than external
data on hard disks optical hardware devices and magnetic tapes. Last but
not least, there is the access of data via local area networks and the Internet
connections. The faster the access to the memorized data the better the
inference.
Access to the next lower level in the memory hierarchy is organized in
pages or blocks. Since the theoretical models of hierarchical memory differ
e.g. by the amount of disks to be concurrently accessible, algorithms are of-
ten ranked according to sorting complexity O(sort(N )), i.e., the number of
block accesses (I/Os) necessary to sort N numbers, and according to scan-
ning complexity O(scan(N )), i.e., the number of I/Os to read N numbers.
The usual assumption is that N is much larger than B, the block size. Scan-
ning complexity equals O(N/B) in a single disk model. The first libraries for
improved secondary memory maintainance are LEDA-SM [226] and TPIE1 .
On the other end, recent developments of hardware significantly deviate from
traditional von-Neumann architecture, e.g., the next generation of Intel pro-
cessors have three processor cache levels. Cache anomalies are well known;
1
http://www.cs.duke.edu/TPIE
11. Memory Limitations in Artificial Intelligence 235
e.g. recursive programs like Quicksort often perform unexpectedly well when
compared to the state-of-the art.
Since the field of improved cache performance in AI is too young and
moving too quickly for a comprehensive survey, in this paper we stick to
knowledge exploration, in which memory restriction leads to a coverage prob-
lem: if the algorithm fails to encounter a memorized result, it has to (re)-
explore large parts of the problem space. Implicit exploration corresponds to
explicit graph search in the underlying problem graph. Unfortunately, theo-
retical results in external graph search are yet too weak to be practical, e.g.
O(|V | + sort(|V | + |E|)) I/Os for breadth-first search (BFS) [567], where |E| is
the number of edges and |V | is the number of nodes. One additional problem
in external single-source shortest path (SSSP) computations is the design of
performant external priority queues, for which tournament-trees [485] serve
as the current best (cf. Chapter 3 and Chapter 4).
Most external graph search algorithms include O(|V |) I/Os for restruc-
turing and reading the graph, an unacceptable bound for implicit search.
Fortunately, for sparse graphs efficient I/O algorithms for BFS and SSSP
have been developed (cf. Chapter 5). For example, on planar graphs, BFS
and SSSP can be performed
in O(sort(|V |)) time. For general
BFS, the best
known result is O |V | · scan(|V | + |E|) + sort(|V | + |E|) I/Os (cf. Chap-
ter 4).
In contrast, most AI techniques improve internal performance and include
refined state-space representations, increased coverage and storage, limited
recomputation of results, heuristic search, control rules, and application-
dependent page handling, but close connections in the design of internal
space saving strategies and external graph search indicate a potential for
cross-fertilization.
We concentrate on single-agent search, game playing, and action plan-
ning, since in these areas, the success story is most impressive. Single-agent
engines optimally solve challenges like Sokoban [441] and Atomix [417], the
24-Puzzle [468], and Rubik’s Cube [466]. Nowadays domain-independent ac-
tion planners [267, 327, 403] find plans for very large and even infinite mixed
propositional and numerical, metric and temporal planning problems. Last
but not least, game playing programs challenge human supremacy for exam-
ple in Chess [410], American Checkers [667], Backgammon [720], Hex [49],
Computer Amazons [565], and Bridge [333].
Θf
f = g
t
s t
s
f = g+h
Fig. 11.1. The effect of heuristics in A* and IDA* (right) compared to blind SSSP
(left).
are re-opened; i.e. re-inserted in the set of horizon nodes. Given an admissible
heuristic, A* yields an optimal cost path. Despite the reduction of explored
space, the main weakness of A* is its high memory consumption, which grows
linear with the total number of generated states; the number of expanded
nodes |V | << |V | is still large compared to the main memory capacity of M
states.
Iterative Deepening A* (IDA*) [465] is a variant of A* with a sequence of
bounded depth-first search (DFS) iterations. In each iteration IDA* expands
all nodes having a total cost not exceeding threshold Θf , which is determined
as the lowest cost of all generated but not expanded nodes in the previous
iteration. The memory requirements in IDA* are linear in the depth of the
search tree. On the other hand IDA* searches the tree expansion of the graph,
which can be exponentially larger than the graph itself. Even on trees, IDA*
may explore Ω(|V |2 ) nodes expanding one new node in each iteration. Ac-
curate predictions on search tree growth [264] and IDA*’s exploration efforts
[469] have been obtained at least for regular search spaces. In favor of IDA*,
problem graphs are usually uniformly weighted with an exponentially grow-
ing search tree, so that many nodes are expanded in each iteration with the
last one dominating the overall search effort.
As computer memories got larger, one approach was to develop better
search algorithms and to use the available memory resources. The first sugges-
tion was to memorize and update state information also for IDA* in form of
a transposition table [631]. Increased coverage compared to ordinary hashing
has been achieved by state compression and by suffix lists. State compression
minimizes the state description length. For example the internal representa-
tion of a state in the 15-Puzzle can easily be reduced to 64 bits, 4 bits for each
tile. Compression often reduces the binary encoding length to O(log |V |), so
that we might assume that for constant c the states u to be stored are as-
signed to a number φ(u) in [1, . . . , n = |V |c ]. For the 15-Puzzle the size of
the state space is 16!/2, so that c = 64/log(16!/2) = 64/44 ≈ 1.45.
238 Stefan Edelkamp
u
h(u)
u
hc (u)
h(u)
h1 (u) h2 (u)
Fig. 11.3. Single bit-state hashing, double bit-state hashing, and hash-compact.
Suffix lists [271] have been designed for external memory usage, but show
a good space performance also for internal memorization. Let bin(φ(u)) be
the binary representation of an element u with φ(u) ≤ n to be stored. We split
bin(φ(u)) in p high order bits and s = log n−p low order bits. Furthermore,
φ(u)s+p−1 , . . . , φ(u)s denotes the prefix of bin(φ(u)) and φ(u)s−1 , . . . , φ(u)0
stands for the suffix of bin(u). The suffix list consists of a linear array P
and of a two-dimensional array L. The basic idea of suffix lists is to store a
common prefix of several entries as a single bit in P , whereas the distinctive
suffixes form a group within L. P is stored as a bit array. L can hold several
groups with each group consisting of a multiple of s + 1 bits. The first bit of
each (s + 1)-bit row in L serves as a group bit. The first s bit suffix entry of
a group has group bit one, the other elements of the group have group bit
zero. We place the elements of a group together in lexicographical order, see
Fig. 11.2. The space performance is by far better than ordinary hashing and
very close to the information theoretical bound. To improve time performance
to amortized O(log |V |) for insertions and memberships, the algorithm buffers
states and inserts checkpoints for faster prefix-sum computations.
Bit-state hashing [224] and state compaction reduce the state vector size
to a selection of few bits allowing even larger table sizes. Fig. 11.3 illustrates
the mapping of state u via the hash functions h, h1 and h2 and compaction
function hc to the according storage structures. This approach of partial
search necessarily sacrifices completeness, but often yields shortest paths in
practice [417]. While hash compact also applies to A*, single and double bit-
state hashing are better suited to IDA* search [271], since the priority of a
state and its predecessor pointer to track the solution, are mandatory for A*.
In regular search spaces, with a finite set of different operators to be
applied, Finite state machine (FSM) pruning [715] provides an alternative for
duplicate prediction in IDA*. FSM pruning pre-computes a string acceptor
for move sequences that are guaranteed to have shorter equivalents; the set of
forbidden words. For example, twisting two opposite sides of the Rubiks cube
in one order, has always an equivalent in twisting them in the opposite order.
11. Memory Limitations in Artificial Intelligence 239
L R R
L U IM
L R
D
L EM
R
active
D page
Fig. 11.4. The finite state machine to prune the Grid (left) and the heap-of-heap
data structure for localized A* (right). The main and the active heap are in internal
memory (IM), while the others reside on external memory (EM).
i f ∗ /2
f∗
t
Fig. 11.5. Divide step in undirected frontier search (left) and backward arc look-
ahead in directed frontier search (right).
ner. These nodes are expanded and re-inserted into the queue if they are safe,
i.e., if D is not full and the f -value of the successor node is still smaller than
the maximal f -value in D. This is done until D eventually becomes empty.
The last expanded node then gives the bound for the next IDA* iteration. Let
E(i) be the number of expanded nodes in iteration i and R(i) = E(i)−E(i−1)
the number of newly generated nodes in iteration i. If l is the last iteration
then the number of expanded nodes in the algorithm is li=1 i · R(l − i + 1).
l
Maximizing i=1 i·R(l−i+1) with respect to R(1)+. . .+R(l) = E(l) = |V |,
and R(i) ≥ M for fixed |V | and l yields R(l) = 0, R(1) = |V | − (l − 2)M
and R(i) = M , for 1 < i < l. Hence, the objective function is maximized at
−M l2 /2 + (|V | + 3M/2)l − M . Maximizing for l yields l = |V |/M + 3/2 and
O(|V | + M + |V |2 /M ) nodes in total.
Frontier search [471] contributes to the observation that the newly gen-
erated nodes in any graph search algorithm form a connected horizon to
the set of expanded nodes, which is omitted to save memory. The technique
refers to Hirschberg’s linear space divide-and-conquer algorithm for comput-
ing maximal common sequences [400]. In other words, frontier search reduces
a (d + 1)-dimensional search problem into a d-dimensional one. It divides into
three phases: in the first phase, a goal t with optimal cost f ∗ is searched; in
the second phase the search is re-invoked with bound f ∗ /2; and by main-
taining shortest paths to the resulting fringe the intermediate state i from s
to t is detected, in the last phase the algorithm is recursively called for the
two subproblems from s to i, and from i to t. Fig. 11.5 depicts the recursion
step and indicates the necessity to store virtual nodes in directed graphs to
avoid falling back behind the search frontier, where a node v is called virtual,
if (v, u) ∈ E, and u is already expanded.
Many external exploration algorithms perform variants of frontier search.
In the O(|V |+sort(|V |+|E |)) I/O algorithm of Munagala and Ranade [567]
the set of visited lists is reduced to one additional layer. In difference to the
internal setting above, this algorithm performs a complete exploration and
uses external sorting for duplicate elimination.
11. Memory Limitations in Artificial Intelligence 241
Fig. 11.7. Operator abstractions for the relaxed planning and the pattern database
heuristic (left); single and disjoint PDB for subsets R and Q of all atoms F (right).
f0 h0
Open f -value H
h-value
x0 x0 x0 x0
Min
1 0 1 0
Fig. 11.8. Symbolic heuristic A* search with symbolic priority queue and estimate.
4 4 4
4 -1 -7 4 -1 ≤ −1 -7 ≤ −7 4 -1 ≤ −1 -7 ≤ −7
4 8 9 5 -1 2 5 -7 9 4 8 9 5 -1 2 5 -7 9 4 8 9 -1 5 2 -7 5 9
Fig. 11.9. Mini-max game search tree pruned by αβ and additional move ordering.
One research area of AI that has ever since dealt with given resource limi-
tations is game playing [666]. Take for example a two-payer zero-sum game
(with perfect information) given by a set of states S, move-rules to modify
states and two players, called Player 0 and Player 1. Since one player is active
at a time, the entire state space of the game is Q = S × {0, 1}. A game has an
initial state and some predicate goal to determine whether the game has come
to an end. We assume that every path from the initial state to a final one is
finite. For the set of goal states G = {s ∈ Q | goal(s)} we define an evaluation
function v : G → {−1, 0, 1}, −1 for a lost position, 1 for a winning position,
and 0 for a draw. This function is extended to v̂ : Q → {−1, 0, 1} asserting a
game theoretical value to each state in the game. More general settings are
multi-player games and negotiable games with incomplete information [628].
DFS dominates game playing and especially computer chess [531], for
which [387] provides a concise primer, including mini-max search, αβ prun-
ing, minimal-window and quiescence search as well as iterative deepening,
move ordering, and forward pruning. Since game trees are often too large to
be completely generated in time, static evaluation functions assert numbers
to root nodes of unexplored subtrees. Fig. 11.9 illustrates a simple mini-max
game tree with leaf evaluation, and its reduction by αβ pruning and move
ordering. In a game tree of height h with branching √ factor b the minimal
traversed part tree reduces from size O(bh ) to O( bh ). Quiescence search
extends evaluation beyond exploration depth until a quiescent position is
reached, while forward pruning refers to different unsound cut-off techniques
to break full-width search. Minimal window search is another inexact approx-
imation of αβ with higher cut-off rates.
As in single-agent search, transposition tables are memory-intense contain-
ers of search information for valuable reuse. The stored move always provides
information, but the memorized score is applicable only if the nominal depth
does not exceed the value of the cached draft.
Since the early 1950s, from the “fruit-fly”-status, Chess has advanced to
one of the main successes in AI, resulting in the defeat of the human-world
champion in a tournament match. DeepThought [532] utilized IBM’s Deep-
Blue architecture for a massive-parallelized, hardware-oriented αβ search
scheme, evaluating and storing billions of nodes within a second, with a fine-
11. Memory Limitations in Artificial Intelligence 247
11.7 Conclusions
The spectrum of research in memory limited algorithms for representing
and exploring large or even infinite problem spaces is enormous and encom-
passes large subareas of AI. We have seen alternative approaches to exploit
and memorize problem specific knowledge and some schemes that explic-
itly schedule external memory. Computational trade-offs under bounded re-
250 Stefan Edelkamp
12.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 251-272, 2003.
© Springer-Verlag Berlin Heidelberg 2003
252 Kay A. Salzwedel
[693, 331] which is simply the sum of the failure rates λi of each participating
disk1 .
The observations made above can be formalized to the data distribution
problem. It has to give an answer to the following question: Where should the
data be stored so that any data request can be answered as fast as possible?
Furthermore, the data distribution has to ensure scalability. The more disks
are in a storage network the more data requests should be answered or the
faster should the data be delivered. This can only be done if the data is evenly
distributed over all disks.
In recent years, modern storage systems move towards distributed so-
lutions. Such systems usually consist of common components connected by
standardized technology like Ethernet or FibreChannel. Hence, it is rather
unlikely that all components have the same characteristic concerning capac-
ity or performance. This leads to the heterogeneity problem. Not exploiting
the different properties of each component results in a waste of capacity and
performance.
As noted above, the data volume is annually growing. This implies a new
challenge to storage networks because they should easily adapt to changing
capacity/bandwidth demands. Allowing the addition of disks would be an
easy task, if the space/access balance is ignored. One simply extends with new
disks and maps new data to the added disks. This will result in a distribution
that cannot increase the access performance for all data even when newer and
faster disks are added. The only way to achieve both goals, space balance and
the addition of disks, is to redistribute some data. This property is measured
by the adaptivity. Naturally, the less data is redistributed the more adaptive
is the underlying strategy.
The above discussion can be summarized by defining a number of require-
ments a storage network has to meet.
1. Space and Access Balance: To ensure good disk utilization and pro-
vide scalability the data should be evenly distributed over all disk drives.
The data requests are usually generated by a file system above the stor-
age network. Hence, the storage network has no knowledge about the
access pattern and must handle any possible request distribution.
2. Availability: Because of the increased sensitivity to disk failures storage
networks need to have some redundancy and implement mechanisms to
tolerate the loss of data.
3. Resource Efficiency: Redundancy implies a necessary waste of re-
source. Nevertheless, systems should use all its resources in an useful
way. Some problems, e.g. adaptivity, are easily solved if large space re-
sources are available. But especially for large networks it is infeasible to
provide these resources.
1
This is only true under the assumption that the failure rate is constant which is
equal to an exponentially distributed time to failure and the failures are inde-
pendent of each other.
254 Kay A. Salzwedel
In the next section we define a storage network and introduce all the relevant
notations. The remainder of the paper will give an overview of techniques and
algorithms used to achieve the above mentioned properties. We will focus on
rather new techniques handling heterogeneity and adaptivity because these
are the most challenging questions and they will become more important
in the future. The last chapter will summarize the paper and give a short
conclusion.
12.2 Model
b2 b5
D1
C1
b1 Dn
bn
Cn
Network
12.4 Availability
Availability describes the property to retrieve the data units even in the case
of failed disk drives and, hence, lost data units. It is a crucial feature of
storage networks because the probability of a failure in a collection of disks
is scaled by its size [607].
The problem of availability can be solved by using redundancy. The easiest
way is to store c copies of all data units like in an RAID 1 system (with c = 1,
which is called mirroring in the storage community) [607] or the PAST storage
server [641, 259]. Naturally, the distribution scheme has to ensure that the
copies are hosted by different disks. The system can tolerate up to c − 1
failures and is still operational (only with degraded performance) so that the
faulty disks can be replaced. The data that has been hosted by the faulty
disk can then be rebuild using one of the redundant copies.
Because a full replication scheme results not only in the necessity to up-
date all copies in order to keep the scheme consistent but also in a large
waste of resources, more space efficient methods are sought. One of the
common approaches is the use of parity information like in RAID level 4/5
[607, 185], Random RAID [128], Swarm [569, 385], and many video servers
[118, 659, 725, 752, 27]. The idea behind is the use of redundant informa-
tion to secure more then just one data unit. This can be done by deriving
the bit-wise parity information (parity means taking an bit-wise XOR op-
eration) of a whole stripe and storing it in an extra parity unit. Assuming
the units reside on different disks, one disk failure can be tolerated. When
a disk failure occurs the faulty disk has to be replaced and the unavailable
data units must be rebuild by accessing all other units in its stripe. During
this reconstruction phase, the next failure would be hazardous. Obviously,
this reconstruction should be done as fast as possible aiming to minimize the
maximum amount of data one has to read during a reconstruction [404]. This
can be done by choosing a stripe length less then n. The resulting reconstruc-
l−1
tion load for each disk is given by the declustering ratio α = n−1 . If α = 1
(like in the RAID 5 layout) each of every surviving disks participates in the
reconstruction.
Using redundancy imposes a new challenge to the data distribution be-
cause it has not only to ensure an even distribution of data units but also
an even distribution of parity information. The reason for this lies in the
different access pattern for data units and parity units. First of all, we have
to distinguish between read and write accesses. A read can be done by get-
ting any l units. A write operation to any data unit (call it small write) has
not only to access the written data unit but also the parity unit because the
parity information after a write may have changed. It follows that the access
pattern to the parity unit is different from all other data accesses. There
are a number of different approaches to handle this situation. In RAID level
4 [607] the parity units are all mapped onto the same disk. As long as the
whole stripe is written (call it full write) this does not impose any problems.
12. Algorithmic Approaches for Storage Networks 259
Unfortunately, this is not always the case. Small writes will put a lot of stress
on the disk storing the parity information. A solution to this is provided by
RAID level 5. Here, the parity units are spread over all participating disks
by permutating the position of the parity unit inside each stripe. In the ith
stripe the parity unit is at position i mod l .
As long as the number of disks is equal to the stripe length l this approach
distributes the parity units evenly. The more general case, allowing the length
to differ from n, calls for another distribution scheme like parity declustering
[404]. The authors propose the use of complete and incomplete block designs.
A block design is the arrangement of ν distinct objects into b blocks or tuples,
each containing k elements, such that each object appears in exactly t blocks,
and each pair of objects appears in exactly λp blocks. For a complete block
design one includes all combinations of exactly
k distinct elements from a
set of ν objects. Note, that there are νk of these different combinations.
Furthermore, only three variables are free because b · k = ν · r and r(k − 1) =
λp (ν − 1) are always true.
If we now associate the blocks with the disk stripes and the objects with
disks the resulting layout (called block design table) distributes the stripes
evenly over the n disks (i.e. ν equals the length of a stripe l and k equals the
number of disks n). But such a block design table gives no information about
the placement of the redundant information. To balance the parity over the
whole array we build l different block design tables putting the parity unit in
each of them at a different position in the stripe. The full (or complete) block
design is derived by fusing these block design tables together. Obviously, this
approach is unacceptable if the number of disks becomes large relative to the
stripe length l, e.g. a full 41 disk array with stripe length 5 has 3,750,000
blocks. Hence, we have to look for small block designs of n objects with
a block size of l, called balanced incomplete block designs (BIBDs), which
do not need full tables to balance the parity units. There is no known way
to derive such designs algorithmically and for all possible combinations of
parameters a BIBD might not be known. Hall [376] presents a large number
of them and some theoretical techniques to find them. In this case, the use of
complete designs or the use of the next possible combination is recommended.
The approach of using parity information can be extended to tolerate
more then one disk failure. As an example the EVENODD layout [130] can
survive two disk failures. This is done by taking the regular RAID 5 layout
and adding parity information for each diagonal on a separate disk. Such
an approach can be even extended to tolerate t arbitrary simultaneous disk
1
failures. Gibson et al. [390] showed that in this case, t · l1− t disks worth of
parity information are needed. Taking fast reconstruction into account, these
schemes are only of theoretical interest.
The above mentioned techniques try to solve different problems, like avail-
ability and space balance. Combining them leads to new and more general
data distributions. In [675, 674] the EVENODD layout is used together with
260 Kay A. Salzwedel
12.5 Heterogeneity
Heterogeneity describes the ability to handle disks with different character-
istics efficiently, i.e. use the disk according to its capacity or its bandwidth.
This problem is easy to solve if all other properties are neglected but is rather
challenging if the space and access balance must also be guaranteed.
Heterogeneity is becoming more and more common in recent years. First
of all, with the growth of the data volume, storage systems quickly run out
of space. If a complete exchange of the system is unacceptable (due to the
higher cost) one has to expand the existing configuration. It is rather un-
likely that disks with the same characteristics can be found which results in
12. Algorithmic Approaches for Storage Networks 261
data requests. Each data element is replicated and randomly mapped to sepa-
rated disks. Then, the scheduling problem can be transformed into a network
flow problem. The requests are on the left hand side (source) and have an
edge (with weight ∞) to all disks which store a copy of the requested data
element. On the right hand side of the flow network (sink), the disks might
be served by disk controllers and I/O-busses each of which is modeled by an
edge with an appropriate weight to the disks/controllers it connects. Now, the
flow problem can be solved resulting in a scheduling algorithm for the data
requests. Heterogeneity can be introduced by adjusting the edge weights of
the flow network according to the given capacities. The disadvantage of this
approach lies in its batch-like behavior. For any collection of new requests a
new flow problem must be solved. Especially for large systems the number of
needed requests is considerable.
In the next four sections we will introduce new approaches that try to
solve the heterogeneity problem.
12.5.1 AdaptRaid
AdaptRaid [221, 222] is a distribution scheme that extends the general RAID
layout so that heterogeneity (corresponding to different capacities) can be
handled. The basic idea is very simple. Larger disks are usually newer once
and, thus, should serve more requests. Nevertheless, the placement of whole
stripes should be kept as long as possible to gain from parallelism. So far,
there are extensions to handle the case without data replication (RAID level
0) and with fully distributed parity information (RAID level 5).
AdaptRaid Level 0. The initial idea of AdaptRaid level 0 [221] is to put
as many stripes of length l = n on all disks as possible. As soon as some k
disks cannot store any more data units, stripes of length n − k are mapped
onto the remaining n − k disks. This process is repeated until all disks are
full (see Fig. 12.2).
D1 D2 D3 D4 D5 D6 D7 D8 Dn
Stripes of Length n
. . . Stripes of Length n − k
k’
k’’
Fig. 12.2. The initial idea is to place as many stripes over all disks as there is
capacity in the k smallest ones. When they saturate this strategy is repeat with the
n − k disks.
12. Algorithmic Approaches for Storage Networks 263
D1 D2 D3 D4 D5 D6 D7 D8 Dn
Lines in Pattern
First Invocation of Pattern
(LIP)
Fig. 12.3. Using the pattern defined by the initial idea, one distributes longer lines
over the whole disk by fusing many invocations of the pattern on the disk.
This approach has the major drawback that the access to data stored
in the upper part of the disk is faster then the access to data in the stored
in the lower part of the disk drives because of the different stripe length.
Furthermore, the assumption that newer disks are faster is in general not true.
The following generalization will cope with both problems. First, we define
the utilization factor U F ∈ [0, 1] to be the ratio between the capacity of a
disk and its bandwidth. This factor is determined on a per disk basis, where
the largest disk has always U Fi = 1 and all other disks get a U F according
to their capacity and relative to the largest one. These values should be set
by the system administrator. To overcome the access problem, data patterns,
which capture the overall capacity configuration, are defined. These patterns
have a similar structure to the pattern of the initial idea, but are kept smaller
such that many of them can fit on a single disk. The size of these patterns
is measured by the second parameter – lines in a pattern (LIP ). In general,
this parameter is an indicator for the distribution quality of the layout and
is measured by the number of units hosted on the largest disk. The overall
data layout is defined by repeating these data patterns until the disks are full
(see Fig. 12.3).
The access to data elements is twofold. In the first phase, the correct
pattern has to be found which can easily be done because the size of a pattern
(in number of units) is known. Then, the requested data unit is found by
accessing the pattern itself. The space resources needed to access any data
element is proportional to the size of the pattern.
The performance of AdapRaid0 was tested using a disk array simulator.
The tests have been made with varying ratios of fast and slow disks2 . Two
different scenarios have been tested. First, the performance is compared to
a RAID level 0. Naturally, AdaptRaid0 can shift the load from slower disks
to the faster ones resulting in a performance gain between 8% and 20% for
read accesses and 15% up to 35% for write accesses. The second scenario
compares AdaptRaid0 to a configuration that contains only fast disks. This
is reasonable because it gives an evaluation of the usefulness of keeping older
disks in the system and therefore of the effect of parallel disk accesses. The
2
The faster disks possess roughly doubled performance parameters, like average
seek time, short seek time, and local cache.
264 Kay A. Salzwedel
Lost Capacity
Fig. 12.4. The left hand side shows the regular RAID 5 layout. Parity units are
distributed over the whole array by letting every stripe start on a different disk in
a round-robin manner. The right hand side shows the initial layout in AdaptRaid5.
Here, the RAID 5 strategy is applied on stripes of varying size.
This problem can be avoided by restricting the stripe length of each stripe
to integer divisors of the length of the largest stripe (see the leftmost layout
in Fig. 12.5). This leads to a heterogeneous layout which may waste some
capacity. But a careful distribution of the free units can significantly improved
this scheme. First, the layout should be independent of the number of disks
still participating. This can be achieved by letting each stripe start on a
different disk (see the center layout of Fig. 12.5). The nice effect of this
transformation is the even distribution of the wasted free space over all disks
instead of concentrating them on a few disk drives. Now, we can get rid of the
wasted capacity by using a ’Tetris’ like algorithm. The holes of free units are
just filled with subsequent data units from stripes further below (data units
12. Algorithmic Approaches for Storage Networks 265
1 1
2 2
3 3
4 4
Fig. 12.5. The left figure shows a placement, so that the length of any stripe is
a divisor of the length of the largest stripe. Clearly, some disk may have unused
capacity. In the next step, the free space is distributed over all disks of this stripe
length as shown in the center. This opens the possibility to apply an ’Tetris’-like
algorithm. The free spaces are erased by moving the subsequent data units on each
disk upwards. The resulting layout is depict in the right figure. The arrows indicate
the movement of data units.
form these stripes ’move’ upwards). Because the free spaces are distributed
over all disks whole stripes can be added at the end of the pattern. Still
the layout is regular and only the size of one pattern is need to locate any
data unit (see the rightmost picture in Fig. 12.5). As in the AdaptRaid0,
we will generalize the solution by defining two parameters. As before, U F
is the utilization factor describing the load of a disk related to the largest
disk. Similar to LIP the parameter SIP – stripes in pattern – controls the
distribution of pattern over the whole array.
For the practical evaluation, the same simulator as for AdaptRaid0 has
been used. The new approach was tested against a RAID 5 and a homo-
geneous configuration of only fast disk3 . Obviously, both approaches waste
some capacity in one way or the other. The performance was measured for
reads, small writes and full writes (writing a full stripe). It was observed
[222], that AdaptRaid5 is almost always the best choice and scales gradually
with an increasing number of fast disks. Only when large chunks of data are
read, the method is outperformed by the homogeneous configuration of only
fast disks. This is due to the fact that the slower disks cannot deliver the
data units appropriately. When full stripes are read they are the bottleneck
of the system. Nevertheless, when simulations of real workloads are used, the
performance gain of the new approach can be around 30% if only half of the
disks are fast disks.
logical view of the system. If the parity groups are fused together they form
a homogeneous system where the stripe length l is the number of extents in a
parity group. It is possible that an extent gets capacity and bandwidth from
different physical disks. The data distribution scheme simply stripes the data
units over the parity groups (see Fig. 12.6).
Logical: 1 2 3 4 5 6 7 8 9 10 11 12
(t = 12)
G1
Parity
G2
Group:
(l = 4) G3
Physical:
(extents)
D1 D2 D3 D4
Fig. 12.6. HERA divides the capacity of a disk into logical extents 1, . . . , Dt .
Each physical disk Di has capacity for ki many logical disks. The logical disks are
arranged in G parity groups each of which appears homogeneous.
How many parity groups are possible? In order to ensure the fault toler-
ance of the system enforced by the used striping strategy, each extent in every
parity group has to come from a different physical disk. Let Dt denote the
number of logical extents in the system. Then, the number of parity groups
has to fulfill the following inequality:
t
D
G≤ i 0≤i≤n
ki
To show the reliability of the system, the behavior is modeled by different
processes like the failure process and the reconstruction process. Because the
mean time to failure of any disk can be estimated, the time of these processes
can be estimated, too. Using a Markov model [331], the mean time to service
loss (MTTSL) can be derived for such a heterogeneous system. It was shown
that the reliability (measured in MTTSL) is only a factor of approximately 10
away from the best possible configuration, namely the clustering of identical
disks.
Nevertheless, this approach still needs the experienced hand of an admin-
istrator and the performance strongly relies on his decisions.
The RIO Mediaserver [658, 659] was build as a generic multimedia storage
system capable of efficient, concurrent retrieval of many types of media ob-
jects. It defines a randomized distribution strategy and supports real-time
12. Algorithmic Approaches for Storage Networks 267
data delivery with statistic delay guarantees. The used randomized distribu-
tion scheme is very simple. The data units are placed on a randomly chosen
disk at a randomly chosen position4 (see Fig. 12.7). Heterogeneity may occur
in different disk capacities and varying bandwidth properties. As noted be-
fore randomization ensures a good balance in the long run, especially when it
is used in conjunction with replication. In addition, RIO also exploits redun-
dancy to improve the short time balance by carefully scheduling the accesses
to data units.
Disk Drives
...
...
...
...
...
...
D1 D2 D3 D4 Dn
Fig. 12.7. RIO assigns every data unit b to a random position on a randomly
chosen disk.
B define the total capacity and bandwidth in the storage network and let
Si = Si /S and Bi = Bi /B be the relative capacity and relative bandwidth
of cluster i, respectively (see Fig. 12.8). The relative bandwidth space ratio
can then be defined as BSi = Bi /Si . For a homogeneous system, BSi = 1.
The cluster with the lowest BSi will be the most stressed cluster, not only
because its disks have the slowest bandwidth but also because they host a
large chunk of the data units. The number of data units that can be stored
S
in the system is defined by U = 1+r where r denotes the replication rate.
In formal terms, we want to use replication to shift the stress from clusters
with BSi < 1 to clusters with BSi > 1. But how much replication is needed
to sustain a certain maximal load λmax ≤ B ?
BSR 1
Bt
B1
BSR t
B2
BSR 2
Fig. 12.8. RIO groups the disks according to their BSR. Each of the t groups is
homogeneous in itself and has a collected capacity of Si and a collective bandwidth
of Bi .
12.6 Adaptivity
One motivation for the heterogeneity in Sect. 12.5 was the addition of newer
disks due to a growing space demand. But how does the introduction of new
disks effect the still operational system? Obviously, if we want to have better
performance when new disks are added, the data units must be evenly dis-
tributed over all disks. Hence, some portion of the existing data units has to
be redistributed. The efficiency of this process is described by adaptivity. A
distribution scheme is adaptive if it only has to redistribute a small number
of data units in case of a changing system. Such a change can be the enter-
ing or failing of disks or the introduction of newly accessed data units. To
capture heterogeneous requirements, we define the property of faithfulness.
A distribution scheme is faithful if after a change in the system it can always
ensure that the number of data units stored on a disk is according to its
capacity/bandwidth requirement. In the homogeneous case each disk should
get the same number of data units. For the more general case, lets denote
by di the relative capacity of each disk Di (relative to the total capacity S).
Then a distribution scheme is faithful if it can ensure that any disk Di gets
(di + ) · m data units for an arbitrary small . Faithfulness uses the -term to
allow randomized approaches to possess this property. Note, that m denotes
the number of data units currently in the system.
The adaptivity can be measured by competitive analysis. For any sequence
of operations σ that represent changes in the system, we intend to compare
the number of (re-)placements of data units performed by the given scheme
with the number of (re-)placements of data units performed by an optimal
strategy that ensures faithfulness. A placement strategy will be called c-
competitive if for any sequence of changes σ, it requires the (re-)placement of
(an expected number of) at most c times the number of data units an optimal
adaptive and perfectly faithful strategy would need.
Looking at the distribution techniques of the last chapters, e.g. the strip-
ing algorithm and the random distribution, none of them is adaptive. Con-
sider a stripe of length n. Introducing a new disk literally changes the whole
distribution. If only the (n + 1) fraction of data units are redistributed, the
system needs to keep track of the redistributions and, in the long run, to
store every position of any data unit to access it. A similar argument can be
used for random distributions.
Consistent hashing [446, 504] was developed for distributed web caching but
can also be applied here. The algorithm overcomes the drawback of conven-
tional hashing schemes by defining a hash function that is consistent in case
of a changing number of servers. We use the balls-into-bin analogy for the
description. The algorithm works as follows.
270 Kay A. Salzwedel
certain that all bins get an even share. This is done by cutting the [0, 1)
interval into ranges and these ranges (and therefore all the balls that fall into
them) are mapped to bins by an assimilation function. If more then one range
is assigned to a bin, they are fused together and adjusted such that the range
on any bin is continuous and starts with 0. The height of the balls falling into
the ranges are adjusted accordingly. Let [0, 1/n]i denote the range mapped
to bin i when n bins are currently in the system. The assimilation function
maintains the invariant that the (accumulated) ranges have the same size.
Furthermore, it ensures that only a small portion of every range is remapped
if the number of bins changes. The definition of the assimilation function is
repetitive in the number of bins. It starts by assigning the interval [0, 1) to
bin 1. Then, the change from n to n + 1 bins is defined by cutting off the
range [1/(n + 1), 1/n]i from every bin i with i ∈ {1, . . . , n} and fuse them
together such that they are in reversed order (the range coming from bin 1
is topmost). This will be the new range for bin n + 1 (see also second picture
in Fig. 12.9). Due to the fact that each bin gets an equal share of the [0, 1)
interval it gets also an almost equal number of balls, accuracy depending only
on the hash function. Furthermore, the number of balls moved to a new bin
comes evenly from all other bins and is in the order of the balls needed to be
moved to ensure an even distribution.
hash function
p 1 1/n 1 2 3 n
1/(n+1) 1
2
3
...
...
0 n
space
D1 D2 D3 Dn Dn+1
3. Non−uniform Placement
address
1
n2
...
12 n2
h 2 (b)
1
n1 ...
11 21 n 1−1 n1
h 1 (b)
1
n0
...
1 0 1 20 30 n 0−1 n0
Fig. 12.9. The cut-and-paste strategy. First, the address space is hashed into a
[0, 1) interval. Then the uniform placement is done according to an assimilation
function which defines the transition from n bins to n + 1 bins. Heterogeneity is
achieved by defining a number of levels in which only disks with free capacity
participate.
This strategy will not only ensure the fast adaptation to a changed number
of bins but also provide an efficient way to access the balls. It suffices to
272 Kay A. Salzwedel
12.7 Conclusions
The constantly growing amount of data calls for flexible and efficient stor-
age systems. It has been shown that storage networks can keep up to this
expectation. There is a large number of different techniques to exploit the
inherent parallelism of such networks to gain better performance. Especially
smaller systems like RAID arrays or multimedia data servers can be used
very efficiently.
Nevertheless, the systems are growing rather rapidly. This introduces new
problems like heterogeneity, adaptivity, or the resource efficiency considera-
tions. For these problems there exist only a few approaches presented in this
paper. Most of them have not been practically tested, yet.
In the last few year two very interesting application fields emerged. The
first is the area of Storage Area Networks (SAN). The fast growing data
volume especially in enterprises leads to very high cost of data maintenance.
This cost can be significantly reduced if a centralized storage system is used.
Nevertheless, such a storage system must allow for a growing data volume and
they are very likely to be heterogeneous. Carefully managed storage networks
are capable of providing these features.
The second new field of interest arepeer-to-peer networks (P2P). First
of all, these networks are highly heterogeneous (with respect to space and
connection requirements). Furthermore, availability is a big issue in these
overlay networks. Data will be stored in them and the main task will be to
retrieve the data. Furthermore, a P2P network is highly dynamic. Server will
join and leave it without further notice calling to efficient adaptivity.
For all these reasons, techniques used in storage networks will become
more popular in the near future.
13. An Overview of File System Architectures
Florin Isaila
13.1 Introduction
The ever increasing gap between processor and memory speeds on one side
and disk systems on the other side has exposed the I/O subsystems as a bot-
tleneck for the applications with intensive I/O requirements. Consequently,
file systems, as low-level managers of storage resources, have to offer flexible
and efficient services in order to allow a high utilization of disks.
A storage device is typically seen by users as a contiguous linear space
that can be accessed in blocks of fixed length. It is obvious that this simple
uniform interface can not address the requirements of complex multi-user,
multi-process or distributed environments. Therefore, file systems have been
created as a layer of software that implements a more sophisticated disk
management. The most important tasks of file systems are:
– Organize the disk in linear, non-overlapping files.
– Manage the pool of free disk blocks.
– Allow users to construct a logical name space.
– Move data efficiently between disk and memory.
– Coordinate access of multiple processes to the same file.
– Give the users mechanisms to protect their files from other users, as for
instance access rights or capabilities.
– Offer recovery mechanism for the case the file system becomes inconsistent.
– Cache frequently used data.
– Prefetch data predicted to be used in the near future.
In this chapter we investigate how different types of file systems address
these issues. The next section describes the access patterns of sequential and
parallel applications, as reported by several research groups. The access pat-
terns are interesting, because they are often used to motivate file system
design choices. Section 13.3 details some tasks of file systems, as enumerated
above. We continue by explaining general issues of distributed file systems in
the first part of Section 13.4. In the second part of Section 13.4, we address
parallel (13.4.5), shared (13.4.6) and grid file systems (13.4.7) and show how
file systems may handle operations for mobile computers(13.4.8). We sum-
marize in Section 13.5.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 273-289, 2003.
© Springer-Verlag Berlin Heidelberg 2003
274 Florin Isaila
Several studies [663, 87, 598] analyzed the file access patterns of applications
running on uniprocessor machines and accessing files belonging to either a
local or a distributed file system. Their results were used either for imple-
menting an efficient local or distributed file systems.
– Most files are small, under 10K [663, 87]. Short files are used for directories,
symbolic links, command files, temporary files.
– Files are open for a short period of time. 75% of all file accesses are open
less than 0.5 seconds and 90% less than 10 seconds [663, 87, 598].
– Life time of files is short. A distributed file system study [87] measured that
between 65% and 80% of all files lived less than 30 seconds. This has an
important impact on the caching policy. For instance, short-lived files may
eventually not be sent to disk at all, avoiding unnecessary disk accesses.
– Most files are accessed sequentially [663, 598, 87]. This suggests that a
sequential pre-fetching policy may be beneficial for the file system perfor-
mance.
– Reading is much more common than writing. Therefore, caching can bring
a substantial performance boost.
– File sharing is unusual. Especially the write sharing occurs infrequently
[87]. This justifies the choice for a relaxed consistency protocols.
– File access is bursty [87]. Periods of intense file system utilization alternate
with periods of inactivity.
Several studies of parallel I/O access patterns are available [582, 701, 695,
225]. Some of them focus on parallel scientific applications that are typically
multiprocess applications, which typically access a huge amount of data.
Some of their results are summarized below, illustrated by the parallel
access example from Figure 13.1. The figure shows two different way of phys-
ical partitioning of a two-dimensional 4x4 matrix, over two disks attached
two different I/O nodes: (a) by striping the columns and (b) by striping the
rows. The matrix is logically partitioned between four compute nodes, each
process accessing a matrix row. For instance , this kind of access can be used
by a matrix multiplication algorithm.
13. An Overview of File System Architectures 275
(0,0) (0,1) (0,2) (0,3) Compute node 0 (0,0) (0,1) (0,2) (0,3) Compute node 0
(1,0) (1,1) (1,2) (1,3) Compute node 1 (1,0) (1,1) (1,2) (1,3) Compute node 1
(2,0) (2,1) (2,2) (2,3) Compute node 2 (2,0) (2,1) (2,2) (2,3) Compute node 2
(3,0) (3,1) (3,2) (3,3) Compute node 3 (3,0) (3,1) (3,2) (3,3) Compute node 3
(0,0) (1,0) (2,0) (3,0) (0,2) (1,2) (2,2) (3,2) (0,0) (0,1) (0,2) (0,3) (2,0) (2,1) (2,2) (2,3)
Disk 0 Disk 0
(0,1) (1,1) (2,1) (3,1) (0,3) (1,3) (2,3) (3,3) (1,0) (1,1) (1,2) (1,3) (3,0) (3,1) (3,2) (3,3)
Disk 1 Disk 1
performance penalty for large disks, due to large times devoted to linearly
scanning the bitmap for a free block.
Newer file systems use mainly two techniques for making disk space man-
agement more efficient: extents and B-tree structures. The main advantage
of extents is that they allow finding several free blocks at a time. The space
overhead for storing extents is lower than for bitmaps, as long as the file
system is not very fragmented. B-trees allow a quicker lookup for a free block
than lists, which have to be sequentially scanned. Extents and B-trees can be
used in conjuction. Indexing by extent size makes possible to quickly allocate
chunks of contiguous blocks of a certain size. Indexing by extent position
allows to quickly locate blocks by their addresses. XFS is an example of a file
system that employs all these techniques.
Caching. When performing a read from a file, the data is brought block-wise
from the disk into the memory. A later file access may find parts of the data
in memory. This technique is called caching. We will refer to the memory
region used for this purpose in a file system as file cache. Caching improves
the performance of the applications which exhibit temporal locality of access,
i.e. in a program, once a block has been accessed, it is highly probable that
it will be accessed again in the near future. Performance measurements show
that this is the case with most applications. In the local file systems, caching
is used to improve local disk access times, providing copies of the low-speed
disks in the faster memory.
Prefetching. Prefetching is the technique of reading ahead from disk into
cache data blocks probable to be accessed in the near future. The prefetching
is motivated by several factors:
– the predictability of file access patterns of some applications
– the availability of file access hints
– the poor utilization of disk parallelism
– the large latencies of small non-sequential disk requests
On the other hand, prefetching may hurt performance by causing the
eviction of valuable disk blocks from the cache. In this respect, it is very
important to take prefetching decisions at the right time.
Prefetching can be done manually or automatically. The programmers
can manually insert prefetching hints into their programs. This approach
supposes that the source code is available. Additionally, this involves a high
cost of program understanding and modification.
There are several automatic prefetching approaches.
– Sequential read-ahead. This is the simplest type of prefetching. It is
based on the file access studies that have found out that the most fre-
quent type of access is sequential. This type of automatic prefetching is
implemented by the most file systems.
13. An Overview of File System Architectures 279
A distributed file system typically presents the user a tree-like name space
that contains directories and files distributed over several machines that can
be accessed by a path. A file’s location is transparent when the user cannot
tell, just by looking at its path, if a file is stored locally or remotely. NFS [652]
13. An Overview of File System Architectures 281
The traditional distributed file system architecture was based on the client-
server paradigm [652, 409, 145]. A file server manages a pool of storage re-
sources and offers a file service to remote or local clients. A typical file server
has the following tasks: stores file data on its local disks, manages metadata,
caches metadata and data in order to quickly satisfy client requests, and
eventually manages data consistency by keeping track of clients that cache
blocks and updating or invalidating stale data.
Examples of file systems using a client-server architecture are NFS [652]
and AFS [409]. A NFS server exports a set of directories of a local file system
to remote authorized client machines. Each client can mount each directory
at a specific point in its name tree. Thereafter, the remote file system can be
accessed as if it is local. The mount point is automatically detected at file
name parsing. AFS’s Vice is a collection of file servers each of which stores a
part of the system tree. The client machines run processes called Venus that
cooperate with Vice for providing a single location-independent name space
and shared file access.
A centralized file server is a performance and reliability bottleneck. As an
alternative, a server-less architecture has been proposed [45]. The file system
consists of several cooperating components. There is no central bottleneck in
the system. The data and metadata are distributed over these components,
can be accessed from everywhere in the system and can be dynamically mi-
grated during operation. This gives also the opportunity of providing available
services by transferring failed component tasks to remaining machines.
13.4.3 Scalability
13.4.4 Caching
Distributed file system caches have two main roles: the traditional role of
caching blocks of local disks and providing local copies of remote resources
(remote disks or remote file caches). In our discussion, we will use the at-
tribute local for a resource (i.e. cache) or entity (i.e page cache) that is ac-
cessible only on a single machine and global for a resource that is accessible
by all machines in a network.
Cooperative Caches. In order to make a comparison between caching in
client-server and server-less architectures we consider a hybrid file caching
model in a distributed file system. In this model “server” represents a file
server for the client-server architecture and a storage manager, in the sense
of disk block repository, for the server-less architecture. “Client” is also used
in a broader sense meaning the counterpart of a server in the client-server
architecture, and a user of the file system in the server-less architecture.
In this hybrid model we identify six broad file caching levels of the disk
blocks, from the perspective of a client:
1. client local memory
2. server memory
3. other clients’ memory
4. server disk
5. client disk
6. other clients’ disks
In the client-server design only four levels are used: 1,2,4,5. For instance,
in order to access file data, NFS looks up the levels 1, 2 and 4 in this order and
AFS 1, 5, 2 and 4. The client has no way of detecting if the data is located in
the cache of some other client. This could represent a significant performance
penalty if the machines are inter-connected by a high-performance network.
Under the circumstances of the actual technologies, remote memory access
can be two to three orders of magnitude quicker than disk access. Therefore,
cached blocks in other clients’ memory could be fetched more efficiently.
Coordinating the file caches of many machines distributed on a LAN in
order to provide a more effective global cache is called cooperative caching.
This mechanism is very suitable to the cooperative nature of the server-
less architecture. Dahlin and al. [237] describe several cooperative caching
algorithms and results of their simulations.
There are three main aspects one has to take into consideration when
designing a cooperative caching algorithm. First, cooperative caching implies
13. An Overview of File System Architectures 283
that a node is able to look up for a block not only in the local cache but also
in remote caches of other nodes. Therefore, a cooperative caching algorithm
has to contain both a local and a global lookup policy. Second, when a block
has to be fetched into a full file cache, the node has to chose a another block
for eviction. The eviction may be directed either to the local disk or to the
remote cache/disk of another node. Therefore, an algorithm has to specify
both a local and a global block replacement policy. Finally, when several nodes
cache copies of the same block, an algorithm has to describe a consistency
protocol. The following algorithms are designed to improve cache performance
for file system reads. Therefore, they do not address consistency problems.
Direct client cooperation. At file cache overflow, a client uses the file
caches of remote clients as an extension of its own cache. The disadvantage is
that a remote client is not aware of the blocks cached on behalf of some other
client. Therefore, he can request anew a block he already caches resulting in
double caching. The lookup is simple, each client keeps information about
the location of its blocks. No replacement policy is specified.
Greedy forwarding. Greedy forwarding considers all the caches in the sys-
tem as a global resource, but it does not attempt to coordinate the contents
of these caches. Each client looks up the levels 1,2,3,4 in order to find a file
block. If the client does not cache the block it contacts the server. If the
server caches the block, it returns it. Otherwise, the server consults a struc-
ture listing the clients that are caching the block. If it finds a client caching
the block it instructs him to send the block to the original requester. The
server sends a disk requests only in the case the block is not cached at all.
The algorithm is greedy, because there is no global policy, each client man-
aging its own local file cache, for instance by using a local “least recently
used” (LRU) policy, for block replacement. This can result in unnecessary
data duplication on different clients. Additionally, it can be noticed that the
server is always contacted in case of a miss and this can cause a substantial
overhead for a high system load.
Centrally coordinated caching. Centrally coordinated caching adds co-
ordination to the greedy forwarding algorithm. Besides the local file caches,
there is a global cache distributed over clients and coordinated by the server.
The fraction of memory each client dedicates to local and global cache is
statically established. The client looks up the levels 1,2,3,4 in order to find
a file block, in the same way as greedy forwarding does. Unlike greedy for-
warding, centrally coordinated caching has a global replacement policy. The
server keeps lists with the clients caching each block and evicts always the
least recently used blocks from the global cache. The main advantage of cen-
trally coordinated caching is the high global hit rate it can achieve due to
the central coordinated replacement policy. On the other side it decreases the
data locality if the fraction each client manages greedily is small.
N-Chance forwarding. This algorithm is different from greedy forwarding
in two respects. First, each client adjusts dynamically the cache fraction it
284 Florin Isaila
manages greedily based on activity. For instance, it makes sense that an idle
client dedicates its whole cache to the global coordinated cache. Second, the
algorithm considers a disk block that is cached at only one client as very
important and it tries to postpone its replacement. Such a block is called a
singlet. Before replacing a singlet, the algorithm gives it n chances to survive.
A recirculation count is associated with each block and is assigned to n at
the time the replacer finds out, by asking the server, that the block is a
singlet. Whenever a singlet is chosen for replacement, the recirculation count
is decremented, and, if it is not zero, the block is sent randomly to another
client and the server is informed about the new block location. The client
receiving the block places the block at the end of its local LRU queue, as if
it has been recently referenced. If the recirculation count becomes zero, the
block is evicted. N-Chance forwarding degenerates into greedy forwarding
when n = 0. There are two main advantages of N-Chance forwarding. First, it
provides a simple trade-off between global and local caches. Second, favoring
singlets provides a better performance, because evicting a single is much
expensive as evicting a duplicate, because a duplicate can be later found in
other client’s cache. The N-Chance forwarding algorithm is employed by xFS
distributed file system.
Hash-distributed caching. Hash-distributed caching differs from centrally
coordinated caching in that each block is assigned to a client cache by hash-
ing its address. Therefore, a client that does not find a block in its cache is
able to contact directly the potential holder, identified by hashing the block
address, and only in miss case the server (lookup order: 1,3,2,4). The replace-
ment policy is the same as in the case of centrally coordinated caching. This
algorithm reduces significantly the server load, because each client is able to
bypass the server in the first lookup phase.
Weighted LRU. The algorithm computes a global weight for each page and
it replaces the page with the lowest value/cost ratio. For instance, a singlet is
more valuable than a block cached in multiple caches. The opportunity cost
of keeping an object in memory is the cache space it consumes until the next
time the block is referenced.
Semantics of File Sharing. Using caching comes at the cost of provid-
ing consistency for replicated file data. Data replication in several caches is
normally the direct consequence of file sharing among several processes. A
consistency protocol is needed when at least one of the sharing processes
writes the file. The distributed file systems typically guarantee a semantics
of file sharing.
The most popular model is UNIX semantics. If a process writes to a file
a subsequent read of any process must see that modification. It is easy to
implement in the one-machine systems, because they usually have a central-
ized file system cache which is shared between processes. In a distributed file
system, caches located on different machines can contain the copy of the same
file block. According to UNIX semantics, if one machine writes to its copy,
13. An Overview of File System Architectures 285
a subsequent read of any other machine must see the modification, even if it
occurred a very short time ago. Possible solutions are invalidation or update
protocols. For instance, xFS uses a token-based invalidation protocol. Before
writing to its locally cached block copy, a process has to contact the block
manager, that invalidates all other cached copies before sending back the
token. Update or invalidation protocols may incur a considerable overhead.
Alternatively the need for a consistency protocol can be eliminated by con-
sidering all caches in the distributed system as a single large cache and not
allowing replication [220]. However, the drawback of this approach is that it
would reduce access locality.
In order to reduce the overhead of a UNIX semantics implementation,
relaxed semantics have been proposed. In the session semantics, guaranteed
by AFS, all the modifications made by a process to a file after opening it,
will be made visible to the other processes only after the process closes the
file.
Transaction semantics guarantees that a transaction is executed atom-
ically and all transactions are sequentialized in an arbitrary manner. The
operations not belonging to a transaction may execute in any order.
NFS semantics guarantees that all the modification of a client will become
visible for other clients in 3 seconds for data and 30 seconds for metadata.
This semantics is based on the observation of access patterns studies that file
sharing for writing is rare.
For instance, files in GPFS [670] and PVFS [422] are split into equally-
sized blocks and the blocks are striped in a round-robin manner over the I/O
nodes. This simplifies the data structure used for file physical distribution,
but it can affect the performance of a parallel application due to a poor match
between access patterns and data placement, as we have shown in subsection
13.2.2.
Other parallel file systems subdivide files into several subfiles. The file is
still a linearly addressable sequence of bytes. The subfiles can be accessed
in parallel as long as they are stored on independent devices. The user can
either rely on a default file placement or control the file declustering. The
fact that the user is aware of the potential physical parallelism helps him in
choosing a right file placement for a given access pattern.
In the Vesta Parallel File System [211, 210], a data set can be partitioned
into two-dimensional rectangular arrays. The nCube parallel I/O system [241]
builds mapping functions between a file and disks using address bit permuta-
tions. The major deficiency of this approach is that all sizes must be powers
of two. A file in Clusterfile parallel file system [423] can be arbitrarily parti-
tioned into subfiles.
Logical Views. Some parallel file systems allow applications to logically
partition a file among several processors by setting views on it. A view is a
portion of a file that appears to have linear addresses. It can thus be stored
or loaded in one logical transfer. A view is similar to a subfile, except that
the file is not physically stored in that way. When an application opens a file
it has by default a view on the whole file. Subsequently, it might change the
view according to its own needs.
Views are used by parallel file systems (Vesta [211, 210] , PVFS [422],
Clusterfile [423]) and by libraries like MPI-IO [548]. MPI-IO allows setting
arbitrary views by using MPI datatypes. With Vesta, the applications may
set views only in two dimensional rectangular patterns, which represents obvi-
ously a limitation. Multidimensional views on a file may be defined in PVFS.
Like MPI-IO, Clusterfile allows for setting arbitrary views. An important ad-
vantage of using views is that it relieves the programmer from complex index
computation. Once the view is set the application has a logical sequential
view of the set of data it needs and can access it in the same manner it
accesses an ordinary file.
Setting a view gives the opportunity of early computation of mappings be-
tween the logical and physical partitioning of the file. The mappings are then
used at read/write operations for gathering/scattering the data into/from
messages. The advantage of this approach is that the overhead of computing
access indices is paid just once at view setting.
Views can also be seen as hints to the operating system. They actually
disclose potential future access patterns and can be used by I/O scheduling,
caching and pre-fetching policies. For example, these hints can help in order-
ing disk requests, laying out of file blocks on the disks, finding an optimal
13. An Overview of File System Architectures 287
machine they were attached to. Therefore, if the machine becomes unavail-
able due to crash, overloading or any other reason, the disks also become
unaccessible. This may happen even though the unavailability reason had
nothing to do with the disk to be accessed. The advent of network-attached
storage (NAS) has allowed the separation of storage from computers. The
disks can be attached to the network and accessed by every entitled host
directly.
There are several advantages of this approach. First, the NAS allows the
separation of file data and metadata [332]. A server running on one machine
takes care of metadata, whereas the data is kept on NAS. The clients contact
the server when opening a file and receive an authorization token, which can
be subsequently used for accessing the disks bypassing the server. This makes
the servers more scalable, because they are relieved from data transfer duties.
Second, this may eliminate the need for expensive dedicated servers, because
a lighter load allows a machine to be used also for other purposes. Third,
the performance is boosted , because a client requests doesn’t have to go
through expensive software layers at the server (e.g. operating system), but
it can contact the disks directly. Forth, in a server-attached disk, the data
has to go thorough two interconnects: the server internal I/O bus and the
external network. The server-attached disks reduce the number of network
transits from two to one. Fifth, if a host in the network fails, the disks continue
to be available for the remaining machines.
The challenge is how to manage concurrent access by several clients to
the disks. If the disks are smart (have a dedicated processor), the concurrent
access may be implemented at disks. Otherwise, an external lock manager
may be needed. Each client has to acquire a lock for a file or for a file portion
before effectively accessing the disks.
13.4.8 Mobility
The increasing development of mobile computing and the frequent poor con-
nectivity have motivated the need for weakly connected services. The file
system users should be able to continue working in case of disconnection or
weak connectivity and update themselves and the system after reintegration.
A mobile computer typically pre-fetches (hoards) data from a file system
server in anticipation of disconnection [145]. The data is cached on the local
disk of the mobile computer. If disconnection occurs the mobile user may
continue to use the locally cached files. The modification of the files can be
sent to the server only after reconnection. If several users modify the same file
concurrently, consistency problems may occur. The conflicts may be resolved
automatically at data reintegration by trying to merge the updates from
different sources. If the automatic process fails, the data consistency has to
be solved manually. However, as we have shown in the access pattern section,
the sequential applications rarely share a file for writing. Therefore, it is likely
that the conflicts occur rarely and the overhead of solving them is small.
13.5 Summary
14.1 Introduction
The exploitation of locality has proven to be paramount for getting the most
out of today’s computers. Both data and instruction locality can be exploited
from different perspectives, including hardware, base software and software
design.
From the hardware point of view, different mechanisms such as data vic-
tim caches or hardware prefetch [392] have been proposed and implemented
in different processors to increase the locality of the reference stream, or to
hide the latency of the accesses to distant memory levels.
From the base software point of view, compilers are able to extract local-
ity with the help of heuristics or profiles of the applications, both from the
instruction stream [627] and from the data references [392].
From the software design point of view, it is possible to design applica-
tions so that both the spatial and temporal locality of the data and instruc-
tion streams are exploited. However, while the programmer can influence the
data access pattern directly, it is more difficult for her/him to influence the
instruction stream.
Database management systems (DBMSs) are complex applications with
special characteristics such as large executables, very large data sets to be
managed, and complex execution patterns. The exploitation of locality in
such applications is difficult because they access many different types of data
and routines to execute a simple transaction. Thus, one simple transaction in
a DBMS may exercise all the levels of the memory hierarchy, from the first
level of cache to the external permanent storage.
Our objective is to understand how the exploitation of locality can be
exercised in the execution kernel of a DBMS on computers with a single
processor. We focus on the study of DBMS locality for the execution of large
read queries coming from Decision Support Systems and Data Warehousing
workloads. We analyse the different layers of the execution kernel of a DBMS,
the base software and the hardware. Furthermore, we study not only the basic
∗
This work was supported by the Ministry of Education and Science of Spain
under contracts TIC-0511/98 and TIC2001-0995-C02-01, CEPBA, the CEPBA-
IBM Research Institute, and an IBM Research grant.
∗∗
To my wife Marta and son Ferran who fill my life with joy, and to my caring
parents.
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 290-319, 2003.
© Springer-Verlag Berlin Heidelberg 2003
14. Exploitation of the Memory Hierarchy in Relational DBMSs 291
operations of DBMSs, such as joins or indexed data accesses, but also how
they influence the execution of the database in general.
Aspects related to logging, locking or the On-Line Transaction Processing
environment will not be considered, because the objective is to understand
how read queries on large amounts of relational data can be enhanced with
the exploitation of locality at different levels in a DBMS.
This chapter is organized in the following way. In Sect. 14.2, we start
by considering what to expect from the chapter and the minimum knowledge
assumed when writing it. In Sect. 14.3, we describe the internal structure of a
Database Management System and how it executes read queries. In Sect. 14.4,
we give evidence of locality in DBMSs, and in Sect. 14.5 we describe the
basic software techniques that can be used to expose such locality. Later,
in Sects. 14.6, 14.7, and 14.8, we explain how the techniques explained in
Sect. 14.5 can be used to exploit locality in each of the horizontal DBMS
layers. In Sects. 14.9 and 14.10, we give some insight into the the hardware
and compilation issues. Finally, in Sect. 14.11, we summarize.
uni-processor computers. That is, we look into ways of reducing the number
of times a data item has to be brought from any of the levels of the memory
hierarchy (including main memory and disk) to a level closer to the processor.
We do not consider disk technology or how disks are implemented to ex-
ploit locality during their access to physical disk sectors. Neither we consider
techniques that help to hide the latency of the memories. This is because hid-
ing the latency does not fall into what we understand as algorithmic changes
to improve locality.
In the survey, and wherever possible, we only give the number of data
reads from the most significant levels of the memory hierarchy. We do so
because data reads are usually in the critical path of operations (some opera-
tions have dependences on a read), while data writes may be delayed because
they do not cause dependences.
Basic DBMS Concepts. We assume the reader has a basic knowledge
of the Relational algebra operators like Restriction, Projection, Join etc.1
Reading [694] may help to understand these concepts. We also assume the
reader knows how these operations map onto their implementations. For in-
stance, how a join can be implemented as a Hash Join, a Nested Loop Join
or a Merge Join. For an understanding of the different implementations of
relational operators we refer the reader to [347, 552].
To enable the reader to understand the implementation of a couple of join
operators, we describe the Nested Loop Join and the Hash Join implemen-
tations. Let us suppose relations R and S with joining attributes R.r and
S.s and a join operation R S with join condition R.r > S.s. The Nested
Loop Join takes every record of relation R and checks its join attribute with
each and every join attribute of the records of relation S returning as result
the pairs of records that fulfil the join condition. Relation R is called outer
relation and relation S is called inner relation because they are traversed as
in the execution of a nested loop where the traversal of R is governed by the
outer loop and the traversal of S is governed by the inner loop.
The execution time of the Nested Loop Join operation is proportional to
the cardinality (number of records) of relation R (which we denote by |R|)
times the cardinality of relation S, |S|.
For the Hash Join implementation we assume that the join condition is
R.r = S.s2 . The Hash Join implementation takes all the instances of at-
1
The application of a Restriction operation on a relation eliminates all the records
of that relation that do not fulfill the Restriction condition imposed. A Projection
operation obtains a new relation with the same amount of records but only with
the attributes desired. A Join operation on two relations R and S, R S, and
a join condition on one or more attributes of each of those relations, obtains as
a result the pairs of records that fulfill the join condition.
2
Given the nature of the algorithms, the join condition of a Hash Join can only be
of the type R.r = S.s, while the Nested Loop Join condition can be of any type
(larger than, smaller than or equal to, etc.). Therefore, the Hash Join can only
be used in the case of equality or natural joins, and the Nested Loop Join can be
14. Exploitation of the Memory Hierarchy in Relational DBMSs 293
tribute, say, R.r and builds a hash table using hash function H. This build
phase has an execution time proportional to the cardinality of the build re-
lation R. Using the same hash function H, all the instances of attribute S.s
are probed against the hash table. The execution time of this probe phase is
proportional to the cardinality of the probe relation S.
Access Methods
Buffer Manager
Fig. 14.1. Layered structure of the query engine of a Database Management Sys-
tem. Layers involved in the execution of read only queries.
Join, Index Scan, Sequential Scan, etc.) and the order in which they have
to be executed. Some current optimizers obtain the execution plans based
on estimates of the cardinality of the tables and the costs of the different
implementations of those operations, among others. Some of the information
used by the optimizer is stored in the database catalog or data dictionary. The
database catalog holds information about the database, such as the structure
of the relations, the relations among tables, the number of attributes of each
record in a table, the name of the attributes on which an index has been
created, the characteristics of the results of database operations in previous
executions, etc.
a b
select Sort
s.name, s.age
from
staff s, Hash
Join
department d,
projects p
where
Nested Seq.
s.deptkey = d.deptkey Join Scan
and s.projkey = p.projkey
and d.city = ’Barcelona’
and p.name = ’Ursus’ Index Index
Scan Scan
and s.age > 30
order by s.name;
STAFF PROJ DEPT
Fig. 14.2. SQL statement (a) and final query execution plan (b).
The query Execution Engine is now responsible for the execution of the
Plan obtained by the optimizer. The components of the Execution Engine
used in read only queries are shown in the shaded box in Fig. 14.1. These
are the Executor, the Access Methods, the Buffer Manager, and the Storage
Manager modules. Therefore, after allocating memory for the data structures
required by the query and its nodes3 , and after initializing those data struc-
tures, the Executor module takes control. The memory allocated for each
node is called the heap of that node. The size of the heap is usually fixed
during the execution of a query, and it can be decided at optimization time
by the user or by the database administrator.
Fig. 14.3 shows a sample of the routines that each layer of the Execution
Engine of the DBMS offers to the upper layers, and what type of data those
3
Those data structures may be, for instance, the hash structure for the Hash Join
node, space to store intermediate results, etc.
14. Exploitation of the Memory Hierarchy in Relational DBMSs 295
layers are expected to obtain from the layers immediately below. Fig. 14.3
serves to illustrate the following explanation.
Execution Engine
Executor
get_next_tuple
ReScan Tuple
...
Access Methods
read_buffer
write_buffer Buffer
fix / unfix
...
Buffer Manager
read_block Block
write_block
... Storage Manager
Fig. 14.3. Functionalities and data offered to the upper layer of a DBMS query
engine.
The Executor module processes the plan in a record by record data driven
execution form, starting from the topmost node of the plan. This is called
pipelined execution: when one node requires a record to process, it invokes
its immediate lower node or nodes and so on. For instance, the Nested Loop
Join operation invokes the inner relation node to retrieve all its records, one
record at a time, for each record of the outer relation node. In this case, the
first result may be obtained and passed to the upper node before reading the
whole inner relation. On the other hand, the Hash Join node requires all the
records of the build relation to create the hash data structure, which will be
stored in the heap of the Hash Join node. Only when the hash data structure
has been fully created, can the probe relation be invoked record by record to
check for joining records against the hash structure 4 .
4
As we will show later, there are cases like the Hybrid Hash Join algorithm that
partition the build and probe relations. This happens when the records of the
build relation do not fit in the heap. After partitioning the relations, only the
records of one partition (that supposedly fit in main memory) are required to
create the hash structure. Moreover, only the records of the corresponding probe
partition are required to be probed against a build partition.
296 Josep-L. Larriba-Pey
Other nodes, like the Sort nodes, require the full input relation before
producing the first result of their operations5. In this case, it is said that the
Sort nodes are materialized operations, as opposed to pipelined ones.
With the pipelined execution strategy, it is possible (i) to execute a full
plan with the least amount of main memory space to store intermediate
results, and (ii) to obtain the first result as soon as possible.
Only the Scan nodes access records from the relations of the database.
Each time a parent node invokes a Scan, the Scan serves a record. However,
the Scan nodes cannot access the relations directly. Instead, they can only
call an Access Methods interface routine to obtain a record.
The Access Methods module abstracts the physical organization of records
in file pages from the Executor module. The Access Methods module offers
functionalities that allow the Executor module to read, delete, modify, or
insert records, and to create or destroy relations.
The generic routines offered by the Access Methods module for sequential
accesses are of the type “get next record(Relation)”, which returns a record
in sequential order from “Relation”. The generic routines for indexed accesses
are of the type “get next indexed record(Attribute, Relation)”, which returns
a record from “Relation” in the order dictated by the index on “Attribute”, or
“get next indexed record for value(Attribute, attr value, Relation)”, which
returns a record from “Relation” with “Attribute” value “attr value” through
an index on “Attribute”.
The Access Methods module accesses the database catalog in order to
obtain knowledge of the type of data file or index file used for the relation
being accessed6 . The Access Methods module also holds structures in memory
that describe the relations like the identifiers of the pages of those relations,
etc.
Whenever the Access Methods module needs to access a record, it calls
the Buffer Pool Manager to get hold of the page that contains the record,
uses the page, and finally releases it.
The Buffer Pool Manager, also called the Buffer Manager; is similar to an
operating system virtual memory manager, it has a Buffer Pool with slots to
store file pages. The objective of the Buffer Pool is to create the illusion to
the Access Methods module that all the database is in main memory.
The Buffer Pool Manager offers a set of functionalities to the Access
Methods module. Those functionalities allow the Access Methods module to
state the intention of working with a page and to release the intention to
use the page. This is done with “Buffer Fix” and “Buffer Unfix” primitives,
respectively. With a fix on it, a page cannot be removed from the Buffer Pool.
5
Note that sorting can be performed in batches and then the batches are merged,
as we will see later. Each batch can be processed in a pipelined manner, but
in order to start the merge process, the last data item to be sorted has to be
inserted in a batch.
6
These can be record or multimedia files, B-trees, R-trees or other data structures.
14. Exploitation of the Memory Hierarchy in Relational DBMSs 297
Different basic techniques can be used to expose the locality of data in al-
gorithms. The particular structure of the DBMS Execution Engine makes it
feasible to use some of these techniques in certain layers of the engine.
In this section, we describe a number of basic techniques for the exploita-
tion of locality. Each technique is described, the layers of the DBMS Execu-
tion Engine where it can be used are detailed, and an example of its use in
the database area is given. In Sects. 14.6 and 14.7, a full explanation of the
application of those basic techniques is given.
The basic techniques are as follows:
– Blocking restructures an algorithm to expose the temporal locality of
groups of records that fit in a level of the memory hierarchy. With this
Blocking, the memory traffic between different levels of the memory hier-
archy is reduced. Blocking can be implemented in the Executor module of
the DBMS.
For instance, Blocking can be applied to the Nested Loop Join [687]. For the
in-memory case, instead of checking all the inner relation records against
each and every record of the outer relation, the DBMS can check a record
of the inner relation against a block of records of the outer relation. This
divides the number of loads of the whole inner relation by the size of the
block. The size of the block, called the Blocking Factor, has to accommo-
date the size of the target cache level. Blocking can also be used in the
external memory implementation of the Nested Loop Join, as we will see
later.
– Horizontal Partitioning distributes all the records to be processed into sub-
sets, with the aim that every record subset should fit in a certain level of
the memory hierarchy. The criterion for distributing the records is some
type of hash function on the values of an attribute or attributes of the re-
lation to be partitioned. An advantage of Horizontal Partitioning is that it
does not usually require post-processing after processing each record sub-
set. A disadvantage of Horizontal Partitioning is that the exact size of the
partitions is uncertain and depends on skew, the number of duplicate keys,
etc. Horizontal Partitioning can be implemented in the Executor module.
For instance, Horizontal Partitioning can be used in the Hash Join algo-
rithm when the build relation does not fit in the heap. In that case, the
build and probe relations are partitioned with the same hash function. The
number of different partitions is decided by the DBMS with the help of a
heuristic and depends on factors like the size of the memory, the expected
skew in the data set, etc. After the partitioning, pairs of partitions are
joined (one from each joining relation). The partitioning phase is expected
to chop the build relation so that each partition fits in the heap. This is
the Grace Hash Join algorithm [459].
14. Exploitation of the Memory Hierarchy in Relational DBMSs 299
In this section, we explore the space of some basic database operations such
as Scan, Join, Aggregate and Sort. For each operation, we detail some of
the proposals that have been used to exploit data locality in the Executor
layer, using any of the techniques above. Locality is not only important at
the record level in this layer of the DBMS, but also at the heap level.
For instance, once the records are obtained from the lower nodes of the
execution plan, part of the heap stores those records and the rest of the heap
is devoted to store the data extracted. As already stated, this is the extraction
of relevant data and may be applied to the operations explained below.
For the models below, we assume a cache level able to accommodate a
total of B1 records, a disk page able to accommodate B2 records, and a node
heap able to accommodate the equivalent to B3 disk pages.
Scan operations receive one record at a time from the Access Methods layer
and pass one record at a time to upper nodes of the plan. In order to allow
Scan operations to pass more than one record at a time, Grouping can be
implemented in these operations. In this case, a loop that iterates as many
times as records wanted in upper nodes, calls routines of the Access Methods
module, such as “get next record(Relation)”, to build a group. In this case,
Scan operations need a number of record slots equal to the number of records
passed.
Grouping at this level has to perform a number of calls to the Access
Methods routines. This implies that Grouping may not exploit locality as
it could at this level. For instance, Index Scan routines traverse the index
for every Scan routine call, which results in as many page accesses as the
depth of the tree structure. This pages are accessed in consecutive calls of
the Index Scan routines, which improves the locality of the non Blocking
approach. However, one only traversal of the index pages would be enough
if Grouping was implemented at the Access Methods level, which we prefer
(see Sect. 14.7 for details).
14. Exploitation of the Memory Hierarchy in Relational DBMSs 301
HJ2
111111
000000
000000
111111
J1R0
HJ1 T
111111
000000
000000
111111
000000
111111
PT0
PT1 H
R S PT2
111111
000000
000000
111111 111111
000000
000000
111111
000000 111111
111111 000000
PR0 PS0
H PR1 PS1 H
PR2 PS2
Fig. 14.4. Implementation of Hash Teams for a two join query on a common join
attribute. Pxy stands for partition y of relation x. J1R0 stands for join 1 result of
partition 0.
Merge Join. The joining process with Merge Join starts with the sorting
of one or both of the joining relations. When one of the joining relations has
an index on the joining attribute, only the non-indexed relation has to be
sorted [347].
In the non-indexed implementation, when the sorting of the relations has
finished, it is necessary to traverse the relations in a sequential way to join
them. The indexed implementation is similar to the Nested Loop Join with
indexes except for the sorting process of the non indexed relation, which has
to be sorted here.
Therefore, the weight of the Join operation is in the Sort nodes. Aspects
related to locality in the Sort operation are explained in Sect. 14.6.4. The rest
14. Exploitation of the Memory Hierarchy in Relational DBMSs 305
which may also reduce the amount of calls since it collapses several nodes of
the plan in one node.
14.6.4 Sorting
The Access Methods module implements different routines to allow the Ex-
ecutor module to access records of the Database either sequentially or with in-
dexes. As stated above, routines for accessing records like “get next record(Re-
lation)” or “get next indexed record(Attribute, Relation)” are typical.
The Access Methods layer follows different steps to return a record to the
Executor layer. For example, in an indexed access, the Access Methods mod-
ule checks the catalog for the type of index used with the specific attribute
in the relation to be read, searches the routine that performs the task for the
index type and traverses the index structure to obtain the record.
The key issue about the Access Methods layer is how data are stored in
physical pages. This is something that the Executor layer does not have to be
aware of, because it can only receive records. The physical storage determines
the patterns of access (hence the locality) by the Access Methods and the
Buffer Manager. We give further details of these aspects for data files and
index files below.
The three basic models for the storage of records in a file page are: the N-ary
Storage Model (NSM), the Decomposition Storage Model (DSM) and the
Partition Attributes Across (PAX) model. Fig. 14.5 shows the data layout
for the three strategies.
Page Header Tuple 1 Tuple 2 Page Header att_1.1 rid_1.1 att_1.2 Page Header Tuples in page
Tuple 3 Minipage for att_1
rid_1.n
Minipage for att_2
Tuple n
Begining of free space
Begining of free space End of free space
End of free space
Minipage for att_n
array of pointers
array of pointers
The N-ary Storage Model stores records contiguously starting from the
beginning of each disk page. In order to identify the offset to each record
in the page, NSM holds a pointer table starting at the end of the page and
growing towards the beginning. In this way, record and pointer table storage
do not interfere until the page is full. Entry X of the offset or pointer table
points to record X in the page. This is the model used by most DBMSs [22].
NSM aims at the exploitation of spatial locality at the record level, in the
belief that a significant number of attributes in a record are accessed in one
308 Josep-L. Larriba-Pey
single record access, exploiting spatial locality. However, in some cases this
model is not efficient in terms of the cache hierarchy. For example, when a
query only processes one attribute per record, the attributes next to it are
also fetched, with the cache line, up to the cache level closest to the processor.
Memory bandwidth is wasted and unnecessary data are stored in caches along
the cache hierarchy [22].
In order to prevent this from occurring, the Decomposition Storage
Model [208] performs a Vertical Partitioning of the relation into a number
of vertical sub-relations equal to the number of attributes. Each sub-relation
holds two values, a record identifier and the attribute itself. A pointer table
is used to identify the record to which each attribute belongs within a page.
DSM aims at the exploitation of locality at the attribute level. This strat-
egy is useful when a small amount of attributes have to be fetched from each
record. However, a buffer page per attribute has to be maintained in the
Buffer Pool and the cost of reconstructing a record is rather expensive. As
shown in [22], DSM deteriorates significantly when more than 4 attributes of
the same relation are involved in a query. As an alternative to this problem,
the proposal in [217] performs a clustered partitioning of each relation based
on an attribute affinity graph, placing together attributes that often appear
together in queries.
As an alternative to the previous two models, the Partition Attributes
Across [22] model partitions records vertically within each page. The objective
is that one record is stored within a page while the instances of each attribute
are clustered in mini-pages. Each mini-page holds a presence bit for fixed size
attributes and a pointer for variable size attributes.
PAX aims at exploiting register and attribute locality at the same time.
By storing in a page the same data as NSM, but having each attribute in sep-
arate mini-pages, PAX solves the cache hierarchy problem and, at the same
time, reduces the amount of pages to be maintained in the Buffer Manager.
This gives a performance similar to that of NSM, when external memory
queries are executed but outperforms significantly NSM and DSM for in-
memory queries [22].
Indexing nodes
35
17 20 50 52
5 10 11 17 18 19 20 21 30 35 39 40 50 51 52 53 57
Leaf nodes
The use of B-trees in DBMSs has the following important features regard-
ing locality:
– The page structure of a B-tree raises the question of what pages of the
B-tree should be kept in main memory to minimize I/O.
– Each record access through a B-tree implies a number of page accesses
equal to the depth of the B-tree, h, plus one access to the page where the
record is stored in the database relation. In the example shown in Fig. 14.6,
there would be 4 page accesses.
The larger the number of records per indexing node, the smaller the po-
tential depth of the B-tree. A possible means of incrementing the number
of records per indexing node is compression, both in the case of fixed size
attributes and in the case of variable sized attributes.
– The complexity of the search for an attribute instance in a physical page
depends on the length and size variability of the attributes stored in a
page. Therefore, searching is also a key feature and may depend on how
compression is implemented.
In this section, we only focus on compression of the indexing attributes
[514, 349]. The first feature, locality at page level, is a Buffer Manager re-
sponsibility and is explained in Sect. 14.8.
The third feature, searching for an attribute instance among the set of
items of a node, can be performed with a binary search algorithm. Binary
search is discussed in [85].
We discuss a few compression techniques that have been proposed in the
literature. Those techniques were designed for variable length keys, but in
310 Josep-L. Larriba-Pey
some cases may also be applied to fixed length keys. Fig. 14.7, which shows
the uncompressed original attribute vector on its left hand side, helps in the
explanation of these techniques:
– When the keys of a node have a set of common most significant bits, a
common prefix of y bits may be factored out from the attributes, as shown
in Fig. 14.7.a. In this case, the rest of the attribute will be smaller and the
binary search will be faster [97].
– When there are almost no common most significant bits, it is possible to
extract a few bits of each key, say z, forming a prefix vector so that compar-
isons can be made on smaller attribute sizes, as shown in Fig. 14.7.b [514].
A tie is resolved by searching on the attribute extensions that have the
same value in the prefix vector.
– A hybrid between the common prefix and the vector prefix can be imple-
mented to save comparisons. This may be done for long indexing keys that
have a common set of most significant bits plus a few different second most
significant bits. This is not shown in Fig. 14.7.
– The Vertical Partitioning approach, where the indexing attributes and
pointers are separated into two vectors, can also be considered, see Fig.
14.7.c. This can be applied to any of the above compression techniques,
and leads to cache conscious algorithms since spatial locality can be ex-
posed during the binary search on the attribute instances [638].
a b c
x bits x−y bits z bits x−z bits x bits
att y bits
att att att
att att att
Common
prefix
Among other aspects that can be taken into account to improve the data
locality for main memory B-trees, we also have data alignment with cache
lines or implementation of B-trees based on the cache line size [194, 349].
The usual way for the Access Methods module to communicate with the
Executor module is record by record, as already stated. However, it is possible
to communicate by groups of records, which increases the locality at the
different levels now discussed:
14. Exploitation of the Memory Hierarchy in Relational DBMSs 311
– For a group of n records, there is only one Access Methods routine call,
as opposed to the n routine calls that would be necessary in the case of a
record by record access.
– For sequential accesses to relations, locality increases because consecutive
records are accessed in an atomic operation, thus increasing spatial locality.
A routine “get next record(Relation, n)” would be necessary.
– For accesses to indexed relations, locality also increases because the index is
traversed only once for a group of record reads. This increases the temporal
locality of the indexing nodes and the spatial locality of the leaf nodes.
In this case, a couple of calls are necessary. First, when the indexed relation
is traversed completely in the order established by the index, a routine like
“get next indexed record(Attribute, Relation, n)” with the same added pa-
rameter as above, would return a block of records of size n. Second, when a
direct access is required to a set of records with the same value for the in-
dexing attribute, a routine like “get next indexed records for value(Attri-
bute, attr value, Relation, n)” would return a block of maximum size n
records.
Grouping at the Access Methods layer may be used for the Blocking,
Run Generation and Horizontal Partitioning techniques. In these three cases,
Grouping allows the three techniques to process records in groups at the
Executor layer reducing the number of routine calls in Scan operations.
One last comment is necessary: let us recall that an index is intended for
the consecutive access of records that are not sorted by the indexing attribute
in the actual relation. So, in this case, blocking can be of special interest at the
Access Methods layer because it may save additional I/O if it is divided into
the following phases. First, n accesses to the index structure are performed
to collect the RIDs of those records. Second, the RIDs collected are sorted.
Third, the records in the sorted RID order are accessed. With this strategy,
and in the case that there is some clustering of records with similar indexing
attributes in physical pages, there may be some additional I/O reduction.
ferent users11 compete for memory pages. Global strategies have one only
replacement policy associated. We discuss some basic replacement strategies
in Sect. 14.8.2.
A second approach is to consider that the type of page or file is critical
for the locality behavior of the reference stream. Hence, the Buffer Pool is
structured in local pool areas that are exercised by different users.
A third approach is to consider the transaction as a key feature in local-
ity exploitation. Similar to the previous case, the Buffer Pool is separated
in different local areas, one per transaction. The areas are managed in an
independent way. This approach may use any of the two previous approaches
to manage the local transaction space. Therefore, we only concentrate on the
first two approaches.
There are three important aspects to the local strategies. First, the page
replacement strategy to be used for the local pool of each user. This is the
same problem as for the global strategies, but in this case a different page re-
placement strategy may suit each local buffer area better. Second, how many
pages are assigned to each local area and whether those pages are assigned
statically or dynamically. Finally, in the case of dynamic page assignation,
what must be done in the case of page starvation by one user. These aspects
are covered in Sect. 14.8.3.
One final aspect to be considered is how the Buffer Pool interacts with
the operating system. This is addressed in Sect. 14.8.4.
The management of the pages stored in the Buffer Pool requires a control of
(i) the pages that are fixed by the upper layers, and (ii) the pages that are
not being used and can be substituted by newly referenced pages.
The list or lists of pages being used and the list of candidates to be
replaced can be implemented in one data structure combining both types of
lists or in several data structures, one for each list. The data structures used
may be chained lists or hash structures.
For the following explanation, the victim pages are those that are not fixed by
a transaction when a referenced page has to displace a page from the Buffer
Pool.
The Random replacement algorithm takes one page at random as victim
for replacement. This strategy only requires a structure that maintains all
the possible victims.
The FIFO algorithm takes as victim the page with the first reference of
all those pages present in the buffer pool. This algorithm requires a list of all
the page descriptors sorted by the historic reference order. The page at the
top of the list is the one to be substituted if it is not fixed.
The LRU strategy takes the page that has been Least Recently Used
(referenced) as victim. This algorithm requires a linked list of page descriptors
so that when a page is unfixed, it is placed at the bottom of the list. When
a victim is necessary, the first page in the list is chosen.
The LRU-K strategy [591] modifies the LRU strategy in the following
way. The victim page is the one whose backward K-distance is the maximum
of all the pages in the Buffer Pool. The backward K-distance of a page is
defined as the Kth most recent reference to that page. As explained in [591],
this strategy requires a significant amount of memory to store the K-distance
history of all the pages in the buffer, and the added problem of having to
traverse the vector of page distances to find the maximum distance among
the candidate victims. The data structures of LRU-K can be improved to
reduce data manipulation, as shown in [437] for the 2Q algorithm.
The LFU strategy takes the page with a smallest number of references
as victim. This requires a set of counters to keep track of the number of
references to the pages of the database. The strategy can be implemented in
two different ways. First, by only keeping track of the pages that are present
in the Buffer Pool. Second, by keeping track of all the pages including those
that have been cached in the Buffer Pool in the past.
Apart from the previous algorithms, the OPT and A0 algorithms are used
as a reference by several authors [277, 591]. The OPT algorithm replaces the
page with the longest forward reference distance. This is the optimum al-
gorithm, but it is impossible to implement because the knowledge of future
references would be needed. The A0 algorithm replaces the page whose proba-
bility of access is lowest. It is impossible to know this probability beforehand,
but it can be estimated using the previous reference behavior.
314 Josep-L. Larriba-Pey
The CLOCK strategy makes use of a used bit per page to give a second
chance to a candidate victim page. The used bit of a page is set to one when
the page is referenced. Candidate pages are found in a round robin traversal
of all the page descriptors. Finding a used bit equal to one, forces a reset
to zero. The first page found with a used bit equal to zero is the victim of
choice.
The CLOCK strategy can be combined with the Random, FIFO and
LFU strategies. Every time a page is referenced, the bit is set to one. When a
victim has to be found, the list of potential victims is traversed in the order
determined by the strategy (instead of the round robin traversal), displacing
a page as explained above.
The CLOCK strategy is similar to the LRU strategy by virtue of the fact
that when a page is used, a recency bit gives it a second chance. Note that
the use of the CLOCK strategy combined with the LRU strategy makes little
sense.
The GCLOCK strategy generalizes the CLOCK strategy to the use of
a used counter instead of a used bit. In this way, different types of pages
can increment the used counter in different quantities, depending on how
important it is to keep them in main memory.
Correlation of References
One important aspect of the reference stream is that references are usually
bursty [503]. The burstiness of references comes from the fact that, when a
record is referenced, for instance, a few of its attributes may be referenced in
a short lapse of time.
One possibility is to consider the references made to the same page in a
certain lapse of time as a single reference [634]. The work presented in that
paper is based on the LRU replacement strategy, and although it is proposed
for operating systems’ file managers, it can be useful in DBMSs.
The Buffer Pool may be structured in separate pool areas that are exercised
by different users. The pool areas are managed in different ways and employ
different replacement policies depending on the needs of each user.
The global replacement strategies do not take into account that different
data structures and database operations have different access patterns on the
database pages. A very good account of different local behavior strategies can
be found in [199]. In this section, we explain two of the strategies described in
that paper, the Hot Set algorithm, proposed in [646], and the DBMIN Buffer
Management algorithm, proposed in [199].
14. Exploitation of the Memory Hierarchy in Relational DBMSs 315
The Hot Set algorithm relies on a model with the same name. This model
calls hot sets those groups of pages that show a looping behavior within a
query. If all the hot sets of a query fit in the Buffer Pool, its processing will
be efficient because all the references that iterate on a set of pages will fit in
memory.
For instance, the hot set of a Nested Loop Join algorithm would be the
number of pages of the inner relation plus one for the outer relation.
The Hot Set algorithm provides an independent set of buffer slots for each
query. Each of these sets are managed by the LRU replacement strategy,
and the number of slots is decided according to the Hot Set model of the
specific query. The amount of slots assigned to a query varies dynamically as
a function of the need for new pages.
On the other hand, the DBMIN algorithm relies on the Query Locality Set
Model (QLSM). The QLSM classifies the page reference patterns based on
the reference behavior of basic database operations such as the Nested Loop
Join or the Index Scan. DBMIN makes use of different replacement strategies
depending on the type of pattern shown by the operation. The partitioning
scheme of the algorithm dynamically assigns the amount of Buffer Pool slots
needed at each moment.
The QLSM distinguishes among three types of reference patterns, Sequen-
tial, Random and Hierarchical. Sequential references are those performed by
(i) a sequential Scan, (ii) local re-scans12 and (iii) the Nested Loop Join.
Random references are those performed in independent accesses to a relation
to obtain (i) a single record or (ii) a clustered set of records. Finally, Hierar-
chical references are caused by index structures and include those performed
in the access to (i) a single record through an index, (ii) an Index Scan, (iii)
a clustered Scan after an index traversal and (iv) a Join where the inner
relation is indexed.
The DBMIN algorithm allocates and manages buffers on a per file instance
basis. The buffer pages linked to a file instance are referred to as the locality
set of that file instance. When a requested page is in the Buffer Pool but in
the locality set of a different file instance than that requested, the page is
given to the requester but remains in the locality set of its host. If it is not
in memory or it is in memory but not assigned to a locality set, it is assigned
to the requester’s locality set. Each locality set is assigned a replacement
strategy based upon the type of reference pattern incurred. For instance,
when a file is scanned with a Nested Loop pattern, its locality set is managed
with a Most Recently Used replacement policy, which takes as candidate for
replacement the last referenced page in the locality set. Additionally, when a
file is scanned through an index, the root of the index is the only page worth
keeping in memory because the fan-out of an index node is usually large.
12
Like those performed by a Merge Join when records within the inner relation
are repeatedly scanned and matched with those that have the same value for the
joining attribute of the outer relation.
316 Josep-L. Larriba-Pey
In this section, we use the results of the papers mentioned above to compare
the different replacement strategies described in this paper. We compare all
the algorithms because they are all meant for the same objective, i.e. the
improvement of locality in the Buffer Pool of a DBMS.
The following conclusions can be extracted from those papers:
– Among the FIFO, Random, LRU, CLOCK, and GCLOCK global replace-
ment strategies, the LRU and the CLOCK strategies have a more satisfac-
tory overall behavior [277].
– The best performance/cost LRU-K strategy is LRU-2. It behaves bet-
ter than the plain LRU strategy and does not have a significantly worse
performance than the LRU-3 strategy, which requires additional memory
space [591].
– Factoring out recent references improves the LRU strategy in Unix oper-
ating system environments [634].
– Compared to FIFO, Random, CLOCK, and the Hot Set algorithm, DBMIN
obtains the best results for different query mixes and degree of data sharing
among those mixes [199].
– The previous points lead to the conclusion that replacement strategies that
specialize in the type of access pattern, such as DBMIN, may lead to better
locality of reference exploitation.
Apart from the basic, static, buffer replacement policies explained above,
there are dynamic policies, based on Data Mining [733], to adaptively capture
the best replacement strategy depending on the most recent page references,
and policies based on queuing models [286].
by the OS, and then it will be written back to memory by the Buffer Manager
of the DBMS. There is an excess of a read and a write of a page to bring M
to the Buffer Pool.
There are different solutions to this problem [191]:
1. One possibility is to displace a page different from M from the DBMS
Buffer Pool. This implies that the Buffer Manager and the Virtual Mem-
ory Manager have to communicate in some way to avoid double paging
of those pages referenced solely with the purpose of replacement. How-
ever, this does not ensure the complete elimination of the double paging
anomaly.
2. Another possibility is to make sure that the complete Buffer Pool is
always resident in main memory. This can be achieved by (i) assuring
enough main memory so that there is no need to page out pages from
the Buffer Pool, (ii) making the Buffer Pool small enough so that it fits
in main memory, or (iii) fixing into main memory the Buffer Pool pages
from the Virtual Memory Manager itself.
3. A final approach would be to use the files that contain the database re-
lations as the external storage of the Virtual Memory. This would be
called a memory mapped system. With this system, and some commu-
nication between the Buffer Manager and the Virtual Memory Manager,
it is possible to avoid the double paging anomaly.
The role played by the hardware of modern computers is important for in-
creasing the locality of DBMSs. Different research works have focused on
ways of improving the exploitation of locality on in-memory DBMSs.
Among different aspects, the cache size and, more importantly, the cache
line size play a significant role both in PostgreSQL [730] and in Oracle [91]
for Decision Support Systems workloads. In some cases, the execution time of
complex queries can be penalized by more than 20%. In general, the optimum
cache line size ranges from 64 to 128 bytes.
Simultaneous Multithreading13 (SMT) processors have also been analyzed
in the DBMS context [510]. In this case, the results show that an execution
time improvement of more than 35% can be achieved with SMTs compared
to superscalar processors. The reason for this is the large amount of low level
parallelism that can be found in DBMSs, and the ability of SMTs to exploit
this type of parallelism and to hide the memory latency.
13
The ability of one processor to execute several program threads simultaneously.
318 Josep-L. Larriba-Pey
One final consideration is how the compiler can collaborate to expose the
locality of DBMSs. Results in this area show that the execution performance
of code-reordered DBMSs for Decision Support Systems workloads achieve
as good results with 32K byte first level instruction caches as non-reordered
codes with 128K bytes or larger [572].
The optimizations that help to improve the performance of DBMS codes
reorganize the binary in two different ways. First, groups of instructions that
are executed in sequence are stored together, and the most frequently exe-
cuted instructions are stored in a special part of the binary in such a way
that no other instructions conflict with them in the instruction cache [627].
14.11 Summary
In this chapter, we have explored the structure and the different means of ex-
ploiting locality in Relational Database Management Systems. This has been
done for read queries on large data sets arising from the Data Warehousing
and Decision Support Systems areas.
DBMSs are complex codes with a layered structure. The Engine of a
DBMS is divided into the Executor, the Access Methods, the Buffer Man-
agement, and the Storage Management layers. We concentrated in the former
three layers and analyzed different means of exploiting data locality.
The Executor layer performs operations at a record level. The pipelined
execution of queries is implemented in this layer and has a direct influence
on the exploitation of locality for basic operations like Scan, Join, Aggregate
and Sort. We have seen that techniques like Blocking or Horizontal Parti-
tioning can achieve significant reductions in I/O and memory traffic in this
layer. These techniques provide significant improvements in the different im-
plementations of the Join operation; Nested Loop (Blocking) and Hash Join
(Hybrid Hash Join, Hash Teams).
The Access Methods module provides records to the Executor layer. The
Access Methods module is capable of managing the physical structure of the
data and index pages in a DBMS. By using variants of Vertical Partitioning, it
is possible to organize data files in such a way that I/O is reduced and locality
at the memory hierarchy is exploited better. In terms of index structure, a
key aspect is compression and how the index nodes of B-trees are structured
to reduce the search time for attributes.
The Buffer Management layer provides the pages that store data and
index structures to the Access Methods layer. The Buffer Manager has the
aim of reducing the amount of page misses for a given amount of memory to
manage. We have seen that there are different page replacement techniques
that range from global policies to local policies. We have also explained that
14. Exploitation of the Memory Hierarchy in Relational DBMSs 319
the local replacement policies are more effective than the global policies in
reducing the miss ratio.
Finally, we have given hints about how the hardware and base software
influence the execution of DBMSs. On one hand, compilers can restructure
the code of DBMSs so that more instruction locality can be obtained. On
the other hand, the hardware may be enhanced (larger cache lines/blocks) so
that more locality can be extracted from Database Codes running Decision
Support Systems and Data Warehousing workloads.
Acknowledgements
The author would like to thank students Josep Aguilar, Roque Bonilla, Daniel
Jiménez-González, and Victor Muntés for their discussions on Database Man-
agement; Carlos Navarro, Alex Ramirez, and Xavi Serrano for the semi-
nal work they did in the Database group; Calisto Zuzarte, Adriana Zubiri,
and Berni Schiefer for their discussions on DBMS structure; and Christian
Breimann, Peter Sanders, and Martin Schmollinger for their useful comments
to improve this chapter.
15. Hierarchical Models and Software Tools for
Parallel Programming
Massimo Coppola and Martin Schmollinger
15.1 Introduction
Hierarchically structured architectures are becoming more and more perva-
sive in the field of parallel and high performance computing. While memory
hierarchies have been recognized for a long-time, only in the last years hi-
erarchical parallel structures have gained importance, mainly as a result of
the trend towards cluster architectures and high-performance application of
computational grids.
The similarity among the issues of managing memory hierarchies and
those of parallel computation has been pointed out before (see for instance
[213]). It is an open question if a single, unified model of both aspects ex-
ists, and if it is theoretically tractable. Correspondingly, a programming en-
vironment which includes support for both hierarchies is still lacking. We
thus need well-founded models and efficient new tools for hierarchical paral-
lel machines, in order to connect algorithm design and complexity results to
high-performance program implementation.
In this chapter we survey theoretically relevant results, and we compare
them with existing software tools and programming models. One aim of the
survey is to show that there are promising results with respect to the theoreti-
cal computational models, developed by merging the concepts of bulk-parallel
computational models with those from the hierarchical memory field. A sec-
ond goal is to investigate if software support has been realized, and what
is still missing, in order to exploit the full performance of modern high-
performance cluster architectures. Even in this case, solutions emerge from
combining results of different nature, those employing hardware-provided
shared memory and those explicitly dealing with message passing. We will
see that both at the theoretical level and on the application side, combination
of techniques from different fields is often promising, but still leaves many
open questions and unresolved issues.
The chapter is organized in three parts. The first one (Sect. 15.2) describes
the architectural background of current parallel platforms and supercomput-
ers. The basic architectural options of parallel architectures are explained,
showing that they naturally lead to a hierarchy concept associated with the
exploitation of parallelism. We discuss the technological reasons for, and fu-
ture expectations of current architectural trends.
The second part of the chapter gives an overview of parallel computational
models, exploring the connection among the so-called parallel bridging models
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 320-354, 2003.
© Springer-Verlag Berlin Heidelberg 2003
15. Hierarchical Models and Software Tools for Parallel Programming 321
and external memory models, here mainly represented by the parallel disk
model (PDM) [754]. There is a similarity between problems in bulk parallelism
and block-oriented I/O. Both techniques try to efficiently exploit locality in
mapping algorithmic patterns to a hierarchical structure. We discuss the
issues of parallel computation models in Sect. 15.3. In Sect. 15.4 we get to
discuss parallel bridging models. We survey definitions and present some
models of the class. We describe their extensions to hierarchical parallelism,
and survey results on emulating their algorithms using sequential and parallel
external-memory models. At the end of Sect. 15.4 we describe two results that
exploit parallel hierarchical models for algorithm design.
The third part of the chapter shifts toward the practical approach to hi-
erarchical architectures. Sect. 15.5 gives an overview of software tools and
programming models that can be used for program implementation. We con-
sider libraries for parallel and external-memory programming, and combined
approaches. With respect to parallel software tools, we focus on the exist-
ing approaches which support hierarchy-aware program development. Sec-
tion 15.6 summarizes the chapter and draws conclusions.
In the following, we assume the reader is familiar with the basic concepts of
sequential computational architectures. We also assume the notions of process
and thread1 are known. Table 15.1 summarizes some acronyms used in the
chapter.
Parallel architectures are made up from multiple processing and memory
units. A network connects the processing units and the memory banks (see
Fig. 15.1a). We refer the reader to [484], which is a good starting point to
understand the different design options available (the kind of network, which
modules are directly connected, etc.). We only sketch them here due to lack
of space.
Efficiency of communication is measured by two parameters, latency and
bandwidth. Latency is the time taken for a communication to complete, and
bandwidth is the rate at which data can be communicated. In a simple world,
these metrics are directly related. For communication over a network, how-
ever, we must take into account several factors like physical limitations, com-
munication startup and clean-up times, and the possible performance penalty
from many simultaneous communications through the network. As a very gen-
eral rule, latency depends on the network geometry and implementation, and
bandwidth increases with the length of the message, because of the lesser
1
Largely simplifying, a process is a running program with a set of resources, which
includes a private memory space; a thread is an activity within a process. A
thread has its own control flow but shares resources and memory space with
other threads in the same process.
322 Massimo Coppola and Martin Schmollinger
P P P P
P P P P P P P P
C C C C
processor C C C C
to memory
interconn. M M M M
M M M M M M M
interprocessor
network
1. Worst-case latency is that of reading a block from memory into the external
cache. In the IA32 architecture, a cache line is 64 bytes.
2. Processors share the same memory bus through separate external caches. A
shared memory communication implies at least a L2 cache fault. The worst-
case accounts for other issues like acquiring hardware and software locks, and
thread scheduling.
3. From slow Ethernet up to Gbit Ethernet and Myrinet [744].
4. We do not include here using in parallel several disks, and multiple network
interfaces per node.
5. Multi-Gigabit geographic networks are being already built, but wide area net-
work (WAN) have higher latencies [309, sec. 21.4].
15. Hierarchical Models and Software Tools for Parallel Programming 325
tions, ranging from local area networks to geographic ones, add more levels
to the hierarchy, with different communication bandwidth and latency [309,
chapter 2].
Summing up, in modern parallel architectures we have the following hi-
erarchy of memory and communication layers.
– shared memory
– distributed memory
– local area network
– wide area network
Each one of these layers may exhibit hierarchical effects, depending on its
implementation choices.
The effects on latency and bandwidth of the parallel hierarchy are similar
and combine with those of the ordinary memory hierarchy. A crucial observa-
tion is that there is no strict order among the levels of these two hierarchies,
which we can easily exploit to build a unitary model. For instance, we can
see in Table 15.2 that the communication layers (both shared memory and
distributed memory based ones) provide a bandwidth lower than main mem-
ory, and in some cases lower than that of local I/O. However, their latency is
usually much lower than that of mechanical devices like disks. Different ac-
cess patterns thus lead to different relative performances of communication
and I/O.
Assessing the present and future characteristics of the parallel hierarchy
[193] and devising appropriate programming models to exploit it are among
the main open issues in modern parallel/distributed computing research.
We have already seen in this book that the classical random access machine
(RAM) sequential model, the archetype of the von Neumann computer, does
not properly account with the cost of memory access within a hierarchy
of memories. The PDM model [754], and more complex multi-level compu-
tational models have been developed to increase prediction accuracy with
respect to the practical performance of algorithms.
The same has happened in the field of parallel algorithms. The classical
parallel random access machine (PRAM) computational model is made up
of a number of sequential RAM machines, each one with its local memory.
These “abstract processors” compute in parallel and can communicate by
reading and writing to a global memory which they all share. Several precise
assumptions are made to keep the model general.
– Unlimited resources: no bound is put on the number of processors, the size
of local or global memory.
– Parallel execution is fully synchronous, all active processor always complete
one instruction in one time unit.
– Unitary cost of memory access, both for the local and global memory.
– Unlimited amount of simultaneous operations in the global memory is al-
lowed (though same-location collisions are forbidden).
These assumptions are appropriate for a theoretical model. They allow to
disregard the peculiarities of any specific architecture, and make the PRAM
an effective model in studying abstract computational complexity. However,
they are not realistic for the majority of physical architectures, as practical
bandwidth constraints, network traffic constraints and locality effects are
completely ignored.
Considering the issues discussed in Sect. 15.2, we see that real MIMD
machines are much more complex. Synchronous parallel execution is usu-
ally impossible on modern parallel computers, as well as to ensure constant,
uniform memory access times independently of machine size, amount of ex-
changed data and exploited parallelism. Indeed, optimal PRAM complexity
is often misguiding with respect to real computational costs.
Several variants of the PRAM model have been devised with the aim of
reconciling theoretical computational costs with real performance. They add
different kinds of constraints and costs on the basic operations. A survey on
these derived models is given in [354]. We do not even discuss the research
on communication and algorithmic performance of models which use a fixed
network structure (e.g. a mesh or hypercube). Despite the results on network
cross-simulation properties, network-specific algorithms are often too tied to
the geometry of the network, and show a sub-optimal behavior on other kinds
of interconnection.
We focus on different research track, which has started in the last years
and involves the class of parallel bridging computational models. Parallel
328 Massimo Coppola and Martin Schmollinger
p number of processors
L message latency / synchronization
g cost parameter for message routing M M M M
Fig. 15.2. BSP symbols and parameters (left). The BSP abstract architecture
(right).
P1 P1
P2 P2
decomposition
decomposition
comm. / merge
merge
P3 P3
... ...
Pp Pp
wt g ht L superstep t+1 f(n) work g(n) f(n) s(n)
by barriers (i.e. all units must complete a superstep before any of them can
proceed to the next superstep), we can estimate the length of a superstep
in time units as wt + g · ht + L. This value becomes an upper bound if we
assume instead that synchronization is only enforced when actually needed
(e.g. before receiving a message).
We can analyze a BSP algorithm by computing wt and ht for each super-
step. If the algorithm terminates in Tsupersteps, the local work W = t wt
and the communication volume H = t ht of the algorithm lead to the cost
estimate W + g · H + L · T . A more sound evaluation compares the perfor-
mance of the algorithm with that of the best known sequential algorithm.
Let Tseq be the sequential running time, we call c-optimal a BSP algorithm
that solves the problem with W = c · Tseq /p and g · H + L · T = o(Tseq /p) for
a constant c.
Other parallel bridging models have been developed. Among them we
mention the CGM model, which is the closest to the BSP, the LogP and the
QSM models, which we describe below.
The Coarse-Grained Multicomputer. The coarse-grained multicomputer
(CGM) model [245] is based on supersteps too. The communication phase in
CGM is different from that of BSP, as it involves all processors in a global
communication pattern, and O(n) data are exchanged at each communica-
tion phase. In Fig. 15.3b the different patterns are the f, g, s functions. Only
two numeric parameters are used, the number of nodes p and the problem
size n. Each node has thus O(n/p) local memory.
In the original presentation the model was parametric, as the network
structure was left essentially unspecified. The communication phases were
allowed to be any global pattern (e.g. sorting, broadcasts, partial sums) which
could be efficiently emulated on various interconnection networks. To get the
actual algorithmic cost, one should substitute the routing complexity of the
parallel patterns on a given network (e.g. g(n, p) may be the complexity of
exchanging O(n/p) keys in a hypercube of diameter log2 p).
The challenge in the CGM model is to devise a coarse-grain decomposition
of the problem into independent subproblems by exploiting a set of “portable”
global parallel routines. The best algorithms will usually require the smallest
15. Hierarchical Models and Software Tools for Parallel Programming 331
possible number of supersteps. During the years, in the common use CGM
has been simplified and became close to BSP. In recent works [243, 244], the
network geometry is no longer considered. CGM algorithms are defined as
a special class of BSP algorithms, with the distinguishing feature that each
CGM superstep employs a routing relation of size h = Θ(n/v).
LogP Model. In the LogP [233] model, processors communicate through
point-to-point messages, ignoring the network geometry like in BSP. Unlike
BSP and the other PBMs, there are no supersteps in LogP.
LogP models physical communication behavior. It uses four parameters:
l, an upper bound on communication latency; o, the overhead involved in a
communication; g, a time gap between sending two messages; and the physical
parallelism P . Messages are considered to be of small, fixed length, thus
introducing the need to split large communications. There are two flavors
of the model, stalling LogP, which imposes a network capacity constraint (a
processor can have no more than l/g messages in transit to it at the same
time, or senders will stall), and non-stalling LogP, which has no constraint.
Because of the unstructured and asynchronous programming model, of the
need to split messages into packets, and to deal with the capacity constraint,
algorithm design and analysis with LogP is more complex than with the
other PBMs. There are comparably fewer results with LogP, even if most
basic algorithms (broadcasts, summing) have been analyzed.
QSM Model. The queuing shared memory (QSM) [330] model can be seen
both as a PRAM evolution and as a shared-memory variant of the BSP. Like
a PRAM, a set of processors with private memories communicate by means
of a shared memory. Like in the BSP, QSM computation is globally divided
into phases. Read and write operation are posted to the shared memory, and
they complete at the end of a phase. Concurrent reads or writes (but not
both) to a memory location are allowed.
Each processor must also perform a certain amount of local computation
within each phase. The cost of each phase is defined as max(mop , g · mrw , κ),
where mop is the largest amount of local computation in the phase, mrw is
the largest number of shared reads and writes from the same processor, and
the gap parameter g is the overhead of each request. Latency is not explicitly
considered, and it is substituted by the maximum contention κ of the phase,
i.e. the maximum number of colliding accesses on any location in that phase.
A large number of algorithms designed for variants of the PRAM can be
easily mapped on the QSM.
A Comparison of Parallel Bridging Models. Several results about emu-
lation among different parallel bridging models can be found in the literature.
Emulations are work-preserving if the product p · t (processors per execution
time) on the emulating machine is the same as that on the machine being
emulated, to within a constant factor. Work-preserving emulations typically
increase the amount of parallel slackness (the emulating machine has fewer
332 Massimo Coppola and Martin Schmollinger
processor than the emulated one), and are characterized by a certain slow-
down. The slowdown is O(f ), when we are able to map an algorithm, running
in t time on p processors, to one running on p ≤ p/f processors in time
t = O(t · (p/p )). An ideal slowdown of 1 means that the emulation intro-
duces at most a constant factor of inefficiency. Table 15.3 summarizes some
asymptotic slowdown results taken from a recent survey by Ramachandran
[622]. The fact that a collection of work-preserving emulations with small
slowdown exists, suggests that these models are to a good extent equivalent
in their applicability as cost models to real parallel machines.
However, some of the bridging models are better suited than others for the
role of programming models, as a more abstract view of the algorithm struc-
ture and communication pattern allows easier algorithm design and analysis.
From this point of view, LogP is probably the hardest PBM to use. It
leads to difficult, low-level analysis of communication behavior, and thus it
has been rarely used to evaluate complex algorithms. The QSM can be used
to evaluate the practical performance of many existing PRAM algorithms,
but is a low-level model too. QSM is a “flat” model, which disregards the
hierarchical structure of the computation, and it has an abstract but fine-
grain approach to communication cost.
The bulk parallel models (BSP and CGM) have been used more exten-
sively to code parallel algorithms. They proved to be easier to use when
designing algorithms, and actually several software tools have been designed
to directly implement BSP algorithms. In the same direction there are even
more simplified model, like the one used in [690]. Aggregate computation cost
is measured in terms of total work, total network traffic (sum of messages)
and total number of messages. Thus the three “weights” of these operations
are the parameters of this model. It can be seen as a flat, close relative of BSP
and CGM, and at least a class of algorithms based on computation phases
with limited unbalancing can be analyzed using this model.
Both the CGM, and extensions of the original BSP model, allow to rep-
resent hierarchically structured networks. Finally, the concepts of parallel
slackness, medium-grain parallelism and supersteps have been exploited to
develop efficient emulation of BSP and CGM algorithms in external memory,
BSP Combination of
PRAM PBM extensions
CGM
BSP* HMM and PBM
unconstrained bulk LogP better parallel
fine−grain parallelism QSM locality models D−BSP
parallelism coarse−grain, bulk parallelism
minimize I/O and communications
exploiting hierarchical structures
E.M. emulations
BSP, CGM onto HMM
map par. locality to E.M.
HMM (PDM)
E.M. locality
block−oriented I/O
Fig. 15.4. Development path of parallel bridging models and their relationship
with hierarchical memory models
showing up the connections among the design of parallel algorithms and that
of external-memory algorithms.
for (i) effects due to message length and (ii) for the relationship between
network size and parallel overhead in communications.
The BSP* Model. In real interconnection networks, communication time
is not independent of message length. The combination of bandwidth con-
straints, startup costs and latency effects is often modeled as a linear affine
function of message length. BSP disregards this aspect of communication.
The number of exchanged messages roughly measures the congestion effects
on the network.
Counting non-local accesses is a first-order approximation that has been
successfully used in external memory models like the PDM. On the other
hand, PDM uses a block size parameter to measure the number of page I/O
operation. To model the practical constraint of efficiency for real communi-
cations, the BSP* model [75] has been introduced in 1996.
BSP* adds a critical block size parameter b, which is the minimum size of
data for a communication to fully exploit the available bandwidth. The cost
function for communications is modified to account both for the number ht of
message start-ups in a phase, and for the communication volume st (the sum
of the sizes of all messages). Each message is charged a constant overhead,
and a time proportional to its length in blocks. Superstep cost is defined as
wt + g(st /b + ht )+ L, often written as wt + g ∗ ·(st + ht ·b)+ L, where g ∗ = g/b.
The effect on performance evaluation is that algorithm that pack information
when communicating are still rewarded like in BSP, but high communication
volumes and long messages are not. Thus the BSP* model explicitly pro-
motes both block-organized communications and a reduced amount of data
transfers, the same way external memory models do for I/O operations.
D-BSP Model. The BSP model inherits from the PRAM the assumption
that model behavior is independent of its size. While we can easily account
for a general behavior by changing parameter values (e.g. adding processors
to a bus interconnection leads to larger values of g and L), there is no way
we can model more complex situations where the network properties change
according to the part of it that we are using. This is an intentional trade-off
of the BSP model, but it can lead to inaccurate cost estimates in some cases.
We mention two examples.
– Networks with a regular geometry, like meshes or hypercubes, can behave
quite differently if most of the communication traffic is local, as compared
to the general case.
– Modern cluster of multiprocessors and multiple-level interconnections can-
not be properly modeled with any value of g, L, as shared memory and
physical message-passing communications among different kinds of con-
nections imply very different bandwidths and overheads.
De La Torre and Kruskal [240] introduce the decomposable BSP (D-BSP),
which rewards locality of computation by allowing hierarchical decomposition
15. Hierarchical Models and Software Tools for Parallel Programming 335
Once for each superstep, the computational context of each virtual pro-
cessor is loaded into main memory in turn. A computational context con-
sists of the memory image and message buffers of a virtual processor. Local
computation and message exchange are emulated before switching to next
processor.
Efficient simulation needs picking the right number v of virtual BSP pro-
cessors with respect to the emulating machine. By implementing communi-
cation buffers using external memory data structures, we can derive efficient
external memory algorithms from a subclass of BSP algorithms, those that
require limited memory and communication bandwidth per processor. There
are interesting points to note.
– the use of a bandwidth gap G parameter, measuring the ratio between the
instruction execution speed and the I/O bandwidth,
– the introduction of a notion of x-optimality, close to that of c-optimality,
which relates the number of I/O operations of a sequential algorithm with
those of an emulated parallel algorithm for the same problem,
– the fact that the BSP* messages have a cost depending on their length
in blocks helps in determining a relationship among the parallel algorithm
and its external memory emulation.
Parallel emulation. In [242] the emulation results hold under more general
assumptions. To evaluate the cost of parallel emulation, the EM-BSP* model
is defined. EM-BSP* is a BSP* model extended with a secondary memory
which is local to the processing nodes, see Fig. 15.5. Alternatively, we can see
it as a PDM model augmented with a BSP* interconnection and a superstep
cost function. In addition to the L, g, b, p parameters of BSP*, we find also
local memory size M , the number (per processor) of local disks D, the I/O
block transfer size B (which is borrowed from the PDM model) and the
computational to I/O capacity ratio G as in the simpler simulation.
The emulation of BSP* algorithms proceeds by supersteps, but each em-
ulating processor loads from disk a set of the virtual processors (with their
needed context data) at the same time, instead of a single one. The emulation
procedure can run sequentially in external memory, or in parallel, where the
emulating machine is modeled using EM-BSP*. A reorganization algorithm
is provided to perform BSP message routing in the external memories using
an optimal amount of I/O.
Like in [692], the result in [242] exploits the BSP* cost function to simplify
the emulation algorithm. BSP*, BSP and CGM algorithms (by reduction to
BSP*) can be emulated if they satisfy given bounds on message sizes and of
the memory used by the processors.
The c-optimality criterion is refined, taking into account I/O, computa-
tion and communication time of the emulated algorithm. We thus have a
metric to compare EM-BSP* algorithms with the best sequential algorithms
known.
15. Hierarchical Models and Software Tools for Parallel Programming 337
Fig. 15.5. The common structure of the combined parallel and external-memory
models, and a summary of parameters used in the models, beyond those from BSP.
In this section, we briefly survey two works which exploit mixed models of
computation to develop parallel-external memory algorithms. The first work
pre-dates most of the results we have previously presented. Aggarwal and
Plaxton [16] define a multi-level storage model, made up of a chain of hyper-
cubes of increasing dimension 0 ≤ d ≤ a. A set of four primitive operations
is defined on such a structure, which includes a scan operation, two routings
and a shift of sub-cube data. A hypercube of dimension b > a is then made up
of the smaller ones, using bounded-degree networks to connect them through
a subset of their nodes. The networks are supposed to compute prefix and
scan operations in O(b) = O(log p) time. Due to its complexity, the model
was never further developed despite a promising result on sorting.
Apart from the details of the model and of the sorting algorithm, in [16] it
is indeed interesting to note how a parallel, hierarchical data space is defined
on which to compute. The authors choose a set of primitive operation that
can be practically implemented both in external memory and in parallel.
This choice allows a certain degree of flexibility in choosing which levels of
the computation to map to external memory, and which ones to perform in
parallel.
Newer approaches, following the path of Fig. 15.4 (page 333), are based
on external-memory extensions of BSP-like models. Dehne and others [246]
338 Massimo Coppola and Martin Schmollinger
Software tools of this first kind manage only two levels of a hierarchy. Since
the mangement of parallelism and I/O are in principle completely separate,
15. Hierarchical Models and Software Tools for Parallel Programming 339
these tools can be combined within the same environment to exploit archi-
tectures which correspond to the EM parallel models of Sect. 15.4.3 and
15.4.4. The work in the two separate fields of external memory and paral-
lel programming is mature enough to have already produced some widely
recognized standards.
Parallel Programming Libraries. There are two main parallel program-
ming paradigms, which fit the two extremes of the MIMD architectural class,
the distributed memory paradigm and the shared memory paradigm.
In the message passing paradigm each process has its local data, and
it communicates with other processes by exchanging messages. This shared
nothing approach corresponds to the abstraction of a DM-MIMD architec-
ture, if we map each process to a distinct processor.
In the shared-memory programming paradigm, all the data is accessible
to all processes, hence this shared-everything approach fits perfectly the SM-
MIMD class of architectures. The programmer however has to formulate race-
conditions to avoid deadlocks or inconsistencies.
For both paradigms, there is one official or de facto standard library, re-
spectively the message-passing interface (MPI) standard, and the OpenMP
programming model for shared memory programming. In both, MPI and
OpenMP, possible hierarchies in the parallel target machine are not consid-
ered. They assume independent processors, either connected by an intercon-
nection network or by a shared memory. Of course, there are approaches to
incorporate hierarchy sensitive methods in both libraries. We will present
some of them in Section 15.5.3.
Message-Passing-Interface MPI. In 1994, the MPI-Forum unified the most
important concepts of message-passing-based programming interfaces into
the MPI standard [547]. The current, upward compatible version of the stan-
dard is known as MPI-2 [548], and it specifies primitive bindings for languages
of the C and Fortran families.
In its simplest form, an MPI program starts one process per processor on
a given number of processors. Each process executes the same program code,
but it operates on its local data, and it receives a rank (a unique identifier)
during the execution, that becomes its address w.r.t. communications. Sub-
ject to the rank, a process can execute different parts of the program. This
single program multiple data (SPMD) model of execution actually allows a
generic MIMD programming model.
There are MPI implementations for nearly all platforms, which is the pre-
requisite for program portability. Key features of the MPI standard include
the following, and those described on page 342 about the I/O.
Point-to-point communication: The basic MPI communication mechanism is
to exchange messages between pair of endpoint processes, regardless of the
actual network structure that delivers the data. One process initiates a
send operation and the other process has to start a receive operation in
order to start the data transfer.
340 Massimo Coppola and Martin Schmollinger
Several variants of the basic primitives are defined in the standard, which
differ in the communication protocol and the synchronous/asynchronous
behavior. For instance, we can choose to block or not until communica-
tion set-up or completion, or to use a specific amount of communication
buffers.
These different options are needed both to allow optimized implementa-
tion of the library and to allow the application programmer to overlap
communication and computation.
Collective operations: Collective communications involve a group of pro-
cesses, each one having to call the communication routine with matching
arguments, in order for the operation to execute.
Well-known examples of collective operations are the barrier synchro-
nization (processes wait for each other at a synchronization point), the
broadcast (spreading a message to a group of processes) or the scan op-
eration.
One-sided Communications: With one-sided communication all communica-
tion parameters for both, the sender and the receiver side, are specified
by one process, thus avoiding explicit intervention of the partner in the
communication. This kind of remote memory access separates communi-
cation and synchronization. Remote write, read and update operations
are provided this way, together with additional synchronization primi-
tives.
OpenMP. The OpenMP-API [593] is a standard for parallel shared memory
programming based on compiler directives. Directives are a way to param-
eterize a specific compiler behavior. They preserve program semantics, and
have to be ignored when unknown to a compiler. Thus they are coded as
#pragma statements in C and C++, and are put within comments in For-
tran. OpenMP directives allow to mark parallel regions in a sequential pro-
gram. This approach facilitates an incremental parallelization of sequential
programs.
The sequential part of the code is executed by one thread (master thread)
that forks new threads as soon as a parallel region starts and joins them at
the end of the parallel region (fork-join model). OpenMP has three types of
directives.
Parallelism directives mark parallel regions in the program.
Work sharing directives within a parallel region divide the computation
among the threads. An example is the for/DO directive (each thread
executes a part of the iterations of the loop).
Data environment directives control the sharing of program variables that
are defined outside a parallel region (e.g. shared, private and reduction).
Synchronization directives (barrier, critical, flush) are responsible for syn-
chronized execution of several threads. Synchronization is necessary to
avoid deadlocks and data inconsistencies.
15. Hierarchical Models and Software Tools for Parallel Programming 341
where the data is stored. Since the number of threads is not known at com-
pile time, calls to a data mapping runtime library are inserted to compute
loop bounds and data that must be communicated. In this setting, the data
is stored locally and check codes can be removed.
Collective communication optimization. Inter-node communication is neces-
sary to implement a reduction operation on variables defined in the data scope
attribute of a parallel region. It can be performed efficiently using a collective
communication library. The execution starts after the local reduction at the
end of parallel regions or after work-sharing directives.
Distributed OpenMP. A different approach to adapt OpenMP to SMP
clusters is suggested in [546]. The authors propose the distributed OpenMP.
This extension of OpenMP with data locality features provides a set of new
directives, library routines and environment variables. One data-distribution
extensions is the distribute directive with which it is possible to parti-
tion an array over the node memories. For performance reasons, the threads
should work on local array elements. Hence, the user must distribute the data
in order to minimize remote data accesses. Another proposed extension is the
on home directive in a parallel region. With this directive, it is possible to
perform a parallel loop over a distributed array without redistributing the
array. The threads of a node perform the iterations for the array elements
that reside in their local memory. Further extensions are library routines and
environment variables that provide specific numbers of the run-time instance
of the SMP cluster, like for example the number of involved nodes or pro-
cessors per node. Disadvantages are that programs get more complex, and
the user has to take care about an efficient data decomposition. Since we are
providing more information to the compiler, after adding the new directives
to an OpenMP program a redesign step, and a performance tuning phase
have to be performed.
Hybrid Programming with MPI and OpenMP. The idea of the hybrid
programming model is to use message passing between the SMP nodes, and
shared memory programming inside the SMP nodes. The structure of this
model fits exactly to the architecture, therefore, the model has potential to
produce programs with significant performance improvement. But it is also
obvious that the model is more complicated to use, and that there may arise
unpredicted performance problems, because of the simultaneous usage of the
two programming models. There are several possibilities for choosing libraries
for each model, but it is straightforward to combine the de facto standards
MPI and OpenMP. In the following we give an overview of the different ap-
proaches to the production of hybrid programs, with no emphasis on technical
details. We also survey some performance evaluations that compare hybrid
programs with pure MPI ones.
The general execution scheme uses one process in each node, to handles
communications by means of MPI primitives. Inside the process, multiple
threads compute in parallel. The number of threads in a node is equal to the
348 Massimo Coppola and Martin Schmollinger
number of processors in that node. The base for the design of an efficient
hybrid program is an efficient MPI program. According to [171], there are
two approaches to incorporate OpenMP directives into MPI programs, the
fine-grain and the coarse-grain approach.
Fine-Grain Parallelization. The hybrid fine-grain parallelization is done in-
crementally. The computational part of an MPI program is examined, and the
loop nests are parallelized with OpenMP directives. Therefore, the approach
is also called loop-level parallelization. Clearly, the loops must be profiled,
and only loop nests with a significant contribution to the global execution
time are selected for OpenMP parallelization.
Some loop-nests can not be parallelized directly. If they are non negligible,
the developer can try to transform them into parallelizable loops. Techniques
like loop exchange and loop permutations, and introduction of temporary
variables, can often avoid false sharings and reduce the number of synchro-
nizations.
Performance of Fine-Grain Hybrid Programs. In [171, 172, 173, 200]
investigations to measure performance of fine-grain hybrid programs are pre-
sented. A comparison is shown of the performance achieved by a hybrid and
a pure MPI version of the NAS benchmark [86] on a SMP cluster. An impor-
tant subject of [171, 172, 173, 200] is the interpretation of the performance
measurements, in order to understand the behavior of the hybrid programs
and their performance. Experiments were made on a PC-based SMP clus-
ter with two processors per node and on IBM SP cluster systems with four
processors per node.
The comparison among the two kinds of models for SMP clusters shows
no general advantage of one over the other. Depending on the characteristics
of the application, some benchmarks perform better with the hybrid version,
others perform better with the pure MPI version. The following aspects have
influence on the performance of the models.
Level of shared memory parallelization. The more of the total computation
can be parallelized, the more interesting is the hybrid approach. The size
of the parallelized sections (OpenMP) compared to the whole computation
section must be significant.
Communication time. It depends on the communication pattern of an ap-
plication, and on the differences between the two models concerning latency,
bandwidth, and synchronization time. If more processes share one network
interface, then the latency for network accesses increases, but the per process
bandwidth increases too. If there is only one process per node, the latency is
low, but the process cannot transfer data fast enough to the network interface
to fully exploit the maximum network bandwidth. Therefore, the pure MPI
approach performs better if the application is bandwidth limited, and it is
worse for latency limited applications.
15. Hierarchical Models and Software Tools for Parallel Programming 349
Memory access patterns. The memory access patterns are different for the
two models. Whereas MPI allows to express multi-dimensional blocking,
OpenMP does not. To achieve the same memory access patterns, rewriting
of loop nests is necessary, which may be very complex.
Performance balance of the main components. (processors, memory and net-
work) can offset the communication/computation tradeoff. If the processors
are so fast that communication becomes the bottleneck, then the actual com-
munication pattern decides which model is best. If, on the other hand, com-
putation is the bottleneck, then MPI seems to be always the best.
Coarse-Grain Parallelization. In this approach a single program multi-
ple data style is used to incorporate OpenMP threads into MPI programs.
OpenMP is used to spawn threads immediately after the initialization of the
MPI processes in the main program. Each thread itself is acting similar to
an MPI process. For threads there are several issues to consider:
– The data distribution between the threads is different from that of MPI
processes. Because of the shared memory, it is only necessary to calculate
the bounds of the arrays for each thread. There has to be a mapping from
array regions to threads.
– The work distribution between the threads is made according to the data
distribution. Instead of an automatic distribution of the iterations, some
calculations of the loop boundaries depending of the thread number define
the schedule.
– The coordination of the threads means managing critical sections by either
the usage of OpenMP directives, like MASTER or thread library calls like
omp get thread num(), to construct conditional statements.
– Communication is still done by only one thread.
As far as we know, the coarse-grain approach has been proposed, but there
are no results yet. We can compare it with TOMPI, as both methods convert
MPI processes to threads. However, TOMPI programs on SMP clusters do
not share data structures common to all the processes, as they would do in
a coarse-grain parallelization.
High-Level Programming Models. Besides the programming libraries
and paradigms above, there are some programming models for SMP clusters
that try to build a higher level of abstraction for the programmer. All these
models are based on the hybrid programming paradigm where threads are
used for the internal computation and message passing libraries are used to
perform communication between the nodes.
SIMPLE Model. The significant difference between SIMPLE [82] and the
manual hybrid programming approach above lies in the provided primitives
for communication and computation.
The computation primitives comprise data parallel loops, control primi-
tives to address threads or nodes directly, and memory management primi-
tives.
350 Massimo Coppola and Martin Schmollinger
Data parallel loops: There are several parallel loop directives for executing
loops concurrently on one or more nodes of the SMP cluster, assuming
no data dependencies. The loop is partitioned implicitly to the threads
without need for explicit synchronization or communication between pro-
cessors. Both block and a cyclic partitioning is provided.
Control: With this class of primitives, it is possible to control which threads
are involved in the computation context. The execution of a block of code
can be restricted to one thread per node, all threads in one node, or to
only one thread in the SMP cluster.
Memory management: A heap for dynamic memory allocation is managed in
each processing node, and can be used by the threads of that node via
the node malloc and node free primitives.
SIMPLE provides three libraries for communication. There is an inter-
node-communication library, an SMP node library for thread synchroniza-
tion, and a SIMPLE communication library build on top of both. The SMP
node library implements the three primitives reduce, barrier and broadcast.
It is based on POSIX threads. Together with the functionality of the inter-
node-communication library, it is possible to implement the primitives bar-
rier, reduce, broadcast, allreduce, alltoall, alltoallv, gather, and scatter that
are assumed to be sufficient for the design of SIMPLE algorithms. The use
of these top-level primitives means using message passing between nodes and
shared memory communication within the nodes.
Hybrid-Parallel Programming with High Performance Fortran. High Perfor-
mance Fortran (HPF ) is a set of extensions to Fortran that enables users
to develop data-parallel programs for architectures where the distribution of
data impacts performance. Main features of HPF are directives for data distri-
bution within distributed memory machines and primitives for data parallel
and concurrent execution. HPF can be employed on both distributed memory
and shared memory machines, and it is possible to compile HPF programs
on SMP clusters. However, HPF does not provide primitives or directives
to exploit the parallel hierarchy of SMP clusters. Most HPF compilers just
ignore the shared memory within the nodes and treat the target system as a
distributed memory machine.
One exception is presented in [106]. Therein, HPF is extended with the
concept of processor mappings and the concept of hierarchical data mappings.
With these two concepts, it is possible for the programmer to consider the
hierarchical structure of SMP clusters. A product of this approach is the
Vienna Fortran Compiler [105]. It creates fine-grain hybrid programs using
MPI and OpenMP, starting from programs in an enriched HPF syntax.
Processor mappings: Beside the already existing abstract processor array
that is used as the target of data distribution directives, abstract node
arrays are defined. Together with an extended version of the distribute
directive it is possible to construct the structure of an SMP cluster.
15. Hierarchical Models and Software Tools for Parallel Programming 351
have their own data layout and data motion plan. Three classes of KeLP2
programming abstractions help to manage this mechanisms.
1. The Meta-Data represents the abstract structure of some facet of the
calculation. It describes the data decomposition and the communication
patterns.
2. Instantiators execute the program according to the information contained
in the meta-data.
3. The primitives for parallel control flow are iterators, which iterate over
all nodes, or over all processors of a specified node.
When comparing KeLP2 with SIMPLE, the latter provides lower-level
primitives, does not support data-decomposition, and does not overlap com-
munication and computation. KeLP2 is of narrower scope concerning the
application domain, but it nevertheless enables a parallel specification that
is less dependent on the implementation.
15.6 Conclusions
It is clear from Sect. 15.2 that we need theoretical models and new software
tools to fully exploit hierarchical architectures like large clusters of SMP and
future Computational Grid super-clusters.
The interaction among parallel bridging models and external memory
models has produced several results, which we surveyed in Sect. 15.3, 15.4.
The exploitation of locality effects in these two classes of models employs
very similar solutions, that involve block-oriented cooperation and abstract
modeling of the hierarchical structure. The intuition underlying theoretical
and performance results on bulk parallel models, and their theoretical lesson,
is that the simple exploitation of fine-grain parallelism at the algorithm level
is not the right way to obtain portable parallel programs in practice.
However, there is still a lot of work to do in order to meet the need of
appropriate computational models. Hierarchical-parallel models like those of
Fig. 15.5 are already close to the structure of modern SMP clusters, and
they are relatively simple to understand, yet algorithms can be quite hard
to analyze. Composed models usually employ the full set of parameters of
their parallel part, those of their memory part, and at least another one to
asses the relative cost of I/O and communication operations. For the BSP
derivatives we have described, this leads to seven or eight parameters.
Excessive complexity of the analysis and poor intuitive understanding are
a limiting factor for the diffusion of computation models, as it was pointed
out in [213]. There is no answer yet to the questions “can all these parameters
be merged in some synthesis?” and “what are the four or five most important
parameters?”
For these reasons the impact of sophisticated parallel computational mod-
els is still limited, while simple disk I/O models like PDM have been quickly
15. Hierarchical Models and Software Tools for Parallel Programming 353
Systems like SIMPLE and KeLP2 are close to this research path. Abstract,
high-level operations simplify program writing, while still providing the tools
with information about the best mapping to the architecture hierarchy.
Acknowledgments
16.1 Introduction
U. Meyer et al. (Eds.): Algorithms for Memory Hierarchies, LNCS 2625, pp. 355-377, 2003.
© Springer-Verlag Berlin Heidelberg 2003
356 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
We analyze two different radix sort based solutions on the SGI Origin
2000 parallel computer; the Straight Forward Parallel version of radix sort,
SF-Radix sort, and Communication and Cache Conscious Radix sort, CCC-
Radix or C3-Radix [430]. SF-Radix sort is not a very efficient algorithm but
a very good algorithm for didactic purposes. C3-Radix sort is a memory
conscious algorithm at both the sequential (cache conscious) and the parallel
(communication conscious) levels. C3-Radix shows a good speed-up1 of 7.3
for 16 processors and 16 million keys, while previous work only achieved a
speed-up of 2.5 for the same problem size and set up [430]. Furthermore, the
sequential algorithm used in C3-Radix has proved to be faster than previous
algorithms found in the literature [433].
The data sets we use to test our algorithms are N records formed by one
32-bit integer key and one 32-bit pointer.2 The keys we sort are generated
at random with a uniform distribution and have no duplicates. The parallel
algorithms explained here may not be efficient for data distributions with
duplicates and/or with skew. In this case, we recommend reading [431, 432]
which focus on the specific problem of skew and duplicates.
In order to motivate this chapter, we advance some results in Figs. 16.1
and 16.2 using the visualization and analysis tool Paraver [604]. We use this
tool throughout this chapter in order to show the executions of the different
implementations of the sorting problem. Each window in Figs. 16.1 and 16.2
shows the time spent in sorting 16 million records on 8 R10000 processors
(one processor per horizontal line) of the SGI Origin 2000. The execution
time in those figures (as well as in the Paraver based figures below) starts
at the first small flag above each horizontal line and ends with the last flag.
Note that all the processors start at the same time and end at different times.
The results in Fig. 16.1 are for SF-Radix sort (picture at the top) and
for C3-Radix sort (picture at the bottom) using MPI. The most important
aspect of that figure is that the parallel algorithm chosen has a significant
impact on the overall performance. Data communication is shown with the
lightest shaded areas between marks in the figure. In both cases, the 2.00
s vertical dotted line falls in the middle of the first data communication.
We can see that SF-Radix sort performs four communication steps, while
C3-Radix sort performs just one.
The results in Fig. 16.2 are for two different implementations of C3-Radix
sort using OpenMP. We can see that the implementation at the top takes
more time than that at the bottom. However, both implementations only
differ slightly; we have laid special emphasis on the memory aspects of the
1
We measure the speed-up as the ratio between the execution time of the fastest
sequential algorithm at hand and the parallel execution time.
2
We sort key-pointer records because, for instance, in database applications, sort-
ing is done on records that may be very large. Therefore, to avoid moving large
sets of data, the key is extracted from the record, a pointer is created to that
record and they are both copied to a new vector of key and pointer tuples that
is sorted by the key [588, 687].
16. Case Study: Memory Conscious Parallel Sorting 357
Fig. 16.1. Comparison of two MPI implementations of SF-Radix sort (top) and
C3-Radix sort (bottom)
fastest algorithm at the bottom of the figure. Therefore, we can say that a
memory conscious implementation of an algorithm may also have a significant
impact on the performance. Thus, Figs. 16.1 and 16.2 give insight into the
results obtained by the techniques addressed in this chapter.
The chapter is structured as follows. In Sect. 16.2, we discuss the main
computer architecture aspects that affect the cost of memory accesses or com-
munication. In Sect. 16.3, we discuss and analyze the SF-Radix sort algorithm
for shared and distributed memory machines. Then, as a better solution for
the same problem, the shared and distributed versions of C3-Radix sort are
discussed and analyzed in detail in Sect. 16.4. In both Sects. 16.3 and 16.4,
we discuss different implementations of the algorithms to understand the in-
fluence of some implementation details on the exploitation of data locality,
and thus, on the performance of those implementations. Finally, in Sect. 16.5,
we set out our conclusions.
358 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
16.2.1 SM MIMD
In order to reduce the memory latency in shared memory systems it is neces-
sary to have several levels of cache memories conforming a complex memory
hierarchy. From the point of view of a processor, there is a hierarchy of lo-
cal cache memories and a hierarchy of remote cache memories. This causes
some cache coherence problems among the cache hierarchies of the different
processors.
16.2.2 DM MIMD
The algorithms we analyze and compare in this paper are tested on the SGI
Origin 2000 at CEPBA3 .
The SGI O2000 is a directory based NUMA parallel computer [234, 498].
It allows both the use of the shared and distributed memory programming
models. The O2000 is a distributed shared memory (DSM) computer, with
cache coherence maintained via a directory-based protocol.
The building block of the SGI O2000 is the 250 MHz MIPS R10000 pro-
cessor that has a memory hierarchy with private on-chip 32Kbyte first level
data and instruction caches and external 4Mbyte second level combined in-
struction and data cache.
One node card of the SGI O2000 is formed by 2 processors with a 720MB/s
peak bandwidth, 128Mbyte shared memory. Groups of 2 node cards are linked
to a router that connects to the interconnection network. In our case, the
interconnection network is formed by 4 routers connected as a 160GB/s global
peak bandwidth 2-d hypercube network. This hypercube has two express links
between the groups of two routers that are not connected directly.
3
CEPBA stands for “Centre Europeu de Paral.lelisme de Barcelona”. More in-
formation on CEPBA, its interests and projects can be found at http://www.
cepba.upc.es.
16. Case Study: Memory Conscious Parallel Sorting 361
S D D S D D
0 10 10 10 0 10 1 1
1 1 C C C 1 C 1 1 1 C C C C 4
0 1 0 0 0 1 0 1 0 2 0 0 0 1 0 2
2 4 21 21 2 21 10 10
1 6 1 1 1 4 1 7 1 3 1 2 1 4 1 5
3 21 81 81 3 81 12 12
2 2 2 7 2 8 2 9 2 3 2 5 2 6 2 8
4 12 91 4 91 17
3 3 3 9 3 9 3 12 3 2 3 8 3 8 3 10
5 81 61 5 61 21 21
4 5 4 12 4 14 .......... 4 17 4 2 4 10 4 10 .......... 4 12
6 64 71 6 71 26
5 1 5 17 5 17 5 18 5 2 5 12 5 12 5 14
7 17 12 12 7 12 29
6 1 6 18 6 18 6 19 6 2 6 14 6 15 6 16
8 59 92 8 92 34
7 2 7 19 7 20 7 21 7 2 7 16 7 18 7 18
9 35 73 9 73 35
8 0 8 21 8 21 8 21 8 2 8 18 8 18 8 20
10 44 53 10 53 44
9 3 9 21 9 22 9 24 9 4 9 20 9 22 9 24
11 97 83 11 83 49
count count
completely sorted
13 73 64 64 13 64 59
Intermediate state
Intermediate state
have been moved
14 91 44 14 44 61 61
Final state
with vector
Final state
15 94 94 15 94 64
been moved
been moved
16 29 34 16 34 71 71
17 34 35 17 35 73 73
18 53 26 18 26 81 81
19 61 17 17 19 17 83
20 71 97 20 97 91 91
21 83 59 59 21 59 92 92
22 49 29 22 29 94
23 92 49 23 49 97
Iteration 1 Iteration 2
Fig. 16.3. Example for radix sort for 2 decimal digit keys on the O2000 machine
The radix sort algorithm is a simple sequential algorithm that can easily be
parallelized. We first explain the sequential radix sort, and then we present
two straight forward parallel solutions of radix sort, one for shared memory
and another one for distributed memory computers.
The idea behind radix sort is that b-bit keys are sorted on a per digit basis,
iterating from the least significant to the most significant digits. The b-bits of
m−1
a key can be grouped forming m digits of bi consecutive bits where i=0 bi =
b.
To sort the set of keys for each digit, radix sort may use any sorting
method. Our explanation relies on the counting algorithm [460] for that pur-
pose. Alg. 1 is an example of this version of radix sort.
The counting algorithm performs three steps for each digit: count, accu-
mulation and movement steps, which correspond to the three code sections
within the for loop starting at line 1 of Alg. 1.
Next, we explain the procedure of the radix sort algorithm for the sorting
of the least significant digit of a 2-decimal digit key with the help of Alg. 1,
and the example in Fig. 16.3. The processing of this digit corresponds to
iteration 1 of Alg. 1 and the left hand side of Fig. 16.3.
First, the count step computes a histogram of the possible values of the
least significant digit of vector S, on vector C (lines 3 to 6). In the example,
it is necessary to have a vector with 10 counters, one for each possible value
362 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
3: for i = 0 to N − 1 do
4: value ← get digit value(S[i], dig)
5: C[value] ← C[value] + 1
6: end for
7: tmp ← C[0]
8: C[0] ← 0
9: for i = 1 to nbuckets − 1 do
10: accum ← tmp + C[i − 1]
11: tmp ← C[i]
12: C[i] ← accum
13: end for
14: for i = 0 to N − 1 do
15: value ← get digit value(S[i], dig)
16: D[C[value]] ← S[i]
17: C[value] ← C[value] + 1
18: end for
(0 to 9) of one digit (in general, 2bi counters for bi -bit digits). The counters
show that, for instance, the least significant digit of the key has 6 and 2
occurrences of values 1 and 7 respectively.
Second, the accummulation step computes a partial sum of the counters
(lines 7 to 13). We show this sum in a different instance of vector C with
label acumm in Fig. 16.3.
Third, the movement step reads vector S, writing each key to vector D
using the values of vector C (lines 14 to 18). During that process, 10 buckets
(in general, 2bi for a bi -bit digit), one for each possible value of this digit,
are formed. In the example of Fig. 16.3, we show two different situations of
vector C after moving the first 9 keys, and after moving all the keys from
vector S to vector D. Subsequently, vectors S and D exchange their role (line
19). Then the same procedure is performed on the most significant digit of
the key during iteration 2. The final sorted vector is also shown in Fig. 16.3.
In general, note that after sorting digit k, we say that the N keys are
k−1
sorted for the i=0 bi least significant bits.
A discussion about how to improve this sequential algorithm with the
objective of exploiting data locality can be found in Chapter 8. Here we use
the fastest version known to us [433]. However, we focus on its parallelization,
and we distinguish between the shared and distributed memory versions of
the algorithm.
16. Case Study: Memory Conscious Parallel Sorting 363
The shared parallel version of radix sort is very similar to the sequential
version adding some parallelization directives. Alg. 2 is an OpenMP version
of SF-Radix sort. The parallel directives used are quite intuitive and we do
not explain them here. More information can be found in [595].
Fig. 16.4 shows the parallel procedure for sorting the same previous ex-
ample with two processors. Processors P0 and P1 work with the white and
grey vector parts of vector S and D respectively in each iteration of Alg. 2,
lines 2 to 20 of the code. For instance, for the least significant digit, each
processor Ppid :
1. Performs the initialization of its counters in the global counter matrix Gl
(lines 3 to 5).
2. Computes a histogram (lines 6 to 10) of the number of keys for each
possible value of the least significant digit. Then, all processors synchro-
nize in a barrier before the next step. The omp for directive in line 6 of
the algorithm inserts a barrier after the corresponding piece of code by
default.
3. Computes the local accumulation vector LCacc with the partial sum call
(using the global counters Gl) as follows:
364 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
D D
S S
0 10 0 10 0 10 0 1
LCacc0 1 1 LCacc0 1 4
1 1 0 0 1 1 0 0
2 4 1 1 2 21 2 21 2 10
1 2
3 21 2 7 3 81 3 81 3 12
2 5
4 12 3 9 4 91 4 91 4 17
3 8
5 61 5 21
5 81 Gl0 Gl1 4 12 5 61 Gl0 Gl1 4 10
P0 6 64 0 1 0
5 17 6 71 P0 6 71 0 1 1 5 12
6 26
7 17 6 18 7 12 7 12 7 29
1 3 3 1 2 1 6 14
8 59 7 19 8 92 8 92 8 34
2 1 1 2 1 2 7 16
9 35 8 21 9 73 9 73 9 35
3 0 3 3 0 2 8 18
10 44 9 21 10 53 10 53 10 44
4 3 2 4 0 3 9 20
11 97 LCacc1 11 83 11 83 11 49
5 1 0 . 5 1 2 LCacc1
0 12 4 . 12 53
12 26 6 1 12 4 6 0 1
0 1 1 1
13 73 1 4 13 64 13 64 13 59
7 2 0 7 2 0 1 4
14 91 2 8 14 44 14 44 14 61
8 0 0 8 2 0 2 6
15 94 3 9 15 94 15 94 15 64
9 1 2 9 2 2 3 8
P1 16 29 4 15 16 34 P1 16 34 4 10
16 71
17 34 . 5 18 17 35 17 35 17 73
5 13
count 6 18 26 count 18 81
18 53 18 18 26 6 15
19 61 7 21 19 17 19 17 19 83
7 18
20 71 8 21 20 97 20 97 20 91
8 20
21 83 9 22 21 59 21 59 21 92
9 22
22 49 22 29 22 29 22 94
accum. 23 49 accum. 23 97
23 92 23 49
Iteration 1 Iteration 2
Fig. 16.4. Example of the shared memory SF-Radix sort for 2 decimal digit keys.
LCacc pid indicates the local accumulation vector of processors Ppid
P
i−1
pid−1
LCacc[i] = Gl[j][ip] + Gl[i][ip] , 0 ≤ i ≤ nbuckets − 1 .
ip=0 j=0 ip=0
With this formula, processor Ppid knows the first place where the first
key of each of its buckets should be written. So, the first key belonging
to bucket i of processor Ppid is placed in the position of vector D just
after all the keys belonging to buckets 0 to i − 1, and after all the keys
of the same bucket i belonging to processors P0 to Ppid−1 . For instance,
in the loop in lines 12 to 17, processor P1 writes key 91 in position 4 of
vector D, as its LCacc 1 [1] indicates in Fig. 16.4, and processor P0 writes
the key of vector S with value 1 in position 1 of vector D.
4. Moves keys from its part of vector S to vector D using its local counters
LCacc (lines 12 to 17). Each processor writes keys belonging to the same
bucket in independent places of vector D. For instance, during iteration
1 of the example of Fig. 16.4, and for bucket 1, processor P0 writes keys
with values 1, 21, and 81 in positions 1 to 3 of vector D, and processor
P1 writes keys with values 91, 61, and 71 in positions 4 to 6 of the same
vector.
Before the next step, all the processors synchronize in another barrier.
One of the processors then exchanges the roles of vectors S and D (lines
18 and 19), as in the sequential algorithm. After that, all the processors
need to synchronize with a barrier (the single directive in line 18 of the
algorithm inserts a barrier after the corresponding piece of code by default) to
16. Case Study: Memory Conscious Parallel Sorting 365
Analysis
Table 16.1 shows the number of L2 misses, the number of invalidations, and
the execution time per processor for sorting 16 million key-pointer records
for two different implementations: the plain implementation explained above
and a memory conscious implementation, explained below.
The results are quite significant if we consider that a miss penalty on L2
on the SGI Origin 2000 may be 75 processor cycles or more.4 The direct cost
of an invalidation is low. However, those invalidations may lead to L2 misses,
as happens in the plain implementation. In any case, both L2 misses and
invalidations are important, since we may not have a large number of invali-
dations but rather a large number of L2 misses. With the memory conscious
implementation, the number of L2 misses and the number of invalidations
have been dramatically reduced when compared to the plain implementa-
tion. This reduction yields a significant reduction of the execution time; the
memory conscious implementation is about 8.9 times faster than the plain
implementation.
Table 16.1. L2 miss and invalidation average per processor and execution time for
sorting 16M records with 8 processors with SF-Radix sort
SF-Radix sort L2 misses Invalidations Exec. Time (s)
Plain Impl. 9,828,950 9,725,467 51.29
Memory Conscious 1,534,000 251,744 5.76
will become Gl[pid][i] in the code of Alg. 2. Another solution (the memory
conscious solution shown in Table 16.1) is to use a local counter vector per
processor, combined with the use of the global counter matrix commented
above. With those data structures we only have to modify steps 1 and 2 of
Alg. 2, as follows: Now, each processor:
1. Initializes its local counters, and
2. Locally computes a histogram of the values of the keys with its local
counters. Therefore, no false sharing occurs. Then, each processor intial-
izes the global counter matrix with its local counters. We may have some
false sharing updating the global counter matrix, but it is much less than
before.
This memory conscious solution dramaticaly reduces the number of L2
misses and invalidations. First, the number of accesses to the global counter
matrix has been reduced from N , the number of records, to nbuckets, the
number of buckets. Second, we have swapped columns and rows so that false
sharing between processors is reduced when accessing global counter matrix.
Another possible drawback comes from the fact that, when a counter row
ends in the middle of a cache line and the next counter row starts right after
it, that cache line will also be the cause of false sharing misses. In order to
overcome that drawback, it is possible to pad some useless bytes after the
first row, so that the next row starts in a new cache line.
SM Algorithm Drawbacks
We have explained two versions of SF-Radix sort where the memory conscious
implementation improves the plain one. However, we still have some potential
performance problems with this algorithm.
The main problem is that data are moved from one processor’s cache to
another every time we sort a digit. For instance, the grey keys of vector D
in Fig. 16.4 are problably in processor P1 ’s cache at the end of iteration 1;
it has just written them. Then, after swapping the role of the vectors and
synchronizing, iteration 2 starts and processor P0 reads the white vector part
of S (which played the role of D in the previous iteration). This means that
processor P0 will probably miss in the cache hierarchy and will have to fetch
those keys from processor P1 ’s cache. That may happen for each digit.
Another potential problem is that processors have to synchronize several
times with a barrier every time they sort a digit. The larger the number of
digits, the bigger the problem may be. Barriers may impose a significant
overhead. First, executing a barrier has some run-time overhead that grows
quickly as the number of processors increases. Second, executing a barrier
requires all processors to be idle while waiting for the slowest processor; this
effect may result in poor processor utilization when there is load unbalance
among processors [732].
16. Case Study: Memory Conscious Parallel Sorting 367
3: for i = 0 to N − 1 do
4: value ← get digit value(S[i], dig)
5: LC[value] ← LC[value] + 1
6: end for
7: tmp ← LC[0]
8: LC[0] ← 0
9: for i = 1 to nbuckets − 1 do
10: accum ← tmp + LC[i − 1]
11: tmp ← LC[i]
12: LC[i] ← accum
13: end for
14: for i = 0 to N − 1 do
15: value ← get digit value(S[i], dig)
16: D[LC[value]] ← S[i]
17: LC[value] ← LC[value] + 1
18: end for
tively studied: the balanced all-to-all routing [657, 688, 689]. Here, we explain
four possible non-complex implementations for didactic purposes, which we
believe show interesting data communication aspects.
However, the performance differences between those implementations are
not significant for SF-Radix sort. The performance differences observed in
C3-Radix sort, which are more significant, are analyzed later in Sect. 16.4.3.
Bucket by Bucket Messages. This is a straight forward implementation,
where the local buckets of a processor that have to be sent to/received from
another processor are sent/received one by one synchronously. Therefore, a
processor cannot continue doing anything else until a data communication
has been completely finished. In addition, we have to pay an overhead cost
for each message. Therefore, this data communication may be very costly if
the number of messages is large.
We refer to this implementation as plain solution.
16. Case Study: Memory Conscious Parallel Sorting 369
S D S S D S
0 10 10 0 10 0 1 0 1
0 10
1 1 1 1 1 LC0 LC0 1 10 1 4
1 1 LC0 LC0 0 1 0 0
0 1 0 0 2 21 2 12 2 10
2 4 2 21 21 1 2 1 1
1 3 1 1 3 81 3 21 3 12
3 21 3 81 81 2 1 2 3
2 1 2 4 4 91 4 53 4 17
4 12 4 12 91 3 0 3 4
3 0 3 5 5 61 5 61 5 21
P0 5 81 4 4
5 4 61 4 0 4 4 P0
3 5 6 71 6 71 6 26
6 64 6 64 71 5 1 5 4
5 1 5 8 7 12 7 73 7 29
7 17 7 44 12 6 1 6 5
6 0 6 9 8 92 8 81 8 34
8 59 8 35 92 7 2 7 6
7 2 7 9 9 73 9 83 9 35
9 35 9 17 73 8 2 8 8
8 0 8 11 10 53 10 91 10 44
10 44 10 97 53 9 2 9 10
9 1 9 11 11 83 11 92 11 49
11 97 11 59 83
S D S S D
LC1 LC1 S
0 26 0 91 4 0 4 LC1 LC1 4
0 0 0 0 0 0 0 53
1 73 1 61 64 1 64 0 1 17
1 3 1 0 1 1 1 59
2 91 2 71 44 2 44 1 1 26
2 1 2 3 2 2 2 61
3 94 3 92 94 3 94 2 2 29
3 3 3 4 3 4 3 64
4 29 4 73 34 4 34 3 2 34
4 2 4 7 4 6 4 71
4 2
P1 5 34 5 0 5 9 5 53 35 5 35
5 1 5 8
35 5 73 P1
6 53 6 1 6 9 6 83 26 6 26 49 6 81
7 61 7 94 17 7 17 6 1 6 9 44
7 0 7 10 7 10 7 83
8 71 8 34 97 8 97 7 0 59
8 0 8 10 8 10 8 91
9 83 9 26 59 9 59 8 0 64
9 2 9 10 9 10 9 92
10 49 10 29 29 10 29 9 2 94 10 94
11 92 11 49 49 11 49 97 11 97
count accum count accum
Iteration 1 Iteration 2
Fig. 16.5. Example for distributed SF-Radix sort for 2 decimal digit keys
Algorithm Drawbacks
The distributed memory SF-Radix sort has a serious problem related to the
data communicated. The algorithm may keep moving one key from one pro-
cessor to another at each data communication step. There are as many data
communication steps as number of digits. Because of this, SF-Radix sort has
very little data locality.
16. Case Study: Memory Conscious Parallel Sorting 371
As shown in Fig. 16.1 in the introduction, SF-Radix sort is not a good ap-
proach for solving our sorting problem. In this section, we analyze a better
parallel sorting approach, C3-Radix sort.
We start by explaining the Cache Conscious Radix sort [433], the sequen-
tial algorithm on which C3-Radix sort is based. Then we analyze the shared
and distributed memory versions of C3-Radix sort, and some results on the
SGI O2000 are discussed. With this analysis, we want to show the possi-
ble influence of the computer architecture on the performance of a parallel
algorithm, once the algorithm is conscious of the architecture of the target
computer.
The Cache Conscious Radix sort, CC-Radix sort, is based on the same princi-
ple as radix sort. However, it starts sorting by the most significant digits. The
objective of CC-Radix sort is to improve the locality of the key-pointer vec-
tors S and D, and the counter vector C when sorting large data sets. This is
done by optimizing the use of radix sort as explained in [433]. Other works,
such as [12] and [47], use the same idea of sorting by the most significant
digits with objectives other than exploiting data locality.
With the help of the example in Fig. 16.6 we explain CC-Radix sort. First,
the algorithm performs the three steps of the counting algorithm on the most
significant digit. With this process, called Reverse sorting in [433], keys are
distributed in such a way that vector S may be partitioned into a total of
2bm−1 buckets using the bm−1 most significant bits of the key (10 buckets in
the example, because we are working with decimal digits).
Note that, after sorting by the most significant digit, keys belonging to
different buckets are already sorted among these buckets. For instance, keys 1
and 4 (bucket 0) of the example are smaller than 10, 17 and 12 (bucket 1), and
these latter are smaller than 29, 26 and 21 (bucket 2), and so on. Therefore,
we only have to individually sort each bucket to obtain a completely sorted
vector. Each bucket can be sorted by any sorting algorithm; we use the plain
radix sort version mentioned above.
Actually, the idea behind this is a sample sort [236], where the sample
keys have been statically chosen. In that sample sort, the sample is formed by
sample keys with values 00, 10, 20, etc. However, in our case, the algorithm
does not make the sample explicity.
The algorithm is very similar to Alg. 2, but now it has only one loop iteration
of the code in lines 3 to 18, which is for the most significant digit. Then, each
372 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
S D D
0 10 1 1
1 1 C C C 4 4
bucket
2 4 0 2 0 0 0 1 10 10
3 29 1 3 1 2 1 7 17 12
4 17 2 3 2 5 2 9 12 17
5 81 3 2 3 8 3 12 29 21
6 64 4 2 4 10 .......... 4 17 26 26
7 12 5 2 5 12 5 18 21 29
8 59 6 2 6 14 6 19 35 34
9 35 7 2 7 16 7 21 34 35
10 44 8 2 8 18 8 21 44 44
11 97 9 4 9 20 9 24 49 49
12 26 59 53
count accum. Sort
Final state
15 94 61 bucket 64
16 21 73 71
17 34 71 73
18 53 81 81
19 61 83 83
20 71 97 91
21 83 91 92
22 49 94 94
23 92 92 97
Reverse Sorting
Fig. 16.6. Example for CC-Radix sort for 2 decimal digit keys
one processor’s cache to another has been reduced to only one. Therefore,
the shared memory C3-Radix sort can exploit the data locality much better
than the shared memory SF-Radix sort. In addition, the number of synchro-
nization points (three per digit to be sorted) has been reduced to those for
the most significant digit. Therefore, the shared memory C3-Radix sort al-
gorithm overcomes the drawbacks of the shared memory version of SF-Radix
sort.
Analysis
Table 16.2 shows the number of L2 misses, the number of invalidations, and
the execution time for two versions of the shared memory C3-Radix sort al-
gorithm. These two versions correspond to (i) the case where a global counter
matrix Gl is used to compute the histogram (plain implementation), and (ii)
the case where the histogram is computed using a local counter vector and
then a global matrix Gl is updated (memory conscious implementation). On
the one hand, we can see that a non memory conscious implementation may
cause a significant loss of performance, even with a memory conscious algo-
rithm like C3-Radix sort. The memory conscious implementation is about
8.5 times faster than the plain implementation.
On the other hand, if we compare the execution times and the number
of L2 misses and invalidations of the results in Table 16.2 to those in Ta-
ble 16.1, we can see that C3-Radix sort is significantly better than SF-Radix
sort. First, for the plain implementation, C3-Radix sort is 2.35 times faster
than SF-Radix sort. The reason for this can be found in the number of L2
misses and invalidations of both algorithms. The number of L2 misses and
invalidations of C3-Radix sort are only about 39% and 45% of the L2 misses
and invalidations of the SF-Radix sort algorithm, respectively.
As for the memory conscious implementation, the number of invalidations
is similar to SF-Radix sort. However, the number of L2 misses is half the
number of L2 misses of SF-Radix sort, and C3-Radix sort is 2.25 times faster
than the memory conscious SF-Radix sort.
Table 16.2. Second level cache average number of misses and invalidations per
processor, and execution time for sorting 16 million record with 8 processors with
C3-Radix sort
C3-Radix sort L2 misses Invalidations Exec. Time (s)
Plain Impl. 3,875,000 4,338,310 21.69
Memory Conscious 833,559 311,237 2.55
The algorithm is very similar to Alg. 3. As happens with the shared memory
version, it performs only one iteration to sort the most significant digit (lines
374 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
2 to 23 in the code of Alg. 3), and then each processor individually and locally
sorts a set of buckets. Each processor performs five steps. The first two steps
are the same as the first two steps of the distributed memory SF-Radix sort.
Therefore, each processor Ppid :
1. Performs the three sequential steps of the counting algorithm on the most
significant digit.
2. Broadcasts its local counters.
3. Locally computes a bucket range distribution to get an even partition of
the buckets among processors. If it is not possible to achieve a good load
balance, each processor can start again at step 1 using the next most
significant digit, that is, another Reverse sorting step using digit bm−2 .
See [430] for more details.
4. Performs the only data communication step of this algorithm taking into
account of the partitioning of the previous step. After this step, each
processor has a set of global buckets in its local memory.
5. Locally sorts its global buckets.
rss Each global bucket is sorted using CC-
Radix. Only the the b − i=1 bm−i least significant bits of the keys of
each bucket need to be sorted, where rss is the number of Reverse sorting
steps.
C3-Radix sort overcomes the main drawback mentioned above for the
distributed memory SF-Radix sort. C3-Radix sort performs only one step of
communication, independently of the number of digits of the keys. Therefore,
C3-Radix can exploit the data locality of the keys when sorting each bucket.
Nevertheless, SF-Radix sort does not exploit the data locality because it
performs as many communciation steps as number of digits.
Analysis
Fig. 16.7. Distributed memory execution times for different solutions of C3-Radix
sort ( from top to bottom, bucket by bucket, grouping, grouping and asynchronously
sending, and all-to-all collective operation solution)
376 Dani Jiménez-González, Josep-L. Larriba-Pey, and Juan J. Navarro
16.5 Conclusions
In this chapter, we consider the parallel sorting problem as a case study. To
study this problem, we analyze two different algorithms, the Straight For-
ward Radix sort, SF-Radix sort, and the Communication and Cache Con-
scious Radix sort, C3-Radix sort. We also explain different implementation
techniques to improve the memory hierarchy response, and how they can be
applied to both algorithms. We analyze both the shared memory and the
distributed memory implementations of those algorithms.
On one hand, this chapter shows the impact of the implementation details
and the communication mechanisms on the performance of an algorithm.
For shared memory, the total execution time of different implementations
of SF-Radix sort and C3-Radix sort is closely related to the number of L2
misses and invalidations (those may lead to L2 misses) of those implemen-
tations. We have seen that the larger the number of L2 misses, the worse
the performance of the implementation. For instance, the memory conscious
implementation of the shared memory SF-Radix sort and C3-Radix sort are
respectively 8.9 and 8.5 times faster, and it has a significantly smaller num-
ber of L2 misses than the plain implementation of those algorithms. Some of
the techniques explained in this chapter for reducing the number of true and
false sharing misses are data placement, padding and data alignment.
For distributed memory, the large number of synchronous messages sent
has an influence on the performance of C3-Radix sort. In this chapter, we
have shown that reducing the number of messages may improve the perfor-
mance of the implementation, because we reduce the fixed cost of overhead of
the messages. Asynchronous messages may also improve the performance of
the implementation if we can overlap communication with computation. For
instance, the solution that groups and sends messages asynchronously is 1.25
times faster than the plain implementation of C3-Radix sort, which sends a
large number of messages synchronously.
On the other hand, we show that C3-Radix sort, which is a memory con-
scious algorithm, performs much better than SF-Radix sort. Table 16.3 shows
16. Case Study: Memory Conscious Parallel Sorting 377
the total execution times of SF-Radix sort and C3-Radix sort for the two pro-
gramming models (shared and distributed), and the improvements achieved
by C3-Radix sort (how many times faster the algorithm is). These substantial
Table 16.3. Total execution times in seconds for the shared and distributed mem-
ory of SF-Radix sort and C3-Radix sort. The execution time is for sorting 16M
records with 8 processors with SF-Radix sort. (x) indicates the speed up of C3-
Radix sort compared to SF-Radix sort
Programming Model SF-Radix sort C3-Radix sort
Shared Memory 5.76 2.55 ( 2.25 )
Distributed Memory 6.56 2.36 ( 2.78 )
improvements (2.25 and 2.78 times faster) are caused by the reduction in the
number of data communication steps to only one, and the better exploitation
of data locality by C3-Radix sort.
Finally, we can observe that the distributed memory implementation per-
forms better than the shared memory implementation of C3-Radix sort for
the executions we have analyzed, as shown in Table 16.3. However, shared
memory algorithms are much simpler than distributed memory algorithms.
In any case, this relative performance may vary depending on the number of
processors, the size of data to be sorted, the amount of data communicated,
the computer architecture, the communication mechanisms used, etc.
This chapter concludes that the efficiency of the solutions for a given
problem depends on the degree of memory consciousness of the algorithm,
and on the algorithm implementation details chosen to solve the problem.
16.6 Acknowledgments
This work was supported by the Ministry of Education and Science of Spain
under contract TIC-0511/98 and TIC2001-0995-C02-01, CIRI 6 , CEPBA,
“Direcció General de Recerca of the Generalitat de Catalunya” under grant
1998FI-00283-APTIND and by an IBM CAS Fellowship grant.
We wish to thank Jop Sibeyn, Peter Sanders and the anonymous reviewers
for their valuable contribution to improve the quality of this paper. We would
also like to thank Núria Nicolau-Balasch for her help in the bibliography
search; CEPBA for letting us use Paraver, and specially to Jordi Caubet
who helped us a lot in using the trace tools of Paraver. Finally, we wish to
thank Xavi Serrano, Victor Mora, Alex Muntada, and Jose A. Rodrı́guez,
system managers of our Lab., who helped us to round off this chapter by
solving some technical problems.
6
CEPBA-IBM Research Institute, http://www.cepba.upc.es/ciri/
Bibliography
[34] B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierarchy
model of computation. Algorithmica, 12(2-3):72–109, 1994.
[35] S. Alstrup, G. S. Brodal, and T. Rauhe. Optimal static range reporting in one
dimension. In Proceedings of the 33rd Annual ACM Symposium on Theory
of Computing (STOC ’01), pages 476–482. ACM Press, 2001.
[36] M. Altieri, C. Becker, and S. Turek. On the realistic performance of linear
algebra components in iterative solvers. In H.-J. Bungartz, F. Durst, and
C. Zenger, editors, High Performance Scientific and Engineering Computing,
Proc. of the Int. FORTWIHR Conference on HPSEC, volume 8 of LNCSE,
pages 3–12. Springer, 1998.
[37] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic
local alignment search tool. Journal of Molecular Biology, 215(3):403–410,
1990.
[38] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and
B. Smith. The tera computer system. In International Conference on Super-
computing, pages 1–6, Sept. 1990. in ACM SIGARCH 90(3).
[39] N. M. Amato and E. A. Ramos. On computing Voronoi diagrams by divide-
prune-and-conquer. In Proceedings of the 12th Annual ACM Symposium on
Computational Geometry, pages 166–175, 1996.
[40] A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern matching in
Z-compressed files. Journal of Computer and System Sciences, 52(2):299–307,
1996.
[41] C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu,
and W. Zwaenepoel. TreadMarks: Shared memory computing on networks of
workstations. IEEE Computer, 29(2):18–28, Feb. 1996.
[42] B. S. Andersen, J. A. Gunnels, F. Gustavson, and J. Waśniewski. A recursive
formulation of the inversion of symmetric positive definite matrices in packed
storage data format. In Proc. of the 6th Int. Conference on Applied Parallel
Computing, volume 2367 of LNCS, pages 287–296, Espoo, Finland, 2002.
Springer.
[43] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don-
garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and
D. Sorensen. LAPACK Users’ Guide. SIAM, 3rd edition, 1999.
http://www.netlib.org/lapack/lug.
[44] J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S. A.
Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E.
Weihl. Continuous profiling: Where have all the cycles gone? In Proc. of
the 16th ACM Symposium on Operating System Principles, pages 1–14, St.
Malo, France, 1997.
[45] T. Anderson, M. Dahlin, J. Neefe, D. Paterson, D. Roselli, and R. Wang.
Serverless network file systems. In Proceedings of the 15th Symposium on
Operating Systems Principles, pages 109–126, Copper Mountain Resort, Col-
orado, December 1995.
[46] A. Andersson, N. J. Larsson, and K. Swanson. Suffix trees on words. Algo-
rithmica, 23(3):246–260, 1999.
[47] A. Andersson and S. Nilsson. A new efficient radix sort. In FOCS: Symposium
on Foundations of Computer Science (FOCS), pages 714–721. IEEE, 1994.
[48] A. Andersson and M. Thorup. Tight(er) worst-case bounds on dynamic
searching and priority queues. In Proceedings of the 32nd Annual ACM Sym-
posium on Theory of Computing (STOC ’00), pages 335–342. ACM Press,
2000.
382 Bibliography
[138] C. Böhm and H.-P. Kriegel. Determining the convex hull in large multidi-
mensional databases. In Data Warehousing and Knowledge Discovery. Third
International Conference, DaWaK 2001, volume 2114 of Lecture Notes in
Computer Science, pages 294–306, 2001.
[139] J. Bojesen, J. Katajainen, and M. Spork. Performance engineering case study:
Heap construction. ACM Journal of Experimental Algorithmics, 5, 2000.
[140] W. J. Bolosky and M. L. Scott. False sharing and its effect on shared memory
performance. In Proceedings of the USENIX Symposium on Experiences with
Distributed and Multiprocessor Systems, pages 57–72, 1993.
[141] P. Boncz, S. Manegold, and M. Kersten. Database architecture optimized
for the new bottleneck: Memory access. In The VLDB Journal, pages 54–65,
1999.
[142] B. Bonet and H. Geffner. Planning with incomplete information as heurstic
search in belief space. In Artificial Intelligence Planning and Scheduling
(AIPS), pages 52–61, 2000.
[143] O. Boruvka. Über ein Minimalproblem. Pràce, Moravské Prirodovedecké
Spolecnosti, pages 1–58, 1926.
[144] J. Boyer and W. Myrvold. Stop minding your P’s and Q’s: A simplified O(n)
planar embedding algorithm. In Proceedings of the 10th Annual ACM-SIAM
Symposium on Discrete Algorithms, pages 140–146, 1999.
[145] P. J. Braam. The Coda distributed file system. Linux Journal, 1998.
[146] J. H. Breasted. The Edwin Smith Surgical Papyrus, volume 1–2. The Oriental
Institute of the University of Chicago, 1930 (Reissued 1991).
[147] W. Briggs, V. Henson, and S. McCormick. A Multigrid Tutorial. SIAM,
second edition, 2000.
[148] S. Brin. Near neighbor search in large metric spaces. In Proceedings of the 21st
International Conference on Very Large Data Bases, pages 574–584, 1995.
[149] T. Brinkhoff. Der Spatial Join in Geo-Datenbanksystemen. PhD thesis,
Ludwig-Maximilians-Universität München, 1994. (in German).
[150] T. Brinkhoff, H.-P. Kriegel, R. Schneider, and B. Seeger. Multi-step process-
ing of spatial joins. In Proceedings of the 1994 ACM SIGMOD International
Conference on Management of Data, volume 23.2 of SIGMOD Record, pages
197–206, June 1994.
[151] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Efficient processing of spatial
joins using R-trees. In Proceedings of the 1993 ACM SIGMOD International
Conference on Management of Data, volume 22.2 of SIGMOD Record, pages
237–246, June 1993.
[152] A. Brinkmann, K. A. Salzwedel, and C. Scheideler. Efficient, distributed data
placement strategies for storage area networks. In Proceedings of the 12th
Annual Symposium on Parallel Algorithms and Architectures, pages 119–128.
ACM Press, 2000.
[153] A. Brinkmann, K. A. Salzwedel, and C. Scheideler. Compact, adaptive place-
ment schemes for non-uniform requirements. In Proceedings of the 14th An-
nual Symposium on Parallel Algorithms and Architectures, pages 53–62. ACM
Press, 2002.
[154] G. S. Brodal. Worst-case efficient priority queues. In Proceedings of the 7th
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’96), pages
52–58, 1996.
[155] G. S. Brodal and R. Fagerberg. Funnel heap — a cache oblivious priority
queue. In Proc. 13th Annual International Symposium on Algorithms and
Computation, volume 2518 of LNCS, pages 219–228. Springer, 2002.
388 Bibliography
[211] P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM
Transactions on Computer Systems, 14(3):225–264, 1996.
[212] T. H. Cormen and A. Colvin. ViC*: A preprocessor for virtual-memory C*.
Technical report, Department of Computer Science, Dartmouth College, 1994.
[213] T. H. Cormen and M. T. Goodrich. A bridging model for parallel computa-
tion, communication, and I/O. ACM Computing Surveys, 28(4es):208–208,
Dec. 1996. Position Statement.
[214] T. H. Cormen and M. Hirschl. Early Experiences in Evaluating the Parallel
Disk Model with the ViC* Implementation. Technical Report PCS-TR96-293,
Dartmouth College, Computer Science, Sept. 1996.
[215] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms.
McGraw-Hill, 1990.
[216] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to
Algorithms. MIT Press, second edition, 2001.
[217] D. Cornell and P. Yu. An effective approach to vertical partitioning for
physical design of relational databases. IEEE Trans. Software Engineering,
16(2):248–258, 1990.
[218] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Closest
pair queries in spatial databases. Technical report, Data Engineering Lab,
Department of Informatics, Aristotle University of Thessaloniki, Greece, 1999.
[219] A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Closest
pair queries in spatial databases. In Proceedings of the 2000 ACM SIGMOD
International Conference on Management of Data, volume 29.2 of SIGMOD
Record, pages 189–200, June 2000.
[220] T. Cortes, S. Girona, and L. Labarta. PACA: A distributed file system cache
for parallel machines. performance under UNIX-like workload. Technical
Report UPC-DAC-RR-95/20, Departament d’Arquitectura de Computadors,
Universitat Politecnica de Catalunya, 1995.
[221] T. Cortes and J. Labarta. A case for heterogenenous disk arrays. In Proceed-
ings of the International Conference on Cluster Computing, pages 319–325.
IEEE Computer Society Press, 2000.
[222] T. Cortes and J. Labarta. Extending heterogeneity to RAID level 5. In
Proceedings of the USENIX 2001, pages 119–132. USENIX Association, 2001.
[223] T. P. P. Council. www.tpc.org.
[224] C. Courcoubetis, M. Y. Vardi, P. Wolper, and M. Yannakakis. Memory-
efficient algorithms for the verification of temporal properties. Formal Meth-
ods in System Design, 1(2/3):275–288, 1992.
[225] P. E. Crandall, R. A. Aydt, A. A. Chien, and D. A. Reed. Input/Output char-
acteristics of scalable parallel applications. In Proceedings of Supercomputing
’95, 1995.
[226] A. Crauser. External Memory Algorithms and Data Structures in Theory and
Practice. PhD thesis, MPI-Informatik, Universität des Saarlandes, 2001.
[227] A. Crauser and P. Ferragina. A theoretical and experimental study on the
construction of suffix arrays in external memory. Algorithmica, 32(1):1–35,
2002.
[228] A. Crauser, P. Ferragina, K. Mehlhorn, U. Meyer, and E. Ramos. Ran-
domized external-memory algorithms for some geometric problems. Inter-
national Journal of Computational Geometry & Applications, 11(3):305–337,
June 2001.
[229] A. Crauser and K. Mehlhorn. LEDA-SM, A Platform for Secondary Memory
Computation. Max-Planck-Institut für Informatik, Saarbrücken, Germany,
Mar. 1999.
392 Bibliography
[247] J. del Rosario, R. Bordawekar, and A. Choudhary. Improved parallel I/O via
a two-phase run-time access strategy. In Proceedings of 7th International Par-
allel Processing Symposium Workshop on Input/Output in Parallel Computer
Systems, 1993.
[248] E. Demaine. A Threads-Only MPI Implementation for the Development of
Parallel Programs. In Proceedings of the 11th International Symposium on
High Performance Computing Systems, HPCS, pages 153–163, 1997.
[249] D. C. Dennet. Minds, machines, and evolution. In C. Hookway, editor, Cogni-
tive Wheels: The Frame Problem of AI, pages 129–151. Cambridge University
Press, 1984.
[250] D. deWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebreaker, and D. Wood.
Implementation techniques for main memory database systems. In Proceed-
ings of the SIGMOD Int’l. Conference on the Management of Data, pages
1–8. ACM, 1984.
[251] M. Dietzfelbinger. Universal hashing and k-wise independent random vari-
ables via integer arithmetic without primes. In 13th Symposium on Theoret-
ical Aspects of Computer Science (STACS), volume 1046 of Lecture Notes in
Computer Science, pages 569–580. Springer-Verlag, 1996.
[252] E. W. Dijkstra. A note on two problems in connexion with graphs. Nu-
merische Mathematik, 1:269–271, 1959.
[253] J. F. Dillenburg and P. C. Nelson. Perimeter search (research note). Artificial
Intelligence, 65(1):165–178, 1994.
[254] W. Dittrich, D. Hutchinson, and A. Maheshwari. Blocking in parallel multi-
search problems. ACM Trans. Comput. Syst., 34:145–189, 2001.
[255] J. Dongarra, J. Du Croz, S. Hammarling, and I. Duff. A set of level 3 basic
linear algebra subprograms. ACM Transactions on Mathematical Software,
16(1):1–17, 1990.
[256] C. C. Douglas. Caching in with multigrid algorithms: Problems in two di-
mensions. Parallel Algorithms and Applications, 9:195–204, 1996.
[257] C. C. Douglas, J. Hu, M. Kowarschik, U. Rüde, and C. Weiß. Cache optimiza-
tion for structured and unstructured grid multigrid. Electronic Transactions
on Numerical Analysis, 10:21–40, 2000.
[258] J. R. Driscoll, N. Sarnak, D. D. Sleator, and R. E. Tarjan. Making data
structures persistent. Journal of Computer and System Sciences, 38(1):86–
124, 1989.
[259] P. Druschel and A. Rowstron. PAST: A large-scale, persistent peer-to-peer
storage utility. In Hot Topics in Operating Systems, pages 75–80, Schloss
Elmau, Germany, 2001. IEEE Computer Society Press.
[260] J. Eckerle. Memory-Limited Heuristic Search (German). PhD thesis, Univer-
sity of Freiburg, 1998. DISKI, Infix.
[261] M. Edahiro, I. Kokubo, and T. Asano. A new point-location algorithm and its
practical efficiency: Comparison with existing algorithms. ACM Transactions
on Graphics, 3(2):86–109, April 1984.
[262] S. Edelkamp. Suffix tree automata in state space search. In German Confer-
ence on Artificial Intelligence (KI), pages 381–385, 1997.
[263] S. Edelkamp. Planning with pattern databases. In European Conference on
Planning (ECP), pages 13–24, 2001.
[264] S. Edelkamp. Prediction of regular search tree growth by spectral analysis.
In German Conference on Artificial Intelligence (KI), pages 154–168, 2001.
[265] S. Edelkamp. Symbolic exploration in two-player games: Preliminary re-
sults. In Artificial Intelligence Planning and Scheduling (AIPS)–Workshop
on Model Checking, pages 40–48, 2002.
394 Bibliography
[287] C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proceed-
ings of the Eighth ACM SIGMOD-SIGACT-SIGART Symposium on Princi-
ples of Database Systems, pages 247–252, 1989.
[288] M. Farach. Optimal suffix tree construction with large alphabets. In Pro-
ceedings of the 38th Annual Symposium on Foundations of Computer Science,
pages 137–143. IEEE, 1997.
[289] M. Farach and M. Thorup. String matching in Lempel-Ziv compressed strings.
Algorithmica, 20(4):388–404, 1998.
[290] M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-
complexity of suffix tree construction. Journal of the ACM, 47(6):987–1011,
2000.
[291] A. Felner. Finding optimal solutions to the graph-partitioning problem with
heuristic search. In Symposium on the Foundations of Artificial Intelligence,
2001.
[292] Z. Feng and E. Hansen. Symbolic heuristic search for factored markov decision
processes. In National Conference on Artificial Intelligence (AAAI), pages
455–460, 2002.
[293] J. Fenlason and R. Stallman. GNU gprof. Free Software Foundation, Inc.,
Boston, Massachusetts, USA, 1998. http://www.gnu.org.
[294] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt,
H. Theiling, S. Thesing, and R. Wilhelm. Reliable and precise WCET de-
termination for a real-life processor. In Workshop on Embedded Systems
(EMSOFT), number 2211 in LNCS, pages 469–485. Springer, 2001.
[295] P. Ferragina and R. Grossi. Fast string searching in secondary storage: Theo-
retical developments and experimental results. In Proceedings of the 7th An-
nual Symposium on Discrete Algorithms, pages 373–382. ACM–SIAM, 1996.
[296] P. Ferragina and R. Grossi. The string B-tree: A new data structure for
string search in external memory and its applications. Journal of the ACM,
46(2):236–280, 1999.
[297] P. Ferragina, R. Grossi, and M. Montangero. Note on updating suffix tree
labels. Theoretical Computer Science, 201(1–2):249–262, 1998.
[298] P. Ferragina and F. Luccio. Dynamic dictionary matching in external memory.
Information and Computation, 146(2):85–99, 1998.
[299] P. Ferragina and G. Manzini. Opportunistic data structures with applications.
In Proceedings of the 41st Annual Symposium on Foundations of Computer
Science, pages 390–398. IEEE, 2000.
[300] P. Ferragina and G. Manzini. An experimental study of an opportunistic
index. Information Sciences, 135(1–2):13–28, 2001.
[301] J. Ferrante, V. Sarkar, and W. Trash. On estimating and enhancing cache
effectiveness. In U. Banerjee, editor, Proc. of the Fourth Int. Workshop on
Languages and Compilers for Parallel Computing, LNCS. Springer, 1991.
[302] E. Feuerstein and A. Marchetti-Spaccamela. Memory paging for connectivity
and path problems in graphs. In Proceedings of the International Sympo-
sium on Algorithms and Computation, volume 762 of LNCS, pages 416–425.
Springer, 1993.
[303] R. Fikes and N. Nilsson. Strips: A new approach to the application of theorem
proving to problem solving. Artificial Intelligence, 2:189–208, 1971.
[304] U. A. Finke and K. H. Hinrichs. Overlaying simply connected planar subdi-
visions in linear time. In Proceedings of the Eleventh Annual Symposium on
Computational Geometry, pages 119–126, New York, 1995. ACM Press.
[305] P. Flajolet. On the performance evaluation of extendible hashing and trie
searching. Acta Informatica, 20(4):345–369, 1983.
396 Bibliography
[347] G. Graefe. Query evaluation techniques for large databases. ACM Computing
Surveys, 25(2):73–170, 1993.
[348] G. Graefe, R. Bunker, and S. Cooper. Hash joins and hash teams in mi-
crosoft SQL server. In Proceedings of the 24th Int’l. Conference on Very
Large Databases, pages 86–97, 1998.
[349] G. Graefe and P. Larson. B-tree indexes and CPU caches. In Int’l. Conference
on Data Engineering, pages 349–358. IEEE, 2001.
[350] T. G. Graf. Plane-sweep construction of proximity graphs. PhD thesis, Fach-
bereich Mathematik, Westfälische Wilhelms-Universität Münster, Germany,
1994.
[351] T. G. Graf and K. H. Hinrichs. Algorithms for proximity problems on colored
point sets. In Proceedings of the Fifth Canadian Conference on Computational
Geometry, pages 420–425, 1993.
[352] R. L. Graham. An efficient algorithm for determining the convex hull of a
finite planar set. Information Processing Letters, 1(4):132–133, 1972.
[353] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics.
Addison-Wesley, Reading, MA, 1989.
[354] A. Grama, V. Kumar, S. Ranka, and V. Singh. Architecture independent
analysis of parallel programs. In Alexandrov et al. [28], pages 599–608.
[355] E. D. Granston and H. A. G. Wijshoff. Managing pages in shared virtual
memory systems: Getting the compiler into the game. In Proceedings of the
7th international conference on Supercomputing, pages 11–20. ACM Press,
1993.
[356] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques.
Morgan Kaufmann, 1993.
[357] J. Griffioen and R. Appleton. Reducing file system latency using a predictive
approach. In Proceedings of USENIX Summer 1994 Technical Conference,
pages 197–207, 1994.
[358] W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. High performance
parallel implicit CFD. Parallel Computing, 27(4):337–362, 2001.
[359] R. Grossi and G. F. Italiano. Suffix trees and their applications in string algo-
rithms. Rapporto di Ricerca CS-96-14, Università “Ca’ Foscari” di Venezia,
Italy, 1996.
[360] R. Grossi and G. F. Italiano. Efficient cross-trees for external memory. In
J. Abello and J. S. Vitter, editors, External Memory Algorithms, volume 50 of
DIMACS Series in Discrete Mathematics and Theoretical Computer Science,
pages 87–106. American Mathematical Society, Providence, RI, 1999.
[361] R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with
applications to text indexing and string matching. In Proceedings of the 32nd
Annual Symposium on Theory of Computing, pages 397–406. ACM, 2000.
[362] O. Günther. Efficient computation of spatial joins. In Proceedings of the
Ninth International Conference on Data Engineering, pages 50–59, 1993.
[363] A. Gupta, W.-D. Weber, and T. Mowry. Reducing memory and traffic re-
quirements for scalable directory-based cache coherence schemes. In 1990
International Conference on Parallel Processing, volume I, pages 312–321,
St. Charles, Ill., 1990.
[364] P. Gupta, R. Janardan, and M. Smid. Efficient algorithms for counting and
reporting pairwise intersections between convex polygons. Information Pro-
cessing Letters, 69(1):7–13, January 1999.
[365] C. Gurret and P. Rigaux. The Sort/Sweep algorithm: A new method for R-
tree based spatial joins. In Proceedings of the 12th International Conference
on Scientific and Statistical Database Management, pages 153–165, 2000.
Bibliography 399
[407] J. Hopcroft and R. Tarjan. Efficient algorithms for graph manipulation. Com-
munications of the ACM, 16(6):372–378, 1973.
[408] P. V. C. Hough. Method and means for recognizing complex patterns. U. S.
Patent No. 3069654, 1962.
[409] J. H. Howard. An overview of the Andrew file system. In Proceedings of the
USENIX Winter Technical Conference, pages 23–26, 1988.
[410] F. Hsu, T. Anantharaman, M. Campbell, and A. Nowatzyk. A grandmaster
chess machine. Scientific American, 4:44–50, 1990.
[411] W. Hsu, A. Smith, and H. Young. Characteristics of production database
workloads and the TPC benchmarks. IBM Systems Journal, 40(3):781–802,
2001.
[412] A. J. Hu and D. L. Dill. Reducing BDD size by exploiting functional depen-
dencies. In Design Automation, pages 266–271, 1993.
[413] Y. Hu, H. Lu, A. Cox, and W. Zwaenepoel. OpenMP for Networks of SMPs.
In Proceedings of the 2nd Merged Symposium International Parallel and Dis-
tributed Symposium/Symposium on Parallel and Distributed Processing (IPP-
S/SPDP). IEEE, 1999.
[414] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Spatial joins using R-
trees: Breadth-first traversal with global optimizations. In Proceedings of
the Twenty-third International Conference on Very Large Data Bases, pages
396–405, 1997.
[415] J. Huber, C. Elford, D. Reed, A. Chien, and D. Blumenthal. PPFS: A high
performance portable file system. In Proceedings of the 9th ACM Interna-
tional Conference on Supercomputing, pages 385–394, 1995.
[416] S. Huddleston and K. Mehlhorn. A new data structure for representing sorted
lists. Acta Informatica, 17(2):157–184, 1982.
[417] F. Hüffner, S. Edelkamp, H. Fernau, and R. Niedermeier. Finding optimal
solutions to Atomix. In German Conference on Artificial Intelligence (KI),
pages 229–243, 2001.
[418] F. Hülsemann, P. Kipfer, U. Rüde, and G. Greiner. gridlib: flexible and
efficient grid management for simulation and visualization. In Proc. of the
Int. Conference on Computational Science, Part III, volume 2331 of LNCS,
pages 652–661, Amsterdam, The Netherlands, 2002. Springer.
[419] D. Hutchinson, A. Maheshwari, and N. Zeh. An external memory data struc-
ture for shortest path queries. In Proceedings of the 5th ACM-SIAM Com-
puting and Combinatorics Conference, volume 1627 of LNCS, pages 51–60.
Springer, July 1999.
[420] IBM Corporation, http://www-124.ibm.com/developerworks/oss/jfs/. JFS
website.
[421] C. Icking, R. Klein, and T. A. Ottmann. Priority search trees in secondary
memory. In Graph-Theoretic Concepts in Computer Science. International
Workshop WG ’87, Proceedings, volume 314 of Lecture Notes in Computer
Science, pages 84–93, Berlin, 1988. Springer.
[422] W. B. L. III and R. B. Ross. An overview of the Parallel Virtual File System.
In Proceedings of the Extreme Linux Workshop, 1999.
[423] F. Isaila and W. Tichy. Clusterfile: A flexible physical layout parallel file
system. In Third IEEE International Conference on Cluster Computing, pages
37–44, Oct. 2001.
[424] P. Jackson. Introduction to Expert Systems. Addison Wesley Longman, 1999.
[425] G. Jacobson. Space efficient static trees and graphs. In Proceedings of the
30th Annual Symposium on Foundations of Computer Science, pages 549–554.
IEEE, 1989.
402 Bibliography
[562] M. Müller. Computer Go as a sum of local games. PhD thesis, ETH Zürich,
1995.
[563] M. Müller. Partial order bounding: A new approach to game tree search.
Artificial Intelligence, 129(1–2):279–311, 2001.
[564] M. Müller. Proof set search. Technical Report TR01-09, University of Alberta,
2001.
[565] M. Müller and T. Tegos. Experiments in computer amazons. More Games of
No Chance, Cambridge University Press, 42:243–257, 2002.
[566] K. Mulmuley. Computational Geometry: An Introduction Through Random-
ized Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1994.
[567] K. Munagala and A. Ranade. I/O-Complexity of Graph Algorithms. In Proc.
10th Ann. Symposium on Discrete Algorithms, pages 687–694. ACM-SIAM,
1999.
[568] M. Mundhenk, J. Goldsmith, C. Lusena, and E. Allender:. Complexity of
finite-horizon markov decision process problems. Journal of the ACM, 4:681–
720, 2000.
[569] I. Murdock and J. H. Hartman. Swarm: A log-structured storage system
for linux. In Proceedings of the FREENIX Track: 2000 USENIX Annual
Technical Conference, pages 1–10. USENIX Association, 2000.
[570] S. Näher and K. Mehlhorn. LEDA: A platform for combinatorial and geo-
metric computing. Communications of the ACM, 38(1):96–102, 1995.
[571] Namesys, http://www.reiserfs.org/. ReiserFS Whitepaper.
[572] C. Navarro, A. Ramirez, J. Larriba-Pey, and M. Valero. On the performance
of fetch engines running dss workloads. In Proceedings of the EUROPAR
Conference, pages 591–595. Springer Verlag, 2000.
[573] G. Navarro. A partial deterministic automaton for approximate string match-
ing. In R. Baeza-Yates, editor, Proceedings of the 4th South American Work-
shop on String Processing, pages 95–111. Carleton University Press, 1997.
[574] G. Navarro. A guided tour to approximate string matching. ACM Computing
Surveys, 33(1):31–88, 2001.
[575] G. Navarro, T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Faster
approximate string matching over compressed text. In Proceedings of the
11th Data Compression Conference, pages 459–468. IEEE, 2001.
[576] G. Navarro and M. Raffinot. A general practical approach to pattern match-
ing over Ziv-Lempel compressed text. In M. Crochemore and M. Paterson,
editors, Proceedings of the 10th Annual Symposium on Combinatorial Pattern
Matching, number 1645 in LNCS, pages 14–36. Springer, 1999.
[577] J. J. Navarro, E. Garcia-Diego, J.-L. Larriba-Pey, and T. Juan. Block algo-
rithms for sparse matrix computations on high performance workstations. In
Proc. of the Int. Conference on Supercomputing, pages 301–308, Philadelphia,
Pennsylvania, USA, 1996.
[578] B. Nebel. Personal communication, 2002.
[579] J. v. Neumann. First draft of a report on the EDVAC. Technical report,
University of Pennsylvania, 1945. http://www.histech.rwth-aachen.de/
www/quellen/vnedvac.pdf.
[580] A. Newell, V. C. Shaw, and H. A. Simon. Report on a general problem-solving
program. In Proceedings International Conference on Information Processing
(ICIP ’59), pages 256–264. Butterworth, 1960.
[581] N. Nieuwejaar and D. Kotz. The Galley parallel file system. Parallel Com-
puting, 23(4-5):447–476, 1997.
[582] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. S. Ellis, and M. L. Best. File
access characteristics of parallel scientific workloads. IEEE Transactions on
Parallel and Distributed Systems, 7(10):1075–1089, October 1996.
410 Bibliography
[776] H. Zhang and M. Martonosi. A mathematical cache miss analysis for pointer
data structures. In Proc. 10th SIAM Conference on Parallel Processing for
Scientific Computing, 2001.
[777] W. Zhang. Depth-first branch-and-bound versus local search. In National
Conference on Artificial Intelligence (AAAI), pages 930–935, 2000.
[778] W. Zhang and P. Larson. A memory-adaptive sort (masort) for database
systems. In Proceedings of the CASCON Conference. IBM, 1996.
[779] Z. Zhang and X. Zhang. Cache-optimal methods for bit-reversals. In Proc.
Supercomputing’99, 1999.
[780] V. V. Zhirnov and D. J. C. Herr. New frontiers: Self-assembly and nanoelec-
tronics. IEEE Computer, 34(1):34–43, 2001.
[781] G. M. Ziegler. Lectures on Polytopes, volume 152 of Graduate Texts in Math-
ematics. Springer, New York, second edition, 1998.
[782] G. Zimbrão and J. M. de Souza. A raster approximation for the processing
of spatial joins. In Proceedings of the 24th Annual International Conference
on Very Large Data Bases, pages 558–569, 1998.
[783] R. Zimmermann and S. Ghandeharizadeh. HERA: Heterogeneous extension
of raid. In H. Arabnia, editor, Proceedings of the International Conference on
Parallel and Distributed Processing Techniques and Applications, volume 4,
pages 2159–2165. CSREA Press, 2000.
Index
vector computer, 8
Vienna Fortran Compiler (VFC), 350
virtual cache, 176, 178
virtual memory, 9, 18, 172, 175
– memory management unit, 175
Voronoi diagram, 129
web crawl, 85
web mining, 248
web modelling, 85
weight balance, 19
word-based index, 155
work-preserving emulation, 331
write
– full, 258, 265
– small, 258, 264, 265
write policy, 175
– write-back, 175
– write-through, 175