0% found this document useful (0 votes)
12 views10 pages

2008 - Evaluation of A Cache-Oblivious Data Structure

Uploaded by

胡仲义
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

2008 - Evaluation of A Cache-Oblivious Data Structure

Uploaded by

胡仲义
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Evaluation of a Cache-Oblivious Data Structure

Maks Verver
[email protected]

ABSTRACT
In modern computer hardware architecture memory is organized in
a hierarchy consisting of several types of memory with different
memory sizes, block transfer sizes and access times. Traditionally,
data structures are evaluated in a theoretical model that does not
take the existence of a memory hierarchy into account. The cache-
oblivious model has been proposed as a more accurate model. Al-
though several data structures have been described in this model
relatively little empirical performance data is available. This pa-
per presents the results of an empirical evaluation of several data
structures in a realistic scenario and aims to provide insight into
the applicability of cache-oblivious data structures in practice.
Figure 1: Schematic depiction of the memory hierarchy

Keywords
cache efficiency, locality of reference, algorithms The main cause for this is the increasing difference between pro-
cessor and memory speeds. As processor speeds have increased
1. INTRODUCTION greatly, the time required to transfer data between processor and
A fundamental part of theoretical computer science is the study main memory has become a bottleneck for many types of compu-
of algorithms (formal descriptions of how computations may be tations. Hardware architects have added faster (but small) cache
performed) and data structures (descriptions of how information memory at various points in the computer architecture to reduce
is organized and stored in computers). Traditionally, algorithms this problem. Similar developments have taken place on the bound-
have been evaluated in a simplified model of computation. In this ary between main memory and disk-based storage.
model it is assumed that a computer executes an algorithm in dis-
crete steps. At each step it performs one elementary operation (e.g. As a result, a modern computer system lacks a central memory stor-
comparing two numbers, adding one to another, storing a value in age with uniform performance characteristics. Instead, it employs a
memory, et cetera). Each elementary operation is performed within hierarchy of memory storage types. Figure 1 gives a typical exam-
a constant time. In this model, both storing and retrieving data val- ple of such a hierarchy. The processor can directly manipulate data
ues in memory is considered to be an elementary operation. contained in its registers only. To access data in a lower level in the
memory hierarchy, the data must be transferred upward through
This model is close enough to the way computers work to be ex- the memory hierarchy. Memory is typically transferred in blocks
tremely useful in the development and analysis of data structures of data of a fixed size (although bulk transfers involving multiple
and algorithms that work well in practice. However, like every blocks of data at once are also supported at the memory and disk
model, it is a simplification of reality. One of the simplifications level).
is the assumption that data can be stored or retrieved at any loca-
tion in a constant amount of time, which is why we will call this the In the memory hierarchy, every next level of storage is both signifi-
uniform memory model. Advancements in software and hardware cantly slower and significantly larger than the one above it and with
design over the last two decades have caused this assumption to be increasing memory sizes, the block size increases as well. Table 1
increasingly detached from reality. gives an overview of typical memory sizes, block sizes, and access
times. The values for the (general purpose) registers, L1 cache and
L2 cache are those for the Pentium M processor as given in [1];
Permission to make digital or hard copies of all or part of this work for other values are approximate.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies To conclude: the memory model used in real computer systems
bear this notice and the full citation on the first page. To copy otherwise, or is quite a bit more complex than the uniform memory model as-
republish, to post on servers or to redistribute to lists, requires prior specific sumed. Given this reality, the performance of many existing algo-
permission. rithms and data structures can be improved by taking the existence
9th Twente Student Conference on IT, Enschede, June 23rd , 2008
Copyright 2008, University of Twente, Faculty of Electrical Engineering, of a memory hierarchy into account. This has prompted research
Mathematics and Computer Science into new memory models that are more realistic. We will consider
Type Total Size Block Size Access Time medium for external memory, the model can be generalized to ap-
Registers 32 bytes 4 bytes 1 cycle ply to every pair of adjacent levels in the memory hierarchy. In that
L1 Cache 32 KB 64 bytes 3 cycles case, the possibility of bulk transfer may have to be dropped.
L2 Cache 1 MB 64 bytes 9 cycles
RAM ± 2 GB 1 KB 50-100 cycles We will call algorithms that are designed to minimize the num-
Disk ± 300 GB 4 KB 5,000-10,000 cycles ber of transfers in a two-level memory hierarchy “cache-aware”
(as opposed to traditional “cache-unaware” algorithms) or “cache-
Table 1: Sizes and access times of various types of storage conscious”. These algorithms typically rely on knowledge of the
block size to achieve optimal performance.

three classes of data structures and algorithms, depending on the 2.2 Hierarchical Memory Model
assumptions that are made about the environment in their defini- The external memory model has the limitation that it only describes
tion: two levels of storage, while we have seen that in practice the mem-
ory hierarchy contains more than just two levels. Even though the
external memory model can be used to describe any pair of adja-
Cache-unaware data structures and algorithms are designed cent levels, a particular algorithm can only be tuned to one. Ag-
for the traditional uniform memory model, with no consider- garwal, Alpern, Chandra and Snir [3] addressed this shortcoming
ation of the existence of a memory hierarchy. by introducing a hierarchical memory model in which the cost of
Cache-aware (or cache-conscious) data structures and al- accessing values at different memory addresses is described by a
gorithms are designed to perform optimally in the external non-decreasing function of these addresses, which means that ac-
memory model (described below) but require parametriza- cessing data at higher addresses can be slower than at lower ad-
tion with some or all properties of the cache layout (such as dresses. This is a very general model that can be applied to the
block size or cache memory size). real memory hierarchy, but it assumes that the application has full
control over which data is placed where, which is usually not the
Cache-oblivious data structures and algorithms are designed case in practice. As a result, the applicability of their model to the
to perform optimally in the cache-oblivious model (explained design and evaluation of practical algorithms is limited.
below) which does not allow parametrization with properties
of the cache layout.
2.3 Cache-Oblivious Memory Model
A different approach to generalizing the external memory model
A great number of cache-unaware and cache-aware data structures was taken by Prokop [4] who proposed the cache-oblivious model.
has been developed and these are also widely used in practice. Al- In this model, there is an infinitely large external memory and an
though some research has been done on cache-oblivious data struc- internal memory of size M which operates as a cache for the ex-
tures, there is currently no evidence that they are used in practice. ternal memory. Data is transferred between the two in aligned data
Consequently, it is unclear if they are suitable for practical use at blocks of size B. In contrast with the hierarchical memory model,
all. The main goal for this paper is to give some insight in the the application does not have explicit control over the transferring
practical merit of cache-oblivious data structures. of blocks between the two memories. Instead, it is assumed that
an optimal cache manager exists which minimizes the number of
In the following pages we will give an overview of the different block transfers over the execution the program. Additionally (and
available memory models and explain why the cache-oblivious mod- in contrast with the external memory model) the values of parame-
el is of particular interest. We will then describe what our goals for ters like M and B are not known to the application, so they cannot
this research project were and how our work relates to previous re- be used explicitly when defining algorithms and data structures. Of
search. A large part of the paper will be dedicated to a description course, analysis in terms of memory transfers does involve these
of our research methodology. Finally, we will present and discuss parameters, so the number of memory transfers performed is still a
our results, and draw a conclusion on the applicability of cache- function of M and B (and other parameters relevant to the problem).
oblivious data structures.
Algorithms that perform optimally in this model are called “cache-
oblivious” and they distinguish themselves from cache-aware algo-
2. PREVIOUS WORK rithms in that they cannot rely on knowledge of the block size or
The cache-oblivious memory model is not the only or the first mod- other specific properties of the cache configuration. The key advan-
el that was developed as a more realistic alternative to the uniform
tage of this class of algorithms is that, even though they are defined
memory model. We will briefly describe some of the alternatives. in a two-level memory model, they are implicitly tuned to all levels
in the memory hierarchy at once. It has been conjectured that these
2.1 External Memory Model algorithms may therefore perform better than algorithms that are
One of the earliest models to take the differences in memory ac- tuned to a specific level in the hierarchy only.
cess cost into account is the external memory model, described by
Aggarwal and Vitter [2]. They make a distinction between inter- The cache-oblivious model is very similar to the real memory hi-
nal memory (which is limited in size) and external memory (which erarchy, which means that algorithms designed for this model can
is virtually unlimited). The external memory is subdivided into easily be implemented in practice. This property, combined with
blocks of a fixed size and only entire blocks of data can be trans- the promise of cache-efficiency across multiple levels of the mem-
ferred between the two memories; additionally, consecutive blocks ory hierarchy, makes it a promising model for the development of
can be transferred at reduced cost (so-called bulk transfer). Al- low-maintenance, high-performance data structures for use in real-
though Aggarwal and Vitter focus on magnetic disk as a storage world applications.
3. RESEARCH GOALS cially when the hash table is heavily loaded.
Several cache-oblivious data structures and algorithms have been
proposed. Complexity analysis shows that the proposed solutions Vitter presents a theoretical survey of algorithms evaluated in a par-
are asymptotically optimal. However, in software engineering prac- allel disk model [10], which is a refinement of the external memory
tice we are not only interested in computational complexity, but model described by Aggarwal and Vitter, but only allows parallel
also in the practical performance of data structures and algorithms. transfer of multiple blocks from different disks, which is more re-
Indeed, many algorithms that have suboptimal computational com- alistic. Unfortunately, his survey lacks empirical results.
plexity are actually widely used because they perform well in prac-
tice (for example, sorting algorithms like Quicksort and Shell sort) Olsen and Skov evaluated two cache-oblivious priority queue data
and the converse is true as well: some algorithms, although theo- structures in practice [11] and designed an optimal cache-oblivious
retically sound, are unsuitable for practical use because of greater priority deque. Their main result is that although the cache-obliv-
memory requirements, longer execution time, or difficulty of im- ious data structures they examined make more efficient use of the
plementation (for example: linear-time construction of suffix arrays cache, they do not perform better than traditional priority queue
is possible, but in practice slower alternatives are often preferred implementations.
that are easier to implement and require less memory).
From the available publications we can conclude that only a mi-
This raises the question whether cache-oblivious data structures are nority of the research on the cache-oblivious memory model com-
actually preferable to traditional data structures in practice. To de- pares the practical performance of newly proposed data structures
termine if cache-oblivious data structures have practical merit, em- with that of established data structures. Contrary to what theoret-
pirical performance data is required, which is scarce, as existing ical analysis suggests, the practical results that are available so far
research has focused mainly on theoretical analysis. This paper fail to show the superior performance of cache-oblivious data struc-
addresses the question by reporting on the performance of a newly tures. Therefore, additional research is needed to determine more
implemented (but previously described) cache-oblivious data struc- precisely to which extent cache-oblivious data structures are useful
ture and two of its traditional counterparts (both cache-aware and as a building block for practical work; this paper will provide some
cache-unaware data structures). insight in this regard.

4. RELATED WORK 5. RESEARCH METHODS


Many data structures and algorithms have been analyzed in the In order to gather empirical data, the research approach must be
cache-oblivious model. Several new data structures and algorithms made more concrete. We will need to limit ourselves to a spe-
have been developed that perform optimally in this model as well. cific class of data structures, since algorithms and data structures
Prokop presents asymptotically optimal cache-oblivious algorithms offering different functionality cannot be compared in a meaning-
for matrix transposition, fast Fourier transformation and sorting [4]. ful way. Furthermore, we will need to select a proper scenario in
which the data structures are evaluated, as conclusions on the prac-
Demaine gives an introduction into the cache-oblivious memory tical merit of the data structures depend on the degree to which the
model and an overview of a selection of cache-oblivious data struc- test scenario is realistic.
tures and algorithms [5]. He also motivates the simplifications
made in the cache-oblivious memory model, such as the assump- We also need to define more accurately what we mean by practical
tion of full cache associativity and an optimal replacement policy. performance. Our experiments are performed by running a test ap-
plication (which will be described in detail below) and measuring
Bender, Demaine and Farach-Colton designed a cache-oblivious two properties: primarily the execution time, and secondarily the
data structure that supports the same operations as a B-tree [6] memory in use. The rationale for selecting these metrics is that if
achieving optimal complexity bounds on search and nearly-optimal the data structures perform identical functions and enough memory
bounds on insertion. Later, Bender, Duan, Iacono and Wu simpli- is available, the only observable difference in running a program
fied this data structure [7] while preserving the supported opera- using different data structures will be the execution time. Mem-
tions and complexity bounds and adding support for additional op- ory use is of secondary interest because in practice memory may
erations (finger searches in particular). This data structure will be be limited, which would preclude the use of data structures that
explained in detail in Section 5.3.5. The authors note that a chief require a large amount of memory to function.
advantage of their data structure over the previously described one
is that it is less complex, more easily implementable and therefore 5.1 State Space Search
more suitable for practical use. The test application that we used to gather performance data imple-
ments a state space search algorithm. This is a suitable scenario for
Several data structures were proposed by Rahman, Cole and Raman two reasons. First, it is commonly used as a practical component
[8] amongst them a cache-oblivious exponential search tree with a of formal methods for software verification, and therefore good al-
similar structure as the static search tree proposed by Prokop. In gorithms are of great practical significance. Second, as we will
their experimental results the cache-oblivious tree performs worse explain below, the performance of state space search algorithms
than the (non-oblivious) alternatives. Nevertheless, they conclude depends for a large part on the performance of the data structures
that “cache-oblivious data structures may have significant practi- that are used to implement them; therefore, research into efficient
cal importance”. data structures is of particular interest to this application.

Askitis and Zobel [9] propose a way to optimize separate-chaining State space search can be used to verify the correctness of software
hash tables for cache efficiency by storing the linked lists that con- programs. For this purpose, programs are first modeled using a for-
tain the contents for a bucket in contiguous memory. Their exper- mal language, that is also used to specify properties of the program
iments show a performance gain over traditional methods, espe- that should hold during its execution. An executing program can
Program 1 Pseudo-code for a simple state search algorithm looked at Spin [12], a widely used model checking tool. Spin uses
Queue queue = new Queue; a custom modeling language (called PROMELA) to specify mod-
Set generated = new Set; els and is distributed with a collection of example models that are
queue.insert(initial_state); suitable for our experiments.
while (!queue.empty()) {
State state = queue.extract(); For the execution of these models we used the NIPS VM [13], a
for (State s : successors(state)) { high-performance virtual machine for state space generation that
if (generated.contains(s) == false) { is easily embeddable in our framework. Although the NIPS VM
generated.insert(s); only executes bytecode in a custom format, a PROMELA compiler
queue.insert(s); is also available to generate the required bytecode from the Spin
} models [14]. The NIPS VM is preferred to the Spin tools because
} it was designed to be embedded in other programs and as such is
} more easily integrated in our framework.

The NIPS VM represents program state as a binary string; the size


of the state depends (amongst others) on the number of active pro-
be in a (possibly infinite) amount of states, one of which is usually cesses in the program, which may change over the execution of the
designated the initial state. If a transition from one state to another program. A typical state size is in the order of a few hundred bytes.
is possible (according to the rules of the formal language used) the
latter state is said to be a successor state of the former. Generating 5.3 Data Structures
the successors of a particular state is also called expanding the state. In the introduction we identified three classes of data structures.
Of course, the the execution model must include some form of non- For our evaluation we have implemented a single representative
determinism to allow more than one successor to exist for a single data structure for each class:
state. In practice, this non-determinism usually comes from pro-
cesses that execute in parallel, where the precise interleaving of the
execution of instructions in these processes is non-deterministic, • Cache-unaware: hash tables. Hash tables are widely used in
unless synchronization primitives (such as channels, semaphores, practice and noted for good performance if the data set fits in
atomic execution blocks, et cetera) are used to enforce a particular main memory, although they have also been used as an index
ordering. structure for data stored on disk.

The set of all states reachable by transitions from the initial state is • Cache-aware: B-trees. The B-tree is the standard choice for
called the state space of a model, and it is the goal of a state space storing (ordered) data on disk, and depends on a page size
search algorithm to generate all of these states, in order to check being selected that corresponds with the size of data blocks
that desired properties hold in all of them. This approach usually that can be efficiently transferred.
requires the state state space to be finite, although exhausting the • Cache-oblivious: the data structure proposed by Bender, Duan,
entire state space is not necessary for our experiments. Iacono and Wu. Since it provides functionality comparable
to that of a B-tree, this seems like a fair candidate for a com-
The outline of a state space search algorithm is given in Program 1. parison. For brevity, this data structure will be referred to as
Note that in addition to the initial state and a function to generate a Bender set.
successors, a queue and a set data structure are used. The queue
holds states that have been generated but not yet expanded and is
used to determine what state to expand next. The set holds all states Cache-oblivious data structures are not yet commonly used in prac-
that have been generated so far and is used to prevent a single state tice and to our knowledge there are no high-quality implementa-
from being expanded more than once. tions publicly available. The Bender set therefore had to be imple-
mented from scratch.
Although other behavior is possible, our queue operates by a first-
in, first-out principle, meaning that all states that are reachable in In contrast, both hash tables and B-trees are widely used and there
N steps from the initial state are expanded before any states that are several high quality implementations available as software li-
require more than N steps. braries. It is, however, undesirable to use existing libraries for a
fair comparison, for two reasons. First, many existing implementa-
From the pseudo-code it is clear that the state space search algo- tions support additional operations (such as locking, synchroniza-
rithm does not accomplish much by itself; instead, the real work tion, atomic transactions, et cetera) which are not used in our exper-
is done by the successor function and the queue and set data struc- iments, but which may harm the performance of these implemen-
tures. Queues can easily be implemented efficiently (adding and tations. Second, many established libraries have been thoroughly
removing states takes O(1) time). The efficiency of the successor optimized while our newly implemented data structures have not.
function depends on the execution model used, but is typically lin- This may give the existing libraries an unfair advantage.
ear in the number of states produced. In practice, therefore, the
bottleneck of the algorithm is the set data structure. In an attempt to reduce the bias caused by differences in function-
ality and quality between existing and newly developed libraries,
5.2 Experimental Framework all data structures used in the performance evaluation have been
For our experimental framework, we needed a collection of formal implemented from scratch.
models that are representative of those typically used for formal
verification, and a way to execute them. For the first part, we have
5.3.1 Set Operations
Dynamic set data structures may support various operations, such
as adding and removing elements, testing if a value is an element of
the set, finding elements nearby, succeeding or preceding a given
value, counting elements in a range, et cetera. However, our state
space search algorithm only needs two operations: insertion of new
elements (in a set that is initially empty) and testing for the exis-
Figure 2: Depiction of a separate-chaining hash table
tence of elements. In fact, these two operations can be combined
into a single operation. We call this operation insert(S,x). If
an element x does not exist in S, then insert(S,x) inserts it and
returns 0. If x is already an element of S, insert(S,x) returns 1 different sizes (as the slots only need to store a fixed-size pointer
and no modifications are made. The inner loop of our state space and not a variable-size value) and maintains relatively good perfor-
algorithm can then be rewritten as follows: mance when the number of values stored exceeds the size of the
index array [15]. Figure 2 shows a hash table (with an index size
of four, storing three values) and the way it is stored in consecutive
for (State s : successors(state)) { memory.
if (generated.insert(s) == 0) {
queue.insert(s); Our hash table implementation uses a fixed size index which must
} be specified when creating the hash table, as this simplifies the im-
} plementation considerably. The index is stored at the front of the
file, after which values are simply appended in the order in which
We now have a single operation that must be implemented by all set they are added. Note that we do not support removing elements
data structures. Recall that the values to be stored are virtual ma- from the hash table, which means we do not have to deal with holes
chine state descriptions, which are variable-length binary strings. that would otherwise occur in the stored file.
All data structures must therefore support storing strings of vary-
ing length efficiently. For our experiments the FNV-1a hash function [16] is used (modulo
the size of the index) to map values to slots.
5.3.2 Common Implementation Details
All data structures were implemented in C using a common inter- 5.3.4 B-tree Implementation
face. For memory management, the data structures make use of a The B-tree data structure was first proposed by Bayer and Mc-
single memory mapped data region, bypassing the C library’s allo- Creight [17] and is widely implemented and often described in text-
cation functions and giving the implementer complete control over books. Our implementation is based on the description by Kifer,
how data is arranged in memory. Bernstein and Lewis [18].

In principle, this also means the operating system has control over B-trees are similar to other search tree data structures in the sense
when and which data pages are transferred from main memory to that they store ordered values hierarchically in such a way that ev-
disk storage and back. However, our experiments (which were per- ery subtree stores consecutive values. A key property of B-trees
formed on a system without a swap file) were limited to using main is that (unlike most in-memory tree structures) they do not fix the
memory only. number of children per node, but instead organize data in pages of
a fixed size, each containing as many values as will fit. This makes
It should be noted that since we only need a limited subset of the them especially suitable for purposes of on-disk storage where read-
functionality offered by the set data structures for our test applica- ing and writing data in a few large blocks is relatively cheap com-
tion, we did not implement any operations that were not required to pared to accessing several smaller blocks.
perform our experiments. However, we did not change the design
of the data structures to take advantage of the reduced functional- B-tree pages are ordered in a tree structure. Figure 3 depicts a B-
ity. That means that additional operations could be implemented tree of height two storing eight values. In each page the values are
without structural changes to our existing implementation. stored in lexicographical order, and the values in the first leaf page
are lexicographically smaller than “John”, the values in the second
5.3.3 Hash Table Implementation leaf page are between “John” and “Philip” and the values in the
In its simplest form, a hash table consists of an index array (the third leaf page are greater than “Philip”. Note that since the page
index) with a fixed number of slots. A hash function is used to map size is fixed, not all pages are completely filled.
values onto slot numbers. If the slot for a value is empty, it may be
inserted there. Queries for the existence of an element in the hash
table similarly see if the queried value is stored at its designated
slot.

When several values are inserted, some values may map to the same
slot, which is problematic if each slot can only store one value.
There are many different ways to resolve this collision problem;
we use separate chaining, which means that slots do not store val-
ues directly, but instead each slot stores a pointer to a singly-linked
list of values that map to that slot. This particular implementa-
tion technique is well-suited to the scenario where values may have Figure 3: Depiction of a B-tree
Since every page can store many values, the resulting tree is typi-
cally very shallow, which is beneficial as the number of pages that
need to be retrieved is equal to the height of the tree (worst case).

New values are inserted into a leaf page which can be determined
by traversing the tree. If this leaf page does not have enough free
space to insert the new value, the page will have to be split: the
median value is selected, and the old page is replaced by two new
pages containing the values less than respectively greater than this
Figure 4: Depiction of a Bender set
median value, while the median value itself is moved to the parent
page. When the top-most page needs to be split, a new (empty)
top-level page is created and the height of the B-tree is increased
the window, which will cause some of the values to be moved out
by one. As a result, all leaf nodes in a B-tree are at the same depth,
of the lower-level windows which are too full.
and all pages stored are at least half full, except possibly the root
page.
To insert a value, the index tree is used to find the position of the
smallest existing value that is greater than the new value (i.e. the
B-trees can easily support the insertion of variable-length values, as
value’s successor); the new value will be inserted right before its
long as each individual value fits in a single page. However, values
successor. Then, the windows overlapping the goal position are
with a size larger than the size of a single page must be handled
considered from bottom to top. The lowest (smallest) window that
separately. Our implementation does not support this and therefore
can support another element without overflowing is selected, and
requires all stored values to be smaller than the pages.
then rebalanced (thereby resolving the overflow in the overlapping
lower level windows).
5.3.5 Bender Set Implementation
The first implementation challenge for the Bender set is that the If all windows, including the topmost window spanning the entire
description given by Bender et al assumes that all values stored in array, would overflow upon insertion of the new element, the ca-
the set are of a fixed size, which is not the case in our experimental pacity of the data structure must be increased to make room for
framework. To work around this, we create several separate in- more elements. When this happens, the capacity is doubled, a new
stances of the Bender set with different value sizes which are pow- top level is created and then the entire array is rebalanced. Since in-
ers of two. When a value is to be inserted, its size is rounded up to creasing the capacity causes all values in the array to be moved and
a power of two and it is inserted in the corresponding set instance. the entire index tree to be recreated, this is an expensive operation;
This ensures that the amount of space wasted remains below 50% however, it only occurs infrequently.
while still allowing values of various sizes to be inserted. The fol-
lowing description will be of a single Bender set instance, and it According to this description (which follows the paper by Bender
will therefore be assumed that values do have a fixed size. et al) a window is rebalanced every time an element is inserted. As
an optimization, our implementation does not always rebalance a
A Bender set has a capacity C that is a power of two. Its imple- window. When there is free space between the successor and pre-
mentation consists of two parts: a sparse array storing the values decessor of the value to be inserted, the value is simply inserted in
in order, and a binary tree stored in van Emde Boas layout that is the middle. In our experiments, this optimization yielded a reduc-
used to search for values efficiently. Both the tree and the array tion in execution time.
may contain special empty values.
A detail that is not specified in the paper by Bender et al, is how
Initially, the array stores only empty values and is partitioned into the population count for windows is kept. The simplest option is to
windows on several levels. On the highest level, there is a single keep no such information and simply scan part of the array when-
window of size C. Each subsequent level has twice as many win- ever a population count is required. Another extreme is to keep
dows of half that size, and the lowest level has size log2C (rounded population counts for all windows on all levels, and update these
up to the next power of 2). For each level a maximum density is counts whenever values are inserted or moved. In our experiments
chosen, with the lowest level having density 1, the highest level a compromise seemed to work best: keep population counts only
having density δ, and the density for the intermediary levels ob- for the lowest-level windows, and recompute the counts for higher-
tained by interpolating linearly between 1 and δ. level windows when required. This prevents a lot of recomputation,
while keeping the additional costs of updating low.
A depiction of a Bender set storing four values (with a capacity for
eight) is given in Figure 4, with the window population counts for Finally, the Bender set uses a complete binary tree as an index data
the sparse array given on the left (density thresholds not shown) structure to efficiently find the successor element for a given value.
and the index tree on the right. The tree is stored in memory in van Emde Boas layout to allow
cache-oblivious traversal. This layout is named after the data struc-
The top-level density must be a number between 0 and 1; the op- ture described in [19] and determines an order in which the nodes
timal value depends on practical circumstances such as available of a tree can be stored in a contiguous block of memory, in such a
memory and the relative cost of rebuilding the data structures. When- way that traversing a tree with N nodes from root to leaf requires
ever the fraction of non-empty values stored in a window (the win- only O(logB N + 1) pages to be fetched from memory.
dow’s population count divided by the size of the window) ex-
ceeds its maximum density, the window is said to be overflowing. In van Emde Boas lay-out, which is defined recursively, a tree with
Overflows are resolved when a higher-level window is rebalanced, N nodes is divided in half vertically, creating a √
small subtree at the
meaning that the values in the window are redistributed evenly over top and a number of subtrees (approximately N) at the bottom.
Therefore, we decided to measure wall clock time and deal with
variations across multiple executions by running each experiment
seven times and using the median value for further analysis.

5.4.1 Framework Overhead


Finally, there is another factor that must be taken into account: the
test framework uses several components (such as the NIPS VM and
a queue) that are not being evaluated, yet which do contribute to the
memory use and runtime of the process. In order to discount these
Figure 5: A binary tree in van Emde Boas layout factors, tests are performed using a mock set implementation that
gives a near-minimal overhead. This works by first running with a
√ real set implementation and logging all results (the return values of
Each subtree has a size of approximately N nodes and is stored the insert(S,x) calls) to a file, and then running a second time
in van Emde Boas layout in a contiguous block of memory; these while reading the stored values. In the second run, the mock set
blocks are then stored consecutively. implementation does not have to actually store or retrieve data and
consequently has negligible memory and runtime overhead.
In Figure 5 a binary tree is shown, with the nodes labeled with their
positions according to the van Emde Boas layout. In this example, In the results below, all values are reported relative to the values ob-
only two levels of recursion are needed. If we suppose that every tained using the mock set implementation, which means the over-
page stores three nodes, then the highlighted path from node 1 to head of the test framework is removed from the results. Although
12 visits only two pages. the extra memory used by the test framework is relatively small
(less than 10% in all cases) and primarily caused by the queue data
In the index tree used for the Bender set, the leaf nodes store the structure, the overhead in terms of execution time was quite signif-
same values as the sparse array (including empty values). Interior icant: close to 70% on the worst case. This does mean that the set
nodes store the maximum value of their two children (or an empty data structure accounts for at least 30% of the execution time of the
value, if both children store an empty value). This tree can be used search algorithm in all cases (and much more for the slower data
in a similar way as a binary search tree: by examining the two structures).
children of a node, we can determine whether a searched-for value
belongs in the left or right subtree. See the right side of Figure 4 5.4.2 System Configuration
for an example.
The experiments where performed on a 64-bit Linux system (kernel
version 2.6.18) with an Intel Xeon E5335 processor (2 GHz, 4 MB
Note that the structure of the tree is static and only changes when
cache) and 8 GB of main memory (no swap file). Although the
the capacity of the set is increased (in which case the tree is recre-
system is a multiprocessor system (and the E5335 is a dual-core
ated from scratch). Of course, the values stored in the tree have to
processor) all code is single-threaded so only a single core is used
be updated when the corresponding elements of the array change.
when executing tests.

The following models are used for benchmarking:


5.4 Measurements
Measuring the memory usage of a process can be done by retrieving
how much of the address space has been allocated by a process; • Eratosthenes is a parallel implementation of the sieve of Er-
called the virtual set size. This does not necessarily correspond atosthenes which is used to find prime numbers up to a given
one-to-one with memory being allocated for the process, but since maximum. It has a single parameter N: only integers from 2
the majority of the memory used by the data structures is mapped to N (exclusive) are tested. A new process is created for ev-
at exactly one location, this metric is suitable for our experiments. ery prime number found, and as a result the state space can
only be exhaustively searched for relatively small values of
There are several ways of measuring the time a process takes to N (e.g. N < 40). Because processes are dynamically created,
execute. The simplest is counting the number of seconds elapsed state size increases during execution.
since the start of the program (called wall clock time). This has the
disadvantage that concurrently executing processes affect the tim- • Leader2 is a leader election algorithm with a configurable
ing of the process being measured, so on a busy system the reported number of processes competing for leadership (N). Since a
time may vary over different executions. constant number of processes will be created and these pro-
cesses cannot make progress until all of them are created,
A different way to measure time is using the statistics the ker- almost all states have the same size.
nel collects of how many seconds the processor spends execut- • Peterson N is Peterson’s solution to the multi-process mu-
ing instructions for the process (both in user space and in kernel tual exclusion problem (using global variables instead of chan-
space). These values are independent of what other processes are nels for communication) with a configurable number of pro-
doing, which makes them more consistent across multiple execu- cesses (N). The processes are created atomically, so the state
tions. However, these values do not account for the time the system size is constant after initialization.
spends waiting (for example, for data to be read from disk) or the
time spent by other processes on behalf of the executing process
(for example, the kernel swap daemon on Linux). Since these fac- In Table 2 an overview is given of the parameters of the models
tors may have a large effect on the total running time of the algo- used and the resulting properties of the state space search. Note that
rithm, it is undesirable to leave these out. the number of transitions relative to the number of iterations gives
Model Parameters Iterations Transitions 7. DISCUSSION
Eratosthenes N = 40 1,019,960 4,923,218 The three test cases paint a very similar picture: in all cases, the
Leader2 N=5 5,950,945 23,856,363 hash table offers the best performance in terms of both execution
Peterson N N=4 10,000,000 37,434,411 time and memory usage, followed by the B-tree. In all cases, the
Bender set performs worst by a large margin.
Table 2: Properties of test cases used
It should be noted that the Bender set is the only data structure that
shows abrupt changes in both the execution time and memory usage
graphs. These jumps in the graph occur whenever the capacity of
an indication of the ratio of value look-ups (insertions of values the Bender set is increased; this causes the memory allocated for the
that are already present in the set) and actual insertions, ranging set to be doubled and the array and tree structures to be recreated,
approximately from 4 to 2.7 look-ups per insertion. which is a relatively expensive operation.

6. RESULTS The difference in execution time between the hash table and B-tree
Each of the tree data structures has some configurable parameters. increases as the size of the states becomes larger. This is explained
The hash table needs to be parametrized with the size of the index, by the fact that the height of the B-tree depends on the average
number of values per page; larger values means less values per
the B-tree with the size of the pages, and the Bender set with the
page, and therefore a deeper tree, and as a result more pages to
density parameter (δ). To determine which parameters to use, var-
ious different parameters were first tried on the first (and smallest) be fetched for a query. The hash table does not have such a limi-
test case. In this case, a million relatively small states need to be tation, as every value in the bucket must be fetched independently,
stored. regardless of the size of these values.

With respect to memory usage the B-tree and hash table have sim-
Figure 6(a) shows the execution times for the hash table. The hash
ilar requirements; the B-tree has slightly larger overhead per value
table with a small index (100,000 slots) performs well initially, but
gets slower as it becomes too full, at which point each slot in the stored, but initially the hash table uses more memory because of
hash table has to store a long list of values that map to that slot. the space allocated for the index.
As Figure 7(a) shows, the only difference in memory usage of the
The performance of the Bender set does not seem to depend on the
different hash tables is in the (fixed) size of the index. We will
page size, and as a result the difference between the Bender set and
use the hash table with an index size of 10 million slots for further
experiments, as it performed best in this case, and seems suitable the B-tree is smallest when the page size is largest. Unfortunately,
for other cases (which require storing more values) as well. even then it requires about three times as much time (and 5–6 times
as much memory).
Figure 6(b) and Figure 7(b) show the execution time and memory
The increased memory requirements of the Bender set can in part
usage for the B-tree with page sizes ranging from 1 kilobyte to
be explained by our implementation of variable-length values, which
16 kilobyte. The execution platform maps memory in 4 KB pages,
which would suggest that using pages less than 4 KB makes little wastes some space by storing them in fixed-size slots. This over-
sense. Indeed, the B-tree with 1 KB pages seems to perform worst, head should be around 25% on average.
but if we take the memory use into account, this is most likely due
to the fact that fewer items fit on a single page, which means a rel- The test cases used all have a relatively high ratio of value inser-
tions to look-ups. This may explain the good performance of the
atively large portion of the page remains unused. The difference
hash table (for which insertions are barely more expensive than
between 4 KB and 16 KB pages is relatively small (both in exe-
cution time and memory usage); we select the 16 KB page size for look-ups) as well as the bad performance of the Bender tree (for
further experiments because state sizes are larger in other test cases, which insertions are relatively expensive, especially when a win-
in which case the 4 KB page size may cause similar problems as dow is rebalanced). Unfortunately, our experiments do not give
the 1 KB page size here. enough data to determine how the relative performance of the data
structures changes when this ratio changes.
Finally, in Figure 6(c) the execution times for the Bender set with
several different density parameters are given. With a lower den- 7.1 Implementation Complexity
sity, more space is required, but new insertions less frequently cause Since all data structures included in the experiments were imple-
large windows to be rebalanced. The execution times for the den- mented from scratch, our research also yields some insight in the
sity values of 0.5 and lower are almost the same while larger den- implementation complexity of the different data structures.
sity values are slower. Figure 7(c) shows that the lower the density
value, the more memory is used. It appears that the advantage of Table 3 gives the number of lines of source code used to implement
low density is negated by the overhead of constructing increasingly various part of the test application, after removing comments and
large data structures. The set with δ = 0.125 cannot even finish the blank lines (which amount to about 25% of the code). Common
test case in the memory that is available. We select δ = 0.5 for fu- code includes allocation functions, interface descriptions and com-
ture experiments, which strikes a good balance between execution parison and hashing functions. The framework includes not just the
time and memory requirements. search algorithm, but also the functionality to report various met-
rics while running a test case.
Now that we have established the parameters to use for the other
test cases, we can run final experiments on all three test cases. The Although not a perfect metric of implementation complexity, the
execution times for these cases are presented in Figure 8 and the lines of code required for the various components of the test appli-
corresponding memory usage is presented in Figure 9. cation do give some insight in the relative complexity of the data
Purpose Lines Percentage small changes are enough to close the performance gap.
Common code 433 18.21%
Hash table 146 6.14% We hope that future research on cache-oblivious data structures will
B-tree 299 12.57% not focus on theoretical performance alone, but will also compare
Bender set 623 26.20% performance in practice with existing alternatives. Although theo-
Queue 184 7.74% retical results are invaluable, newly developed data structures and
Search Framework 693 29.14% algorithms should, preferably, have demonstrable practical merit as
well.
Table 3: Source lines of code of the test application
10. REFERENCES
[1] Intel Corporation. IA-32 Intel Architecture Optimization
structures. It is clear why hash tables are a popular choice: they Reference Manual, 2004.
are easy to implement yet perform very well. The Bender set not [2] A. Aggarwal and S.V. Jeffrey. The input/output complexity
only uses more lines of code, but in our experience also required a of sorting and related problems. Communications of the
greater amount of effort to implement correctly. ACM, 31(9):1116–1127, 1988.
[3] A. Aggarwal, B. Alpern, A. Chandra, and M. Snir. A model
for hierarchical memory. ACM Press New York, NY, USA,
8. FUTURE WORK 1987.
It should be noted that in our experiments only part of the memory
[4] H. Prokop. Cache-Oblivious Algorithms. Master’s thesis,
hierarchy was used (up to the use of main memory). This still in-
Massachusetts Institute of Technology, 1999.
volves several levels of processor cache, but it is not a very deep
hierarchy. In a deeper memory hierarchy, with greater differences [5] E.D. Demaine. Cache-oblivious algorithms and data
in access times between the levels, the cache-efficient data struc- structures. Lecture Notes from the EEF Summer School on
tures (the B-tree and the Bender set) should perform better. Specif- Massive Data Sets, 2002.
ically, using disk-based storage as the lowest level of storage seems [6] M.A. Bender, E.D. Demaine, and M. Farach-Colton.
a logical extension of our research. Cache-Oblivious B-Trees. SIAM Journal on Computing,
35(2):341–358, 2005.
In our experiments we did not measure cache efficiency specifi- [7] M.A. Bender, Z. Duan, J. Iacono, and J. Wu. A
cally; instead, we measured total execution time only, which is locality-preserving cache-oblivious dynamic dictionary.
affected by several different factors of which cache efficiency is Journal of Algorithms, 53(2):115–136, 2004.
only one. To better understand how different factors influence per- [8] N. Rahman, R. Cole, and R. Raman. Optimised Predecessor
formance, it would be interesting to measure cache efficiency sepa- Data Structures for Internal Memory. Algorithm
rately and report on actual cache hits and misses on different bound- Engineering: 5th International Workshop, WAE 2001,
aries of the memory hierarchy. Aarhus, Denmark, August 28-31, 2001: Proceedings, 2001.
[9] N. Askitis and J. Zobel. Cache-Conscious Collision
Finally, we only evaluated a single data structure in a single test en- Resolution in String Hash Tables. String Processing and
vironment (even though we used more than one model to perform Information Retrieval: 12th International Conference,
experiments). To draw more general conclusions about the practi- SPIRE 2005, Buenos Aires, Argentina, November 2-4, 2005:
cal merits of cache-oblivious data structures, it will be necessary to Proceedings, 2005.
perform experiments at a larger scale, comparing multiple cache- [10] J.S. Vitter. External Memory Algorithms and Data
oblivious data structures and using scenarios that differ more. Structures: Dealing with massive data. ACM Computing
Surveys, 33(2):209–271, 2001.
9. CONCLUSION [11] J.H. Olsen and S.C. Skov. Cache-Oblivious Algorithms in
The experimental results clearly show that the cache-oblivious data Practice. Master’s thesis, University of Copenhagen,
structure proposed by Bender, Duan, Iacono and Wu is outper- Copenhagen, Denmark, 2002.
formed by traditional data structures in our test scenario, in terms [12] G.J. Holzmann. The Spin Model Checker: Primer and
of both execution time and memory use. The advantage of more Reference Manual. Addison-Wesley Professional, 2004.
cache-friendly behavior does not appear to be large enough to com- [13] Michael Weber. An embeddable virtual machine for state
pensate for the increased complexity of the data structure, which re- space generation. In SPIN, pages 168–186, 2007.
sults in higher memory requirements and increased computational [14] M. Weber. NIPS VM.
overhead. If the data structure does have asymptotic performance http://www.cs.utwente.nl/˜ michaelw/nips/.
benefits, then realistic work loads on current hardware systems are [15] R. Sedgewick. Algorithms in C++. Addison-Wesley
not enough to reveal this. These findings are consistent with earlier Longman Publishing Co., Inc. Boston, MA, USA, 1992.
results, such as obtained by Rahman, Cole and Raman, and Olsen [16] L.C. Noll. Fowler/Noll/Vo (FNV) hash, 2004.
and Skov (see Section 4). http://isthe.com/chongo/tech/comp/fnv/.
[17] R. Bayer and E. McCreight. Organization and Maintenance
Of course, this does not prove conclusively that cache-oblivious of Large Ordered Indices. 1970.
data structures are entirely without merit. Our experiments are lim- [18] M. Kifer, A. Bernstein, and P.M. Lewis. Database Systems:
ited in scope: only a single cache-oblivious data structure has been An Application-Oriented Approach. Addison-Wesley, 2006.
examined, in a single test scenario, on a single platform. In a differ-
[19] P. van Emde Boas, R. Kaas, and E. Zijlstra. Design and
ent setting, different cache-oblivious data structures might compare
implementation of an efficient priority queue. Theory of
favorably to their traditional counterparts. However, since the ob-
Computing Systems, 10(1):99–127, 1976.
served differences in performance are fairly large, it is unlikely that
10 25 100
capacity=100,000 pagesize=1 KB δ=0.125
9 capacity=1,000,000 pagesize=4 KB 90 δ=0.25
capacity=10,000,000 pagesize=16 KB δ=0.5
8 20 80 δ=0.667
δ=0.75
7 70
Time (seconds)

Time (seconds)

Time (seconds)
6 15 60
5 50
4 10 40
3 30
2 5 20
1 10
0 0 0
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Iterations (x1000) Iterations (x1000) Iterations (x1000)

(a) B-tree (b) Hash table (c) Bender set

Figure 6: Execution times per data structure on Eratosthenes

400 600 8000


capacity=100,000 pagesize=1 KB δ=0.125
capacity=1,000,000 pagesize=4 KB δ=0.25
350 capacity=10,000,000 pagesize=16 KB 7000 δ=0.5
500
δ=0.667
300 6000 δ=0.75
400
Memory (MB)

Memory (MB)

Memory (MB)
250 5000

200 300 4000

150 3000
200
100 2000
100
50 1000

0 0 0
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
Iterations (x1000) Iterations (x1000) Iterations (x1000)

(a) B-tree (b) Hash table (c) Bender set

Figure 7: Memory usage per data structure on Eratosthenes

80 200 250
Hash table Hash table Hash table
B-tree 180 B-tree B-tree
70 Bender set Bender set Bender set
160 200
60
140
Time (seconds)

Time (seconds)

Time (seconds)

50
120 150
40 100
80 100
30
60
20
40 50
10 20
0 0 0
0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Iterations (x1000) Iterations (x1000) Iterations (x1000)

(a) Eratosthenes (b) Leader2 (c) Peterson N

Figure 8: Execution times per test case

3500 7000 5000


Hash table Hash table Hash table
B-tree B-tree 4500 B-tree
3000 Bender set 6000 Bender set Bender set
4000
2500 5000 3500
Memory (MB)

Memory (MB)

Memory (MB)

3000
2000 4000
2500
1500 3000
2000

1000 2000 1500


1000
500 1000
500
0 0 0
0 200 400 600 800 1000 1200 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Iterations (x1000) Iterations (x1000) Iterations (x1000)

(a) Eratosthenes (b) Leader2 (c) Peterson N

Figure 9: Memory usage per test case

You might also like