Thesis
Thesis
This work was supported by the National Science Foundation under grants CCF-1018188 and CCR-0122581, by
Intel via the Intel Labs Pittsburgh, the Intel Labs Academic Research Office for the Parallel Algorithms for Non-
Numeric Computing Program and the Intel Science and Technology Center for Cloud Computing (ISTC-CC), by
Microsoft as part of the Carnegie Mellon Center for Computaional Thinking, and by IBM.
Keywords: Parallelism, Locality, Cost Models, Metrics, Schedulers, Parallel Algorithms,
Shared Memory, Cache Hierarchies, Performance of Parallel Programs, Programming Interfaces,
Run-time Systems, Parallel Cache Complexity, Effective Cache Complexity, PCC Framework,
Parallelizability
Abstract
Good locality is critical for the scalability of parallel computations. Many cost
models that quantify locality and parallelism of a computation with respect to spe-
cific machine models have been proposed. A significant drawback of these machine-
centric cost models is their lack of portability. Since the design and analysis of
good algorithms in most machine-centric cost models is a non-trivial task, lack of
portability can lead to a significant wastage of design effort. Therefore, a machine-
independent portable cost model for locality and parallelism that is relevant to a
broad class of machines can be a valuable guide for the design of portable and scal-
able algorithms as well as for understanding the complexity of problems.
This thesis addresses the problem of portable analysis by presenting program-
centric metrics for measuring the locality and parallelism of nested-parallel pro-
grams written for shared memory machines – metrics based solely on the program
structure without reference to machine parameters such as processors, caches and
connections. The metrics we present for this purpose are the parallel cache com-
plexity (Q∗ ) for quantifying locality, and the effective cache complexity (Q bα ) for
both locality and the cost of load-balancing for a parameter α ≥ 0. These two
program-centric metrics constitute the parallel cache complexity (PCC) framework.
We show that the PCC framework is relevant by constructing a class of provably-
good schedulers that map programs to the parallel memory hierarchy (PMH) model,
represented by a tree of caches. We prove optimal performance bounds on the com-
munication costs of our schedulers in terms of Q∗ and on running times in terms of
Qbα . For many algorithmic problems, it is possible to design algorithms for which
Qbα = O(Q∗ ) for a range of values of α > 0. The least upper bound on the values
of α for which this asymptotic relation holds results in a new notion of paralleliz-
ability of algorithms that subsumes previous notions such as the ratio of work to
depth. For such algorithms, our schedulers are asymptotically optimal in terms of
load balancing cache misses across a hierarchy with parallelism β < α, which re-
sults in asymptotically optimal running times. We also prove bounds on the space
requirements of dynamically allocated computations in terms of Q bα . Since the PMH
machine model is a good approximation for a broad class of machines, our experi-
mental results demonstrate that program-centric metrics can capture the cost of par-
allel computations on shared memory machines.
To show that the PCC framework is useful, we present optimal parallel algo-
rithms for common building blocks such as Prefix Sums, Sorting and List Rank-
ing. Our designs use the observation that algorithms that have low depth (preferably
polylogarithmic depth as in NC algorithms), reasonable “regularity” and optimal
sequential cache complexity on the ideal cache model are also optimal according
to the effective cache complexity metric. We present results indicating that these
algorithms also scale well in practice.
Emphasizing the separation between algorithms and schedulers, we built a frame-
work for an empirical study of schedulers and engineered a practical variant of our
schedulers for the PMH model. A comparison of our scheduler against the widely
used work-stealing scheduler on a 32 core machine with one level of shared cache re-
veals 25–50% improvement in cache misses on shared caches for most benchmarks.
We present measurements to show the extent to which this translates into improve-
ments in running times for memory-intensive benchmarks for different bandwidths.
iv
Acknowledgments
I am very grateful to Guy Blelloch and Phillip Gibbons for introducing me to
this field and many of the problems addressed in this thesis. In addition to being
an inspiration as researchers, they have been great mentors and extremely patient
teachers. It has been exciting to work with them, and many of the results in this
thesis have come out of discussions with them. I could not have asked for wiser
advisors.
I would like to thank Jeremy Fineman for many useful discussions, both techni-
cal and otherwise. He has been a great sounding board for many of my half-baked
ideas. I would like to thank Charles Leiserson for his useful comments and pointers
through out the course of my thesis work. I would like to thank both Jeremy and
Charles for their feedback which has greatly helped me improve the presentation of
this thesis. I would like to thank Gary Miller for asking many pertinent and incisive
questions regarding the thesis.
I would like to thank my co-advisees Aapo Kyrola, Julian Shun and Kanat Tang-
wongsan for being great collaborators. I would like to thank Cyrus Omar, Umut Acar
and Kayvon Fatahalian for many useful discussions and comments about my work.
I would like to thank Pranjal Awasthi, Manuel Blum, Ashwathi Krishnan, Richard
Peng, Vivek Seshadri and Aravindan Vijayaraghavan for many general discussions
related to the thesis and their constructive feedback on my writing and talks.
Finally, I would like to thank my parents, my grandparents and my family for
encouraging me to study science, for being my first teachers, and their unwavering
support.
vi
Contents
1 Introduction 1
1.1 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hardware Constraints of Parallel Machines . . . . . . . . . . . . . . . . 1
1.1.2 Modeling Machines for Locality and Parallelism . . . . . . . . . . . . . 3
1.1.3 Portability of Programs and Analysis . . . . . . . . . . . . . . . . . . . 7
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Solution Approach and Contributions . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Program-centric Cost Model for Locality and Parallelism . . . . . . . . . 11
1.3.2 Scheduler for Parallel Memory Hierarchies . . . . . . . . . . . . . . . . 13
1.3.3 Performance Bounds for the Scheduler . . . . . . . . . . . . . . . . . . 14
1.3.4 Algorithms: Theory and Practice . . . . . . . . . . . . . . . . . . . . . . 14
1.3.5 Experimental Analysis of Schedulers . . . . . . . . . . . . . . . . . . . 15
1.4 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vii
2.7.3 Relation to other Machine Models . . . . . . . . . . . . . . . . . . . . . 33
2.7.4 Memory Consistency Model . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Reuse Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Program Model 37
3.1 Directed Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Nested-Parallel DAGs – Tasks, Parallel Blocks and Strands . . . . . . . . . . . . 38
3.2.1 Properties of Nested-Parallel DAGs . . . . . . . . . . . . . . . . . . . . 38
3.3 Memory Allocation Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.1 Static Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Dynamic allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Schedulers 59
5.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Mapping Sequential Cache Complexity – Scheduling on One-level Cache Models 60
5.2.1 Work-Stealing Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.2 PDF Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3 Mapping PCC Cost Model – Scheduling on PMH . . . . . . . . . . . . . . . . . 66
5.3.1 Communication Cost Bounds . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Time bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.3 Space Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
7 Experimental Analysis of Space-Bounded Schedulers 113
7.1 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1.1 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.1.3 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Schedulers Implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.2.1 Space-bounded Schedulers . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.2 Schedulers implemented . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8 Conclusions 131
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Bibliography 139
ix
x
List of Figures
xi
7.1 Memory hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Interface for the program and scheduling modules . . . . . . . . . . . . . . . . . 117
7.3 WS scheduler implemented in scheduler interface . . . . . . . . . . . . . . . . . 118
7.4 Specification entry for a 32-core Xeon machine . . . . . . . . . . . . . . . . . . 120
7.5 Layout of L3 cache banks and C-boxes on Xeon 7560 . . . . . . . . . . . . . . . 120
7.6 Timins and L3 cache miss numbers for RRM . . . . . . . . . . . . . . . . . . . 126
7.7 Timins and L3 cache miss numbers for RRM . . . . . . . . . . . . . . . . . . . 126
7.8 L3 cache misses for varying number of processors . . . . . . . . . . . . . . . . . 127
7.9 Timings and L3 cache misses for applications at full bandiwidth . . . . . . . . . 129
7.10 Timings and L3 cache misses for applications at one-fourth bandiwidth . . . . . 129
7.11 Empty queue times for Quad-Tree Sort. . . . . . . . . . . . . . . . . . . . . . . 130
xii
List of Tables
xiii
xiv
Chapter 1
Introduction
1
2.5 100
Capacity Latency (tRCD) Latency (tRC)
2.0 80
Capacity (Gb)
Latency (ns)
1.5 60
1.0 40
0.5 20
0.0 0
SDR-200 DDR-400 DDR2-800 DDR3-1066 DDR3-1333
2000
2000 2003
2003 2006
2006 2008
2008 2011
2011
Figure 1.1: DRAM Capacity & Latency trends based on the dominant DRAM chips during the
period of time [52, 104, 136, 144]. The overall DRAM access latency can be decomposed into
individual timing constraints. Two of the most important timing constraints in DRAM are tRCD
(row-to-column delay) and tRC (“row-conflict” latency). When DRAM is lightly loaded, tRCD is
often the bottleneck, whereas when DRAM is heavily loaded, tRC is often the bottleneck. Note
the relatively stagnant access latency. Figure adapted from [114].
with better variable (data) reuse and an ability to execute a greater number of instructions using
a smaller working set that fits in cache would show significant benefits. In fact, many common
algorithmic problems admit algorithms that have significant data reuse [72, 166].
A second approach to solving the memory access latency problem is prefetching — antici-
pating future memory accesses of a processor based on current access patterns and caching the
location before an instruction explicitly requests it. This approach has its limitations: it is not al-
ways possible to predict future accesses with precision, and aggressive prefetching may saturate
the bandwidth of the connections between cache and memory banks. Therefore, even when a
processor is equipped with a prefetcher, it is important to design programs that use caches well.
Moreover, prefetchers bloat the processor increasing its die size and power consumption. Several
hardware designs such as GPUs choose to include only a small inaccurate prefetcher or leave it
out altogether.
A third approach to the problems is to hide latency with multi-threading, also referred to as
hyper-threading on some architectures. Each core is built to handle several instruction pipelines
corresponding to different threads. When one pipeline halts due to slow memory access, other
pipelines are brought in and processed. Although multi-threading can hide latency to some
extent, it results in greater processor complexity. Programming them requires great care to ensure
that the pipelines are not competing for the limited cache available to the core.
None of these hardware fixes are definitive solutions to the bandwidth problem — the gap
between the rate at which processors can process data and the rate at which the connections
between processors, caches and memory can transfer it. This gap has been widening every
year with faster processor technologies [123]. Bandwidth, rather than the number of cores, is
the bottleneck for many communication intensive programs. Working with limited bandwidth
requires fundamental rethinking of the algorithm design to generate fewer data transfers rather
than hardware fixes.
The performance landscape is further complicated by the arrangement of processors with
2
respect to the cache and memory banks. Hardware designs vary widely in how they budget
silicon area and power to processors and caches, the number of processors they include, and the
connections between modules on and across chips. Designs also vary on assignment of caches
to processors — some caches are exclusive to individual processors while others are shared by
subsets of processors. Further variations include the arrangement of memory banks and the
degree of coherence between them, i.e. whether the memory banks present an unified address
space with certain semantic consistency or if the banks are presented as fragmented pieces to be
dealt with directly by the programmer.
Irrespective of these design choices, programmers are left to deal with two common and vital
issues when attempting to write fast and scalable parallel programs for any parallel machine:
1. Maximizing locality: Reducing memory access latency by effective management of caches
and economical usage of bandwidth between processing and memory components.
2. Load Balance: Maximizing utilization of processors available on the machine.
Managing the two issues simultaneously is a difficult problem that requires a good under-
standing of the machine that is not always solvable in an ad hoc manner. A systematic approach
to address this problem is to present a programmer with a good cost model for the machine that
estimates the effectiveness of their program with respect to managing these two issues. Simple
cost models like PRAM make the programmer take account of load balance costs while being
agnostic to locality. However, performance of data-intensive applications is constrained more by
communication capabilities of the machine rather than computation capabilities. Contemporary
chips such as 8-core Nehalem can easily generate and process data requests that are about a fac-
tor of two to three times greater than what the interconnect can support [133]. Placing multiple
such chips on the interconnect exacerbates the problem further, and the gap between computing
power and transfer capability is likely to widen in future generations of machines. Therefore, it
is necessary to construct a cost model that makes the programmer take account of locality issues
in addition to load balance.
3
Memory: M=∞, B Memory: M=∞, B
P P P P P P
p processors p processors
explicit routing specification. Programs are allowed to be, and often are, dynamic. That is, parts
of the program are revealed only after prerequisite portions of the program are executed. This
necessitates an online scheduling policy that assigns instructions to processors on the fly.
The output of the cost model for a given execution typically includes costs such as the amount
of data transferred for capturing locality, execution completion time to reflect load balance, the
maximum space footprint of the program, etc. For example, machine-centric cost models devel-
oped for machine models in Figure 1.2 measure:
• Locality: total number of caches misses under the optimal cache replacement policy [150]
and full associativity.
• Time: number of cycles to complete execution assuming that cache misses stall an instruc-
tion for C time units.
• Space: maximum number of distinct memory locations used by the program at any point
during its execution.
The output of the cost model matches up with actual performance on a machine to the extent
that the model is a realistic representation of the machine. Therefore, various machine and cost
models have been proposed for different categories of machines. The two key aspects in which
the underlying machine models differ are:
1. Whether the memory presents an unified address space (shared memory) or if the program
is required to handle accessing distinct memory banks (distributed memory).
2. Another key aspect in which they differ include the number of levels of caches or program
synchronization points. While some models allow account for locality one level, others
allow hierarchies, i.e. , multiple levels of caches and synchronization.
Table 1.3 classifies previous work along these two axes. The presentation of this thesis builds
on shared memory models of which a brief overview follows. Chapter 2 surveys these various
machine models in greater detail.
Shared Memory. Shared memory parallel machines present a single continuously addressable
memory address space. Contemporary examples include Intel Nehalem/SandyBridge/IvyBridge
with QPI, IBM Z system, IBM POWER blades, SGI Altix series and Intel MIC. The simpler
4
Distributed Memory Shared Memory
Figure 1.3: Machine-centric cost models for parallel machines that address locality. Models are
classified based on whether their memory is shared or distributed, and the number of levels of
locality and synchronization costs that are accounted for in the model.
interface makes them a popular choice for most programmers as they are easier to program than
distributed systems and graphics processing units. They are often modelled as PRAM machines
when locality is not a consideration.
Previously studied models for shared memory machines include the Private Cache model
[2, 48], the closely related parallel External Memory model (PEM) [20] and the Shared Cache
model [37]. Both these models include a large unified shared memory and one level of cache, but
differ in the ownership of caches. While the earlier model assigns each processor its own cache,
the latter forces all processors to share one cache.
Locality on modern parallel machines can not be accurately captured with a model with only
one cache level. The disparity in access latency varies over several orders of magnitude. To
reflect this more accurately, models with several cache levels are required. Further, the model
should include a description of the ownership or nearness relation between caches and proces-
sors. Such models include the Multi-BSP [164] and the Parallel Memory Hierarchy model [13].
The Parallel Memory Hierarchy model (PMH model, Figure 1.2(c)) describes the machine
as a symmetric tree of caches with the memory at the root and the processors at the leaves. Each
internal node is a cache; the closer the cache is to the processor in the tree, the faster and smaller
it is. Locality is measured in terms of the number of cache misses at each level in the hierarchy,
and parallelism in terms of program completion time.
The PMH model will be referred to multiple times through this thesis. It is one of the most
relevant and general machine models for various reasons:
• It models many modern shared memory parallel machines reasonably well. For example,
Figure 1.5 shows the memory hierarchies for the current generation desktop/servers from
Intel, AMD, and IBM all of which directly correspond to the PMH model.
• It is also a good approximation for other cache organizations. Consider a 2-D grid with
each node consisting of cache and processor for example. It is possible to approximate
it with a PMH by recursively partitioning the grid into finer sub-grids. Figure 1.6 depicts
such grid of dimension 4 × 4 and how such a partition might be modeled with a three level
5
Memory: Mh = ∞, Bh
Cost: Ch−1 fh
h
Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1
Cost: Ch−2
M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1
P P f P P P f P P P f P P P f P P P f P
1 1 1 1 1
fh fh−1 . . . f1
Qh
Figure 1.4: A Parallel Memory Hierarchy, modeled by an h-level tree of caches, with i=1 fi
processors.
cache hierarchy.
• Since it is possible to emulate any network of a fixed volume embedded in 2 or 3 dimen-
sions with a fat-tree (a tree with greater bandwidth towards the root) network with only
poly-logarithmic factor slowdown in routing delay [121], fat-trees networks can support
robust performance across programs. PMH models the fat-tree network with storage ele-
ments at each node [112, 169].
6
Memory: up to 1 TB Memory: up to 512 GB
4 of these 4 of these
24 MB L3 24 MB 12 MB L3 12 MB
32 KB 32 KB L1 32 KB 32 KB 64 KB 64 KB L1 64 KB 64 KB
P P P P P P P P
(a) 32-processor Intel Xeon 7500 (b) 48-processor AMD Opteron 6100
Memory: Up to 3 TB
4 of these
196 MB L4 196 MB
6 of these 6 of these
24 MB 24 MB L3 24 MB 24 MB
P P P P P P P P
Figure 1.5: Memory hierarchies of current generation architectures from Intel, AMD, and IBM.
Each cache (rectangle) is shared by all processors (circles) in its subtree.
7
L3 L3
L2 L2 L2 L2 L2
L1 L L L L L L L L L L L L L L L L
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Program Program
(version 1) (version 2)
for p1 processors for p2 processors
And M1‘s locality And M2‘s locality
Machine Machine
1 2
Figure 1.7: Programs designed in machine-centric cost models may not be portable either in
programs or analysis.
model. Such models have little utility if they do not represent costs on real machines. Therefore,
in addition to the model itself, it is critical to show general bounds on performance when map-
ping the program onto particular machine models. Ideally, one high-level program-centric model
for locality can replace many machine-centric models, allowing portability of programs and the
analysis of their locality across machines via machine-specific mappings. Such mappings make
use of a scheduler designed for a particular machine model. Schedulers should come with pro-
gram performance guarantees that are based solely on program-centric metrics for the program
and parameters of the machine.
There are certain natural starting points for program-centric metrics. The ratio of the number
of instructions (work) to the length of the critical path (depth) is a good metric for parallelism
in the PRAM model. However, non-uniform communication cost creates variations in execution
time of individual instructions and this ratio may not reflect the true extent of parallelism on
real machines. Independently, the Cache-Oblivious framework [86] is a popular framework for
measuring locality of sequential programs. In this framework, programs are measured against an
abstract machine with two levels of memory: a slow unlimited RAM and a fast ideal cache with
8
Machine Schedule 1 Schedule 2 Machine
1 2
Portable Program
Figure 1.8: Separating program from scheduler design: programs should be portable and sched-
ulers should be used to map program to target machines
limited size. Locality for a sequential program is expressed in terms of the cache complexity
function Q(M, B), defined to be the number of cache misses on an ideal cache [86] of size M
and cache line size B. As long as a program does not use the parameters M and B, the metric is
program-centric and the bounds for the single level are valid simultaneously across all levels of
a multi-level cache hierarchy.
There has been some success in putting together a program-centric cost model for parallel
programs that has relevance for simple machine models such as the one-level cache models in
Figure 1.2(a),(b) [2, 35, 40, 84]. The cache-oblivious framework (a.k.a., the ideal cache model)
can be used for analyzing the misses in the sequential ordering. For example, we could use the
depth-first order to define a sequential schedule (Figure 1.9) to obtain a cache complexity Q1
on the ideal cache model, say for fixed parameters M, B. One can show, for example, that any
nested-parallel computation with sequential cache complexity Q1 and depth D will cause at most
Q1 + O(pDM/B) cache misses in total when run with an appropriate scheduler on p processors,
each with a private cache of size M and block size B [2, 50].
Unfortunately, previous program-centric metrics have important limitations in terms of gener-
ality: they either apply to hierarchies of only private or only shared caches [2, 35, 40, 66, 84, 134],
require some strict balance criteria [38, 67], or require a joint algorithm/scheduler analysis
[38, 63, 64, 66, 67]. Constructing a program-centric metric relevant to realistic machine models
such as PMHs seems non-trivial.
9
1
2 8
9 10
3
5
11
4
12
6
13 14
7
15
16
10
instructions would be known and each node in the DAG could be assigned a weight depending on
the time taken for the memory request initiated by the instruction to complete. This would give a
fairly accurate prediction of the parallelism of the program. However, this approach violates the
program-centric requirement on the cost model.
To retain the program-centric characteristic, we might place weights on nodes and sets of
nodes—supernodes—corresponding to the delay that might be realizable in practice in a sched-
ule that is asymptotically optimal. The delay on a node might correspond to access latency of the
instruction, and the delay on a supernode might correspond to the cost of warming up a cache for
the group of instruction that a schedule has decided to colocate on a cluster of processors sharing
a cache. However, what constitutes a good schedule for an arbitrary DAG might differ based on
the target machine. Even for a target machine, finding the optimal schedule may be computation-
ally hard. Further, it is intuitively difficult to partition arbitrary DAGs into supernodes to assign
weights.
This dissertation focuses only on nested-parallel DAGs—programs with dynamic nesting of
fork-join constructs but no other synchronizations—as they are more tractable. Although it might
seem restrictive to limit our focus to nested-parallel DAGS, this class is sufficiently expressive
for many algorithmic problems[149]. It includes a very broad set of algorithms, including most
divide-and-conquer, data-parallel, and CREW PRAM-like algorithms.
The solution approach starts with a cost model to analyze the “cache complexity” of nested-
parallel DAGs. The cache complexity is a function that estimates the number of cache misses a
nested-parallel DAG is likely to incur on a variety of reasonable schedules, as a function of the
available cache size. Cache misses can be used as a proxy for objectives an execution needs to
optimize such as minimizing latency and bandwidth usage (see section 2.7.1 for more details).
11
Maximal
size M
tasks
Glue
Nodes
Figure 1.10: Parallel cache complexity analysis
The PCC model takes advantage of the fact that nested-parallel DAGs can be partitioned
easily and naturally at various scales. Any corresponding pair of fork and join points separates
the supernode nested within from the rest of the DAG. We call such supernodes including the
corresponding fork and join points a task.
We refer to the memory footprint of a task (total number of distinct cache lines it touches) as
its size. We say a task is size M maximal if its size is less than M but the size of its parent task
is more than M . Roughly speaking, the PCC analysis decomposes the program into a collection
of maximal subtasks that fit in M space, and “glue nodes” — instructions outside these subtasks
(fig. 4.3). This decomposition of a DAG is unique for a given M because of the tree-like and
monotonicity properties of nested-parallel DAGs described earlier. For a maximal size M task
t, the parallel cache complexity Q∗ (t; M, B) is defined to be the number of distinct cache lines
it accesses, counting accesses to a cache line from unordered instructions multiple times. The
model then pessimistically counts all memory instructions that fall outside of a maximal subtask
(i.e., glue nodes) as cache misses. The total cache complexity of a program is the sum of the
complexities of the maximal tasks, and the memory accesses outside of maximal tasks.
Although this may seem overly pessimistic, it turns out that for many well-designed parallel
algorithms, including quicksort, sample sort, matrix multiplication, matrix inversion, sparse-
matrix multiplication, and convex hulls, the asymptotic bounds are not affected, i.e. , Q∗ =
O(Q) (see Section 4.3.1). The higher baseline enables a provably efficient mapping to parallel
hierarchies for arbitrary nested-parallel computations.
The Q∗ metric captures only the locality of a nested-parallel DAG, but not its parallelism.
We need to extend the metric to capture the difficulty of load balancing the program on a ma-
chine model, say on a parallel hierarchy. Intuitively, programs in which the ratio of space to
parallelism is uniform and balanced across subtasks are easy to load balance. Intuitively this is
because on any given parallel memory hierarchy, the cache resources are linked to the processing
resources: each cache is shared by a fixed number of processors. Therefore any large imbalance
12
between space and processor requirements will require either processors to be under-utilized or
caches to be over-subscribed. Theorem 3 presents a lower bound that indicates that some form
of parallelism-space imbalance penalty is required.
The Q∗ (n; M, B) metric was extended in our work [41] to a new cost metric for a families
of DAGs parameterized by the input size n—the effective cache complexity Q bα (n; M, B)—that
captures the extent of the imbalance (Section 4.3.2). The metric aims to estimate the degree
of parallelism that can be utilized by a symmetric hierarchy as a function of the size of the
computation. Intuitively, a computation of size S with “parallelism” α ≥ 0 should be able to use
p = O(S α ) processors effectively. This intuition works well for algorithms where parallelism is
polynomial in the size of the problem. Most notably, for regular recursive algorithms, the α here
roughly corresponds to (logb a) − c, where a is the number of parallel recursive subproblems,
n/b is the size of subproblems for a size-n problem, and O(nc · polylog n) is the depth (span) of
an algorithm with input size n.
As in the case of parallel cache complexity, for many algorithmic problems, it is possible to
construct algorithms that are asymptotically optimal, i.e. well balanced according to this met-
ric. The cost Q bα (n; M, B) for inputs of size n is asymptotically equal to that of the standard
sequential cache cost Q∗ (n; M, B). When the imbalance is higher, the cost metric would reflect
this in asymptotics—Q bα (n; M, B) = Ω(Q∗ (n, M, B))—and prompt the algorithm designer to
reconsider the algorithm.
13
in section 1.1.2.
where Mi is the size of each level-i cache, B is the uniform cache-line size, Ci is the cost
of a level-i cache miss, and vh is an overhead defined in Theorem 10. For any algorithms
where Q bα (n; Mi , B) is asymptotically equal to the optimal sequential cache-oblivious cost
Q∗ (n; M, B) for the problem, and under conditions where vh is constant, this is optimal
across all levels of the cache. For example, a parallel sample sort (that uses imbalanced
subtasks) has Qbα (n; Mi , B) = O((n/B)logM +2 (n/B)) (Theorem 25), which matches the
optimal sequential cache complexity for sorting, implying optimality on parallel cache
hierarchies using our scheduler.
• Space. (Section 5.3.3) In a dynamically allocated program, the space requirement depends
on the schedule. A bad parallel scheduler can require many times more space than the
space requirement of a sequential schedule. A natural baseline is space of the sequential
depth-first schedule S1 . We demonstrate a schedule that has good provable bounds on
space. In a certain range of cost-metrics and machine parameters, the extra space that the
schedule needs over the sequential space requirement S1 is asymptotically smaller than S1 .
14
Chapter 6 describes our contributions in terms of new algorithms. All the algorithms pre-
sented in the chapter are oblivious to machine parameters such as cache size, number of proces-
sors etc. The chapter presents various building block algorithms such as Scans (Prefix Sums),
Pack, Merge, and Sort from our work [40]. The algorithms are optimal in the effective cache
complexity (Q bα ) model and also have poly-logarithmic depth and are optimal in the sequential
cache complexity model (Q1 ). Many of these algorithms are new. These optimal building blocks
are plugged into adaptations of existing algorithms to construct good graph algorithms (Section
6.5). We also designed new I/O-efficient combinatorial algorithms for the set-cover problem us-
ing these primitives [45] (Section 6.6). An algorithm for sparse matrix-vector multiplication is
also presented (Section 6.4). We used the algorithms presented in this thesis as sub-routines for
designing many baseline algorithms [149] for the problem-based benchmark suite (PBBS) [34].
These algorithms also perform extremely well in practice, in addition to being simple to
program (< 1000 lines in most cases). Experiments on a 40-core shared memory machine with
limited bandwidth are reported in Section 6.7. The algorithms demonstrate good scalability
because they have been designed to minimize data transfers and thus use bandwidth sparingly.
In many cases, the algorithms run at the maximum speed that the L3 – DRAM bandwidth permits.
15
1.4 Thesis statement
A program-centric approach to quantifying locality and parallelism in shared memory parallel
programs is feasible and useful.
16
Chapter 2
This chapter surveys models of computations — models based on circuits, machines, and pro-
grams — that quantify locality and/or parallelism. Approaches to developing complexity results
(lower bounds) and constructive examples (algorithms) will be highlighted.
Chronologically, the first machine-centric model to quantify locality in parallel machines is
the VLSI computation model pioneered by Thompson [160] which attracted a great deal of at-
tention from theorists [143] (Section 2.1). Although this line of work could arguably be linked
to earlier studies such as the comparison of the relative speeds of one-tape vs multi-tape Turing
machines [92, 93], VLSI complexity studies were the first to deal with practical parallel computa-
tion models – logic gates and wires. In the spirit of the Turing hypothesis, Hong [94] establishes
polynomial relations between sequential time (work), parallel time (depth) and space of several
variants of Multi-Tape Turing Machines, Vector Models and VLSI computation models.
In this model, computations are represented as circuits embedded in a few planar layers.
Greater locality in the computation allows for shorter wire lengths and a more compact circuit
layout. Greater parallelism leads to fewer time steps. The complexity of algorithmic problems is
explored in terms of trade-off between the area and time for VLSI solutions to the problem. Such
results are usually expressed in the form AT α = Ω(f (n)) where n is the problem size and α is
typically set to 2 for 2-D VLSI following Thompson’s initial work [160]. A scalable solution
to the problem would present a circuit with the small circuit area (greater locality) and a small
number of time steps (greater parallelism) making the product AT 2 as close to the lower bound
f (n) as possible.
Although the model provides valuable insights into the complexity of the problem and relates
graph partitioning to locality, it can be used only at a very low level of abstraction – fixed logic
gates and wires. Since programs today are written at a much higher level than circuits, a more
program-friendly cost model such as the PCC model is required. Quantitatively mapping VLSI
complexity bounds to PCC bounds and vice-versa remains an interesting problem for future.
The next step in the evolution of parallel machines were the fixed-connection networks (Sec-
tion 2.2). These machines are modeled as a collection of nodes connected by a network with a
specific topology. Each node includes a general purpose processor, a memory bank and a router.
Programming in this model would involve specifying a schedule and routing policy in addition
to the instructions. Performance of an algorithms is typically measured in terms of the number
of time steps required to solve a problem of a specific size. An algorithm with a structure better
17
suited for the machine would take lesser time to complete.
A fundamental problem with fixed-connection networks is the difficulty of porting algorithms
and their analyses between models. In other words, locality in these machines is topology-
specific. The relative strengths of networks, in terms of how well one network can emulate
another network and algorithms written for it, have been studied in detail. These embeddings
allow for portability of some analyses between topologies to within polylog-factor approxima-
tions. However, programming these machines is still inconvenient and porting the programs is
not easy. Lower bounds from the VLSI complexity model can be directly mapped to finite-
dimensional embeddings of fixed-connection networks.
To simplify programming and analysis on fixed-connection networks, cost models that ab-
stract away the details of the network topology in favor of costs for bandwidth, message latency,
synchronization costs, etc. have been proposed (Section 2.3). Among the early models are the
Bulk-Synchronous Parallel and the LogP model. Subsequently, models for distributed mem-
ory machines with multi-level hierarchies such as the Network-Oblivous Model [32] have been
proposed. These machine-centric models have the advantage of being more general in their ap-
plicability. A simpler but slightly more program-centric model inspired by the models above
– the Communication Avoiding model is widely used for developing scalable algorithms for
distributed machines.
Pebbling games (Section 2.4) were suggested by Hong and Kung [102] as a tool to study I/O
requirements in algorithms expressed as dependence graphs, and are very useful for establishing
I/O lower bounds for algorithms. Pebbling games highlight the direct relation between partitions
of a DAG and I/O complexity, a theme that underlies most of the cost models for locality. A
major drawback of this model is that it does not quantify parallelism or the cost of load balancing
a computation.
Nevertheless, pebbling games are very versatile in their applications. In addition to model-
ing I/O complexity, it has been used for studying register allocation problems and space-time
tradeoffs in computations. Pebbling games have been extended for multi-level hierarchies. Sav-
age’s textbook [147] surveys pebbling games in great detail in addition to many of the models
described till this point.
Pebbling games are trivial for sequential programs whose DAGs are just chains. This sim-
plified game directly resembles a one-processor machine with a slow memory (RAM) and a
fast memory (cache). The first simplified model to reflect this was the External Memory Model
which was made more program-centric by the Cache-Oblivious model. Both models have at-
tracted a great deal of algorithm’s research because of their simplicity. Section 2.5 highlights the
similarities and differences between the two. Another model for measuring locality in sequen-
tial programs is the Hierarchical Memory Model which models memory as one continuously
indexable array but with a monotonically increasing cost for accessing higher indices.
Parallel extensions to these sequential models that are well suited for shared memory ma-
chines have been studied. Section 2.6 presents a overview of some of these including one-level
and multi-level cache models. Among the one-level cache models are the Parallel External-
Memory Model and the Shared Cache Model. The Multi-level models include the Multi-BSP
model and the Parallel Memory Hierarchy model. The PMH model is referred to througout the
dissertaion.
The details of an adaptation of Parallel Memory Hierarchy model of Alpern, Carter and Fer-
18
rante [13] are described in Section 2.7. The PMH model is a realistic model that captures locality
and parallelism in modern shared memory machines. While many architectures directly resem-
ble the PMH model (Figure 1.5), others can be approximated well (Figure 1.6) with PMH. The
PCC framework presented in this dissertation, while being entirely program-centric, is inspired
by the PMH machine model. Section 5.3 demonstrates a scheduler that maps costs in the PCC
model to a PMH with good provable guarantees.
Finally, in Section 2.8, we survey reuse distance analysis which is extremely useful for pro-
filing the locality of sequential programs.
Lower Bounds. The central question in VLSI complexity is the tradeoff between A and T .
Most results assume that the design is when and where oblivious, i.e. the location and the timing
of the inputs and outputs are oblivious to the values of the data. The complexity of a problem
is presented as a lower bound on the product AT α for some α in terms of the size of the input
instance N . Results for two dimensional circuits are often presented with α set to 2. For ex-
ample, for comparison-based sorting, the first lower bound presented was AT 2 = Ω(N 2 ) [167],
improved to Ω(N 2 log N ) by Thompson and Angluin and finally to Ω(N 2 log2 N ) by Leighton
[119]. The complexity of multiplying two binary number of length N was demonstrated to
be AT 2 = Ω(N 2 ) by Abelson and Andreae [1] which was generalized by Brent and Kung
19
[53] to AT 2α = Ω(N 1+α ). Thompson [157] established the complexity of Discrete Fourier
α
Transforms to be AT α = Ω(N 1+ 2 ). Savage [145] established the bounds for matrix multi-
plication: AT 2 = Ω((mp)2 ) for multiplying matrices of dimensions m × n and n × p when
(a − n)(b − n) < n2 /2 where a = max{n, p}, b = max{n, m}. Savage’s results can be easily
extended to matrix inversion and transitive closure.
Lower Bound Techniques. Most of the proofs for lower bounds follow a two-step template.
• Establishing an information transfer lower bound for the problem.
• Demonstrate that any embedding circuit transferring the established amount of information
requires a certain area and time.
The first step involves finding a physical partition of the input and output wires on the VLSI
and establishing that any circuit for the problem needs to transfer a minimum amount of “in-
formation” across the partition. For instance, Leighton establishes that for a comparison-based
sorting, any partition of the chip that separates the output wires in half requires Ω(N log N ) bits
of information across it (Theorem 4 of [119]).
The next step establishes the relation between the bisection bandwidth of the circuit graph
and the area required to embed it. Theorem 1 of Thompson [157] demonstrates that any circuit
which requires the removal of at least ω edges to bisect a subset of vertices (say the output nodes)
requires Ω(ω 2 ) area. Since it is not possible to transfer more than ω bits per cycle of information
across the minimum cut, at least I/ω steps are needed to transfer I amount of information. Put
together, these two imply AT 2 = Ω(I 2 ). Plugging in bounds for I in this inequality results in
lower bounds for the complexity of the problem.
Approaches to establishing lower bounds on “information transfer” in the first step vary.
Most approaches are encapsulated by the crossing sequence argument of Lipton and Sedgewick
[124] and the information-theoretic approach of Yao [172]. In the crossing sequence argument
of [124], the circuit is bisected (in terms of area, input, output or any convenient measure) and
the number of different sequences that cross the wires and gates cut by the bisection during the
computation is determined. Lower bounds are established by arguing that a minimum number of
wires and time is required to support the required number of distinct crossing sequences for the
input space. In [172], the left and right halves of the chip are treated as two players with inputs l
and r respectively playing a game (as in [173]) to compute a function f (l, r) by exchanging the
minimum amount of information, which is used to establish the bounds. These techniques are
compared in Aho, Ullman and Yanakakis [9] with suggestions for techniques to obtain stronger
bounds.
A peculiar property of VLSI complexity is its non-composability [172]. While area and time-
efficient circuits may be constructed to evaluate two composable functions f and g individually,
lining up the inputs of one to the output of the other may not be easy. In fact, functions F and G
exist such that the the AT 2 complexity of the composition F ◦ G is asymptotically greater than
that of F and G.
Upper Bounds Good circuit designs for most problems mentioned above exist. Thompson’s
thesis presents circuits for Fourier Transforms and Sorting. Several designs with various tradeoffs
are presented for FFT in [159], and sorting in [158]. Leighton presents sorting circuits that
20
match lower bounds in [119]. Brent and Kung [53] present binary multiplication circuits with
complexity AT 2α = O(n1+α logl+2α n). Kung and Leiserson present an optimal design [111] for
matrix multiplication. [110] surveys and categorizes many of these algorithms.
To place some of these results in perspective, let us consider the lower bounds for matrix
multiplication and contrast two circuit designs for the problem with different degrees of locality.
For multiplying two n × n matrices A and B using the straight-forward O(n3 ) matrix multiplica-
tion, a circuit must have complexity AT 2 = Ω(n4 ). Among the two circuit designs we consider,
one exposes the full parallelism of matrix multiplication (T = O(log n)) for poor locality while
the other trades off some parallelism (T = O(n)) for greater locality.
To compute the product in T = O(log n) steps would require a circuit that is a three-
dimensional mesh of trees, each node of the mesh performing a multiplication operation and
each node in the tree performing an addition operation. The input matrices would be plugged in
at two non-parallel faces, say A at the x = 0 face and B at the y = 0 face on a mesh aligned
along the 3D-axis at origin. In the first phase, the trees along the mesh lines perpendicular to
x = 0 face copy the element in A input on one of their leaf nodes to all of their leaf nodes.
Therefore, Ajk initially placed at (0, j, k) would be copied to each of the grid nodes {i, j, k}
for i ∈ {0, 1, . . . , n − 1} in the first phase in O(logn) steps. B is similarly copied along the
y-axis. After this phase, the elements at each grid element are point-wise multiplied and reduced
using the trees along the z-axis in another O(log n) steps. This circuit design has a bisection
bandwidth of Ω(n2 ) and thus needs n4 wire length when stretched out on to a few planar layers
(following Thompson’s theorem that relates the bisection bandwidth of a graph and the area of
its planar embeddings [157]). The AT 2 cost of this design is O(n4 log2 n) which is close to the
lower bound. This design realizes the full extent of the parallelism while paying a heavy price in
terms of locality – wire lengths are O(n) on average.
Another circuit design for the problem uses a two-dimensional hexagonal array design with
n nodes along each edge that solves the problem in O(n) time steps [111] (see Figures 2 and 3
of [110] for an illustration). It may be possible to roughly view this design as a two-dimensional
projection of the 3D-mesh on to a plane perpendicular to its long diagonal and each time step
of the hexagonal array design as a slice of the 3D-mesh parallel to the projected plane). The
hexagonal array design is already two-dimensional and is thus embeddable with only constant
length wires and thus has A = O(n2 ). By trading off some parallelism, this design achieves
a significant improvement in locality for multiplying matrices. This solution also achieves the
optimal value for AT 2 .
Finally, consider the comparison sorting problem which requires AT 2 = Ω(n2 ). Any poly-
logarithmic time solution to the problem needs a circuit with total wire length O(n2 /polylog(n))
for O(n/polylog(n)) gates which corresponds to average length wire O(n/polylog(n)). This
implies that solutions to this problem need relatively large bisection bandwidth. Circuits for
this problem like the O(log2 n) time shuffle-exchange networks [25] or the O(log n) time AKS
sorting network [10] do indeed have good expansion properties and large bisection bandwidths.
21
2.2 Fixed-Connection Networks
Parallel machines built for high-performance computing span several VLSI chips. They are
composed of multiple nodes, each node containing a general purpose processor, memory, and
a packet router. Once fabricated, the topology of the network between the nodes is not easily
changed. The cost of an algorithm on these machines is measured by the number of steps and
the number of processing elements required. The maximum number of processors that can be
used for problem of size N while being work-efficient can be used as an indicator of the paral-
lelizability of the algorithm on the machine. The average number of hops that messages of the
algorithm travel over the network can be used as an indicator of the locality of the algorithm with
respect to the machine.
A good network should be capable of supporting efficient and fast algorithms for a variety
of problems. Some criteria for a good network are large bisection bandwidth, low diameter,
and good expansion. Alternatively, the number of steps required for a permutation operation –
sending one packet from every processor to another processor chosen from a random permutation
– is a good measure of the quality of the network.
Commonly used network topologies include meshes and the closely related tori (e.g. Cray
XC, IBM Bluegene), mesh of trees, hypercubes (e.g. CM-1,CM-2) and the closely related hyper-
cubic networks – Butterfly and Shuffle-Exchange networks, and fat-trees (e.g. CM-5). Leighton
[116] presents a survey of these topologies and algorithms for them. The relative merits of the
networks have been extensively studied. For example, hypercube and hypercubic networks are
capable emulating many topologies such as meshes or a mesh of trees very effectively. But, on
the downside, hypercubes have a larger degree than other nodes and are difficult to embed in
two or three dimensions. Among networks that can be embedded in a fixed amount of volume
in constant number of dimensions (eg. 2,3), it has been demonstrated that fat-trees are capable
of emulating any other network with only a polylogarithmic slowdown [121]. Therefore, opti-
mal algorithms for fat-trees achieve the best performance to within a polylogarithmic factor for
circuits of a given volume.
22
Valiant and Brebner [165] were the first to formalize this question. They presented a random-
ized O(h + log N )-step routing scheme for an h-relation on N -node hypercubes with O(log N )-
size buffers at each node. Building on this, routing schemes for shuffle-exchange and butterfly
networks were constructed [11, 162]. Pippenger [138] improved the result for Butterfly networks
by using constant size buffers.
These schemes enable efficient emulation of shared memories on top of distributed memory
machines by presenting a uniform address space with low asymptotic access costs for every
location. Karlin and Upfal [105] presented the first O(log N )-step shared memory emulation
on butterfly networks with O(log N ) size priority queue buffers that supports the retrieval and
storage of N data items. Further improvements culminated in Ranade’s O(log N )-step emulation
with O(1) size FIFO queue emulations [142].
All of the results above deal with routing on specific networks. Leighton, Maggs and Rao
[118] addressed the problem of routing on any network. Central to the solution of this problem
is a routing schedule that schedules transfer of packets between source-destination pairs along
specified paths using small buffer at each node. The general routing problem can be reduced to
finding a routing schedule along specific paths using ideas of random intermediate nodes from
Valiant [165]. For an instance of the routing problem where any pair of source-destination nodes
are separated by d hops (dilation) and any edge is used by at most c paths (congestion), Leighton,
Maggs and Rao [118] proved the existence of an optimal online routing strategy that takes at most
O(c + d) steps with high probability of success. This result was used to design a efficient routing
schemes for many fixed-connection networks in [117].
where wp and hp are the number of instructions executed and the number of messages sent or
received at processor p respectively.
23
The aims of the LogP model are similar, although the model is asynchronous. Instead of
waiting for phase boundaries for communication as in the BSP model, processors in the LogP
model can request remote data at any time and use them as soon as they are made available.
The cost parameters in the LogP model are (i) an upper bound on the latency of a message: L,
(ii) overhead of sending or receiving a message: o (during which processor is inactive), (iii) the
gap between two consecutive message transmissions: g (inverse of the bandwidth) and (iv) the
number of processors: P . The cost of a program is the maximum of the sum of costs of all
instructions and messages at any node. LogGP model [12] is a refinement of the LogP model
that distiniguishes between the effective bandwidth available to small (parameter g) and large
messages (parameter G).
24
Proofs of lower bounds on the communication costs in this model relate closely to correspon-
dence between graph partitioning problem and communication complexity explored in Pebbling
Games [102]. For example, Ballard et al. [24] demonstrate a lower bound on the number of
edges required to cut any Strassen-style computation DAG into chunks that use at most M words
(M being the size of local memory on each node), and relate the cut size to minimum commu-
nication cost. Irony et al. [99] use similar techniques to establish lower bounds for the standard
matrix multiplication with various degrees of replication. Ballard’s thesis [23] presents more
comprehensive survey of lower bounds for communication-avoiding algorithms.
25
place (routing), assigning Map and Reduce operations to processors (scheduling), load balancing
and fault tolerance.
The number of rounds of computation and the cost of each round determine the complexity of
a MapReduce program just as in the BSP model. Given the relatively low bisection bandwidths of
Ethernet-based clusters that commonly run the MapReduce framework, emphasis is often placed
on minimizing the number of computation rounds at the expense of greater data replication and
total number of instructions. Karloff, Suri and Vassilvitskii [106] present a simple program-
centric cost model to guide algorithm design with these considerations. In their model, each
Map operation on a key-value pair is allowed to take sublinear (O(n1−ε )) and space, where n
is the length of the input to the round. Similarly, each Reduce operation on a collection of key-
value pairs with the same key is allowed similar time and space budget. The output of each round
must fit within O(n2−2ε ) space. The MapReduce class consists of those problems that admit an
algorithm with a polylogarithmic number of rounds (as a function of the input size). Algorithms
made of Map and Reduce operations with sublinear time and space can be mapped to clusters
made of machines with small memories.
This particular model for MapReduce ignores the variation in the cost of shuffling data across
the interconnect in each round and differentiates between algorithms only in terms of their round
complexity. More quantitative studies relate communication costs to the replication rate – the
number of key-value pairs that an input pair produces, and parallelism to the reducer size – the
number of inputs that map to a reducer. Since communication costs increase with replication
rates, and parallelism decreases with increase in reducer size, a good MapReduce algorithm
would minimize both replication rates and reducer sizes. Trade-offs between the two for several
problems are presented by Afrati et al. [5]. As in the case of other cost models, a notion of
minimal algorithms that achieve optimal total communication, space, instructions, and parallel
time (rounds of load-balanced computation) is formalized by Tao et al. [155]. They present
algorithms for sorting and sliding aggregates. The sorting algorithm closely resembles sample
sort which has been shown to be optimal in many other models of locality [6, 40].
26
placed on x. In other words, if all the inputs required for x are in fast memory, node x can
be computed.
Store: A red pebble on a node can be replaced with a blue pebble. This corresponds to evicting
the data at the node from the fast memory and storing it in the slow memory.
A valid pebbling games corresponds to a valid execution of the DAG. The number of transi-
tions between red and blue pebbles correponds to the number of transfers between fast and slow
memories. The pebbling game that makes the least number of red-blue transitions with M red
pebbles gives the I/O complexity of the DAG with a fast memory of size M .
Lower Bounds. Lower bounds on the I/O complexity are established using the following tight
correspondence between red-blue pebbling games and M -paritions of the DAG. Define a M -
partition of the DAG to be a partition of the nodes in to disjoint partitions such that (i) the in-flow
into each partition is ≤ M , (ii) the out-flow out of each partition is ≤ M , and (iii) the contracted
graph obtained by replacing the partitions with a supernode is acyclic. Then any pebbling game
with M red pebbles and Q transfers can be related to a 2M -partition of the DAG into h partitions
with M h ≥ Q ≥ M (h − 1). Using this technique, [102] establishes the I/O complexity of the
FFT graph on n elements to be Ω(n logM n). [102] also establishes other techniques to prove
lower bounds based on number of vertex-disjoint paths between pairs of nodes. Note that lower
bounds established in this model correspond to the algorithms and not the problem.
Parallelism. A major drawback of pebbling games is that they do not model the cost of load
balance or parallelism. The description does not specify whether the game is played sequentially
or in parallel. Although the usual description of the game corresponds to a sequential execution,
natural extensions of the games can model parallel executions. Sequential pebbling games corre-
spond to I/O complexity in the sequential models in Section 2.5. Parallel executions of the DAG
on a shared cache of size M (Section 2.6.1) correspond to simultaneous pebbling across multiple
nodes in the DAG with a common pool of M red pebbles. Parallel executions of the DAG on the
PEM model (Section 2.6.1) with p processors each with a M -size private cache correspond to a
pebbling games with p different subtypes of red pebbles numbering M each. The compute rule
of the pebbling game would be modified to allow a node to pebbled red with subtype j only if
all its immediate predecessors have been pebbled red with the same subtype j.
Multi-level Pebbling Games An an extension of pebbling game to hierarchies called the Mem-
ory Hiererchy Game (MHG) has been suggested [146]. In a MHG, there are L different pebble
types corresponding to each level in a L-level memory hierarchy. The number of pebbles of type
j < L are limited and increase with j, their posistion in hierarchy. Computation can only be
performed on a node with predecessors pebbled with type-1 pebble corresponding to the fastest
level in the hierarchy. Transitions are allowed between adjacent types, and correspond to trans-
fers between adjacent levels in the hierarchy. I/O complexity at level i relates to the number
of transitions between level-(i − 1) and level-i pebbles. Complexity in this pebbling game re-
lates quite closely to cache complexity in the Cache-Oblivious framework applied to multi-level
hierarchy (Section 2.5.2).
27
2.5 Models for Sequential Algorithms
This section surveys cost models for locality of sequential algorithms. We start with the one-
level External Memory Model of Aggarwal and Vitter [6], and move to the program-centric
Cache-Oblivious framework and the ideal cache model [86] that can be used for capturing costs
on multi-level hierarchies. We finally describe explicitly multi-level cost models such as the
Hierarchical Memory Model [7].
28
The ideal cache assumptions make for an easy and portable algorithm analysis. Many of the
External-Memory algorithms can be adapted into this model, while some problems need newer
design. Demaine [72] presents a survey of some cache-oblivious algorithms and data-structures
designed for the ideal cache model and their analyses.
Multi-level Hierarchies. Since algorithms are oblivious to cache sizes, the analysis for the
algorithms is valid for multi-level cache hierarchies as well. On a h-level hierarchy with caches
of size Mi , i = 1, 2, . . . , h, the algorithm incurs Q(Mi , B) cache misses at level i.
Limitations. Some problems seem to admit better External-Memory algorithms than Oblivious
algorithms by virtue of the cache size awareness. For example, optimality of cache complexity
for many algorithms in the CO model depends on the tall-cache assumption – M ≤ B ε for
some ε > 1, usually 2. Brodal and Fagerberg [58] show that this assumption is in fact required
for optimal cache-oblivious algorithms to some problems such as sorting. Further, they prove
that even with the tall-cache assumption, permutations can not be generated optimally in the
Cache-Oblivious model.
Use for Parallel Algorithms Although the ideal cache model (and cache-oblivious framework)
has been described for sequential algorithms, it can be used to study cache complexities of par-
allel algorithms by imposing a sequential order on parallel programs. In particular, sequential
cache complexity in the ideal cache model under the depth-first order Q1 is a useful quantifier of
locality in nested-parallel programs. See section 5.2 for more details.
29
programmers as they are easier to program than distributed systems and GPUs. The semantics of
concurrent accesses to the shared address space [3] are specified by memory consistency policy.
30
with good bounds. Scheduling even slightly imbalanced computations is not considered in their
model.
Memory: Mh = ∞, Bh
Cost: Ch−1 fh
h
Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1
Cost: Ch−2
M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1
P P f P P P f P P P f P P P f P P P f P
1 1 1 1 1
fh fh−1 . . . f1
This section presents the details of an adaptation of the PMH model from [13] that we refer
to in the next few chapters.
31
2.7.1 Parameters
A PMH consists of a height-h tree of memory units, called caches which store copies of elements
from the main memory. The leaves of the tree are at level-0 and any internal node has level
one greater than its children. The leaves (level-0 nodes) are processors, and the level-h root
corresponds to an infinitely large main memory.
Each level in the tree is parameterized by four parameters: M i , B i , Ci , and fi .
• The capacity of each level-i cache is denoted by M i .
• Memory transfers between a cache and its child occur at the granularity of cache lines.
We use B i ≥ 1 to denote the line size of a level-i cache, or the size of contiguous data
transferred from a level-(i + 1) cache to its level-i child.
• If a processor accesses data that is not resident in its level-1 cache, a level-1 cache miss
occurs. More generally, a level-(i + 1) cache miss occurs whenever a level-i cache miss
occurs and the requested line is not resident in the parent level-(i + 1) cache; once the
data becomes resident in the level-(i + 1) cache, a level-i cache request may be serviced by
loading the size-Bi+1 line into the level-i cache. The cost of a level-i cache miss is denoted
by Ci ≥ 1, where this cost represents the amount of time to load the corresponding line
into the level-i cache under full load. Thus, Ci models both the latency and the bandwidth
constraints of the system (whichever is worse under full load). The cost ofPan access at a
processor that misses at all levels up to and including level-j is thus Cj0 = ji=0 Ci .
• We use fi ≥ 1 to denote the number of level-(i − 1) caches below a single level-i cache,
also called the fanout.
The miss cost Ch and line size B h are not defined for the root of the tree as there is no level-
(h + 1) cache. The leaves (processors) have no capacity (M 0 = 0), and they have B 0 = C0 = 1.
Also, Bi ≥ Bi−1 for 0 < i < h. Finally, we call the entire subtree rooted at a level-i Q cache a
level-i cluster, and we call its child level-(i − 1) clusters subclusters. We use pi = ij=1 fj to
denote the total number of processors in a level-i cluster.
2.7.2 Caches
2.7.2.1 Associativity
We assume that all the caches are fully associative just as in the ideal cache model [86]. In
fully associative caches, a memory location from memory can be mapped to any cache line. In
practice, hardware caches have limited associativity. In a 24-way associative cache such as in
Intel Xeon X7560, each memory location can be mapped to one of only 24 locations. However,
a fully associativity makes the machine model much simpler to work with without giving up on
accuracy. Pathological access patterns may be dealt with using hashing techniques if needed.
2.7.2.2 Inclusivity
We do not assume inclusive caches, meaning that a memory location may be stored in a low-level
cache without being stored at all ancestor caches. We can extend the model to support inclusive
caches, but then we must assume larger cache sizes to accommodate the inclusion.
32
P P … P P P … P
Z1 Z1 Z1 Z1
Z2 Z2 Z2 Z2
Zk Zk Zk Zk
Figure 2.2: Multi-level shared and private cache machine models from [40].
We assume that the number of lines in any non-leaf cache is greater than the sums of the
number of lines in all its immediate children, i.e. , M i /Bi ≥ fi M i−1 /Bi−1 for 1 < i ≤ h, and
M1 /B1 ≥ f1 . For all of our bounds to hold for inclusive caches, we would need M i ≥ 2fi M i−1 .
We assume that the cache has the online LRU (Least Recently Used) replacement policy. If a
cache is used by only one processor, this replacement is competitive with the optimal offline
strategy — furthest-in-the-future (FITF) — for any sequence of accesses [150]. However, the
scenario is completely different when multiple processors share a cache [91]. Finding the optimal
offline replacement cost where page misses have different costs (as in a multi-level hierarchy)
is NP-complete. Moreover, LRU is not a competitive policy when a cache is shared among
processors. Online replacement policies for shared caches that tend to work in practice [100, 101]
and those with theoretical guarantees [107] have been studied.
Despite the lack of strong theoretical guarantees for LRU in general, it suffices for the results
presented in Chapter 5.
33
Locality in parallel machines with topologies embeddable in finite and small number of di-
mensions may also be approximated with tree of caches. Loosely speaking, any arbitrary topol-
ogy where the number of processors and the storage capacity of a subset of the machine de-
pends only on the volume of the subset can be recursively bisected along each dimension, and
each level of recursion mapped to a level in the hierarchy. For example, Figure 1.6 depicts a
two-dimensional grid of size 4 × 4 and how it might be approximated with a tree of caches by
recursive partitioning.
34
In order to model spatial locality, the analysis must replace the distance metric between ref-
erences by the number of memory (or cache) blocks of a certain size B instead of the number
of locations [89]. However, requiring the block size parameter of a machine would make the
analysis less program-centric. An alternative is to run an analysis for different values of B, say
for powers of two, and store the histogram for each value of B. An approximate mapping from
program analysis to performance on a machine can be done by selecting the histogram corre-
sponding to a value of B that is most relevant to the machine. Reuse distance analysis has been
used for automatically generating better loop fusions and loop tilings [77, 139].
Since reuse distance is analyzed with reference to a specific input, varying the input may not
preserve the analysis. However, it may be possible to use training instances to predict data reuse
pattern across the range of inputs for a family of programs [78].
Reuse distance requires non-trivial program profiling as it involves keeping track of informa-
tion about the access time of every memory location. Low overhead algorithms and optimizations
for efficient, yet robust, profiling of reuse distance at varying scales (loops, routines, program)
has been studied in [78, 127, 175].
Other related metrics that are either equivalent or directly related to reuse distance such as
miss-rate, inter-miss time, footprint, volume fill time etc. have been considered for their ease of
collection or practicality. Quantitative relations between these measures were first considered in
the Working Set Theory studies of Denning [76], and explored further towards building a “higher
order theory” of locality [171].
A limitation of reuse distance analysis is that it does not seem to capture the parallelism of the
program. Further, it is not inherently defined for parallel programs. However, it may be extended
for estimating the communication costs of a parallel program by augmenting the program with
extra information – imposing a sequential order such as in many shared cache models, or some
notion of distance between references in parallel programs.
35
36
Chapter 3
Program Model
37
f
Task
Strand
f’
Parallel Block
Figure 3.1: Decomposing the computation: tasks, strands and parallel blocks. The circles
represent the fork and join points. f is a fork, and f 0 its corresponding join.
38
T1
T2
T3
Task
Node
Live node
T4
Completed node
Figure 3.2: Tree-like property of live nodes: A set of live nodes in a nested-parallel DAG cor-
respond to tasks that can be organized into a tree based on their nesting relation. In this case,
there are 3 lives nodes corresponding to four tasks that can be naturally organized into the tree
(T1 (T2 (T3 ) ()) (T4 )).
• Tree-like. Any feasible set of “live” nodes that are simultaneously executing (which forms
an anti-chain of nodes in the DAG) will always be drawn from a set of tasks that can be
organized into a tree based on their nesting relation (see figure 3.2.1).
• Monotonicity. The space footprint of task increases monotonically with levels of nesting:
a task t2 nested within a task t1 can never access more distinct memory locations than t1 .
Consider, for example, the simple nested-parallel version of Quicksort in Figure 3.3 pre-
sented in the NESL language [33, 148]. At each level of recursion, array A is split into three
smaller arrays lesser, equal and greater which are recursively sorted, and put together
before returning. Note that:
39
function Quicksort(A) =
if (#A < 2) then A
else
let pivot = A[#A/2];
lesser = {e in A| e < pivot};
equal = {e in A| e == pivot};
greater = {e in A| e > pivot};
result = {quicksort(v): v in [lesser,greater]};
in result[0] ++ equal ++ result[1];
Since A is not referenced after it is split, the memory locations assigned to A can be recycled
after the splits are computed. In fact, for an input of size n it is possible to have two memory
chunks of size n and use them to support variables A and lesser, equal, greater across all
levels of recursion. At each level of recursion the input variable would be mapped to one chunk,
and split into the other chunk. At the recursion level beneath, the mapping of input and output
variables to the memory chunks is reversed.
This scheme, in addition to saving space, provides scope for locality. In a recursive function
call f with input size that is smaller than a cache size by a certain constant, the locations corre-
sponding to the variables in f can be mapped to the cache and loaded only once into the cache.
All the variables in lower levels of recursion can be mapped to the same set of locations that very
mapped onto the cache. Each function call nested with in f can reuse the data in cache without
incurring additional cache misses.
40
function MM(A,B,C) =
if (#A <= k) then return MM-Iterative(A,B)
let
(A_11,A_12,A_21,A_22) = split(A);
(B_11,B_12,B_21,B_22) = split(B);
C_11 = MM(A_11},B_11) || C_21 = MM(A_21,B_11) ||
C_12 = MM(A_11,B_12) || C_22 = MM(A_21,B_12) ;
C_11 += MM(A_12,B_21) || C_21 += MM(A_22,B_21) ||
C_12 += MM(A_12,B_22) || C_22 += MM(A_22,B_22) ;
in join(C_{11},C_{12},C_{21},C_{22})
Figure 3.4: 4-way parallel matrix multiplication. Commands separated by “||” are executable
in parallel, while “;” represents sequential composition.
the full parallelism of Quicksort, as opposed to the O(n log n) locations that a naive allocation
would need. Section 4.3.1 deals with the locality of this program.
The static scheme avoids the runtime overhead of managing memory. However, it requires
finding a good mapping manually at compile time. This is quite often difficult for dynamic DAGs
and may force use of suboptimal mappings. However, it is possible to find a good mapping
for many of the nested-parallel algorithms described in this dissertation, the statically allocated
Quicksort presented in the Appendix being an example.
Another significant drawback of this approach is that it may not be possible to express the full
parallelism of a program without using too much space or losing locality. Consider the matrix
multiplication in Figure 3.4 for example. The eight recursive calls to matrix multiplication can
be invoked in parallel. However, a static allocation would need O(n3 ) space for multiplying two
n×n matrices, as there are O(n3 ) concurrent leaves in this recursion each of which would require
a distinct memory location. To avoid this, the program in Figure 3.4 restricts the parallelism of
the program. Four recursive calls are invoked in parallel followed by the other four. This allows
the static mapping to reuse memory locations better and requires only O(n2 ) memory locations.
To simplify presentation, the bounds on communication costs and running times of schedulers
we present in Section 5.3.1 and Section 5.3.2 are for statically allocated programs.
41
QuickSort (A)
alloc(AL,AR,temp)!
Filter
free(temp)!
QS(AL) QS(AR)
Filter Filter
alloc! alloc!
(ALL,ALLR..)! (ARL,ALRR..)!
free(AL)! free(AR)!
ALL ALR ARL ARR
… … … …
42
+n +1 +1 +1
Figure 3.6: Replacing a node with +n allocation with a parallel block with n parallel tasks each
with +1 allocation.
43
44
Chapter 4
45
defined in Section 4.2) can be relevant on simple machine models with one level of cache. Work,
depth and cs alone can predict performance on these simple machine models as we will see in
Section 5.2. However, it may not be easy to generalize such results to parallel machines with
a hierarchy of caches because of the schedule-specific nature of Q1 metric. In Section 4.3, we
will present a parallel cache complexity cost model for locality which is schedule-agnostic, and
realizable by many practical schedules. Further, we will extend the parallel cache complexity
metric to a metric called the effective cache complexity (Section 4.3.2) which brings together
both locality and parallelism under one metric. The metric leads to a new definition for quanti-
fying the parallelizability of an algorithm (Section 4.3.3) that is more practical and general than
depth (span) or the ratio of work to depth. The parallel and effective cache complexity metrics
together form our parallel cache complexity (PCC) framework.
46
Task t forks subtasks t1 and t2 ,
with κ = {l1 , l2 , l3 }
t1 accesses l1 , l4 , l5 incurring 2 misses
t2 accesses l2 , l4 , l6 incurring 2 misses
At the join point: κ0 = {l1 , l2 , l3 , l4 , l5 , l6 }
Figure 4.1: Example applying the PCC cost model (Definition 2) to a parallel block. Here,
Q∗ (t; M, B; κ) = 4.
Analyzing using a sequential ordering of the subtasks in a parallel block (as in most prior work1 )
is problematic for mapping to even a single shared cache, as the following theorem demonstrates
for the CO model:
Theorem 1 Consider a PMH comprised of a single cache shared by p > 1 processors, with
cache-line size B, cache size M ≥ pB, and a memory (i.e., h = 2). Then there exists a parallel
block such that for any greedy scheduler2 the number of cache misses is nearly a factor of p
larger than the sequential cache complexity in the CO framework.
Proof. Consider a parallel block that forks off p identical tasks, each consisting of a strand read-
ing the same set of M memory locations from M/B blocks. In the sequential cache complexity
analysis, after the first M/B misses, all other accesses are hits, yielding a total cost of M/B
misses.
Any greedy schedule on p processors executes all strands at the same time, incurring si-
multaneous cache misses (for the same line) on each processor. Thus, the parallel block incurs
p(M/B) misses. t
u
The gap arises because a sequential ordering accounts for significant reuse among the subtasks
in the block, but a parallel execution cannot exploit reuse unless the line has been loaded earlier.
To overcome this difficulty, we make use of two key ideas:
1. Ignore any data reuse among the subtasks since parallel schedules might not be able to
make use of such data reuse. Figure 4.3 illustrates the difference between the sequential
cache complexity which accounts for data reuse between concurrent tasks and the parallel
cache complexity which forwards cache state only between parts of the computation that
are ordered (have edges between them).
2. Flushing the cache at each fork and join point of any task that does not fit within the cache
as it is unreasonable to expect schedules to preserve locality at a scale larger than the cache.
For simplicity, we assume that program variables have been statically allocated. Therefore
the set of locations that a task accesses is invariant with respect to schedule. Furthermore, we
logically segment the memory space into cache lines which are contiguous chunks of size B.
Therefore, every program variable is mapped to an unique cache line. Let loc(t; B) denote the
1
Two prior works not using the sequential ordering are the concurrent cache-oblivious model [27] and the ideal
distributed cache model [84], but both design directly for p processors and consider only a single level of private
caches.
2
In a greedy scheduler, a processor remains idle only if there is no ready-to-execute task.
47
Case 1: Assuming this task fits
Carry forward cache state according in cache, i.e., S(t,B) ≤ M
to some sequential order
Q1 Q*
All three
subtasks
Memory start with
same state
M,B
Merge state
and carry
forward
P
67
Figure 4.2: Difference in cache-state forwarding in the sequential cache complexity of the CO
framework and the parallel cache complexity. Green arrows indicate forwarding of cache state.
set of distinct cache lines accessed by task t, and S(t; B) = |loc(t; B)| · B denote its size (also
let s(t; B) = |loc(t; B)| denote the size in terms of number of cache lines). Let Q(c; M , B; κ) be
the sequential cache complexity of c in the CO framework when starting with cache state κ.
Definition 2 [Parallel Cache Complexity] For cache parameters M and B the parallel
cache complexity of a strand s, parallel block b, or task t starting at state κ is defined
as:
strand:
Q∗ (s; M , B; κ) = Q(s; M , B; κ)
parallel block: For b = t1 kt2 k . . . ktk ,
k
X
Q∗ (b; M , B; κ) = Q∗ (ti ; M , B; κ)
i=1
task: For t = c1 ; c2 ; . . . ; ck ,
k
X
∗
Q (t; M , B; κ) = Q∗ (ci ; M , B; κi−1 ) ,
i=1
48
Problem Span Cache Complexity Q∗
Scan (prefix sums, etc.) O(log n) O( Bn )
Matrix Transpose (n × m matrix) [86] O(log(n + m)) O(d nm e)
√ √ √ l 1.5 m B√
Matrix Multiplication ( n × n matrix) [86] O( n) O( nB / M + 1)
√ √ √ l 1.5 m √
Matrix Inversion ( n × n matrix) O( n) O( nB / M + 1)
Quicksort [109] O(log2 n) O( Bn (1 + log Mn+1 ))
Sample Sort [40] O(log2 n) O( Bn dlogM +2 ne)
l m
Sparse-Matrix Vector Multiply [40] O(log2 n) O( m B
+ n
(M +1) 1− )
(m nonzeros, n edge separators)
Convex Hull (e.g., see [39]) O(log2 n) O( Bn dlogM +2 ne)
Barnes Hut tree (e.g., see [39]) O(log2 n) O( Bn (1 + log Mn+1 ))
Table 4.1: Parallel cache complexities (Q∗ ) of some algorithms. The bounds assume M =
Ω(B 2 ). All algorithms are work optimal and their cache complexities match the best sequential
algorithms.
49
they are tight: they asymptotically match the bounds given by the sequential ideal cache model,
which are asymptotically optimal. Table 4.3 presents the parallel cache complexities of a few
such algorithms, including both algorithms with polynomial span (matrix inversion) and highly
imbalanced algorithms (the block transpose used in sample sort).
n
X n
X n j≤i+M/3
X X 6
2 n
+ = Θ(n log + n).
i=1 j>i+M/3
j − i i=1 j>i M M +1
Completing the analysis (dividing this cost by B) entails observing that each recursive quicksort
scans the subarray in order, and thus whenever a comparison causes a cache miss, we can charge
Θ(B) comparisons against the same cache miss.
The rest of the algorithms in Table 4.3 can be similarly analyzed without difficulty, observing
that for the original analyses in the CO framework, the cache complexities of the parallel subtasks
were already analyzed independently assuming no data reuse.
50
parallelism to make use of a cache of appropriate size, that either processors will sit idle or addi-
tional misses will be required. This might be true even if there is plenty of parallelism on average
in the computation. The following lower-bound makes this intuition more concrete.
Theorem 3 (Lower Bound) Consider a PMH comprised of a single cache shared by p > 1 pro-
cessors with parameters B = 1, M and C, and a memory (i.e. , h = 2). Then for all r ≥ 1, there
exists a computation with n = rpM memory accesses, Θ(n/p) span, and Q∗ (M , B) = pM , such
that for any scheduler, the runtime on the PMH is at least nC/(C + p) ≥ (1/2) min(n, nC/p).
Proof. Consider a computation that forks off p > 1 parallel tasks. Each task is sequential (a
single strand) and loops over touching M locations, distinct from any other task (i.e. , a total of
M p locations are touched). Each task then repeats touching the same M locations in the same
order a total of r times, for a total of n = rM p accesses. Because M fits within the cache, only
a task’s first M accesses are misses and the rest are hits in the parallel cache complexity cost
model. The total cache complexity is thus only Q∗ (M, B) = M p for B = 1 and any r ≥ 1.
Now consider an execution (schedule) of this computation on a shared cache of size M
with p processors and a miss cost of C. Divide the execution into consecutive sequences of
M time steps, called rounds. Because it takes 1 (on a hit) or C ≥ 1 (on a miss) units of
time for a task to access a location, no task reads the same memory location twice in the same
round. Thus, a memory access costs 1 only if it is to a location in memory at the start of the
round and C otherwise. Because a round begins with at most M locations in memory, the
total number of accesses during a round is at most (M p − M )/C + M by a packing argument.
Equivalently, in a full round, M processor steps execute at a rate of 1 access per step, and the
remaining M p − M processor steps complete 1/C accesses per step, for an average “speed” of
1/p + (1 − 1/p)/C < 1/p + 1/C accesses per step. This bound holds for all rounds except the
first and last. In the first round, the cache is empty, so the processor speed is 1/C. The final round
may include at most M fast steps, and the remaining steps are slow. Charging the last round’s
fast steps to the first round’s slow steps proves an average “speed” of at most 1/p + 1/C accesses
per processor time step. Thus, the computation requires at least n/(p(1/p+1/C)) = nC/(C +p)
time to complete all accesses. When C ≥ p, this time is at least nC/(2C) = n/2. When C ≤ p,
this time is at least nC/(2p). t
u
The proof shows that even though there is plenty of parallelism overall and a fraction of at
most 1/r of the accesses are misses in Q∗ , an optimal scheduler either executes tasks (nearly)
sequentially (if C ≥ p) or incurs a cache miss on (nearly) every access (if C ≤ p).
This indicates that some cost must be charged to account for the space-parallelism imbalance.
We extend parallel cache complexity with a cost metric that charges for such imbalance, but does
not charge for imbalance in subtask size. When coupled with our scheduler in Section 5.3, the
metric enables bounds in the PCC framework to effectively map to PMH at runtime (Section
5.3.2), even for highly-irregular computations.
The metric aims to estimate the degree of parallelism that can be utilized by a symmetric
hierarchy as a function of the size of the computation. Intuitively, a computation of size S with
“parallelism” α ≥ 0 should be able to use p = O(S α ) processors effectively. This intuition
works well for algorithms whose extent of parallelism is polynomial in the size of the problem.
For example, in a regular recursive algorithms where a is the number of parallel recursive sub-
problems, n/b is the size of subproblems for a size-n problem, and Õ(nc ) is the span (depth) of
51
b
Q(b) b ) + Q(t
= Q(t b ) + Q(t
b )
1 2 3
c
Q(b)
s(t)α
s(t2 )α
s(t1 )α s(t3 )α
s(t)α
b
Parallel block b; area denotes Q(b)
c c c
Q(b) Q(t 2) Q(t 2)
s(t)α = s(t2 )α s(t2 )α
s(t1 )α s(t3 )α
s(t2 )α
s(t)α
52
space s(t;) (this is purely for convenience of definition and does not restrict the program in any
way).
Definition 4 [parallel cache complexity extended for imbalance] For cache parameters M
and B and parallelism α, the effective cache complexity of a strand s, parallel block b, or
task t starting at cache state κ is defined as:
strand: Let t be the nearest containing task of strand s
bα (s; M, B; κ) = Q∗ (s; M, B; κ) × s(t; B)α
Q
bα (b; M, B; κ) =
Q
( nl b mo
Qα (ti ;M,B;κ)
s(t; B)α maxi s(ti ;B)α
(depth dominated)
max P
b
i Qα (ti ; M, B; κ) (work dominated)
task: For t = c1 ; c2 ; . . . ; ck ,
k
X
bα (t; M, B; κ) =
Q bα (ci ; M, B; κi ) ,
Q
i=1
In the rule for parallel block, the depth dominated term corresponds to limiting the number
of processors available
l to do them work on each subtask ti to s(ti )α . This throttling yields an
effective depth of Q bα (ti )/s(ti )α for subtask ti . The effective cache complexity of the parallel
block b nested directly inside task t is the maximum effective depth of its subtasks multiplied by
the number of processors for the parallel block b, which is s(t; B)α (see Fig. 4.3).
Qbα is an attribute of an algorithm, and as such can be analyzed irrespective of the machine
and the scheduler. Section 4.3.4 illustrates the analysis for Q bα (·) and effective parallelism for
several algorithms. Note that, as illustrated in Fig. 4.3(top) and the analysis of algorithms in the
report, good effective parallelism can be achieved even when there is significant work imbalance
among subtasks.
Finally, the depth dominated term implicitly l b includes m the span so we do not need a separate
Qα (t;M,B;κ)
depth (span) cost in our model. The term s(t;B)α
behaves like the effective depth in that
for a task t = s1 ; b1 ; s2 ; b2 ; . . . ; sk , the effective depth of task t is the sum of the effective depths
of si s and bi .
Note that since work is just a special case of Q∗ , obtained by substituting M = 0, the
effective cache complexity metric can be used to compute effective work just like effective cache
complexity.
53
4.3.3 Parallelizability of an Algorithm
bα (n; M, B)),
We say that an algorithm is α-efficient for a parameter α ≥ 0 if Q∗ (n; M, B) = O(Q
i.e. , for any values of M > B > 0 there exists a constant cM,B such that
bα (n; M, B)
Q
lim ≤ cM,B ,
n→∞ Q∗ (n; M, B)
where n denotes the input size. This α-efficiency occurs trivially if the work term always domi-
nates, but can also happen if sometimes the depth term dominates. The least upper bound on the
set of α for which an algorithm is α-efficient specifies the parallelizability of the algorithm. For
example, the Map algorithm in Section 4.3.1 is α-efficient for all values in the range [0, 1) and
not for any values greater than or equal to 1. Therefore, the map algorithm has parallelizabe to
degree 1.
The lower the α, the more work-space imbalance the effective complexity can absorb and
still be work efficient, and in particular when α = 0 the balance term disappears. Therefore, a
sequential algorithm, i.e. a program with a trivial chain for DAG, will not be α-efficient for any
α > 0. Its parallelizability would be 0.
Qbα (n) = O(1) for cn < B, M > 0 because only the first level of recursion after cn < B
(and one other node in the recursion tree rooted at this level, in case this call accesses locations
across a boundary) incurs at most one cache miss (which costs O(1α )) and the cache state, which
includes this one cache line, is carried forward to lower levels preventing them from any more
cache misses. For α < 1, M > 0, the summation term dominates (second term in the max) and
the recursion solves to O(dn/Be), which is optimal. If M = 0, then O(n), for α < 1.
We will see the analysis for some matrix operations. Consider a recursive version of matrix
addition where each quadrant of the matrix is added in parallel. Here, a task (corresponding to a
recursive addition) comprises a strand for the fork and join points and a parallel block consisting
of four smaller tasks on a matrix one fourth the size. For an input of size n, the space of the
algorithm is cn for some constant c. Therefore,
b O(1), cn ≤ B, M > 0
Qα (n; M, B) = α b (4.2)
O(dcn/Be ) + 4Qα (n/4; M, B), B < s
54
which implies Q bα (n; M, B) = O(dn/Be) if α < 1, M > 0, and Q bα (n) = O(dn/Beα ) if
α ≥ 1, M > 0. These bounds imply that matrix addition is work efficient only when parallelism
is limited to α < 1. Further, when M = 0, Q bα (n; M, B) = O(n) for α < 1.
The matrix multiplication described in Figure 4.4 consists of fork and join points and two
parallel blocks. Each parallel block is a set of 4 parallel tasks each, each of which is a matrix
multiplication on matrices four times smaller. Note that if at some level of recursion, the first
parallel block fits in cache, then the next parallel block can reuse the same set of locations. For
an input of size n, the space of the algorithm is cn for some constant c. Therefore, when M > 0,
0, if κ includes all relevant locations
O(1) if κ does not include the relevant cache lines,
bM ult (n; M, B; κ) = cn ≤ B
Q cn α (4.3)
α
b M ult n b M ult n
O( B ) + 4Qα ( 4 ; M, B; κ), κp + 4Qα ( 4 ; M, B; κp ), B < cn ≤ M
α
O( cn bM
) + 8Q α
ult n
( 4 ; M, B; ∅), otherwise,
B
55
work and parallelism as indicated by the maximum α is consistent with the fact that under the
conventional definitions, this inversion algorithm has a depth of about s1/2 .
In all the above examples, calls to individual tasks in a parallel block are relatively balanced.
An example where this is not the case is a parallel deterministic cache oblivious sorting from [40]
outlined as COSORT(i, n) Figure 4.4. This sorting uses five primitives: prefix sums, matrix
transpose, bucket transpose, merging and simple merge sort. Theorem25 will present an analysis
b s
to show that COSORT(cdot, n) has Qα (n; M, B) = B logM +2 s for α < 1 − Θ(1/ log n)
demonstrating that the algorithm has a parallelizability of 1.
For a version of Quicksort in which a random pivot is repeatedly drawn until it partitions the
array into two parts none of which is smaller than a third of the array, the algorithm is α-efficient
log2 dn/3M e
for α < log1.5 1 + 2 log d2n/3M e .
2
56
function QuickSort(A)
if (#A ≤ 1) then return A b α (n; M, B) = O((n/B) log(n/(M + 1)))
Q
p = A[rand(#A)] ;
Les = QuickSort({a ∈ A|a < p}) k In the code both k and { } indicate parallelism.
Eql = {a ∈ A|a = p} k The bounds are expected case.
Grt = QuickSort({a ∈ A|a > p}) ;
return Les ++ Eql ++ Grt
function SampleSort(A)
if (#A ≤ 1) then return A b α (n; M, B) = O((n/B) logM +2 n)
Q
√
parallel for i ∈ [0, n, . . . , n)
√
SampleSort(A[i, ..., i + n)) ; This version is asymptotically optimal for cache
√
P [0, 1, . . . , n) = findPivots(A) ; misses. The pivots partition the keys into buck-
√
B[0, 1, . . . , n) = bucketTrans(A, P ) ; ets and bucketTrans places the keys from each
√
parallel for i ∈ [0, 1, . . . , n) sorted subset in A into the buckets B. Each
SampleSort(Bi ) ; √
bucket ends up with about n elements [40].
return flattened B
function MM(A,B,C)
√
if (#A ≤ k) then return MMsmall(A,B) b α (n; M, B) = O((n1.5 /B)/ M + 1)
Q
(A11 , A12 , A21 , A22 ) = split(A) ;
(B11 , B12 , B21 , B22 ) = split(B) ; √ √
Multiplying two n × n matrices. The split
C11 = MM(A11 , B11 ) k C21 = MM(A21 , B11 ) k operation has to touch any of the four elements
C12 = MM(A11 , B12 ) k C22 = MM(A21 , B12 ) ; at the center of the matrices A, B. The eight
C11 + = MM(A12 , B21 ) k C21 + = MM(A22 , B21 ) k recursive subtasks are divided in to two par-
C12 + = MM(A12 , B22 ) k C22 + = MM(A22 , B22 ) ; allel blocks of four tasks each. Can easily
return join(C11 , C12 , C21 , C22 ) be converted to √Strassen with Qb α (n; M, B) =
(log
(n 2 7)/2 /B)/ M + 1 and the same depth.
function MatInv(A)
√
if (#a ≤ k) then InvertSmall(A) b α (n; M, B) = O((n1.5 /B)/ M + 1)
Q
(A11 , A12 , A21 , A22 ) = split(A) ;
√ √
A−122 = MatInv(A22 ) ; Inverting a n × n matrix using the Schur
S = A11 − A12 A−1 22 A21 ; complement (S). The split operation has to
C11 = MatInv(S) ; touch any of the four elements at the center of
√
C12 = C11 A12 A−1 −1
22 ; C21 = −A22 A21 C11 ; matrix A. The depth is O( n) since the two
C22 = A−1 −1
22 + A22 A21 C11 A12 A22 ;
−1
recursive calls cannot be made in parallel. The
return join(C11 , C12 , C21 , C22 ) parallelism comes from the matrix multiplies.
Figure 4.4: Examples of quicksort, sample sort, matrix multiplication and matrix inversion. All
the results in this table are for 0 < α < 1. Therefore, these algorithms have parallelizability of 1.
57
function BHT(P, (x0 , y0 , s))
if (#P = 0) then return EMPTY b α (n; M, B) = O((n/B) log(n/(M + 1)))
Q
if (#P = 1) then return LEAF(P [0])
xm = x0 + s/2 ; ym = y0 + s/2 ; The two dimensional Barnes Hut code for con-
P1 = {(x, y, w) ∈ P |x < xm ∧ y < ym } k structing the quadtree. The bounds make some
P2 = {(x, y, w) ∈ P |x < xm ∧ y ≥ ym } k (reasonable) assumptions about the balance of
P3 = {(x, y, w) ∈ P |x ≥ xm ∧ y < ym } k the tree.
P4 = {(x, y, w) ∈ P |x ≥ xm ∧ y ≥ ym } k
T1 = BHT(P1 , (x0 , y0 , s/2)) k
T2 = BHT(P2 , (x0 , y0 + ym , s/2)) k
T3 = BHT(P3 , (x0 + xm , y0 , s/2)) k
T4 = BHT(P4 , (x0 + xm , y0 + ym , s/2)) k
C = CenterOfMass(T1 , T2 , T3 , T4 ) ;
return NODE(C, T1 , T2 , T3 , T4 )
function ConvexHull(P )
P 0 =SampleSort(P by x coordinate) ; b α (n; M, B) = O((n/B) logM +2 n)
Q
return MergeHull(P 0 )
The two dimensional Convex Hull code (for
function MergeHull(P ) finding the upper convex hull). The bridgeHulls
if (#P = 0) then return EMPTY routine joins two adjacent hulls by doing a dual
Touch P [n/2]; binary search and only requires O(log n) work.
HL = MergeHull(P [0, . . . , n/2)) k The cost is dominated by the sort.
HR = MergeHull(P [n/2, . . . , n)) ;
return bridgeHulls(HL , HR )
function SparseMxV(A, x)
parallel for i ∈ [0, . . . , n) b α (n, m; M, B) = O(m/B + n/(M + 1)1−γ )
Q
ri = sum({v × xj |(v, j) ∈ Ai })
return r Sparse vector matrix multiply on a matrix with n rows
and m non-zeros. We assume the matrix is in com-
pressed sparse row format and Ai indicates the ith row
of A. The bound assumes the matrix has been diagonal-
ized using recursive separators and the edge-separator size
is O(nγ ) [38]. The parallel for loop at the top should be
done in a divide and conquer fashion by cutting the in-
dex space in half. At each such fork, the algorithm should
touch at least one element from the middle row. The sum
inside the loop can also be executed in parallel.
Figure 4.5: Examples of Barnes Hut tree construction, convex hull, and a sparse matrix by dense
vector multiply. All the results in this table are for 0 < α < 1. Therefore, these algorithms have
parallelizability of 1.
58
Chapter 5
Schedulers
A schedule specifies the location and order of execution for each instruction in the DAG. Since
DAGs are allowed to be dynamic, schedulers should be able to make online decisions. A sched-
uler is designed for each class of machines, or a machine model keeping in the mind the con-
straints and the features of the machine. A good scheduler makes the right trade-off between load
balancing the processors and preserving the locality of an algorithm. The quality of a schedule
for a given machine and algorithm is measured in terms of running time, the maximum space it
requires, and the amount of communication it generates.
As proposed in our solution approach, we separate the task of algorithm and scheduler de-
sign. It is the task of the algorithm designer to design algorithms that are good in the program-
centric cost model. On the other hand, the scheduler is designed per machine model and is
required to translate the cost in the program-centric model to good performance on the machine.
Performance bounds are typically provided in terms of the program-centric cost model and the
parameters that characterize the machine.
This chapter starts with a definition of schedulers (Section 5.1) and surveys previous work
on scheduling for simple machine models (Section 5.2). We will then present new results on
scheduling for the PMH machine model starting with a definition for space-bounded schedulers
and proceeding through the construction of a specific class of space-bounded schedulers (Sec-
tion 5.3 with performance guarantees on communication costs in terms of the parallel cache
complexity (Section 5.3.1) and bounds on running time based on the extended cache complex-
ity metric (Section 5.3.2). We will end this chapter with bounds on the space requirements of
dynamically allocated programs in terms of their effective cache complexity (Section 5.3.3). By
demonstrating that program-centric metrics of the PCC framework can be translated into prov-
able performance guarantees on realistic machine models, this chapter validates the claim that
such cost models are relevant and useful.
5.1 Definitions
A scheduler defines where and when each node in a DAG is executed on a certain machine. We
differentiate between preemptive and non-preemptive schedulers based on whether a strand is
executed on the same processor or is allowed to migrate for execution on multiple processors.
59
We consider only non-preemptive schedulers.
A non-preemptive schedule defines three functions for each strand `. Here, we use P to
denote the set of processors on the machine, and L to denote the set of strands in the computation.
• Start time: start : L → Z, where start(`) denotes the time the first instruction of ` begins
executing,
• End time: end : L → Z, where end (`) denotes the time the last instruction of ` finishes –
this depends on both the schedule and the machine, and
• Location: proc : L → P , where proc(`) denotes the processor on which the strand is
executed.
A non-preemptive schedule cannot migrate strands across processors once they begin exe-
cuting, so proc is well-defined. We say that a strand ` is live at any time τ with start(`) ≤ τ <
end (`).
A non-preemptive schedule must also obey the following constraints on ordering of strands
and timing:
• (ordering): For any strands `1 ≺ `2 , start(`2 ) ≥ end (`1 ).
• (processing time): For any strand `, end (`) = start(`)+γhschedule,machinei (`). Here γ denotes
the processing time of the strand, which may vary depending on the specifics of the ma-
chine and the history of the schedule. The point is simply that the schedule has no control
over this value.
• (non-preemptive execution): No two strands may run on the same processor at the same
time, i.e., `1 6= `2 , proc(`1 ) = proc(`2 ) =⇒ [start(`1 ), end (`1 )) ∩ [start(`2 ), end (`2 )) =
∅.
We extend the same notation and terminology to tasks. The start time start(t) of a task t is a
shorthand for start(t) = start(`s ), where `s is the first strand in t. Similarly end (t) denotes the
end time of the last strand in t. The location function proc, however, is not defined for a task as
each strand inside the task may execute on a different processor.
When discussing specific schedulers, it is convenient to consider the time a task or strand
first becomes available to execute. We use the term spawn time to refer to this time, which is
exactly the instant at which the preceding fork or join finishes. Naturally, the spawn time is no
later than the start time, but schedule may choose not to execute the task or strand immediately.
We say that the task or strand is queued during the time between its spawn time and start time
and live during the time between its start time and finish time. Figure 5.1 illustrates the spawn,
start and end times of a task its initial strand. They are spawned and start at the same time by
definition. The strand is continuously executed till it ends, while a task goes through several
phases of execution and idling before it ends.
60
spawn at the same time
queued
start at the same time
end
Lifetime of a strand
nested immediately within
the task to the left.
live
executing
end
Lifetime of a task.
Figure 5.1: Lifetime of a task, and of a strand nested directly beneath it. Note that the task can
go through multiple execution and idle phases corresponding to the execution of its fragments,
but the strand executes continuously in one stretch.
— and presents performance guarantees on communication costs, time and space in terms of
the sequential cache complexity Q1 , work W and depth D. Note that W is superfluous as Q1
subsumes work. It can be obtained by setting M = 0, B = 1 in the sequential complexity:
W = Q1 (0, 1).
61
of cache misses across all caches QP for a p processor machine processor machines is bounded
by
62
• The computation completes in time not more than WPlat /p + DPlat < W/p + O((DCk +
P
log 1/δ)Ck Mk /Bk )+O( i Ci (Q1 (Bi−1 , Bi−1 )−Q1 (Mi , Bi )))/p with probability at least
1 − δ.
Proof. Results from [22] imply the statement about the number of steals. Because the schedule
involves not more than O(p(DPlat + log 1/δ)) steals with probability at least 1 − δ, all the caches
at level i incur a total of at most Q1 (Mi , Bi ) + O(p(DPlat + log 1/δ)Mi /Bi ) cache misses with
probability at least 1 − δ. To compute the running time of the algorithm, we count the time spent
by a processor waiting upon a cache miss towards the work and use the same proof as in [22]. In
other words, we use the notion of latency-added work P (WPlat ) defined above. Because this is not
more than W + O(p(DPlat + log 1/δ)Ck Mk /Bk + i Ci (Q1 (Mi−1 , Bi−1 ) − Q1 (Mi , Bi ))) with
probability at least 1 − δ, the claim about the running time follows. t
u
Thus, for constant δ, the parallel cache complexity at level i exceeds the sequential cache
complexity by O(pDPlat Mi /Bi ) with probability 1 − δ. This matches the earlier bounds presented
for the single-level case.
We can also consider a centralized work stealing scheduler. In centralized work stealing,
processors deterministically steal a node of least depth in the DAG; this has been shown to be
a good choice for reducing the number of steals [22]. The bounds in Theorem 5 carry over to
centralized work stealing without the δ terms, e.g., the parallel cache complexity exceeds the
sequential cache complexity by O(pDPlat Mi /Bi ).
1 1’
2 2’ D 1’
a a a a 1
3 3’ D
3 12K+1
2’
4 4’
2
p/3 p/3
Figure 5.2: DAGs used in the lower bounds for randomized and centralized work stealing.
The following lower bound shows that there exist DAGs for which Qp does indeed exceed
Q1 by the difference stated above.
Theorem 6 (Lower Bound) For a multi-level private-cache machine P with any given number
of processors p ≥ 4, cache sizes M1 < · · · < Mk ≤ M/3 for some a priori upper bound M ,
cache line sizes B1 ≤ · · · ≤ Bk , and cache latencies C1 < · · · < Ck , and for any given depth
D0 ≥ 3(log p + log M ) + Ck + O(1), we can construct a nested-parallel computation DAG with
63
binary forking and depth D0 , whose (expected) parallel cache complexity on P , for all levels i,
exceeds the sequential cache complexity Q1 (Mi , Bi ) by Ω(pDPlat Mi /Bi ) when scheduled using
randomized work stealing. Such a computation DAG can also be constructed for centralized
work stealing.
Proof. Such a construction is shown in Figure 5.2(a) for randomized work stealing. Based on
the earlier lemma, we know that there exist a constant K such that the number of steals is at most
KpDlat with probability at least 1 − (1/Dlat ). We construct the DAG such that it consists of a
binary fanout to p/3 spines of length D = D0 − 2(K + log(p/3) + log M ) each. Each of the
first D/2 nodes on the spine forks off a subdag that consists of 3(12K+1) identical parallel scan
structures of length M each. A scan structure is a binary tree forking out to M parallel nodes that
collectively read a block of M consecutive locations. The remaining D/2 nodes on the spine are
the joins back to the spine of these forked off subdags. Note that DPlat = D0 + Ck because each
path in the DAG contains at most one memory request.
For any Mi and Bi , the sequential cache complexity Q1 (Mi , Bi ) = (p/3)(Mi + L + i +
3(12K+1) (M − Mi + Bi ))(D/2)/Bi because the sequential execution executes the subdags one
by one and can reuse a scan segment of length Mi /Bi for all the identical scans to avoid repeated
misses on a set of locations. In other words, sequential execution gets (p/3)(3(12K+1) − 1)(Mi −
Bi ))(D/2)/Bi cache hits because it executes identical scans one after the other.
We argue that in the case of randomized work stealing, there are a large number of subdags
such that the probability that at least two scans from the subdag are repeated are executed by
disjoint set of processors is greater than some positive constant. This implies that the cache
complexity is Θ(pDMi /Bi ) higher that the sequential cache complexity.
1. Once the p/3 spines have been forked, each spine is occupied by at least one processor
till the stage where work along a spine has been exhausted. This property follows directly
from the nature of the work stealing protocol.
2. In the early stages of computation after spines have been forked, but before the computa-
tion enters the join phase on the spines, exactly p/3 processors have a spine node on the
head of their work queue. Therefore, the probability that a random steal with get a spine
node and hence a fresh subdag is 1/3.
3. At any moment during the computation, the probability that more than p/2 of the latest
steals of the p processors found fresh spine nodes is exponentially small in terms of p and
therefore less than 1/2. This follows from the last observation.
4. If processor p stole a fresh subdag A and started the scans in it, the probability that work
from the subdag A is not stolen by some other processor before p executes the first 2/3-rd
of the scan is at most a constant ch ∈ (0, 1). This is because, the event that p currently
chose a fresh subdag is not correlated with any event in the history, and therefore, with
probability at least 1/2, more than p/2 processors did not steal a fresh subdag in the latest
steal. This means that these processors which stole a stale subdag (a subdag already being
worked on by other processors) got less than 2/3-rd fraction of the subdag to work on
before they need to steal again. Therefore, by the time p finishes 2/3-rd of the work, there
would have been at least p/2 steal attempts. Since these steals are randomly distributed
over all processors, there is a probability of at least 1/16 that two of these steals where
64
from p. Two steals from p would cause p to lose work from it’s fresh subdag.
5. Since there are at most KpD steals with high probability, there no more pD/6 subdags
which incur more than 12K steals. Among nodes with fewer than 12K steals, consider
those described in the previous scenario where the processor p that started the subdag has
work stolen from it before p executes 2/3-rd of the subdag. At least (1/16)(5pD/6) such
subdags are expected in a run. Because there are 312K+1 identical scans in each such
subdag, at least one processor apart from p that has stolen work from this subdag gets to
execute one complete scan. This means that the combined cache complexity at the i-th
cache level for each subdag is at least Mi /Bi greater than the sequential cache complexity,
proving the lemma.
The construction for lower bounds on centralized work stealing is shown in Figure 5.2(b).
The DAG ensures that the least-depth node at each steal causes a scan of a completely different
set of memory locations. The bound follows from the fact that unlike the case in sequential
computation, cache access overlap in the pairs of parallel scans are never exploited. t
u
65
where Dl = DC, C being the cost of a cache miss. Since, the PDF schedule is greedy, its running
time is bounded by
T ≤ (W + C · QP )/p + Dl . (5.5)
Further, when space is dynamically allocated, the space requirement of the PDF schedule does
not exceed the space requirement of the sequential schedule it is based on by more than the
number of premature nodes:
Sp ≤ S1 + pDl . (5.6)
Since the PDF scheduler makes decisions after every node, its overhead can be very high.
Some coarsening of the granularity at which scheduling decisions are made such as in AsyncDF
(Section 3 of [134]) is needed to bound scheduler overheads while trading off slightly on space
requirements and locality. Further loosening of the scheduling policy to allow for Work-Stealing
at the lowest granularities while using PDF at higher levels (DFDequeues, Section 6 of [134])
may be also be useful.
66
In this section we formally define the class of space-bounded schedulers, and show (Theo-
rem 8) that such schedulers have cache complexity on the PMH machine model that matches
the parallel cache complexity. For simplify of presentation, we assume that program are stitcally
allocated and therefore the space of a program is scheduler independent. We identify, by con-
struction, specific schedules within that class that have provable guarantees on time and space
based the effective cache complexity.
Informally, a space-bounded schedule satisfies two properties:
• Anchored: Each task is anchored to a smallest possible cache that is bigger than the task—
strands within the task can only be scheduled on processors in the tree rooted at the cache.
• Bounded: A “maximal” live task “occupies” a cache X if it is either (i) anchored to X, or
(ii) anchored to a cache in a subcluster below X while its parent is anchored above X. A
live strand occupies cache X if it is live on a processor beneath cache X and the strand’s
task is anchored to an ancestor cache of X. The sum of sizes of live tasks and strands that
occupy a cache is restricted by the scheduler to be less than the size of the cache.
These two conditions are sufficient to imply good bounds on the number of cache misses at every
level in the tree of caches. A good space-bounded scheduler would also handle load balancing
subject to anchoring constraints to quickly complete execution.
Space-Bounded Schedulers. A “space-bounded scheduler” is parameterized by a global dila-
tion parameter 0 < σ ≤ 1 and machine parameters {Mi , Bi , Ci , fi }. We will need the following
terminology for the definition.
Cluster: For any cache Xi , recall that its cluster is the set of caches and processors nested below
Xi . We use P (Xi ) to denote the set of processors in Xi ’s cluster and $(Xi ) to denote the set of
caches in Xi ’s cluster. We use Xi ⊆ Xj to mean that cache Xi is in cache Xj ’s cluster.
Befitting Cache: Given a particular cache hierarchy and dilation parameter σ ∈ (0, 1], we say
that a level-i cache befits a task t if σMi−1 < S(t, Bi ) ≤ σMi .
Maximal Task: We say that a task t with parent task t0 is level-i maximal if and only if a level-i
cache befits t but not t0 , i.e., σMi−1 < S(t, Bi ) ≤ σMi < S(t0 , Bi ).
Anchored: A task t with strand set L(t) is said to be anchored to level-i cache Xi (or equivalently
to Xi ’s cluster), if and only if a) it runs entirely in the cluster, i.e., {proc(`)|` ∈ L(t)} ⊆ P (Xi ),
and b) the cache befits the task. Anchoring prevents the migration of tasks to a different cluster
or cache. The advantage of anchoring a task to a befitting cache is that once it load its working
set, it can reuse it without the risk of losing it from the cache. If a task is not anchored anywhere,
for notational convenience we assume it is anchored at the root of the tree.
As a corollary, suppose t0 is a subtask of t. If t is anchored to X and t0 is anchored to X 0 ,
then X 0 ⊆ X. Moreover, X 6= X 0 if and only if t is maximal. In general, we are only concerned
about where maximal tasks are anchored.
Cache occupying tasks: For a level-i cache Xi and time τ , the set of live tasks occupying cache
Xi at time τ , denoted by Ot(Xi , τ ), is the union of (a) maximal tasks anchored to Xi that are
live at time τ , and (b) maximal tasks anchored to any cache in $(Xi ) \ {Xi }, live at time τ , with
their immediate parents anchored to a cache above Xi in the hierarchy. The tasks in (b) are called
67
“skip level” tasks. Tasks described in (a) and (b) are the tasks that consume space in the cache
at time τ . Note that the set of tasks in (b) can be dropped from the Ot without an asymptotic
change in cache miss bounds if the cache hierarchy is “strongly” inclusive,i.e. , Mi > cfi Mi−1
at all levels i of PMH for some constant c > 1
Cache occupying strands: The set of live strands occupying cache Xi at time τ , denoted by
Ol(Xi , τ ), is the set of strands {`} such that (a) ` is live at time τ , i.e., start(`) ≤ τ < end (`),
(b) ` is processed below Xi , i.e., proc(`) ∈ P (Xi ), and (c) `’s task is anchored strictly above Xi .
A space-bounded scheduler for a particular cache hierarchy is a scheduler parameterized by
σ ∈ (0, 1] that satisfies the two following properties:
• Anchored Every subtask t of the root task with is anchored to a befitting cache.
• Bounded: At every instant τ , for every level-i cache Xi , the sum of sizes of cache occupy-
ing tasks and strands is less then Mi :
X X
S(t, Bi ) + S(`, Bi ) ≤ Mi .
t∈Ot(Xi ,τ ) `∈Ol(Xi ,τ )
68
strand can use only the σMj capacity of a level-(j < i) cache awarded to it by the space-bounded
scheduler. Then the number of misses is indeed as though the strand executed on a serial level-
(i − 1) memory hierarchy with σMj cache capacity at each level j. Hence, Q∗ (s; σMj , Bj ) is
an upper bound on the actual number of level-j cache misses incurred while executing the strand
s. (The actual number may be less because an optimal replacement policy may not partition the
caches and the cache state is not initially empty.)
Finally, to complete the proof for all memory-hierarchy levels j, we assume inductively
that the theorem holds for all maximal subtasks of t. The defintion of parallel cache com-
plexity assumes an empty initial level-j cache state for any maximal level-j subtask of t, as
S(t; Bj ) > σMj . Thus, the level-j cache complexity for t is defined as Q∗ (t; σMj , Bj ) =
P ∗ 0
t0 ∈A(t) Q (t ; σMj , Bj , ∅), where A(t) is the set of all level-i strands and nearest maximal sub-
tasks of t. Since the theorem holds inductively for those tasks and strands in A(t), it holds for t.
t
u
In contrast, there is no such optimality result based on sequential cache complexity – Theo-
rem 1 (showing a factor of p gap) readily extends to any greedy-space-bounded scheduler, using
the same proof.
69
(and hence unanchored), maximal level-(j < i) subtasks of t. The ready-task list contains only
maximal tasks, as tasks which are not maximal are implicitly anchored to the same cache as t.1
A strand or task is assigned to the appropriate list when it first becomes ready. We use
parent[t] to denote the nearest containing task of task/strand t, and maximal [t] to denote the
nearest maximal task containing task/strand t (maximal [t] = t if t is maximal). When the exe-
cution of a strand reaches a fork, we also keep a count on the number of outstanding subtasks,
called the join counter.
Initially, the main task t is anchored at the smallest cache U in which it fits, both R(t) and
S(t) are created, and t is allocated all subclusters of U (ignore for the greedy scheduler). The
leading strand of the task is added to the ready-strand list S(t), and all other lists are empty.
The scheduler operates in parallel, invoking either of the two following scheduling rules when
appropriate. We ignore the precise implementation of the scheduler, and hence merely assume
that the invocation of each rule is atomic. For both rule descriptions, let t be a maximal level-i
task, let U be the cache to which it is anchored, and let Ut be all the children of U for the greedy
scheduler.
strands We say that a cache Uj is strand-ready if there exists a cache-to-processor path of caches
Uj , Uj−1 , . . . , U1 , p descending from Uj down to processor P such that: 1) each Uk ∈
{Uj , . . . , U1 } has σMk space available, and 2) P is idle. We call Uj , Uj−1 , . . . , U1 , p the
ready path. If S(t) is not empty and V ∈ Ut is strand-ready, then remove a strand s from
S(t). Anchor s at all caches along the ready path descending from V , and decrease the
remaining capacity of each Uk on the ready path by σMk . Execute the strand on the ready
path’s processor.
tasks Suppose there exists some level-(j − 1) < (i − 1) strand-ready descendant cache Uj−1 of
V ∈ Ut , and let Uj be its parent cache. If there exists level-j subtask t0 ∈ R(t) such that
Uj has S(t0 ; Bj ) space available, then remove t0 from R(t). Anchor t0 at Uj , create R(t0 )
and S(t0 ), and reduce Uj ’s remaining capacity appropriately. Then schedule and execute
the first strand of t0 as above.
Observe that the implementation of these rules can either be top-down with caches pushing tasks
down or bottom up with idle processors “stealing” and pulling tasks down. To be precise about
where pieces of the computation execute, we’ve chosen both scheduling rules to cause a strand
to execute on a processor. This choice can be relaxed without affecting the bounds.
When P has been given a strand s to execute (by invoking either scheduling rule), it executes
the strand to completion. Let t = parent[s] be the containing task. When s completes, there are
two cases:
Case 1 (fork). When P completes the strand and reaches a fork point, t’s join counter is
set to the number of forked tasks. Any maximal subtasks are inserted into the ready-task list
R(maximal [t]), and the lead strand of any other (non maximal) subtask is inserted into the ready-
strand list S(maximal [t]).
Case 2 (task end). If P reaches the end of the task t, then the containing parent task parent[t]’s
join counter is decremented. If parent[t]’s join counter reaches zero, then the subsequent strand
in parent[t] is inserted into maximal [parent[t]]’s ready-strand list. Moreover, if t is a maximal
1
S
For this greedy scheduler, a pair of lists for each cache is sufficient, i.e. , R(U ) = t anchored to U R(t), and
similarly for S, but for consistency of notation with later schedulers we continue with a pair of lists per task here.
70
level-i task, the total space used in by the cache to which t was anchored is decreased by S(t; Bi )
to reflect t’s completion.
In either of these cases, P becomes idle again, and scheduling rules may be invoked.
Space-bounded schedulers like this greedy variant perform well for computations that are
very balanced. Chowdhury et al. [64] present analyses of a similar space-bounded scheduler
(that includes minor enhancements violating the greedy principle). These analyses are algorithm
specific and rely on the balance of the underlying computation.
We prove a theorem (Th. 9) to provides run time bounds of the same flavor for the greedy
space-bounded scheduler, leveraging many of the same assumptions on the underlying algo-
rithms. Along with our main theorem (Theorem 10), these analyses all use recursive application
of Brent’s theorem to obtain a total running time: small recursive tasks are assumed inductively
to execute quickly, and the larger tasks are analyzed using Brent’s theorem with respect to a
single-level machine of coarser granularity.
The following are the types of informal structural restrictions imposed on the underlying algo-
rithms to guarantee efficient scheduling with a greedy-space-bounded scheduler and previous
work. For more precise, sufficient restrictions, see the technical report [39].
1. When multiple tasks are anchored at the same cache, they should have similar structure
and work. Moreover, none of them should fall on a much longer path through the compu-
tation. If this condition is relaxed, then some anchored task may fall on the critical path.
It is important to guarantee each task a fair share of processing resources without leaving
many processors idle.
2. Tasks of the same size should have the same parallelism.
3. The nearest maximal descendant tasks of a given task should have roughly the same size.
Relaxing this condition allows two or more tasks at different levels of the memory hierar-
chy to compete for the same resources. Guaranteeing that each of these tasks gets enough
processing resources becomes a challenge.
In addition to these balance conditions, the previous analyses exploit preloading of tasks: the
memory used by a task is assumed to be loaded (quickly) into the cache before executing the
task. For array-based algorithms preloading is a reasonable requirement. When the blocks to be
loaded are not contiguous, however, it may be computationally challenging to determine which
blocks should be loaded. Removing the preloading requirement complicates the analysis, which
then must account for high-level cache misses that may occur as a result of tasks anchored at
lower-level caches.
We now formalize the structural restrictions described above and analyze a greedy space-
bounded scheduler with ideal dilation σ = 1. This analysis is intended both as an outline for the
main theorem in Section 5.3.2.5 and also to understand the limitations of the simple scheduler
that we are trying to ameliorate. These balance restrictions are necessary in order to prove good
bounds for the greed scheduler, and that are relaxed in Section 5.3.2.5.
Pi the total
For conciseness, define work of a maximal level-i task t for the given memory
∗
hierarchy as T W (t) = j=0 Cj Q (t; Mj , Bj ). For s a level-i strand or task that is not maximal,
71
P
T W (s) = i−1 ∗
j=0 Cj Q (s; Mj , Bj ), i.e. , assume all memory locations already reside in the level-i
cache.
We say that a task t is recursively large if any child subtask t0 of t (those nested directly
within t) uses at least half the space used by t, i.e. , S(t0 ; B) ≥ S(t; B)/2 for all B, and t0 is
also recursively large. Recursively large is quite restrictive, but it greatly simplifies both the
analysis and further definitions here, as we can assume all subtasks of a level-i task are at least
level-(i − 1).
Consider a level-i task t. We say that t is γi parallel at level-i if no more than a 1/γi fraction
of the total work from all descendant maximal level-i strands of t fall along a single path in t’s
subdag, and no more than a 1/γi fraction of total work from all descendant level-(i − 1) maximal
subtasks falls along a single path among level-(i − 1) tasks2 in t’s subdag. Moreover, we say that
t is λi task heavy if the total work from strands comprises at most a 1/λi factor of the total work
from level-(i − 1) subtasks.
72
Qk−1
(T W (tk−1 )/pk−1 ) j=1 (1 + fj /γj )(1 + pj−1 /λj ) (i.e. , takes this amount of time when run on a
single “processor”). Similarly, each level-k strand s is a serial computation having T W (s) work.
We thus interpret tk as a computation having
X k−1
T W (tk−1 ) Y fj pj−1 X
1+ 1+ + T W (s)
subtasks tk−1
pk−1 j=1 γj λj level-k strands s
k−1
T W (tk ) − Ck S(tk ; Bk )/Bk Y fj pj−1 X
≤ 1+ 1+ + T W (s)
pk−1 j=1
γj λ j
level-k strands s
k−1
pk−1 T W (tk ) − Ck S(tk ; Bk )/Bk Y fj pj−1
≤ 1+ 1+ 1+
λk pk−1 j=1
γj λj
work after preloading the cache, where the first step of the derivation follows from the definition
of Q∗ and T W , and the second follows from tk being λk task heavy. Since tk is also γk parallel,
it has a critical-path length that is at most a 1/γk factor of the work. Observing that the space-
bounded scheduler operates as a greedy scheduler with respect to the level-k cache, we apply
Brent’s theorem to conclude that the space bounded scheduler executes this computation in
k−1
1 1 pk−1 T W (tk ) − Ck S(tk ; Bk )/Bk Y fj pj−1
+ 1+ 1+ 1+
fk γk λk pk−1 j=1
γj λj
k
T W (tk ) − Ck S(tk ; Bk )/Bk Y fj pj−1
= 1+ 1+
pk j=1
γ j λj
time on the fk processors. Adding the Ck S(tk ; Bk )/pk Bk time necessary to load the level-k cache
completes the proof t
u
The (1 + fj /γj ) overheads arise from load imbalance and the recursive application of Brent’s
theorem, whereas the (1 + pj−1 /λj ) overheads stem from the fact that strands block other tasks.
For sufficiently parallel algorithms with short enough strands, the product of these overheads
reduces to O(1). This bound is then optimal whenever Q∗ (t; Mi , Bi ) = O(Q(t; Mi , Bi )) for all
i.
Our new scheduler in the next section relaxes all of these balance conditions, allowing for
more asymmetric computations. Moreover, we do not assume preloading. We use the effective
cache complexity analysis (Section 4.3.2) to facilitate analysis of less regular computations and
prove our performance bounds with respect to the effective cache complexity metric.
73
The main performance theorem for our scheduler is the following, which is proven in Sec-
tion 5.3.3. This theorem does not assume any preloading of the caches, but we do assume that
all block sizes are the same (except at level 0). Here, the machine parallelism β is defined as
the minimum value such that for all hierarchy levels i > 1, we have fi ≤ (Mi /Mi−1 )β , and
f1 ≤ (M1 /3B1 )β . Aside from the overhead vh (defined in the theorem), this bound is optimal
in the PCC framework for a PMH with 1/3-rd the given memory sizes. Here, k is a tunable
constant scheduler parameter with 0 < k < 1, discussed later in this section. Observe that the vh
overhead reduces significantly (even down to a constant) if the ratio of memory sizes is large but
the fanout is small (as in the machines in Figure 7.1), or if α β.3
Theorem 10 Consider an h-level PMH with B = Bj for all 1 ≤ j ≤ h, and let t be a task such
that S(t; B) > fh Mh−1 /3 (the desire function allocates the entire hierarchy to such a task) with
effective parallelism α ≥ β, and let α0 = min {α, 1}. The runtime of t is no more than:
Ph−1 b
j=0 Qα (t; Mj /3, Bj ) · Cj
· vh , where overhead vh is
ph
Y
h−1
1 fj
vh = 2 + .
j=1
k (1 − k)(Mj /Mj−1 )α0
Since much of the scheduler matches the greedy-space-bounded scheduler from Section 5.3.2.1,
only the differences are highlighted here. An operational description of the scheduler can be
found in the associated technical report [39].
There are three main differences between this scheduler and greedy-space-bounded scheduler
from Section 5.3.2.1. First, we fix the dilation to σ = 1/3 instead of σ = 1. Whereas reducing σ
worsens the bound in Theorem 8 (only by a constant factor for cache-oblivious algorithms), this
factor of 1/3 allows us more flexibility in scheduling.
Second, to cope with tasks that may skip levels in the memory hierarchy, we associate with
each cache a notion of how busy the descending cluster is, to be described more fully later. For
now, we say that a cluster is saturated if it is “too busy” to accept new tasks, and unsaturated
otherwise. The modification to the scheduler here is then restricting it to anchor maximal tasks
only at unsaturated caches.
Third, to allow multiple differently sized tasks to share a cache and still guarantee fairness,
we partition each of the caches, awarding ownership of specific subclusters to each task. Specif-
ically, whenever a task t is anchored at U , t is also allocated some subset Ut of U ’s level-(i − 1)
subclusters, essentially granting ownership of the clusters to t. This allocation restricts the sched-
uler further in that now t may execute only on Ut instead of all of U . This allocation is exclusive
in that a cluster may be allocated to only one task at a time, and no new tasks may be anchored at
any cluster V ∈ Ut except descendent tasks of t. Moreover, tasks may not skip levels through V ,
i.e. , a new level-(j < i − 1) subtask of a level-k > i task may not be anchored at any descendent
cache of V . Tasks that skipped levels in the hierarchy before V was allocated may have already
been anchored at or below V — these tasks continue running as normal, and they are the main
reason for our notion of saturation.
3
For example, vh < 10 on the Xeon 7500 as α → 1.
74
A level-i strand is allocated every cache to which it is anchored, i.e. , exactly one cache at
every level below i. In contrast, a level-i task t is anchored only to a level-i cache and allocated
potentially many level-(i−1) subclusters, depending on its size. We say that the size-s = S(t; Bi )
task t desires gi (s) level-(i − 1) clusters, gi to be specified later. When anchoring t to a level-i
cache U , let q be the number of unsaturated and unallocated subclusters of U . Select the most
unsaturated min{q, gi (s)} of these subclusters and allocate them to t.
For each cache, there may be one anchored maximal task that is underallocated, meaning that
it receives fewer subclusters than it desires. The only underallocated task is the most recent task
that caused the cache to transition from being unsaturated to saturated. Whenever a subcluster
frees up, allocate it to the underallocated task. If assigning a subcluster causes the underallocated
task to achieve its desire, it is no longer underallocated, and future free subclusters become
available to other tasks.
Scheduler details. We now describe the two missing details of the scheduler, namely the notion
of saturation, as well as the desire function gi , which specifies for a particular task size the
number of desired subclusters.
One difficulty is trying to schedule tasks with large desires on partially assigned clusters.
We continue assigning tasks below a cluster until that cluster becomes saturated. But what if
the last job has large desire? To compensate, our notion of saturation leaves a bit of slack,
guaranteeing that the last task scheduled can get some minimum amount of computing power.
Roughly speaking, we set aside a constant fraction of the subclusters at each level as a reserve.
The cluster becomes saturated when all other subclusters have been allocated. The last task
scheduled, the one that causes the cluster to become saturated, may be allocated subclusters
from the reserve.
There is some tradeoff in selecting the reserve constant here. If a large constant is reserved,
we may only allocate a small fraction of clusters at each level, thereby wasting a large fraction
of all processing power at each level. If, on the other hand, the constant is small, then the last
task scheduled may run too slowly. Our analysis will count the first against the work of the
computation and the second against the depth.
Designing a good function to describe saturation and the reserved subclusters is complicated
by the fact that task assignments may skip levels in the hierarchy. The notion of saturation thus
cannot just count the number of saturated or allocated subclusters — instead, we consider the
degree to which a subcluster is utilized. For a cluster U with subclusters V1 , V2 , . . . , Vfi (fi > 1),
define the utilization function µ(U ) as follows:
n P fi 0 o
min 1, 1
µ (V ) if U is a level-(≥ 2)
kfi i=1 i
cluster
µ(U ) =
x
min{1, f1 k } if U is a level-1 cluster with
x allocated processors
and (
1 if V is allocated
µ0 (V ) = ,
µ(V ) otherwise
where k ∈ (0, 1), the value (1 − k) specifying the fraction of processors to reserve. For a cluster
U with just one subcluster V , µ(U ) = µ(V ). To understand the remainder of this section, it
75
is sufficient to think of k as 1/2. We say that U is saturated when µ(U ) = 1 and unsaturated
otherwise.
It remains to define the desire function gi for level i in the hierarchy. A natural choice for
gi is gi (S) = dS/(Mi /fi )e = dSfi /Mi e. That is, associate with each subcluster a 1/fi fraction
of the space in the level-i cache — if a task uses x times this fraction of total space, it should
receive x subclusters. It turns out that this desire does not yield good scheduler performance with
respect to our notion of balanced cache complexity. In particular it does not give enough parallel
slackness to properly load-balance subtasks across subclusters.
0
Instead, we use gi (S) = min{fi , max{1, bf (3S/Mi )α c}}, where α0 = min{α, 1}. What
this says is that a maximal level-i task is allocated one subcluster when it has size S(t; Bi ) =
1/α0
Mi /(3fi ), and the number of subclusters allocated to t increases by a factor of 2 whenever
0
the size of t increases by a factor of 21/α . It reaches the maximum number of subclusters when
it has size S(t; Bi ) = Mi−1 /3. We define g(S) = gi (S)pi−1 if S ∈ (Mi−1 /3, Mi /3].
For simplicity we assumed in our model that all memory is preallocated, which includes
stack space. This assumption would be problematic for algorithms with α > 1 or for algorithms
which are highly dynamic. However, it is easy to remove this restriction by allowing temporary
allocation inside a task, and assume this space can be shared among parallel tasks in the analysis
of Q∗ . To make our bounds work this would require that for every cache we add an additional
number of lines equal to the sum of the sizes of the subclusters. This augmentation would account
even for the very worst case where all memory is temporarily allocated.
The analysis of this scheduler is in Section 5.3.3, summarized by Theorem 10. There are a
couple of challenges that arise in the analysis. First, while it is easy to separate the run time of a
task on a sequential machine in to a sum of the cache miss costs for each level, it is not as easy
on a parallel machine. Periods of waiting on cache misses at several levels at multiple processors
can be interleaved in a complex manner. Our separation lemma (lemma 13) addresses this issue
by bounding the run time by the sum of its cache costs at different levels (Q bα (t; M, Bi ) · Ci ).
Second, whereas a simple greedy-space-bounded scheduler applied to balanced tasks lends
itself to an easy analysis through an inductive application of Brent’s theorem, we have to tackle
the problem of subtasks skipping levels in the hierarchy and partially allocated caches. At a
high level, the analysis of Theorem 10 recursively decomposes a maximal level-i task into its
nearest maximal descendent level-j < i tasks. By inductively assuming that these tasks finish
“quickly enough,” we combine the subproblems with respect to the level-i cache analogous to
Brent’s theorem, arguing that a) when all subclusters are busy, a large amount of productive
work occurs, b) and when subclusters are idle, all tasks have been allocated sufficient resources
to progress at a sufficiently quick rate. Our carefully planned allocation and reservations of
clusters as described earlier in this section are critical to this proof.
5.3.2.6 Analysis
This section presents the analysis of our scheduler, proving several lemmas leading up to Theo-
rem 10. First, the following lemma implies that the capacity restriction of each cache is subsumed
by the scheduling decision of only assigning tasks to unallocated, unsaturated clusters.
Lemma 11 Any unsaturated level-i cluster U has at least Mi /3 capacity available and at least
one subcluster that is both unsaturated and unallocated.
76
Proof. The fact that an unsaturated cluster has an unsaturated, unallocated cluster follows from
the definition. Any saturated or allocated subcluster Vi has P µ0 (Vi ) = 1. Thus, forPunsaturated
cluster U with subclusters V1 , . . . , Vfi , we have 1 > (1/kfi ) fj=1 i
µ0 (Vi ) ≥ (1/fi ) fj=1
i
µ0 (Vi ),
and it follows that some µ0 (Vi ) < 1.
We now argue that if U is unsaturated, then it has at least Mi /3 capacity remaining. This fact
is trivial for fi = 1, as in that case at most one task is allocated. Suppose that tasks t1 , t2 , . . . , tk
are anchored to an unsaturated cluster and have desires x1 , x2 , . . . , xk . Since U is unsaturated
P k
i=1 xi ≤ fi −1, which implies xi ≤ fi −1 for all i. We will show that the ratio of Pk
space to desire,
S(ti ; B)/xi , is at most 2Mi /3fi for all tasks anchored to U , which implies i=1 S(ti ; B) ≤
2Mi /3.
0
Since a task with desire x ∈ {1, 2, . . . , fi − 1} has size at most (Mi /3)((x + 1)/fi )1/α , where
0
α0 = min{α, 1} ≤ 1, the ratio of its space to its desire x is at most (Mi /3x)((x + 1)/fi )1/α .
Letting q = 1/α0 ≥ 1, we have the space-to-desire ratio r bounded by
t
u
Latency added cost. Section 4.3.2 introduced effective cache complexity Q bα (·), which is al-
gorithmic measure. To analyze the scheduler, however, it is important to consider when cache
misses occur. To factor in the effect of the cache miss costs, we define the latency added effective
work, denoted by W c ∗ (·), of a computation with respect to the particular PMH. Latency added
α
effective work is only for use in the analysis of the scheduler, and does not need to be analyzed
by an algorithm designer.
The latency added effective work is similar to the effective cache complexity, but instead of
counting just instructions, we add the cost of cache misses at each instruction. The cost ρ(x) of
an instruction x accessing location m is ρ(x) = W (x)+Ci0 if the scheduler causes the instruction
x to fetch m from a level i cache on the given PMH. Using this per-instruction cost, we define
effective work Wc ∗ (.) of a computation using structural induction in a manner that is deliberately
α
bα (.).
similar to that of Q
Definition 12 (Latency added cost) For cost ρ(x) of instruction x, the latency added effective
work of a task t, or a strand s or parallel block b nested inside t is defined as:
strand: X
cα∗ (s) = s(t; B)α
W ρ(x). (5.9)
x∈s
77
task: For t = c1 ; c2 ; . . . ; ck ,
k
X
c ∗ (t) =
W c ∗ (ci ).
W (5.11)
α α
i=1
Because of the large number of parameters involved ({Mi , B, Ci }i etc.), it is undesirable
to compute the latency added work directly for an algorithm. Instead, we will show a nice
relationship between latency added work and effective work.
We first show that Wcα∗ (·) (and ρ(·), on which it is based) can be decomposed into a per (cache)
level costs Wcα (·) that can each be analyzed in terms of that level’s parameters ({Mi , B, Ci }).
(i)
We then show that these costs can be put together to provide an upper bound on W c ∗ (·). For
α
i ∈ [h − 1], W cα(i) (c) of a computation c is computed exactly like W c ∗ (c) using a different base
α
case: for each instruction x in c, if the memory access at x costs at least Ci0 , assign a cost of
ρi (x) = Ci to that node. Else, assign a cost of ρi (x) = 0. Further, we set ρ0 (x) = W (x), and
define Wcα(0) (c) in terms of ρo (·). It also follows from these definitions that ρ(x) = Ph−1 ρi (x)
i=0
for all instructions x.
Lemma 13 Separation Lemma: For an h-level PMH with B = Bj for all 1 ≤ j ≤ h and
computation A, we have
h−1
X
c ∗
Wα (A) ≤ cα(i) (A).
W
i=0
Proof. The proof is based on induction on the structure of the computation (in terms of its
decomposition in to block, tasks and strands). For the base case of the induction, consider the
sequential thread (or strand) s at the lowest level in the call tree. If S(s) denotes the space of
task immediately enclosing s, then by definition
! h−1
!
X XX
cα∗ (s) =
W ρ(x) · s(s; B)α ≤ ρi (x) · s(s; B)α (5.12)
x∈s x∈s i=0
h−1
! h−1
X X X
= ρi (x) · s(s; B) α
= c (i) (s).
W (5.13)
α
i=0 x∈s i=0
For a parallel block b inside task t consisting of tasks {ti }m i=1 , consider the equation 5.10
c ∗
for Wα (b) which is the maximum of m + 1 terms, the (m + 1)-th term being a summation.
Suppose that of these terms, the term that determines Wcα∗ (b) is the k-th term (denote this by Tk ).
Similarly, consider the equation 5.10 for evaluating each of W cα(l) (b) and suppose that the kl -th
(l)
term (denoted by Tkl ) on the right hand side determines the value of W cα(l) (b). Then,
h−1
X h−1
X Ph−1 c (l)
cα∗ (b)
W (l) (l) l=0 Wα (b)
= T k ≤ Tk ≤ Tk = , (5.14)
s(t; B)α l=0 l=0
l
s(t; B) α
78
which completes the proof. Note that we did not use the fact that some of the components
Ph−1 were
work or cache complexities. The proof only depended on the fact that ρ(x) = i=0 ρi (x) and
the structure of the composition rules given by equations 5.11, 5.10. ρ could have been replaced
with any other kind of work and ρi with its decomposition. t
u
The previous lemma indicates that the latency added work can be separated into costs per
cache level. The following lemma then relates these separated costs to effective cache complex-
bα (·).
ity Q
Lemma 14 Consider an h-level PMH with B = Bj for all 1 ≤ j ≤ h and a computation c.
If c is scheduled on this PMH using a space-bounded scheduler with dilation σ = 1/3, then
c ∗ (c) ≤ Ph−1 Q
W bα (c; Mi /3, B) · Ci .
α i=0
Proof. (Sketch) The function W cα(i) (·) is monotonic in that if it is computed based on function
ρ0i (·) instead of ρi (x), where ρ0i (x) ≤ ρi (x) for all instructions x, then the former estimate would
be no more than the latter. It then follows from the definitions of W cα(i) (·) and ρi (·), that W
cα(i) (c) ≤
Qbα (c; Mi /3, B) · Ci for all computations c, i ∈ {0, 1, . . . , h − 1}. Lemma 13 then implies that
for any computation c: W cα∗ (c) ≤ Ph−1 Q bα (c; Mi /3, B) · Ci . t
u
i=0
Finally, we prove the main lemma, bounding the running time of a task with respect to the
remaining utilization the clusters it has been allocated. At a high level, the analysis recursively
decomposes a maximal level-i task into its nearest maximal descendent level-j < i tasks. We as-
sume inductively that these tasks finish “quickly enough.” Finally, we combine the subproblems
with respect to the level-i cache analogous to Brent’s theorem, arguing that a) when all subclus-
ters are busy, a large amount of productive work occurs, b) and when subclusters are idle, all
tasks make sufficient progress. Whereas this analysis outline is consistent with a simple analysis
of the greedy scheduler and that in [64], here we address complications that arise due to partially
allocated caches and subtasks skipping levels in the hierarchy.
Lemma 15 Consider an h-level PMH with B = Bj for all 1 ≤ j ≤ h and a computation
to schedule with α ≥ β, and let α0 = min {α, 1}. Let Ni be a task or strand which has
been assigned a set Ut of q ≤ gi (S(Ni ; B)) level-(i − 1) subclusters by the scheduler. Letting
P
V ∈Ut (1 − µ(V )) = r (by definition, r ≤ |Ut | = q), the running time of Ni is at most:
c ∗ (Ni )
W α
· vi , where overhead vi is (5.15)
rpi−1
i−1
Y
1 fi
vi = 2 + . (5.16)
j=1
k (1 − k)(Mi /Mi−1 )α0
Proof. We prove the claim on run time using induction on the levels.
Induction: Assume that all child maximal tasks of Ni have run times as specified above. Now
look at the set of clusters Ut assigned to Ni . At any point in time, either:
1. all of them are saturated.
79
2. at least one of the subcluster is unsaturated and there are no jobs waiting in the queue
R(Ni ). More specifically, the job on the critical path (χ(Ni )) is running. Here, critical
path χ(Ni ) is the set of strictly ordered immediate child subtasks that have the largest sum
of effective depths. We would argue in this case that progress is being made along the
critical path at a reasonable rate.
Assuming q > 1, we will now bound the run time required to complete Ni by bounding the
number of cycles the above two phases use. Consider the first phase. A job x ∈ C(Ni ) (subtasks
of Ni ) when given an appropriate number of processors (as specified by the function g) can not
have an overhead of more than vi−1 , i.e., it uses at most Wcα∗ (x)vi−1 individual processor clock
cycles. Since in the first phase, at least k fraction of available subclusters under Ut are always
allocated (at least rpi−1 clock cycles put together) to some subtask of Ni , it can not last for more
than
X 1W
cα∗ (x) cα∗ (Ni )
1W
· vi−1 < · vi−1 number of cycles.
k rpi−1 k rpi−1
x∈C(Ni )
For the second phase, we argue that the critical path runs fast enough because we do not
underallocate processing resources for any subtask by more than a factor of (1 − k) as against
that indicated by the g function. Specifically, consider a job x along the critical path χ(Ni ).
Suppose x is a maximal level-j(x) task, j(x) < i. If the job is allocated subclusters below
a level-j(x) subcluster V , then V was unsaturated at the time of allocation. Therefore, when
P scheduler picked the gj(x) (S(x; B)) most unsaturated subclusters under V (call this set V),
the
v∈V µ(v) ≥ (1 − k)gj(x) (S(x; B)). When we run x on V using the subclusters V, its run time
is at most
cα∗ (x)
W c ∗ (x)
W · vj(x)−1
P · vj(x)−1 < α (5.17)
( v∈V µ(v))pj(x)−1 (1 − k)g(S(x; Bj(x) ))
cα∗ (x)
W s(x; B)α vj(x)−1
= (5.18)
s(x; B)α g(S(x; Bj(x) )) 1 − k
s(x;B) α
time. Amongst all subtasks x of Ni , the ratio g(S(x;Bj(x) ))
is maximum when when S(x; B) =
Mi−1 /3, where the ratio is (Mi−1 /3B)α /pi−1 . Summing the run times of all jobs along the
80
critical path would give us an upper bound for time spent in phase two. This would be at most
X Wcα∗ (x)
· vi−1 (5.19)
(1 − k)g(S(x; B))
x∈χ(Ni )
X c ∗ (x)
W s(x; B)α vi−1
α
= · · (5.20)
s(x; B)α g(S(x; B)) 1 − k
x∈χ(Ni )
X W c ∗ (x) α
≤ α · (Mi−1 /3B) · vi−1 (5.21)
s(x; B)α pi−1 1−k
x∈χ(Ni )
Putting together the run times of both the phases, we have an upper bound of
c ∗ (Ni ) c∗
W α 1 fi Wα (Ni )
vi−1 · + 0 = · vi . (5.26)
rpi−1 k (1 − k)(Mi /Mi−1 )α rpi−1
If q = 1, Ni would get allocated just one (i − 1)-subcluster V , and of course, all the (yet
unassigned) (i − 2) subclusters V below V . Then, we can view this scenario as Ni running on the
(i − 1)-level hierarchy. Memory accesses and cache latency costs are charged the same way as
c ∗ (Ni ). By inductive
before with out modification so that the effective work of Ni would still be Wα
hypothesis, we know that the run time of Ni would be at most
Wc ∗ (Ni )
α
P · vi−1
( V ∈V (1 − µ(V )))pi−2
c ∗ (Ni )
W P
which is at most α
rpi−1
· vi since V ∈V (1 − µ(V )) ≥ rfi−1 and vi−1 < vi .
Base case (i = 1): N1 has q = r processors available, all under a shared cache. If q = 1, the
claim is clearly true. If q > 1, since there is no further anchoring beneath the level-1 cache (since
M0 = 0), we can use Brent’s theorem on the latency added effective work to bound the run time:
81
c ∗ (N1 )
W c∗
Wα (N1 )
α
r
added to the critical path length, which is at most s(N1 ;B)
α . This sum is at most
cα∗ (N1 )
W q
c∗
Wα (N1 )
g(S(N1 ; B))
1+ ≤ 1+ (5.27)
r s(N1 ; B)α r s(N1 ; B)α
c ∗ (N1 )
W α S(N1 ; B)α f1
≤ 1+ · (5.28)
r s(N1 ; B)α (M1 /3)α
c ∗ (N1 )
W
≤ α × 2. (5.29)
r
t
u
Theorem 10 follows from Lemmas 14 and 15, starting on a system with no utilization.
82
Definition 16 (α-Shortened DAG and α-effective depth ) The α-shortened DAG Gxα,y of task
t evaluated at point x with respect to size y ≤ S1 (t, B) is constructed based on the decom-
position of the DAG G corresponding to t in to super-nodes of size at most y and glue-nodes.
Replace
l b 0 the super-node
m corresponding to every maximal subtask t0 of size (at most y) by a chain
of Qα (t ;x,B) nodes in Gx . For every maximal sequential chain of glue-nodes (glue-strand) s,
(y/B)α α,y
add Q (s; x, B; ∅) nodes in Gxα,y (one for every cache miss starting from a cold state). Prece-
∗
dence constraints between the nodes in Gxα,y are identical to those between the super-nodes and
glue-nodes they represent in G. The α-effective depth dxα,y of task t evaluated at point x with
respect to size y < S1 (t, B) is the depth of the α-shortened DAG Gxα,y .
Lemma 17 It follows from the composition l b rules m for effective cache complexity that the α-
x Qα (t;x,B)
effective depth dα,y of a task t is at most s (t,B)α , when y ≤ S1 (t, B).
1
Proof. We will fix y, and prove the claim by induction on the size of the task t. If S1 (t, B) = y,
the claim follows immediately.
For a task t = s1 ; b1 ; s2 ; b2 ; . . . ; sk with sequential space greater than y, it follows from the
composition rules for effective depth that
& ' k k−1
& '
bα (t; x, B; κ)
Q X X bα (ti ; x, B; κ)
Q
≥ Q∗ (si ; x, B) + , (5.30)
s(t; B)α i=1 i=i
s(t i ; B)α
where ti is the parallel task in block bi with the largest effective depth.
Iflwe inductively m assume that the α-effective depths of every task ti with S1 (t, B) ≥ y is at
b α (ti ;x,B)
Q
most s (t ,B)α , then we can use the above inequality to bound the α-effective depth of t. In the
1 i
α-shortened graph of t, each of the strands si would show up as Q∗ (si ; x, B) nodes corresponding
to the first term on the right hand side of the inequality 5.30. For those ti with sequential space
greater than y, the second term on the RHS of the inequality dominates the α-shortened DAG
of ti (inductive
lb assumption).
m l b ti with
For those m sequential space y, their depth in the α-shortened
Qα (ti ;x,B) Qα (ti ;x,B)
DAG (y/B)α is smaller than s (t ,B)α as s1 (ti , B) < y. The claim then follows.
1 i
t
u
Just as in the case of the PDF scheduler for the shared cache (Section 5.2.2), we will demon-
strate bounds on that the extra space added by premature nodes in the execution of a task an-
chored at level i. We assume that allocations in a maximal level-(i − 1) task incur the cost of
cache miss to be fetched from level i cache – i.e. , they incur Ci−1 cost. By adding the extra
space to the cache at level i, we can retain the same asymptotic bounds on communication cost
and time as presented in Theorems 8 and 10.
Lemma 18 Let ti be a level-i task. Suppose that Ci−1 = 1 and Cj = 0 for all j < i − 1, i.e.
cache misses to level i cost one cycle and accesses to all lower level caches are free. Then the
recursive PDF-based space-bounded schedule mapping ti to a machine with parallelism β < α
83
requires no more than
i−1 & ' α
1 bα (ti ; Mi , B)
Q Mi−1
2 × × fi Mi−1 + fi
1−k s(ti , B)α B
additional space over S1 (ti , B) at level i. When the sequential space of ti is exactly Mi , the
expression above simplifies to
! !
bα (ti ; Mi , B)
Q Qbα (ti ; Mi , B)
1−α 1−α
O × fi 1 + Mi−1 · Bα or O 1 + Mi−1 · Bα .
(Mi /Mi−1 )α fi
α/β−1
Mi−1
Proof. Let Hα := Gα,M i−1
be the α-shortened DAG of ti at point Mi−1 with respect to size Mi−1 .
Denote its effective depth by dα . The PDF schedule on Hα allows us to adapt the arguments in
the proof of Theorem 2.3 in [37] which bounds the number of premature nodes in a PDF schedule
for Hα . To do this, we adopt the following convention: for a maximal task t0 replaced by l > 1
nodes in Hα , we say that a node at depth i < l of t0 in Hα is completed in a space-bounded
schedule if there have been at least i × (Mi−1 /B)α cache misses from instructions in t0 . t0 is
completed to depth l if it has been completed. We partition the execution of t0 into phases that
can related to execution of levels in Hα . l
1
i−1 (Mi−1 /B)α m
Partition wall-clock time into continuous phases of 1−k × pi−1 clock cycles,
where k ∈ (0, 1) is the reservation parameter described in the previous section. If a super-node
is executed to depth i at the beginning of a phase and yet to be completed, it will be executed to
depth i + 1 by the end of the phase because of the allocation policy. Put another way, each phase
provides ample time for a (i − 1)-level subcluster to completely execute all the instructions in
any set of nodes from Hα corresponding to maximal tasks or glue nodes that fit in Mi−1 space.
Consider a snapshot of the execution and let C denote the set of nodes of Hα that have been
executed and let the longest sequential prefix of the sequential execution of ti contained in C be
C1 . A phase is said to complete level l in the Hα if it is the earliest phase in which all nodes at
depth l in C1 have been completed. We will bound the number of phases that complete a level
l ≤ dα by arguing that if a new premature node is started in phase i, either phase i or i + 1
completes a level.
Suppose that a node of Hα premature with respect to C1 was started in phase i. Let l be the
lowest level in Hα completed in C1 at the start of phase i. Then, at the beginning of phase i, all
nodes in C1 at level l + 1 are either executed, under execution or ready to be executed (denote
these three sets by Cl+1,i,d , Cl+1,i,e , and Cl+1,i,r respectively). A premature node with respect to
C1 can not be started unless all of the nodes in Cl+1,i,r have been started as they are ready for
execution and have a higher priority in the PDF order on Hα . Therefore, if a premature node is
started in phase i, all nodes in Cl+1,i,r have been started by the end of phase i, which implies they
will be completed by phase i + 1. Nodes in Cl+1,i,e will be completed by phase i. This proves our
claim that if a premature node is started in phase i, a new level of Hα in phase i or i + 1.
There are at most dα nodes in which a level of C1 in Hα is completed. Since a premature
node can be executed only in a phase that completes a level or in the phase before it, the number
of phases that start a new premature node with respect to C1 is at most 2dα . We will bound the
84
additional space that premature nodes take up in each such phase. Note that there will be phases
in which premature nodes are executed but not started. We account for the space added by each
such premature node in the phase that started it.
Suppose a phase contained new premature nodes. A premature super-node in the decomposi-
tion of ti that increases the space requirement by M units at some point during its execution must
have at least M cache misses or allocations. It costs at least M processor cycles since Ci−1 = 1,
i.e. every unit of extra space is paid for with a processor cycle. Therefore, the worst case scenario
in terms of extra space added by premature nodes is the following. Every processor allocates an
unit of space in a premature node every cycle until the last cycle. In the last cycle of the phase,
an additional set of premature nodes of the largest possible size are started. The contribution of
all but the last cycle of the phase to extra space is at most number of cycles per phase multiplied
by the number of processors, i.e.
i−1 ! i−1
α
1 (Mi−1 /B) 1
× × pi = × fi (Mi−1 /B)α .
1−k pi−1 1−k
In the last cycle of phase, the costliest way to schedule premature nodes is to schedule a Mi−1
size super-node at each level-(i − 1) subcluster for a total of fi Mi−1 space. Adding together the
extra space contributed by premature nodes across all phases gives an upper bound of
i−1 !
1
2dα × × fi (Mi−1 /B)α + fi Mi−1
1−k
& ' i−1 !
Qbα (ti ; Mi , B) 1
≤2 × × fi (Mi−1 /B)α + fi Mi−1
s(ti , B)α 1−k
i−1 & b '
1 Qα (ti ; Mi , B)
≤2 × × (fi (Mi−1 /B)α + fi Mi−1 ) .
1−k s(ti , B)α
The requirement that Ci−1 = 1 in the above lemma can be easily relaxed by letting phases
be longer by a factor of Ci−1 . However, allowing Cj to be non-trivial for j < i − 1 requires a
slightly different argument and results in a different bound.
85
To interpret this lemma, consider matrix multiplication with parallelizability 1.5. A bad
schedule might need superlinear space to execute this algorithm. However, the recursive PDF
schedule executes a matrix multiplication with inputs size Mi with the same guarantees on com-
munication costs and running time as Theorems 8 and 10 using only o(Mi ) extra space at level-i
cache Mi when the machine parallelism β is less than 1.
We now relax the condition that Cj = 0 for all j < i − 1 and bound the amount of extra space
needed to maintain communication and running time bounds. The bounds in the next lemma are
relatively weak compared to the previous lemma. However, we conjecture that the bound can be
improved.
Lemma 19 The recursive PDF-based space-bounded schedule that maps a level-i task ti on to
a machine with parallelism β < α would require no more than
i−1 i−1
& '! i−1 α !
1 X bα (ti ; Mj , B)
Q X Cj Mj
2 × α
× fi Mi−1 + fi
1−k j=0
s(t i , B) j=0
Ci−1 B
wall clock cycles. The second difference is that we have to relate the start of new premature
Mi−1 Mj
nodes to completion of levels not just in Hαi−1 := Gα,M i−1
but also in each of Hαj := Gα,M i−1
for
j ≤ i − 1. Further, we change the convention used in the previous proof: for a maximal task t0
replaced by l > 1 nodes in Hαj , we say that a node at depth i < l of t0 in Hαj is completed in a
space-bounded schedule if there have been at least i×(Mi−1 /B)α cache misses from instructions
in t0 to a level j cache. t0 is completed to depth l if it has been completed.
Let t0 be a maximal task in the decomposition of ti into maximal level-≤ (i − 1) subtasks and
glue nodes. If the super-node corresponding to t0 is under execution at the beginning of phase i,
then with the new phase duration and conventions, the super-node completes and another level
of t0 in at least one of {Hαj }j≤i−1 is completed by the end of phase i. A similar statement can be
made of the all the super-nodes under execution at the beginning of phase i — the maximal tasks
that these super-nodes correspond to will progress by at least one level in one of {Hαj }j≤i−1 .
In parallel with the previous proof, it can be argued that if phase i starts a new premature
node with respect to an execution snapshot C and its sequential prefixes C1j in each of Hαj , then
Pi−1 Mj
phase i or i + 1 completes another level in one of C1j . Therefore, there are at most 2 j=0 dα,Mi−1
phases in which new premature nodes start execution. A phase which starts a new premature
node adds at most
i−1 Xi−1 α !
1 Cj Mj
× fi Mi−1 + fi
1−k j=0
Ci−1 B
86
extra space over S1 (t, B) at the level-i cache. The Ci−1 in the denominator of the expression
comes because every allocation or cache miss to level i cache costs Ci−1 wall clock cycles. The
lemma follows by multiplying the number of phases that allow premature nodes by the maximum
amount of extra space that can be allocated in those phases.
t
u
87
88
Chapter 6
We advocated the advantage of designing portable algorithms with provable bounds in program-
centric cost models in Section 1.3. Such algorithms can be used across different machines with
an appropriate scheduler with good performance results. Two closely related program-centric
cost models were described as appropriate choices for measuring parallelism and locality in
algorithms. Good algorithms in these frameworks have the following properties.
• Low-depth and good sequential cache complexity. Design algorithms with low depth
(polylogarithmic in input size if possible) and minimal sequential cache complexity Q1 in
the Cache-Oblivious framework according to a depth-first schedule.
• Low effective cache complexity in the PCC framework. Design algorithms with mini-
bα for as large a value of α as possible.
mal effective cache complexity Q
This Chapter presents several polylogarithmic depth algorithms with good locality for many
common problems. The algorithms are oblivious to cache sizes, number of processors, and other
machine parameters which makes them portable.
Most algorithms in this chapter (except a few graph algorithms) are optimal in both the cost
models. This is not surprising because the cost models are closely related for divide and conquer
algorithms. Loosely speaking, a divide and conquer algorithm designed to be optimal in the
first cost model will be optimal in the second cost model if each divide step is “balanced”. One
trivial way to ensure balance is to divide the problem size evenly at each recursive divide. Of
course, this is not always feasible and algorithms with some irregularity in the division can also
be optimal in the second model as described in Section 4.3.2. For example, consider the analysis
of Samplesort in Section 6.3. Despite an unequal divide, the algorithm has optimal asymptotic
complexity Q bα for values of α up to 1 as a result of ample parallelism. In general, the greater
the parallelism in the algorithm, the greater the imbalance it can handle without an increase in
effective cache complexity. In other words, they are easier to load balance.
This chapter starts with parallel algorithms for building blocks including Scan, Pack, Merge,
and Sort. Section 6.4 presents algorithms for Sparse-Matrix Vector Multipliation. Section 6.5
uses Scan and Sort to design algorithms for List-Ranking, Euler Tours and other graph algo-
rithms. Section 6.6 presents algorithms for set cover. Tables 6.1 and 6.2 summarize these results.
89
Problem Depth Cache Complexity Parallelizability
Q1 and Q∗
Prefix Sums (Sec.6.1) O(log n) O( Bn ) 1
Merge (Sec.6.2) O(log n) O( Bn ) 1
Sort (deterministic)∗ (Sec.6.3.1) O(log2 n) O( Bn dlogM +2 ne) 1
Sort (randomized; bounds w.h.p.)∗ O(log1.5 n) O( Bn dlogM +2 ne) 1
(Sec.6.3.2)
Sparse-Matrix Vector Multiply O(log2 n) O( mB
+ n
M 1−
) <1
(m non-zeros, n separators)∗ (Sec.6.4)
Matrix Transpose (n × m matrix) [86] O(log (n + m)) O( nm
B
) 1
Table 6.1: Complexities of low-depth cache-oblivious algorithms. New algorithms are marked
(∗ ). All algorithms are work optimal and their cache complexities match the best sequential
algorithms. The fourth columnd presents the parallelism of the algorithm (see Section 4.3.3),
i.e. , the maximum value of α for which Q bα = O(Q∗ ). The bounds assume M = Ω(B 2 ). The
parallelizability of Sparse-matrix vector multipliaction depends on the precise balance of the
separators.
90
A[0:3] A[4:7]
+ +
A[0:1] A[2:3] A[4:5] A[6:7]
+ + + +
A
Figure 6.1: Prefix Sums. The first phase involves recursively summing up the elements and
storing the intermediate values.
Algorithm 1 UP(A, T, i, n)
if n = 1 then
return A[i]
else
k ← i + n/2
T [k] ← UP(A, T, i, n/2)
right ← UP(A, T, k, n/2)
return T [k] ⊕ right
end if
Algorithm 2 DOWN(R, T, i, n, s)
if n = 1 then
R[i] ← s
else
k ← i + n/2
DOWN(R, T, i, n/2, s)
DOWN(R, T, k, n/2, s ⊕ T [k]);
end if
91
Lemma 20 The prefix sum of an array of length n can be computed in W (n) = O(n), D(n) =
O(log n) and Q∗ (n; M, B) = Q1 (n; M, B) = O(dn/Be). Further its Q bα (s; M, B) = O(ds/Be)
for α < 1, M > 0.
Proof. There exists a positive constant c such that for a call to the function UP with n ≤ cM , all
the memory locations accessed by the function fit into the cache. Thus, Q∗ (n; M, B) ≤ M/B
for n ≤ cM . Also Q∗ (n; M, B) = O(1) + 2 · Q1 (n/2; M, B) for n > cM . Therefore, the
sequential cache complexity for UP is O(n/M + (n/(cM )) · (M/B)) = O(dn/Be). A similar
analysis for the cache complexity of DOWN shows that Q1 (n; M, B) = O(dn/Be). Both the
algorithms have depth O(log n) because the recursive calls can be made in parallel.
Each of these phases involves a simple balanced recursion where each task of size s adds two
integers apart from two recrsive tasks on an array half the size. Therefore, the recursion for Qbα
of each phase is the same as for the Map operation analyzed in Section 4.3.4. The bounds for
effective cache complexity follows. t
u
We note that the temporary array T can also be stored in preorder or postorder with the same
bounds, but that the index calculations are then a bit more complicated. What matters is that any
subtree is contiguous in the array. The standard heap order (level order) does not give the same
result.
Pack. The Scan operation can be used to implmenent the Pack operation. The Pack operation
takes as input an array A of length n and a binary predicate that can be applied to each element in
the array. It outputs an array with only those elements in array A that satisfy the predicate. To do
this, the Pack operation (i) applies the predicate to each element in A independently in parallel
marking a select bit on the element to 1 if selected for inclusion and 0 otherwise, (ii) computes a
prefix sum on the select bit, and (iii) transfers the selected elements to their position in the output
array according to the prefix sum. The works, depth and cache complexity of the Pack operation
are asymptotically the same as the of the Prefix Sum operation.
6.2 Merge
To make the usual merging algorithm√ cache oblivious, we use divide-and-conquer with a branch-
1/3
ing factor of n . (We need a o( n) branching factor in order to achieve optimal cache com-
plexity.) To merge two arrays A and B of sizes lA and lB (lA + lB = n), conduct a dual binary
search of the arrays to find the elements ranked {n2/3 , 2n2/3 , 3n2/3 , . . . } among the set of keys
from both arrays, and recurse on each pair of subarrays. This takes n1/3 · log n work, log n depth
and at most O(n1/3 log (n/B)) cache misses. Once the locations of pivots have been identified,
the subarrays, which are of size n2/3 each, can be recursively merged and appended.
Lemma 21 Two arrays of combined length n can be merged in W (n) = O(n), D(n) = O(log n)
and Q∗ (n; M, B) = Q1 (n; M, B) = O(dn/Be). Further its Q bα (s; M, B) = O(ds/Be) for
2
α < 1 when M > Ω(B ).
Proof. The cache complexity of Algorithm MERGE can be expressed using the recurrence
Q∗ (n; M, B) ≤ n1/3 (log (n/B) + Q∗ (n2/3 ; M, B)), (6.1)
92
Algorithm 3 MERGE ((A, sA , lA ), (B, sB , lB ), (C, sC ))
if lB = 0 then
Copy A[sA : sA + lA ) to C[sC : sC + lA )
else if lA = 0 then
Copy B[sB : sB + lB ) to C[sC : sC + lB )
else
∀k ∈ [1 : bn1/3 c], find pivots (ak , bk ) such that ak + bk = kdn2/3 e
and A[sA + ak ] ≤ B[sB + bk + 1] and B[sB + bk ] ≤ A[sA + ak + 1].
∀k ∈ [1 : bn1/3 c], MERGE((A, sA +ak , ak+1 −ak ), (B, sB +bk , bk+1 −bk ), (C, sC +ak +bk ))
end if
{This function merges A[sA : sA + lA ) and B[sB : sB + lB ) into array C[sC : sC + lA + lB )}
where the base case is Q∗ (n; M, B) = O(dn/Be) when n ≤ cM for some positive constant c.
When n > cM , Equation 6.1 solves to
Because M = Ω(B 2 ) and n > cM , the O(n/B) term in Equation 6.2 is asymptotically
larger than the n1/3 log (n/B) term, making the second term redundant. Therefore, in all cases,
Q∗ (n; M, B) = O(dn/Be). The analysis for Q1 is similar. The bounds on Q bα follow because
the recursion divides the problem into exactly equal parts.
The recurrence relation for the depth is:
which solves to D(n) = O(log n). It is easy to see that the work involved is linear. t
u
Mergesort. Using this merge algorithm in a mergesort in which the two recursive calls are paral-
bα (n; M, B) = O((n/B) log2 (n/M ))
lel gives an algorithm with depth O(log2 n) and cache complexity Q
for α < 1, which is not optimal. Blelloch et al. [38] analyze similar merge and mergesort algo-
rithms with the same (suboptimal) cache complexities but with larger depth. Next, we present a
sorting algorithm with optimal cache complexity (and low depth).
6.3 Sorting
In this section, we present the first cache-oblivious sorting algorithm that achieves optimal work,
polylogarithmic depth, and good sequential cache complexity. √ Prior cache-oblivious algorithms
with optimal cache complexity [55, 56, 57, 81, 86] have Ω( n) depth. In the PCC framework,
the algorithm has optimal cache complexity and is parallelizable up to α < 1.
Our sorting algorithm uses known algorithms for prefix sums and merging as subroutines.
A simple variant of the standard parallel prefix-sums algorithm has logarithmic depth and cache
complexity O(n/B). The only adaptation is that the input has to be laid out in memory such
that any subtree of the balanced binary tree over the input (representing the divide-and-conquer
93
recursion) is contiguous. For completeness, the precise algorithm and analysis are in Section 6.1
of the appendix. Likewise, a simple variant of a standard parallel merge algorithm also has
logarithmic depth and cache complexity O(n/B), as described next.
Algorithm 4 COSORT(A, n)
if n ≤ 10 then
return Sort A sequentially
end if √
h ← d ne
∀i ∈ [1 : h], Let Ai ← A[h(i − 1) + 1 : hi]
∀i ∈ [1 : h], Si ← COSORT(Ai , h)
repeat
Pick an appropriate sorted pivot set P of size h
∀i ∈ [1 : h], Mi ← SPLIT(Si , P)
{Each array Mi contains for each bucket j a start location in Si for bucket j and a length of
how many entries are in that bucket, possibly 0.}
L ← h × h matrix formed by rows Mi with just the lengths
LT ← TRANSPOSE(L)
∀i ∈ [1 : h], Oi ← PREFIX-SUM(LTi )
OT ← TRANSPOSE(O) {Oi is the ith row of O}
T
∀i, j ∈ [1 : h], Ti,j ← hMi,j h1i, Oi,j , Mi,j h2ii
{Each triple corresponds to an offset in row i for bucket j, an offset in bucket j for row i
and the length to copy.}
until No bucket is too big
Let B1 , B2 , . . . , Bh be arrays (buckets) of sizes dictated by T
B-TRANSPOSE(S, B, T , 1, 1, h)
∀i ∈ [1 : h], Bi0 ← COSORT(Bi , length(Bi ))
return B10 ||B20 || . . . ||Bh0
√
Our sorting algorithm
√ (COSORT in Algorithm 4) first splits the set of elements into n
subarrays of size n and recursively sorts each of the subarrays. Then, samples are chosen to
determine pivots. This step can be done either deterministically or randomly. We first describe a
deterministic version of the algorithm for which the repeat and until statements are not needed;
Section 6.3.2 will describe a randomized version that uses these statements. For the deterministic
version, we choose every (log n)-th element from each of the subarrays as a sample. The sample
94
Algorithm 5 B-TRANSPOSE(S, B, T , is , ib , n)
if n = 1 then
Copy Sis [Tis ,ib h1i : Tis ,ib h1i + Tis ,ib h3i)
to Bbs [Tis ,ib h2i : Tis ,ib h2i + Tis ,ib h3i)
else
B-TRANSPOSE(S, B, T , is , ib , n/2)
B-TRANSPOSE(S, B, T , is , ib + n/2, n/2)
B-TRANSPOSE(S, B, T , is + n/2, ib , n/2)
B-TRANSPOSE(S, B, T , is + n/2, ib + n/2, n/2)
end if
S B
Figure 6.2: Bucket transpose diagram: The 4x4 entries shown for T dictate the mapping from
the 16 depicted segments of S to the 16 depicted segments of B. Arrows highlight the mapping
for two of the segments.
95
set, which is smaller than the given data set by a factor of log n, is then sorted using the mergesort
algorithm outlined above. Because mergesort is reasonably cache-efficient, using it on a set
slightly smaller than the input set is not too costly in terms of cache
√ complexity. More precisely,
this mergesort incurs O(dn/Be) cache misses. We can then pick n evenly spaced keys from the
sample set P as pivots to determine bucket boundaries. To determine the bucket boundaries, the
pivots are used to split each subarray using the cache-oblivious merge procedure. This procedure
also takes no more than O(dn/Be) cache misses.
Once the subarrays have been split, prefix sums and matrix transpose operations can be used
to determine the precise location in the buckets where each segment of the subarray is to be sent.
We can use the standard divide-and-conquer matrix-transpose algorithm [86], which is work
optimal, has logarithmic depth and has optimal cache complexity when M = Ω(B 2 ) √ (details
√ in
Table 6.1). The mapping of segments to bucket locations is stored √ in a matrix T of size n× n.
Note that none of the buckets will be loaded with more than 2 n log n keys because of the way
we select pivots.
Once the bucket boundaries have been determined, the keys need to be transferred to the
buckets. Although a naive algorithm to do this is not cache-efficient, we show that the bucket
transpose algorithm (Algorithm B-TRANSPOSE in Algorithm 5) is. The bucket transpose is a
four way divide-and-conquer procedure on the (almost) square matrix T which indicates a set of
segments of subarrays (segments are contiguous in each subarray) and their target locations in
the bucket. The matrix T is cut in half vertically and horizontally and separate recursive calls are
assigned the responsibility of transferring the keys specified in each of the four parts. Note that
ordinary matrix transpose is the special case of Ti,j = hj, i, 1i for all i and j.
√ √
Lemma
√ 22 Algorithm B-TRANSPOSE transfers a matrix of n × n keys into bucket matrix B
of n buckets according to offset matrix T in O(n) work, O(log n) depth, and Q∗ (n; M, B) =
O(dn/Be) cache complexity.
Proof. It is easy to see that the work is O(n) because there is only constant work for each of the
O(log n) levels of recursion and each key is copied exactly once. Similarly, the depth is O(log n)
because we can use prefix sums to do the copying whenever a segment is larger than O(log n).
To analyze the cache complexity, we use the following definitions. For each node v in the re-
cursion tree of bucket transpose, we define the node’s size s(v) to be n2 , the size of its submatrix
T , and the node’s weight w(v) to be the number of keys that T is responsible for transferring.
We identify three classes of nodes in the recursion tree:
1. Light-1 nodes: A node v is light-1 if s(v) < M/100, w(v) < M/10, and its parent node is of
size ≥ M/100.
2. Light-2 nodes: A node v is light-2 if s(v) < M/100, w(v) < M/10, and its parent node is of
weight ≥ M/10.
3. Heavy leaves: A leaf v is heavy if w(v) ≥ M/10.
The union of these three sets covers the responsibility for transferring all the keys, i.e., all leaves
are accounted for in the subtrees of these nodes.
From the definition of a light-1 node, it can be argued that all the keys that a light-1 node is
responsible for fit inside a cache, implying that the subtree rooted at a light-1 node cannot incur
more than M/B cache misses. It can also be seen that light-1 nodes can not be greater than
96
4n/(M/100) in number leading to the fact that the sum of cache complexities of all the light-1
nodes is no more than O(dn/Be).
Light-2 nodes are similar to light-1 nodes in that their target data fits into a cache of size M .
If we assume that they have a combined weight of n − W , then there are no more than 4(n −
W )/(M/10) of them, putting the aggregate cache complexity for their subtrees at 40(n−W )/B.
A heavy leaf of size w incurs dw/Be cache misses. There are no more than W/(M/10) of
them, implying that their aggregate cache complexity is W/B + 10W/M < 11W/B. Therefore,
the cache complexities of light-2 nodes and heavy leaves add up to another O(dn/Be).
Note that the validity of this proof does not depend on the size of the individual buckets.
The√statement of the lemma holds even for the case where each of the buckets is as large as
O( n log n). t
u
Lemma
√ √ 23 The effective cache complexity √ of Algorithm B-TRANSPOSE transfers a matrix of
n× n keys into bucket matrix B of n buckets according to offset matrix T if Q bα (n; M, B) =
O(dn/Be) for α < 1.
Proof. Bucket tranpose involves work that is linear in terms of space, but the recursion is
highly imbalanced. A call (task) to bucketPtranspose with space s invokes four parallel bucket
tranposes on sizes s1 , s2 , s3 , s4 such that 4i=1 si = s. The only other work here is fork and
join points. When this recursion reaches the leaf (of size s), an array copy (completely parallel)
whose effective work is s (for α < 1) is done. For an internal node of this recursion, when
s > B, M > 0,
l s mα
b BT
Qα (s; M, B; ∅) = O (6.3)
B( ( s α ) 4 )
X
+ max max Q bBT (si ; M, B; ∅) B α , bBT (si ; M, B; ∅) .
Q
α si α
i
B i=1
(6.4)
bBT
If we inductively assume that for α < 1, Qα (si ; M, B; ∅) ≤ k(dsi /Be − dsi /Be
(1+α)/2
), then
α 1−α (α−1)/2
the i-th balance term is at most k(ds/Be dsi /Be − dsi /Be ), which is dominated by
the summation on the right, when α < 1. Further the summation is also less than k(ds/Be −
ds/Be(1+α)/2 ), when s > k1 B for some constant k1 . Filling in other cases of the recurrence
will show that QbBT (s; M, B) = O(ds/Be), for α < 1, M > 0 (also Q bBT (s; 0, B) = O(s) for
α α
α < 1). t
u
97
√
n
D(n) = O(log2 n) + maxi=1 {D(ni )}
n √ √ P√ n ∗
Q∗ (n; M, B) = O B
+ nQ∗ ( n; M, B) + i=1 Q (ni ; M, B),
√
where the {ni }i are such that their sum is n and none individually exceed 2 n log n. The
base case for the recursion for cache complexity is Q∗ (n; M, B) = O(dn/Be) for n ≤ cM for
some constant c. Solving these recurrences proves the theorem. t
u
bα (n; M, B) =
Theorem 25 The effective cache complexity of deterministic COSORT for α < 1 is Q
O(dn/Be dlogM +2 ne).
√ call to such a COSORT with input size s starts√with a completely balanced parallel
Proof. Each
block of s recursive calls to COSORT with input size s. After this, (s/ log s) pivots are
picked and sorted, the subarrays are split (same complexity as merging) with the pivots, offsets
in to target buckets computed with prefix sum (and matrix tranpose operations) and the elements √
are sent in to buckets using bucket transpose. At the end, there is a parallel block sorting the s
buckets (not all of same size) using COSORT. All the operations except the recursive COSORT
have effective cache complexity that sums up O(ds/Be) (if starting with empty cache) when
α < 1. Therefore, for α < 1, s > M ,
l s m √ √
b SS
Qα (s; M, B) = O + sQ bSS
α ( s; M, B)
B
α X
b SS ds/Be b SS
+ max max Qα (si ; M, B) , Qα (si ; M, B) ,
i dsi /Beα P
si ≤s log s, i si =s
when s ≥ M/k. The base case for this recursion is Q bSS (s; M, B) = O(ds/Be) for s < M/k
α
for some small constant k.
There exist constants c1 > c4 > 0, c2 , c3 such that the cache complexity of COSORT has the
bounds
l s m l s m l s m lsm
c4 logM +2 s ≤ Q∗ (s) ≤ c1 logM +2 s − c2 log logM +2 s − c3 . (6.5)
B B B B
Inductively assume that the recursion is work-dominated at every level below size s so that
b SS
Qα (·) has the same complexity as in equation 6.5. Without loss of generality, assume that
√
s1 ≥ si ∀i ∈ {1, 2, . . . , d se}. Then the recursion is work-dominated at this level if for some
constant c0 ,
P bSS
bSS
i Qα (si ; M, B) Q α (s1 ; M, B)
α ≥ · c0 (6.6)
ds/Be ds1 /Beα
P bSS
i Qα (si ; M, B) ds/Beα 0
⇐⇒ ≥ ·c (6.7)
bSS
Q α (s1 ; M, B)
ds1 /Beα
P bSS (si ; M, B)
Q ds/Beα 0
⇐⇒ 1 + i>1 α ≥ α ·c. (6.8)
bSS
Qα (s 1 ; M, B) ds 1 /Be
98
√
Fix the value of s1 and let s0 = (s − s1 )/ s and 0 < α < 1. The left-hand side attains
minimum value when all si except s1 are equal to s0 . The right hand-side attains maximum value
when all si are equal to s0 . Therefore, the above inequality holds if and only if
s0 0
s α
√ log M +2 s
1 + ( s − 1) sB1 ≥ sB1 α · c0 (6.9)
B
logM +2 s1 B
α
√ s0 logM +2 s0 s+B
⇐ 1 + ( s − 1) ≥ · c0 (6.10)
(s1 + B) logM +2 s1 s1
α
√ s0 logM +2 s0 s
⇐ ( s − 1) ≥ · 4c0 (6.11)
s1 logM +2 s1 s1
α
s logM +2 s0 s
⇐ ≥ · 8c0 (6.12)
s1 logM +2 s1 s1
1−α
s logM +2 s1
⇐ ≥ · 8c0 (6.13)
s1 logM +2 s0
1−α
s logM +2 s1
⇐ ≥ · 8c0 (6.14)
s1 logM +2 s0
√
√ 1−α logM +2 ( s log s)
⇐ s/ log s ≥ √ · 8c0 (6.15)
logM +2 ( s/2)
√
√ 1−α logM +2 ( s log s)
⇐ s/ log s ≥ √ · 8c0 , (6.16)
logM +2 ( s/2)
which is true for α ≤ 1 − logs c00 for some constant c00 . When all levels of recursion are work-
dominated, QbSS (s) has the same asymptotic value as cache complexity. This demonstrates that
α
COSORT has a parallelizability of 1. t
u
99
the while loop, including the brute force sort requires O(n) work and incurs at most O(dn/Be)
cache misses with high probability. Therefore,
√
n
√ √ X
E[W (n)] = O(n) + n E[W ( n)] + E[W (ni )]
i=1
√
, where each ni < 2 n log n and the ni s add up to n. This implies that E[W (n)] = O(n log n).
Similarly for cache complexity we have
√
n
√ √ X
∗
E[Q (n; M, B)] = O(n/B) + n E[Q∗ ( n; M, B)] + ∗
E[Q (ni ; M, B)],
i=1
which solves to E[Q∗ (n; M, B)] = O((n/B) log√M n) = O((n/B) logM n). To show the high
probability bounds for work and cache complexity, we can use Chernoff bounds because the
fan-out at each level of recursion is high.
Proving probability bounds on the depth is more involved. We prove that the sum of depths
of all nodes at level d in a recursion tree is O(log1.5 n/ log log n) with probability at least 1 −
2
1/nlog log n . We also show that the recursion tree has at most O(log log n) levels and the number
of instances of recursion tree is nO(1.1 log log n) . This implies that the depth of the critical path
computation is at most O(log1.5 n) with high probability.
To analyze the depth of the dag, we obtain high probability bounds on the depth of each level
of recursion tree (we assume that the levels are numbered starting with the root at level 0). To
get sufficient probability at each level we need to execute the outer loop more times toward the
leaves where the size is small. Each iteration of the outer loop at node N of input size m at level
k in the recursion tree has depth log m and the termination probability of the loop is 1 − 1/m.
To prove the required bounds on depth, we prove that the sum of depths of all nodes at level d
2
in the recursion tree is at most O(log3/2 n/ log log n) with probability at least 1 − 1/nO(log log n)
and that the recursion tree is at most 1.1 log2 log2 n levels deep. This will prove that the depth
2
of the recursion tree is O(log3/2 n) with probability at least 1 − (1.1 log2 log2 n)/nO(2 log log n) .
Since the actual depth is a maximum over all “critical” paths which we will argue are not more
than nO(1.1 log log n) in number (critical paths are shaped liked the recursion tree), we can conclude
that the depth is O(log3/2 n) with high probability.
The maximum number√of levels in the recursion tree can be bounded using the recurrence
relation X(n) = 1 + X( n log n) and X(10) = 1. Using induction, it is straightforward to
show that this solves to X(n) < 1.1 log2√log2 n.√Similarly√ the √number of critical paths C(n) can
be bounded using the relation C(n) < ( nX( n))( nX( n log n)). Again, using induction,
this relation can be used to show that C(n) = nO(1.1 log log n) .
To compute the sum of the depth of nodes at level d in the recursion tree, we consider two
cases: (1) when d > 10 log log log n and (2) otherwise. √
Case 1: The size of a node one level deep in the recursion tree is at most O( n log n) =
d
O(n1/2+r ) for any r > 0. Also, the size of a node which is d levels deep is at most O(n(1/2+r) ),
each costing O((1/2 + r)d log n) depth per trial. Since there are 2d nodes at level d in the recur-
sion tree, and the failure probability of a loop in any node is no more than 1/2, we show that the
probability of having to execute more than (2d · log1/2 n)/((1 + 2r)d · log log n) loops is small.
100
Since we are estimating the sum of 2d independent variables, we use Chernoff bounds of the
form:
2
P r[X > (1 + δ)µ] ≤ e−δ µ , (6.17)
with µ = (2 · 2d ), δ = (1/2)(log1/2 n/((1 + 2r)d · log log n)) − 1. The resulting probability bound
2
is asymptotically lesser than 1/nO(log log n) for d > 10 log log log n. Therefore, the contribution
of nodes at level d in the recursion tree to the depth of recursion tree is at most 2d ·(1/2+r)d log n·
2
log1/2 n/((1 + 2r)d · log log n) = log3/2 n/ log log n with probability at least 1 − 1/nO(log log n) .
Case 2: We classify all nodes at level d in to two kinds, the large ones with size greater
than log2 n and the smaller ones with size at most log2 n. The total number of nodes is 2d <
(log log n)5 . Consider the small nodes. Each small node can contribute a depth of at most
2 log log n to the recursion tree and there are at most (log log n)1 0 of them. Therefore, their
contribution to depth of the recursion tree at level d is asymptotically lesser than log n.
We use Chernoff bounds to bound the contribution of large nodes to the depth of the recur-
sion tree. Suppose that there are j large nodes. We show that with probability not more than
2
1/nO(log log n) , it takes more than 10 · j loop iterations at depth d for j of them to succeed. For
this, consider 10 · j random independent trials with success probability at least 1 − 1/ log2 n each.
The expected number of failures is no more than µ = 10 · j/ log2 n. We want to show that the
probability that there are greater than 9 · j failures in this experiment is tiny. Using Chernoff
bounds in the form presented above with the above µ, and δ = (0.9 · log2 n − 1), we infer that
2
this probability is asymptotically less than 1/nO(log log n) . Since j < 2d , the depth contributed by
the larger nodes is at most 2d (1/2 + r)d log n, asymptotically smaller than log3/2 n/ log log n.
The above analysis shows that the combined depth of all nodes at level d is O(log3/2 n/ log log n)
with high probability. Since there are only 1.1 log2 log2 n levels in the recursion tree, this com-
pletes the proof.
t
u
101
Algorithm BuildTree(V, E)
if |E| = 1 then
return V
end if
(Va , Vsep , Vb ) ← FindSeparator(V, E)
Ea ← {(u, v) ∈ E|u ∈ Va ∨ v ∈ Va }
Eb ← E − Ea
Va,sep ← Va ∪ Vsep
Vb,sep ← Vb ∪ Vsep
Ta ← BuildTree(Va,sep , Ea )
Tb ← BuildTree(Vb,sep , Eb )
return SeparatorTree(Ta , Vsep , Tb )
Algorithm SparseMxV(x,T )
if isLeaf(T ) then
T .u.value ← x[T .v.index] ⊗ T .v.weight
T .v.value ← x[T .u.index] ⊗ T .u.weight
{Two statements for the two edge directions}
else
SparseMxV(T .left) and SparseMxV(T .right)
for all v ∈ T .vertices do
v.value ← (v.left→value ⊕ v.right→value)
end for
end if
Figure 6.3: Cache-Oblivious Algorithms for Building a Separator Tree and for Sparse-Matrix
Vector Multiply
separators and present the first cache-oblivious, low cache complexity algorithm for the sparse-
matrix multiplication problem on such graphs. We do not analyze the cost of finding the layout,
which involves the recursive application of finding vertex separators, as it can be amortized across
many solver iterations. Our algorithm for matrices with n separators is linear work, O(log2 n)
depth, and O(m/B + n/M 1− ) parallel cache complexity.
Let S be a class of graphs that is closed under the subgraph relation. We say that S satisfies
a f (n)-vertex separator theorem if there are constants α < 1 and β > 0 such that every graph
G = (V, E) in S with n vertices can be partitioned into three sets of vertices Va , Vs , Vb such that
|Vs | ≤ βf (n), |Va |, |Vb | ≤ αn, and {(u, v) ∈ E|(u ∈ Va ∧ v ∈ Vb ) ∨ (u ∈ Vb ∧ v ∈ Va )} = ∅.
In our presentation we assume the matrix has symmetric non-zero structure (but not necessarily
symmetric values (weights)); if it is asymmetric we can always add zero weight reverse edges
while at most doubling the number of edges.
We now describe how to build a separator tree assuming we have a good algorithm Find-
Separator for finding separators. For planar graphs this can be done in linear time [125]. The
algorithm for building the tree is defined by Algorithm BuildTree in Figure 6.4. At each recur-
sive call it partitions the edges into two subsets that are passed to the left and right children. All
102
the vertices in the separator are passed to both children. Each leaf corresponds to a single edge.
We assume that FindSeparator only puts a vertex in the separator if has an edge to each side and
always returns a separator with at least one vertex on each side unless the graph is a clique. If
the graph is a clique, we assume the separator contains all but one of the vertices, and that the
remaining vertex is on the left side of the partition.
Every vertex in our original graph of degree ∆ corresponds to a binary tree embedded in the
separator tree with ∆ leaves, one for each of its incident edges. To see this consider a single
vertex. Every time it appears in a separator, its edges are partitioned into two sets, and the vertex
is copied to both recursive calls. Because the vertex will appear in ∆ leaves, it must appear in
∆ − 1 separators, so it will appear in ∆ − 1 internal nodes of the separator tree. We refer to the
tree for a vertex as the vertex tree, each appearance of a vertex in the tree as a vertex copy, and
the root of each tree as the vertex root. The tree is used to sum the values for the SpMV-multiply.
We reorder the rows/columns of the matrix based on a preorder traversal of their root locations
in the separator tree (i.e. , all vertices in the top separator will appear first). This is the order we
will use for the input vector x and output vector y when calculating y = Ax. We keep a vector
R in this order that points to each of the corresponding roots of the tree. The separator tree is
maintained as a tree T in which each node keeps its copies of the vertices in its separator. Each
of these vertex copies will point to its two children in the vertex tree. Each leaf of T is an edge
and includes the indices of its two endpoints and its weight. In all internal vertex copies we keep
an extra value field to store a temporary variable, and in the leaves we keep two value fields,
one for each direction. Finally we note that all data for each node of the separator tree is stored
adjacently (i.e. , all its vertex copies are stored one after the other), and the nodes are stored in
preorder. This is important for cache efficiency.
Our algorithm for SpMV-multiply is described in Algorithm SparseMxV in Figure 6.4. This
algorithm will take the input vector x and leave the results of the matrix multiplication in the root
of every vertex. To gather the results up into a result vector y we simply use the root pointers
R to fetch each root. The algorithm does not do any work on the way down the recursion, but
when it gets to a leaf the edge multiplies its two endpoints by its weight putting the result in its
temporary value. Then on the way back up the recursion the algorithms sums these values. In
particular whenever it gets to an internal node of a vertex tree it adds the two children. Since
the algorithm works bottom up the values of the children are always ready when the parent reads
them.
Theorem 27 Let M be a class of matrices for which the adjacency graphs satisfy an n -vertex
separator theorem. Algorithm SparseMxV on an n × n matrix A ∈ M with m ≥ n non-zeros
has O(m) work, O(log2 n) depth and O(dm/B + n/M 1− e) parallel cache complexity.
Proof. For a constant k we say a vertex copy is heavy if appears in a separator node with size
(memory usage) larger than M/k. We say a vertex is heavy if it has any heavy vertex copies. We
first show that the number of heavy vertex copies for any constant k is bounded by O(n/M 1− )
and then bound the number of cache misses based on the number of heavy copies.
For a node of n vertices, the size X(n) of the tree rooted at the node is given by the recurrence
relation:
X(n) = max0 {X(α0 n) + X((1 − α0 )n) + βn }
1/2≥α ≥α
This recurrence solves to X(n) = k(n − n ), k = β/(α + (1 − α) − 1). Therefore, there exists
103
a positive constant c such that for n ≤ cM , the subtree rooted at a node of n vertices fits into the
cache. We use this to count the number of heavy vertex copies H(n). The recurrence relation for
H(n) is:
maxα≤α0 ≤ 1 {H(α0 n) + H((1 − α0 )n) + βn } if n > cM
H(n) = 2
0 otherwise
This recurrence relation solves to H(n) = k(n/(cM )1− − βn ) = O(n/M 1− ).
Now we note that if a vertex is not heavy (i.e. , light) it is used only by a single subtree
that fits in cache. Furthermore because of the ordering of the vertices based on where the roots
appear, all light vertices that appear in the same subtree are adjacent. Therefore the total cost of
cache misses for light vertices is O(n/B). We note that the edges are traversed in order so they
only incur O(m/B) misses. Now each of the heavy vertex copies can be responsible for at most
O(1) cache misses. In particular reading each child can cause a miss. Furthermore, reading the
value from a heavy vertex (at the leaf of the recursion) could cause a miss since it is not stored
in the subtree that fits into cache. But the number of subtrees that just fit into cache (i.e. , their
parents do not fit) and read a vertex u is bounded by one more than the number of heavy copies
of u. Therefore we can count each of those misses against a heavy copy. We therefore have a
total of O(m/B + n/M 1− ) misses.
The work is simply proportional to the number of vertex copies, which is less than twice m
and hence is bounded by O(m). For the depth we note that the two recursive calls can be made
in parallel and furthermore the for all statement can be made in parallel. Furthermore the tree
is depth O(log n) because of the balance condition on separators (both |Va,sep | and |Vb,sep | are
at most αn + βn ). Since the branching of the for all takes O(log n) depth, the total depth is
bounded by O(log2 n). t
u
104
6.5.1 List Ranking
A basic strategy for list ranking follows the three stage strategy proposed in [15] and adapted
into the Parallel-External Memory model by [21]:
1. Shrink the list to size O(n/ log n) through repeated contraction.
2. Apply Wyllie’s pointer jumping [170] on this shorter list.
3. Compute the place of the origninal nodes in the list by splicing the elements thrown out in
the compression and pointer jump
Stage 1 is achieved through finding independent sets in the list of size Θ(n) and removing
them to yield a smaller problem. This can be done randomly using the random mate technique
in which case, O(log log n) rounds of such reduction would suffice. Each round involves a Pack
operation (Section 6.1) to compute the elements still left.
Alternately, we could use the deterministic technique similar to Section III of [21]. Use two
rounds of Cole and Vishkin’s deterministic coin tossing [65] to find a O(log log n)-ruling set and
then convert the ruling set to an independent set of size at least n/3 in O(log log n) rounds. Arge
et al. [21] showed how this conversion can be made cache-efficient, and it is straightforward to
change this algorithm to a cache-oblivious one by replacing the Scan and Sort primitives. To
convert the O(log log n) ruling set into a 2- ruling set, in parallel, start at each ruler and traverse
the list iteratively and assign a priority number to each node based on the distance from the
ruler. Then sort the nodes based on the priority numbers so that the nodes are now in contiguous
chunks based on distance to the previous ruler in the list. Choose chunks corresponding to
prioirity numbers 0, 3, 6, 9, . . . to get a 2-ruling set.
Stage 2 uses O(log n) rounds of pointer jumping, each round essentially involving a sort
operation on O(n/ log n) elements in order to figure out the next level of pointer jumping. Thus,
the cache complexity of this stage is asymptotically the same as sorting and its depth is O(log n)
times the depth of sorting.
Once this stage is completed, the original location of nodes in the list can be computed by
tracing back through each of the compression and pointer jumping steps.
Theorem 28 The deterministic list ranking outlined above has Q∗ (n; M, B) = O(Q∗sort (n; M, B))
sequential cache complexity, O(n log n) work, and O(Dsort (n) log n) depth. The effective cache
complexity is Q bα (n; M, B) = Q∗ (n; M, B) for α < 1.
Proof. Consider a single contraction step of stage one. The first part of this involving two
deterministic coin tossing steps to compute O(log log n)-ruling set can be implemented with a
small constant number of scans and sorts.
The second part involving iteratively creating a 2-ruling set requires that in the i-th iteration
concerning the i-th group of size ni , we use a small constant number of scans and sorts on prob-
lem of size ni . In addition, it involves writing at most ni elements in to O(log log n) contiguous
locations in to the memory. The cache complexity of this step is atmost
log log n
!
X
∗
c Qsort (ni ; M, B) + ni /B + 1 ≤ (c + 1) dn/Be logM +2 n + log log n
i=1
for some constant c. A closer analysis using the tall cache assumption would show that the
second term O(log log n) is asymptotically smaller than the first. If it is the case that n < cM for
105
Problem Depth Cache Complexity (Q∗ )
Table 6.2: Low-depth cache-oblivious graph algorithms from Section 6.5. All algorithms are
deterministic. DLR represents the depth of List Ranking. The bounds assume M = Ω(B 2 ).
Dsort and Qsort are the depth and cache complexity of cache-oblivious sorting.
some suitable constant 0 < c < 1, then the entire problem fits in cache and the cache complexity
is simply O(dn/Be). Suppose not, in order for O(log log n) to be asymptotically greater than
O(n/B), it has to be the case that B = Ω(n/ log log n). However, by tall cache assumption
B = o(M 1− ) for some 0 < < 1 and consequently, B = o(n1− ) which is a contradiction.
Therefore, we can neglect the log log n term.
Since we need O(log log n) rounds each of such contractions each on a set of geometrically
smaller size, the overall cache complexity of this first phase is just Q∗Sort (n; M, B). Since we
have established that the second phase involving pointer jumping has the same cache complexity,
the result follows. It is straightforward to verify the bounds for work and depth.
To compute the effective cache complexity, observe that all steps except the iterative coloring
to find 2-ruling are existing parallel primitives with effective parallelism 1. Since the iterative
coloring on n elements involves a fork with O(n/ log log n) parallel calls each with linear work
on atmost O(log log n) locations, the effective cache complexity term is work-dominated for
α < 1 − o(1/ log n). The bounds on effective cache complexity follow. t
u
106
computing the Euler tour and reducing the problem to a range minima query problem [26, 31].
The batch version of the range minima query problems which can be solved with cache-oblivious
search trees [73].
Deterministic algorithms for Connected Components and Minimum Spanning Forest are sim-
ilar and use tree contraction as their basic idea [61]. The connected components algorithm finds
a forest on the graph, contracts the forest, and repeats the procedure. In each phase, every vertex
picks the neighbor with the minumum label in parallel. The edges selected form a forest. In-
dividual trees can be identified using the Euler tree technique and list ranking. Contracting the
trees reduces the problem size by atleast a constant. For the Minimum Spanning Forest, each
node selects the least weight edge incident on it instead of the√neighbor with lowest index. The
cache bounds are slightly worse than those in [60]: log(|V |/ M ) versus log(|V |/M ). While
[60] uses knowledge of M to transition to a different approach once the vertices in the contracted
graph fit within the cache, cache-obliviously we need for the edges to fit before we stop incurring
misses.
All the algorithms in this section except the LCA query problem have parallelizability 1.
107
Algorithm 6.6.1 SetCover — Blelloch et al. parallel greedy set cover.
Input: a set cover instance (U, F, c) and a parameter ε > 0.
Output: a ordered collection of sets covering the ground elements.
P n2
i. Let γ = maxe∈U minS∈F c(S), n = S∈F |S|, T = log1/(1−ε) (n3 /ε), and β = ε·γ .
ii. Let (A; A0 , . . . , AT ) = Prebucket(U, F, c) and U0 = U \ (∪S∈A S).
iii. For t = 0, . . . , T , perform the following steps:
1. Remove deleted elements from sets in this bucket: A0t = {S ∩ Ut : S ∈ At }
2. Only keep sets that still belong in this bucket: A00t = {S ∈ A0t : c(S)/|S| > β · (1 − ε)t+1 }.
3. Select a maximal nearly independent set from the bucket: Jt = MaNIS(ε,3ε) (A00t ).
4. Remove elements covered by Jt : Ut+1 = Ut \ Xt where Xt = ∪S∈Jt S
5. Move remaining sets to the next bucket: At+1 = At+1 ∪ (A0t \ Jt )
iv. Finally, return A ∪ J0 ∪ · · · ∪ JT .
— M A NIS : Invoked in Step iii(3) of the set cover algorithm, M A NIS finds a subcollection
of the sets in a bucket that are almost non-overlapping with the goal of closely mimicking the
greedy behavior. Algorithm 6.6.2 shows the M A NIS algorithm, reproduced from [43]. (The
annotations on the side indicate which primitives in the PCC cost model we will use to implement
them.) Conceptually, the input to M A NIS is a bipartite graph with left vertices representing the
sets and the right vertices representing the elements. The procedure starts with each left vertex
picking random priorities (step 2). Then, each element identifies itself with the highest priority
set containing it (step 3). If “enough” elements identify themselves with a set, the set selects
itself (step 4). All selected sets and the elements they cover are eliminated (steps 5(1), 5(2)), and
the cost of remaining sets is re-evaluated based on the uncovered elements. Only sets (A0 ) with
costs low enough to belong to the correct bucket (which invoked this M A NIS ) are selected in
step 5(3) and the procedure continues with another level of recursion in step 6. The net result is
that we choose nearly non-overlapping sets, and the sets that are not chosen are “shrunk” by a
constant factor.
— Bucket Movement: The remaining steps in Algorithm 6.6.1 are devoted to moving sets between
buckets, ensuring the contents of the least-costly bucket contain only sets in a specific cost range.
Step iii(4) removes elements covered by the sets returned by M A NIS ; step iii(5) transfers the
sets not selected in M A NIS to the next bucket; and step iii(2) selects only those sets with costs
in the current bucket’s range for passing to M A NIS .
108
covered. Since we only need one bit of information per element to indicate whether it is covered
|U |
or not, this can be stored in O( log n
) words. Unlike in [42], we do not maintain back pointers
from the elements to the sets.
— Prebucketing: This phase involves sorting sets based on their cost and a filter to remove the
costliest and the cheapest set. Sets can then be partitioned into buckets with a merge operation.
All operations have less than Q∗sort (n) cache complexity in the PCC framework and O(log2 n)
depth.
— M A NIS : We invoke M A NIS in step 3 of the set cover algorithm. Inside M A NIS , we store
the remaining elements of U (right vertices) as a sequence of element identifiers. To implement
M A NIS , in Step 3, for each left vertex, we copy its value xa to all the edges incident on it, then
sort the edges based on the right vertex so that the edges incident on the same right vertex are
contiguous. For each right vertex, we can now use a prefix “sum” using maximum to find the
neighbor a with the maximum xa . In step 4, the “winning” edges (an edge connecting a right
vertex with it’s chosen left vertex) are marked on the edge list for each right vertex we computed
in step 3. The edge list is then sorted based on the left vertex. Prefix sum can then be used to
compute the number of elements each set has “won”. A compact representation for J and its
adjacency list can be computed with a filter operation. In step 5(1), the combined adjacency list
of elements in J is sorted and duplicates removed to get B. For step 5(2), we first evaluate the
list A \ J. Then, we sort the edges incident on A \ J based on their right vertices; merge with
the remaining elements to identify which is contained in B, marking these edges accordingly.
After sorting these edges based on their left vertices, for each left vertex a ∈ A \ J, we filter
and pack the “live edges” to compute NG0 (a). Steps 5(3) and 5(4) involve simple filter and sum
operations. The most I/O intensive operation (as well as the operation with maximum depth) in
each round of M A NIS is sorting, which has at most O(Q∗sort (|G|)) cache complexity in the PCC
framework and O(log2 |G|) depth. As analyzed in [43], for a bucket with nt edges to start with,
M A NIS runs for at most O(log nt ) rounds—and after each round, the number of edges drops by
a constant factor; therefore, we have the following bounds:
Lemma 30 The cost of running M A NIS on a bucket with nt edges in the parallel cache com-
plexity model is O(Q∗sort (nt )), and the depth is O(Dsort (nt ) log2 nt ). Its parallelizability is 1.
— Bucket Movement: We assume the At , A0t and A00t are stored in the same format as the input for
M A NIS (see 6.6.1). The right set of vertices of the bipartite graph is now a bitmap corresponding
to the elements indicating whether an element is alive or dead. Step iii(1) is similar to Step 3 of
M A NIS . We first sort S ∈ At to order the edges by element index, then merge this representation
(with a vector of length O(|U|/ log n)) to match them with the elements bitmap, do a filter to
remove deleted edges, and perform another sort to get them back ordered by set. Step iii(2) is
simply a filter. The append operation in Step iii.5 is no more expensive than a scan. In the PCC
models, these primitives have cache complexity at most O(Q∗sort (nt ) + Q∗scan (|U|/ log n)) for a
bucket with nt edges. They all have O(Dsort (n)) depth.
To show the final cache complexity bounds, we make use of the following claim:
Claim 31 ([43]) Let nt be thePnumber of edges in bucket t at the beginning of the iteration which
processes this bucket. Then, nt = O(n).
Therefore, we have O(Q∗sort (n)) from prebucketing, O(Q∗sort (n)) from M A NIS combined,
and O(Q∗sort (n)+Q∗scan (U)) from bucket management combined (since there are log(n) rounds).
109
Algorithm 6.6.2 MaNIS(ε,3ε) (G)
Input: A bipartite graph G = (A, NG (a))
A is a sequence of left vertices (the sets), and NG (a), a ∈ A are the neighbors of each left vertex on the
right.
These are represented as contiguous arrays. The right vertices are represented implicitly as B = NG (A).
Output: J ⊆ A of chosen sets.
1. If A is empty, return the empty set.
2. For a ∈ A, randomly pick xa ∈R {0, . . . , n7G − 1}. //map
3. For b ∈ B, let ϕ be b’s neighbor with maximum xa // sort and prefix sum
4. Pick vertices of A “chosen” by sufficiently many in B: // sort, prefix sum, sort and filter
5. Update the graph by removing J and its neighbors, and elements of A with too few remaining neigh-
bors:
(1) B = NG (J) (elements to remove) // sort
0
(2) NG = {{b ∈ NG (a)|b 6∈ B} : a ∈ A \ J} // sort, merge, sort and filter
(3) A0 = {a ∈ A \ J : |NG0 (a)| ≥ (1 − ε)D(a)} // filter
(4) NG00 = {{b ∈ NG0 (a)} : a ∈ A0 } // filter
6. Recurse on reduced graph: JR = MaNIS(ε,3ε) ((A0 , NG00 ))
7. return J ∪ JR
This simplifies to an cache complexity of Q∗ (n; M, B) = O(Q∗sort (n; M, B)) since U ≤ n. The
depth is O(log4 n), since set cover has O(log n) iterations to go through buckets and invokes a
M A NIS which has O(log n) rounds (w.h.p.) and each round is dominated by the sort. Every
subroutine in the algorithm has parallelizability 1. Since the entire algorithm can be programmed
with O(n) statically allocated space, the parallelizability of the algorithm is 1.
6.7 Performance
The algorithms presented in this Chapter also perform and scale well in practice. The sample sort
and the set cover algorithms were implemented in Intel Cilk+ and run on a 40-core consisting of
4 2.4GHZ Intel E7 − 8870 Xeon Processors with 256GB of RAM connected with a 1066MHz
bus. Figures 6.4 and 6.5 demonstrate the scalability of the Sample Sort and MaNIS based set
cover algorithms respectively.
Furthermore, the algorithms described here are widely used in the algorithms written for the
problems in the PBBS benchmark suite [34]. The performance of these algorithms is tabulated
in Table 2 of Shun et.al. [149] which is reproduced here in Table 6.3. The algorithms have good
sequential performance and demonstrate good scalability.
110
Sample Sort Scalability
45
40
35
30
25
T/T_1
20
15
10
0
0 10 20 30 40 50 60 70 80
Number of cores (including HyperThreads)
Figure 6.4: Plot of the speedup of samplesort on upto 40 cores with 2× hyperthreading. The
algorithm achieves over 30× speedup on 40 cores without hyperthreading. The baseline for one
core is 2.72 seconds and the parallel running time is 0.066 seconds.
Speedup - uk-union
25
20
15
10
0
10 20 30 40 50 60 70 80
Number of Threads
Figure 6.5: Speedup of the parallel M A NIS based set cover algorithm on an instance with 121
million sets, 133 million elements and 5.5 billion edges (set-element relations). The baseline for
one core is 278 seconds and the parallel running time is 11.9 seconds.
111
Application 1 40 T1 /T40 TS /T40
Algorithm thread core
Integer Sort
serialRadixSort 0.48 – – –
parallelRadixSort 0.299 0.013 23.0 36.9
Comparison Sort
serialSort 2.85 – – –
sampleSort 2.59 0.066 39.2 43.2
Spanning Forest
serialSF 1.733 – – –
parallelSF 5.12 0.254 20.1 6.81
Min Spanning Forest
serialMSF 7.04 – – –
parallelKruskal 14.9 0.626 23.8 11.2
Maximal Ind. Set
serialMIS 0.405 – – –
parallelMIS 0.733 0.047 14.1 8.27
Maximal Matching
serialMatching 0.84 – – –
parallelMatching 2.02 0.108 18.7 7.78
K-Nearest Neighbors
octTreeNeighbors 24.9 1.16 21.5 –
Delaunay Triangulation
serialDelaunay 56.3 – – –
parallelDelaunay 76.6 2.6 29.5 21.7
Convex Hull
serialHull 1.01 – – –
quickHull 1.655 0.093 17.8 10.9
Suffix Array
serialKS 17.3 – – –
parallelKS 11.7 0.57 20.5 30.4
Table 6.3: Weighted average of running times (seconds) over various inputs on a 40-core machine
with hyper-threading (80 threads). Time of the parallel version on 40 cores (T40 ) is shown relative
to both (i) the time of the serial version (TS ) and (ii) the parallel version on one thread (T1 ). In
some cases our parallel version on one thread is faster than the baseline serial version. Table
reproduced from [149].
112
Chapter 7
We advocated the separation of algorithm and scheduler design, and argued that schedulers are to
be designed for each machine model. Robust schedulers for mapping nested parallel programs to
machines with certain kinds of simple cache organizations such as single-level shared and private
caches were described in 5.2. They work well both in theory [37, 48, 49] and in practice [120,
134]. Among these, the work-stealing scheduler is particularly appealing for private caches
because of its simplicity and low overheads, and is widely deployed in various run-time systems
such as Cilk++. The Parallel Depth-First scheduler [37] is suited for shared caches and practical
versions of this schedule have been studied. The cost of these schedulers in terms of cache
misses or running times can be bounded by the locality cost of the programs as measured in
certain abstract program-centric cost models [2, 41, 46, 49].
However, modern parallel machines have multiple levels of cache, with each cache shared
amongst a subset of cores (e.g. , see Fig. 1(a)). We used the parallel memory hierarchy model,
as represented by a tree of caches [13] (Fig. 1(b)), as a reasonably accurate and tractable model
for such machines [41, 80]. Previously studied schedulers for simple machine models may not
be optimal for these complex machines. Therefore, a variety of hierarchy-aware schedulers have
been proposed recently [41, 64, 68, 80, 140] for use on such machines.
Among these are the class of space-bounded schedulers [41, 64, 68]. We described a partic-
ular class of space-bounded schedeulers in Section 5.3 along with performance bounds. To use
space-bounded schedulers, the computation needs to annotate each function call with the size of
its memory footprint. Roughly speaking, the scheduler tries to match the memory footprint of a
subcomputation to a cache of appropriate size in the hierarchy and then run the subcomputation
fully on the cores associated with that cache. Note that although space annotations are required,
the computation can be oblivious to the size of the caches and hence is portable across machines.
Under certain conditions these schedulers can guarantee good bounds on cache misses at ev-
ery level of the hierarchy and running time in terms of some intuitive program-centric metrics.
Cole and Ramachandran [68] describe such schedulers with strong asymptotic bounds on cache
misses and runtime for highly balanced computations. Recent work [41] describes schedulers
that generalizes to unbalanced computations as well.
In parallel with this line of work, certain variants of the work-stealing scheduler such as the
113
PWS and HWS schedulers [140] that exploit knowledge of memory hierarchy to a certain extent
have been proposed, but no theoretical bounds are known.
While space-bounded schedulers have good theoretical guarantees on the PMH model, there
has been no extensive experimental evidence to suggest that these (asymptotic) guarantees trans-
late into good performance on real machines with multi-level caches. Existing analyses of these
schedulers ignore the overhead costs of the scheduler itself and account only for the program run
time. Intuitively, given the low overheads and highly-adaptive load balancing of work-stealing
in practice, space-bounded schedulers would seem to be inferior on both accounts, but superior
in terms of cache misses. This raises the question as to the relative effectiveness of the two types
of schedulers in practice.
Experimental Framework and Results. We presents the first experimental study aimed at ad-
dressing this question through a head-to-head comparison of work-stealing and space-bounded
schedulers. To facilitate a fair comparison of the schedulers on various benchmarks, it is neces-
sary to have a framework that provides separate modular interfaces for writing portable nested
parallel programs and specifying schedulers. The framework should be light-weight, flexible,
provide fine-grained timers, and enable access to various hardware counters for cache misses,
clock cycles, etc. Prior scheduler frameworks, such as the Sequoia framework [80] which im-
plements a scheduler that closely resembles a space-bounded scheduler, fall short of these goals
by (i) forcing a program to specify the specific sizes of the levels of the hierarchy it is intended
for, making it non-portable, and (ii) lacking the flexibility to readily support work-stealing or its
variants.
This chapter describes a scheduler framework that we designed and implemented, which
achieves these goals. To specify a program in the framework, the programmer uses the Fork-
Join (and Parallel-For built on top of Fork-Join) primitive. To specify the scheduler, one needs
to implement just three primitives describing the management of tasks at Fork and Join points:
add, get, and done. Any scheduler can be described in this framework as long as the schedule
does not require the preemption of sequential segments of the program. A simple work-stealing
scheduler, for example, can be described with only 10s of lines of code in this framework. Fur-
thermore, in this framework, program tasks are completely managed by the schedulers, allowing
them full control of the execution.
The framework enables a head-to-head comparison of the relative strengths of schedulers in
terms of running times and cache miss counts across a range of benchmarks. (The framework
is validated by comparisons with the commercial CilkPlus work-stealing scheduler.) We present
experimental results on a 32-core Intel Nehalem series Xeon 7560 multi-core with 3 levels of
cache. As depicted in Figure 7.1(a)), each L3 cache is shared (among the cores on a socket)
while the L1 and L2 caches are exclusive to cores. We compare four schedulers—work-stealing,
priority work-stealing (PWS) [140], and two variants of space-bounded schedulers—on both
divide-and-conquer micro-benchmarks (scan-based and gather-based) and popular algorithmic
kernels such as quicksort, sample sort, matrix multiplication, and quad trees.
Our results indicate that space-bounded schedulers reduce the number of L3 cache misses
compared to work-stealing schedulers by 25–50% for most of the benchmarks, while incurring
up to 6% additional overhead. For memory-intensive benchmarks, the reduction in cache misses
114
Memory: Mh = ∞, Bh
Memory: up to 1 TB
Cost: Ch−1 fh
4 of these
24 MB 24 MB h
L3 Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1 Mh−1 , Bh−1
M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1 M1 , B 1
32 KB 32 KB L1 32 KB 32 KB
P P f P P P f P P P f P P P f P P P f P
1 1 1 1 1
P P P P
fh fh−1 . . . f1
Figure 7.1: Memory hierarchy of a current generation architecture from Intel, plus an example
abstract parallel hierarchy model. Each cache (rectangle) is shared by all cores (circles) in its
subtree.
overcomes the added overhead, resulting in up to a 25% improvement in running time for syn-
thetic benchmarks and about 5% improvement for algorithmic kernels. To better understand
the impact of memory bandwidth on scheduler performance, we quantify runtime improvements
over a 4-fold range in the available bandwidth per core and show further improvements in the
running times of kernels (up to 25%) when the bandwidth is reduced.
Finally, as part of our study, we generalize prior definitions of space-bounded schedulers to
allow for more practical variants, and explore the tradeoffs in a key parameter of such schedulers.
This is useful for engineering space-bounded schedulers, which were previously described only
at a high level suitable for theoretical analyses, into a form suitable for real machines.
115
7.1 Experimental Framework
We implemented a C++ based framework with the following design objectives in which nested
parallel programs and schedulers can be built for shared memory multi-core machines. Some of
the code for threadpool has been adapted from an earlier implementation [108].
Modularity: The framework completely separates the specification of three components—programs,
schedulers, and description of machine parameters—for portability and fairness. The user can
choose any of the candidates from these three categories. Note, however, some schedulers may
not be able to execute programs without scheduler-specific hints (such as space annotations).
Clean Interface: The interface for specifying the components should be clean, composable, and
the specification built on the interface should be easy to reason about.
Hint Passing: While it is important to separate program and schedulers, it is necessary to al-
low the program to pass useful hints to the scheduler. Because scheduling decisions are made
“online”, i.e. on the fly, the scheduler may not have a complete picture of the program until its
execution is complete. However, carefully designed algorithms allow static analyses that provide
estimates on various complexity metrics such as work, size, and cache complexity. Annotating
the program with these analyses and passing them to the scheduler at run time may help the
scheduler make better decisions.
Minimal Overhead: The framework itself should be light-weight with minimal system calls,
locking and code complexity. The control flow should pass between the functional modules
(program, scheduler) with negligible time spent outside. The framework should avoid generating
background memory traffic and interrupts.
Timing and Measurement: It should enable fine-grained measurements of the various modules.
Measurements include not only clock time, but also insightful hardware counters such as cache
and memory traffic statistics. In light of the earlier objective, the framework should avoid OS
system calls for these, and should use direct assembly instructions.
7.1.1 Interface
The framework has separate interfaces for the program and the scheduler.
Programs: Nested parallel programs, with no other synchronization primitives, are composed
from tasks using fork and join constructs. A parallel for primitive built with fork and
join is also provided. Tasks are implemented as instances of classes that inherit from the Job
class. Different kinds of tasks are specified as classes with a method that specifies the code to be
executed. An instance of a class derived from Job is a task containing a pointer to a strand nested
immediately within the task. The control flow of this function is sequential with a terminal fork
or join call. To adapt this interface for non-nested parallel programs such as futures [153], we
would need to add other primitives to the interface that the program uses to call the framework
beyond fork and join.
The interface allows extra annotations on a task such as its size, which is required by space-
bounded schedulers. Such tasks inherit a derived class of Job class, the extensions in the de-
rived class specifying the annotations. For example, the class SBJob suited for space-bounded
schedulers is derived from Job by adding two functions—size(uint block size) and
strand size(uint block size)—that allow the annotations of the job size.
116
run
Job
Scheduler
Nested
(concurrent)
Parallel
Fork
Framework
add
Shared
Program
Join
done
structures
(Queues…)
Thread
get
Program
code
Concurrent
code
in
scheduler
module
Scheduler: The scheduler is a concurrent module that handles queued and live tasks (as defined
in Section 2) and is responsible for maintaining its own queues and other internal shared data
structures. The module interacts with the framework that consists of a thread attached to each
processing core on the machine, through an interface with three call-back functions.
• Job* get (ThreadIdType): This is called by the framework on behalf of a thread
attached to a core when the core is ready to execute a new strand, after completing of a
previously live strand. The function may change the internal state of the scheduler module
and return a (possibly null) Job so that the core may immediately being execution on the
strand. This function specifies proc for the strand.
• void done(Job*,ThreadIdType) This is called when a core completes executing
a strand. The scheduler is allowed to update its internal state to reflect this completion.
• void add(Job*,ThreadIdType): This is called when a fork or join is encoun-
tered. In case of a fork, this call-back is invoked once for each of the newly spawned
tasks. For a join, it is invoked for the continuation task of the join. This function
decides where to enqueue the job.
Other auxiliary parameters to these call-backs have been dropped from the above description
for clarity and brevity. Fig. B.2 provides an example of a work-stealing scheduler. The Job*
argument passed to these functions may be instances of one of the derived classes of Job* and
carry additional information that is helpful to the scheduler.
Machine configuration: The interface for specifying machine descriptions accepts a de-
scription of the cache hierarchy: number of levels, fanout at each level, and cache and cache-line
size at each level. In addition, a mapping between the logical numbering of cores on the system to
their left-to-right position as a leaf in the tree of caches must be specified. For example, Fig. B.3
117
void
WS_Scheduler::add (Job *job, int thread_id) {
_local_lock[thread_id].lock();
_job_queues[thread_id].push_back(job);
_local_lock[thread_id].unlock();
}
int
WS_Scheduler::steal_choice (int thread_id) {
return (int)((((double)rand())/((double)RAND_MAX))
*_num_threads);
}
Job*
WS_Scheduler::get (int thread_id) {
_local_lock[thread_id].lock();
if (_job_queues[thread_id].size() > 0) {
Job * ret = _job_queues[thread_id].back();
_job_queues[thread_id].pop_back();
_local_lock[thread_id].unlock();
return ret;
} else {
_local_lock[thread_id].unlock();
int choice = steal_choice(thread_id);
_steal_lock[choice].lock();
_local_lock[choice].lock();
if (_job_queues[choice].size() > 0) {
Job * ret = _job_queues[choice].front();
_job_queues[choice].erase(_job_queues[choice].begin());
++_num_steals[thread_id];
_local_lock[choice].unlock();
_steal_lock[choice].unlock();
return ret;
}
_local_lock[choice].unlock();
_steal_lock[choice].unlock();
}
return NULL;
}
void
WS_Scheduler::done (Job *job, int thread_id,
bool deactivate) {}
118
is a description of one Nehalem-EX series 4-socket × 8-core machine (32 physical cores) with 3
levels of caches as depicted in Fig. 7.1(a).
7.1.2 Implementation
The runtime system initially fixes a POSIX thread to each core. Each thread then repeatedly
performs a call (get) to the scheduler module to ask for work. Once assigned a task and a
specific strand inside it, it completes the strand and asks for more work. Each strand either ends
in a fork or a join. In either scenario, the framework invokes the done call back to allow the
scheduler to clean up. In addition, when a task invokes a fork, the add call-back is invoked to
let the scheduler add new tasks to its data structures.
All specifics of how the scheduler operates (e.g. , how the scheduler handles work requests,
whether it is distributed or centralized, internal data structures, where mutual exclusion occurs,
etc.) are relegated to scheduler implementations. Outside the scheduling modules, the runtime
system includes no locks, synchronization, or system calls except during the initialization and
cleanup of the thread pool.
7.1.3 Measurements
Active time and overheads: Control flow on each thread moves between the program and the
scheduler modules. Fine-grained timers in the framework break down the execution time into
five components: (i) active time—the time spent executing the program, (ii) add overhead,
(iii) done overhead, (iv) get overhead, and (v) empty queue overhead. While active time
depends on the number of instructions and the communication costs of the program, add, done
and get overheads depend on the complexity of the scheduler, and the number of times the
scheduler code is invoked by Forks and Joins. The empty queue overhead is the amount of
time the scheduler fails to assign work to a thread (get returns null), and reflects on the load
balancing capability of the scheduler. In most of the results in Section 7.3, we usually report
two numbers: active time averaged over all threads and the average overhead, which includes
measures (ii)–(v). Note that while we might expect this partition of time to be independent, it is
not so in practice—the background coherence traffic generated by the scheduler’s bookkeeping
may adversely affect active time. The timers have very little overhead in practice—less than 1%
in most of our experiments.
Measuring hardware counters: Multi-core processors based on newer architectures like Intel
Nehalem-EX and Sandybridge contain numerous functional components such as cores (which
includes the CPU and lower level caches), DRAM controllers, bridges to the inter-socket inter-
connect (QPI) and higher level cache units (L3). Each component is provided with a performance
monitoring unit (PMU)—a collection of hardware registers that can track statistics of events rel-
evant to the component.
For instance, while the core PMU on Xeon 7500 series (our experimental setup, see Fig. 7.1(a))
is capable of providing statistics such as the number of instructions, L1 and L2 cache hit/miss
statistics, and traffic going in and out, it is unable to monitor L3 cache misses which constitute
a significant portion of active time. This is because L3 cache is a separate unit with its own
PMU(s). In fact, each Xeon 7560 die has eight L3 cache banks on a bus that also connects
119
int num_procs=32;
int num_levels = 4;
int fan_outs[4] = {4,8,1,1};
long long int sizes[4] = {0, 3*(1<<22), 1<<18, 1<<15};
int block_sizes[4] = {64,64,64,64};
int map[32] = {0,4,8,12,16,20,24,28,
2,6,10,14,18,22,26,30,
1,5,9,13,17,21,25,29,
3,7,11,15,19,23,27,31};
Figure 7.4: Specification entry for a 32-core Xeon machine depicted in Fig. 7.1(a)
DRAM
Banks
DRAM
DRAM
ctrler
ctrler
core
L3
L3
C-‐box
core
core L3 L3 core
core L3 L3 core
core
L3
L3
core
to
to
to
to
QPI
QPI
QPI
QPI
QPI
Figure 7.5: Layout of 8 cores and L3 cache banks on a bidirectional ring in Xeon 7560. Each
L3 bank hosts a performance monitoring unit called C-box that measures traffic into and out of
the L3 bank.
DRAM and QPI controllers (see Fig. 7.5). Each L3 bank is connected to a core via buffered
queues. The address space is hashed onto the L3 banks so that a unique bank is responsible for
each address. To collect L3 statistics such as L3 misses, we monitor PMUs (called C-Boxes on
Nehalem-EX) on all L3 banks and aggregate the numbers in our results.
Software access to core PMUs on most Intel architectures is well supported by several tools
including the Linux kernel, the Linux perf tool, and higher level APIs such as libpfm [137]. We
use libpfm library to provide fine-grained access to the core PMU. However, access to uncore
PMUs—complex architecture-specific components like the C-Box—is not supported by most
tools. Newer Linux kernels (3.7+) are incrementally adding software interfaces to these PMUs
at the time of this writing, but we are only able to make program-wide measurements using this
interface rather than the fine-grained measurements. For accessing uncore counters, we adapt the
Intel PCM 2.4 tool [98].
120
7.2.1 Space-bounded Schedulers
Space-Bounded schedulers were described in Section 5.3. We make a minor modification to the
boundedness property introducing a new parameter µ that results in better results in practice.
A space-bounded scheduler for a particular cache hierarchy is a scheduler parameterized by
σ ∈ (0, 1] that satisfies the two following properties:
• Anchored Every subtask t of the root task is anchored to a befitting cache.
• Bounded: At every instant τ , for every level-i cache Xi , the sum of sizes of cache occupy-
ing tasks and strands is less than Mi :
X X
S(t, Bi ) + S(`, Bi ) ≤ Mi .
t∈Ot(Xi ,τ ) `∈Ol(Xi ,τ )
Well behaved: We say that a scheduler is well behaved if the total number of cache misses at
level-i caches of a PMH incurred by scheduling a task t anchored to the root of the tree is at
most O(Q∗ (t; σMj , Bj )), where Q∗ is the parallel cache complexity as defined in the PCC model
(Section 4.3). We assume that each cache uses the optimal offline replacement policy. Roughly
speaking, the cache complexity Q∗ of a task t in terms of a cache of size M and line size B is
defined as follows. Decompose the task into a collection of maximal subtasks that fit in M space,
and “glue nodes” – instructions outside these subtasks. For a maximal size M task t0 , the parallel
cache complexity Q∗ (t0 ; M ; B) is defined to be the number of distinct cache lines it accesses,
counting accesses to a cache line from unordered instructions multiple times. The model then
pessimistically counts all memory instructions that fall outside of a maximal subtask (i.e., glue
nodes) as cache misses. The total cache complexity of a program is the sum of the complexities
of the maximal subtasks, and the memory accesses outside of maximal subtasks.
All space-bounded schedulers are well behaved. To see this, use the following replacement
policy at each cache Xi in the hierarchy. Let cache occupying tasks at Xi cache their working
sets at Xi . Allow a maximum of µ fraction of cache for each cache occupying strand at Xi . The
cache occupying tasks bring in their working sets exactly once because the boundedness property
prevents cache overflows. Cache misses for each cache occupying strand ` at cache Xi with task
t(`) (anchored above level-i) is at most 1/µ times Q∗ (t(`); σMi , Bi ). Since every subtask of the
root task is anchored at a befitting cache, the cache misses are bounded by O(Q∗ (·; σMi , Bi )) at
every level-i under this replacement policy. An ideal replacement policy at Xi would perform
at least as well as such a policy. Note that while setting σ to 1 yields the best bounds on cache
misses, it also makes load balancing harder. As we will see later, a lower value for σ like 0.5
allows greater scheduling flexibility.
121
scheduler call-back is issued to a thread, that thread can modify an internal node of the tree after
gathering all locks on the path to the node from the core it is mapped onto. This scheduler accepts
Jobs which are annotated with task and strand sizes. When a new Job is spawned at a fork, the
add call-back enqueues it at the cluster where it’s parent was anchored. For a new Job spawned
at a join, add enqueues it at the cluster where the Job that called the corresponding fork of this
join was anchored.
A basic version of such a scheduler would implement logical queues at each cache as one
queue. However, this presents two problems: (i) It is difficult to separate tasks in queues by
the level of cache that befits it, and (ii) a single queue might be a contention hotspot. To solve
problem (i), behind each logical queue, we use separate “buckets” for each level of cache be-
low to hold tasks that befit those levels. Cores looking for a task at a cache go through these
buckets from the top (heaviest tasks) to bottom. We refer to this variant as the SB scheduler.
To solve problem (ii) involving queueing hotspots, we replace the top bucket with a distributed
queue — one queue for each child cache — like in the work-stealing scheduler. We refer to the
SB scheduler with this modification as the SB-D scheduler.
In the course of our experimental study, we found that the following minor modification to
the boundedness property of space-bounded schedulers improves their performance. Namely, we
introduce a new parameter µ ∈ (0, 1] and modify the boundedness property to be that at every
instant τ , for every level-i cache Xi :
X X
S(t, Bi ) + min{µMi , S(`, Bi )} ≤ Mi ,
t∈Ot(Xi ,τ ) `∈Ol(Xi ,τ )
The minimum term with µMi is to allow several large strands to be explored simultaneously
without their space measure taking too much of the space bound. This helps the scheduler to
quickly traverse the higher levels of recursion in the DAG and reveal parallelism so that the
scheduler can achieve better load balance.
122
Priority Work-Stealing scheduler: PWS Unlike in the basic WS scheduler, cores in the
PWS scheduler [140] choose victims of their steals according to the “closeness” of the victims
in the socket layout. Dequeues at cores that are closer in the cache hierarchy are chosen with a
higher probability than those that are farther away to improve scheduling locality while retaining
the load balancing properties of WS scheduler. On our 4 socket machines, we set the probability
of an intra-socket steal to be 10 times that of an inter-socket steal.
7.3 Experiments
We describe experiments on two types of benchmarks. The first are a pair of micro-benchmarks
designed specifically to enable us to measure various characteristics of the scheduler. We use
these benchmarks to verify that the scheduler is getting the performance we expect and to see
how the performance scales with bandwidth and number of cores per processor(socket). The
second set of benchmarks are a set of divide-and-conquer algorithms for real problems (sorting,
matrix-multiply and nearest-neighbors). These are used to better understand the performance in
a more realistic setting. In all cases we measure both time and level 3 (last level) cache misses.
We have found the cache misses on other levels do not vary significantly among the schedulers.
7.3.1 Benchmarks
Synthetic Benchmarks. We use two synthetic micro-benchmarks that mimic the behavior of
memory-intensive divide-and-conquer algorithms. Because of their simplicity, we use these
benchmarks to closely analyze the behavior of the schedulers under various conditions and verify
that we get the expected cache behavior on a real machine.
• Recursive Repeated Map (RRM). This benchmark takes two n-length arrays A and B
and a point-wise map function that maps elements of A to B. In our experiments the each
element of the arrays is a double and the function simply adds one. RRM first does a
parallel point-wise map from A to B, and repeats the same operation multiple times. It
then divides A and B into two by some ratio (e.g. 50/50) and recursively calls the same
operation on each of the two parts. The base case of the recursion is set to some constant
at which point the recursion terminates. The input parameters are the size of the arrays n,
number of repeats r, the cut ratio f , and the base-case size. We set r = 3, f = 50% in
the experiments here unless mentioned otherwise. RRM is a memory intensive benchmark
since there is very little work done per memory operation. However, once a recursive call
fits in a cache (the cache size is at least 16n bytes) accesses are no longer required past that
cache.
• Recursive Repeated Gather (RRG). This benchmark is similar to RRM but instead of
doing a simple map it does a gather. In particular at any given level of the recursion it
takes three n-length arrays A, B and I and for each location i sets B[i] = A[I[i] mod n].
The values in I are random integers. As with RRM after repeating r times it splits the
arrays in two and repeats on each part. RRG is even more memory intensive than RRM
since its accesses are random instead of linear. Again, however, once a recursive call fits
in a cache accesses are no longer required past the cache.
123
Algorithms. Our other benchmarks are a set of algorithms for some full, albeit simple, prob-
lems. These include costs beyond just memory accesses and are therefore more representative of
real-world workloads.
• Matrix Multiplication. This benchmark is an 8-way recursive matrix
l multiplication.
m Ma-
√
trix multiplication has cache complexity Q(n) = d2n2 /Be × n/ M . The ratio of
√
instructions to cache misses is therefore very high, about B M , making this is a very
compute-intensive benchmark. We switch to a serial quicksort for sizes of 30 × 30, which
fit within the L1 cache.
• Quicksort. This is a parallel quicksort algorithm that both parallelizes the partitioning
and the recursive calls. It switches to a version which only parallelizes the recursive calls
for n < 128K and a serial version for n < 16K. These parameters worked well for all
the schedulers. We note that our quicksort is about 2x faster than the Cilk code found in
the Cilk+ guide [120]. This is because it does the partitioning in parallel. It is also the
case that the divide does not exactly partition the data evenly, since it depends on how
well the pivot divides the data. For an input of size n the program has cache complexity
Q(n; M, B) = O(dn/Be log2 (n/M )) and therefore is reasonably memory intensive.
• Samplesort. This is cache-optimal parallel Sample Sort algorithm described in [40]. The
√
algorithm splits the input of size
√ n into n subarrays, recursively sorts each subarray,
“block transposes” them into n buckets and recursively sorts these buckets. For this
algorithm, W (n) = O(n log n) and Q∗ (n; M, B) = O(dn/Be log2+ M n/B) making it
B
relatively cache friendly, and optimally cache oblivious even.
• Aware Samplesort. This is a variant of sample-sort that is aware of the cache sizes. In
particular it moves elements into buckets that fit into the L3 cache and then runs quicksort
on the buckets. This is the fastest sort we implemented and is in fact faster than any other
sort we found for this machine. In particular it is about 10% faster than the PBBS sample
sort [149].
• Quad-Tree. This generates a quad tree for a set of n points in two dimensions. This is
implemented by recursively partitioning the points into four sets along the mid line of each
of two dimensions. When the number of points is less than 16K we revert to a sequential
algorithm.
Controlling bandwidth. Each of the four socket on the machine has memory links to distinct
DRAM modules. The sockets are connected with the Intel QPI interconnect which has a very
high bandwidth. Memory requests from a socket to a DRAM module connected to another socket
pass through the QPI, the remote socket and finally the remote memory link. Since the QPI has
124
high bandwidth, if we treat the RAM as a single functional module, the L3-RAM bandwidth
depends on the number of the memory links used which in turn depends on the mapping of pages
to the DRAM modules. If all pages used by a program are mapped to DRAM modules connected
to one socket, the program effectively utilizes one-fourth of the memory bandwidth. On the other
hand, an even distribution of pages to DRAM modules across the sockets provides full bandwidth
to the program. By controlling the mapping of pages, we can control the bandwidth available to
the program. We report numbers with different bandwidth values in our results to highlight the
sensitivity of running time to L3 cache misses.
Monitoring L3 cache. We picked the L3 cache for reporting as it is the only cache level that
is shared and is consequently the only level where space-bounded schedulers can be expected
to outperform work-stealing scheduler. It is also the highest and the most expensive cache level
before DRAM on our machine. Unlike total execution time, partitioning L3 misses between
application code and scheduler code was not possible due to software limitations. Even if it
were possible to count them separately, it would difficult to interpret it because of the non-trivial
interference between the data cached by program and the scheduler.
To count L3 cache misses, the uncore counters in the C-boxes were programmed using the
Intel PCM tool to count misses that occur due to any reason (LLC MISSES - event code:
0x14, umask: 0b111) and L3 cache fills in any coherence state (LLC S FILLS - event
code: 0x16, umask: 0b1111). Both the numbers concur up to three significant digits
in most cases. Therefore, only the L3 cache miss numbers are reported here.
To prevent excessive TLB cache misses, we use Linux hugepages of size 2MB to pre-allocate
the space required by the algorithms. We configured the system to have a pool of 10, 000 huge
pages by setting vm.nr hugepages to that value using sysctl. We used the hugectl tool
to supper memory allocations with hugepages.
7.3.3 Results
We set the values of σ and µ in SB and SB-D schedulers to 0.5 and 0.2 respectively after some
experimentation with the parameters. All numbers reported in this paper are the average on at
least 5 runs with the smallest and largest readings across runs removed.
Synthetic benchmarks. The number of L3 cache misses of RRM and RRG, along with their
active times and overheads at different bandwidth values are plotted in Figs. 7.6 and 7.7 respec-
tively. In addition to the four schedulers discussed in Section 7.2.2 (WS, PWS, SB and SB-D),
we ran the experiments using the Cilk Plus work stealing scheduler to show that our scheduler
performs reasonably compared to Cilk Plus. We could not separate overhead from time in Cilk
Plus since it does not supply such information.
First note that, as expected, the number of L3 misses of a scheduler does not significantly
depend on the bandwidth. The space-bounded schedulers perform significantly better in terms of
L3 cache misses. The active time is influenced by the number of instructions in the benchmark as
well as the number of L3 misses. Since the number of instructions is constant across schedulers,
the number of L3 cache misses is the primary factor influencing active time. The extent to
125
0.7
60
0.6 50
0.5
40
0.4
30
0.3
20
0.2
0.1 10
0
0
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Plus
Plus
Plus
Plus
100%
b/w
75%
b/w
50%
b/w
25%
b/w
Figure 7.6: RRM on 10 million double elements running time and L3 misses for different
bandwidth values. Left axis is time, right is cache misses.
0.9
70
0.8
60
0.7
50
0.6
0.5 40
0.4
30
0.3
20
0.2
10
0.1
0
0
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Cilk
WS
PWS
SB
SB-‐D
Plus
Plus
Plus
Plus
100%
b/w
75%
b/w
50%
b/w
25%
b/w
Figure 7.7: RRG on 10 million double elements running time and L3 misses for different
bandwidth values.
126
70
60
50
40
30
20
10
0
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
4
X
1
4
X
2
4
x
4
4
x
8
RRG (L3 misses in Millions) RRM (L3 misses in millions)
Figure 7.8: L3 cache misses for RRM and RRG with varying number of processors per socket.
127
misses than standard work stealing, it does not perform nearly as well as the space bounded
schedulers. We also note that we found little difference between our two versions of space
bounded schedulers.
These experiments indicate that the advantages of space bounded schedulers over work-
stealing schedulers improve as (1) the number of cores per processor goes up, and (2) as the
bandwidth per core goes down. At 8 cores there is about a 35% reduction in L3 cache misses.
At 64 cores we would expect over a 60% reduction.
Load balance and dilation parameters σ. The choice of σ, determining which tasks are max-
imal, is an important parameter affecting the performance of space-bounded schedulers, espe-
cially the load balance. If σ is set to 1, it is likely that one task that is about the size of cache gets
anchored to the cache leaving little room for other tasks or strands. This adversely affects load
balance, and we might expect to see greater empty queue times (see Figure 7.11 for Quad-Tree
sort). If σ is set to a lower value like 0.5, then each cache can allow more than one task or strand
to be simultaneously anchored leading to better load balance. If σ is too low (closer to 0), the
recursion level at which space-bounded scheduler preserves locality is lowered resulting in less
effective use of shared caches.
7.4 Conclusion
We developed a framework for comparing schedulers, and deployed it on a 32-core machine
with 3 levels of caches. We used it to compare four schedulers, two each of work stealing
and space-bounded types. As predicted by theory, we did notice that space-bounded schedulers
demonstrate some, or even significant, improvement over work-stealing schedulers in terms of
cache miss counts on shared caches for most benchmarks. In memory-intensive benchmarks
with low instruction count to cache miss count ratios, an improvement in L3 miss count because
128
2.5
400
350
2
300
1.5 250
200
1
150
100
0.5
50
0
0
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
Quicksort
(100M
double)
Samplesort
(100M
double)
awareSampleSort
(100M
quadTreeSort
(50M
double)
MatMul
(3.3K
X
3.3K
double)
double)
Figure 7.9: Active times, overheads and L3 cache misses for algorithms at full bandwidth. L3
misses for MatMul are multiplied by 10.
4.5
400
4
350
3.5
300
3
250
2.5
200
2
150
1.5
100
1
0.5 50
0
0
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
WS
PWS
SB
SB-‐D
Quicksort
(100M
double)
Samplesort
(100M
double)
awareSampleSort
(100M
quadTreeSort
(50
Million
MatMul
(3.3K
X
3.3K
double)
double)
double)
Figure 7.10: Active times, overheads and L3 cache misses for algorithms at 25% bandwidth. L3
misses for MatMul are multiplied by 10.
129
100
80
60
40
Empty
Queue
Time
20
(milliseconds)
0
SB
SB-‐D
SB
SB-‐D
SB
SB-‐D
SB
SB-‐D
(σ
=
0.5)
σ
=
0.7
σ
=
0.9
σ
=
1.0
of space-bounded schedulers can improve running time, despite their added overheads. On the
other hand, for compute-intensive benchmarks or benchmarks with highly optimized cache com-
plexities, space-bounded schedulers do not yield much advantage over WS schedulers in terms of
active time and most of the advantage is lost due to greater scheduler overheads. Improving the
overhead of space-bounded schedulers further could make the case for space-bounded schedulers
stronger and is an important direction for future work.
Our experiments were run on a current generation Intel multicore with only 8 cores per
socket, only 32 cores total, and only one level of shared cache (the L3). The experiments make
it clear that as the core count per chip goes up the advantages of space-bounded schedulers sig-
nificantly increases due to the increased benefit of avoiding conflicts among the many unrelated
threads sharing the limited memory (cache) on the chip. A secondary benefit is that it is likely
that the bandwidth per core will go down with an increased number of cores, again giving an ad-
vantage to space-bounded schedulers. We also anticipate a greater advantage for space-bounded
schedulers over work-stealing schedulers when more cache levels are shared and when the caches
are shared amongst a greater number of cores. Such studies are left to future work as and when
the necessary hardware becomes available.
130
Chapter 8
Conclusions
We made a case for designing program-centric models for locality and parallelism and presented
the PCC framework as a suitable candidate. We designed schedulers that mapped program costs
in the PCC framework to good performance bounds on communication costs, time and space on
a Parallel Memory Hierarchy. This demonstrates that the PCC framework captures the cost of
parallel algorithms on shared memory machines since the parallel memory hierarchy model is a
good approximation for a broad class of shared memory machines. Therefore, we claim that the
program-centric cost model for quantifying locality and parallelism is feasible.
The usefulness of the model is demonstrated through the design of algorithms for many
common problems that are not only optimal according to the metrics in the PCC framework,
but also perform extremely well in practice. We also demonstrated that practical adaptations of
our theoretical designs for schedulers perform better than the state of the art for a certain class
of algorithms, and anticipate their increasing relevance on larger scale machine where the gap
between computation and communication speed will widen.
131
requiring program annotations?
2. Are there problems for which it is not possible to design an algorithm that is both optimally
parallel and cache-efficient?
3. Are there schedulers capable of load-balancing irregularities beyond what is characterized
by effective cache complexity.
4. How accurate are program-centric models at characterizing locality and parallelism in non-
nested parallel programs?
5. Writing algorithms with static allocation while retaining locality is a tedious task. Dynamic
memory allocation is a much more elegant model for writing parallel program. However,
we do not completely understand the design of an allocation scheme along with a scheduler
that preserves locality at every level in the hierarchy while also retaining time and space
bounds. Designing such an allocator would add weight to our program-centric models.
6. Is there a quantitative relation between the effective cache complexity Q bα of an algorithm
α
and the AT complexity of its corresponding VLSI circuit?
7. The cost models and schedulers in this thesis have been constructed with shared memory
machines in mind. Is it possible to adapt and generalize these ideas to distributed memory
machines.
132
Appendix A
A statically allocated quick sort algorithm writen in Cilk. The map scan, filterLR are all
completely parallel and take pre-allocated memory chunks from quickSort. The total amount
of space required for quickSort is about 2 ∗ n ∗ sizeof (E) + 3 ∗ n ∗ sizeof (int) in addition to the
< n ∗ sizeof (int) space required for stack variables. In this version stack frames are allocated
automatically by the cilk runtime system.
PositionScanPlus lessScanPlus(LESS);
PositionScanPlus moreScanPlus(MORE);
less_pos_n = stack_var_space;
more_pos_n = 1+stack_var_space;
scan<int,PositionScanPlus>(less_pos_n,compared,less_pos,
n,lessScanPlus,0);
scan<int,PositionScanPlus>(more_pos_n,compared,more_pos,
n,moreScanPlus,0);
filterLR<E> (A,B,B+*less_pos_n,B+n-*more_pos_n,
less_pos,more_pos,compared,0,n);
cilk_spawn
133
quickSort<E,BinPred>(B,*less_pos_n,f,
compared,less_pos,more_pos,
A,stack_var_space+2,!invert);
quickSort<E,BinPred>(B+*more_pos_n,*more_pos_n,f,
compared+*more_pos_n,
less_pos+*more_pos_n,
more_pos+*more_pos_n,
A+*more_pos_n,
stack_var_space+2
+2*(*more_pos_n)/QSORT_PAR_THRESHOLD,
!invert)
cilk_sync;
if (!invert)
map<E,E,Id<E> >(B+*less_pos_n,A+*less_pos_n,
n-*less_pos_n-*more_pos_n,Id<E>());
}
}
134
Appendix B
135
template <class AT, class BT, class F>
class Map : public SBJob {
AT* A; BT* B; int n; F f; int stage;
public:
Map (AT *A_, BT *B_, int n_, F f_, int stage_=0,
bool del=true)
: SBJob (del), A(A_), B(B_), n(n_), f(f_), stage(stage_) {}
lluint size (const int block_size) {
return round_up (n*sizeof(AT), block_size)
+ round_up (n*sizeof(BT), block_size);
}
lluint strand_size (const int block_size) {
if (n < _SCAN_BSIZE && stage==0)
return size (block_size);
else return STRAND_SIZE; // A small constant ˜100B
}
void function () {
if (stage == 0) {
if (n<_SCAN_BSIZE) {
for (int i=0; i<n; ++i) B[i] = f(A[i]); //Sequential map
join ();
} else {
binary_fork(new Map<AT,BT,F>(A,B,n/2,f), // Left half
new Map<AT,BT,F>(A+n/2,B+n/2,n-n/2,f), //Right half
new Map<AT,BT,F>(A,B,n,f,1)); //Continuation
}
} else {
join ();
}
}
};
136
void
WS_Scheduler::add (Job *job, int thread_id) {
_local_lock[thread_id].lock();
_job_queues[thread_id].push_back(job);
_local_lock[thread_id].unlock();
}
int
WS_Scheduler::steal_choice (int thread_id) {
return (int) ((((double)rand())/((double)RAND_MAX))*_num_threads);
}
Job*
WS_Scheduler::get (int thread_id) {
_local_lock[thread_id].lock();
if (_job_queues[thread_id].size() > 0) {
Job * ret = _job_queues[thread_id].back();
_job_queues[thread_id].pop_back();
_local_lock[thread_id].unlock();
return ret;
} else {
_local_lock[thread_id].unlock();
int choice = steal_choice(thread_id);
_steal_lock[choice].lock();
_local_lock[choice].lock();
if (_job_queues[choice].size() > 0) {
Job * ret = _job_queues[choice].front();
_job_queues[choice].erase(_job_queues[choice].begin());
++_num_steals[thread_id];
_local_lock[choice].unlock();
_steal_lock[choice].unlock();
return ret;
}
_local_lock[choice].unlock();
_steal_lock[choice].unlock();
}
return NULL;
}
void
WS_Scheduler::done (Job *job, int thread_id, bool deactivate) {
//Nothing to be done
}
137
/****** 32 Core Teach ******/
int num_procs=32;
int num_levels = 4;
int fan_outs[4] = {4,8,1,1};
long long int sizes[4] = {0, 3*(1<<22), 1<<18, 1<<15};
int block_sizes[4] = {64,64,64,64};
int map[32] = {0,4,8,12,16,20,24,28,
2,6,10,14,18,22,26,30,
1,5,9,13,17,21,25,29,
3,7,11,15,19,23,27,31};
Figure B.3: Specification entry for a 32-core Xeon machine sketched in 7.1(a)
138
Bibliography
[1] Harold Abelson and Peter Andreae. Information transfer and area-time tradeoffs for VLSI
multiplication. Communications of the ACM, 23(1):20–23, January 1980. ISSN 0001-
0782. URL http://doi.acm.org/10.1145/358808.358814.
[2] Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. The data locality of work steal-
ing. Theory of Computing Systems, 35(3):321–347, 2002.
[3] Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tuto-
rial. IEEE Computer, 29(12):66–76, December 1995.
[4] Sarita V Adve and Mark D Hill. Weak ordering – a new definition. ACM SIGARCH
Computer Architecture News, 18(3a):2–14, 1990.
[5] Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. Upper and
lower bounds on the cost of a map-reduce computation. CoRR, abs/1206.4377, 2012.
[6] Alok Aggarwal and S. Vitter, Jeffrey. The input/output complexity of sorting and related
problems. Communications of the ACM, 31(9):1116–1127, September 1988. ISSN 0001-
0782. URL http://doi.acm.org/10.1145/48529.48535.
[7] Alok Aggarwal, Bowen Alpern, Ashok Chandra, and Marc Snir. A model for hierarchical
memory. In Proceedings of the 19th annual ACM Symposium on Theory of Computing,
STOC ’87, 1987. ISBN 0-89791-221-7.
[8] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Hierarchical memory with block
transfer. In Foundations of Computer Science, 1987., 28th Annual Symposium on, pages
204–216, 1987.
[9] Alfred V. Aho, Jeffrey D. Ullman, and Mihalis Yannakakis. On notions of information
transfer in VLSI circuits. In Proceedings of the 15th annual ACM Symposium on Theory
of Computing, STOC ’83, pages 133–139, New York, NY, USA, 1983. ACM. ISBN 0-
89791-099-0. URL http://doi.acm.org/10.1145/800061.808742.
[10] M. Ajtai, J. Komlós, and E. Szemerédi. An 0(n log n) sorting network. In Proceedings of
the fifteenth annual ACM symposium on Theory of computing, STOC ’83, pages 1–9, New
York, NY, USA, 1983. ACM. ISBN 0-89791-099-0. URL http://doi.acm.org/
10.1145/800061.808726.
[11] Romas Aleliunas. Randomized parallel communication (preliminary version). In Proceed-
ings of the first ACM SIGACT-SIGOPS Symposium on Principles of distributed computing,
PODC ’82, pages 60–72, New York, NY, USA, 1982. ACM. ISBN 0-89791-081-8. URL
139
http://doi.acm.org/10.1145/800220.806683.
[12] Albert Alexandrov, Mihai F. Ionescu, Klaus E. Schauser, and Chris Scheiman. Loggp:
Incorporating long messages into the logp model for parallel computation. Jour-
nal of Parallel and Distributed Computing, 44(1):71 – 79, 1997. ISSN 0743-
7315. URL http://www.sciencedirect.com/science/article/pii/
S0743731597913460.
[13] Bowen Alpern, Larry Carter, and Jeanne Ferrante. Modeling parallel computers as mem-
ory hierarchies. In Programming Models for Massively Parallel Computers, 1993.
[14] Bowen Alpern, Larry Carter, Ephraim Feig, and Ted Selker. The uniform memory hierar-
chy model of computation. Algorithmica, 12, 1994.
[15] Richard J. Anderson and Gary L. Miller. Deterministic parallel list ranking. Algorithmica,
6:859–869, 1991.
[16] Lars Arge. The buffer tree: A technique for designing batched external data structures.
Algorithmica, 37(1):1–24, 2003.
[17] Lars Arge. External geometric data structures. In Kyung-Yong Chwa and J.IanJ. Munro,
editors, Computing and Combinatorics, volume 3106 of Lecture Notes in Computer Sci-
ence, pages 1–1. Springer Berlin Heidelberg, 2004. ISBN 978-3-540-22856-1. URL
http://dx.doi.org/10.1007/978-3-540-27798-9_1.
[18] Lars Arge and Norbert Zeh. Algorithms and Theory of Computation Handbook, chapter
External-memory algorithms and data structures, pages 10–10. Chapman & Hall/CRC,
2010. ISBN 978-1-58488-822-2. URL http://dl.acm.org/citation.cfm?
id=1882757.1882767.
[19] Lars Arge, Michael A. Bender, Erik D. Demaine, Bryan Holland-Minkley, and J. Ian
Munro. Cache-oblivious priority queue and graph algorithm applications. In Proceedings
of the 34th ACM Symposium on Theory of Computing, STOC ’02, 2002.
[20] Lars Arge, Michael T. Goodrich, Michael Nelson, and Nodari Sitchinava. Fundamental
parallel algorithms for private-cache chip multiprocessors. In SPAA ’08: Proceedings of
the 20th annual Symposium on Parallelism in algorithms and architectures, 2008.
[21] Lars Arge, Michael T. Goodrich, and Nodari Sitchinava. Parallel external memory graph
algorithms. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Sympo-
sium on, pages 1–11, 2010.
[22] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for mul-
tiprogrammed multiprocessors. In Proceedings of the 10th Annual ACM Symposium on
Parallel Algorithms and Architectures , Puerto Vallarta, 1998.
[23] Grey Ballard. Avoiding Communication in Dense Linear Algebra. PhD thesis, EECS
Department, University of California, Berkeley, Aug 2013. URL http://www.eecs.
berkeley.edu/Pubs/TechRpts/2013/EECS-2013-151.html.
[24] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Graph expansion and
communication costs of fast matrix multiplication. Journal of the ACM, 59(6):32:1–
32:23, January 2013. ISSN 0004-5411. URL http://doi.acm.org/10.1145/
140
2395116.2395121.
[25] K. E. Batcher. Sorting networks and their applications. In Proceedings of the spring joint
computer Conference, AFIPS ’68 (Spring), pages 307–314, New York, NY, USA, 1968.
ACM. URL http://doi.acm.org/10.1145/1468075.1468121.
[26] Michael A. Bender and Martn Farach-Colton. The LCA problem revisited. In GastonH.
Gonnet and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, volume 1776
of Lecture Notes in Computer Science, pages 88–94. Springer Berlin Heidelberg, 2000.
ISBN 978-3-540-67306-4. URL http://dx.doi.org/10.1007/10719839_9.
[27] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Bradley C. Kuszmaul. Concur-
rent cache-oblivious B-trees. In SPAA ’05: Proceedings of the 17th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA), 2005.
[28] Michael A. Bender, Gerth S. Brodal, Rolf Fagerberg, Riko Jacob, and Elias Vicari. Opti-
mal sparse matrix dense vector multiplication in the I/O-model. In SPAA ’07: Proceedings
of the 16th annual ACM Symposium on Parallelism in algorithms and architectures, 2007.
[29] Michael A. Bender, Bradley C. Kuszmaul, Shang-Hua Teng, and Kebin Wang. Optimal
cache-oblivious mesh layout. Computing Research Repository (CoRR) abs/0705.1033,
2007.
[30] Bryan T. Bennett and Vincent J. Kruskal. LRU stack processing. IBM Journal of Research
and Development, 19(4):353–357, 1975.
[31] Omer Berkman and Uzi Vishkin. Recursive star-tree parallel data
structure. SIAM Journal on Computing, 22(2):221–242, 1993.
URL http://epubs.siam.org/doi/abs/10.1137/0222017,
arXiv:http://epubs.siam.org/doi/pdf/10.1137/0222017.
[32] Gianfranco Bilardi, Andrea Pietracaprina, Geppino Pucci, and Francesco Silvestri.
Network-oblivious algorithms. In Parallel and Distributed Processing Symposium, 2007.
IPDPS 2007. IEEE International, 2007.
[33] Guy E. Blelloch. NESL: A nested data-parallel language. Technical Report CMU-CS-92-
103, School of Computer Science, Carnegie Mellon University, 1992.
[34] Guy E. Blelloch. Problem based benchmark suite. www.cs.cmu.edu/˜guyb/pbbs/
Numbers.html, 2011.
[35] Guy E. Blelloch and Phillip B. Gibbons. Effectively sharing a cache among threads.
In SPAA ’04: Proceedings of the sixteenth annual ACM Symposium on Parallelism in
algorithms and architectures. ACM, 2004. ISBN 1-58113-840-7.
[36] Guy E. Blelloch and Robert Harper. Cache and I/O efficent functional algorithms. In
Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of
programming languages, POPL ’13, pages 39–50, New York, NY, USA, 2013. ACM.
ISBN 978-1-4503-1832-7. URL http://doi.acm.org/10.1145/2429069.
2429077.
[37] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias. Provably efficient scheduling for
languages with fine-grained parallelism. Journal of the ACM, 46(2):281–321, 1999.
141
[38] Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran,
Shimin Chen, and Michael Kozuch. Provably good multicore cache performance for
divide-and-conquer algorithms. In SODA ’08: Proceedings of the 19th annual ACM-SIAM
Symposium on Discrete algorithms, 2008.
[39] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri.
A cache-oblivious model for parallel memory hierarchies. Technical Report CMU-CS-
10-154, Computer Science Department, Carnegie Mellon University, 2010.
[40] Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Low-depth cache
oblivious algorithms. In SPAA ’10: Proceedings of the 22th annusl Symposium on Paral-
lelism in Algorithms and Architectures, pages 189–199. ACM, 2010.
[41] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri.
Scheduling irregular parallel computations on hierarchical caches. In Symposium on Par-
allelism in Algorithms and Architectures, pages 355–366, 2011.
[42] Guy E. Blelloch, Anupam Gupta, Ioannis Koutis, Gary L. Miller, Richard Peng, and
Kanat Tangwongsan. Near linear-work parallel sdd solvers, low-diameter decomposi-
tion, and low-stretch subgraphs. In Proceedings of the 23rd ACM Symposium on Paral-
lelism in algorithms and architectures, SPAA ’11, pages 13–22, New York, NY, USA,
2011. ACM. ISBN 978-1-4503-0743-7. URL http://doi.acm.org/10.1145/
1989493.1989496.
[43] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel
approximate set cover and variants. In Proceedings of the 23rd ACM Symposium on Par-
allelism in algorithms and architectures, SPAA ’11, pages 23–32, New York, NY, USA,
2011. ACM. ISBN 978-1-4503-0743-7. URL http://doi.acm.org/10.1145/
1989493.1989497.
[44] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally de-
terministic parallel algorithms can be fast. SIGPLAN Not., 47(8):181–192, February 2012.
ISSN 0362-1340. URL http://doi.acm.org/10.1145/2370036.2145840.
[45] Guy E. Blelloch, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Parallel and i/o
efficient set covering algorithms. In Proceedinbgs of the 24th ACM symposium on Par-
allelism in algorithms and architectures, SPAA ’12, pages 82–90, New York, NY, USA,
2012. ACM. ISBN 978-1-4503-1213-4. URL http://doi.acm.org/10.1145/
2312005.2312024.
[46] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Harsha Vardhan Simhadri.
Program-centric cost models for locality. In MSPC ’13, 2013.
[47] Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded
computations. In Proceedings of the twenty-fifth annual ACM Symposium on Theory of
Computing, STOC ’93, pages 362–371, New York, NY, USA, 1993. ACM. ISBN 0-
89791-591-7. URL http://doi.acm.org/10.1145/167088.167196.
[48] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by
work stealing. Journal of the ACM, 46(5), 1999.
142
[49] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H.
Randall. Dag-consistent distributed shared memory. In IPPS, 1996.
[50] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Charles E. Leiserson, and Keith H.
Randall. An analysis of dag-consistent distributed shared-memory algorithms. In Pro-
ceedings of the eighth annual ACM symposium on Parallel algorithms and architectures,
SPAA ’96, pages 297–308, New York, NY, USA, 1996. ACM. ISBN 0-89791-809-6.
URL http://doi.acm.org/10.1145/237502.237574.
[51] OpenMP Architecture Review Board. OpenMP application program interface. http:
//www.openmp.org/mp-documents/spec30.pdf, May 2008. version 3.0.
[52] Shekhar Borkar and Andrew A. Chien. The future of microprocessors. Communications
of the ACM, 54(5):67–77, May 2011.
[53] Riachrd P. Brent and H. T. Kung. The Area-Time complexity of binary multiplication.
Journal of the ACM, 28:521–534, 1981.
[54] Richard P. Brent. The parallel evaluation of general arithmetic expressions. Journal of the
ACM, 21(2):201–206, April 1974. ISSN 0004-5411. URL http://doi.acm.org/
10.1145/321812.321815.
[55] Gerth S. Brodal and Rolf Fagerberg. Cache oblivious distribution sweeping. In ICALP
’02: Proceedings of the 29th International Colloquium on Automata, Languages, and
Programming, 2002.
[56] Gerth S. Brodal, Rolf Fagerberg, and G. Moruz. Cache-aware and cache-oblivious adap-
tive sorting. In ICALP ’05: Proceedings of the 32nd International Colloquium on Au-
tomata, Languages, and Programming, 2005.
[57] Gerth S. Brodal, Rolf Fagerberg, and K. Vinther. Engineering a cache-oblivious sorting
algorithm. ACM Journal of Experimental Algorithmics, 12, 2008. Article No. 2.2.
[58] Gerth Stølting Brodal and Rolf Fagerberg. On the limits of cache-obliviousness. In
Proceedings of the 35th annual ACM Symposium on Theory of Computing, STOC ’03,
pages 307–315, New York, NY, USA, 2003. ACM. ISBN 1-58113-674-9. URL http:
//doi.acm.org/10.1145/780542.780589.
[59] Bernard Chazelle and Louis Monier. A model of computation for VLSI with related
complexity results. Journal of the ACM, 32:573–588, 1985.
[60] Yi-Jen Chiang, Michael T. Goodrich, Edward F. Grove, Roberto Tamassia, Darren Erik
Vengroff, and Jeffrey Scott Vitter. External-memory graph algorithms. In SODA ’95: Pro-
ceedings of the 6th annual ACM-SIAM Symposium on Discrete algorithms, 1995. ISBN
0-89871-349-8.
[61] Francis Y. Chin, John Lam, and I-Ngo Chen. Efficient parallel algorithms for some graph
problems. Communications of the ACM, 25(9):659–665, September 1982. ISSN 0001-
0782. URL http://doi.acm.org/10.1145/358628.358650.
[62] Rezaul Alam Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elim-
ination paradigm: theoretical framework, parallelization and experimental evaluation. In
SPAA ’07: Proceedings of the 19th annual ACM Symposium on Parallel algorithms and
143
architectures, 2007. ISBN 978-1-59593-667-7.
[63] Rezaul Alam Chowdhury and Vijaya Ramachandran. Cache-efficient dynamic program-
ming algorithms for multicores. In SPAA ’08: Proceedings of the 20th ACM Symposium
on Parallelism in Algorithms and Architectures, 2008. ISBN 978-1-59593-973-9.
[64] Rezaul Alam Chowdhury, Francesco Silvestri, Brandon Blakeley, and Vijaya Ramachan-
dran. Oblivious algorithms for multicores and network of processors. In IPDPS ’10:
Proceedings of the IEEE 24th International Parallel and Distributed Processing Sympo-
sium, pages 1–12, 2010.
[65] R Cole and U Vishkin. Deterministic coin tossing and accelerating cascades: micro and
macro techniques for designing parallel algorithms. In STOC ’86: Proceedings of the 18th
annual ACM Symposium on Theory of Computing, 1986. ISBN 0-89791-193-8.
[66] Richard Cole and Vijaya Ramachandran. Resource oblivious sorting on multicores. In
Proceedings of the 37th international colloquium Conference on Automata, languages
and programming, ICALP’10, pages 226–237, Berlin, Heidelberg, 2010. Springer-Verlag.
ISBN 3-642-14164-1, 978-3-642-14164-5. URL http://dl.acm.org/citation.
cfm?id=1880918.1880944.
[67] Richard Cole and Vijaya Ramachandran. Efficient resource oblivious scheduling of mul-
ticore algorithms. manuscript, 2010.
[68] Richard Cole and Vijaya Ramachandran. Efficient resource oblivious algorithms for multi-
cores. Arxiv preprint arXiv.11034071, 2011. URL http://arxiv.org/abs/1103.
4071.
[69] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Cliff Stein. Introduction
to Algorithms, 2nd Edition. MIT Press, 2001.
[70] David Culler, Richard Karp, David Patterson, Abhijit Sahay, Klaus Erik Schauser, Eunice
Santos, Ramesh Subramonian, and Thorsten von Eicken. LogP: towards a realistic model
of parallel computation. In Proceedings of the fourth ACM SIGPLAN symposium on Prin-
ciples and practice of parallel programming, PPOPP ’93, pages 1–12, New York, NY,
USA, 1993. ACM. ISBN 0-89791-589-5. URL http://doi.acm.org/10.1145/
155332.155333.
[71] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large
clusters. In OSDI 2004, pages 137–150, 2004. URL http://www.usenix.org/
events/osdi04/tech/dean.html.
[72] Erik D. Demaine. Cache-oblivious algorithms and data structures. In Lecture Notes from
the EEF Summer School on Massive Data Sets. BRICS, 2002.
[73] Erik D. Demaine, Gad M. Landau, and Oren Weimann. On cartesian trees and range min-
imum queries. In Susanne Albers, Alberto Marchetti-Spaccamela, Yossi Matias, Sotiris
Nikoletseas, and Wolfgang Thomas, editors, Automata, Languages and Programming,
volume 5555 of Lecture Notes in Computer Science, pages 341–353. Springer Berlin Hei-
delberg, 2009. ISBN 978-3-642-02926-4. URL http://dx.doi.org/10.1007/
978-3-642-02927-1_29.
144
[74] James Demmel and Oded Schwartz. Communication avoiding algorithms – course web-
site, Fall 2011. URL http://www.cs.berkeley.edu/˜odedsc/CS294/.
[75] James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. Communication-
optimal parallel and sequential QR and LU factorizations. ArXiv e-prints, August 2008,
arXiv:0808.2664.
[76] Peter J. Denning. The working set model for program behavior. Communications of the
ACM, 11(5):323–333, May 1968. ISSN 0001-0782. URL http://doi.acm.org/
10.1145/363095.363141.
[77] Chen Ding and Ken Kennedy. Improving effective bandwidth through compiler enhance-
ment of global cache reuse. Journal of Parallel and Distributed Computing, 64(1):108–
134, January 2004. ISSN 0743-7315. URL http://dx.doi.org/10.1016/j.
jpdc.2003.09.005.
[78] Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance
analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Lan-
guage Design and Implementation, PLDI ’03, pages 245–257, New York, NY, USA, 2003.
ACM. ISBN 1-58113-662-5. URL http://doi.acm.org/10.1145/781131.
781159.
[79] D.L. Eager, J. Zahorjan, and E.D. Lazowska. Speedup versus efficiency in parallel sys-
tems. Computers, IEEE Transactions on, 38(3):408–423, 1989.
[80] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J Knight, Larkhoon Leem, Mike Hous-
ton, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J Dally, et al. Se-
quoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Con-
ference on Supercomputing, page 83. ACM, 2006.
[81] G. Franceschini. Proximity mergesort: Optimal in-place sorting in the cache-oblivious
model. In Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, 2004.
[82] W. D. Frazer and A. C. McKellar. Samplesort: A sampling approach to minimal storage
tree sorting. Journal of the ACM, 17(3):496–507, July 1970. ISSN 0004-5411. URL
http://doi.acm.org/10.1145/321592.321600.
[83] Matteo Frigo and Victor Luchangco. Computation-centric memory models. In Proceed-
ings of the tenth annual ACM symposium on Parallel algorithms and architectures, SPAA
’98, pages 240–249, New York, NY, USA, 1998. ACM. ISBN 0-89791-989-0. URL
http://doi.acm.org/10.1145/277651.277690.
[84] Matteo Frigo and Volker Strumpen. The cache complexity of multithreaded cache oblivi-
ous algorithms. In SPAA ’06: Proceedings of the 18th annual ACM Symposium on Paral-
lelism in algorithms and architectures, 2006. ISBN 1-59593-452-9.
[85] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the
Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN ’98 Conference on
Programming Language Design and Implementation (PLDI), pages 212–223, Montreal,
Quebec, Canada, June 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33,
No. 5, May, 1998.
145
[86] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-
oblivious algorithms. In 40th Annual Symposium on Foundations of Computer Science,
1999.
[87] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta,
and John Hennessy. Memory consistency and event ordering in scalable shared-memory
multiprocessors. SIGARCH Comput. Archit. News, 18(3a):15–26, May 1990. ISSN 0163-
5964. URL http://doi.acm.org/10.1145/325096.325102.
[88] R. L. Graham. Bounds for certain multiprocessing anomalies pdf. The Bell System Tech-
nical Journal, 45, November 1966.
[89] Xiaoming Gu, Ian Christopher, Tongxin Bai, Chengliang Zhang, and Chen Ding. A
component model of spatial locality. In Proceedings of the 2009 International Sym-
posium on Memory management, ISMM ’09, pages 99–108, New York, NY, USA,
2009. ACM. ISBN 978-1-60558-347-1. URL http://doi.acm.org/10.1145/
1542431.1542446.
[90] Yi Guo, R. Barik, R. Raman, and V. Sarkar. Work-first and help-first scheduling poli-
cies for async-finish task parallelism. In IPDPS ’09: IEEE International Parallel and
Distributed Processing Symposium, pages 1–12, 2009.
[91] Avinatan Hassidim. Cache replacement policies for multicore processors. In ICS, pages
501–509, 2010.
[92] F. C. Hennie. One-tape, off-line turing machine computations. Information and Control,
8(6):553–578, Dec 1965.
[93] F. C. Hennie and R. E. Stearns. Two-tape simulation of multitape turing machines. Journal
of the ACM, 13(4):533–546, October 1966. ISSN 0004-5411. URL http://doi.acm.
org/10.1145/321356.321362.
[94] Jia-Wei Hong. On similarity and duality of computation (i). Information and Control, 62
(23):109 – 128, 1984. ISSN 0019-9958. URL http://www.sciencedirect.com/
science/article/pii/S0019995884800303.
[95] John Hopcroft, Wolfgang Paul, and Leslie Valiant. On time versus space. Journal of the
ACM, 24(2):332–337, April 1977. ISSN 0004-5411. URL http://doi.acm.org/
10.1145/322003.322015.
[96] Intel. Intel Cilk++ SDK programmer’s guide. https://www.clear.rice.edu/
comp422/resources/Intel_Cilk++_Programmers_Guide.pdf, 2009.
[97] Intel. Intel Thread Building Blocks reference manual. http://software.intel.
com/sites/products/documentation/doclib/tbb_sa/help/index.
htm#reference/reference.htm, 2013. Version 4.1.
[98] Intel. Performance counter monitor (PCM). http://www.intel.com/software/
pcm, 2013. Version 2.4.
[99] Dror Irony, Sivan Toledo, and Alexander Tiskin. Communication lower bounds for
distributed-memory matrix multiplication. Journal of Parallel and Distributed Comput-
ing, 64(9):1017–1026, September 2004. ISSN 0743-7315. URL http://dx.doi.
146
org/10.1016/j.jpdc.2004.03.021.
[100] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. High performance
cache replacement using re-reference interval prediction (rrip). SIGARCH Comput. Archit.
News, 38(3):60–71, June 2010. ISSN 0163-5964. URL http://doi.acm.org/10.
1145/1816038.1815971.
[101] Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon C. Steely, and
Joel Emer. Cruise: cache replacement and utility-aware scheduling. In Proceedings of
the seventeenth international Conference on Architectural Support for Programming Lan-
guages and Operating Systems, ASPLOS XVII, pages 249–260, New York, NY, USA,
2012. ACM. ISBN 978-1-4503-0759-8. URL http://doi.acm.org/10.1145/
2150976.2151003.
[102] Hong Jia-Wei and H. T. Kung. I/o complexity: The red-blue pebble game. In Proceedings
of the 13th annual ACM Symposium on Theory of Computing, STOC ’81, pages 326–
333, New York, NY, USA, 1981. ACM. URL http://doi.acm.org/10.1145/
800076.802486.
[103] Christopher F. Joerg. The Cilk System for Parallel Multithreaded Computing. PhD thesis,
Department of Electrical Engineering and Computer Science, Massachusetts Institute of
Technology, Cambridge, Massachusetts, January 1996. Available as MIT Laboratory for
Computer Science Technical Report MIT/LCS/TR-701.
[104] T. S. Jung. Memory technology and solutions roadmap. http://www.sec.co.kr/
images/corp/ir/irevent/techforum_01.pdf, 2005.
[105] Anna R. Karlin and Eli Upfal. Parallel hashing: an efficient implementation of shared
memory. Journal of the ACM, 35(4):876–892, 1988.
[106] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation
for mapreduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on
Discrete Algorithms, SODA ’10, pages 938–948, Philadelphia, PA, USA, 2010. Soci-
ety for Industrial and Applied Mathematics. ISBN 978-0-898716-98-6. URL http:
//dl.acm.org/citation.cfm?id=1873601.1873677.
[107] Anil Kumar Katti and Vijaya Ramachandran. Competitive cache replacement strategies
for shared cache environments. In IPDPS, pages 215–226, 2012.
[108] Ronald Kriemann. Implementation and usage of a thread pool based on posix threads.
http://www.hlnum.org/english/projects/tools/threadpool/doc.html.
[109] Piyush Kumar. Cache oblivious algorithms. In Ulrich Meyer, Peter Sanders, and Jop
Sibeyn, editors, Algorithms for Memory Hierarchies. Springer, 2003.
[110] H. T. Kung. Let’s design algorithms for VLSI systems. In Proceedings of the Caltech Con-
ference On Very Large Scale Integration, pages 65–90. California Institute of Technology,
1979.
[111] Hsiang Tsung Kung and Charles E Leiserson. Algorithms for VLSI processor arrays.
Introduction to VLSI systems, pages 271–292, 1980.
[112] Edya Ladan-Mozes and Charles E. Leiserson. A consistency architecture for hierarchi-
147
cal shared caches. In SPAA ’08: Proceedings of the 20th ACM Symposium on Parallel
Algorithms and Architectures, pages 11–22, Munich, Germany, June 2008.
[113] Leslie Lamport. How to make a correct multiprocess program execute correctly on a
multiprocessor. IEEE Transactions on Computers, 46, 1997.
[114] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur
Mutlu. Tiered-latency dram: A low latency and low cost dram architecture. 2013 IEEE
19th International Symposium on High Performance Computer Architecture (HPCA), 0:
615–626, 2013. ISSN 1530-0897.
[115] I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. Using
memory mapping to support cactus stacks in work-stealing runtime systems. In PACT ’10:
Proceedings of the 19th International Conference on Parallel Architectures and Compila-
tion Techniques, pages 411–420, Vienna, Austria, September 2010. ACM.
[116] F. Thomson Leighton. Introduction to parallel algorithms and architectures: array, trees,
hypercubes. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992. ISBN
1-55860-117-1.
[117] Frank T. Leighton, Bruce M. Maggs, Abhiram G. Ranade, and Satish B. Rao. Random-
ized routing and sorting on fixed-connection networks. Journal of Algorithms, 17(1):157
– 205, 1994. URL http://www.sciencedirect.com/science/article/
pii/S0196677484710303.
[118] Frank T. Leighton, Bruce M. Maggs, and Satish B. Rao. Packet routing and job-shop
scheduling ino(congestion+dilation) steps. Combinatorica, 14(2):167–186, 1994. ISSN
0209-9683. URL http://dx.doi.org/10.1007/BF01215349.
[119] Tom Leighton. Tight bounds on the complexity of parallel sorting. In Proceedings of the
16th annual ACM Symposium on Theory of Computing, STOC ’84, pages 71–80, New
York, NY, USA, 1984. ACM. ISBN 0-89791-133-4. URL http://doi.acm.org/
10.1145/800057.808667.
[120] Charles Leiserson. The Cilk++ concurrency platform. Journal of Supercomputing, 51,
2010.
[121] Charles E. Leiserson. Fat-Trees: Universal networks for hardware-efficient supercomput-
ing. IEEE Transactions on Computers, C–34(10), 1985.
[122] Charles E. Leiserson and Bruce M. Maggs. Communication-efficient parallel graph algo-
rithms for distributed random-access machines. Algorithmica, 3(1):53–77, 1988.
[123] Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Rein-
hardt, and Thomas F. Wenisch. Disaggregated memory for expansion and sharing in blade
servers. In Proceedings of the 36th annual International Symposium on Computer Ar-
chitecture, ISCA ’09, pages 267–278, New York, NY, USA, 2009. ACM. ISBN 978-1-
60558-526-0. URL http://doi.acm.org/10.1145/1555754.1555789.
[124] Richard J. Lipton and Robert Sedgewick. Lower bounds for VLSI. In Proceedings of the
13th annual ACM Symposium on Theory of Computing, STOC ’81, pages 300–307, New
York, NY, USA, 1981. ACM. URL http://doi.acm.org/10.1145/800076.
148
802482.
[125] Richard J. Lipton and Robert E. Tarjan. A separator theorem for planar graphs. SIAM
Journal on Applied Mathematics, 36, 1979.
[126] Victor Luchangco. Precedence-based memory models. In Marios Mavronicolas and
Philippas Tsigas, editors, Distributed Algorithms, volume 1320 of Lecture Notes in Com-
puter Science, pages 215–229. Springer Berlin Heidelberg, 1997. ISBN 978-3-540-63575-
8. URL http://dx.doi.org/10.1007/BFb0030686.
[127] Gabriel Marin and John Mellor-Crummey. Cross-architecture performance predictions for
scientific applications using parameterized models. SIGMETRICS Performance Evalua-
tion Review, 32(1):2–13, June 2004. ISSN 0163-5999. URL http://doi.acm.org/
10.1145/1012888.1005691.
[128] R.L. Mattson, J. Gecsei, D.R. Slutz, and I.L. Traiger. Evaluation techniques for storage
hierarchies. IBM Systems Journal, 9(2):78–117, 1970.
[129] C.A. Mead and M. Rem. Cost and performance of VLSI computing structures. IEEE
Transactions on Electron Devices, 26(4):533–540, 1979.
[130] Carver Mead and Lynn Conway. Introduction to VLSI Systems. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1979. ISBN 0201043580.
[131] Microsoft. Task Parallel Library. http://msdn.microsoft.com/en-us/
library/dd460717.aspx, 2013. .NET version 4.5.
[132] Gary L. Miller and John H. Reif. Parallel tree contraction part 1: Fundamentals. In
Silvio Micali, editor, Randomness and Computation, pages 47–72. JAI Press, Greenwich,
Connecticut, 1989. Vol. 5.
[133] D. Molka, D. Hackenberg, R. Schone, and M.S. Muller. Memory performance and cache
coherency effects on an intel nehalem multiprocessor system. In Parallel Architectures
and Compilation Techniques, 2009. PACT ’09. 18th International Conference on, pages
261 –270, sept. 2009.
[134] Girija J. Narlikar. Scheduling threads for low space requirement and good locality. In Pro-
ceedings of the 11th Annual ACM Symposium on Parallel Algorithms and Architectures,
June 1999.
[135] Girija J. Narlikar. Space-Efficient Scheduling for Parallel, Multithreaded Computations.
PhD thesis, Carnegie Mellon University, May 1999. Available as CMU-CS-99-119.
[136] David A. Patterson. Latency lags bandwith. Communications of the ACM, 47(10):
71–75, October 2004. ISSN 0001-0782. URL http://doi.acm.org/10.1145/
1022594.1022596.
[137] Perfmon2. libpfm. http://perfmon2.sourceforge.net/, 2012.
[138] N. Pippenger. Parallel communication with limited buffers. In Proceedings of the 25th
Annual Symposium onFoundations of Computer Science, 1984, SFCS ’84, pages 127–
136, Washington, DC, USA, 1984. IEEE Computer Society. ISBN 0-8186-0591-X. URL
http://dx.doi.org/10.1109/SFCS.1984.715909.
149
[139] Apan Qasem and Ken Kennedy. Profitable loop fusion and tiling using model-driven em-
pirical search. In Proceedings of the 20th annual International Conference on Supercom-
puting, ICS ’06, pages 249–258, New York, NY, USA, 2006. ACM. ISBN 1-59593-282-8.
URL http://doi.acm.org/10.1145/1183401.1183437.
[140] Jean-Noël Quintin and Frédéric Wagner. Hierarchical work-stealing. In Proceedings of
the 16th international Euro-Par Conference on Parallel processing: Part I, EuroPar’10,
pages 217–229, Berlin, Heidelberg, 2010. Springer-Verlag.
[141] S. Rajasekaran and J. H. Reif. Optimal and sublogarithmic time randomized parallel
sorting algorithms. SIAM Journal on Computing, 18(3):594–607, 1989. ISSN 0097-5397.
[142] Abhiram G. Ranade. How to emulate shared memory. Journal of Computing and Systems
Sciences, 42(3):307–326, 1991.
[143] Arnold L. Rosenberg. The VLSI revolution in theoretical circles. In Jan Paredaens, edi-
tor, Automata, Languages and Programming, volume 172 of Lecture Notes in Computer
Science, pages 23–40. Springer Berlin Heidelberg, 1984. ISBN 978-3-540-13345-2. URL
http://dx.doi.org/10.1007/3-540-13345-3_2.
[144] Samsung. DRAM Data Sheet. http://www.samsung.com/global/business/
semiconductor/prod\uct.
[145] John E. Savage. Areatime tradeoffs for matrix multiplication and related problems in VLSI
models. Journal of Computer and System Sciences, 22(2):230 – 242, 1981. ISSN 0022-
0000. URL http://www.sciencedirect.com/science/article/pii/
0022000081900295.
[146] John E. Savage. Extending the hong-kung model to memory hierarchies. In Proceedings of
the First Annual International Conference on Computing and Combinatorics, COCOON
’95, pages 270–281, London, UK, UK, 1995. Springer-Verlag. ISBN 3-540-60216-X.
URL http://dl.acm.org/citation.cfm?id=646714.701412.
[147] John E. Savage. Models of Computation: Exploring the Power of Computing. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1st edition, 1997. ISBN
0201895390.
[148] Scandal. Irregular algorithms in NESL language. http://www.cs.cmu.edu/
˜scandal/alg/algs.html, July 1994.
[149] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Har-
sha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: the problem based
benchmark suite. In Proceedinbgs of the 24th ACM Symposium on Parallelism in algo-
rithms and architectures, SPAA ’12, pages 68–70, New York, NY, USA, 2012. ACM.
ISBN 978-1-4503-1213-4.
[150] Daniel D. Sleator and Robert E. Tarjan. Amortized efficiency of list update and paging
rules. Communications of the ACM, 28(2), 1985.
[151] Edgar Solomonik and James Demmel. Communication-optimal parallel 2.5d matrix mul-
tiplication and lu factorization algorithms. Technical Report UCB/EECS-2011-72, EECS
Department, University of California, Berkeley, Jun 2011. URL http://www.eecs.
150
berkeley.edu/Pubs/TechRpts/2011/EECS-2011-72.html.
[152] Daniel Spoonhower. Scheduling Deterministic Parallel Programs. PhD thesis, Carnegie
Mellon University, May 2009. Available as CMU-CS-09-126.
[153] Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. Beyond
nested parallelism: tight bounds on work-stealing overheads for parallel futures. In SPAA
’09: Proceedings of the 21st annual Symposium on Parallelism in Algorithms and Archi-
tectures, pages 91–100, 2009.
[154] Kanat Tangwongsan. Efficient Parallel Approximation Algorithms. PhD thesis, Carnegie
Mellon University, 2011.
[155] Yufei Tao, Wenqing Lin, and Xiaokui Xiao. Minimal mapreduce algorithms. In Pro-
ceedings of the 2013 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’13, pages 529–540, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-
2037-5. URL http://doi.acm.org/10.1145/2463676.2463719.
[156] R. E. Tarjan and U. Vishkin. Finding biconnected components and computing tree func-
tions in logarithmic parallel time. In Proceedings of the 25th Annual Symposium on Foun-
dations of Computer Science, 1984, SFCS ’84, pages 12–20, Washington, DC, USA, 1984.
IEEE Computer Society. ISBN 0-8186-0591-X. URL http://dx.doi.org/10.
1109/SFCS.1984.715896.
[157] C. D. Thompson. Area-time complexity for VLSI. In Proceedings of the 11th annual
ACM Symposium on Theory of computing, STOC ’79, pages 81–88, New York, NY, USA,
1979. ACM. URL http://doi.acm.org/10.1145/800135.804401.
[158] C.D. Thompson. The VLSI complexity of sorting. IEEE Transactions on
Computers, 32(12):1171–1184, 1983. ISSN 0018-9340. URL http://doi.
ieeecomputersociety.org/10.1109/TC.1983.1676178.
[159] C.D. Thompson. Fourier transforms in VLSI. IEEE Transactions on Computers, 32(11):
1047–1057, 1983. ISSN 0018-9340. URL http://doi.ieeecomputersociety.
org/10.1109/TC.1983.1676155.
[160] Clark Thompson. A Complexity Theory of VLSI. PhD thesis, Carnegie Mellon University,
1980. available as CMU-CS-80-140.
[161] Pilar Torre and ClydeP. Kruskal. Submachine locality in the bulk synchronous setting.
In Luc Boug, Pierre Fraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par’96
Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 352–358.
Springer Berlin Heidelberg, 1996. ISBN 978-3-540-61627-6. URL http://dx.doi.
org/10.1007/BFb0024723.
[162] Eli Upfal. Efficient schemes for parallel communication. Journal of the ACM, 31(3):507–
517, June 1984. ISSN 0004-5411. URL http://doi.acm.org/10.1145/828.
1892.
[163] Leslie G. Valiant. A bridging model for parallel computation. Communications of the
ACM, 33(8), 1990. ISSN 0001-0782.
[164] Leslie G. Valiant. A bridging model for multi-core computing. In ESA ’08: Proceedings
151
of the 16th European Symposium on Algorithms, 2008.
[165] Leslie G. Valiant and Gordon J. Brebner. Universal schemes for parallel communication.
In Proceedings of the thirteenth annual ACM Symposium on Theory of Computing, STOC
’81, pages 263–277, New York, NY, USA, 1981. ACM. URL http://doi.acm.org/
10.1145/800076.802479.
[166] Jeffrey Scott Vitter. External memory algorithms and data structures: dealing with massive
data. ACM Comput. Surv., 33(2), 2001. ISSN 0360-0300.
[167] J. Vuillemin. A combinatorial limit to the computing power of VLSI circuits. IEEE
Transactions on Computers, 32(3):294–300, 1983. ISSN 0018-9340. URL http://
doi.ieeecomputersociety.org/10.1109/TC.1983.1676221.
[168] Tom White. Hadoop: The Definitive Guide. O’Reilly Media, 1 edition, June 2009. ISBN
0596521979.
[169] Andrew W Wilson Jr. Hierarchical cache/bus architecture for shared memory multipro-
cessors. In Proceedings of the 14th annual International symposium on Computer archi-
tecture, pages 244–252. ACM, 1987.
[170] J. C. Wyllie. The Complexity of Parallel Computation. PhD thesis, Cornell University,
1979.
[171] Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. HOTL: a higher order theory of locality.
In Proceedings of the eighteenth International Conference on Architectural support for
programming languages and operating systems, ASPLOS ’13, pages 343–356, New York,
NY, USA, 2013. ACM. ISBN 978-1-4503-1870-9. URL http://doi.acm.org/10.
1145/2451116.2451153.
[172] Andrew C. Yao. The entropic limitations on VLSI computations(extended abstract). In
Proceedings of the 13th annual ACM Symposium on Theory of Computing, STOC ’81,
pages 308–311, New York, NY, USA, 1981. ACM. URL http://doi.acm.org/
10.1145/800076.802483.
[173] Andrew Chi-Chih Yao. Some complexity questions related to distributive comput-
ing(preliminary report). In Proceedings of the 11th annual ACM Symposium on The-
ory of Computing, STOC ’79, pages 209–213, New York, NY, USA, 1979. ACM. URL
http://doi.acm.org/10.1145/800135.804414.
[174] Norbert Zeh. I/O-efficient graph algorithms. http://web.cs.dal.ca/˜nzeh/
Publications/summer_school_2002.pdf, 2002.
[175] Yutao Zhong, Steven G. Dropsho, Xipeng Shen, Ahren Studer, and Chen Ding. Miss
rate prediction across program inputs and cache configurations. IEEE Transactions on
Computers, 56(3):328–343, March 2007. ISSN 0018-9340.
152