A Primer On Hardware Prefetching
A Primer On Hardware Prefetching
Series
SeriesISSN:
ISSN:
ISSN:1935-3235
1935-3235
1935-3235
WENISCH
WENISCH
Editor:
Editor:
Editor:Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,University
University
Universityof
of
ofWisconsin
Wisconsin
Wisconsin
A
A Primer
Primer on
on Hardware
Hardware Prefetching
Prefetching A Primer on Hardware
Prefetching
Babak
Babak
BabakFalsafi,
Falsafi,
Falsafi,EPFL
EPFL
EPFLSwitzerland
Switzerland
Switzerland ••• Thomas
Thomas
ThomasF.
F.
F.Wenisch,
Wenisch,
Wenisch,University
University
Universityof
of
ofMichigan
Michigan
Michigan
Since
Since
Sincethe the
the1970’s,
1970’s,
1970’s,microprocessor-based
microprocessor-based
microprocessor-baseddigital digital
digitalplatforms
platforms
platformshave have
havebeen
been
beenriding
riding
ridingMoore’s
Moore’s
Moore’slaw, law,
law,allowing
allowing
allowingfor for
for
doubling
doubling
doubling of of
of density
density
density for
for
for the
the
the same
same
same area
area
area roughly
roughly
roughly every
every
every twotwo
two years.
years.
years.However,
However,
However,whereaswhereas
whereas microprocessor
microprocessor
microprocessor
fabrication
fabrication
fabricationhas has
hasfocused
focused
focusedon on
onincreasing
increasing
increasinginstruction
instruction
instructionexecution
execution
executionrate,rate,
rate,memory
memory
memoryfabrication
fabrication
fabricationtechnologies
technologies
technologies
have
have
havefocused
focused
focusedprimarily
primarily
primarilyon on
onan an
anincrease
increase
increasein in
incapacity
capacity
capacitywith with
withnegligible
negligible
negligibleincrease
increase
increasein in
inspeed.
speed.
speed.This
This
Thisdivergent
divergent
divergent
trend
trend
trendin in
inperformance
performance
performancebetweenbetween
betweenthe the
theprocessors
processors
processorsand and
andmemory
memory
memoryhas has
hasledled
ledtoto
toaaaphenomenon
phenomenon
phenomenonreferred referred
referredto to
toasas
as
A
A PRIMER
A PRIMER ON
PRIMER
the
the
the“Memory
“Memory
“MemoryWall.” Wall.”
Wall.”
To
To
To overcome
overcome
overcome the the
the memory
memory
memory wall, wall,
wall,designers
designers
designers have have
have resorted
resorted
resorted to to
to aaa hierarchy
hierarchy
hierarchy of of
of cache
cache
cache memory
memory
memory levels,
levels,
levels,
which
which
whichrely rely
relyon
on
onthe
the
theprincipal
principal
principalof of
ofmemory
memory
memoryaccess access
accesslocality
locality
localityto to
toreduce
reduce
reducethe the
theobserved
observed
observedmemory memory
memoryaccess access
accesstime
time
time
ON HARDWARE
ON
and
and
andthethe
theperformance
performance
performancegap gap
gapbetween
between
betweenprocessors
processors
processorsand and
andmemory.
memory.
memory.Unfortunately,
Unfortunately,
Unfortunately,importantimportant
importantworkload
workload
workloadclass-
class-
class-
HARDWARE
HARDWARE PREFETCHING
es
es
esexhibit
exhibit
exhibitadverse
adverse
adversememory
memory
memoryaccess access
accesspatterns
patterns
patternsthat that
thatbaffle
baffle
bafflethethe
thesimple
simple
simplepolicies
policies
policiesbuiltbuilt
builtinto
into
intomodern
modern
moderncache cache
cache
hierarchies
hierarchies
hierarchiesto to
tomove
move
moveinstructions
instructions
instructionsand and
anddata
data
dataacross
across
acrosscache
cache
cachelevels.
levels.
levels. As
As
Assuch,
such,
such,processors
processors
processorsoften often
oftenspend
spend
spendmuchmuch
much
time
time
timeidling
idling
idlingupon
upon
uponaaademand
demand
demandfetch fetch
fetchof of
ofmemory
memory
memoryblocks blocks
blocksthatthat
thatmiss
miss
missin in
inhigher
higher
highercache
cache
cachelevels.
levels.
levels.
PREFETCHING
PREFETCHING
Prefetching—predicting
Prefetching—predicting
Prefetching—predictingfuture future
futurememory
memory
memoryaccessesaccesses
accessesand and
andissuing
issuing
issuingrequests
requests
requestsfor for
forthe
the
thecorresponding
corresponding
correspondingmem- mem-
mem-
ory
ory
oryblocks
blocks
blocksin in
inadvance
advance
advanceof of
ofexplicit
explicit
explicitaccesses—is
accesses—is
accesses—isan an
aneffective
effective
effectiveapproach
approach
approachto to
tohide
hide
hidememory
memory
memoryaccess access
accesslatency.
latency.
latency.
There
There
Therehave have
havebeen
been
beenaaamyriad
myriad
myriadof of
ofproposed
proposed
proposedprefetching
prefetching
prefetchingtechniques,
techniques,
techniques,and and
andnearly
nearly
nearlyevery
every
everymodern
modern
modernprocessor
processor
processor
includes
includes
includessome some
somehardware
hardware
hardwareprefetching
prefetching
prefetchingmechanisms
mechanisms
mechanismstargeting targeting
targetingsimple
simple
simpleand and
andregular
regular
regularmemory
memory
memoryaccess access
accesspat-
pat-
pat-
terns.
terns.
terns.This
and
and
anddata
This
data
Thisprimer
primer
primeroffers
dataproposed
proposed
proposedin
offers
offersan
in
inthe
the
an
anoverview
overview
overviewof
theresearch
research
of
ofthe
researchliterature,
the
thevarious
literature,
literature,and
various
variousclasses
and
andpresents
classes
classesof
presents
of
ofhardware
presentsexamples
hardware
hardwareprefetchers
examples
examplesof of
prefetchers
prefetchersfor
oftechniques
techniques
for
forinstructions
techniquesincorporated
instructions
instructions
incorporated
incorporatedinto into
into
Babak
Babak Falsafi
Falsafi
modern
modern
modernmicroprocessors.
microprocessors.
microprocessors.
Thomas
Thomas F.
F.Wenisch
Wenisch
ABOUT
ABOUT
ABOUTSYNTHESIS
SYNTHESIS
SYNTHESIS
This
This
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionofof
ofaaawork
work
workthat
that
thatappears
appears
appearsinin
inthe
the
theSynthesis
Synthesis
Synthesis
Digital
Digital
DigitalLibrary
Library
Libraryof of
ofEngineering
Engineering
Engineeringand and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
Synthesis Lectures
Lectures
Lectures
provide
provide
provideconcise,
concise,
concise,original
original
originalpresentations
presentations
presentationsof of
ofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development
M
M OR
M OR
OR G
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,in
in
indigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information
G
G AN
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com
SSyntheSiS
yntheSiS L
LectureS
AN
AN &
ectureS
&
& CL
ccomputer
omputer AArchitecture
CL
CL AYP
MORGAN&&CLAYPOOL
ISBN:
ISBN:
ISBN: 978-1-60845-952-0
978-1-60845-952-0
978-1-60845-952-0
MORGAN
MORGAN CLAYPOOL
CLAYPOOL PUBLISHERS
PUBLISHERS
PUBLISHERS 90000
90000
90000 rchitecture
AYP
AYP OOL
w
www
www
ww...m
m
mooorrrgggaaannnccclllaaayyypppoooooolll...cccooom
m
m
OOL
OOL
9
99781608
781608
781608459520
459520
459520
Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,Editor
Editor
Editor
A Primer on Hardware Prefetching
iii
Synthesis Lectures on
Computer Architecture
Editor
Margaret Martonosi, Princeton University
Founding Editor Emeritus
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics per-
taining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. The scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Multithreading Architecture
Mario Nemirovsky, Dean M. Tullsen
January 2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, Wen-mei Hwu
November 2012
On-Chip Networks
Natalie Enright Jerger, Li-Shiuan Peh
2009
The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus, Ravi Rajwar
2006
vi
Quantum Computing for Computer Architects
Tzvetan S. Metodi, Frederic T. Chong
2006
Copyright © 2014 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quota-
tions in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00581ED1V01Y201405CAC028
Babak Falsafi
EPFL
Thomas F. Wenisch
University of Michigan
M
&C MORGAN & CLAYPOOL PUBLISHERS
x
ABSTRACT
Since the 1970’s, microprocessor-based digital platforms have been riding Moore’s law, allowing for
doubling of density for the same area roughly every two years. However, whereas microprocessor
fabrication has focused on increasing instruction execution rate, memory fabrication technologies
have focused primarily on an increase in capacity with negligible increase in speed. This divergent
trend in performance between the processors and memory has led to a phenomenon referred to as
the “Memory Wall.”
To overcome the memory wall, designers have resorted to a hierarchy of cache memory levels,
which rely on the principal of memory access locality to reduce the observed memory access time
and the performance gap between processors and memory. Unfortunately, important workload
classes exhibit adverse memory access patterns that baffle the simple policies built into modern
cache hierarchies to move instructions and data across cache levels. As such, processors often spend
much time idling upon a demand fetch of memory blocks that miss in higher cache levels.
Prefetching—predicting future memory accesses and issuing requests for the corresponding
memory blocks in advance of explicit accesses—is an effective approach to hide memory access
latency. There have been a myriad of proposed prefetching techniques, and nearly every modern
processor includes some hardware prefetching mechanisms targeting simple and regular memory
access patterns. This primer offers an overview of the various classes of hardware prefetchers for
instructions and data proposed in the research literature, and presents examples of techniques in-
corporated into modern microprocessors.
KEYWORDS
hardware prefetching, next-line prefetching, branch-directed prefetching, discontinuity
prefetching, stride prefetching, address-correlated prefetching, Markov prefetcher, global history
buffer, temporal memory streaming, spatial memory streaming, execution-based prefetching
xi
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ���1
1.1 The Memory Wall�������������������������������������������������������������������������������������������� 1
1.2 Prefetching�������������������������������������������������������������������������������������������������������� 3
1.2.1 Predicting Addresses���������������������������������������������������������������������������� 3
1.2.2 Prefetch Lookahead ���������������������������������������������������������������������������� 4
1.2.3 Placing Prefetched Values�������������������������������������������������������������������� 4
3 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 Stride and Stream Prefetchers for Data���������������������������������������������������������� 15
3.2 Address-Correlating Prefetchers�������������������������������������������������������������������� 17
3.2.1 Jump Pointers������������������������������������������������������������������������������������ 17
3.2.2 Pair-Wise Correlation������������������������������������������������������������������������ 18
3.2.3 Markov Prefetcher ���������������������������������������������������������������������������� 18
3.2.4 Improving Lookahead via Prefetch Depth���������������������������������������� 19
3.2.5 Improving Lookahead via Dead Block Prediction���������������������������� 20
3.2.6 Addressing On-Chip Storage Limitations���������������������������������������� 21
3.2.7 Global History Buffer������������������������������������������������������������������������ 22
3.2.8 Stream Chaining�������������������������������������������������������������������������������� 24
3.2.9 Temporal Memory Streaming ���������������������������������������������������������� 24
3.2.10 Irregular Stream Buffer���������������������������������������������������������������������� 25
3.3 Spatially Correlated Prefetching �������������������������������������������������������������������� 26
3.3.1 Delta-Correlated Lookup������������������������������������������������������������������ 27
xii
3.3.2 Global History Buffer PC-Localized/Delta-Correlating
(GHB PC/DC) �������������������������������������������������������������������������������� 27
3.3.3 Code-Correlated Lookup������������������������������������������������������������������ 28
3.3.4 Spatial Footprint Prediction�������������������������������������������������������������� 30
3.3.5 Spatial Pattern Prediction������������������������������������������������������������������ 30
3.3.6 Stealth Prefetching���������������������������������������������������������������������������� 31
3.3.7 Spatial Memory Streaming���������������������������������������������������������������� 31
3.3.8 Spatio-Temporal Memory Streaming������������������������������������������������ 32
3.4 Execution-Based Prefetching�������������������������������������������������������������������������� 33
3.4.1 Algorithm Summarization���������������������������������������������������������������� 33
3.4.2 Helper-Thread and Helper-Core Approaches ���������������������������������� 33
3.4.3 Run-Ahead Execution ���������������������������������������������������������������������� 34
3.4.4 Context Restoration�������������������������������������������������������������������������� 34
3.4.5 Computation Spreading�������������������������������������������������������������������� 35
3.5 Prefetch Modulation and Control������������������������������������������������������������������ 35
3.6 Software Approaches�������������������������������������������������������������������������������������� 36
4 Concluding Remarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Author Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
xiii
Preface
Since their inception in the 1970’s, microprocessor-based digital platforms have been riding Moore’s
law, allowing for doubling of density for the same area roughly every two years. Microprocessors
and memory fabrication technologies, however, have been exploiting this increase in density in two
somewhat opposing ways. Whereas microprocessor fabrication has focused on increasing the rate
at which machine instructions execute, memory fabrication technologies have focused primarily on
an increase in capacity with negligible increase in speed. This divergent trend in performance be-
tween the processors and memory has led to a phenomenon referred to as the “Memory Wall” [1].
To overcome the memory wall, designers have resorted to a hierarchy of cache memory levels
where at each level access latency is traded off for capacity. Caches rely on the principal of mem-
ory access locality to reduce the observed memory access time and the performance gap between
processors and memory. Unfortunately, there are a number of important classes of workloads that
exhibit adverse memory access patterns that baffle the simple policies built into modern cache
hierarchies to move instructions and data across the cache levels. As such, processors often spend
much time idling upon a demand fetch of memory blocks that miss in higher cache levels.
Prefetching—predicting future memory accesses and issuing requests for the corresponding
memory blocks in advance of explicit accesses by a processor—is quite promising as an approach
to hide memory access latency. There have been a myriad of hardware and software approaches to
prefetching. A number of effective hardware prefetching mechanisms targeting simple and regular
memory access patterns have been incorporated into modern microprocessors to prefetch instruc-
tions and data.
This primer offers an overview of the various classes of hardware prefetchers for instructions
and data that have been proposed over the years, and presents examples of techniques incorporated
into modern microprocessors. Although the techniques covered in this book are by no means
comprehensive, they cover important instances of techniques from each class and as such the book
serves as a suitable survey for those who plan to familiarize themselves with the domain. We cover
prefetching for instruction and data caches, but many of the techniques we discuss may also be
applicable to prefetching memory translations into translation lookaside buffers (see, e.g., [2]).
This primer is broken down into four chapters. In Chapter 1, we present an introduction to
the memory hierarchy and general prefetching concepts. In Chapter 2, we describe techniques to
prefetch instructions. Chapter 3 covers techniques to prefetch data, and we give concluding remarks
in Chapter 4. The instruction prefetching techniques cover next-line prefetchers, branch-directed
prefetching, discontinuity prefetchers, and temporal instruction streaming. The data prefetchers in-
xiv
clude stride and stream-based data prefetchers, address-correlated prefetching, spatially correlated
prefetching, and execution-based prefetching.
We assume the reader is familiar with the basics of processor architecture and caches and
has some familiarity with more advanced topics like out of order execution. This book enumerates
the key issues in designing hardware prefetchers and provides high-level descriptions of a variety of
prefetching techniques. We refer the reader to the cited publications for more complete microar-
chitectural details and performance evaluations of the prefetching schemes.
We would like to acknowledge those who helped contribute to this book. Thanks to Mark
Hill for shepherding us through the writing and editorial process. Thanks to Cansu Kaynak and
Michael Ferdman for providing figures and comments on drafts of this book. Thanks to Margaret
Martonosi, Calvin Lin, and other anonymous reviewers for their detailed feedback that helped to
improve this book.
1
CHAPTER 1
Introduction
10000
1000
Performance
100
10
1
1985 1990 1995 2000 2005 2010
Figure 1.1: The growing disparity between processor and memory performance. From [3].
Computer architects have historically attempted to bridge this performance gap using a hier-
archy of cache memories. Figure 1.2 depicts the anatomy of a modern computer’s cache hierarchy.
The hierarchy consists of cache memories that trade off capacity for lower latency at each level. The
purpose of the hierarchy is to improve the apparent average memory access time by frequently han-
dling a memory request at the cache, avoiding the comparatively long access latency of DRAM. The
cache levels closer to the cores are smaller but faster. Each level provides a temporary repository for
recently accessed memory blocks to reduce the effective memory access latency. The more frequently
memory blocks are found in levels closer to the cores, the lower the access latency. We refer to the
cache(s) closest to the core as the L1 caches and then number cache levels successively, referring to
the final cache as the last level cache (LLC).
2 1. INTRODUCTION
The hierarchy relies on two types of memory reference locality. Temporal locality refers to
memory that has been recently accessed and is likely to be accessed again. Spatial locality refers to
memory in physical proximity that is likely to be accessed because near-neighbor instructions and
data are often related.
While locality is extremely powerful as a concept to exploit and reduce the effective mem-
ory access latency, it relies on two basic premises that do not necessarily hold for all workloads,
particularly as the cache hierarchies grow deeper. The first premise is that one cache size fits all
workloads and access patterns. In fact, the capacity demands of modern workloads vary drastically,
and differing workloads benefit from different trade-offs in the capacity and speed of cache hier-
archy levels. The second premise is that a single strategy for allocating and replacing cache entries
(typically allocating on demand and replacing entries that have not been recently used) is suitable
for all workloads. However, again, there is enormous variation in memory access patterns for which
a simple strategy for deciding which blocks to cache may fare poorly.
There are a myriad of techniques that have been proposed from the algorithmic, compil-
er-level, and system software level all the way down to hardware to overcome the Memory Wall.
These techniques include cache-oblivious algorithms, code and data layout optimizations at the
compiler level, to hardware-centric approaches. Moreover, many software-based techniques have
been proposed for prefetching. In this book, we focus on hardware-based techniques for prefetching
instructions and data. For a more comprehensive treatment of the memory system, we refer the
reader to the synthesis lecture by Jacob [4].
1.2. PREFETCHING 3
1.2 PREFETCHING
One way to hide memory access latency is to prefetch. Prefetching refers to the act of predicting
a subsequent memory access and fetching the required values ahead of the memory access to hide
any potential long latency. In the limit, a memory access does not incur any additional overhead
and memory appears to have a performance equal to a processor register. In practice, however,
prefetching may not always be timely or accurate. Late or inaccurate prefetches waste energy and,
in the worst case, can hurt performance.
To hide latency effectively, a prefetching mechanism must: (1) predict the address of a mem-
ory access (i.e., be accurate), (2) predict when to issue a prefetch (i.e., be timely), and (3) choose
where to place prefetched data (and, potentially, which other data to replace).
CHAPTER 2
Instruction Prefetching
Instruction fetch stalls are detrimental to performance for workloads with large instruction working
sets; when instruction supply slows down, the processor pipeline’s execution resources (no matter
how abundant) will be wasted. Whereas desktop and scientific workloads often exhibit small in-
struction working sets, conventional server workloads and emerging cloud workloads exhibit pri-
mary instruction working sets often far beyond what upper-level caches can accommodate. With
trends towards fast software development, scripting paradigms, and virtualized environments with
increasing software stack depth, primary instruction working sets are also growing fast. Modern
hardware instruction scheduling techniques, such as out-of-order execution, are often effective in
hiding some or all of the stalls due to data accesses and other long latency instructions. However,
out-of-order execution generally cannot hide instruction fetch latency. As such, instruction stalls
often account for a large fraction of overall memory stalls in servers.
A B C D A C C D A G B C K
hlpr1(
)
fn(
)
fn(
)
fn(
)
M
D
fn(
fn(
fn(
A
A
A
G
B B B hlpr2( )
C
C
C
K
cache
D
blocks
D
D
L
M
Figure 2.2: Examples of instruction fetch: (a) sequential fetch, (b) discontinuity due to an if-statement
and a loop, and (c) discontinuities due to function calls. From [7].
Figure 2.2 compares examples of sequential fetch and discontinuities created by control flow.
Figure 2.2(a) depicts sequential fetch of instruction cache blocks. Sequential fetch can be covered
effectively with next-line prefetching. Figure 2.2(b) depicts two different types of discontinuity, one
due to an if-statement that is false and as such requires a fetch around one or more cache blocks,
and the other due to a loop. Figure 2.2(c) depicts discontinuities due to function calls.
2.2. FETCH-DIRECTED PREFETCHING 9
Branch-predictor-directed prefetchers [11, 12, 13, 14, 15] reuse existing branch predictors
to explore future control flow. These techniques use the branch predictor to recursively make future
predictions to find instruction-block addresses for prefetch. Because branch predictors are, to the
first order, decoupled from the rest of the pipeline, predictors can theoretically advance ahead of
execution to an arbitrary extent to predict future control flow.
100%
% L1 Inst. Cache Misses
80%
Figure 2.5: Correct branch predictions required to achieve 4-miss lookahead. Data from [7].
F
(6) G
Figure 2.7 (from [7]) illustrates the design of TIFS. L1 instruction cache misses are recorded
in an instruction miss log, a circular buffer maintained either in dedicated storage or within the L2
cache. A separate index table keeps a mapping from instruction block addresses to the location that
address was last recorded in the log. An L1-I miss to address C consults the index table (1), which
points to an instruction miss log entry (2). The stream of addresses following C is read from the log
and cache block addresses are sent to a streamed value buffer (3). The streamed value buffer requests
the blocks in the stream from L2 (4), which returns the contents (5). Later, on a subsequent L1-I
miss to D, the buffer returns the contents to the L1-I (6).
2.7. PROACTIVE INSTRUCTION FETCH 13
TIFS subsumes the sequential access predictions of next-line prefetchers. However, its per-
formance benefit is greater because the predictions are more accurate and more timely. Increased
accuracy comes from the fact that TIFS uses history to determine how many of the upcoming
consecutive blocks should be prefetched.
TIFS improves lookahead in several ways. First, it operates at the granularity of cache blocks
rather than individual instructions, addressing a key limitation of helper-thread approaches. Be-
cause of this, it skips over local loops and minor control flow within a cache block. TIFS is able to
support any number of discontinuous branches and indirect branch targets by separately recording
the discontinuities as part of the instruction stream. Furthermore, because it records extended se-
quences of instruction cache misses, it can quickly predict far into the future, providing substantially
higher lookahead. For example, a next-line predictor is able to correctly prefetch a function body
only after the first instruction block of that function is accessed. However, TIFS is able to predict
and prefetch the same blocks earlier by predicting the function call and its sequential accesses prior
to entering the function itself, while the caller is still executing code leading up to the call.
CHAPTER 3
Data Prefetching
Data miss patterns arise from the inherent structure that algorithms and high-level programming
constructs impose to organize and traverse data in memory. Whereas instruction miss patterns in
conventional von Neumann computer systems tend to be quite simple, following either sequential
patterns or repetitive control transfers in a well-structured control flow graph, data access patterns
can be far more diverse, particularly in pointer-linked data structures that enable multiple travers-
als. Moreover, whereas code tends to be static and hence easy to prefetch (with the exception of
recent virtualization and just-in-time compilation mechanisms, which tend to thwart instruction
prefetching), data structures morph over the course of execution, causing traversal patterns to
change. This greater complexity in access patterns has led to a rich and diverse design space for data
prefetching schemes that is much broader than instruction prefetchers.
We divide the design space of data prefetchers into four broad categories. First are prefetch-
ers that rely on simple stride patterns, which directly generalize next-line instruction prefetching
concepts to data. Second are those that rely on repetitive traversal sequences, often exploiting the
pointer relationships among addresses. Third are those that rely on regular (yet potentially non-
strided) data structure layouts. Finally, are mechanisms that explore ahead of the conventional
out-of-order instruction window, and hence do not rely on regularity or repetition in the memory
access address stream.
Figure 3.1: Baer and Chen’s reference prediction table. From [29].
A second key implementation issue is to decide how many blocks to prefetch when a strided
stream is detected. This parameter, often referred to as the prefetch degree or prefetch depth, is ideally
large enough that the prefetched data arrive before being referenced by the processor, but not so
large that blocks are replaced before access or cause undue pollution for short streams. Hur and Lin
propose simple state machines that track histograms of recent stream lengths and can adaptively
determine the appropriate prefetch depth for each distinct stream, enabling stream prefetchers to
be effective even for short streams of only a few addresses [33].
Conventionally, stride prefetchers place the data they fetch directly into the cache hierarchy.
However, if stride prefetchers are aggressive, they may pollute the cache, displacing useful data.
Jouppi [34] describes an alternative organization wherein stream prefetchers place data in separate
buffers, called stream buffers, which are accessed immediately after or in parallel with the L1 cache.
By placing data in a stream buffer, a low accuracy stream (where many data are fetched but not
3.2. ADDRESS-CORRELATING PREFETCHERS 17
used) does not displace useful data in the cache, reducing the risk of inaccurate prefetching. How-
ever, erroneous prefetches still consume energy and bandwidth. Palacharla and Kessler evaluate a
memory system organization where stream buffers entirely replace the second-level data cache [35].
Each stream buffer holds cache blocks from a single stream. Accesses from the processor
interrogate the stream buffer contents, typically in parallel with accesses to the L1 cache. A hit
in a stream buffer typically causes the requested block to be transferred to the L1 cache and an
additional block from the stream to be fetched. In some variants, stream buffers are strictly FIFO
and only the head of each stream buffer may be accessed. In other variants, stream buffers are asso-
ciatively searched. When the stride detection mechanism observes a new stream, an entire stream
buffer is cleared and re-allocated (discarding any unreferenced blocks from a stale stream), typically
according to a round-robin or least-recently-used scheme.
Additional implementation concerns and optimizations for stream prefetchers have been
analyzed by Zhang and McKee [36] and Iacobovici and co-authors [37].
The Markov prefetcher design is inspired by conceptualizing a Markov model of the off-chip
access sequence. Each state in the model corresponds to a trigger address, with possible successor
states corresponding to subsequent miss addresses. Transition probabilities in the first-order Mar-
kov model correspond to the likelihood of each successor miss. The objective of the lookup table
is to store the successors with the highest transition probabilities for the most frequently encoun-
tered triggers. However, existing hardware proposals do not explicitly calculate trigger or transition
probabilities; both the trigger addresses and the successors for each are managed heuristically using
least-recently used (LRU) replacement.
Two factors limit the effectiveness of Markov prefetchers: (1) lookahead and memory-lev-
el-parallelism are limited because the prefetcher attempts to predict only the next miss and (2) cov-
erage is limited by on-chip correlation table capacity. We next discuss several proposals to address
each of these limitations.
Figure 3.3: Address-correlating global history buffer (GHB G/AC). From [68].
By varying the key stored in the index table and the link pointers between history buffer
entries, the GHB design can exploit a variety of properties that relate trigger events to predicted
prefetch streams. Nesbit and Smith introduce a taxonomy of GHB variants of the form GHB X/Y,
where X indicates how streams are localized (i.e., how link pointers connect history buffer entries
that should be prefetched consecutively) and Y indicates the correlation method (i.e., how the
lookup process locates a candidate stream) [68, 69]. Localization can be global (G) or per-PC (PC).
Under global localization, consecutively recorded history buffer entries form a stream. The pointer
associated with each history table entry either points to earlier occurrences of the same miss address
(facilitating higher prefetch width as discussed above) or is unused. Under per-PC localization,
both the index table and link pointers connect history buffer entries based on the PC of the trigger
access; a stream is formed by following the link pointers connecting consecutive misses issued by
the same trigger PC. The correlation method may be address correlating (AC) or delta correlating
(DC). In this section, we discuss the global address correlating variant (GHB G/AC), where the
index table maps miss addresses to history buffer locations. In Section 3.3.2, we discuss GHB PC/
DC (“program counter-localized delta correlation”), which instead locates entries and records his-
24 3. DATA PREFETCHING
tory based on the stride between consecutive misses and localizes miss histories on a per-PC basis.
The literature discusses several other alternatives for localization and correlation [68, 69].
One challenge under the GHB organization is to determine when a stream ends, that is,
when the prefetcher should no longer fetch additional addresses indicated in the history buffer.
Many proposals that build on the GHB organization (e.g., [42]) make no effort to predict the end
of a stream. Instead, they allocate a stream buffer [34] for each successful index table lookup and
continue follow the stream while it continues to provide prefetch hits. Stream buffers allocated to
streams that are no longer useful are recycled, for example, via least-recent-used replacement. We-
nisch discusses adaptively adjusting stream prefetch rate as a stream is followed [42].
Figure 3.4: Sampling index table updates is effective for long and short, frequent streams. From [67].
IBM recently announced that the IBM Blue Gene/Q includes a new prefetching scheme,
called list prefetching [71], that bears many similarities to temporal memory streaming and is, to our
knowledge, the only publicly disclosed commercial implementation of such a prefetcher. The list
prefetching engine can prefetch from a recorded miss stream located in main memory. The address
list can either be provided via a software API or recorded automatically by hardware. However,
the list prefetcher does not provide an off-chip index table; software must assist the prefetcher in
recording, locating and initiating streams. Hardware then manages timeliness of prefetch and small
deviations between the recorded stream and L1 misses.
Figure 3.5: Delta-correlating global history buffer (GHB G/DC). From [68].
Although GHB PC/DC has demonstrated remarkable effectiveness with only limited stor-
age (256-entry index and history buffer tables) for SPEC benchmarks, to date, its effectiveness
has not been studied with workloads that have large code and data footprints (such as commercial
server or cloud computing applications) and its scaling behavior is unknown.
Several innovative variants of the GHB PC/DC prefetcher were evaluated in the First Data
Prefetching Championship and described in a special issue of the Journal of Instruction Level Par-
allelism in 2011 [31, 74, 75, 76, 77, 78].
Upon a trigger event (e.g., an L1 cache miss), the prefetcher constructs a lookup key from
the trigger event and searches for this key in a pattern history table, which associates the key with
a spatial pattern, a representation of the relative offsets to prefetch. Trigger events, lookup keys,
spatial pattern encoding, pattern history table organization, and the mechanisms used to train the
prefetcher vary among specific prefetcher designs.
The lookup key typically includes some or all of the bits from the PC of the trigger access.
Several studies [81, 82, 83] show that additionally including low-order bits of the data address, in
particular, the offset within the region, improves prefetch accuracy. These low-order data address
bits serve to distinguish among accesses to objects with similar layouts that are aligned differently
with respect to region boundaries—separate entries are recorded in the predictor tables for each
possible alignment. Alternatively, Ferdman and co-authors propose storing only a single pattern
using the PC as the lookup key and instead use the low-order bits to rotate the pattern to the
appropriate alignment [84, 85].
The simplest representation for spatial patterns is a bit vector representing which portions
(e.g., cache lines) of the region should be prefetched. This bit vector, combined with the base ad-
dress of the region (taken from the data address requested by the trigger event), provides the list of
addresses to prefetch.
30 3. DATA PREFETCHING
In the following subsections, we briefly summarize key aspects of specific code-correlated
prefetcher designs.
Figure 3.7: Structures for storage-efficient training in spatial memory streaming. From [82].
Whereas the active generation table is effective in reducing the storage requirements for
training the SMS prefetcher, the required pattern history table size remains large, on the order of
64KB, for maximum effectiveness. Hence, follow-on work by Burcea and co-authors proposed vir-
tualizing the pattern history table. Instead of a large dedicated table, the virtualized approach stores
prefetcher meta-data in the last level cache, using a small, dedicated meta-data cache to accelerate
access [89].
CHAPTER 4
Concluding Remarks
Hardware prefetching has been a subject of academic research and industrial development for over
40 years. Nevertheless, because of the scaling trends that continue to widen the gap between proces-
sor performance and memory access latency, the importance of hardware prefetching and the need
to hide memory system latency has only grown—further innovation remains critical.
In this primer, we have surveyed the myriad of prefetching techniques that have been de-
veloped and highlighted the principle program behaviors on which these techniques are based.
We hope this book serves as an introduction to the field, as an overview of the vast literature on
hardware prefetching, and as a catalyst to spur new research efforts.
A number of challenges remain to be addressed in future work. Instruction fetch remains a
fundamental bottleneck especially in servers with complex software stacks and ever-growing on-
chip instruction working sets. Although instruction footprints can often fit entirely on chip in large
last-level caches, cycle time constraints place severe limits on the capacity of L1 instruction caches,
and access latency to larger caches remains exposed. Advanced proposals for temporal instruction
streaming have in recent years achieved phenomenal accuracies and coverage (> 99.5%) even in the
presence of complex software stacks. The key to a wider adoption of these proposals are techniques
to reduce on-chip meta-data storage to practical levels.
Another key challenge to instruction prefetching is due to developments in programming
languages and software engineering often complicating or even thwarting the techniques we have
discussed. Object-oriented programming practices, dynamic dispatch, and managed runtimes all
lead to an increase in the use of frequent, short function calls, indirection through function pointers,
register-indirect branches and multi-way control transfers. Dynamic code generation/optimization,
interpreted languages, and just-in-time compilation lead to environments where the control struc-
ture of a program may be obscured and instruction addresses change meaning over time. Virtualiza-
tion and operating system layering similarly complicate and obscure control flow through frequent
virtual-machine exits and traps to emulate privileged functionality.
On the hardware front, processors are increasingly supporting multiple concurrent hard-
ware threads that must share already over-subscribed instruction cache capacity. Prefetchers must
be enhanced to share limited capacity and bandwidth among threads, disambiguate instruction
streams issuing from each thread, and consider the interaction of prefetching policies and thread
prioritization/fetch policies.
A key remaining challenge for data prefetching is low accuracy and coverage across a broad
spectrum of workloads. While the emergence of data-intensive workloads and large-scale in-mem-
40 4. CONCLUDING REMARKS
ory data services is placing ever-growing demands on the need for effective data prefetching, the
increase in memory capacity is dwarfing even the most advanced history-based prefetching tech-
niques we cover in this primer in terms of diminishing repetitive history patterns and prohibitive
meta-data storage requirements. Future advances in data prefetching are required to capture repet-
itive access patterns with lower meta-data storage requirement and with a higher accuracy.
Prefetching techniques are beginning to emerge for graphics processing units and other
forms of specialized accelerators, which may have markedly different code and data access patterns
than conventional processors. In the case of graphics processors, memory access stalls and thread/
warp scheduling interact in complex ways, creating new opportunities for synergistic designs.
A fundamental challenge that has emerged in the past decade is that power has become a
first-class constraint due to a slowdown in Dennard Scaling [130, 131] and leveling off of supply
voltages. On the one hand, prefetchers eliminate stalls, which can lead to energy efficiency gains
due to more efficient use of hardware resources. On the other hand, most prefetchers require aux-
iliary hardware structures, which require energy. Moreover, prefetchers often fetch incorrect blocks,
which can waste substantial energy. Indeed, many of the simpler (but widely deployed) designs are
wildly inaccurate; over half the blocks they retrieve may never be accessed. Advances in prefetching
must target energy efficiency as a first-class constraint in conjunction with other key metrics such
as accuracy and coverage in evaluating the effectiveness of a prefetcher design.
41
Bibliography
[1] W. A. Wulf and S. A. McKee. “Hitting the Memory Wall: Implications of the
Obvious.” ACM SIGARCH Computer Architecture News, v. 23 no. 1, 1995. DOI:
10.1145/216585.216588. xiii
[2] D. Lustig, A. Bhattacharjee, and M. Martonosi. “TLB Improvements for Chip
Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs.”
ACM Transactions on Architecture and Code Optimization, v. 10, no. 1, 2013. DOI:
10.1145/2445572.2445574. xiii
[3] J. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 4th ed.
DOI: 10.1.1.115.1881. 1
[4] B. Jacob. “The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't
Fake It.” Synthesis Lectures on Computer Architecture, v. 4, no. 1, 2009. DOI: 10.2200/
S00201ED1V01Y200907CAC007. 2
[5] A. J. Smith. “Sequential Program Prefetching in Memory Hierarchies.” Computer, v. 11,
no. 12, 1978. DOI: 10.1109/C-M.1978.218016. 7, 15
[6] D. W. Anderson, F. J. Sparacio, and R. M. Tomasulo. “The IBM System/360 Model 91:
Machine Philosophy and Instruction-Handling.” IBM Journal of Research and Develop-
ment, v. 11 no. 1, 1967. DOI: 10.1147/rd.111.0008. 8
[7] M. Ferdman, T. F. Wenisch, A. Ailamaki, B. Falsafi, A. Moshovos. “Temporal Instruction
Fetch Streaming.” In Proc. of the 41st Annual ACM/IEEE International Symposium on
Microarchitecture, 2008. DOI: 10.1109/MICRO.2008.4771774. 8, 10, 12
[8] P. Ranganathan, K. Gharachorloo, S. V. Adve, and L. A. Barroso. “Performance of Data-
base Workloads on Shared-Memory Systems With Out-Of-Order Processors.” In Proc.
of the 8th International Conference on Architectural Support for Programming Languages and
Operating Systems, 1998. DOI: 10.1145/291069.291067. 8
[9] A. Ramirez, O. J. Santana, J. L. Larriba-Pey and M. Valero. “Fetching Instruction
Streams.” In Proc. of the 35th Annual ACM/IEEE International Symposium on Microarchi-
tecture, 2002. 8
[10] O. J. Santana, A. Ramirez, and M. Valero. “Enlarging Instruction Streams.” IEEE Trans-
actions on Computers, v. 56, no. 10, 2007. DOI: 10.1109/TC.2007.70742. 8, 11
42 BIBLIOGRAPHY
[11] I-C. K. Chen, C-C. Lee, and T. N. Mudge. “Instruction Prefetching Using Branch Pre-
diction Information.” In Proc. of the IEEE International Conference on Computer Design,
1997. DOI: 10.1109/ICCD.1997.628926. 9
[12] G. Reinman, B. Calder, and T. Austin. “Fetch Directed Instruction Prefetching.” In Proc.
of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture, 1999. 9
[13] A. V. Veidenbaum, Q. Zhao, and A. Shameer. “Non-Sequential Instruction Cache
Prefetching for Multiple–Issue Processors.” International Journal of High Speed Comput-
ing, v. 10, no. 1, 1999. DOI: 10.1142/S0129053399000065. 9
[14] R. Panda, P. V. Gratz, and D. A. Jiménez. “B-Fetch: Branch Prediction Directed Prefetch-
ing for In-Order Processors.” In Proc. of the 18th International Symposium on High-Perfor-
mance Computer Architecture, 2012. DOI: 10.1109/L-CA.2011.33. 9
[15] T. Sherwood, S. Sair, and B. Calder. "Predictor-Directed Stream Buffers." In Proc. of
the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, 2000. DOI:
10.1145/360128.360135. 9
[16] J. Pierce, and T. N. Mudge. “Wrong-Path Instruction Prefetching.” In Proc. of the 29th
Annual ACM/IEEE International Symposium on Microarchitecture, 1996. 11
[17] V. Srinivasan, E. S. Davidson, G. S. Tyson, M. J. Charney, and T. R. Puzak. “Branch
History Guided Instruction Prefetching.” In Proc. of the 7th International Symposium on
High-Performance Computer Architecture, 2001. DOI: 10.1109/HPCA.2001.903271. 11
[18] Y. Zhang, S. Haga, and R. Barua. “Execution History Guided Instruction Prefetch-
ing.” In Proc. of the 16th Annual International Conference on Supercomputing, 2002. DOI:
10.1145/514191.514220. 11
[19] Q. Jacobson, E. Rotenberg, and J. E. Smith. “Path-Based Next Trace Prediction.” In Proc.
of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, 1997. DOI:
10.1109/MICRO.1997.645793. 11
[20] M. Annavaram, J. M. Patel, and E. S. Davidson. “Call Graph Prefetching for Data-
base Applications.” ACM Transactions on Computer Systems, v. 21, no. 4, 2003. DOI:
10.1145/945506.945509. 11
[21] L. Spracklen, Y. Chou, and S. G. Abraham. “Effective Instruction Prefetching in Chip
Multiprocessors for Modern Commercial Applications.” In Proc. of the 11th Interna-
tional Symposium on High-Performance Computer Architecture, 2005. DOI: 10.1109/
HPCA.2005.13. 11
BIBLIOGRAPHY 43
[22] T. M. Aamodt, P. Chow, P. Hammarlund, H. Wang, and J. P. Shen. “Hardware Support
for Prescient Instruction Prefetch.” Proc. of the 10th International Symposium on High-Per-
formance Computer Architecture, 2004. DOI: 10.1109/HPCA.2004.10028. 12
[23] C-K. Luk, T. C. Mowry. “Cooperative Prefetching: Compiler and Hardware Support
for Effective Instruction Prefetching In Modern Processors.” In Proc. of the 31st an-
nual ACM/IEEE International Symposium on Microarchitecture, 1998. DOI: 10.1109/
MICRO.1998.742780. 12
[24] K. Sundaramoorthy, Z. Purser, and E. Rotenberg. “Slipstream Processors: Improving
Both Performance and Fault Tolerance.” In Proc. of the 9th International Conference on
Architectural Support for Programming Languages and Operating Systems, 2000. DOI:
10.1145/356989.357013. 12
[25] C. Zilles and G. Sohi. “Execution-Based Prediction Using Speculative Slices.” In
Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI:
10.1145/379240.379246. 12
[26] A. Kolli, A. Saidi, and T. F. Wenisch. “RDIP: Return-Address-Stack Directed Instruction
Prefetching.” In Proc. of the 46th Annual IEEE/ACM International Symposium on Microar-
chitecture, 2013. DOI: 10.1145/2540708.2540731. 13
[27] M. Ferdman, C. Kaynak, and B. Falsafi. “Proactive Instruction Fetch.” In Proc. of the
44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011. DOI:
10.1145/2155620.2155638. 14
[28] C. Kaynak, B. Grot, and B. Falsafi. “Shift: Shared History Instruction Fetch for Lean-
Core Server Processors.” In Proc. of the 46th Annual IEEE/ACM International Symposium
on Microarchitecture, 2013. DOI: 10.1145/2540708.2540732. 14
[29] J.-L. Baer and T.-F. Chen. “An Effective On-Chip Preloading Scheme to Reduce Data
Access Penalty.” In Proc. of Supercomputing, 1991. DOI: 10.1145/125826.125932. 15, 16
[30] F. Dahlgren and P. Stenstrom. “Effectiveness of Hardware-Based Stride and Sequential
Prefetching in Shared-Memory Multiprocessors.” In Proc. of the 1st IEEE Symposium on
High-Performance Computer Architecture, 1995. DOI: 10.1109/HPCA.1995.386554. 16
[31] Y. Ishii, M. Inaba and K. Hiraki. “Access Map Pattern Matching for High Performance
Data Cache Prefetch.” Journal of Instruction-Level Parallelism, v. 13, 2011. 16, 28
[32] S. Sair, T. Sherwood, and B. Calder. “A Decoupled Predictor-Directed Stream Prefetch-
ing Architecture.” IEEE Transactions on Computers, v. 52, no. 3, 2003. DOI: 10.1109/
TC.2003.1183943. 16
44 BIBLIOGRAPHY
[33] I. Hur and C. Lin. “Memory Prefetching Using Adaptive Stream Detection.” In Proc. of
the 39th Annual ACM/IEEE International Symposium on Microarchitecture, 2006. DOI:
10.1109/MICRO.2006.32. 16, 35
[34] N. P. Jouppi. “Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers.” In Proc. of the 17th Annual International
Symposium on Computer Architecture, 1990. DOI: 10.1145/325164.325162. 16, 24
[35] S. Palacharla and R. E. Kessler. “Evaluating Stream Buffers As a Secondary Cache Place-
ment.” In Proc. of the 21st Annual International Symposium on Computer Architecture, 1994.
17
[36] C. Zhang and S. A. McKee. “Hardware-Only Stream Prefetching and Dynamic Access
Ordering.” In Proc. of the 14th Annual International Conference on Supercomputing, 2000.
DOI: 10.1145/335231.335247. 17
[37] S. Iacobovici, L. Spracklen, S. Kadambi, Y. Chou and S. G. Abraham. “Effective Stream-
Based and Execution-Based Data Prefetching.” In Proc. of the 18th Annual International
Conference on Supercomputing, 2004. DOI: 10.1145/1006209.1006211. 17
[38] J-L. Baer, J-L., and G. R. Sager. “Dynamic Improvement of Locality in Virtual Mem-
ory Systems.” IEEE Transactions on Software Engineering, v. 1, 1976. DOI: 10.1109/
TSE.1976.233801. 17
[39] M. J. Charney and A. P. Reeves. “Generalized Correlation-Based Hardware Prefetching.”
Technical Report EE-CEG-95-1, School of Electrical Engineering, Cornell University,
Feb. 1995. 17
[40] M. J. Charney. Correlation-Based Hardware Prefetching, 1996. Ph.D. diss., Cornell Uni-
versity, 1996. 17
[41] T. M. Chilimbi and M. Hirzel. “Dynamic Hot Data Stream Prefetching for Gener-
al-Purpose Programs.” In Proc. of the Conference on Programming Language Design and
Implementation, 2002. DOI: 10.1145/512529.512554. 17, 19, 20, 24, 36
[42] T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. “Temporal
Streaming of Shared Memory.” In Proc. of the 32nd Annual International Symposium on
Computer Architecture, June 2005. DOI: 10.1109/ISCA.2005.50. 17, 20, 22
[43] C.-K. Luk and T. C. Mowry. “Compiler Based Prefetching for Recursive Data Struc-
tures.” In Proc. of the 7th International Conference on Architectural Support for Programming
Languages and Operating Systems, 1996. DOI: 10.1145/237090.237190. 17, 36
BIBLIOGRAPHY 45
[44] A. Roth, A. Moshovos, and G. S. Sohi. “Dependence Based Prefetching for Linked Data
Structures.” In Proc. of the 8th International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems, 1998. DOI: 10.1145/291069.291034. 17, 33
[45] A. Roth and G. S. Sohi. “Effective Jump Pointer Prefetching for Linked Data Structures.”
In Proc. of the 26th Annual International Symposium on Computer Architecture, 1999. DOI:
10.1109/ISCA.1999.765944. 17, 33
[46] J. Collins, S. Sair, B. Calder, and D. M. Tullsen. “Pointer Cache Assisted Prefetching.” In
Proc. of the 35th Annual ACM/IEEE International Symposium on Microarchitecture, 2002.
DOI: 10.1109/MICRO.2002.1176239. 17
[47] R. Cooksey, S. Jourdan, and D. Grunwald. “A Stateless, Content-Directed Data Prefetch-
ing Mechanism.” In Proc. of the 10th International Conference on Architectural Support for
Programming Languages and Operating Systems, 2002. DOI: 10.1145/605397.605427. 18
[48] E. Ebrahimi, O. Mutlu, and Y. N. Patt. “Techniques for Bandwidth-Efficient Prefetching
of Linked Data Structures in Hybrid Prefetching Systems.” In Proc. of the 15th Inter-
national Symposium on High Performance Computer Architecture, 2009. DOI: 10.1109/
HPCA.2009.4798232. 18
[49] D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” In Proc. of
the 24th Annual International Symposium on Computer Architecture, 1997. DOI:
10.1145/264107.264207. 18, 19
[50] D. Joseph and D. Grunwald. “Prefetching Using Markov Predictors.” IEEE Transactions
on Computers, v. 48 no. 2, 1999. DOI: 10.1109/12.752653. 18
[51] A.-C. Lai, C. Fide, and B. Falsafi. “Dead-Block Prediction and Dead-Block Correlating
Prefetchers.” In Proc. of the 28th Annual International Symposium on Computer Architecture,
2001. DOI: 10.1145/379240.379259. 19, 20, 21
[52] Y. Solihin, J. Lee, and J. Torrellas. “Using a User-Level Memory Thread for Correlation
Prefetching.” In Proc. of the 29th Annual International Symposium on Computer Architecture,
May 2002. DOI: 10.1109/ISCA.2002.1003576. 20, 22
[53] Y. Solihin, J. Lee, and J. Torrellas. “Correlation Prefetching with a User-Level Memory
Thread.” IEEE Transactions on Parallel and Distributed Systems, v. 14, no. 6, 2003. DOI:
10.1109/TPDS.2003.1206504. 20
[54] M. Ferdman and B. Falsafi. “Last-Touch Correlated Data Streaming.” In IEEE Interna-
tional Symposium on Performance Analysis of Systems and Software, 2007. DOI: 10.1109/
ISPASS.2007.363741. 20, 21, 22
46 BIBLIOGRAPHY
[55] T.F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos. “Temporal Streams
in Commercial Server Applications.” In Proc. of the IEEE International Symposium on
Workload Characterization, 2008. DOI: 10.1109/IISWC.2008.4636095. 20, 22, 32
[56] Y. Chou, B. Fahs, and S. Abraham. “Microarchitecture Optimizations for Exploiting
Memory-Level Parallelism.” In Proc. of the 31st Annual International Symposium on Com-
puter Architecture, 2004. DOI: 10.1145/1028176.1006708. 20
[57] Y. Chou. “Low-Cost Epoch-Based Correlation Prefetching for Commercial Applica-
tions.” In Proc. of the 40th Annual ACM/IEEE International Symposium on Microarchitec-
ture, 2007. DOI: 10.1109/MICRO.2007.39. 20
[58] N. Kohout, S. Choi, D. Kim, and D. Yeung. “Multi-Chain Prefetching: Effective Ex-
ploitation of Inter-Chain Memory Parallelism for Pointer-Chasing Codes.” In Proc. of the
International Conference on Parallel Architectures and Compilation Techniques, 2001. DOI:
10.1109/PACT.2001.953307. 24
[59] P. Díaz and M. Cintra. “Stream Chaining: Exploiting Multiple Levels of Correlation
in Data Prefetching.” In Proc. of the 36th Annual International Symposium on Computer
Architecture, 2009. DOI: 10.1145/1555754.1555767. 24
[60] A-C. Lai, and B. Falsafi. “Selective, Accurate, and Timely Self-Invalidation Using Last-
Touch Prediction.” In Proc. of the 27th Annual International Symposium on Computer Ar-
chitecture, 2000. DOI: 10.1145/339647.339669. 20, 28
[61] Z. Hu, S. Kaxiras, and M. Martonosi. “Timekeeping in the Memory System: Predicting
and Optimizing Memory Behavior.” In Proc. of the 29th Annual International Symposium
on Computer Architecture, 2002. DOI: 10.1109/ISCA.2002.1003579. 20, 21
[62] H. Liu, M. Ferdman, J. Huh, and D. Burger. “Cache Bursts: A New Approach for
Eliminating Dead Blocks and Increasing Cache Efficiency.” In Proc. of the 41st An-
nual ACM/IEEE International Symposium on Microarchitecture, 2008. DOI: 10.1109/
MICRO.2008.4771793. 20, 21
[63] T. R. Puzak. Analysis of Cache Replacement-Algorithms, 1985. Ph.D. diss., Univ. Massachu-
setts, Amherst,1985. 20
[64] A. Mendelson, D. Thiebaut, and D. K. Pradhan. “Modeling Live and Dead Lines
in Cache Memory Systems.” IEEE Transactions on Computers, v. 42, no. 1. DOI:
10.1109/12.192209. 20
[65] D. A. Wood, M. D. Hill, and R. E. Kessler. “A Model for Estimating Trace-Sample Miss
Ratios.” In Proc. of the 1991 ACM SIGMETRICS Conference on Measurement and Model-
ing of Computer Systems, 1991. DOI: 10.1145/107971.107981. 20
BIBLIOGRAPHY 47
[66] Z. Hu, M. Martonosi, and S. Kaxiras. “TCP: Tag Correlating Prefetchers.” In Proc. of the
9th IEEE Symposium on High-Performance Computer Architecture, 2003. DOI: 10.1109/
HPCA.2003.1183549. 21
[67] T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, A. Moshovos. “Practical Off-Chip
Meta-Data for Temporal Memory Streaming.” In Proc. of the 15th International Symposium
on High Performance Computer Architecture, 2009. DOI: 10.1109/HPCA.2009.4798239.
22, 25
[68] K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer.” In
Proc. of the 10th IEEE Symposium on High-Performance Computer Architecture, 2004. DOI:
10.1109/HPCA.2004.10030. 22, 23, 24, 26, 27, 28
[69] K. J. Nesbit, A. S. Dhodapkar, and J. E. Smith. “AC/DC: An Adaptive Data Cache
Prefetcher.” In Proc. of the 13th International Conference on Parallel Architectures and Com-
pilation Techniques, 2004. DOI: 10.1109/PACT.2004.1342548. 23, 24, 27
[70] T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi and A. Moshovos. “Making Ad-
dress-Correlated Prefetching Practical.” IEEE Micro, v. 30, no. 1, 2010. DOI: 10.1109/
MM.2010.21. 25
[71] I. Chung, C. Kim, H.-F. Wen, and G. Cong. “Application Data Prefetching on the IBM
Blue Gene/Q Supercomputer.” In International Conference on High Performance Comput-
ing, Networking, Storage and Analysis, 2012. DOI: 10.1109/SC.2012.19. 25
[72] A. Jain and C. Lin. “Linearizing Irregular Memory Accesses for Improved Correlated
Prefetching.” In Proc. of the 46th Annual ACM/IEEE International Symposium on Microar-
chitecture, 2013. DOI: 10.1145/2540708.2540730. 22, 25
[73] G. B. Kandiraju and A. Sivasubramaniam. “Going the Distance for Tlb Prefetching: An
Application-Driven Study.” In Proc. of the 29th Annual International Symposium on Com-
puter Architecture, 2002. DOI: 10.1109/ISCA.2002.1003578. 27
[74] M. Grannaes, M. Jahre, and L. Natvig. “Storage Efficient Hardware Prefetching Using
Delta Correlating Prediction Tables.” Journal of Instruction-Level Parallelism, v. 13, 2011.
28
[75] M. Dimitrov and H. Zhou. “Combining Local and Global History for High Perfor-
mance Data Prefetching.” Journal of Instruction-Level Parallelism, v. 13, 2011. 28
[76] G. Liu, Z. Huang, J-K. Peir, X. Shi, and L. Peng. “Enhancements for Accurate and
Timely Streaming Prefetcher.” Journal of Instruction-Level Parallelism, v. 13, 2011. 28
48 BIBLIOGRAPHY
[77] L. M. Ramos, J. L. Briz, P. E. Ibáñez, and V. Viñals. “Multi-Level Adaptive Prefetching
Based on Performance Gradient Tracking.” Journal of Instruction-Level Parallelism, v. 13,
2011. 28
[78] A. Sharif and H-H. Lee. “Data Prefetching by Exploiting Global and Local Access Pat-
terns.” Journal of Instruction-Level Parallelism, v. 13, 2011. 28
[79] S. S. Mukherjee and M. D. Hill. “Using Prediction to Accelerate Coherence Protocols.”
In Proc. of the 25th Annual International Symposium on Computer Architecture, 1998. DOI:
10.1109/ISCA.1998.694773. 28
[80] S. Kaxiras, J. R. Goodman. “Improving CC-NUMA Performance Using Instruc-
tion-Based Prediction.” In Proc. of the 5th International Symposium on High-Performance
Computer Architecture, 1999. DOI: 10.1109/HPCA.1999.744359. 28
[81] S. Kumar and C. Wilkerson. “Exploiting Spatial Locality in Data Caches Using Spatial
Footprints.” In Proc. of the 25th Annual International Symposium on Computer Architecture,
1998. DOI: 10.1145/279358.279404. 28, 30
[82] S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory
Streaming.” In Proc. of the 33rd Annual International Symposium on Computer Architecture,
2006. DOI: 10.1109/ISCA.2006.38. 29, 31, 32
[83] C. F. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos. “Accurate and Complexity-Effective
Spatial Pattern Prediction.” In Proc. of the 10th IEEE Symposium on High-Performance
Computer Architecture, Feb. 2004. DOI: 10.1109/HPCA.2004.10010. 29, 30
[84] M. Ferdman, S. Somogyi, and B. Falsafi. “Spatial Memory Streaming with Rotated Pat-
terns.” 1st JILP Data Prefetching Championship, 2009. 29
[85] S. Somogyi, T. F. Wenisch, M. Ferdman, and B. Falsafi. “Spatial Memory Streaming.”
Journal of Instruction-Level Parallelism, v. 13, 2011. DOI: 10.1109/ISCA.2006.38. 29
[86] A. Seznec. “Decoupled Sectored Caches: Conciliating Low Tag Implementation Cost
and Low Miss Ratio.” In Proc. of the 21st Annual International Symposium on Computer
Architecture, 1994. DOI: 10.1145/191995.192072. 30
[87] M. D. Powell, S-H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. “Gated-Vdd: A
Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories.” In
Proc. of the International Symposium on Low Power Electronics and Design, 2000. DOI:
10.1145/344166.344526. 30
[88] J. F. Cantin, M. H. Lipasti, and J. E. Smith. "Stealth Prefetching.” In Proc. of the 12th
International Conference on Architectural Support for Programming Languages and Operating
Systems, 2006. DOI: 10.1145/1168857.1168892. 31
BIBLIOGRAPHY 49
[89] I. Burcea, S. Somogyi, A. Moshovos, and B. Falsafi. “Predictor Virtualization.” In Proc. of
the 13th International Conference on Architectural Support for Programming Languages and
Operating Systems, 2008. DOI: 10.1145/1346281.1346301. 32
[90] T. F. Wenisch. Temporal memory streaming. Ph.D. diss., Carnegie Mellon University, 2007.
32
[91] S. Somogyi, T. F. Wenisch, A. Ailamaki, and B. Falsafi. “Spatio-Temporal Memory
Streaming.” In Proc. of the 36th Annual International Symposium on Computer Architecture,
2009. DOI: 10.1145/1555754.1555766. 32, 33
[92] M. Annavaram, J. M. Patel, and E. S. Davidson. “Data Prefetching by Dependence Graph
Precomputation.” In Proc. of the 28th Annual International Symposium on Computer Archi-
tecture, 2001. DOI: 10.1109/ISCA.2001.937432. 33
[93] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen.
“Speculative Precomputation: Long-Range Prefetching of Delinquent Loads.” In
Proc. of the 28th Annual International Symposium on Computer Architecture, 2001. DOI:
10.1145/379240.379248. 33
[94] J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. “Dynamic Speculative Precompu-
tation.” In Proc. of the 34th Annual ACM/IEEE International Symposium on Microarchitec-
ture, 2001. 33
[95] I. Ganusov and M. Burtscher. “Future execution: A Hardware Prefetching Technique for
Chip Multiprocessors.” In Proc. of the 14th International Conference on Parallel Architectures
and Compilation Techniques, 2005. DOI: 10.1109/PACT.2005.23. 33
[96] I. Ganusov and M. Burtscher. “Future Execution: A Prefetching Mechanism that Uses
Multiple Cores to Speed Up Single Threads.” ACM Transactions on Architecture and Code
Optimization, v. 3, no. 4, 2006. DOI: 10.1145/1187976.1187979. 33
[97] J. Lee, C. Jung, D. Lim, and Y. Solihin. “Prefetching with Helper Threads for Loosely
Coupled Multiprocessor Systems.” IEEE Transactions on Parallel and Distributed Systems,
v. 20, no. 9, 2009. DOI: 10.1109/TPDS.2008.224. 33
[98] W. Zhang, D. M. Tullsen, and B. Calder. “Accelerating and Adapting Precomputation
Threads for Efficient Prefetching.” In Proc. of the 13th International Symposium on High
Performance Computer Architecture, 2007. DOI: 10.1109/HPCA.2007.346187. 33
[99] R. S. Chappell, F. Tseng, A. Yoaz, and Y. N. Patt. “Microarchitectural Support for Pre-
computation Microthreads.” In Proc. of the 35th Annual ACM/IEEE International Sympo-
sium on Microarchitecture, 2002. DOI: 10.1109/MICRO.2002.1176240. 33
50 BIBLIOGRAPHY
[100] M. Kamruzzaman, S. Swanson, and D. M. Tullsen. “Inter-Core Prefetching for Multicore
Processors Using Migrating Helper Threads.” In Proc. of the 16th International Conference
on Architectural Support for Programming Languages and Operating Systems, 2011. DOI:
10.1145/1950365.1950411. 33
[101] A. Roth and G. S. Sohi. “Speculative Data-Driven Multithreading.” In Proc. of the 7th In-
ternational Symposium on High-Performance Computer Architecture, 2001. DOI: 10.1109/
HPCA.2001.903250. 33
[102] J. Dundas and T. N. Mudge. “Improving Data Cache Performance by Pre-Executing
Instructions under a Cache Miss.” In Proc. of the 11th Annual International Conference on
Supercomputing, 1997. DOI: 10.1145/263580.263597. 34
[103] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. “Runahead Execution: An Alternative
to Very Large Instruction Windows for Out-Of-Order Processors.” In Proc. of the 9th In-
ternational Symposium on High-Performance Computer Architecture, 2003. DOI: 10.1109/
HPCA.2003.1183532. 34
[104] O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. “Runahead execution: An Effective Al-
ternative to Large Instruction Windows.” IEEE Micro, v. 23, no. 6, 2003. DOI: 10.1109/
MM.2003.1261383. 34
[105] O. Mutlu, H. Kim, and Y. N. Patt. “Techniques for Efficient Processing in Runahead
Execution Engines.” In Proc. of the 32nd Annual International Symposium on Computer
Architecture, 2005. DOI: 10.1109/ISCA.2005.49. 34
[106] O. Mutlu, H. Kim, and Y. N. Patt. “Efficient Runahead Execution: Power-Efficient Mem-
ory Latency Tolerance.” IEEE Micro, v. 26, no. 1, 2006. DOI: 10.1109/MM.2006.10. 34
[107] S. T. Srinivasan, R. Rajwar, H. Akkary, A. Gandhi, and M. Upton. “Continual Flow
Pipelines.” In Proc. of the International Conference on Architectural Support for Programming
Languages and Operating Systems, 2004. DOI: 10.1145/1024393.1024407. 34
[108] A. Hilton, S. Nagarakatte, and A. Roth. “iCFP: Tolerating All-Level Cache Misses in
In-Order Processors.” In Proc. of the 15th International Symposium on High Performance
Computer Architecture, 2009. DOI: 10.1109/MM.2010.20. 34
[109] H. Cui and S. Suleyman. “Extending Data Prefetching to Cope with Context Switch
Misses.” In Proc. of the International Conference on Computer Design, 2009. DOI: 10.1109/
ICCD.2009.5413144. 34, 35
[110] D. Daly and H. W. Cain. “Cache Restoration for Highly Partitioned Virtualized Sys-
tems.” In Proc. of the 18th Annual International Symposium on High Performance Computer
Architecture, 2012. DOI: 10.1109/HPCA.2012.6169029. 34
BIBLIOGRAPHY 51
[111] J. Zebchuk, H. W. Cain, X. Tong, V. Srinivasan and A. Moshovos. “RECAP: A
Region-Based Cure for the Common Cold (Cache).” In Proc. of the 19th An-
nual International Symposium on High Performance Computer Architecture, 2013. DOI:
10.1145/2370816.2370887. 34
[112] K. Chakraborty, P. M. Wells, and G. S. Sohi. “Computation Spreading: Employing Hard-
ware Migration to Specialize CMP Cores On-The-Fly.” In Proc. of the 12th International
conference on Architectural Support for Programming Languages and Operating Systems, 2006.
DOI: 10.1145/1168857.1168893. 35
[113] I. Atta, P. Tozun, A. Ailamaki, and A. Moshovos. “SLICC: Self-Assembly of Instruction
Cache Collectives for OLTP Workloads.” In Proc. of the 2012 45th Annual ACM/IEEE
International Symposium on Microarchitecture, 2012. DOI: 10.1109/MICRO.2012.26. 35
[114] S. Srinath, O. Mutlu, H. Kim, and Y. N. Patt. “Feedback Directed Prefetching: Improving
the Performance and Bandwidth-Efficiency of Hardware Prefetchers.” In Proc. of the 13th
International Symposium on High Performance Computer Architecture, 2007. DOI: 10.1109/
HPCA.2007.346185.
[115] E. Ebrahimi, O. Mutlu, C. J. Lee, Y. N. Patt. “Coordinated Control of Multiple Prefetch-
ers in Multi-Core Systems.” In Proc. of the 42nd Annual ACM/IEEE International Sym-
posium on Microarchitecture, 2009. DOI: 10.1145/1669112.1669154. 35
[116] C. J. Lee, V. Narasiman, O. Mutlu, Y. N. Patt, “Improving Memory Bank-Level Parallel-
ism in the Presence of Prefetching.” In Proc. of the 42nd Annual ACM/IEEE International
Symposium on Microarchitecture, 2009. DOI: 10.1145/1669112.1669155. 35
[117] W.-f. Lin, S. Reinhardt, and D. Burger. “Reducing DRAM Latencies with an Integrated
Memory Hierarchy Design.” In Proc. of the 7th International Symposium on High Perfor-
mance Computer Architecture, 2001. DOI: 10.1109/HPCA.2001.903272. 35
[118] C-J. Wu, A. Jaleel, M. Martonosi, S. Steely Jr, and J. Emer. “PACMan: Prefetch-
Aware Cache Management for High Performance Caching.” In Proc. of the 44th
Annual ACM/IEEE International Symposium on Microarchitecture, 2011. DOI:
10.1145/2155620.2155672. 35
[119] S. Verma, D. M. Koppelman, and L. Peng. “Efficient Prefetching with Hybrid Schemes
and Use of Program Feedback to Adjust Prefetcher Aggressiveness.” Journal of Instruc-
tion-Level Parallelism, v. 13, 2011. 35
[120] S. Chen, A. Ailamaki, P. B. Gibbons, and T. C. Mowry. “Improving Hash Join Perfor-
mance through Prefetching.” In Proc. of the 20th International Conference on Data Engi-
neering, 2004. DOI: 10.1109/ICDE.2004.1319989. 36
[121] S. Chen, P. B. Gibbons, and T. C. Mowry. “Improving Index Performance through
Prefetching.” In Proc. of the 20th Annual ACM International Conference on Management of
Data, 2001. DOI: 10.1145/375663.375688. 36
[122] T. C. Mowry, M. S. Lam, and A. Gupta. “Design and Evaluation of a Compiler Algorithm
for Prefetching.” In Proc. of the 5th International Conference on Architectural Support for Pro-
gramming Languages and Operating Systems, 1992. DOI: 10.1145/143365.143488. 36
[123] Z. Wang, D. Burger, K. S. McKinley, S. K. Reinhardt, and C. C. Weems. “Guided Region
Prefetching: A Cooperative Hardware/Software Approach.” In Proc. of the 30th Annual
International Symposium on Computer Architecture, 2003. DOI: 10.1145/859618.859663. 36
[124] D. Koufaty, X. Chen, D. Poulsen, and J. Torrellas. “Data Forwarding in Scalable
Shared-Memory Multiprocessors.” In Proc. of the 9th Annual International Conference on
Supercomputing, 1995. DOI: 10.1145/224538.224569. 36
[125] C.-K. Luk and T. C. Mowry. “Memory Forwarding: Enabling Aggressive Layout Op-
timizations by Guaranteeing the Safety of Data Relocation.” In Proc. of the 26th Annual
International Symposium on Computer Architecture, 1999. DOI: 10.1145/300979.300987. 36
[126] J. H. Ahn, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das. “Evaluating the Imagine
Stream Architecture.” In Proc. of the 31st Annual International Symposium on Computer
Architecture, 2004. 36
[127] W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J.-H. Ahn, J. Gummaraju, M. Erez, N.
Jayasena, I. Buck, T. J. Knight, and U. J. Kapasi. “Merrimac: Supercomputing with
Streams.” In Proc. of Supercomputing, 2003. DOI: 10.1145/1048935.1050187. 36
[128] M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J.
Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. “A Stream Compiler for Com-
munication-Exposed Architectures.” In Proc. of the 10th International Conference on
Architectural Support for Programming Languages and Operating Systems, 2002. DOI:
10.1145/605397.605428. 36
[129] J. Gummaraju and M. Rosenblum. “Stream Programming on General-Purpose Proces-
sors.” In Proc. of the 38th Annual ACM/IEEE International Symposium on Microarchitec-
ture, 2005. DOI: 10.1109/MICRO.2005.32. 36
[130] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. “Dark Silicon
and the End of Multicore Scaling.” In Proc. of the 38th Annual International Symposium
on Computer Architecture, 2011. DOI: 10.1145/2000064.2000108. 40
[131] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. “Toward Dark Silicon in Serv-
ers.” In IEEE Micro, v. 31, no. 4, 2011. DOI: 10.1109/MM.2011.77. 40
53
Author Biographies
Babak Falsafi is a Professor in the School of Computer and Communication Sciences at EPFL,
and the founding director of the EcoCloud research center, targeting future energy-efficient and
environmentally friendly cloud technologies. He has made numerous contributions to computer
system design and evaluation including: a scalable multiprocessor architecture that laid the founda-
tion for the Sun (now Oracle) WildFire servers; snoop filters; temporal stream prefetchers that are
incorporated into IBM BlueGene/P and BlueGene/Q; and computer system simulation sampling
methodologies that have been in use by AMD and HP for research and product development.
His most notable contribution has been to be first to show that, contrary to conventional wisdom,
multiprocessor memory programming models—known as memory consistency models—prevalent
in all modern systems are neither necessary nor sufficient to achieve high performance. He is a
recipient of an NSF CAREER award, IBM Faculty Partnership Awards, and an Alfred P. Sloan
Research Fellowship. He is a fellow of IEEE.
Thomas Wenisch is an Associate Professor of Computer Science and Engineering at the
University of Michigan, specializing in computer architecture. His prior research includes memory
streaming for commercial server applications, store-wait-free multiprocessor memory systems,
memory disaggregation, and rigorous sampling-based performance evaluation methodologies. His
ongoing work focuses on computational sprinting, memory persistency, data center architecture,
energy-efficient server design, and accelerators for medical imaging. Wenisch received the NSF
CAREER award in 2009 and the University of Michigan Henry Russel Award in 2013. He re-
ceived his Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University.