0% found this document useful (0 votes)
14 views12 pages

HELIX-RC - An Architecture-Compiler Co-Design For Automatic Parallelization of Irregular Programs (Campanoni14-Isca)

This document describes a new architecture-compiler co-design called HELIX-RC that aims to improve parallelization of irregular programs. It proposes lightweight architectural enhancements for fast inter-core communication to support a compiler that parallelizes loop iterations across multiple cores. Simulations show an average 6.85x speedup on SPEC CINT2000 benchmarks compared to 1-4x speedup from prior work.

Uploaded by

metar62276
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

HELIX-RC - An Architecture-Compiler Co-Design For Automatic Parallelization of Irregular Programs (Campanoni14-Isca)

This document describes a new architecture-compiler co-design called HELIX-RC that aims to improve parallelization of irregular programs. It proposes lightweight architectural enhancements for fast inter-core communication to support a compiler that parallelizes loop iterations across multiple cores. Simulations show an average 6.85x speedup on SPEC CINT2000 benchmarks compared to 1-4x speedup from prior work.

Uploaded by

metar62276
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

HELIX-RC: An Architecture-Compiler Co-Design

for Automatic Parallelization of Irregular Programs

Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones+ Gu-Yeon Wei David Brooks

Harvard University University of Cambridge+


Cambridge, MA, USA Cambridge, UK

Abstract overheads to support misspeculation and must therefore target


Data dependences in sequential programs limit paralleliza- relatively large loops to amortize penalties.
tion because extracted threads cannot run independently. Al- In contrast to existing parallelization solutions, we propose
though thread-level speculation can avoid the need for precise an alternate strategy that instead targets small loops, which are
dependence analysis, communication overheads required to syn- much easier to analyze via state-of-the-art control and data flow
chronize actual dependences counteract the benefits of paral- analysis, significantly improving accuracy. Furthermore, this
lelization. To address these challenges, we propose a lightweight ease of analysis enables transformations that simply re-compute
architectural enhancement co-designed with a parallelizing com- shared variables in order to remove a large fraction of actual
piler, which together can decouple communication from thread dependences. This strategy increases TLP and reduces core-to-
execution. Simulations of these approaches, applied to a pro- core communication. Such optimizations do not readily translate
cessor with 16 Intel Atom-like cores, show an average of 6.85× to TLS because the complexity of TLS-targeted code typically
performance speedup for six SPEC CINT2000 benchmarks. spans multiple procedures in larger loops. Finally, our data shows
parallelizing small hot loops yields high program coverage and
1. Introduction produces meaningful speedups for the non-numerical programs
in the SPEC CPU2000 suite.
In today’s multicore era, program performance largely depends
on the amount of thread level parallelism (TLP) available. While Targeting small loops presents its own set of challenges. Even
some computing problems often translate to either inherently after extensive code analysis and optimizations, small hot loops
parallel or easy-to-parallelize numerical programs, sequentially will retain actual dependences, typically to share dynamically
designed, non-numerical programs with complicated control allocated data. Moreover, since loop iterations of small loops
(e.g., execution path) and data flow (e.g., aliasing) are much tend to be short in duration (less than 25 clock cycles on average),
more common, but difficult to analyze precisely. These non- they require frequent, memory-mediated communication. At-
numerical programs are the focus of this paper. While conven- tempting to run these iterations in parallel demands low-latency
tional wisdom is that non-numerical programs cannot make good core-to-core communication for memory traffic, something not
use of multiple cores, research in the last decade has made steady available in commodity multicore processors.
progress towards extracting TLP from complex, sequentially- To meet these demands, we present HELIX-RC, a co-designed
designed programs such as the integer benchmarks from SPEC architecture-compiler parallelization framework for chip multi-
CPU suites [3, 6, 27, 45, 48]. To further extend this body of processors. The compiler identifies what data must be shared
research, this paper presents lightweight architectural enhance- between cores and the architecture proactively circulates this data
ments for fast inter-core communication in order to support along with synchronization signals among cores. Rather than
advances in a custom compiler framework that parallelizes loop waiting for a request, this proactive communication immediately
iterations across multiple cores within a chip multiprocessor. circulates shared data, as early as possible—decoupling commu-
Performance gains sought by parallelizing loop iterations of nication from computation. HELIX-RC builds on the HCCv1
non-numerical programs depend on two key factors: (i) accuracy compiler, developed for the first iteration of HELIX [6, 7], that
of the data dependence analysis and (ii) speed of communica- automatically generates parallel code for commodity multicore
tion provided by the underlying computer architecture to satisfy processors. Because performance improvements from HCCv1
the dependences. Unfortunately, complex control and data flow saturate at four cores due to communication latency, we pro-
in non-numerical programs—both exacerbated by ambiguous pose ring cache—an architectural enhancement that facilitates
pointers and ambiguous indirect calls—make accurate data de- low-latency core-to-core communication—to satisfy inter-thread
pendence analysis difficult. In addition to actual dependences memory dependences and relies on guarantees provided by the
that require communication between cores, a compiler must co-designed HCCv3 compiler to keep it lightweight.
conservatively handle apparent dependences never realized at HELIX-RC automatically parallelizes non-numerical program
runtime. While thread level speculation (TLS) avoids the need with unmatched performance improvements. Across a range of
for accurate data dependence analysis by speculating that some SPEC CINT2000 benchmarks, decoupling communication en-
apparent dependences are not realized [29, 38, 39], TLS suffers ables a three-fold improvement in performance when compared

978-1-4799-4394-4/14/$31.00 c 2014 IEEE


to HCCv1, on a simulated multicore processor consisting of 16 16
14 HCCv1 HCCv2 Numerical
Atom-like, in-order cores with a ring cache with 1KB per node

Program speedup
Programs
12
of memory (32× smaller than the L1 data cache). The proposed Non-Numerical
10
system offers an average speedup of 6.85× when compared to Programs
8
running un-parallelized code on a single core. Detailed evalua- 6
tions show that even with a conservative ring cache configuration, 4
HELIX-RC is able to achieve 95% of the speedup possible with 2
0
unlimited resources (i.e., unbounded bandwidth, instantaneous

ip

er

ke

rt

a
cf
f

eo n

n
ol
vp

ip

es

ea

ea
ea

a
gz

m
rs

ua
tw

9.
bz
5.

am

m
1.
inter-core communication, and unconstrained size). Moreover,

pa

m
m
4.

17
0.

eq
17

6.

7.
18

eo

eo
16

8.
7.

30

25

3.

17
18
19

G
18
simulations for a HELIX-RC system comprising 16 out-of-order

FP
IN
cores show 3.8× performance speedup for the same set of non-
numerical programs. This result confirms HELIX-RC’s ability Figure 1: Improving the HCCv1 compiler alone does not improve
to extract TLP on top of the instruction level parallelism (ILP) performance for SPEC CINT2000 benchmarks.
provided by an out-of-order processor.
The remainder of this paper further describes the motivation communication latency.1 The engineering improvements of
for and results of implementing HELIX-RC. We first review HCCv2 significantly raised speedups for numerical programs
the limitations of compiler-only improvements and identify co- (SPEC CFP2000) over HCCv1 from 2.4× to 11×. HCCv2
design opportunities to improve TLP of loop iterations. Next, successfully parallelized the numerical programs because the
we explore the speedups obtained by decoupling communication accuracy of the data dependence analysis is high for loops at
from computation with compiler support. After describing the almost any level of the loop nesting hierarchy. Furthermore, the
overall HELIX-RC approach, we delve deeper into both the com- improved compiler removed the remaining actual dependences
piler and the hardware enhancement. Finally, we use a detailed among registers (e.g., via parallel reduction) to generate loops
simulation framework to evaluate the performance of HELIX-RC with long iterations that can run in parallel on different cores.
and analyze its sensitivity to architectural parameters. Unfortunately, non-numerical programs (SPEC CINT) are
not as compliant to compiler improvements and saw little to no
2. Background and Opportunities
benefit from HCCv2. Because core-to-core communication in
2.1. Limits of compiler-only improvements conventional systems is expensive, the compiler must parallelize
large loops (the larger the loop with loop-carried dependences,
To understand what limits the performance of parallel code ex- the less frequently cores synchronize), which limits the accuracy
tracted from non-numerical programs, we started with HCCv1 [6, of dependence analysis and thereby limits TLP extraction. This
7], a state-of-the-art parallelizing compiler. is why HELIX-RC focuses on small (hot) loops to parallelize this
HCCv1. This first generation compiler automatically gener- class of programs. Our hypothesis is that modest architectural
ates parallel threads from sequential programs by distributing enhancements, co-designed with a compiler that targets small
successive loop iterations across adjacent cores within a single loops, can successfully parallelize non-numerical programs.
multicore processor, similar to conventional DOACROSS par-
allelism [10]. Since there are data dependences between loop 2.2. Opportunity
iterations (i.e., loop-carried dependences), some segments of There is an opportunity to aggressively parallelize non-numerical
a loop’s body—called sequential segments—must execute in programs based on the following insights: (i) small loops are
iteration order on the separate cores to preserve the semantics of easier to analyze with high accuracy; (ii) predictable computa-
sequential code. Synchronization operations mark the beginning tion means most of the required communication updates shared
and end of each sequential segment. memory locations; (iii) we can efficiently satisfy communication
HCCv1 includes a large set of code optimizations (e.g., code demands of actual dependences for small loops with low-latency,
scheduling, method inlining, loop unrolling), most of which core-to-core communication; and (iv) proactive communication
are specifically tuned to extract TLP. Despite this, performance efficiently hides communication latencies.
improvements obtained by the original HCCv1 compiler saturate Accurate data dependence analysis is possible for small
at four cores, due to high core-to-core communication latency. loops in non-numerical programs. The accuracy of data de-
HCCv2. We first improved code analysis and transformation. pendence analysis increases for smaller loops because (i) there
Specifically, we increased the accuracy of both data dependence is less code—therefore less complexity—to analyze and (ii) the
and induction variable analysis, and we added other transforma- number of possible aliases of a pointer in the code scales down
tions to extract more parallelism (e.g., scalar expansion, scalar with code size. In other words, we can avoid conservative pointer
renaming, parallel reductions, and loop splitting [1]). We call aliasing assumptions that lower accuracy for large loops.
this improved compiler HCCv2. To evaluate the accuracy of data dependence analysis for small
Figure 1 compares speedups for HCCv1 and HCCv2 based on loops using modern compilers, we started with a state-of-the-art
simulations of parallel code generated by each when targeting
a 16-core processor with an optimistic 10-cycle core-to-core 1 Details of this experiment are in Section 6.
0% 48% 81% 100% 15% 100%
Mem Register
VLLPA +flow sensitive +path based +data type +lib calls Re-compute
Mem

Figure 2: Accuracy of data dependence analysis for small hot Figure 3: Predictability of variables reduces register communi-
loops in SPEC CINT2000 benchmarks. cation.

analysis called VLLPA [13]. Figure 2 shows the starting accu- plot shows that more than half of the loop iterations complete
racy (i.e., average number of actual data dependences compared within 25 clock cycles. The plot also delineates the measured
to all dependences identified for our set of loops) of this analysis core-to-core round trip communication latencies for three mod-
is 48%. To improve accuracy, we extended VLLPA (i) to be ern multicore processors. Even for the shortest-latency machine,
fully flow sensitive [8], which tracks values of both registers and Ivy Bridge, 75 cycles is much too long for the majority of these
memory locations according to their position in the code; (ii) short loops. Of course, a conventional region-extending trans-
to be path-based, which names runtime locations by how they formation such as loop unrolling could lengthen the duration
are accessed from program variables [11]; (iii) to exploit data of these inner loops, but this would also increase the lengths of
type and type casting information to conservatively eliminate sequential segments, reducing exploitable parallelism.
incompatible aliases; and (iv) to exploit standard library call se- Proactive communication achieves low latency by decou-
mantics. Figure 2 shows that these extensions increase accuracy pling communication from computation. A compiler must
for small loops to 81%. As a result, most of the loop-carried data conservatively assume dependences exist between all iterations
dependences identified by the compiler are actual and therefore for most of loop-carried dependences in non-numerical programs.
require core-to-core communication. Because of the complexity of control and data flow in such pro-
Most required communication is to update shared mem- grams, a compiler cannot easily infer the distance between a loop
ory locations. Sharing data among loop iterations requires core- iteration that generates data and the ones that consume it. For
to-core communication to propagate new values when loop it- conventional synchronization approaches [6, 25, 26, 43, 47, 48],
erations run on different cores. However, if new values are this assumption of dependences between all subsequent iterations
predictable (e.g., incrementing a shared variable at every itera- leads to sequential chains that severely limit the performance
tion), communication can be avoided. We extended the variable sought by running loop iterations in parallel.3
analysis in HCCv1 to capture the following predictable variables: These sequential chains, which include both communica-
(i) induction variables where the update function is a polyno- tion and computation, have two sources of inefficiency. First,
mial up to the second order; (ii) accumulative, maximum, and adjacent-core synchronization often turns out to not be necessary
minimum variables; (iii) variables set but not used until after for every link of these chains. Second, when data forwarding
the loop; and (iv) variables set in every iteration even when is initiated lazily (at request time), it blocks computation while
the updated value is not constant.2 If a variable falls into any waiting for data transfers between cores.
of these categories, each core can re-compute its correct value Finally, for loops parallelized by HELIX-RC, most communi-
independently. cation is not between successive loop iterations. Hence, because
Exploiting the predictability of variables, again for small loops HELIX-RC distributes successive loop iterations to adjacent
in non-numerical programs, allows the compiler to remove a cores, most communication is not between adjacent cores. Fig-
large fraction of the communication required to share registers. ure 4b charts the distribution of undirected distances between
Figure 3 compares a naive solution that propagates new values data-producing cores and the first consumer core on a platform
for all loop-carried data dependences (100%) versus a solution with 16 cores organized in a ring. Only 15% of those transfers
that exploits variable predictability. By re-computing variables, are between adjacent cores. Moreover, Figure 4c shows that
the majority of the remaining communication is to share memory most of the shared values (86%) from these loops are consumed
locations rather than registers. by multiple cores. Since consumers of shared values are not
Communication for small hot loops must be fast. While known at compile time, HELIX-RC implements a mechanism
the simplicity of small loops allows for easy analysis, small loops that proactively broadcasts data and signals to all other cores.
have short iterations—typically less than 100 clock cycles long. Such proactive communication, which does not block computa-
However, because these short iterations require (at least) some tion, is the cornerstone of the HELIX-RC approach.
communication to run in parallel, efficient parallel execution
demands a low-latency core-to-core communication mechanism. 3. The HELIX-RC Solution
To better understand this need for fast communication, Fig-
ure 4a plots a cumulative distribution of average iteration execu- The goal of HELIX-RC is to decouple all communication re-
tion times on a single Atom-like core (described in Section 6) for quired to efficiently run iterations of small hot loops in parallel.
the set of hot loops from SPEC CINT2000 benchmarks chosen This is realized by decoupling value forwarding from value gen-
for parallelization by HELIX-RC. The shaded portion of the eration and by decoupling signal transmission from synchroniza-
tion. We now show how HELIX-RC achieves such decoupling.
2 This is an example of code replication. Details about this transformation are

outside the scope of this paper. 3 Others have called this chain a critical forwarding path [40, 48].
4 sequential segment or not, a core simply counts the number of
5
3
12%
9%
executed wait and signal instructions. If more waits have
100 39% 6% 6+
12%
1
been executed than matching signals, then the executing code
Percentage of loop iterations

22%
Hop
2
belongs to a sequential segment.
Atom
Memory. The ring cache is a connected ring of nodes, one
Measured cache
coherence latency (b) Distance per core. Each ring node has a cache array that satisfies both
50
Nehalem 5
loads and stores received from its attached core. HELIX-RC
Ivy 34%
6+
does not require other changes to the existing memory hierarchy
Bridge 9%
4 12%
16%
because the ring cache orchestrates interactions with it. To
21% 8%
1
0 3
2
Core avoid any changes to conventional cache coherence protocols,
0 25 75 110 260
Loop iteration execution time (cycles) the ring cache permanently maps each memory address to a
(a) Short loop iterations (c) Consumers unique ring node. All accesses from the distributed ring cache to
the next cache level (L1) go through the associated node for a
Figure 4: Small hot loops have short iterations that send data
corresponding address.
over multiple hops and to multiple cores.
3.2. Decoupling communication from computation
3.1. Approach
Having seen the main components of HELIX-RC, we describe
HELIX-RC is a co-design of its compiler (HCCv3) and archi- how they interact to efficiently decouple communication from
tectural enhancements (the ring cache). HCCv3 distinguishes computation.
parallel code (i.e., code outside any sequential segment) from Shared data communication. HELIX-RC decouples com-
sequential code (i.e., code within sequential segments) by using munication of variables and other shared data locations by prop-
two instructions that extend the instruction set. The ring cache agating new shared data through the ring cache as soon as it
is a ring network that connects together ring nodes attached to is generated. Once a ring node receives a store, it records the
each core in the processor to operate during sequential segments new value and proactively forwards its address and value to an
as a distributed first-level cache that precedes the private L1 adjacent node in the ring cache, all without interrupting the ex-
cache. Because it relies on compiler-guaranteed properties of the ecution of the attached core. The value then propagates from
code, the hardware support can be simple and efficient. The next node to node through the rest of the ring without interrupting
paragraphs summarize the main components of HELIX-RC. the computation of any core—decoupling communication from
ISA. We introduce a pair of instructions—wait and signal— computation.
that mark the beginning and end of a sequential segment. Each Synchronization. Given the difficulty of determining which
of these instructions has an integer value as a parameter that iden- iteration depends on which in non-numerical programs, compil-
tifies the particular sequential segment. The wait instruction ers typically make the conservative assumption that an iteration
blocks execution of the core that issued it (e.g., wait 3) until all depends on all of its predecessor iterations. Therefore, a core
other cores have finished executing the corresponding sequen- cannot execute sequential code until it is unblocked by its prede-
tial segment, which they signify by executing the appropriate cessor [6, 25, 40]. Moreover, an iteration unblocks its successor
signal instruction (e.g., signal 3). only if both it and its predecessors have executed this sequen-
Compiler. HCCv3 takes sequential programs and parallelizes tial segment or if they are not going to. This execution model
loops that are most likely to speed up performance when their leads to a chain of signal propagation across loop iterations that
iterations execute in parallel. Only one loop runs in parallel at includes unnecessary synchronization: even if an iteration is not
a time and its successive iterations run on cores organized as a going to execute sequential code, it still needs to synchronize
unidirectional ring. with its predecessor before unblocking its successor.
To satisfy loop-carried data dependences, HCCv3 keeps the HELIX-RC removes these synchronization overheads by en-
execution of sequential segments in iteration order by inserting abling an iteration to detect the readiness of all predecessor
wait and signal instructions to delimit the entry and exit points iterations, not just one. Therefore, once an iteration forgoes
of these segments. In this way, HCCv3 guarantees that accesses executing the sequential segment, it immediately notifies its suc-
to a variable or another memory location that might need to be cessor without waiting for its predecessor. Unfortunately, while
shared between cores are always within sequential segments. HELIX-RC removes unnecessary synchronization, it increases
Moreover, shared variables (normally allocated to registers in the number of signals that can be in flight simultaneously.
sequential code) are mapped to specially-allocated memory lo- HELIX-RC relies on the new signal instruction to handle
cations. Hence, their accesses within sequential segments occur synchronization signals efficiently. Synchronization between a
via memory operations. producer and a consumer includes (i) the producer generating a
Core. A core forwards all memory accesses within sequential signal, (ii) the consumer requesting that signal, and (iii) signal
segments to its local ring node. All other memory accesses transmission between the two. On a conventional multicore,
(not within a sequential segment) go through the private L1 which relies on a pull-based memory hierarchy for communica-
cache. To determine whether the executing code is part of a tion, signal transmission is inherently lazy, and signal request and
Core 0 Core 1 Core 2
... signal wait wait
Core 0 Core 1
signal
Core 2
signal wait
signal stall

wait 1; wait 1; Data forwarding signal


load
a=load; signal 1; Sequential chain signal
1 a = a+1; Sequential code load
store a; Parallel code
sequential data stall
signal 1;
segment
signal

...
Time
(a) Parallel code (b) Coupled communication (c) Decoupled communication

Figure 5: Example illustrating benefits of decoupling communication from computation.

transmission get serialized. On the other hand, in HELIX-RC, architecture. In this section, we focus on compiler-guaranteed
signal instructs the ring cache to proactively forward a signal code properties that enable a lightweight ring cache design, and
to all other nodes in the ring without interrupting any of the cores, follow up with code optimizations that make use of the ring
thereby decoupling signal transmission from synchronization. cache.
Code example. Given the importance of these decoupling Code properties.
mechanisms to fully realize performance benefits, let’s explore • Only one loop can run in parallel at a time. Apart from a
how HELIX-RC implements them using a concrete example. dedicated core responsible for executing code outside parallel
The code in Figure 5(a), abstracted for clarity, represents a small loops, each core is either executing an iteration of the current
hot loop from 175.vpr of SPEC CINT2000 that is responsible for loop or waiting for the start of the next one.
55% of the total execution time of that program. It contains a • Successive loop iterations are distributed to threads in a round-
sequential segment with two possible execution paths. The left robin manner. Since each thread is pinned to a predefined core,
path contains an actual dependence where instances of instruc- and cores are organized in a unidirectional ring, successive
tion 1 in an iteration use values from previous iterations. The iterations form a logical ring.
right path does not depend on prior data. Because the compiler • Communication between cores executing a parallelized loop
cannot predict the execution path of a particular iteration (due occurs only within sequential segments.
to complex control flow), it must assume that instruction 1, in • Different sequential segments always access different shared
any given iteration, depends on the previous iteration. Therefore, data. HCCv3 only generates multiple sequential segments
it must synchronize all successive iterations by inserting wait when there is no intersection of shared data. Consequently,
and signal instructions on every execution path. Figure 5(b) instances of distinct sequential segments may run in parallel.
highlights this sequential chain in red. Now, assume only iter- • At most two signals per sequential segment emitted by a given
ations 0 and 2, running on cores 0 and 2, respectively, execute core can be in flight at any time. Hence, only two signals per
instruction 1. Then, this sequential chain is unnecessarily long segment need to be tracked by the ring cache.
because of the superfluous wait in iteration 1. Each iteration
This last property eliminates unnecessary wait instructions
waits (via the wait instruction) for the signal generated by the
while keeping the architectural enhancement simple. Eliminat-
signal instruction of the previous iteration. Also, iterations
ing waits allows a core to execute a later loop iteration than its
that update a (iterations 0 and 2) must load previous values first
successor (significantly boosting parallelism). Future iterations,
(using a regular load). Hence, two sets of stalls slow down the
however, produce signals that must be buffered. The last code
chain. First, iteration 1 performs unnecessary synchronization
property prevents a core from getting more than one “lap” ahead
(signal stalls), because it only contains parallel code. Second,
of its successor. So when buffering signals, each ring cache
lazy forwarding of the shared data leads to data stalls, because
node only needs to recognize two types—those from the past
the transfer only begins when requested, at a load, rather than
and those from the future.
when generated, at a store.
Code optimizations. In addition to the optimizations of
HELIX-RC proactively communicates data and synchroniza-
HCCv2, HCCv3 includes ones that are essential for best per-
tion signals between cores, which leads to the more efficient
formance of non-numerical programs on a ring-cache-enhanced
scenario shown in Figure 5(c). The sequential chain now
architecture: aggressive splitting of sequential segments into
only includes the delay required to satisfy the dependence—
smaller code blocks; identification and selection of small hot
communication updating a shared value.
loops; and elimination of unnecessary wait instructions.
4. Compiler Sizing sequential segments poses a tradeoff. Additional seg-
ments created by splitting run in parallel with others, but extra
The decoupled execution model of HELIX-RC described so segments entail extra synchronization, which adds communica-
far is possible given the tight co-design of the compiler and tion overhead. Thanks to decoupling, HCCv3 can split more
L1 Cache Reads/Writes
Data and Signals Remote L1
Request/Reply

DL1 Loads
Cache ReadPort from Core
Cache array Stores/Signals
from Core
WritePort
Ring
node Data and
Signals Data and
Signal buffer Signals
Signal S Signal 1
Past Credits
...
Core Future
Core Link
Credits
Buffers
Control

Figure 6: Ring cache architecture overview. From left to right: overall system; single core slice; ring node internal structure.

aggressively than HCCv2 to significantly increase TLP. Note shared between iterations moves between cores from current to
that segments cannot be split indefinitely—each shared location future iterations. These properties imply that the data involved
must belong to only one segment. in timing-critical dependences that potentially limit overall per-
To identify small hot loops that are most likely to speed up formance are both produced and consumed in the same order as
when their iterations run in parallel, HCCv3 includes a profiler to loop iterations. Furthermore, a ring network topology captures
capture the behavior of the ring cache. Whereas HCCv1 relies on this data flow, as sketched in Figure 6. The following paragraphs
an analytical performance model to select the loops to parallelize, describe the structure and purpose of each ring cache component.
HCCv3 profiles loops on representative inputs. During profiling, Ring node structure. The internal structure of a per-core
instrumentation code emulates execution with the ring cache, ring node is shown in the right half of Figure 6. Parts of this
resulting in an estimate of time saved by parallelization. Finally, structure resemble a simple network router. Unidirectional links
HCCv3 uses a loop nesting graph, annotated with the profiling connect a node to its two neighbors to form the ring backbone.
results, to choose the most promising loops. Bidirectional connections to the core and private L1 cache allow
injection of data into and extraction of data from the ring. There
5. Architecture Enhancements are three separate sets of data links and buffers. A primary set
Adding a ring cache to a multicore architecture enables the proac- forwards data and signals between cores. Two other sets manage
tive circulation of data and signals that boost parallelization. This infrequent traffic for integration with the rest of the memory
section describes the design of the ring cache and its constituent hierarchy (see Section 5.2). Separating these three traffic types
ring nodes. The design is guided by the following objectives: simplifies the design and avoids deadlock. Finally, signals move
Low-latency communication. HELIX-RC relies on fast in lockstep with forwarded data to ensure that a shared memory
communication between cores in a multicore processor for syn- location is not accessed before the data arrives.
chronization and for data sharing between loop iterations. Since In addition to these router-like elements, a ring node also
low-latency communication is possible between physically ad- contains structures more common to caches. A set associative
jacent cores in modern processors, the ring cache implements a cache array stores all data values (and their tags) received by
simple unidirectional ring network. the ring node, whether from a predecessor node or from its
Caching shared values. A compiler cannot easily guarantee associated core. The line size of this cache array is kept at
whether and when shared data generated by a loop iteration one machine word. While the small line is contrary to typical
will be consumed by other cores running subsequent iterations. cache designs, it ensures there will be no false data sharing by
Hence, the ring cache must cache shared data. Keeping shared independent values from the same line.
data on local ring nodes provides quick access for the associated The final structural component of the ring node is the signal
cores. As with data, it is also important to buffer signals in each buffer, which stores signals until they are consumed.
ring node for immediate consumption. Node-to-node connection. The main purpose of the ring
Easy integration. The ring cache is a minimally-invasive cache is to proactively provide many-to-many core communica-
extension to existing multicore systems, easy to adopt and inte- tion in a scalable and low-latency manner. In the unidirectional
grate. It does not require modifications to the existing memory ring formed by the ring nodes, data propagates by value circula-
hierarchy or to cache coherence protocols. tion. Once a ring node receives an (address, value) pair, either
With these objectives in mind, we now describe the internals of from its predecessor, or from its associated core, it stores a local
the ring cache and its interaction with the rest of the architecture. copy in its cache array and propagates the same pair to its suc-
cessor node. The pair eventually propagates through the entire
5.1. Ring Cache Architecture
ring (stopping after a full cycle) so that any core can consume
The ring cache architecture relies on properties of compiled code the data value from its local ring node, as needed.
including: (i) parallelized loop iterations execute in separate This value circulation mechanism allows the ring cache to
threads on separate cores, arranged in a logical ring; and (ii) data communicate between cores faster than reactive systems (like
most coherent cache hierarchies). In a reactive system, data Benchmark Phases Parallel loop coverage
HELIX-RC HCCv2 HCCv1
transfer begins once the receiver requests the shared data, which Integer benchmarks
adds transfer latency to an already latency-critical code path. 164.gzip 12 98.2% 42.3% 42.3%
In contrast, a proactive scheme overlaps transfer latencies with 175.vpr 28 99% 55.1% 55.1%
197.parser 19 98.7% 60.2% 60.2%
computation to lower the receiver’s perceived latency. 300.twolf 18 99% 62.4% 62.4%
The ring cache prioritizes the common case, where data gen- 181.mcf 19 99% 65.3% 65.3%
erated within sequential segments must propagate to all other 256.bzip2 23 99% 72.3% 72.1%
Floating point benchmarks
nodes as quickly as possible. Assuming no contention over 183.equake 7 99% 99% 77.1%
the network and single-cycle node-to-node latency, the design 179.art 11 99% 99% 84.1%
shown in Figure 6 allows us to bound the latency for a full trip 188.ammp 23 99% 99% 60.2%
around the ring to N clock cycles, where N is the number of 177.mesa 8 99% 99% 64.3%

cores. Each ring node prioritizes data received from the ring and Table 1: Characteristics of parallelized benchmarks.
stalls injection from its local core.
In order to eliminate buffering delays within the node that 5.2. Memory Hierarchy Integration
are not due to L1 traffic, the number of write ports in each
node’s cache array must match the link bandwidth between two The ring cache is a level within the cache hierarchy and as such
nodes. While this may seem like an onerous design constraint must not break any consistency guarantees that the hierarchy
for the cache array, Section 6.3 shows that just one write port is normally provides. Consistency between the ring cache and the
sufficient to reap more than 99% of the ideal-case benefits. conventional memory hierarchy results from the following invari-
To ensure correctness under network contention, the ring ants: (i) shared memory can only be accessed within sequential
cache is sometimes forced to stall all messages (data and signals) segments through the ring cache (compiler-enforced); (ii) only
traveling along the ring. The only events that can cause con- a uniquely assigned owner node can read or write a particular
tention and stalls are ring cache misses and evictions, which may shared memory location through the L1 cache on a ring cache
then need to fetch data from a remote L1 cache. While these ring miss (ring cache-enforced); and (iii) the cache coherence proto-
stalls are necessary to guarantee correctness, they are infrequent. col preserves the order of stores to a memory location through a
The ring cache relies on credit-based flow control [17] and is particular L1 cache.5
deadlock free. Each ring node has at least two buffers attached to Sequential consistency. To preserve the semantics of a paral-
the incoming links to guarantee forward progress. The network lelized single-threaded program, memory operations on shared
maintains the invariant that there is always at least one empty values require sequential consistency. The ring cache meets this
buffer per set of links somewhere in the ring. That is why a node requirement by leveraging the unidirectional data flow guaran-
only injects new data from its associated core into the ring when teed by the compiler. Sequential consistency must be preserved
there is no data from a predecessor node to forward. when ring cache values reach lower-level caches, but the con-
Node-core integration. Ring nodes are connected to their sistency model provided by conventional memory hierarchies
respective cores as the closest level in the cache hierarchy (Fig- is weaker. We resolve this difference by introducing a single
ure 6). The core’s interface to the ring cache is through regular serialization point per memory location, namely a unique owner
loads and stores for memory accesses in sequential segments. node responsible for all interactions with the rest of the mem-
As previously discussed, wait and signal instructions delin- ory hierarchy. When a shared value is moved between the ring
eate code within a sequential segment. A thread that needs to cache and L1 caches (owing to occasional ring cache load misses
enter a sequential segment first executes a wait, which only re- and evictions), only its owner node can perform the required
turns from the associated ring node when matching signals have L1 cache accesses. This solution preserves existing consistency
been received from all other cores executing prior loop iterations. models with minimal impact on performance.
The signal buffer within the ring node enforces this. Specialized Cache flush. Finally, to guarantee coherence between paral-
core logic detects the start of the sequential segment and routes lelized loops and serial code between loop invocations, each ring
memory operations to the ring cache.4 Finally, executing the node flushes the dirty values of memory locations it owns to L1
corresponding signal marks the end of the sequential segment. once a parallel loop has finished execution. This is equivalent
The wait and signal instructions require special treatment to executing a distributed fence at the end of loops. In a multi-
in out-of-order cores. Since they may have system-wide side program scenario, signal buffers must also be flushed/restored at
effects, these instructions must issue non-speculatively from program context switches.
the core’s store queue and regular loads and stores cannot 6. Evaluation
be reordered around them. Our implementation reuses logic
from load-store queues for memory disambiguation and holds a By co-designing the compiler along with the architecture,
lightweight local fence in the load queue until the wait returns HELIX-RC more than triples the performance of parallelized
to the senior store queue. This is not a concern for in-order cores. code when compared to a compiler-only solution (HCCv2). This
4 This feature may add one multiplexer delay to the critical delay path from 5 Most cache coherence protocols (including Intel, AMD, and ARM imple-

the core to L1. mentations) provide this minimum guarantee.


16
16 HCCv2
HCCv2 HELIX-RC 14 decoupled reg. communication
14
Numerical
decoupled reg. comm. and synch.
Programs

memory communication
12 decoupled reg. and memory comm.

Benefit of decoupling
Program speedup
12 HELIX-RC (decoupled all communication)
Non-Numerical 10
Program speedup

Programs
10 8
Benefits of
decoupling
8 6 synchronization

4
6
2
4
0

ip

er

2
cf
f

n
ol
vp
2

ip

ea
gz

m
rs

tw

bz
5.

1.
pa

m
4.

0.
17

6.
18

eo
16

7.

30

25
19

G
T
0

IN
ip

pr

er

eo 2

ke

eo a
cf
18 lf

eo n

n
ar
o

ip

es
ea

ea

ea
gz

m
v

rs

ua
tw

9.
bz
5.

am

m
1.
pa

m
4.

17
0.

eq
17

6.

7.
Figure 8: Breakdown of benefits of decoupling communication
16

8.
7.

30

25

3.

17
18
19

G
18
T

FP from computation.
IN

Figure 7: HELIX-RC triples the speedup obtained by HCCv2.


Speedups are relative to sequential program execution.
to simulate conventional hardware, and later (Section 6.2) show
section investigates HELIX-RC’s performance benefits and their that low latency alone is not enough to compensate for the lazy
sensitivity to ring cache parameters. We confirm that the majority nature of its coherence protocol.
of speedups come from decoupling all types of communication Simulated ring cache. We extended XIOSim to simulate the
and synchronization. We conclude by analyzing the remaining ring cache as described in Section 5. Unless otherwise noted, it
overheads of the execution model. has the following configuration: a 1KB 8-way associative array
size, one-word data bandwidth, five-signal bandwidth, single-
6.1. Experimental Setup cycle adjacent core latency, and two cycles of core-to-ring-node
We ran experiments on two sets of architectures. The first relies injection latency to minimally impact the already delay-critical
on a conventional memory hierarchy to share data among the path from the core to the L1 cache. We use a simple bit mask as
cores. The second relies on the ring cache. the hash function to distribute memory addresses to their owner
Simulated conventional hardware. Unless otherwise noted, nodes. To avoid triggering the cache coherence protocol, all
we simulate a multicore in-order x86 processor by adding words of a cache line have the same owner. Lastly, XIOSim
multiple-core support to the XIOSim simulator. The single- simulates changes made to the core to route memory accesses
core XIOSim models have been extensively validated against an either to the attached ring node or to the private L1.
TM
Intel R Atom processor [19]. We use XIOSim because it is a Benchmarks. We use 10 out of the 15 C benchmarks from
publicly-available simulator that is able to simulate fine-grained the SPEC CPU2000 suite: 4 floating point (CFP2000) and 6
microarchitectural events with high precision. For one of the integer benchmarks (CINT2000). For engineering reasons, the
sensitivity experiments, we also simulate out-of-order cores mod- data dependence analysis that HCCv3 relies on [13] requires
eled after Intel Nehalem using the models from Zesto [21]. either too much memory or too much time to handle the rest.
The simulated cache hierarchy has two levels: a per-core This limitation is orthogonal to the results described in this paper.
32KB, 8-way associative L1 cache and a shared 8MB 16-bank Compiler. We extended the ILDJIT compilation frame-
L2 cache. We vary the core count from 1 to 16, but do not vary work [5], version 1.1, to use LLVM 3.0 for backend machine
the amount of L2 cache with the number of cores, keeping it at code generation. We generated both single- and multi-threaded
8MB for all configurations. Also scaling cache size would make versions of the benchmarks. The single-threaded programs are
it difficult to distinguish the benefits of parallelizing a workload the unmodified versions of benchmarks, optimized (O3) and
from the benefits of fitting its working set into the larger cache, generated by LLVM. This code outperforms GCC 4.8.1 by 8%
causing misleading results. Finally, we use DRAMSim2 [33] for on average and underperforms ICC 14.0.0 by 1.9%.6 The multi-
cycle-accurate simulation of memory controllers and DRAM. threaded programs were generated by HCCv3 and HCCv2 to run
We extended XIOSim with a cache coherence protocol as- on ring-cache-enhanced and conventional architectures, respec-
suming an optimistic cache-to-cache latency of 10 clock cycles. tively. Both compilers produce code automatically and do not
This 10-cycle latency is optimistically low even compared to require any human intervention. During compilation, they use
research prototypes of low-latency coherence [23]. In fact, it is SPEC training inputs to select the loops to parallelize.
the minimum reasonably possible with a 4×4 2D mesh network. Measuring performance. We compute speedups relative to
(Running microbenchmarks in our testbed, we found that Intel 6 As an aside, automatic parallelization features of ICC led to a geomean
Ivy Bridge is 75 cycles, Intel Sandy Bridge is 95 cycles, and Intel slowdown of 2.6% across SPEC CINT2000 benchmarks, suggesting ICC cannot
Nehalem is 110 cycles.) We only use this low-latency model parallelize non-numerical programs.
16

Program speedup
180 510 14 4-way OoO 2-way IO
160 Communication Computation 12
% Execution Time

140 C
Slow 10 2-way OoO
down
120 C C 8
100
C C
C 6
4
80 2
60 Speed 0
up
40 R 1

Seq.
20 R R R R R
R
0 0

ip

er

2
cf
f

n
ol
vp

ip

ea
ip

er

2
cf
f

gz
n

m
rs
ol

tw
vp

ip

bz
ea

5.
gz

m
rs

1.
pa

m
tw

4.
bz

0.
5.

17
1.

6.
pa

18
m
4.

eo
16
0.

7.

30
17

6.
18

25
eo
16

7.

30

19

G
25
19

T
IN
T
IN
Figure 10: Speedup obtained by changing the complexity of the
Figure 9: While code generated by HCCv3 speeds up with a ring core from a 2-way in-order to a 4-way out-of-order.
cache (R), it slows down on conventional hardware (C).

tional multicore (left bars), performs no better than sequential


sequential simulation. Both single- and multi-threaded runs use execution (100%), even with the optimistic 10-cycle core-to-core
reference inputs. To make simulation feasible, we simulate mul- latency. These results further stress the importance of selecting
tiple phases of 100M instructions as identified by SimPoint [14]. loops based on the core-to-core latency of the architecture.
6.2. Speedup Analysis Sequential segments. While more splitting offers higher TLP
(more sequential segments can run in parallel), more splitting
In our 16-core processor evaluation system, HELIX-RC boosts also requires more synchronization at run time. Hence, the high
the performance of sequentially-designed programs (CINT2000), synchronization cost for conventional multicores discourages
assumed not to be amenable to parallelization. Figure 7 shows aggressive splitting of sequential segments.8 In contrast, the ring
that HELIX-RC raises the geometric mean of speedups for these cache enables aggressive splitting to maximize TLP.
benchmarks from 2.2× for HCCv2 without ring cache to 6.85×. To analyze the relationship between splitting and TLP, we
HELIX-RC not only maintains the performance increases of computed the number of instructions that execute concurrently
HCCv2 (compared to HCCv1) on numerical programs (SPEC for the following two scenarios: (i) conservative splitting con-
CFP2000), but also increases the geometric mean of speedups strained by a contemporary multicore processor with high syn-
for CFP2000 benchmarks from 11.4×7 to almost 12×. chronization penalty (100 cycles) and (ii) aggressive splitting for
We now turn our attention to understanding where the HELIX-RC with low-latency communication (<10 cycles) pro-
speedups come from. vided by the ring cache. In order to compute TLP independent of
Communication. Speedups obtained by HELIX-RC come both the communication overhead and core pipeline advantages,
from decoupling both synchronization and data communication we used a simple abstracted model of a multicore system that
from computation in loop iterations, which significantly reduces has no communication cost and is able to execute one instruction
communication overhead, allows the compiler to split sequential at a time. Using the same set of loops chosen by HELIX-RC
segments into smaller blocks, and cuts down the critical path and used in Figure 7, TLP increased from 6.4 to 14.2 instruc-
of the generated parallel code. Figure 8 compares the speedups tions with aggressive splitting. Moreover, the average number
gained by multiple combinations of decoupling synchroniza- of instructions per sequential segment dropped from 8.5 to 3.2
tion, register-, and memory-based communication. As expected, instructions.
fast register transfers alone do not provide much speedup since Coverage. Despite all the loop-level speedups possible via de-
most in-register dependences can be satisfied by re-computing coupling communication and aggressively splitting of sequential
the shared variables involved (Section 2). Instead, most of the segments, Amdahl’s law states that program coverage dictates
speedups come from decoupling communication for both syn- the overall speedup of a program. Prior parallelization tech-
chronization and memory-carried actual dependences. To the niques have avoided selecting loops with small bodies because
best of our knowledge, HELIX-RC is the only solution that communication would slow down execution on conventional
accelerates all three types of transfers for actual dependences. processors [6, 39]. Since HELIX-RC does not suffer from this
In order to assess the impact of decoupling communication problem, the compiler can freely select small hot loops to cover
from computation for CINT2000 benchmarks, we executed the almost the entirety of the original program. Table 1 shows that
parallel code generated by HCCv3—assuming a decoupling ar- HELIX-RC achieves >98% for all of the benchmarks evaluated.
chitecture like a ring cache—on a simulated conventional system
6.3. Sensitivity to Architectural Parameters
which does not decouple. The loops selected under this assump-
tion do require frequent communication (every 24 instructions Speedup results so far assumed the default configuration (in
on average). Figure 9 shows that such code, running on a conven- Section 6.2) for the ring cache. We now investigate the impact of
different architectural parameters on speedup. In the next set of
7 These speedups are possible even with a cache coherence latency of conven-

tional processors (e.g., 75 cycles). 8 This is the rationale behind DOACROSS parallelization [10].
16 16 16 16

14 16 cores 4 cores 14 1 cycle 16 14 Unbounded 2 14 Unbounded 1 KB


Program speedup

12 8 cores 2 cores 12 4 32 12 4 Signals 1 12 32 KB 256 B


10 10 8 10 10

8 8 8 8

6 6 6 6
4 4 4
4
2 2 2
2
0 0 0
0

ip

2
f

cf
ip

er

2
f

cf

ol
vp

se
ip

2
f

cf
ol

ip
vp
ip

gz
f

cf

m
ip

ol
vp

se
gz

m
ol

tw
ip
vp

se

rs

gz

m
.tw
ip

bz
5.

ar
gz

tw
bz

1.
5.
.tw

4.
bz
1.

5.

ar
a

0.
4.
bz

1.
5.

ar

17

p
4.

6.
1.

18
17

.p

0.
0
4.

6.

16
18

17

7.

30
6.
16

18
17

.p

30
6.

16

25
18

7.
7

30
16

25
30

19
25
7

19
25

19
19

(a) Core count. (b) Adjacent node link latency. (c) Signal bandwidth. (d) Node memory size.

Figure 11: Sensitivity to core count and ring cache parameters. Only SPEC CINT benchmarks are shown.

164.gzip 40.8% 8.1% 9.6% 4.5% 0.0% 18.1% 18.8% 3.0x


175.vpr 11.9% 0.4% 74.2% 12.4% 0.0% 0.5% 0.5% 6.1x
197.parser 31.3% 24.3% 15.3% 5.0% 0.3% 11.6% 12.2% 7.3x
300.twolf 0.1% 0.2% 41.8% 1.4% 31.8% 0.0% 24.6% 7.6x
181.mcf 37.7% 10.4% 5.5% 1.2% 3.2% 20.9% 21.2% 8.7x
256.bzip2 3.4% 3.4% 51.6% 0.1% 1.1% 19.7% 20.7% 12.0x
183.equake 0.2% 0.0% 9.1% 1.5% 87.7% 0.0% 1.5% 10.1x
179.art 0.2% 0.0% 47.7% 24.8% 16.1% 0.0% 11.3% 10.5x
188.ammp 64.1% 8.0% 6.3% 7.4% 8.9% 2.2% 3.1% 12.5x
177.mesa 29.3% 0.9% 3.7% 58.4% 7.3% 0.0% 0.3% 15.1x
Dependence Communication Low Trip Iteration Memory Wait/Signal Additional HELIX-RC
Waiting Count Imbalance Instructions Instructions Speedup

Figure 12: Breakdown of overheads that prevent achieving ideal speedup.

experiments we sweep one parameter of the ring cache at a time signals (up to 4) is negligible.
while keeping all others constant at the default configuration. Memory size. Figure 11d shows the impact of memory size.
Core type. While HELIX-RC successfully improves TLP for The finite-size cases assume LRU replacement. Reducing cache
in-order cores, one may ask how ILP provided by more com- array size within the ring node only impacts 197.parser, which
plex cores impacts speedups. Figure 10 (upper) plots speedups has the largest ring cache working set.
for two additional core types—4-way and 2-way out-of-order
(OoO) cores—in addition to the default 2-way in-order core 6.4. Analysis of Overhead
(IO) for CINT2000 benchmarks. The lower graph plots the se-
To understand areas for improvement, we categorize every over-
quential execution time normalized to that of the 4-way OoO
head cycle (preventing ideal speedup) based on a set of simulator
core. Although the OoO cores can extract more ILP for the same
statistics and the methodology presented by Burger et al. [4]. Fig-
workloads (the 4-way OoO core is on average 1.9× faster than
ure 12 shows the results of this categorization for HELIX-RC,
the default 2-way IO core), HELIX-RC speeds up most of the
again implemented on a 16-core processor.
benchmarks except for 164.gzip. Thoroughly characterizing and
Most importantly, the small fraction of communication over-
accounting for this tradeoff between HELIX-RC-extracted TLP
heads suggests that HELIX-RC successfully eliminates the core-
and ILP with different architectures is the subject of future work.
to-core latency for data transfer in most benchmarks. For several
Core count. Figure 11a shows that HELIX-RC efficiently
benchmarks, notably 175.vpr, 300.twolf, 256.bzip2, and 179.art,
scales parallel performance with core count, from 2 to 16.
the major source of overhead is the low number of iterations
Link latency. Figure 11b shows the speedups obtained versus
per parallelized loop (low trip count). While many hot loops
the minimum communication latency between adjacent ring
are frequently invoked, low iteration count (ranging from 8 to
nodes. As expected, HELIX-RC performance degrades for
20) leads to idle cores. Other benchmarks such as 164.gzip,
longer latencies for most of the benchmarks. It is important
197.parser, 181.mcf, and 188.ammp suffer from dependence wait-
to note that current technologies can satisfy single-cycle adja-
ing due to large sequential segments. Finally, HCCv3 must
cent core latencies, confirmed by commercial designs [46] and
sometimes add a large number of wait and signal instructions
CACTI [24] wire models of interconnect lengths for dimensions
(i.e., many sequential segments) to increase TLP, as seen for
in modern multicore processors.
164.gzip, 197.parser, 181.mcf, and 256.bzip2.
Link bandwidth. A ring cache uses separate dedicated wires
for data and signals to simplify design. Simulations confirm that 7. Related Work
a minimum data bandwidth of one machine word (hence, single
write port) sufficiently sustains more than 99.9% of the perfor- To compare HELIX-RC to a broad set of related work, Table 2
mance obtained by a data link with unbounded bandwidth for all summarizes different parallelization schemes proposed for non-
benchmarks. In contrast, reducing signal bandwidth can degrade numerical programs organized with respect to the types of com-
performance, as shown in Figure 11c, due to synchronization munication decoupling implemented (register vs. memory) and
stalls. However, the physical overhead of adding additional the types of dependences targeted (actual vs. false). HELIX-RC
Actual dependences False dependences execute fine- and coarse-grained parallel numerical programs.
Register HELIX-RC, Multiscalar, TRIPS, T3 HELIX-RC, Multiscalar, TRIPS, T3
Memory HELIX-RC HELIX-RC, TLS-based approaches,
However, without an efficient broadcast mechanism, iWarp’s fast
Multiscalar, TRIPS, T3 communication cannot reach the speedups offered by HELIX-
RC.
Table 2: Only HELIX-RC decouples communication for all types Automatic parallelization of non-numerical programs.
of dependences. Several automatic methods to extract TLP have demonstrated
respectable speedups on commodity multicore processors for
covers the entire design space and is the only one to decouple non-numerical programs [6, 16, 27, 29, 30, 43, 49]. All of these
memory accesses from computation for actual dependences. methods transform loops into parallel threads. Decoupled soft-
Multiscalar register file. Multiscalar processors [38] extract ware pipelining (DSWP) [27] reduces sensitivity to communica-
both ILP and TLP from an ordinary application. While a ring tion latency by restructuring a loop to create a pipeline among the
cache’s structure resembles a Multiscalar register file, there are extracted threads with unidirectional communication between
fundamental differences. For the Multiscalar register file, there pipeline stages. Demonstrated both on simulators and on real sys-
is a fixed and relatively small number of shared elements that tems, DSWP performance is largely insensitive to latency. How-
must be known at compile time. Furthermore, the Multiscalar ever, significant restructuring of the loop makes speedups diffi-
register file cannot handle memory updates by simply mapping cult to predict and generated code can sometimes be slower than
memory to a fixed number registers without a replacement mech- the original. Moreover, DSWP faces the challenges of selecting
anism. In contrast, the ring cache does not require compile-time appropriate loops to parallelize and keeping the pipeline bal-
knowledge to handle an arbitrary number of elements shared anced at runtime. While DSWP-based approaches focus more on
between cores (i.e., memory locations allocated at runtime) and restructuring loops to hide communication latency [16, 27, 30],
can readily handle register updates by deallocating a register to HELIX-RC proposes an architecture-compiler co-design strategy
a memory location. In other words, HELIX-RC proposes to use that selects the most appropriate loops for parallelization.
a distributed cache to handle both register and memory updates. Combining DSWP with HELIX-RC has the potential to yield
Cache coherence protocols. The ring cache addresses an en- significantly better performance than either alone. DSWP cannot
tirely different set of communication demands. Cache coherence easily scale beyond four cores [31] without being combined with
protocols target relatively small amounts of data shared infre- approaches that exploit parallelism among loop iterations [16]
quently between cores. Hence, cores can communicate lazily, (e.g., DOALL [22]). While DSWP + DOALL can scale be-
but the resulting communication almost always lies in the critical yond several cores, DOALL parallelism is not easy to find in
sequential chain. In contrast, the ring cache targets frequent and non-numerical code. Instead, DSWP + HELIX-RC presents an
time-critical data sharing between cores. opportunity to parallelize a much broader set of loops.
On-chip networks. While on-chip-networks (OCNs) can Several TLS-based techniques [15, 18, 20, 39, 48, 49], includ-
take several forms, they commonly implement reactive co- ing STAMPede, Stanford Hydra, and POSH, combine hardware-
herence protocols [34, 37, 41, 44, 46] that do not fulfill the assisted thread-level speculation (TLS) with compiler optimiza-
low-latency communication requirements of HELIX-RC. Scalar tions to manage dependences between loop iterations execut-
operand networks [12, 42] somewhat resemble a ring cache to ing in different threads. When the compiler identifies sources
enable tight coupling between known producers and consumers and destinations of frequent dependences, it synchronizes us-
of specific operands, but they suffer from the same limitations ing wait and signal primitives; otherwise, it uses speculation.
as the Multiscalar register file. Hence, HELIX-RC implements HELIX-RC, on the other hand, optimizes code assuming all
a relatively simple OCN, but supported by compiler guarantees dependences are actual. While we believe adding speculation
and additional logic to implement automatic forwarding. may help HELIX-RC, Figure 7 shows decoupled communica-
Off-chip networks. Networks that improve bandwidth be- tion already yields significant speedups without misspeculation
tween processors have been studied extensively [36, 46]. While overheads.
they work well for CMT parallelization techniques [9, 28] that
require less frequent data sharing, there is less overall parallelism. 8. Conclusion
Moreover, networks that target chip-to-chip communication do
not meet the very different low-latency core-to-core communica- Decoupling communication from computation makes non-
tion demands of HELIX-RC [17]. Our results show HELIX-RC numerical programs easier to parallelize automatically by compil-
is much more sensitive to latency than to bandwidth. ing loop iterations as parallel threads. While numerical programs
Non-commodity processors. Multiscalar [38], TRIPS [35], can often be parallelized by compilation alone, non-numerical
and T3 [32] are polymorphous architectures that target paral- programs greatly benefit from a combined compiler-architecture
lelism at different granularities. They differ from HELIX-RC in approach. Our HELIX-RC prototype shows that a minimally-
that (i) they require a significantly larger design effort and (ii) invasive architecture extension co-designed with a parallelizing
they only decouple register-to-register communication and/or compiler can liberate enough parallelism to make good use of
false memory dependence communication by speculating. 16 cores for non-numerical benchmarks commonly thought not
An iWarp system [2] implements special-purpose arrays that to be parallelizable.
Acknowledgements [24] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi.
CACTI 6.0: A tool to model large caches. Technical Report 85, HP
We thank the anonymous reviewers for their feedback on nu- Laboratories, 2009.
[25] Alexandru Nicolau, Guangqiang Li, and Arun Kejariwal. Techniques for
merous manuscripts. Moreover, we would like to thank Glenn efficient placement of synchronization primitives. In PPoPP, 2009.
Holloway for his invaluable contributions to the HELIX project. [26] Alexandru Nicolau, Guangqiang Li, Alexander V. Veidenbaum, and Arun
This work was possible thanks to the sponsorship of the Royal Kejariwal. Synchronization optimizations for efficient execution on multi-
cores. In ICS, 2009.
Academy of Engineering, EPSRC and the National Science [27] Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Au-
tomatic thread extraction with decoupled software pipelining. In MICRO,
Foundation (award number IIS-0926148). Any opinions, find- 2005.
ings, and conclusions or recommendations expressed in this [28] David K. Poulsen and Pen-Chung Yew. Data prefetching and data forward-
ing in shared memory multiprocessors. In ICPP, 1994.
material are those of the authors and do not necessarily reflect [29] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and
the views of our sponsors. David I. August. Speculative parallelization using software multi-threaded
transactions. In ASPLOS, 2010.
[30] Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges,
References and David I. August. Parallel-stage decoupled software pipelining. In
CGO, 2008.
[1] Randy Allen and Ken Kennedy. Optimizing compilers for modern archi- [31] Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August.
tectures. Morgan Kaufmann, 2002. Performance scalability of decoupled software pipelining. In ACM TACO,
[2] Shekhar Borkar, Robert Cohn, George Cox, Sha Gleason, Thomas Gross, 2008.
H. T. Kung, Monica Lam, Brian Moore, Craig Peterson, John Pieper, Linda [32] Behnam Robatmil, Dong Li, Hadi Esmaeilzadeh, Sibi Govindan, Aaron
Rankin, P. S. Tseng, Jim Sutton, John Urbanski, and Jon Webb. iWarp: An Smith, Andrew Putnam, Doug Burger, and Stephen W. Keckler. How
integrated solution to high-speed parallel computing. In Supercomputing, to Implement Effective Prediction and Forwarding for Fusable Dynamic
1988. Multicore Architectures. In HPCA, 2013.
[3] Matthew J. Bridges, Neil Vachharajani, Yun Zhang, Thomas Jablin, and [33] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. DRAMSim2: A Cy-
David I. August. Revisiting the sequential programming model for multi- cle Accurate Memory System Simulator. In IEEE Computer Architecture
core. In MICRO, 2007. Letters, 2011.
[4] Doug Burger, James R. Goodman, and Alain Kägi. Memory bandwidth [34] Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. Flexible archi-
limitations of future microprocessors. In ISCA, 1996. tectural support for fine-grain scheduling. In ASPLOS, 2010.
[5] Simone Campanoni, Giovanni Agosta, Stefano Crespi Reghizzi, and An- [35] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu,
drea Di Biagio. A Highly Flexible, Parallel Virtual Machine: Design and Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger,
Experience of ILDJIT. In Software: Practice and Experience, 2010. Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. TRIPS:
[6] Simone Campanoni, Timothy M. Jones, Glenn Holloway, Vijay A polymorphous architecture for exploiting ILP, TLP, and DLP. In ACM
Janapa Reddi, Gu-Yeon Wei, and David Brooks. HELIX: Automatic TACO, 2004.
[36] Steven L. Scott. Synchronization and Communication in the T3E Multi-
Parallelization of Irregular Programs for Chip Multiprocessing. In CGO, processor. In ASPLOS, 1996.
2012. [37] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,
[7] Simone Campanoni, Timothy M. Jones, Glenn Holloway, Gu-Yeon Wei,
and David Brooks. HELIX: Making the Extraction of Thread-Level Paral- Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert
lelism Mainstream. In IEEE Micro, 2012. Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan.
[8] Ramkrishna Chatterjee, Barbara G. Ryder, and William A. Landi. Relevant Larrabee: a many-core x86 architecture for visual computing. In ACM
Context Inference. In POPL, 1999. Transactions on Graphics, 2008.
[9] Lynn Choi and Pen-Chung Yew. Compiler and hardware support for [38] Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. Multiscalar
cache coherence in large-scale multiprocessors: Design considerations and processors. In ISCA, 1995.
performance study. In ISCA, 1996. [39] J. Gregory Steffan, Christopher Colohan, Antonia Zhai, and Todd C.
Mowry. The STAMPede approach to thread-level speculation. In ACM
[10] Ron Cytron. DOACROSS: Beyond vectorization for multiprocessors. In Transactions on Computer Systems, 2005.
ICPP, 1986. [40] J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C.
[11] Alain Deutsch. A storeless model of aliasing and its abstractions using Mowry. Improving value communication for thread-level speculation. In
finite representations of right-regular equivalence relations. In ICCL, 1992. HPCA, 2002.
[12] Paul Gratz, Changkyu Kim, Karthikeyan Sankaralingam, Heather Hanson, [41] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae
Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. On-Chip Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee,
Interconnection Networks of the TRIPS Chip. In IEEE Micro, 2007. Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman,
[13] Bolei Guo, Matthew J. Bridges, Spyridon Triantafyllis, Guilherme Ottoni, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Ararwal.
Easwaran Raman, and David I. August. Practical and accurate low-level The RAW microprocessor: A computational fabric for software circuits
pointer analysis. In CGO, 2005. and general-purpose programs. In IEEE Micro, 2002.
[14] Greg Hamerly, Erez Perelman, and Brad Calder. How to use simpoint to [42] Michael Bedford Taylor, Walter Lee, Saman P. Amarasinghe, and Anant
pick simulation points. In ACM SIGMETRICS Performance Evaluation Agarwal. Scalar Operand Networks. In IEEE Transactions on Parallel
Review, 2004. Distributed Systems, 2005.
[15] Lance Hammond, Benedict A. Hubbert, Michael Siu, Manohar K. Prabhu, [43] Georgios Tournavitis, Zheng Wang, Björn Franke, and Michael F. P.
Michael K. Chen, and Kunle Olukotun. The Stanford Hydra CMP. In O’Boyle. Towards a holistic approach to auto-parallelization. In PLDI,
IEEE Micro, 2000. 2009.
[16] Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, [44] Rob F. van der Wijngaart, Timothy G. Mattson, and Werner Haas. Light-
and David I. August. Decoupled software pipelining creates parallelization weight communications on Intel’s single-chip cloud computer processor.
opportunities. In CGO, 2010. In SIGOPS Operating Systems Review, 2011.
[17] Natalie Enright Jerger and Li-Shiuan Peh. On-Chip Networks. Synthesis [45] Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralax
Lectures on Computer Architecture. Morgan & Claypool, 2009. infrastructure: Automatic parallelization with a helping hand. In PACT,
[18] Troy A. Johnson, Rudolf Eigenmann, and T. N. Vijaykumar. Speculative 2010.
[46] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce
thread decomposition through empirical optimization. In PPoPP, 2007. Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown,
[19] Svilen Kanev, Gu-Yeon Wei, and David Brooks. XIOSim: power- III, and Anant Agarwal. On-chip interconnection architecture of the tile
performance modeling of mobile x86 cores. In ISLPED, 2012. processor. In IEEE Micro, 2007.
[20] Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau, [47] Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan, and Todd C.
and Josep Torrellas. POSH: A TLS compiler that exploits program structure. Mowry. Compiler optimization of scalar value communication between
In PPoPP, 2006. speculative threads. In ASPLOS, 2002.
[21] Gabriel H Loh, Samantika Subramaniam, and Yuejian Xie. Zesto: A [48] Antonia Zhai, J. Gregory Steffan, Christopher B. Colohan, and Todd C.
cycle-level simulator for highly detailed microarchitecture exploration. In Mowry. Compiler and hardware support for reducing the synchronization
ISPASS, 2009. of speculative threads. In ACM TACO, 2008.
[22] Stephen F. Lundstrom and George H. Barnes. A controllable MIMD [49] Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke.
architecture. In Advanced computer architecture, 1986. Uncovering hidden loop level parallelism in sequential applications. In
[23] Milo M. K. Martin. Token coherence. PhD thesis, University of Wisconsin- HPCA, 2008.
Madison, 2003.

You might also like