Article
Article
Abstract—Cache coherence protocols limit the scalability of multicore and manycore architectures and are responsible for an important
amount of the power consumed in the chip. A good way to alleviate these problems is to introduce a local memory alongside the cache
hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and do not generate coherence
traffic, but they suffer from poor programmability. When non-predictable memory access patterns are found compilers do not succeed
in generating code because of the incoherence between the two storages. This paper proposes a coherence protocol for hybrid
memory systems that allows the compiler to generate code even in the presence of memory aliasing problems. Coherence is ensured
by a software/hardware co-design where the compiler identifies potentially incoherent memory accesses and the hardware diverts
them to the correct copy of the data. The coherence protocol introduces overheads of 0.26% in execution time and of 2.03% in energy
consumption to enable the usage of the hybrid memory system, which outperforms cache-based systems by an speedup of 38% and
an energy reduction of 27%.
Index Terms—Coherence protocol, Local memories, Scratchpad memories, Hybrid memory system
the copy of this data in the SM. This is key to ensure there
is no interaction with the cache coherence protocol and it
is what allows the proposal to work by only monitoring
events inside the core. This constraint is easily ensured when
a compiler maps private data to the LM because the data
distribution is already specified in the parallelization model.
If the architecture is programmed by hand, the programmer
is responsible for not accessing the data mapped to one core
from another core without using synchronization primitives.
The next sections explain the compiler and hardware support
for the coherence protocol, show an example of how every-
thing works together and describe of how the system manages
the copies of the data, in such a way that the conditions for
the correctness of the coherence protocol are always fulfilled.
In Figure 3 the code is transformed in exactly the same this problem is to disable the tiling optimization, forcing that
way as explained in Section 2.2. The only difference is the the write-back is always performed and so incurring in high
existence of a new class of memory accesses, the potentially performance penalties. A more efficient solution is to make
incoherent ones, like ptr. Since the compiler does not have the modifications in the two memories. A simple way to do
to do any transformations for them, the resulting code is the it is that the compiler generates a double store: one irregular
same as in Figure 2. store that will update the copy in the SM and one potentially
Phase 3 - Code generation: In this phase the compiler incoherent store that will trigger a lookup in the directory
generates the assembly code for the target architecture: and will update the copy in the LM if it exists. Note that if
the lookup in the directory of the potentially incoherent store
• For regular accesses the compiler generates memory
misses there will be two stores of the same data to the same
instructions that directly access the LM. This is accom-
SM address. The overhead of this unnecessary second store is
plished by using as source operands the base address of
small. The performance impact is low because the two stores
a LM buffer and an offset.
are independent so they both can be issued in the same cycle.
• For irregular accesses the compiler generates memory
The increase in power consumption is also small since the
instructions that directly access the SM. This is accom-
Load/Store Queue [11] will collapse the second store with the
plished by using as source operands a base address in the
first one if it is not yet committed, having one single cache
SM and an offset.
access and so not paying the cost of an extra memory access.
• For potentially incoherent accesses the compiler gener-
Note also that, in presence of a potentially incoherent store,
ates guarded memory instructions with an initial SM ad-
the compiler almost always generates a double store since, in
dress. This is accomplished by using as source operands a
general, it is unable to ensure that the aliasing is not with some
base address in the SM and an offset. When it is executed,
read-only data. This happens because typically the compiler
the guarded memory instruction accesses the directory
is unable to determine what is the accessible address range
using the SM address and is diverted to the corresponding
of a potentially incoherent access, therefore it is also unable
memory. The implementation of the guarded memory
to ensure there is no read-only data in this potentially infinite
instruction is discussed later in this section.
accessible address range.
Figure 3 shows the assembly code that the third phase emits The final code of Figure 3 shows how the compiler generates
for the body of the innermost loop. In the statement that uses the double store. For the increment of a random position of
regular accesses (line 10), a conventional load (ld in line ptr (line 16), the value is read with a guarded load (gld
11) and a conventional store (st in line 12) are emitted to, in line 17), incremented and finally written with a double
respectively, read a value from _b and write it in _a. When store. The double store consists of a guarded store (gst in
these instructions are executed, its addresses will guide the line 19) that will modify the copy in the LM if it exists and
memory accesses to the LM. Similarly, in the statement that a conventional store (st in line 20) with the same source
uses an irregular access to store the zero value in random operands that will always update the value in the SM.
positions of c (line 13), the compiler emits a conventional The implementation of the guarded memory operations is
store (st in line 15) with an address that will access the highly architecture-dependent. The trivial implementation is to
SM at execution time. Finally, to increment the value that duplicate all memory instructions with a guarded form. As this
is accessed via potentially incoherent accesses (line 16), the might produce many new opcodes, it may be unacceptable for
compiler emits a guarded load (gld in line 17) to read the some ISAs, specially in RISC architectures. One alternative
value and a guarded store (gst in line 19) to write the is to take unused bits of the binary representation of memory
value after incrementing it. When these two guarded memory instructions, as happens in PowerPC [23]. Another option is
instructions are executed, the initial SM addresses based on to provide a fewer range of guarded memory instructions
ptr will be used to look up the directory and they will be and restrict the compiler to these. In CISC architectures
changed to LM addresses if a copy of the data exists there. like x86 [24], where most instructions can access memory,
One special case has to be treated separately. When the instruction prefixes can be used to implement the guard. A
compiler determines a write access is potentially incoherent generic solution for any ISA is to extend the instruction set by
it has to ensure also that the access does not alias with some only a single instruction that performs the computation of the
data that is mapped to the LM as read-only. If it cannot ensure address using the directory and leaves the result in a register
this, emitting a single guarded store can lead to an erroneous that is consumed by the memory instruction, conceptually
execution. This is caused by a typical optimization of tiling converting the guarded memory access to a coherence-aware
transformations that consists on not triggering a write-back address calculation plus a normal memory operation.
to the SM of a chunk of data that is mapped as read-only.
With this optimization, what will happen at execution time 3.2 Hardware Design
is that the guarded store will hit in the directory and the The only hardware support needed for the coherence protocol
modification will be done to the LM. Since the buffer will not is a directory that keeps track of the contents of the LM.
be written-back to the SM, when the buffer is reused to map This section explains how the directory is configured, updated
new data, the corresponding dma-get operation will overwrite and used in the address generation. Then some considerations
the contents of the buffer and the modifications done by the about its access time, its double buffering support and its side
potentially incoherent store will be lost. A naive solution to effects on the hybrid memory system are discussed.
6
Fig. 5: Example of the hardware handling memory operations in the hybrid memory system with the coherence protocol. The
code is divided in four pieces, each one of them corresponding to a step in the execution diagram. Every step shows how the
data is moved between memories and what memory serves the memory accesses triggered by the corresponding piece of code.
from a in the SM to _a in the LM. It is assumed the caches accesses are potentially incoherent because they are not strided
are empty, so the data is transferred from the main memory. and the example assumes the compiler does not succeed in
If some cache kept the valid copy of the data, the coherent ensuring they do not alias with any regular access. In addition,
DMA transfers would read the data from the cache. Similarly, the potentially incoherent write access needs to be treated
the second statement of this step (line 5) maps a chunk of b with a double store. The instructions for this statement are a
to _b, which at execution time provokes the data movement guarded load of some element of ptr (line 17), the increment
from the main memory to the LM represented by MAP5. (line 18), a guarded store that will write the new value in the
Once the control phase has been executed, the step 2 takes LM in case it exists (line 19) and a conventional store that
place. This step is the assignment from _b to _a done will always write the new value in the SM (line 20). When
with regular accesses (line 10). In assembly language the these instructions are executed, the guarded load gld17 does
assignment is done using a conventional load (line 11) and a a lookup in the directory. If it hits, the access is diverted to
conventional store (line 12). At execution time, the load to _b the LM as gld17H shows. This happens, for instance, if ptr
is served directly by the LM, as represented by ld11, and the equals a and ptr[_a[_i]] is a position of a that has been
store to _a is also directly sent to the LM, as st12 represents. mapped to the LM in the step 1. Otherwise, if the directory
These direct accesses to the LM happen because the addresses lookup misses, the load labeled as gld17M is served by the L1
of the memory operations are in the range reserved to the LM. cache, which requests the cache line if needed. This happens,
In the step 3 an irregular store sets to zero a random position for instance, if ptr equals a but ptr[_a[_i]] is a position
of c (line 13). The store is irregular because it does not expose of a that has not been mapped to the LM in the step 1. After
a strided access pattern and the example assumes the compiler loading the value, it is incremented and written to memory
can ensure the access does not alias with any regular access. with the guarded store gst19. The execution of the guarded
The assembly code for this statement consists on placing the store, analogous to the guarded load, first does a lookup in
value zero in a register (line 14) and then storing this value to the directory. If there is a copy of the data in the LM the
memory using a conventional store to some position of c (line lookup hits, the address is changed to point to the LM and
15). This store, labeled as st15, is served by the L1 cache at the access goes there as gst19H shows. If there is no copy in
execution time, since the address it modifies is not in the LM the LM the lookup misses and the SM address is preserved,
address space. Assuming the caches are empty, the L1 cache so the L1 cache serves the access as gst19M shows. With
requests the cache line to the upper levels of the hierarchy this mechanism, always the valid copy of the data is accessed.
(not shown in the figure) and these forward the request to the To prevent losing the modifications when there is aliasing with
main memory. The cache line is then sent to the requesters, read-only data in the LM, the irregular store st20 modifies the
reaching the L1 cache so the modification can be done. copy in the SM. The address of the store guides the operation
Finally, the step 4 increments an element of ptr using po- to the L1 cache, which requests the cache line if necessary.
tentially incoherent loads and stores (line 16). These memory
8
ap CM
This section shows the correctness of the coherence protocol. ac k −m
r iteb
− ac
LM cess
−w
The two previous sections described how memory operations LM MM
situations can arise: either both versions are identical, or the Fig. 6: State diagram of the possible replication states of the
version in the LM is always more recent than the version in the data. A piece of data can be in main memory only (MM),
cache hierarchy. Then it is shown that whenever replicated data replicated only in the LM (LM), replicated only in the cache
is evicted to main memory, the version in the LM is always the hierarchy (CM), or replicated in the LM and in the cache
one transferred, invalidating the cache version. This is always hierarchy at the same time (LM-CM). Creating and discarding
guaranteed unless both versions are identical, in which case copies of the piece of data cause transitions between the states.
the system supports the eviction indistinctly.
3.4.1 Data States and Operations The LM-CM state is reachable from both the LM and the
Figure 6 shows the possible actions and states of data in the CM states. In the LM state, a guarded instruction will never
system. The state diagram is conceptual, it is not implemented cause a replica in the caches since the access goes through
in hardware. The MM state indicates the data is in main the directory, and this will divert the access to the LM. It
memory and has no replica neither in the cache hierarchy nor is impossible to have unguarded memory instructions to the
in the LM. The LM state indicates that only one replica exists, SM because the compiler never emits them unless it is sure
and it is located in the LM. In the CM state only one replica that there is no aliasing, which cannot happen in this state.
in the cache hierarchy exists. In the LM-CM state two replicas In the LM state, only the execution of a double store can
exist, one in the LM and the other in the cache hierarchy. cause the transition to the LM-CM state. The double store is
Actions prefixed with “LM-” correspond to LM control composed of a guarded store and a store to the SM (stguarded
actions, activated by software. There is a distinction between and stsm ). The stsm is served by the cache hierarchy, so a
LM-map and LM-unmap although both actions correspond replica of the data is generated and updated in the cache, while
to the execution of a dma-get, which unmaps the previous the stguarded modifies the LM replica with the same value, so
contents of a LM buffer and maps new contents instead. LM- two replicas generated through a LM→LM-CM transition are
map indicates that a dma-get transfers the data to the LM. The always identical. The transition CM→LM-CM happens due to
LM-unmap indicates that a dma-get has been performed that an LM-map action, and the DMA coherence ensures the two
overwrites the data in question, so it is no longer mapped to versions are identical. Once in the LM-CM state, the double
the LM. The LM-writeback corresponds to the execution of a store updates both versions, while stguarded and stlm modify
dma-put that transfers the data from the LM to the SM. Actions the LM version and stsm will never be generated.
prefixed with “CM-” correspond to hardware activated actions In conclusion, only two possibilities exist for having two
in the cache hierarchy. The CM-access corresponds to the replicas of data. Each one is represented by one path reaching
placement of the cache line that contains the data in the cache the LM-CM state from the MM state. In both cases, the two
hierarchy. The CM-evict corresponds to the replacement of the versions are either identical or the version in the LM is the
cache line, with its write-back to main memory if needed. valid one. The next section shows the valid version is always
The MM→LM transition occurs when the software causes selected at the moment of evicting the data to main memory.
an LM-map action. Switching back to the MM state occurs
when an LM-unmap action happens due to a dma-get mapping 3.4.2 Data Eviction
new data to the buffer. Notice that an LM-writeback action The state diagram shows that the eviction of data can only
does not imply a switch to the MM state, as transferring data occur from the LM and CM states. There is no direct transition
to the main memory does not unmap the data from the LM. from the LM-CM state to the MM state, which means that
Transitions between the MM and CM states happen accord- eviction of data can only happen when one replica exists in
ing to the execution of load and store operations that cause the system. This is a key point to ensure coherence. In case
CM-access and CM-eviction actions. Notice that unless the data is in the LM-CM state, its eviction can only occur if
data reaches the LM-CM state, no coherence problem can first one of the replicas is discarded, which corresponds to a
appear due to the use of a LM. DMA transfers are coherent transition to the LM or CM states. According to the previous
with the SM, ensuring the system coherence as long as the section, it is ensured that in the LM-CM state the two replicas
data switches between the LM and MM states. Similarly, the are identical or, if not, the version in the LM is the valid one.
cache coherence protocol ensures the system coherence when Consequently, the eviction discards the cache version unless
the data switches between the MM and CM states. In both both versions are identical, in which case either version can be
cases, never more than one replica is generated. evicted. This behavior is guaranteed by the transitions exiting
9
TABLE 1: PTLsim configuration parameters. TABLE 2: Scheme of the microbenchmark. The microbench-
Parameter Description
mark is a simple loop that can be configured in four modes.
Pipeline Out-of-order, 4 instructions wide
For each mode it is assumed some memory references are
Hybrid 4K selector, 4K G-share, 4K Bimodal potentially incoherent so guarded memory instructions are
Branch predictor
4K BTB 4-way, RAS 32 entries emitted for them, represented with bold font.
Functional units 3 INT ALUs, 3 FP ALUs, 2 load/store units
Register file 256 INT registers, 256 FP registers Microbenchmark Mode Assembly code
32 KB, 8-way set-associative mov a(,esi,4),ebx
L1 I-cache
2 cycles latency Baseline add edi,ebx
32 KB, 8-way set-associative mov ebx,a+4(,esi,4)
L1 D-cache
write-through, 2 cycles latency int a[N]; mov a(,esi,4),ebx
256 KB, 24-way set-associative int c; RD add edi,ebx
L2 cache
write-back, 15 cycles latency for(i=0; i<N-1; i++) mov ebx,a+4(,esi,4)
4 MB, 32-way set-associative { mov a(,esi,4),ebx
L3 cache
write-back, 40 cycles latency a[i+1] = a[i] + c; add edi,ebx
WR
IP-based stream prefetcher [30], [31] } mov ebx,a+4(,esi,4)
Prefetcher
to L1, L2 and L3 mov ebx,a+4(,esi,4)
Local memory 32 KB, 2 cycles latency mov a(,esi,4),ebx
add edi,ebx
RD/WR
mov ebx,a+4(,esi,4)
mov ebx,a+4(,esi,4)
the LM-CM state. When a LM-writeback action is triggered
by a dma-put the associated DMA transfer invalidates the
version of the data that is in the cache hierarchy. The CM- marks have been compiled using GCC 4.6.3 with the -O3 op-
evict transition is caused by an access to some other data timization flag on. SimPoint [33] has been used to identify the
in the SM that causes a replacement of the cache line that simulation points and at least 150 millions of x86 instructions
holds the current data, leaving just one replica, the one in have been simulated for each benchmark.
the LM, and thus transitioning to the LM state. Once the LM The outcome of the alias analysis performed by GCC on
state is reached, at some point the program will execute a every memory reference has been checked to generate the
dma-put operation to write-back the data to the SM. Finally, guarded memory instructions. The references that GCC is
the transition LM-CM→CM caused by a LM-unmap action not able to determine the aliasing for are the potentially
corresponds to the case where the program explicitly discards incoherent accesses. Once these accesses have been identified,
the copy in the LM when new data is mapped to the buffer that the source code of the benchmarks has been modified by hand
holds it. The programming model imposes that this will only to generate the guarded memory instructions using assembly
happen when both versions are identical, because if the version macros. x86 instruction prefixes are used to implement the
in the LM had modifications it would be written-back before guarded instructions as explained in Section 3.1.
being replaced. So, after the LM-unmap, the only replica of
the data is in the cache hierarchy and it is valid, and the cache
coherence protocol will ensure the transfer of the cache line 4.2 Overhead of the Coherence Protocol
to the main memory is done coherently. A microbenchmark that stresses the coherence protocol is used
In conclusion, the system always evicts the valid version to facilitate the study of its performance overheads. Table 2
of the data. When two replicas exist, first the invalid one shows its characteristics. The microbenchmark is a loop that
is discarded and, then, the DMA and the cache coherence makes a sequence of load/add/store instructions that can be
mechanisms correctly manage the eviction of the valid replica. configured in four modes. In the baseline mode no guarded
instructions are generated for any access. The RD mode
4 E VALUATION assumes the read access a[i] is potentially incoherent, so a
This section evaluates the coherence protocol for the hybrid guarded load is generated. The guarded memory instructions
memory system. A microbenchmark and a set of real bench- are represented in bold font in the assembly code. The WR
marks are used to study the overhead of the proposal in terms mode assumes the write access to a[i+1] is potentially
of execution time and energy consumption. Then a comparison incoherent and it cannot be ensured a write-back to the SM will
against a cache-based system is presented. be performed, so a double store is emitted. The RD/WR mode
is a combination of the RD and the WR modes. To model all
possible scenarios in terms of the ratio of accesses that are
4.1 Experimental Framework potentially incoherent, the percentage of memory operations
The proposal has been evaluated using PTLsim [28], extending that need to be guarded can also be adjusted.
it with a LM, a DMAC and the directory of the coherence Figure 7 shows the overhead in execution time of the
protocol. For the energy results Wattch [29] has been inte- proposal in the microbenchmark. Three lines appear in the
grated into the simulator. Single-core simulations are presented figure, one per each mode of the microbenchmark. The X
because the coherence protocol is per core. Table 1 shows the axis shows the percentage of references that are potentially
parameters of the simulated speculative out-of-order core. incoherent with respect to the total number of references. The
Six memory intensive HPC benchmarks from the NAS overhead of each mode is shown in ratio and computed against
benchmark suite [32] are used for the evaluation. The bench- the baseline mode of the microbenchmark.
10
1.30 1.10
OVERHEAD
OVERHEAD
RD/WR
1.15
1.00
1.10
1.05 0.95
1.00
0.95 0.90
0 10 20 30 40 50 60 70 80 90 100 CG EP FT IS MG SP AVG
% OF GUARDED INSTRUCTIONS
Fig. 8: Overhead in real benchmarks.
Fig. 7: Overhead in all microbenchmark modes.
TABLE 3: Activity in the memory subsystem for the hybrid memory and the cache-based systems.
Benchmark Guarded L1 L2 L3 LM Directory
AMAT
Name Mode References Hit ratio Accesses Accesses Accesses Accesses Accesses
CG Hybrid coherent 1/7 (14%) 3.15 90.52 19319 26376 10597 30235 10566
CG Cache-based 0 4.31 82.23 70371 62822 84202 0 0
EP Hybrid coherent 1/20 (5%) 2.14 99.93 37152 10266 228 3862 3519
EP Cache-based 0 2.37 98.93 43814 13219 797 0 0
FT Hybrid coherent 4/34 (11%) 2.60 96.61 912779 761009 110186 1155150 55118
FT Cache-based 0 4.95 78.54 1379688 789765 352269 0 0
IS Hybrid coherent 2/5 (25%) 6.27 74.00 140663 194465 74647 73400 25714
IS Cache-based 0 7.93 64.10 169425 182716 127692 0 0
MG Hybrid coherent 1/60 (1.66%) 2.24 99.71 605269 252799 35588 798562 19377
MG Cache-based 0 3.89 90.65 827239 238099 127176 0 0
SP Hybrid coherent 0/497 (0%) 2.41 98.37 331832 162441 24159 235024 0
SP Cache-based 0 4.73 79.59 407952 164515 82301 0 0
1.50
Work Synch Control cache-based system execution time and show the weight of
1.25
each execution phase, considering as work time the whole
EXECUTION TIME
1.00
execution time of the cache-based system. All benchmarks
0.75 but EP present some degree of reduction. The reductions are
0.50 mainly due to the reduction of execution time of the work
0.25 phase, more than 35% in all cases. This big reduction in the
0.00
work phase is caused by the better management of memory
CG EP FT IS MG SP AVG
references in the hybrid memory system. First, the irregular
Fig. 9: Reduction in execution time. accesses that reuse data along the execution of the benchmarks
have a much higher L1 hit ratio in the hybrid memory
system. This is because the hybrid memory system uses the
with one difference. The hybrid memory system has a 32KB LM to serve the regular accesses and the L1 to serve the
LM and the directory of the coherence protocol. For fairness, irregular ones, so the data placed in the L1 is much less often
the capacity of the L1 of the cache-based system is increased evicted than in the cache-based system, where every access is
to 64KB, matching the 32KB of LM plus the 32KB of L1 in served by the L1 so the data brought for irregular accesses
the hybrid memory system. Table 3 summarizes the statistics is evicted when new data needs to be brought for regular
of the memory subsystem that are the dominating factors of references, causing misses when irregular accesses reuse data.
the improvements. This table is used throughout this section The second important observation is that the hybrid memory
to explain the differences between the two architectures. For system imposes an execution model that does extra work in
each benchmark the table shows the ratio of references that the control and synchronization phases, but in the work phase
are potentially incoherent, the average memory access time it is able to execute the strided accesses without cache misses,
(AMAT), the L1 hit ratio and the number of accesses to since they are served by the LM. In the cache-based system,
all the components of the memory subsystem in thousands. when a lot of strided memory references are being used, they
The accounting of accesses includes hits, misses, lookups and cause collisions in the history tables of the prefetchers and also
invalidations provoked by memory instructions, prefetchers, the big amount of prefetched data causes conflict misses in the
placement of cache lines by the MSHRs, write-through and whole cache hierarchy. These two situations are reflected in
write-back policies and bus requests of the DMA commands. the AMAT and the L1 hit ratios shown in Table 3. MG and
SP show a very similar behaviour, with respective reductions
The immediate consequence of the coherence protocol is
of 39% and 40% (or speedups of 1.64x and 1.66x). The big
that any computational loop can be executed on the hybrid
amount of regular references they have provoke conflict misses
memory system. The benchmarks that take benefit of this
and collisions in the prefetchers in the cache-based system,
achievement are all but SP. In Table 3 this is reflected in the
which cause important penalties compared to the execution
column of the number of guarded references. All benchmarks
time spent in control phases in the hybrid memory system.
but SP have potentially incoherent references for which the
CG, FT and IS show reductions of 26%, 24% and 36% (or
compiler generates guarded accesses. Without the coherence
speedups of 1.34x, 1.30x and 1.55x), respectively. These loops
protocol the usage of the hybrid memory system would not
have fewer strided references but their critical path contains
be possible in these cases, so the performance and energy
a potentially incoherent access with a high degree of reuse.
consumption benefits it provides would not be exploited.
These memory references almost always miss in the L1 in
The reduction in execution time the hybrid memory system the cache-based system, while they are served very efficiently
achieves when compared to a cache-based system can be in the hybrid memory system. EP presents no speedup at all.
observed in Figure 9. For each benchmark two bars are In both architectures all accesses are served very efficiently,
presented. The leftmost bar is the execution time of the cache- with similar AMATs and L1 hit ratios of 99.9% and 98.9%.
based system and the rightmost bar is the execution time of An irregular store causes this difference in the hit ratio but
the hybrid memory system. Both bars are normalized to the
12
1.50
ENERGY CONSUMPTION
CPU Caches LM Others 5 R ELATED WORK
1.25
ence problems because, with regular memory instructions, the [3] A. Ros, M. E. Acacio, and J. M. Garcı́a, Parallel and Distributing
accelerator cores can only access their LMs and the general Computing. IN-TECH, 2010, ch. Cache Coherence Protocols for Many-
Core CMPs.
purpose core can only access the cache hierarchy. Whenever [4] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
a modification has to be visible to other cores DMAs are used “Scratchpad Memory: A Design Alternative for Cache On-chip Memory
so the coherence is ensured. In the hybrid memory system in Embedded Systems,” in CODES ’02: Proceedings of the 10th Inter-
national Symposium on Hardware/Software Codesign. ACM, 2002, pp.
this approach is extended to support coherence at the memory 73–78.
instruction level because a core can access both memories. [5] J. Kahle, “The Cell Processor Architecture,” in MICRO 38: Proceedings
D. Tang et al. [37] introduce on-chip storage to separate IO of the 38th International Symposium on Microarchitecture. IEEE
Computer Society, 2005, pp. 3–4.
data from CPU data. Although with different motivations, this [6] P. N. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Com-
work faces similar coherence problems as the ones the pro- puting Architecture.” White paper, 2009.
posed coherence protocol addresses. The introduction of the [7] M. Gonzàlez, N. Vujic, X. Martorell, E. Ayguadé, A. E. Eichenberger,
T. Chen, Z. Sura, T. Zhang, K. O’Brien, and K. O’Brien, “Hybrid
DMA-cache creates potential incoherences that are solved by a Access-Specific Software Cache Techniques for the Cell BE Architec-
refinement of the MOESI and ESI cache coherence protocols. ture,” in PACT ’08: Proceedings of the 17th International Conference
In the coherent hybrid memory system data invalidation only on Parallel Architectures and Compilation Techniques. ACM, 2008,
pp. 292–302.
happens along a dma-put and never a memory access to the
[8] W. Landi and B. G. Ryder, “A Safe Approximate Algorithm for Interpro-
cache hierarchy can modify the contents of the LM. cedural Aliasing,” in PLDI ’92: Proceedings of the ACM SIGPLAN 1992
Conference on Programming Language Design and Implementation.
ACM, 1992, pp. 473–489.
6 C ONCLUSIONS [9] A. Deutsch, “Interprocedural May-Alias Analysis for Pointers: Beyond
The hybrid memory system, which consists of adding a local k-limiting,” in PLDI ’94: Proceedings of the ACM SIGPLAN 1994
memory alongside the cache hierarchy, is a promising solution Conference on Programming Language Design and Implementation.
ACM, 1994, pp. 230–241.
to the lack of scalability and the power consumption problems [10] R. P. Wilson and M. S. Lam, “Efficient Context-Sensitive Pointer
of future cache coherent multicore and manycore architectures. Analysis for C Programs,” in PLDI ’95: Proceedings of the ACM
One of the main problems of the hybrid memory system is the SIGPLAN 1995 Conference on Programming Language Design and
Implementation. ACM, 1995, pp. 1–12.
incoherence between the two storages, for which this paper [11] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantita-
proposes a novel hardware/software coherence protocol. tive Approach (The Morgan Kaufmann Series in Computer Architecture
The protocol admits data replication in the two storages and and Design). Morgan Kaufmann, 2002.
[12] R. Bertran, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé,
avoids keeping them coherent. Instead, it ensures that the valid “Local Memory Design Space Exploration for High-Performance Com-
copy of the data is always accessed. The design consists of a puting,” The Computer Journal, pp. 786–799, 2010.
hardware directory that keeps track of the contents of the local [13] H. Cook, K. Asanovic, and D. A. Patterson, “Virtual Local Stores: En-
abling Software-Managed Memory Hierarchies in Mainstream Comput-
memory and guarded memory instructions that the compiler ing Environments,” Electrical Engineering and Computer Sciences De-
selectively emits for potentially incoherent memory accesses. partment, University of California at Berkeley, Tech. Rep. UCB/EECS-
Guarded instructions access the directory and then are diverted 2009-131, 2009.
[14] M. Kistler, M. Perrone, and F. Petrini, “Cell Multiprocessor Communi-
to the storage where the correct copy of the data is. The main cation Network: Built for Speed,” IEEE Micro, pp. 10–23, 2006.
achievement of the coherence protocol is that the compiler [15] T. B. Berg, “Maintaining I/O Data Coherence in Embedded Multicore
algorithm to generate code for the hybrid memory system is Systems,” IEEE Micro, pp. 10–19, 2009.
straightforward and always safe because it is not limited by [16] “MPI: A Message-Passing Interface Standard. 2003.”
[17] “OpenMP Application Program Interface. Version 3.0. May 2008.”
memory aliasing problems. [18] S. Seo, J. Lee, and Z. Sura, “Design and Implementation of Software-
The proposed coherence protocol introduces average over- Managed Caches for Multicores with Local Memory,” in HPCA ’09:
heads of 0.26% in execution time and of 2.03% in energy Proceedings of the 15th International Conference on High-Performance
Computer Architecture. IEEE Computer Society, 2009, pp. 55–66.
consumption to enable the usage of the hybrid memory system. [19] A. E. Eichenberger, J. K. O’Brien, K. M. O’Brien, P. Wu, T. Chen, P. H.
This system, compared to a cache-based system, provides an Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang,
average speedup of 38% and an energy reduction of 27%. P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, “Using
Advanced Compiler Technology to Exploit the Performance of the Cell
Broadband EngineTM Architecture,” IBM Systems Journal, pp. 59–84,
ACKNOWLEDGMENTS 2006.
[20] A. E. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. H. Oden,
We thankfully acknowledge the support of the the Spanish D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao,
Ministry of Education (TIN2007-60625 and CSD2007-00050), and M. Gschwind, “Optimizing Compiler for the CELL Processor,” in
the Generalitat de Catalunya (2009-SGR-980), the HiPEAC PACT ’05: Proceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques. IEEE Computer Society,
Network of Excellence (contracts EU FP7/ICT 217068 and 2005, pp. 161–172.
287759), and the BSC-IBM collaboration agreement. [21] Y. Paek, J. Hoeflinger, and D. Padua, “Efficient and Precise Array Access
Analysis,” ACM Transactions on Programming Languages and Systems,
pp. 65–109, 2002.
R EFERENCES [22] T. Chen, T. Zhang, Z. Sura, and M. G. Tallada, “Prefetching Irregular
[1] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, References for Software Cache on Cell,” in CGO ’08: Proceedings of
M. Horowitz, and C. Kozyrakis, “Comparing Memory Systems for Chip the 6th International Symposium on Code Generation and Optimization.
Multiprocessors,” SIGARCH Computer Architecture News, pp. 358–368, ACM, 2008, pp. 155–164.
2007. [23] “Power ISA. Version 2.06 Revision B. IBM. July 2010.”
[2] R. Murphy, “On the Effects of Memory Latency and Bandwidth on [24] “Intel 64 and IA-32 Architectures Software Developer’s Manual. January
Supercomputer Application Performance,” in IISWC ’07: Proceedings of 2011.”
the 10th International Symposium on Workload Characterization. IEEE [25] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0:
Computer Society, 2007, pp. 35–43. A Tool to Understand Large Caches. 2009.
14
[26] R. C. Murphy and P. M. Kogge, “On the Memory Access Patterns of Su- Marc Gonzàlez received the Engineering de-
percomputer Applications: Benchmark Selection and Its Implications,” gree in Computer Science in 1996 and the
IEEE Transactions on Computers, pp. 937–945, 2007. Computer Science PhD degree on December
[27] J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, “Quan- 2003. He currently holds an Associate Professor
tifying Locality In The Memory Access Patterns of HPC Applications,” position in the Computer Architecture Depart-
in SC ’05: Proceedings of the 2005 ACM/IEEE conference on Super- ment from the Technical University of Catalonia.
computing. IEEE Computer Society, 2005, pp. 50–62. His research activity is linked to the Barcelona
[28] M. T. Yourst, “PTLsim: A Cycle Accurate Full System x86-64 Microar- Supercomputing Center (BSC) as a collaborator.
chitectural Simulator,” in ISPASS ’07: Proceedings of the 7th Interna- His main interests are both parallel programming
tional Symposium on Performance Analysis of Systems and Software. and computer architecture, specifically for hybrid
IEEE Computer Society, 2007, pp. 23–34. multi-core systems. Besides. he has worked on
[29] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework power and energy modeling techniques for multi-core processors and on
for Architectural-Level Power Analysis and Optimizations,” in ISCA parallel programming models, with special interest in the OpenMP and
’00: Proceedings of the 27th International Symposium on Computer OpenCL paradigms. Up today, he has published more than 40 refereed
architecture. ACM, 2000, pp. 83–94. papers in journals and conferences.
[30] T.-F. Chen and J.-L. Baer, “Effective Hardware-Based Data Prefetching
for High-performance Processors,” IEEE Transactions on Computers,
pp. 609–623, 1995.
[31] J. Doweck, “Inside Intel Core Microarchitecture and Smart Memory Ac-
cess. An In-Depth Look at Intel Innovations for Accelerating Execution Xavier Martorell received the M.S. and Ph.D.
of Memory-Related Instructions.” White paper, 2006. degrees in Computer Science from the Technical
[32] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, University of Catalunya (UPC) in 1991 and 1999,
L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. respectively. He has been an associate profes-
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, sor in the Computer Architecture Department
“The NAS Parallel Benchmarks,” in SC ’91: Proceedings of the 1991 at UPC since 2001, teaching on operating sys-
Conference on Supercomputing. IEEE Computer Society, 1991, pp. tems. His research interests cover the areas of
158–165. paralellism, runtime systems, compilers and ap-
[33] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically plications for high-performance multiprocessor
Characterizing Large Scale Program Behavior,” in ASPLOS ’02: Pro- systems. Since 2005 he is the manager of the
ceedings of the 10th nternational conference on Architectural Support team working on Parallel Programming Models
for Programming Languages and Operating Systems. ACM, 2002, pp. at the Barcelona Supercomputing Center. He has participated in several
45–57. european projects dealing with parallel environments (Nanos, Intone,
[34] “NVIDIA CUDA C Programming Guide. Version 4.2. April 2012.” POP, SARC, ACOTES). He is currently participating in the European
[35] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, HiPEAC2 Network of Excellence, and the ENCORE european project.
“Smart Memories: A Modular Reconfigurable Architecture,” in ISCA
’00: Proceedings of the 27th International Symposium on Computer
architecture. ACM, 2000, pp. 161–171.
[36] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel,
“Cohesion: An Adaptive Hybrid Memory Model for Accelerators,” IEEE Nacho Navarro is Associate Professor at the
Micro, pp. 42–55, 2011. Universitat Politecnica de Catalunya (UPC),
[37] D. Tang, Y. Bao, W. Hu, and M. Chen, “DMA Cache: Using On- Barcelona, Spain, since 1994, and Senior Re-
Chip Storage to Architecturally Separate I/O Data from CPU Data for searcher at the Barcelona Supercomputing Cen-
Improving I/O Performance,” in HPCA ’10: Proceedings of the 16th ter (BSC), serving as manager of the Acceler-
International Conference on High-Performance Computer Architecture. ators for High Performance Computing group.
IEEE Computer Society, 2010, pp. 1–12. He holds a Ph.D. degree in Computer Science
from UPC. His current interests include: GPGPU
computing, multi-core computer architectures,
hardware accelerators, dynamic reconfigurable
logic support, memory management and run-
Lluc Alvarez received a bachelor’s degree in time optimizations. He is also doing research on massively parallel
Computer Systems from Universitat de les Illes computing at the University of Illinois (IMPACT Research Group). Prof.
Balears in 2006 and a master’s degree in Com- Navarro is a member of IEEE, the IEEE Computer Society, the ACM and
puter Architecture from Universitat Politècnica the HiPEAC NOE.
de Catalunya (UPC) in 2009. Since 2010 he
is a PhD student in the Computer Architecture
Department at UPC and a resident student at
Barcelona Supercomputing Center. His main re-
search interests are computer microarchitecture Eduard Ayguadé received the Engineering de-
and memory hierarchies of multicore architec- gree in Telecommunications in 1986 and the
tures for high-performance computing. Ph.D. degree in Computer Science in 1989, both
from the Universitat Politècnica de Catalunya
(UPC), Spain. Since 1987 he has been lectur-
ing on computer organization and architecture
and parallel programming models. Currently, and
Lluı́s Vilanova is a PhD student at the since 1997, he is full professor of the Computer
Barcelona Supercomputing Center and the Architecture Department at UPC. His research
Computer Architecture Department at the Uni- interests cover the areas of processor microar-
versitat Politècnica de Catalunya, from where chitecture, multicore architectures and program-
he also received his bachelor’s degree in Com- ming models and their architectural support. He has published more
puter Science in 2006 and his master’s degree than 100 papers in these topics and participated in several research
in Computer Architecture in 2008. His interests projects in the framework of the European Union and research col-
cover computer architecture and operating sys- laborations with companies. He is associated director for research on
tems. computer sciences at the Barcelona Supercomputing Center (BSC-
CNS).