0% found this document useful (0 votes)
17 views14 pages

Article

Uploaded by

laxmipoudel1116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

Article

Uploaded by

laxmipoudel1116
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Hardware-Software Coherence Protocol for the


Coexistence of Caches and Local Memories
Lluc Alvarez, Member, IEEE, Lluı́s Vilanova, Marc Gonzàlez, Member, IEEE,
Xavier Martorell, Member, IEEE, Nacho Navarro, Member, IEEE, Eduard Ayguadé

Abstract—Cache coherence protocols limit the scalability of multicore and manycore architectures and are responsible for an important
amount of the power consumed in the chip. A good way to alleviate these problems is to introduce a local memory alongside the cache
hierarchy, forming a hybrid memory system. Local memories are more power-efficient than caches and do not generate coherence
traffic, but they suffer from poor programmability. When non-predictable memory access patterns are found compilers do not succeed
in generating code because of the incoherence between the two storages. This paper proposes a coherence protocol for hybrid
memory systems that allows the compiler to generate code even in the presence of memory aliasing problems. Coherence is ensured
by a software/hardware co-design where the compiler identifies potentially incoherent memory accesses and the hardware diverts
them to the correct copy of the data. The coherence protocol introduces overheads of 0.26% in execution time and of 2.03% in energy
consumption to enable the usage of the hybrid memory system, which outperforms cache-based systems by an speedup of 38% and
an energy reduction of 27%.

Index Terms—Coherence protocol, Local memories, Scratchpad memories, Hybrid memory system

1 I NTRODUCTION is the potential replication of data between the two storages.


Upcoming multicore and manycore architectures are expected Compilers succeed in generating code for LMs when the
to include a significant number of cores, as a result of the computation is based on predictable memory access pat-
replication of general purpose and accelerator cores. As an terns [7] but, when non-predictable memory access patterns
immediate consequence, the memory subsystem has to evolve are found, compilers need to ensure correctness by applying
into some novel organization that overcomes the problems of complex analyses such as memory aliasing [8], [9], [10]. When
traditional cache-based schemes. Two of the major concerns compilers cannot ensure that there is no aliasing between two
are the important amount of power consumed in the cache memory references that may target copies of the same data in
hierarchy and the lack of scalability of current cache coherence the LM and in the cache hierarchy, they must conservatively
protocols, which constrain the sharing and the size of caches avoid using the LM. This problem happens because the copies
when cores are replicated beyond certain levels [1], [2], [3]. of data in the LM and in the cache hierarchy are incoherent.
A possible solution to the power consumption and scalabil- The main contribution of this paper is a novel coherence
ity problems of cache coherence protocols is the introduction protocol for hybrid memory systems to achieve the pro-
of local memories (LMs), also known as scratchpad memo- grammability of a cache-based system by safely enabling the
ries [4]. The main advantages of LMs are that they offer access use of the LM in the presence of memory aliasing problems.
delays similar to that of best-case cache delays in a much A coherent memory view of the two storages is ensured by
more power-efficient way and they do not generate coherence a simple hardware/software mechanism implemented by two
traffic. The drawback is that LMs introduce programmability components: (1) a per-core hardware directory that keeps track
difficulties due to the explicit data transfers they require, so of which data is mapped to the LM and (2) guarded in-
usually programmers rely on compiler transformations that structions for memory operations that the compiler selectively
generate code to manage the LM. Although this limitation, places in potentially incoherent data accesses. At execution
LMs have been successfully introduced in the high perfor- time the guarded memory instructions access the directory to
mance computing (HPC) domain in several ways. In the identify which memory keeps the correct copy of the data
Cell B.E. [5], accelerator cores access their private LM with and are diverted to it. The proposal allows the compiler to
memory instructions and use explicit DMA transfers to move use an straightforward algorithm to generate code for the
data between memories. A more recent trend is to introduce a hybrid memory system. The evaluation shows that, compared
LM alongside the cache hierarchy, forming a hybrid memory to a compiler that is able to resolve all memory aliasing
system. This approach is currently used in GPGPUs [6]. problems, the proposal introduces average overheads of 0.26%
One of the main problems of the hybrid memory system in execution time and 2.03% in energy consumption. These
overheads are outweighted by the benefits coming from the
• L. Alvarez, L. Vilanova, M. Gonzàlez, X. Martorell, N. Navarro and
ability to generate code for the hybrid memory system, that
E. Ayguadé are with the Department of Computer Architecture, Universitat provides an average speedup of 38% and an average energy
Politècnica de Catalunya and the Barcelona Supercomputing Center, saving of 27% when compared to a cache-based system.
Barcelona, 08034 Spain. E-mail: [email protected]
The rest of this paper is organized as follows: Section 2
2

gives some background of how a LM is integrated in a core


and how the resulting architecture is programmed. Section 3
explains the design of the coherence protocol and Section 4
presents its evaluation. Section 5 comments some related work
and Section 6 remarks the main conclusions of this work.

2 BACKGROUND AND M OTIVATION


This section explains the hybrid memory system, its execution Fig. 1: Overview of the hybrid memory system. The architec-
model and the coherence problem it exposes. ture consists on extending a regular core with a Local Memory
(LM) and a programmable DMA controller (DMAC).
2.1 Baseline Architecture
The hybrid memory system consists of extending a core with
2.2 Execution Model
a LM and a DMA controller (DMAC), as Figure 1 shows.
The LM is integrated into the core at the same level as the One of the big challenges of the hybrid memory system is to
L1 cache and is used to store private data only. A range of the be more efficient than a cache-based system offering exactly
virtual address space is devoted to the LM, and this range is the same level of programmability. Since the introduction of
direct-mapped to the physical address space of the LM. The a LM imposes the software to explicitly manage the data,
CPU needs three registers to keep track of the address mapping the only way to offer the programmer a system that is as
of the LM: a register for the base address of the virtual address programmable as a cache-based system is to give the compiler
range, a register for the base address of its physical address the responsibility of generating the code that manages the LM.
range and a register for the size of the LM. The CPU is In order to do so, the idea is that the programmer writes
able to access the LM using regular loads and stores to its conventional parallel code and the compiler, first, identifies
virtual address range. In order to distinguish which memory what data is better suited to be mapped to the LM and, then,
has to serve a memory instruction, a range check is performed generates code to manage this mapping transparently.
on the virtual address, prior to any MMU [11] action. If the The first thing to be done by the compiler is to identify
virtual address is in the range reserved for the LM, the MMU which data is private to each core, so it can be mapped to its
is bypassed and a physical address that points to the LM is LM. Typically programming models rely on the programmer
generated. This scheme is the preferred one to integrate a LM to know how the data of a parallel program is distributed. In
alongside the cache hierarchy [12], [13] because it has two distributed memory architectures, programming models such
important benefits. First, since no pagination is used for the as MPI [16] require the user to explicitly partition the data.
LM, memory accesses to the LM do not need to access the Allocations are private to each computational task and the
TLB, so they are extremely power efficient and they have a programmer adds explicit data transfers and synchronization
deterministic latency. Second, it allows the introduction of the points between tasks when needed. In shared memory architec-
LM in a very simple way because only three registers are tures the programmer guides the partitioning. In OpenMP [17]
required to configure the LM and there is no interference with the programmer adds code annotations to specify if the data is
the cache hierarchy. In addition, the typical size of a LM is private or shared between threads and how the iteration space
extremely small compared to the size of the RAM and of the of a loop is split between the threads. Thus, in both cases, the
virtual address space of a 64-bit machine, so the virtual and data distribution between computational entities is solved by
physical address ranges reserved for the LM occupy a very the inherent properties of the programming models themselves.
minor portion of the whole address spaces. The data assigned to each core is then mapped to its
The DMAC is in charge of transferring data between the LM, inducing a particular execution model. In the case of
LM and the system memory (SM, which includes caches and a computational loop, the code is converted into a two-level
main memory). It offers three operations: (1) dma-get transfers nested iterative structure that uses blocking [7], as Figure 2
data from the SM to the LM, (2) dma-put transfers data from shows. Each outermost iteration maps chunks of data to the
the LM to the SM and (3) dma-synch waits for the completion LM and computes a subset of iterations. It executes three
of certain DMA transfers. These operations are explicitly trig- phases to do so: (1) a control phase that moves chunks of
gered by software using memory instructions to non-cacheable data between the LM and the SM, (2) a synchronization phase
memory-mapped I/O registers in the DMAC. DMA transfers that waits for the DMA transfers to finish and (3) a work
are coherent with the SM [14], [15] by inspecting the cache phase where the computation for the current chunk of data is
hierarchy at every bus request. The bus requests generated by performed. The three phases repeat until the whole iteration
a dma-get look for the data in the caches. If the data is in space is computed. These code transformations are usually
some cache, it is copied from there to the LM, otherwise it is done by run-time libraries [7], [18] or compilers [19], [20].
copied from the main memory. The bus requests generated by Automatic code transformations decide which data is
a dma-put copy the data from the LM to the main memory mapped to the LM by analyzing the memory accesses [21].
and invalidate the cache line in the whole cache hierarchy, if Regular accesses are those that expose predictable access
it exists. patterns (e.g., with a constant stride). These are mapped to
3

Compiler-based solutions for this situation are inefficient.


All approaches rely on memory aliasing analyses [8], [9], [10].
In Figure 2 this means predicting when, if ever, any instance
of the accesses to c or ptr aliases with any instance of the
accesses to a or b. Current algorithms are not able to solve
this problem in the general case, so compilers adopt restrictive
solutions in its presence. The naive one is to discard the usage
of the LM in presence of a potentially incoherent access. A
potentially incoherent access is an irregular access that the
compiler cannot ensure it will never access data in the SM
that is mapped to the LM. Another option is to introduce fine-
grained DMA transfers surrounding the potentially incoherent
accesses [7], adding big overheads because DMA transfers
Fig. 2: Code transformation for the hybrid memory system of small sizes are inefficient. Software caching is another
and three-phase execution model of the transformed code. solution [22], [7]. These keep track of the contents of the
LM with a software directory and perform a costly associative
search on it prior to every potentially incoherent access to
the LM. Unpredictable memory accesses are difficult to map determine if the access has to go to the LM or to the SM.
to the LM [7], so they are served by the cache hierarchy. These This paper proposes an efficient mechanism that ensures
are called irregular accesses. In the original code in Figure 2, coherence in hybrid memory systems. The solution avoids the
the accesses to a and b are regular accesses, and the accesses limitations stemming from the inability to solve the memory
to c and ptr are irregular. aliasing problem, bringing the optimization opportunities to a
In the control phase, chunks of data are moved between new level where automated optimization tools no longer have
the LM and the SM. In order to do this task in a simple and to back-off their code transformations due to coherence issues.
efficient way, the compiler declares as many buffers in the LM
as regular accesses appear in the loop. All buffers have the 3 D ESIGN
same size, determined by the size of the LM and the number
of regular accesses. In Figure 2 there are two regular accesses The main idea of the coherence protocol is to avoid main-
(a and b) so two buffers (_a and _b) would be allocated taining two coherent copies of the data but, instead, ensure
in the LM, each one of them occupying half the storage. In that memory accesses always use the valid copy of the data.
every instance of the control phase, for each regular access, The resulting design is open to data replication between the
the chunk of data that is needed in next work phase is mapped LM and the cache hierarchy. The system guarantees that, first,
to its corresponding LM buffer (MAP statements in Figure 2), in case of data replication, either the copies are identical or
potentially sending back to the SM the previously used chunk. the copy in the LM is the valid one and, second, always a
Even in case of mapping a chunk of data to the LM for writing valid copy of the data is accessed. For data transfers this is
only, the transfer of the chunk from the SM to the LM is done ensured by using coherent DMA transfers and by guaranteeing
because otherwise, if only part of the chunk was modified, the that, at the eviction of replicated data, always the invalid
write-back to the SM would update the unmodified parts of copy is discarded and then the valid version is evicted. For
the copy in the SM with garbage. data accesses, potentially incoherent accesses are diverted
The work phase is like the original loop, but with two to the memory that keeps the valid copy. In order to do
differences. First, every instance of the work phase consumes a so a directory is introduced to keep track of what data is
subset of the original iteration space. The amount of iterations mapped to the LM. The DMAC updates the directory entries
depends on the stride of the regular accesses and the size of when it executes dma-get commands. The compiler identifies
the LM buffers. Second, the original regular accesses (a and potentially incoherent memory accesses and emits guarded
b) are substituted with their LM buffer counterparts (_a and memory instructions for them. The execution of a guarded
_b) while irregular accesses are left untouched (c and ptr). memory instruction triggers a lookup in the directory, diverting
the access to the memory that keeps the valid copy of the data.
The proposed coherence protocol is independent of the
2.3 The Coherence Problem cache coherence protocol. The proposed coherence protocol
The coherence problem in the hybrid memory system appears is per core and it ensures coherence between the caches and
when two incoherent copies of the same data can be accessed the LM of that core, without interacting with other cores nor
during the computation. When some data is mapped to the with the cache coherence protocol. The proposed coherence
LM, a copy of the data is created. For regular accesses, the protocol can be integrated in a multicore with the hybrid
compiler generates memory operations that access the copy memory system by simply replicating the per core hardware
in the LM while, for irregular accesses, it generates memory support in every core. This is because the LMs in the hybrid
operations that access the copy in the SM. Since the memories memory system are used to store per core private data only.
are incoherent, modifications are not visible between paths, so One core cannot access the LM of another core and, when
the execution can be incorrect. a core maps data to its LM, another core should not access
4

the copy of this data in the SM. This is key to ensure there
is no interaction with the cache coherence protocol and it
is what allows the proposal to work by only monitoring
events inside the core. This constraint is easily ensured when
a compiler maps private data to the LM because the data
distribution is already specified in the parallelization model.
If the architecture is programmed by hand, the programmer
is responsible for not accessing the data mapped to one core
from another core without using synchronization primitives.
The next sections explain the compiler and hardware support
for the coherence protocol, show an example of how every-
thing works together and describe of how the system manages
the copies of the data, in such a way that the conditions for
the correctness of the coherence protocol are always fulfilled.

3.1 Compiler Support


With the proposed coherence protocol the compiler algorithm
that transforms the code as shown in Figure 2 is straight-
forward and safe, even in the presence of memory aliasing
problems. The compiler support, as shown in Figure 3, consists
on three phases: classification of memory references, code
transformation and code generation.
Phase 1 - Classification of memory references: In this
phase the compiler identifies which memory accesses are
suitable to be mapped to the LM and which others to the SM.
It does so by classifying the memory references according
to their access patterns and possible aliasing hazards. This
last analysis is done using the alias analysis function, which
receives two pointers as inputs and gives an outcome with
three possible values: the pointers alias, the pointers do not
alias or the pointers may alias. The information generated in
this phase is added to the intermediate representation of the
compiled code and is used in the next phases. The classes of
memory references are:
• Regular accesses are those that expose a strided access
pattern. They access the LM.
• Irregular accesses are those that do not expose a strided
access pattern and the compiler determines they do not
alias with any regular access. They access the SM.
• Potentially incoherent accesses are those that do not ex-
pose a strided access pattern and the compiler determines
they alias or may alias with some regular access. They
access the directory and then the SM or the LM.
In the example shown in Figure 3, the compiler classifies a
and b as regular accesses because they expose a strided access
pattern. Accesses c and ptr do not follow a strided access
pattern so, depending on the outcome of the alias analysis,
Fig. 3: Example of application of the three phases of the they are categorized as irregular or as potentially incoherent
compiler support. In the first phase the memory references are accesses. The example assumes the compiler succeeds in
classified as regular, irregular or potentially incoherent. In the ensuring that c does not alias with any regular access and
second phase the code is transformed to follow the execution that it is unable to do so for ptr, so it classifies c as an
model for the hybrid memory system. In the third phase irregular access and ptr as a potentially incoherent access.
the assembly code is emitted, generating guarded memory Phase 2 - Code transformation: In this phase the compiler
instructions for the potentially incoherent memory instructions transforms the code for regular accesses as explained in
(lines 17 and 19) and the double store if needed (line 20). Section 2.2. These are typical transformations to manage LMs
using tiling [20], [7]. For irregular and potentially incoherent
accesses nothing is done in this phase.
5

In Figure 3 the code is transformed in exactly the same this problem is to disable the tiling optimization, forcing that
way as explained in Section 2.2. The only difference is the the write-back is always performed and so incurring in high
existence of a new class of memory accesses, the potentially performance penalties. A more efficient solution is to make
incoherent ones, like ptr. Since the compiler does not have the modifications in the two memories. A simple way to do
to do any transformations for them, the resulting code is the it is that the compiler generates a double store: one irregular
same as in Figure 2. store that will update the copy in the SM and one potentially
Phase 3 - Code generation: In this phase the compiler incoherent store that will trigger a lookup in the directory
generates the assembly code for the target architecture: and will update the copy in the LM if it exists. Note that if
the lookup in the directory of the potentially incoherent store
• For regular accesses the compiler generates memory
misses there will be two stores of the same data to the same
instructions that directly access the LM. This is accom-
SM address. The overhead of this unnecessary second store is
plished by using as source operands the base address of
small. The performance impact is low because the two stores
a LM buffer and an offset.
are independent so they both can be issued in the same cycle.
• For irregular accesses the compiler generates memory
The increase in power consumption is also small since the
instructions that directly access the SM. This is accom-
Load/Store Queue [11] will collapse the second store with the
plished by using as source operands a base address in the
first one if it is not yet committed, having one single cache
SM and an offset.
access and so not paying the cost of an extra memory access.
• For potentially incoherent accesses the compiler gener-
Note also that, in presence of a potentially incoherent store,
ates guarded memory instructions with an initial SM ad-
the compiler almost always generates a double store since, in
dress. This is accomplished by using as source operands a
general, it is unable to ensure that the aliasing is not with some
base address in the SM and an offset. When it is executed,
read-only data. This happens because typically the compiler
the guarded memory instruction accesses the directory
is unable to determine what is the accessible address range
using the SM address and is diverted to the corresponding
of a potentially incoherent access, therefore it is also unable
memory. The implementation of the guarded memory
to ensure there is no read-only data in this potentially infinite
instruction is discussed later in this section.
accessible address range.
Figure 3 shows the assembly code that the third phase emits The final code of Figure 3 shows how the compiler generates
for the body of the innermost loop. In the statement that uses the double store. For the increment of a random position of
regular accesses (line 10), a conventional load (ld in line ptr (line 16), the value is read with a guarded load (gld
11) and a conventional store (st in line 12) are emitted to, in line 17), incremented and finally written with a double
respectively, read a value from _b and write it in _a. When store. The double store consists of a guarded store (gst in
these instructions are executed, its addresses will guide the line 19) that will modify the copy in the LM if it exists and
memory accesses to the LM. Similarly, in the statement that a conventional store (st in line 20) with the same source
uses an irregular access to store the zero value in random operands that will always update the value in the SM.
positions of c (line 13), the compiler emits a conventional The implementation of the guarded memory operations is
store (st in line 15) with an address that will access the highly architecture-dependent. The trivial implementation is to
SM at execution time. Finally, to increment the value that duplicate all memory instructions with a guarded form. As this
is accessed via potentially incoherent accesses (line 16), the might produce many new opcodes, it may be unacceptable for
compiler emits a guarded load (gld in line 17) to read the some ISAs, specially in RISC architectures. One alternative
value and a guarded store (gst in line 19) to write the is to take unused bits of the binary representation of memory
value after incrementing it. When these two guarded memory instructions, as happens in PowerPC [23]. Another option is
instructions are executed, the initial SM addresses based on to provide a fewer range of guarded memory instructions
ptr will be used to look up the directory and they will be and restrict the compiler to these. In CISC architectures
changed to LM addresses if a copy of the data exists there. like x86 [24], where most instructions can access memory,
One special case has to be treated separately. When the instruction prefixes can be used to implement the guard. A
compiler determines a write access is potentially incoherent generic solution for any ISA is to extend the instruction set by
it has to ensure also that the access does not alias with some only a single instruction that performs the computation of the
data that is mapped to the LM as read-only. If it cannot ensure address using the directory and leaves the result in a register
this, emitting a single guarded store can lead to an erroneous that is consumed by the memory instruction, conceptually
execution. This is caused by a typical optimization of tiling converting the guarded memory access to a coherence-aware
transformations that consists on not triggering a write-back address calculation plus a normal memory operation.
to the SM of a chunk of data that is mapped as read-only.
With this optimization, what will happen at execution time 3.2 Hardware Design
is that the guarded store will hit in the directory and the The only hardware support needed for the coherence protocol
modification will be done to the LM. Since the buffer will not is a directory that keeps track of the contents of the LM.
be written-back to the SM, when the buffer is reused to map This section explains how the directory is configured, updated
new data, the corresponding dma-get operation will overwrite and used in the address generation. Then some considerations
the contents of the buffer and the modifications done by the about its access time, its double buffering support and its side
potentially incoherent store will be lost. A naive solution to effects on the hybrid memory system are discussed.
6

Access time: The directory is restricted to have 32 entries


to keep the access time low. According to CACTI [25], with
a process technology of 45nm, the latency of the directory
is 0.348 ns. Taking into account that this latency would be
significantly lower with nowadays process technology, that
current CPUs work with frequencies between 2GHz and 3GHz
and that the directory is accessed just after an extremely simple
operation in the AGU, it is feasible to generate the address
and to do the lookup in the same cycle. Having 32 entries
constrains the software to use 32 LM buffers at most, so loops
can only map 32 regular references to the LM. This is not a
big limitation since loops with more than 32 regular references
are rare. If a loop needs more than 32 buffers the compiler
can simply not map the exceeding regular accesses to the LM.
Double buffer support: The directory contains a Presence
bit that indicates if the data of a LM buffer is currently being
transferred into the LM by a dma-get. This bit is reset when
Fig. 4: Scheme and operations of the directory for the coher- the dma-get is triggered. If a guarded memory access hits the
ence protocol. The directory is updated at every dma-get. In the directory entry and this bit is unset, an internal exception is
address generation stage of the guarded memory instructions, generated until the bit is set at the dma-get completion. This
the directory is looked up to generate a SM or a LM address. ensures correctness when a guarded memory access accesses
data that is being transferred to the LM using double buffering.
As a final remark, the introduction of the hardware directory
Configuration: The directory can be configured to work does not undermine the benefits of the hybrid memory system.
with any LM buffer size. When the compiler transforms the The number of CAM lookups is kept low because only ac-
code it partitions the LM into equally sized buffers and informs cesses that are not regular trigger them: if they are potentially
the hardware of the LM buffer size through a memory-mapped incoherent accesses they go through the directory and then
register. A directory entry is assigned to each of these LM to either the cache or the LM; if they are irregular accesses
buffers to map the starting address of the copy of the data in they are served directly by the cache. Regular accesses are
the SM (i.e., the directory tag) to the starting address of the directly served by the LM without any CAM lookup. Since
LM buffer where the data is mapped to. Since all LM buffers in HPC applications the vast majority of memory accesses
are equally sized, the base address of a LM buffer is equivalent are regular [26], [27], the directory is rarely accessed and the
to the buffer number and, thus, the index of a directory entry. goodnesses of the hybrid memory system are preserved.
The buffer size is used to set the values of the Base Mask
and Offset Mask internal registers. These registers allow to
decompose any address into a base address and an address 3.3 Example of Operation
offset, so the directory can be operated with any buffer size. The cooperation between the compiler support and the hard-
Update: Every dma-get operation updates the directory. The ware additions for the coherence protocol achieve that the
destination LM address of the transfer is used to identify the memory accesses are always served by a memory that keeps a
base address of the LM buffer and the source SM address is valid copy of the data. This section shows an example of how
used to set the tag of the corresponding directory entry. the whole mechanism operates together in order to do so.
Address generation: The directory is used in the address Figure 5 shows an example of operation. The leftmost part
generation as shown in Figure 4. The Address Generation Unit of the figure shows the final code generated by the compiler
(AGU) [11] first generates a potentially incoherent SM address after applying the three-phase transformations explained in
(Incoherent address). Notice that this is a SM address because Section 3.1. This code is the same as the resulting code of
it is generated by a potentially incoherent access. Two bit- Figure 3. The rightmost part of the figure shows how the
wise AND operations between the Incoherent address and the hardware executes the code. The execution is divided in four
Base Mask and Offset Mask registers split the address in an steps that correspond to four pieces of code. For every step the
Incoherent base address and an Incoherent address offset. The figure explains which memory serves the memory operations
Incoherent base address is used to do a lookup in the directory. triggered by the corresponding piece of code. Note that the
If it hits, the instruction is accessing data in the SM that has memory operations are labeled indicating the instruction and
a copy in the LM, so the access has to be diverted to the LM. the line of the code that triggers them (e.g., MAP4 represents
The base address of the corresponding LM buffer is retrieved the memory operation triggered by the statement MAP of the
from the directory (LM base addr) and a bit-wise OR with the 4th line of code). Note also that, for simplicity, the cache
Incoherent address offset is done, resulting in the Coherent hierarchy in the figure only shows the first level of cache.
address. If the lookup misses there is no copy in the LM, so The execution starts with the step 1. Its first statement (line
the original SM address is preserved performing a bit-wise OR 4) maps a chunk of a to the LM buffer _a. At execution time,
between the SM base addr and the Incoherent address offset. the transition MAP4 shows how a DMA transfer makes a copy
7

Fig. 5: Example of the hardware handling memory operations in the hybrid memory system with the coherence protocol. The
code is divided in four pieces, each one of them corresponding to a step in the execution diagram. Every step shows how the
data is moved between memories and what memory serves the memory accesses triggered by the corresponding piece of code.

from a in the SM to _a in the LM. It is assumed the caches accesses are potentially incoherent because they are not strided
are empty, so the data is transferred from the main memory. and the example assumes the compiler does not succeed in
If some cache kept the valid copy of the data, the coherent ensuring they do not alias with any regular access. In addition,
DMA transfers would read the data from the cache. Similarly, the potentially incoherent write access needs to be treated
the second statement of this step (line 5) maps a chunk of b with a double store. The instructions for this statement are a
to _b, which at execution time provokes the data movement guarded load of some element of ptr (line 17), the increment
from the main memory to the LM represented by MAP5. (line 18), a guarded store that will write the new value in the
Once the control phase has been executed, the step 2 takes LM in case it exists (line 19) and a conventional store that
place. This step is the assignment from _b to _a done will always write the new value in the SM (line 20). When
with regular accesses (line 10). In assembly language the these instructions are executed, the guarded load gld17 does
assignment is done using a conventional load (line 11) and a a lookup in the directory. If it hits, the access is diverted to
conventional store (line 12). At execution time, the load to _b the LM as gld17H shows. This happens, for instance, if ptr
is served directly by the LM, as represented by ld11, and the equals a and ptr[_a[_i]] is a position of a that has been
store to _a is also directly sent to the LM, as st12 represents. mapped to the LM in the step 1. Otherwise, if the directory
These direct accesses to the LM happen because the addresses lookup misses, the load labeled as gld17M is served by the L1
of the memory operations are in the range reserved to the LM. cache, which requests the cache line if needed. This happens,
In the step 3 an irregular store sets to zero a random position for instance, if ptr equals a but ptr[_a[_i]] is a position
of c (line 13). The store is irregular because it does not expose of a that has not been mapped to the LM in the step 1. After
a strided access pattern and the example assumes the compiler loading the value, it is incremented and written to memory
can ensure the access does not alias with any regular access. with the guarded store gst19. The execution of the guarded
The assembly code for this statement consists on placing the store, analogous to the guarded load, first does a lookup in
value zero in a register (line 14) and then storing this value to the directory. If there is a copy of the data in the LM the
memory using a conventional store to some position of c (line lookup hits, the address is changed to point to the LM and
15). This store, labeled as st15, is served by the L1 cache at the access goes there as gst19H shows. If there is no copy in
execution time, since the address it modifies is not in the LM the LM the lookup misses and the SM address is preserved,
address space. Assuming the caches are empty, the L1 cache so the L1 cache serves the access as gst19M shows. With
requests the cache line to the upper levels of the hierarchy this mechanism, always the valid copy of the data is accessed.
(not shown in the figure) and these forward the request to the To prevent losing the modifications when there is aliasing with
main memory. The cache line is then sent to the requesters, read-only data in the LM, the irregular store st20 modifies the
reaching the L1 cache so the modification can be done. copy in the SM. The address of the store guides the operation
Finally, the step 4 increments an element of ptr using po- to the L1 cache, which requests the cache line if necessary.
tentially incoherent loads and stores (line 16). These memory
8

3.4 Data Coherence Management start

ap CM
This section shows the correctness of the coherence protocol. ac k −m
r iteb
− ac
LM cess
−w
The two previous sections described how memory operations LM MM

are diverted to one memory or another when replication exists, nma


p CM
−u
− ev
M ict
considering that the valid copy of the data is in the LM. LM L
p CM
This section shows this situation is always ensured. First, the
C
LM M − ev unma
ic −
−w
r ite t LM
different states and actions that apply to data in the system are back
CM LM-CM
described. According to this, it is shown that whenever data −a
cces −m
ap
is replicated in the LM and in the cache hierarchy, only two s LM

situations can arise: either both versions are identical, or the Fig. 6: State diagram of the possible replication states of the
version in the LM is always more recent than the version in the data. A piece of data can be in main memory only (MM),
cache hierarchy. Then it is shown that whenever replicated data replicated only in the LM (LM), replicated only in the cache
is evicted to main memory, the version in the LM is always the hierarchy (CM), or replicated in the LM and in the cache
one transferred, invalidating the cache version. This is always hierarchy at the same time (LM-CM). Creating and discarding
guaranteed unless both versions are identical, in which case copies of the piece of data cause transitions between the states.
the system supports the eviction indistinctly.

3.4.1 Data States and Operations The LM-CM state is reachable from both the LM and the
Figure 6 shows the possible actions and states of data in the CM states. In the LM state, a guarded instruction will never
system. The state diagram is conceptual, it is not implemented cause a replica in the caches since the access goes through
in hardware. The MM state indicates the data is in main the directory, and this will divert the access to the LM. It
memory and has no replica neither in the cache hierarchy nor is impossible to have unguarded memory instructions to the
in the LM. The LM state indicates that only one replica exists, SM because the compiler never emits them unless it is sure
and it is located in the LM. In the CM state only one replica that there is no aliasing, which cannot happen in this state.
in the cache hierarchy exists. In the LM-CM state two replicas In the LM state, only the execution of a double store can
exist, one in the LM and the other in the cache hierarchy. cause the transition to the LM-CM state. The double store is
Actions prefixed with “LM-” correspond to LM control composed of a guarded store and a store to the SM (stguarded
actions, activated by software. There is a distinction between and stsm ). The stsm is served by the cache hierarchy, so a
LM-map and LM-unmap although both actions correspond replica of the data is generated and updated in the cache, while
to the execution of a dma-get, which unmaps the previous the stguarded modifies the LM replica with the same value, so
contents of a LM buffer and maps new contents instead. LM- two replicas generated through a LM→LM-CM transition are
map indicates that a dma-get transfers the data to the LM. The always identical. The transition CM→LM-CM happens due to
LM-unmap indicates that a dma-get has been performed that an LM-map action, and the DMA coherence ensures the two
overwrites the data in question, so it is no longer mapped to versions are identical. Once in the LM-CM state, the double
the LM. The LM-writeback corresponds to the execution of a store updates both versions, while stguarded and stlm modify
dma-put that transfers the data from the LM to the SM. Actions the LM version and stsm will never be generated.
prefixed with “CM-” correspond to hardware activated actions In conclusion, only two possibilities exist for having two
in the cache hierarchy. The CM-access corresponds to the replicas of data. Each one is represented by one path reaching
placement of the cache line that contains the data in the cache the LM-CM state from the MM state. In both cases, the two
hierarchy. The CM-evict corresponds to the replacement of the versions are either identical or the version in the LM is the
cache line, with its write-back to main memory if needed. valid one. The next section shows the valid version is always
The MM→LM transition occurs when the software causes selected at the moment of evicting the data to main memory.
an LM-map action. Switching back to the MM state occurs
when an LM-unmap action happens due to a dma-get mapping 3.4.2 Data Eviction
new data to the buffer. Notice that an LM-writeback action The state diagram shows that the eviction of data can only
does not imply a switch to the MM state, as transferring data occur from the LM and CM states. There is no direct transition
to the main memory does not unmap the data from the LM. from the LM-CM state to the MM state, which means that
Transitions between the MM and CM states happen accord- eviction of data can only happen when one replica exists in
ing to the execution of load and store operations that cause the system. This is a key point to ensure coherence. In case
CM-access and CM-eviction actions. Notice that unless the data is in the LM-CM state, its eviction can only occur if
data reaches the LM-CM state, no coherence problem can first one of the replicas is discarded, which corresponds to a
appear due to the use of a LM. DMA transfers are coherent transition to the LM or CM states. According to the previous
with the SM, ensuring the system coherence as long as the section, it is ensured that in the LM-CM state the two replicas
data switches between the LM and MM states. Similarly, the are identical or, if not, the version in the LM is the valid one.
cache coherence protocol ensures the system coherence when Consequently, the eviction discards the cache version unless
the data switches between the MM and CM states. In both both versions are identical, in which case either version can be
cases, never more than one replica is generated. evicted. This behavior is guaranteed by the transitions exiting
9

TABLE 1: PTLsim configuration parameters. TABLE 2: Scheme of the microbenchmark. The microbench-
Parameter Description
mark is a simple loop that can be configured in four modes.
Pipeline Out-of-order, 4 instructions wide
For each mode it is assumed some memory references are
Hybrid 4K selector, 4K G-share, 4K Bimodal potentially incoherent so guarded memory instructions are
Branch predictor
4K BTB 4-way, RAS 32 entries emitted for them, represented with bold font.
Functional units 3 INT ALUs, 3 FP ALUs, 2 load/store units
Register file 256 INT registers, 256 FP registers Microbenchmark Mode Assembly code
32 KB, 8-way set-associative mov a(,esi,4),ebx
L1 I-cache
2 cycles latency Baseline add edi,ebx
32 KB, 8-way set-associative mov ebx,a+4(,esi,4)
L1 D-cache
write-through, 2 cycles latency int a[N]; mov a(,esi,4),ebx
256 KB, 24-way set-associative int c; RD add edi,ebx
L2 cache
write-back, 15 cycles latency for(i=0; i<N-1; i++) mov ebx,a+4(,esi,4)
4 MB, 32-way set-associative { mov a(,esi,4),ebx
L3 cache
write-back, 40 cycles latency a[i+1] = a[i] + c; add edi,ebx
WR
IP-based stream prefetcher [30], [31] } mov ebx,a+4(,esi,4)
Prefetcher
to L1, L2 and L3 mov ebx,a+4(,esi,4)
Local memory 32 KB, 2 cycles latency mov a(,esi,4),ebx
add edi,ebx
RD/WR
mov ebx,a+4(,esi,4)
mov ebx,a+4(,esi,4)
the LM-CM state. When a LM-writeback action is triggered
by a dma-put the associated DMA transfer invalidates the
version of the data that is in the cache hierarchy. The CM- marks have been compiled using GCC 4.6.3 with the -O3 op-
evict transition is caused by an access to some other data timization flag on. SimPoint [33] has been used to identify the
in the SM that causes a replacement of the cache line that simulation points and at least 150 millions of x86 instructions
holds the current data, leaving just one replica, the one in have been simulated for each benchmark.
the LM, and thus transitioning to the LM state. Once the LM The outcome of the alias analysis performed by GCC on
state is reached, at some point the program will execute a every memory reference has been checked to generate the
dma-put operation to write-back the data to the SM. Finally, guarded memory instructions. The references that GCC is
the transition LM-CM→CM caused by a LM-unmap action not able to determine the aliasing for are the potentially
corresponds to the case where the program explicitly discards incoherent accesses. Once these accesses have been identified,
the copy in the LM when new data is mapped to the buffer that the source code of the benchmarks has been modified by hand
holds it. The programming model imposes that this will only to generate the guarded memory instructions using assembly
happen when both versions are identical, because if the version macros. x86 instruction prefixes are used to implement the
in the LM had modifications it would be written-back before guarded instructions as explained in Section 3.1.
being replaced. So, after the LM-unmap, the only replica of
the data is in the cache hierarchy and it is valid, and the cache
coherence protocol will ensure the transfer of the cache line 4.2 Overhead of the Coherence Protocol
to the main memory is done coherently. A microbenchmark that stresses the coherence protocol is used
In conclusion, the system always evicts the valid version to facilitate the study of its performance overheads. Table 2
of the data. When two replicas exist, first the invalid one shows its characteristics. The microbenchmark is a loop that
is discarded and, then, the DMA and the cache coherence makes a sequence of load/add/store instructions that can be
mechanisms correctly manage the eviction of the valid replica. configured in four modes. In the baseline mode no guarded
instructions are generated for any access. The RD mode
4 E VALUATION assumes the read access a[i] is potentially incoherent, so a
This section evaluates the coherence protocol for the hybrid guarded load is generated. The guarded memory instructions
memory system. A microbenchmark and a set of real bench- are represented in bold font in the assembly code. The WR
marks are used to study the overhead of the proposal in terms mode assumes the write access to a[i+1] is potentially
of execution time and energy consumption. Then a comparison incoherent and it cannot be ensured a write-back to the SM will
against a cache-based system is presented. be performed, so a double store is emitted. The RD/WR mode
is a combination of the RD and the WR modes. To model all
possible scenarios in terms of the ratio of accesses that are
4.1 Experimental Framework potentially incoherent, the percentage of memory operations
The proposal has been evaluated using PTLsim [28], extending that need to be guarded can also be adjusted.
it with a LM, a DMAC and the directory of the coherence Figure 7 shows the overhead in execution time of the
protocol. For the energy results Wattch [29] has been inte- proposal in the microbenchmark. Three lines appear in the
grated into the simulator. Single-core simulations are presented figure, one per each mode of the microbenchmark. The X
because the coherence protocol is per core. Table 1 shows the axis shows the percentage of references that are potentially
parameters of the simulated speculative out-of-order core. incoherent with respect to the total number of references. The
Six memory intensive HPC benchmarks from the NAS overhead of each mode is shown in ratio and computed against
benchmark suite [32] are used for the evaluation. The bench- the baseline mode of the microbenchmark.
10

1.30 1.10

1.25 RD Execution time Energy


WR 1.05
1.20

OVERHEAD
OVERHEAD

RD/WR
1.15
1.00
1.10

1.05 0.95
1.00

0.95 0.90
0 10 20 30 40 50 60 70 80 90 100 CG EP FT IS MG SP AVG
% OF GUARDED INSTRUCTIONS
Fig. 8: Overhead in real benchmarks.
Fig. 7: Overhead in all microbenchmark modes.

incoherent write references (treated with a double store) to


The RD mode line shows no overhead at all. The only do complex operations on floating point data. The cost of the
differences in the execution of a guarded load and a non- computation and the small percentage of references that need
guarded load are that the prefix has to be decoded and that a to be treated with the double store keep the overhead low.
lookup in the directory is triggered. Both operations fit in the In IS the computation is very simple and the double store is
cycle time so there is no performance overhead for guarded used in 2 out of 5 references, so the extra store provokes a
loads. In the WR and the RD/WR modes it can be observed non-negligible increase in the number of executed instructions.
a linear overhead as the percentage of potentially incoherent These extra instructions barely affect the performance because
accesses grows. The overhead is caused by the extra store most of the times the out-of-order engine is able to issue
added. When the double store is used at every write access it the potentially incoherent store and the irregular store in the
adds an overhead of 28%, which is provoked by an increase same cycle, effectively hiding the performance penalty caused
in executed instructions of 26%. The double store also adds by the double store. A similar situation happens in EP, that
pressure to the Load/Store Queue, although not enough to has 3 strided references, 16 local variables and 1 potentially
become a bottleneck. The overhead decreases to less than 10% incoherent write reference for which the double store is used.
when 35% or less of the write access are guarded and need the In this case the issue of the two stores is always done in the
double store, which provokes an increase of 9% in executed same cycle, that is why the overhead is zero. The resulting
instructions. Notice that in the WR and RD/WR modes, if the average overhead of the benchmarks is negligible, 0.26%.
compiler could ensure the potentially incoherent write access Figure 8 also shows the energy consumption overhead is less
aliases with some data in the LM that will be written back to than 2% in all benchmarks except in IS. These benchmarks
the SM, a single guarded store would be generated instead of have many strided references and do complex computations,
the double store, and the overhead would be zero as in the so the directory is very seldomly accessed and, moreover, the
case of a single guarded load. energy it consumes is much lower than the energy consumed
In conclusion, the coherence protocol adds no performance by other components such as the memory subsystem, ALUs
overhead when the potentially incoherent memory accesses are and issue queues, resulting in a very low overhead. In IS the
for reading data or when they are for writing and the double overhead is 5%. The overhead generated by the directory is
store is not needed. Only the double store adds overhead, around 1.8%, the remaining 3.2% is caused by the execution of
reaching a maximum 28% in the microbenchmark. In real the double store. The average overhead in energy consumption
situations it is common that the number of potentially incoher- of all benchmarks is 2.03%.
ent write accesses is low with respect to the total number of In conclusion, the coherence protocol adds a very low
memory accesses and the computation is more complex than overhead in performance and in energy consumption. In 3 of
the one performed in the microbenchmark, so the expected the 6 benchmarks the double store is not needed, so there are
overheads are far from this reported upper bound. no performance penalties and the utilization of the directory
In order to study the overheads in real benchmarks, the generates an increase in energy consumption of less than 2%.
hybrid memory system extended with the coherence protocol When the double store is needed the increase in the number
is compared against an incoherent hybrid memory system with of instructions provokes a very minor performance degradation
an oracle compiler. In this baseline architecture the potentially and a slightly higher energy consumption.
incoherent accesses are left unguarded and are always served
by the memory that has the valid copy of the data. 4.3 Comparison with Cache-Based Architectures
Figure 8 shows the overhead introduced by the coherence The immediate result of the coherence protocol is that any
protocol in terms of execution time and energy consumption computational kernel can now be executed on the hybrid mem-
in real benchmarks. The performance overhead in CG, MG ory system no matter the restrictions coming from coherence
and SP is zero because the compiler does not find any problems. In order to show the usefulness of this achievement,
potentially incoherent write access that needs to be treated this section evaluates the benefits in performance and energy
with a double store. This happens only in FT and IS, which consumption of the coherent hybrid memory system when
present overheads of 1.03% and 0.44%, respectively, and in compared to a cache-based system.
EP, which presents no overhead. FT uses 34 strided references, The coherent hybrid memory system and the cache-based
2 potentially incoherent read references and 2 potentially system studied in this section have the same characteristics but
11

TABLE 3: Activity in the memory subsystem for the hybrid memory and the cache-based systems.
Benchmark Guarded L1 L2 L3 LM Directory
AMAT
Name Mode References Hit ratio Accesses Accesses Accesses Accesses Accesses
CG Hybrid coherent 1/7 (14%) 3.15 90.52 19319 26376 10597 30235 10566
CG Cache-based 0 4.31 82.23 70371 62822 84202 0 0
EP Hybrid coherent 1/20 (5%) 2.14 99.93 37152 10266 228 3862 3519
EP Cache-based 0 2.37 98.93 43814 13219 797 0 0
FT Hybrid coherent 4/34 (11%) 2.60 96.61 912779 761009 110186 1155150 55118
FT Cache-based 0 4.95 78.54 1379688 789765 352269 0 0
IS Hybrid coherent 2/5 (25%) 6.27 74.00 140663 194465 74647 73400 25714
IS Cache-based 0 7.93 64.10 169425 182716 127692 0 0
MG Hybrid coherent 1/60 (1.66%) 2.24 99.71 605269 252799 35588 798562 19377
MG Cache-based 0 3.89 90.65 827239 238099 127176 0 0
SP Hybrid coherent 0/497 (0%) 2.41 98.37 331832 162441 24159 235024 0
SP Cache-based 0 4.73 79.59 407952 164515 82301 0 0

1.50
Work Synch Control cache-based system execution time and show the weight of
1.25
each execution phase, considering as work time the whole
EXECUTION TIME

1.00
execution time of the cache-based system. All benchmarks
0.75 but EP present some degree of reduction. The reductions are
0.50 mainly due to the reduction of execution time of the work
0.25 phase, more than 35% in all cases. This big reduction in the
0.00
work phase is caused by the better management of memory
CG EP FT IS MG SP AVG
references in the hybrid memory system. First, the irregular
Fig. 9: Reduction in execution time. accesses that reuse data along the execution of the benchmarks
have a much higher L1 hit ratio in the hybrid memory
system. This is because the hybrid memory system uses the
with one difference. The hybrid memory system has a 32KB LM to serve the regular accesses and the L1 to serve the
LM and the directory of the coherence protocol. For fairness, irregular ones, so the data placed in the L1 is much less often
the capacity of the L1 of the cache-based system is increased evicted than in the cache-based system, where every access is
to 64KB, matching the 32KB of LM plus the 32KB of L1 in served by the L1 so the data brought for irregular accesses
the hybrid memory system. Table 3 summarizes the statistics is evicted when new data needs to be brought for regular
of the memory subsystem that are the dominating factors of references, causing misses when irregular accesses reuse data.
the improvements. This table is used throughout this section The second important observation is that the hybrid memory
to explain the differences between the two architectures. For system imposes an execution model that does extra work in
each benchmark the table shows the ratio of references that the control and synchronization phases, but in the work phase
are potentially incoherent, the average memory access time it is able to execute the strided accesses without cache misses,
(AMAT), the L1 hit ratio and the number of accesses to since they are served by the LM. In the cache-based system,
all the components of the memory subsystem in thousands. when a lot of strided memory references are being used, they
The accounting of accesses includes hits, misses, lookups and cause collisions in the history tables of the prefetchers and also
invalidations provoked by memory instructions, prefetchers, the big amount of prefetched data causes conflict misses in the
placement of cache lines by the MSHRs, write-through and whole cache hierarchy. These two situations are reflected in
write-back policies and bus requests of the DMA commands. the AMAT and the L1 hit ratios shown in Table 3. MG and
SP show a very similar behaviour, with respective reductions
The immediate consequence of the coherence protocol is
of 39% and 40% (or speedups of 1.64x and 1.66x). The big
that any computational loop can be executed on the hybrid
amount of regular references they have provoke conflict misses
memory system. The benchmarks that take benefit of this
and collisions in the prefetchers in the cache-based system,
achievement are all but SP. In Table 3 this is reflected in the
which cause important penalties compared to the execution
column of the number of guarded references. All benchmarks
time spent in control phases in the hybrid memory system.
but SP have potentially incoherent references for which the
CG, FT and IS show reductions of 26%, 24% and 36% (or
compiler generates guarded accesses. Without the coherence
speedups of 1.34x, 1.30x and 1.55x), respectively. These loops
protocol the usage of the hybrid memory system would not
have fewer strided references but their critical path contains
be possible in these cases, so the performance and energy
a potentially incoherent access with a high degree of reuse.
consumption benefits it provides would not be exploited.
These memory references almost always miss in the L1 in
The reduction in execution time the hybrid memory system the cache-based system, while they are served very efficiently
achieves when compared to a cache-based system can be in the hybrid memory system. EP presents no speedup at all.
observed in Figure 9. For each benchmark two bars are In both architectures all accesses are served very efficiently,
presented. The leftmost bar is the execution time of the cache- with similar AMATs and L1 hit ratios of 99.9% and 98.9%.
based system and the rightmost bar is the execution time of An irregular store causes this difference in the hit ratio but
the hybrid memory system. Both bars are normalized to the
12

1.50
ENERGY CONSUMPTION
CPU Caches LM Others 5 R ELATED WORK
1.25

1.00 The idea of adding a LM alongside the cache hierarchy is


0.75 not novel. This organization is found in commercial products
0.50
like the NVIDIA Fermi [6]. In this platform the global
memory (that is cached) and the LM are incoherent, and the
0.25
architecture does not provide any mechanism to solve the
0.00
CG EP FT IS MG SP AVG coherence problem between the two storages. Instead, it relies
Fig. 10: Reduction in energy consumption. on the programmer to explicitly manage the two memories.
CUDA [34] provides keywords for the declaration of the
variables to specify which memory will store them, so data
this access is not in the critical path, so the difference in replication does not happen. If two copies of data exist it is the
performance is 1%. On average, the speedup in all benchmarks programmer who has to explicitly declare and manage them,
is 1.38x, or a reduction in execution time of 28%. since neither the hardware nor the compiler give any support
Figure 10 shows, for each benchmark, the energy consump- for coherence management between both memories.
tion of the cache-based system in the leftmost bar and of Bertran et al. [12] propose to add a LM alongside the cache
the hybrid memory system in the rightmost bar. Both bars hierarchy in general purpose cores, but they do not solve
are normalized to the cache-based system energy consumption the coherence problem between the two storages. Instead,
and show the weight of each component of the processor on they give the compiler the responsibility to discard loop
the total consumption. The energy components comprise the transformations in case of coherence problems, restricting the
whole microarchiture of the core (CPU), the three levels of the effective utilization of the hybrid memory system.
cache hierarchy (Caches), the LM (LM) and the prefetchers, Some works [35], [13] propose memory organizations that
the DMA controller and the buses that connect all these can be configured as caches, LMs or a combination of
components (Others). All benchmarks show reductions in both. With such approaches, when the memory is logically
energy consumption of 41% to 12%. In IS, MG and SP configured as a hybrid memory system, the resulting system
important savings come from the CPU. This is provoked by encounters the same coherence problem that this paper solves.
the reduction of cache misses, which cause energy penalties The authors of Virtual Local Memories [13] allow to configure
in the pipeline in the form of re-executed instructions. All a part of the cache as a LM. When they do so they reserve
benchmarks present energy reductions in the cache hierarchy, for the LM a portion of the virtual address space that is
being CG and FT the ones that achieve the highest benefits. direct-mapped to the physical address space and they offer
The energy consumed in the cache hierarchy decreases in all the programmer a high-level API to move data between the
cases because, first, the hybrid memory system does fewer LM and the SM with a DMA engine, ending up with an
accesses to all the levels of the hierarchy because it uses the scheme that is identical to the one proposed in this paper.
LM instead and, second, cache misses and data prefetches are The authors of that work bypass the coherence problem by
more frequent in the cache-based system, provoking energy leaving the responsibility of managing the copies of data
consumption due to cache line lookups and placements. This to the programmer. The coherence protocol for the hybrid
number of saved accesses is much larger than the activity memory system could be directly applied to their proposal to
that the hybrid memory systems provokes in the caches in allow the compiler to generate code that manages the Virtual
the form of cache line lookups and invalidations due to the Local Memory. The memory hierarchy of the Smart Memories
DMA transfers. All together, the resulting number of accesses Architecture [35] also has the possibility to be configured
to any level of the cache hierarchy decreases with the hybrid as a combination of LM and caches. The authors focus on
memory system, as can be observed in Table 3. Furthermore, the hardware details that allow the configurability, but do not
the energy savings in the cache hierarchy are much bigger mention how the resulting configuration would be exposed
than the energy consumed by the LM in the hybrid memory to the upper layers of the system. If the Smart Memories
system, which has a weight of less than 5%, and by the DMA Architecture adopted the same scheme that the hybrid memory
engine, which also has a weight of less than 5%. The average system and the Virtual Local Memories assume, the proposed
savings in energy consumption in the benchmarks is 27%. coherence protocol could also be directly applied to that work.
In conclusion, the hybrid memory system outperforms Cohesion [36] allows the software to dynamically select
cache-based systems because it serves data very efficiently: which cache lines are cache coherent by enabling and disabling
the strided accesses are served by the LM so the cache the cache coherence protocol for specific lines. This approach
hierarchy is less frequently accessed and it can be devoted faces the same problem as the hybrid memory system because
to the data accessed by irregular and potentially incoherent it opens the door to incoherent copies of data, relying on the
accesses, avoiding evictions of data that is going to be reused. programmer to explicitly manage them.
Moreover, fewer collisions in the history tables of the prefetch- This paper relies on previous works on DMA coher-
ers happen due to the lower activity in the caches. This lower ence [15]. The IBM Cell architecture [5], [14] ensures DMA
activity directly translates to less energy consumption, that is coherence by doing lookups in the cache hierarchy when DMA
complemented with energy savings in the CPU due to the transfers are performed. In the Cell architecture only DMA
reduction of re-executed instructions caused by cache misses. transfers can generate data replication and there are no coher-
13

ence problems because, with regular memory instructions, the [3] A. Ros, M. E. Acacio, and J. M. Garcı́a, Parallel and Distributing
accelerator cores can only access their LMs and the general Computing. IN-TECH, 2010, ch. Cache Coherence Protocols for Many-
Core CMPs.
purpose core can only access the cache hierarchy. Whenever [4] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
a modification has to be visible to other cores DMAs are used “Scratchpad Memory: A Design Alternative for Cache On-chip Memory
so the coherence is ensured. In the hybrid memory system in Embedded Systems,” in CODES ’02: Proceedings of the 10th Inter-
national Symposium on Hardware/Software Codesign. ACM, 2002, pp.
this approach is extended to support coherence at the memory 73–78.
instruction level because a core can access both memories. [5] J. Kahle, “The Cell Processor Architecture,” in MICRO 38: Proceedings
D. Tang et al. [37] introduce on-chip storage to separate IO of the 38th International Symposium on Microarchitecture. IEEE
Computer Society, 2005, pp. 3–4.
data from CPU data. Although with different motivations, this [6] P. N. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Com-
work faces similar coherence problems as the ones the pro- puting Architecture.” White paper, 2009.
posed coherence protocol addresses. The introduction of the [7] M. Gonzàlez, N. Vujic, X. Martorell, E. Ayguadé, A. E. Eichenberger,
T. Chen, Z. Sura, T. Zhang, K. O’Brien, and K. O’Brien, “Hybrid
DMA-cache creates potential incoherences that are solved by a Access-Specific Software Cache Techniques for the Cell BE Architec-
refinement of the MOESI and ESI cache coherence protocols. ture,” in PACT ’08: Proceedings of the 17th International Conference
In the coherent hybrid memory system data invalidation only on Parallel Architectures and Compilation Techniques. ACM, 2008,
pp. 292–302.
happens along a dma-put and never a memory access to the
[8] W. Landi and B. G. Ryder, “A Safe Approximate Algorithm for Interpro-
cache hierarchy can modify the contents of the LM. cedural Aliasing,” in PLDI ’92: Proceedings of the ACM SIGPLAN 1992
Conference on Programming Language Design and Implementation.
ACM, 1992, pp. 473–489.
6 C ONCLUSIONS [9] A. Deutsch, “Interprocedural May-Alias Analysis for Pointers: Beyond
The hybrid memory system, which consists of adding a local k-limiting,” in PLDI ’94: Proceedings of the ACM SIGPLAN 1994
memory alongside the cache hierarchy, is a promising solution Conference on Programming Language Design and Implementation.
ACM, 1994, pp. 230–241.
to the lack of scalability and the power consumption problems [10] R. P. Wilson and M. S. Lam, “Efficient Context-Sensitive Pointer
of future cache coherent multicore and manycore architectures. Analysis for C Programs,” in PLDI ’95: Proceedings of the ACM
One of the main problems of the hybrid memory system is the SIGPLAN 1995 Conference on Programming Language Design and
Implementation. ACM, 1995, pp. 1–12.
incoherence between the two storages, for which this paper [11] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantita-
proposes a novel hardware/software coherence protocol. tive Approach (The Morgan Kaufmann Series in Computer Architecture
The protocol admits data replication in the two storages and and Design). Morgan Kaufmann, 2002.
[12] R. Bertran, M. Gonzàlez, X. Martorell, N. Navarro, and E. Ayguadé,
avoids keeping them coherent. Instead, it ensures that the valid “Local Memory Design Space Exploration for High-Performance Com-
copy of the data is always accessed. The design consists of a puting,” The Computer Journal, pp. 786–799, 2010.
hardware directory that keeps track of the contents of the local [13] H. Cook, K. Asanovic, and D. A. Patterson, “Virtual Local Stores: En-
abling Software-Managed Memory Hierarchies in Mainstream Comput-
memory and guarded memory instructions that the compiler ing Environments,” Electrical Engineering and Computer Sciences De-
selectively emits for potentially incoherent memory accesses. partment, University of California at Berkeley, Tech. Rep. UCB/EECS-
Guarded instructions access the directory and then are diverted 2009-131, 2009.
[14] M. Kistler, M. Perrone, and F. Petrini, “Cell Multiprocessor Communi-
to the storage where the correct copy of the data is. The main cation Network: Built for Speed,” IEEE Micro, pp. 10–23, 2006.
achievement of the coherence protocol is that the compiler [15] T. B. Berg, “Maintaining I/O Data Coherence in Embedded Multicore
algorithm to generate code for the hybrid memory system is Systems,” IEEE Micro, pp. 10–19, 2009.
straightforward and always safe because it is not limited by [16] “MPI: A Message-Passing Interface Standard. 2003.”
[17] “OpenMP Application Program Interface. Version 3.0. May 2008.”
memory aliasing problems. [18] S. Seo, J. Lee, and Z. Sura, “Design and Implementation of Software-
The proposed coherence protocol introduces average over- Managed Caches for Multicores with Local Memory,” in HPCA ’09:
heads of 0.26% in execution time and of 2.03% in energy Proceedings of the 15th International Conference on High-Performance
Computer Architecture. IEEE Computer Society, 2009, pp. 55–66.
consumption to enable the usage of the hybrid memory system. [19] A. E. Eichenberger, J. K. O’Brien, K. M. O’Brien, P. Wu, T. Chen, P. H.
This system, compared to a cache-based system, provides an Oden, D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang,
average speedup of 38% and an energy reduction of 27%. P. Zhao, M. K. Gschwind, R. Archambault, Y. Gao, and R. Koo, “Using
Advanced Compiler Technology to Exploit the Performance of the Cell
Broadband EngineTM Architecture,” IBM Systems Journal, pp. 59–84,
ACKNOWLEDGMENTS 2006.
[20] A. E. Eichenberger, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. H. Oden,
We thankfully acknowledge the support of the the Spanish D. A. Prener, J. C. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao,
Ministry of Education (TIN2007-60625 and CSD2007-00050), and M. Gschwind, “Optimizing Compiler for the CELL Processor,” in
the Generalitat de Catalunya (2009-SGR-980), the HiPEAC PACT ’05: Proceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques. IEEE Computer Society,
Network of Excellence (contracts EU FP7/ICT 217068 and 2005, pp. 161–172.
287759), and the BSC-IBM collaboration agreement. [21] Y. Paek, J. Hoeflinger, and D. Padua, “Efficient and Precise Array Access
Analysis,” ACM Transactions on Programming Languages and Systems,
pp. 65–109, 2002.
R EFERENCES [22] T. Chen, T. Zhang, Z. Sura, and M. G. Tallada, “Prefetching Irregular
[1] J. Leverich, H. Arakida, A. Solomatnikov, A. Firoozshahian, References for Software Cache on Cell,” in CGO ’08: Proceedings of
M. Horowitz, and C. Kozyrakis, “Comparing Memory Systems for Chip the 6th International Symposium on Code Generation and Optimization.
Multiprocessors,” SIGARCH Computer Architecture News, pp. 358–368, ACM, 2008, pp. 155–164.
2007. [23] “Power ISA. Version 2.06 Revision B. IBM. July 2010.”
[2] R. Murphy, “On the Effects of Memory Latency and Bandwidth on [24] “Intel 64 and IA-32 Architectures Software Developer’s Manual. January
Supercomputer Application Performance,” in IISWC ’07: Proceedings of 2011.”
the 10th International Symposium on Workload Characterization. IEEE [25] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0:
Computer Society, 2007, pp. 35–43. A Tool to Understand Large Caches. 2009.
14

[26] R. C. Murphy and P. M. Kogge, “On the Memory Access Patterns of Su- Marc Gonzàlez received the Engineering de-
percomputer Applications: Benchmark Selection and Its Implications,” gree in Computer Science in 1996 and the
IEEE Transactions on Computers, pp. 937–945, 2007. Computer Science PhD degree on December
[27] J. Weinberg, M. O. McCracken, E. Strohmaier, and A. Snavely, “Quan- 2003. He currently holds an Associate Professor
tifying Locality In The Memory Access Patterns of HPC Applications,” position in the Computer Architecture Depart-
in SC ’05: Proceedings of the 2005 ACM/IEEE conference on Super- ment from the Technical University of Catalonia.
computing. IEEE Computer Society, 2005, pp. 50–62. His research activity is linked to the Barcelona
[28] M. T. Yourst, “PTLsim: A Cycle Accurate Full System x86-64 Microar- Supercomputing Center (BSC) as a collaborator.
chitectural Simulator,” in ISPASS ’07: Proceedings of the 7th Interna- His main interests are both parallel programming
tional Symposium on Performance Analysis of Systems and Software. and computer architecture, specifically for hybrid
IEEE Computer Society, 2007, pp. 23–34. multi-core systems. Besides. he has worked on
[29] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework power and energy modeling techniques for multi-core processors and on
for Architectural-Level Power Analysis and Optimizations,” in ISCA parallel programming models, with special interest in the OpenMP and
’00: Proceedings of the 27th International Symposium on Computer OpenCL paradigms. Up today, he has published more than 40 refereed
architecture. ACM, 2000, pp. 83–94. papers in journals and conferences.
[30] T.-F. Chen and J.-L. Baer, “Effective Hardware-Based Data Prefetching
for High-performance Processors,” IEEE Transactions on Computers,
pp. 609–623, 1995.
[31] J. Doweck, “Inside Intel Core Microarchitecture and Smart Memory Ac-
cess. An In-Depth Look at Intel Innovations for Accelerating Execution Xavier Martorell received the M.S. and Ph.D.
of Memory-Related Instructions.” White paper, 2006. degrees in Computer Science from the Technical
[32] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, University of Catalunya (UPC) in 1991 and 1999,
L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. respectively. He has been an associate profes-
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, sor in the Computer Architecture Department
“The NAS Parallel Benchmarks,” in SC ’91: Proceedings of the 1991 at UPC since 2001, teaching on operating sys-
Conference on Supercomputing. IEEE Computer Society, 1991, pp. tems. His research interests cover the areas of
158–165. paralellism, runtime systems, compilers and ap-
[33] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically plications for high-performance multiprocessor
Characterizing Large Scale Program Behavior,” in ASPLOS ’02: Pro- systems. Since 2005 he is the manager of the
ceedings of the 10th nternational conference on Architectural Support team working on Parallel Programming Models
for Programming Languages and Operating Systems. ACM, 2002, pp. at the Barcelona Supercomputing Center. He has participated in several
45–57. european projects dealing with parallel environments (Nanos, Intone,
[34] “NVIDIA CUDA C Programming Guide. Version 4.2. April 2012.” POP, SARC, ACOTES). He is currently participating in the European
[35] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz, HiPEAC2 Network of Excellence, and the ENCORE european project.
“Smart Memories: A Modular Reconfigurable Architecture,” in ISCA
’00: Proceedings of the 27th International Symposium on Computer
architecture. ACM, 2000, pp. 161–171.
[36] J. H. Kelm, D. R. Johnson, W. Tuohy, S. S. Lumetta, and S. J. Patel,
“Cohesion: An Adaptive Hybrid Memory Model for Accelerators,” IEEE Nacho Navarro is Associate Professor at the
Micro, pp. 42–55, 2011. Universitat Politecnica de Catalunya (UPC),
[37] D. Tang, Y. Bao, W. Hu, and M. Chen, “DMA Cache: Using On- Barcelona, Spain, since 1994, and Senior Re-
Chip Storage to Architecturally Separate I/O Data from CPU Data for searcher at the Barcelona Supercomputing Cen-
Improving I/O Performance,” in HPCA ’10: Proceedings of the 16th ter (BSC), serving as manager of the Acceler-
International Conference on High-Performance Computer Architecture. ators for High Performance Computing group.
IEEE Computer Society, 2010, pp. 1–12. He holds a Ph.D. degree in Computer Science
from UPC. His current interests include: GPGPU
computing, multi-core computer architectures,
hardware accelerators, dynamic reconfigurable
logic support, memory management and run-
Lluc Alvarez received a bachelor’s degree in time optimizations. He is also doing research on massively parallel
Computer Systems from Universitat de les Illes computing at the University of Illinois (IMPACT Research Group). Prof.
Balears in 2006 and a master’s degree in Com- Navarro is a member of IEEE, the IEEE Computer Society, the ACM and
puter Architecture from Universitat Politècnica the HiPEAC NOE.
de Catalunya (UPC) in 2009. Since 2010 he
is a PhD student in the Computer Architecture
Department at UPC and a resident student at
Barcelona Supercomputing Center. His main re-
search interests are computer microarchitecture Eduard Ayguadé received the Engineering de-
and memory hierarchies of multicore architec- gree in Telecommunications in 1986 and the
tures for high-performance computing. Ph.D. degree in Computer Science in 1989, both
from the Universitat Politècnica de Catalunya
(UPC), Spain. Since 1987 he has been lectur-
ing on computer organization and architecture
and parallel programming models. Currently, and
Lluı́s Vilanova is a PhD student at the since 1997, he is full professor of the Computer
Barcelona Supercomputing Center and the Architecture Department at UPC. His research
Computer Architecture Department at the Uni- interests cover the areas of processor microar-
versitat Politècnica de Catalunya, from where chitecture, multicore architectures and program-
he also received his bachelor’s degree in Com- ming models and their architectural support. He has published more
puter Science in 2006 and his master’s degree than 100 papers in these topics and participated in several research
in Computer Architecture in 2008. His interests projects in the framework of the European Union and research col-
cover computer architecture and operating sys- laborations with companies. He is associated director for research on
tems. computer sciences at the Barcelona Supercomputing Center (BSC-
CNS).

You might also like