In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das
In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das
1
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
consists of 13 instructions. The key challenge is develop- We extend the ReRAM memory array to support in-
ing a simple yet powerful ISA and programming framework situ operations beyond the dot product and design a
that can allow diverse data-parallel programs to leverage the simple ISA with limited compute capability.
underlying massive computational efficiency. • We develop a compiler that transforms DFGs in Google’s
The proposed programming model seeks to utilize the un- TensorFlow to a set of data-parallel modules and gener-
derling parallelism in the hardware by merging the concepts ates module code in the native memory ISA. The com-
of data-flow and vector processing (or SIMD). Data-flow ex- piler implements several optimizations to exploit un-
plicitly exposes the Instruction Level Parallelism (ILP) in the derlying hardware parallelism and unique features/con-
program, while vector processing exposes the Data Level straints of ReRAM-based computation.
Parallelism (DLP). Google’s TensorFlow [1] is a popular pro- • Although the in-memory compute ISA is simple and
gramming model for machine learning. We observe that limited in functionality, we demonstrate that with a
TensorFlow’s programming semantics is a perfect marriage good programming model and compiler, it is possible
of data-flow and vector-processing that can be applied to to off-load a large fraction of general-purpose compu-
more general applications. Thus, our proposed programming tation to memory. For instance, we are able to execute
framework uses TensorFlow as the input. in memory an average of 87% of the PARSEC applica-
We develop a TensorFlow compiler that generates binary tions studied.
code for our in-memory data-parallel processor. The Tensor- • Our experimental results show that the proposed ar-
Flow (TF) programs are essentially Data-Flow Graphs (DFG) chitecture can provide overall speedup of 7.5× over
where each operator node can have multi-dimensional vec- a state-of-art multicore CPU for the PARSEC applica-
tors, or tensors, as operands. A DFG that operates on one tions evaluated. It also provides a speedup of 763× over
element of a vector is referred to as a module by the compiler. state-of-art GPU for the Rodinia kernel benchmarks
The compiler transforms the input DFG into a collection evaluated. The proposed architecture operates with a
of data-parallel modules with identical machine code. Our thermal design power (TDP) of 415 W, improves the
execution model is coarse-grain SIMD. At runtime, a code energy efficiency of benchmarks by 230× and reduces
module is instantiated many times and processes indepen- the average power by 1.26×.
dent data elements. The programming model and compiler
support restricted communication between modules: reduce, 2 Processor Architecture
scatter and gather. Our compiler explores several interesting
We propose an in-memory data-parallel processor on ReRAM
optimizations such as unrolling of high-dimensional tensors,
substrate. This section discusses the proposed microarchi-
merging of DFG nodes to utilize n-ary ReRAM operations,
tecture, ISA, and implementation of the ISA.
pipelining compute and write-backs, maximizing ILP within
a module using VLIW style scheduling, and minimizing com-
2.1 Micro-architecture
munication between arrays.
For general purpose computation, we need to support The proposed in-memory processor adopts a tiled architec-
a variety of compute operations (e.g., division, exponent, ture as shown in Figure 1. A tile is composed of clusters of
square root). These operations can be directly expressed as memory nodes, few instruction buffers and a router. Each
nodes in TensorFlow’s DFG. Unfortunately, ReRAM arrays cluster consists of a few memory arrays, a small register
cannot support them natively due to their limited analog file, and look-up table (LUT). Each memory array is shown
computation capability. Our compiler performs an instruc- in Figure 1 (b). Internally, a memory array in the proposed
tion lowering step in the code-generation phase to trans- architecture consists of multiple rows of resistive bit-cells,
late higher-level TensorFlow operations to the in-memory a set of digital-analog converters (DACs) feeding both the
compute ISA. We discuss how the compiler can efficiently word-lines and bit-lines, sample and hold circuit (S+H), shift
support complex operations (e.g., division) using techniques and adder (S+A) and analog-digital converters (ADCs). The
such as the Newton-Raphson method which iteratively ap- process of reading and writing to ReRAM memory arrays
plies a set of simple instructions (add/multiply) to an initial remains unchanged. We refer the reader to ReRAM litera-
seed from the look-up table and refines the result. The com- ture for details [39, 42]. The memory arrays are capable of
piler also transforms other non-arithmetic primitives (e.g., both data storage and computation. We explain the com-
square and convolution) to the native memory SIMD ISA. pute capabilities of the memory arrays and the role of digital
In summary, this paper offers the following contributions: components (e.g. register file, S+A, LUT) in Section 2.2.
The tiles are connected by an H-Tree router network. The
• We design a processor architecture that re-purposes H-Tree network is chosen to suit communication patterns
resistive memory to support data-parallel in-memory typical in our programming model (Section 3) and it also
computation. In the proposed architecture, memory provides high-bandwidth communication for external I/O.
arrays store data and act as vector processing units. The clusters inside a tile are connected by a router or a
2
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
/d>/E^
Cluster
DAC
DAC
DAC
DAC
... ...
...
tKZ>/E^
...
DAC
H-tree
...
XB ... ReRAM
PU ... ReRAM
PU
LUT
..
Inst. Buf DAC
... ...
S+A DAC ADC
ADC
... Router DAC
DAC
RRAM XB S+A
S+H
Reg SĂŵƉůĞ and Hold
ADC
Tiled architecture Tile Memory Array /
(a) Processing Unit (b)
Figure 1. In-Memory Processor Architecture. (a) Hierarchical Tiled Structure (b) ReRAM array Structure
crossbar topology. A shared bus facilitates communication shuffle) instructions because these are hard to do in-situ in-
inside a cluster. A hierarchical topology inside the tile limits memory. Instead, compiler transforms complex instructions
the network power consumption, while providing sufficient to a set of lut, add and mul instructions as discussed later.
bandwidth for infrequent communication typical in data- The ISA consists of 13 instructions as shown in Table 1. Each
parallel applications. ReRAM arrays executes the instruction locally, hence the
Each memory array can be thought of as a vector process- operand addressing modes reference rows inside the array
ing unit with few SIMD lanes. The processor adopts a SIMD or local registers. The instructions can have a size of up to 34
execution model. Each array is mapped to a specific instruc- bytes. Now we discuss the functionality and implementation
tion buffer. All arrays mapped to the same instruction buffer of individual instructions.
execute the same instruction. Every cycle, one instruction 1) add The add instruction is an n-ary operation that adds
is read out of the each instruction buffer and multi-casted the data in rows specified by <mask>. The <mask> is a 128-bit
to the memory arrays in the tile. The execution model is mask which is set for each row in the array that participates
discussed in detail in Section 4. in addition. Figure 2 (a) shows an add operation. The mask
The processor evaluated in this paper consists of 4,096 is fed to word-line DACs, which is used to apply a Vdd (’11’)
tiles, 8 clusters per tile, and 8 memory arrays per cluster. or Vdd/2 (’10’) to the word-lines. A ’1’ in the mask activates
Each array can store 4KB of data and has 8 SIMD lanes of 32 a row. Each bit-cell in a ReRAM array can be abstractly
bits each. Consequently, the processor has aggregate SIMD thought of as variable resistor. Addition is performed inside
width of two million lanes, aggregate memory capacity of the array by summing up currents generated by conductance
1GB and 494 mm 2 area. The resolution of ADC and DAC is (=resistance−1 ) of each bit-cell. A sample and hold (S + H)
set to 5 and 2 bits. circuit receives the bit-line current and feeds it the ADC
unit which outputs the digital value for the current. The
2.2 Instruction Set Architecture result from each bit-line represents the partial sum for bits
The proposed Instruction Set Architecture (ISA) is simple and stored in that bit-line. A word or data element is stored across
compact. Compared to a standard SIMD ISA, In-memory ISA multiple bit-lines. An external digital shifter and adder (S +
does not support complex (e.g. division) and specialized (e.g. A) combines the partial sums from bit-lines. The final result
3
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Opcode Format Cycles problem by using an additional set of DACs for feeding bit-
add <mask><dst> 3 lines. As in ISAAC, the operation is pipelined into 3 stages:
dot <mask><reg_mask><dst> 18 XB, ADC and S+A, processing 2 bits per cycle, resulting in
mul <src><src><dst> 18 18 cycles in total for 32 bit data.
sub <mask><mask><dst> 3
shift{l|r} <src><dst><imm> 3
4) sub The sub instruction performs element-wise subtrac-
mask <src><dst><imm> 3 tion over elements stored in the two set of memory rows
mov <src><dst> 3 (minuends and subtrahends) specified by <mask>s and stores
movs <src><dst><mask> 3 the result in <dst>. Subtraction in ReRAM arrays has not
movi <dst><imm> 1 been explored before. We support this operation by draining
movg <gaddr><gaddr> Variable the current via word-line as shown in Figure 2 (d). The out-
lut <src><dst> 4 put voltage for word-line DAC of the subtrahend row is set
reduce_sum <src><gaddr> Variable to ground allowing for current drain. Hence the remaining
Table 1. In-Memory Compute ISA. The instructions use current over the bit-line represents the difference between
operand addresses specified by either <src>, <dst> or minuend and subtrahend. For this operation we reverse the
<gaddr>. The <src> and <dst> is a 8-bit local address (1- voltage across memristor bit-cell. Fortunately, several re-
bit indicates memory/register + 7-bit row number/register ports on fabricated ReRAM demonstrate the symmetric V/I
number). The <gaddr> is a 4 byte global address (12-bit tile properties of memristor with reverse voltage across termi-
# + 6-bit array # + 7-bit row # + reserved bits). The <imm> nals [36, 44].
field is a 16 byte immediate value. 5) lut The lut instruction sends the value stored in <src>
as an address to the lookup table (LUT), and write back
is written back to <dst> memory row or register. Each of the data read from the LUT to <dst>. The multi-purpose
ReRAM crossbar (XB), ADC and S+A takes 1 cycle, resulting LUT is implemented for supporting high-level instructions.
in 3 cycles in total. LUT is utilized for nonlinear functions such as sigmoid, and
initial seeding of division and transcendental functions (Sec-
2) dot The dot instruction is also an n-ary operation which tion 5.1). The LUT has 512 entries of 8-bit numbers to suffice
emulates a dot product over the data in rows specified by the precision requirement of the arithmetic algorithms imple-
<mask>. A dot product is a sum of products. The sum is done mented [16]. LUT is a small SRAM structure which operates
using current summation over the bit-line as explained ear- at much higher frequency than ReRAM arrays and hence
lier. Each row computes a product by streaming in the multi- shared by multiple arrays. Its contents are initialized by the
plicand via the word-line DAC in a serial manner as shown host at runtime. lut takes 4 cycles, adding 1 cycle on top of
in Figure 2 (b). The multiplicands are stored in register file the basic XB, ADC, S+A pipeline.
and the individual registers are specified using <reg_mask>
field. 6) mov, movi, movg, movs The mov family of instructions
Robust current summation over ReRAM bit-lines has been facilitates movement of data between memory rows of an
demonstrated in prior works [20, 43]. We adapt the dot prod- array, registers, and even across arrays via global addressing
uct architecture from ISAAC [39] for our add and dot in- (<gaddr>). The global addresses are handled by the network,
structions. We refer the reader to these works for further hence the latency of gobal moves (movg) is variable. Imme-
implementation details. diate values can be stored to <dst> as well via movi instruc-
tion. These instructions are implemented using traditional
3) mul The mul instruction is 2-ary operation that per- memristor read/write operations. The selective mov (movs)
forms element-wise multiplication over elements stored in instruction selectively moves data to elements in <dst> based
the two <src> memory rows and stores the result in <dst>. on an 8-bit mask. Recall that any <dst> row can store 8 32-bit
To implement this instruction we utilize the row of DACs elements in the prototype architecture.
at the top of the array feeding the bit-lines (Figure 1 (c)).
The multiplicand is streamed in through the DACs serially 7) reduce_sum The reduce_sum instruction sums up the
2-bits at time and the product is accumulated over bit-lines values in the <src> row of different arrays. The reduction is
as shown in Figure 2 (c). The word-line DACs are set to Vdd executed outside the arrays. This instruction utilizes the H-
(’11’). tree network and the adders in the routers to reduce values
Note that element-wise multiplication was not supported across the tiles.
in prior works on memristor-based accelerators, and is a 8) shift / mask The shift instruction shifts each of the
new feature we designed for supporting general purpose vector element in <src> by <imm> bits. The mask instruction
data-parallel computation. Since dot product uses the same logically ANDs each of the vector element in <src> with
multiplicand for all elements stored in a row, it can not be <imm>. These instructions utilize the digital shift and adder
utilized for element-by-element multiplication. We solve this (S+A) outside the arrays.
4
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Discussion Our goal is two-fold. First, keep the instruc- same resolution as resistance level of ReRAM bit-cells. In
tion set as simple as possible to reduce design complexity our design, 2-bit DACs are required.
and retain area efficiency (hence memory density). Second,
expose all compute primitives which can be done in-situ
inside the memory array without reading the data out. The 3 Programming Model
proposed ISA does not include any instructions for looping, We choose Google’s TensorFlow [1] as the programming
branch or jump instructions. We rely on the compiler to front-end for proposed in-memory processor. By using Ten-
unroll loops wherever necessary. Our SIMD programming sorFlow, programmers write the kernels which will be of-
model ensures small code size, in spite of unrolling. Control floaded to the memory. TensorFlow expresses the kernel as a
flow is facilitated via condition computation and selective Data Flow Graph (DFG). Since TensorFlow is available for va-
moves (Section 3). The compute instructions in the ISA are riety of programming languages (e.g. Python, C++, Java, Go),
restricted to add, sub, dot, mul. Our programming model programmers can easily plug in the TensorFlow kernels in
based on TensorFlow, supports a rich set of compute opera- their code. Also, since TensorFlow supports variety of target
tions. Our compiler transforms them to a combination of ISA hardware systems (e.g. CPU, GPU, distributed system), pro-
instructions (Section 5.1) and hence enables general purpose grammers can easily validate the functionality of the kernel
computation. and scale the system depending on the input size.
TensorFlow (TF) offers a suitable programming paradigm
for data parallel in-memory computing. First, nodes in TF’s
2.3 Precision and Signed Arithmetic
DFGs can operate on multi-dimensional matrices. This fea-
Floating point operations need normalization based on expo- ture embeds the SIMD programming model and facilitates
nent, hence in-memory computation for the floating point easy exposure of Data Level Parallelism (DLP) to the com-
operands encumbers huge complexity. We adopt a fixed point piler. Second, irregular memory accesses are restricted by
representation. We give the flexibility for deciding the po- not allowing subscript notation. This feature benefits both
sition of the decimal point to trade-off between precision programmers and compilers. Programmers do not have to
and range. But the responsibility to prevent bit overflow and convert high-level data processing operations (e.g., vector
underflow is left to the programmers. We developed a testing addition) into low-level procedural representations (e.g., for-
tool that can calculate the dynamic range of the input that loop with memory access). The compiler can fully under-
assures the required precision. Note that under the condi- stand the memory access pattern. Third, the DFG naturally
tion that overflow/underflow does not happen, fixed point exposes Instruction Level Parallelism (ILP). This can be di-
representation gives better accuracy compared to floating rectly used by a compiler for Very Long Instruction Word
point. Section 6 discusses the impact on application output. (VLIW) style scheduling to further utilize underlying paral-
For general purpose computation, it is important to sup- lelism in the hardware without implementing complex out-
port negative values. Prior work [39] uses a biased repre- of-order execution support. Finally, TensorFlow supports
sentation for numbers, and then normalizes the bias via a persistent memory context in nodes of the DFG. This is
subtraction outside the memory arrays. This approach is useful in our merged memory and compute architecture for
perhaps reasonable for CNN dot products, because the over- storing persistent data across kernel invocations.
head of subtraction outside the array for normalizing the Our programming model and compilation framework sup-
bias, is compensated by multi-row addition within the array. port the following TensorFlow primitives (See Table 2 for
In general, data-parallel programs’ additions need not span the list of supported TF nodes.):
multiple rows (often 2 rows are sufficient). In such a sce-
nario, subtraction outside the array needs additional array Input nodes The proposed system supports three kinds
read which offsets the benefit of biased addition inside the of input: Placeholder, Const, and Variable. Placeholder is a
array. non-persistent input and will not be used for future module
We observe that for b-bit bit-cells (i.e. 2b resistance levels), invocations. Const is used to pass constants whose values
current summation followed by shift+adder across bit-lines are known at compile time. Scalar constants are included in
outputs the correct results as long as negative numbers are ISA, and vector constants are stored in either the register file
stored in 2b ’s complement notation. In our prototype de- or an array based on the type of their consumer node in the
sign, arrays have 2-bit bit-cell, hence addition over negative DFG. Variable is the input with persistent memory context,
numbers stored as 4’complement will yield correct results. of which data can be used and updated in the future kernel
Furthermore it can be mathematically proved that 4’s comple- invocations. Variables are initialized at kernel launch time.
ment is exactly equal to 2’s complement in base-4 representa- Operations The framework supports a variety of complex
tion. Thus there is no need for conversion between number operation nodes including transcendental functions. We dis-
formats. The same principle holds true for multiplication as cuss the process of lowering these operation nodes into na-
long as the DAC used for streaming in the multiplicand has tive memory ISA in Section 5.1.
5
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
6
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Figure 4. Compilation Flow.
IB Expansion
[2] [5]
} Unpack
+ + +
[2+4,3+5]=[6,8] [2+4]=[6] [3+5]=[8]
}
Pack
[6,8]
Reduce Reduce
[6+8]=[14] [6+8]=[14]
Figure 5. Node Merging.
Figure 6. IB Expansion.
The compiler tool-chain is developed using Python 3.6 and are typically small (e.g. 3x3 for HotSpot and Sobel filter), we
C++. The compiler front-end uses TensorFlow’s core frame- map the input data to the array and stream in the filter. This
work to parse the TensorFlow Graph. TensorFlow nodes approach reduces buffering for the input data and improves
supported at this time are listed in Table 2. array utilization. Furthermore, the compiler decomposes the
convolution into a series of matrix-vector dot-products done
5.1 Supporting Complex Operations simultaneously on different input matrix slices, thereby re-
The target memory ISA is quite simple and supports limited ducing the convolution time significantly.
number of compute instructions as described in Section 2.2.
Natively, the arrays can execute dot product, addition, multi- 5.2 Compiler Optimizations
plication and subtraction. However, general purpose compu- Node Merging A node merging pass is introduced to fill
tation requires supporting a diverse set of operations ranging the gap between the capabilities of the target in-memory
from division, exponents, transcendental functions, etc. We architecture and the expressibility of the programming lan-
support these complex operations by converting them into
guage. The proposed in-memory ISA can support compute
a set of LUT, addition and multiplication instructions based
operations over n-operands. A node merging pass promotes
on algorithms used in Intel IA-64 processors [14, 19]. a series of 2-operand compute nodes in the DFG of a module,
The compiler uses either Newton-Raphson or Maclaurin- to a single compute node with many operands as shown in
Goldschmidt methods that iteratively apply a set of instruc- Figure 5. The maximum number of operands n is limited
tions to an initial seed from the look-up table and refine by the number of array rows and the resolution of ADCs.
the result. Our implementation chooses the best algorithms
ADCs consume a significant fraction of chip power, and their
based on the precision requirement. We could have used sim-
power consumption is proportional to their resolution. Our
pler algorithms (e.g., SRT division), but we employ iterative compiler can generate code for an arbitrary resolution n
algorithms because (1) bit shift cannot be supported in the and the chip architects can choose a suitable n based on the
array, so for each bit shift operation the values need to be power budget.
read out and written back, (2) supporting bit-wise logical The node merging pass also combines certain combina-
operations (and, or) are challenging because of multi-level re- tions of nodes to reduce intermediate writes to memory
sistive bit-cells, and (3) simple algorithms often require more
arrays. For example, a node which feeds its results to a mul-
space, which is challenging for the data carefully aligned in tiplication node need not write back the results to memory.
the array. This is because multiplicand is directly streamed into the
Finally, the compiler also lowers convolution nodes in array from registers.
the DFG to the native memory ISA. Prior works [39] have
mapped convolution filter weights to the array and per- Instruction Block Scheduler Independent Instruction Blo-
formed dot product computation by streaming in the input cks (IBs) inside a module can be co-scheduled to maximize
features. Because filters used for general-purpose programs ILP as shown in the third row of Figure 3. Our compiler
7
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
adapts the Bottom-Up-Greedy (BUG) algorithm [15] for sched- Benchmark Input data shape # IB insts.
uling IBs. BUG was first used in the Bulldog VLIW com- Blackscholes [4, 10000000] 163
PARSEC
piler [15] and has been adapted in various schedulers for Canneal [2, 600, 4096] 6
VLIW/data-flow architecture, e.g. Multiflow compiler [29] Fluidanimate [3, 17, 229900] 294
and compiler for the tiled data-flow architecture, WaveScalar Streamcluster [2, 128, 1000000] 6
[30]. Our implementation of the BUG algorithm first tra- Backprop [16, 65536] 117
Rodinia
verses the DFG through a bottom-up path, collecting candi- Hotspot [1024, 1024] 26
date assignments of the instructions. Once the traversal path Kmeans [34, 494020] 91
StreamclusterGPU [2, 256, 65536] 6
reaches the input (define) node, it traverses a top-down path
to make a final assignment, minimizing the data transfer Table 3. Evaluated workloads. Numbers in bracket indicates
latency by taking both the operand location and successor size of respective x,y,z dimensions
location into consideration. We modify the original BUG
algorithm to introduce the notion of in-memory computing, the data characteristics, the SIMD slots assigned to a module
where a functional unit is identical to the data location. We may not be fully utilized in every cycle. In fact, expanding
also modified the algorithm to take into account read/write a module could slow down the total execution time when
latency, network resource collision latency, and operation the number of IBs across all module instances exceeds the
latency. aggregate SIMD slots in the memory chip. In such a scenario,
Instruction Block (IB) Expansion Instruction Blocks that multiple iterations may be needed to process all module
use multi-dimensional vectors as operands can be expanded instances, resulting in a performance loss.
into several instruction blocks with lower-dimension vec- Our compiler can generate code for arbitrary upper bounds
tors to further exploit ILP and DLP. For example, consider a on the number of IBs per module, and can flexibly tune the
program that processes 2D matrices of dimension sizes [2, intra-module parallelism with respect to inter-module par-
1024]. The compiler will first convert the program to a mod- allelism. We develop a simple analytical model to compute
ule which will be instantiated 1,024 times and executed in the approximate execution time given the number of IBs
parallel. Each module will have an IB that processes 2D vec- per module and number of module instances. The number
tors. The expansion pass will further decompose the module of module instances is dependent on input data size, and is
into 2 IBs that process 1D scalar value. only known at runtime. Thus, the optimal code is chosen at
The expansion pass traverses the nodes in a module’s DFG runtime based on the analytical model and streamed in to
in a bottom-up/breath-first order and detects the subtrees the memory chip from host.
that process multi-dimensional vectors of the same size. The
subtree regions detected are expanded. To ensure the dimen- 6 Methodology
sions are consistent between the sub-tree regions, pack and Benchmarks We use a subset of benchmarks from PAR-
unpack pseudo operations are inserted between these re- SEC multi-threaded CPU benchmark suite [8] and Rodinia
gions. Pack and unpack operations are later converted to GPU benchmark suite [11] as listed in Table 3. We re-write
mov instructions. A simplified example is shown in Figure 6. the kernels of the benchmarks in TensorFlow code and then
Pipelining A significant fraction of the compute instruc- generate in-memory ISA code using our compiler. We choose
tions goes through two phases: compute and write-back. to port the applications which could be easily transformed
Unfortunately, these two phases are serialized, since an ar- to Structure of Array (SoA) code for the ease of porting to
ray cannot compute and write simultaneously. Our compiler TensorFlow and a data-parallel architecture. We leave the
breaks this bottleneck by pipelining these phases and en- remaining benchmarks to future work. For the benchmarks
suring the destination address for the write-backs are in a which use floating point numbers in the kernel, we assess the
separate array. By using two arrays, one array computes effect of converting it into fixed point numbers. By tuning
while writing back the previous result to the other array. In the decimal point placement, we ensure that the input data
the worst case, this optimization lowers the utilization of is in the dynamic range of fixed point numbers. We ensure
arrays by half. Thus, this optimization is beneficial when the that the quality of result requirement defined by the bench-
number of modules needed for the input data is lower than mark is met. We use the native dataset for each benchmark
the aggregate SIMD capacity of the memory chip. and compare it with the native execution on the CPU and
GPU baseline systems. The size of the input for each kernel
Balancing Inter-Module and Intra-Module Parallelism
invocation ranges from 8MB to 2GB.
Some of the optimizations discussed above attempt to im-
prove performance by exposing parallelism inside a module. Area and Power Model All power/area parameters are
Because of Amdahl’s law, increasing the number of IBs in summarized in Table 4. We use CACTI to model energy and
a module will not result in linear speedup. Depending on area for registers and LUTs. The energy and area model for
ReRAM processing unit, including ReRAM crossbar array,
8
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Component Params Spec Power Area(mm2 ) Parameter CPU (2-sockets) GPU (1-card) IMP
ADC resolution 5 bits 64 mW 0.0753 SIMD slots 448 3840 2097152
frequency 1.2 GSps Frequency 3.6 GHz 1.58 GHz 20 MHz
number 64 × 2 Area 912.24 mm 2 471 mm2 494 mm2
DAC resolution 2 bits 0.82 mW 0.0026 TDP 290 W 250 W 416 W
number 64 × 256 7MB L2; 70MB L3 3MB L2 1GB
Memory
S+H number 64 × 128 0.16 mW 0.00025 64GB DRAM 12GB DRAM RRAM
ReRAM number 64 19.2 mW 0.0016 Table 5. Comparison of CPU, GPU, and IMP Parameters
Array
S+A number 64 1.4 mW 0.0015
server as CPU baseline and Nvidia Titan XP as the GPU base-
IR size 2KB 1.09 mW 0.0016
OR size 2KB 1.09 mW 0.0016 line. The IMP configuration (shown in Table 4) evaluated has
Register size 3KB 1.63 mW 0.0024 4,096 tiles and 64 128×128 ReRAM arrays in each tile.
XB bus width 16B 1.51 mW 0.0105 Table 5 compares important system parameters of the
size 10 × 10 three configurations analyzed. IMP has significantly higher
LUT number 8 6.8 mW 0.0056 degree of parallelism. IMP enjoys 546× (4681×) more SIMD
Inst. Buf size 8 × 2KB 5.83 mW 0.0129 slots than GPU (CPU). The massive parallelism comes at
Router flit size 16 0.82 mW 0.00434 lower frequency, IMP is 80× (180×) slower than GPU (CPU)
num_port 9 in terms of clock cycle period. IMP is approximately area
S+A number 1 0.05 mW 0.000004 neutral compared to GPU, and about 2× lower area than the
1 Tile Total 101 mW 0.12 2-socket CPU system. The TDP of IMP is significantly higher,
Inter-Tile number 584 0.81 W 2.50 however we will show that IMP has lower average power
Routers consumption and energy consumption (Section 7.3).
Chip total 416 W 494 mm2
7.2 Operation Study
Table 4. In-Memory Processor Parameters
Figure 7 presents the operation throughput of CPU, GPU,
and IMP, measured by profiling microbenchmarks of add,
multiply, divide, sqrt and exponential operations. We com-
sample-and-hold circuits, shift-and-add circuits are adapted pile the microbenchmarks with -O3 option and parallelize it
from the ISAAC [39]. We employ energy and power model using OpenMP for the CPU. We find IMP achieves orders of
in [2] for the on-chip interconnects and assume an activity magnitude improvement over the conventional architectures.
factor of 5% for TDP (given that the network operates at The reason is two fold: massive parallelism and reduction
2 GHz and memory at 20 MHz). The benchmarks show an in data movement. The proposed architecture IMP has 546×
order of magnitude lower utilization of network. ADC/DAC (4681×) more SIMD slots compared to GPU (CPU) as shown
energy and power are scaled for 5-bit and 2-bit precision [27]. in Table 5. Although IMP has lower frequency, it more than
While the state-of-the art ReRAM device supports 4 to 6 re- compensates this disadvantage by avoiding data movement.
sistance levels [6], strong non-uniform analog resistance due CPU and GPU have to pay a significant penalty for reading
to process variation makes it challenging to program ReRAM the data out of off-chip memory and passing it along the
for analog convolution, resulting in convolution errors [12]. on-chip memory hierarchy to compute units.
We conservatively limit the number of cell levels to two and IMP speedup is especially higher for the simple operations.
use multiple cells in a row to represent one data. The largest operation throughput is achieved by addition
Performance Model For determining the IMP performance, (2,460× over CPU and 374× over GPU), which has smallest
we develop a cycle accurate simulator which uses an inte- latency in IMP. On the other hand, division and transcenden-
grated network simulator [22]. Note ReRAM array executes tal functions take many cycles to produce the results. For
instructions in order, instruction latency is deterministic, example, it takes 62 cycles for division and 115 cycles for ex-
network communication is rare, and compiler schedules in- ponential, while addition takes only 3 cycles. Therefore, the
struction statically after accounting for network delay. Thus throughput gain becomes smaller for complex operations.
estimated performance for IMP is highly accurate. While CPU and IMP per-operation throughput reduces for
higher latency operations, GPU throughput increases. This is
7 Results because the GPU performance is bounded by the memory ac-
cess time, and unary operators (exponential and square root)
7.1 Configurations Studied have less amount of data transfer from the GPU memory.
In this section we evaluate the proposed In-Memory Pro- Figure 8 and 9 show the operation latency of addition and
cessor (IMP), and compare it to state-of-art CPU and GPU multiplication for different input size. We compare the ex-
baselines. We use an Intel Xeon E5-2697 v3 multi-socket ecution time of single-threaded CPU, multi-threaded CPU
9
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
ϭϬϬ͕ϬϬϬ
Wh KƉĞŶDW 'Wh /DW Wh KƉĞŶDW 'Wh /DW
Wh 'Wh /DW ϭ͘нϬϵ ϭ͘нϬϵ
ϭϬ͕ϬϬϬ ϭ͘нϬϴ ϭ͘нϬϴ
ϭ͘нϬϳ
ϭ͘нϬϳ
>ĂƚĞŶĐLJ;ŶƐͿ
ϭ͕ϬϬϬ ϭ͘нϬϲ
>ĂƚĞŶĐLJ;ŶƐͿ
'KWͬƐ
ϭ͘нϬϲ ϭ͘нϬϱ
ϭ͘нϬϱ ϭ͘нϬϰ
ϭϬϬ
ϭ͘нϬϰ ϭ͘нϬϯ
ϭϬ ϭ͘нϬϯ ϭ͘нϬϮ
ϭ͘нϬϮ ϭ͘нϬϭ
ϭ͘нϬϭ ϭ͘нϬϬ
ϭ
ĚĚ DƵů ŝǀ ^Ƌƌƚ džƉ ϭ͘нϬϬ ϰ ϲϰ ϭ͕ϬϮϰ ϭϲ͕ϯϴϰ ϮϲϮ͕ϭϰϰ
ϰ ϲϰ ϭ͕ϬϮϰ ϭϲ͕ϯϴϰ ϮϲϮ͕ϭϰϰ ĂƚĂƐŝnjĞ;<Ϳ
Figure 7. Operation throughput (log ĂƚĂƐŝnjĞ;<Ϳ
scale). Figure 8. Addition Latency. Figure 9. Multiplication Latency.
ϭϬϬϬ ϭϬ͕ϬϬϬ
<ĞƌŶĞů ĂƚĂůŽĂĚŝŶŐ EŽ ^ĞƋƵĞŶƚŝĂůнĂƌƌŝĞƌ
Wh 'Wh /DW ϭ͘ϬϬ
EŽƌŵĂůŝnjĞĚdžĞĐƵƚŝŽŶdŝŵĞ
ϭϬϬ ϭ͕ϬϬϬ
Ϭ͘ϴϬ
<ĞƌŶĞů^ƉĞĞĚƵƉ
ϭϬϬ
ϭϬ
Ŷ:ͬKƉ
Ϭ͘ϲϬ
ϭϬ
ϭ Ϭ͘ϰϬ
ϭ
Ϭ͘ϮϬ
ůĂĐŬƐĐŚŽůĞƐ
ĂŶŶĞĂů
,ŽƚƐƉŽƚ
ĂĐŬƉƌŽƉ
&ůƵŝĚĂŶŝŵĂƚĞ
^ƚƌĞĂŵĐůƵƐƚĞƌ
<ŵĞĂŶƐ
^ƚƌĞĂŵĐůƵƐƚĞƌ
'KDE
'KDE
Ϭ͘ϭ
Ϭ͘ϬϬ
Wh /DW /DW Wh /DW /DW Wh /DW /DW Wh /DW /DW
Ϭ͘Ϭϭ DĞŵ DĞŵ DĞŵ DĞŵ
ĚĚ DƵů ŝǀ ^Ƌƌƚ džƉ
ůĂĐŬƐĐŚŽůĞƐ &ůƵŝĚĂŶŝŵĂƚĞ ĂŶŶĞĂů ^ƚƌĞĂŵĐůƵƐƚĞƌ
WhĞŶĐŚŵĂƌŬƐ 'WhĞŶĐŚŵĂƌŬƐ
Figure 10. Operation energy. Figure 11. Kernel speedup. Figure 12. CPU Application performance.
(OpenMP), and GPU. IMP offers the highest operations per- SIMD slots. This series of multiplications of distance calcu-
formance among the three architectures, even for the small- lation increases its critical latency and limits the speedup.
est input size (4KB). As suggested in the operation throughput evaluation on Fig-
Figure 10 shows the energy consumption for each oper- ure 7, IMP achieves higher performance especially when the
ation. Because of the high operation latency and the large kernel has significant DLP and many simple operations. We
energy consumption of ADC, we observe higher energy con- observe in general mul, add, and movl instructions are most
sumption for the complex operations relative to GPU. Ulti- common, while movg, reduce_sum and lut are less frequent.
mately, the instruction mix of the application will determine For example, a blackscholes kernel has 14% add, 21% mul,
the energy efficiency of the IMP architecture. and 58% local move instructions. The rest are mask and lut.
The performance results for the overall PARSEC applica-
tion are presented in Figure 12. For this result, we assume
7.3 Application Study
two scenarios: (1) IMP (memory) assumes IMP is integrated
In this section we study the application performance. First, into the memory hierarchy and the memory region for the
we analyze kernel performance shown in Figure 11. For CPU kernel is allocated in IMP. (2) IMP (accelerator) is a configu-
benchmarks, the figure shows performance for hot kernels ration when IMP is used as an accelerator and requires data
in PARSEC benchmarks. We assume that non-kernel code of copy as GPUs do. While we believe IMP (memory) is the
PARSEC benchmarks are executed in the CPU. Note that this correct configuration, IMP (accelerator) is a near-term easier
data transfer overhead is taken into account in the results of configuration which can be a first step towards integrating
IMP. The GPU benchmarks from Rodinia are relatively small, IMP in host servers.
hence we regard them as application kernels. We observe a On average, IMP (accelerator) yields a 5.55× speedup and
41× speedup for CPU benchmarks and 763× speedups for IMP (memory) provides 7.54× for the Region of Interest (ROI).
GPU benchmarks. We observe that 41× kernel speedup does not translate to
GPU benchmarks obtain higher performance improve- similar application speedup due to Amdahl’s law. Figure 12
ment in IMP because of the opportunity to use dot product also shows the breakdown of the execution time, which is
operations and higher data level parallelism. On the other divided into kernel, data loading, communication on NoC,
hand, the speedup for kmeans is limited to 23×. kmeans and the non-kernel part of the ROI. The non-kernel part
deals with Euclidean distance calculation of 34 dimensional is mainly composed of time for barrier and unparalleled
vectors, and this incurs many element-wise multiplications. parts of the program. It can be seen that 88% of execution
Although kmeans shows significant DLP available in the dis- time can be off-loaded to IMP. We also observe that large
tance calculations, we could not fully utilize the DLP of the fraction of the execution time on ReRAM is used for data
application because of the capacity limitation of the IMP’s loading (4× of the kernel at maximum). Thus, as suggested
10
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
EŽƌŵĂůŝnjĞĚdžĞĐƵƚŝŽŶdŝŵĞ
ǀĞƌĂŐĞWŽǁĞƌ;tͿ
<ĞƌŶĞůŶĞƌŐLJ;:Ϳ
ĂŶŶĞĂů
ĂĐŬƉƌŽƉ
,ŽƚƐƉŽƚ
ůĂĐŬƐĐŚŽůĞƐ
ĂŶŶĞĂů
&ůƵŝĚĂŶŝŵĂƚĞ
^ƚƌĞĂŵĐůƵƐƚĞƌ
<ŵĞĂŶƐ
^ƚƌĞĂŵĐůƵƐƚĞƌ
ĂĐŬƉƌŽƉ
,ŽƚƐƉŽƚ
&ůƵŝĚĂŶŝŵĂƚĞ
^ƚƌĞĂŵĐůƵƐƚĞƌ
<ŵĞĂŶƐ
^ƚƌĞĂŵĐůƵƐƚĞƌ
Ϭ͘Ϭ
Figure 13. Energy consumption. Figure 14. Average power. Figure 15. Compiler optimizations.
before, in-memory accelerator is better coupled with the module to maximize DLP. This policy is useful when the data
existing memory hierarchy to avoid data loading overhead. size is larger than the SIMD slots IMP offers. However, the
We also find the NoC time is not the bottleneck, because of module does not have an opportunity to exploit ILP in the
the efficient reduction scheme supported by the reduction program. Also, IB expansion is not applied for this policy.
tree network integrated in the NoC. The second optimization target is MaxILP, which fully uti-
Figure 13 shows the total energy consumption of the entire lizes the ILP and lets IB expansion expand all multi-dimensio-
application (thus includes both kernel and non-kernel energy nal data in the module. This will create largest number of IBs
for PARSEC). We find 7.5× and 440× energy efficiency for per module and shortest execution time for single module.
CPU benchmarks and GPU benchmarks, respectively. This However, because of the sequential part of the IB, array uti-
energy reduction is partly due to energy efficiency of IMP for lization becomes lower. This policy can increase the overall
kernel’s instruction mix and partly due to reduced execution execution time when the kernel is invoked multiple times
time. due to insufficient SIMD slots in IMP.
Figure 14 shows the average power consumption of the The third optimization target, MaxArrayUtil, maximizes
benchmarks. The TDP of IMP is high when compared to the array utilization considering the number of SIMD slots
GPU and CPU (Table 5). ADCs are the largest contributer needed by input data. For example, if the incoming data
to peak power. The required resolution for ADCs is a func- consumes 30% of the total SIMD slots in IMP, each module
tion of maximum number of operands supported for n-ary can use 3 IBs to fully utilize all the arrays while avoiding
instructions in our ISA. To contain the TDP, we limit the multiple kernel invocations. The compiler optimizes under
ADC resolution to 5-bits and thereby limiting the number of the constraint of maximum IBs available per module
operands for n-ary instructions (add, dot). While this may Table 6 shows the maximum IB latency and the number
affect the performance of a customized dot-product based of IBs per module. Figure 15 presents the execution time
machine learning accelerator significantly, it is not a serious of different optimization policies normalized to MaxDLP
limitation for general purpose computation. Although IMP’s (baseline). MaxArrayUtil represents the best possible per-
TDP is high due to the ADC power consumption, the aver- formance provided by the compiler optimizations under re-
age power consumption is dependent on the instruction’s source constraints imposed by IMP. Overall it provides an
requirement for ADC resolution. For example, the ADCs con- average speedup of 2.3×.
sume less power for instructions with fewer operands. We Two other optimizations not captured by above graph
find that the average resolution for ADC is 2.07 bit (maximum are node merging and pipelining. On average, the module
resolution is 5-bit). Overall, the average power consumption latency is reduced by 13.8% with node merging and 20.8%
for IMP is estimated to be 70.1 W. The average power con- with pipelining.
sumption measured for the benchmarks in the baseline is
81.3 W.
7.5 Memory Lifetime
7.4 Effect of Compiler Optimizations We evaluate the memory lifetime by calculating the write
We introduce three optimization targets to the compiler and intensity of the benchmarks (last row in Table 6). Based on
evaluate how each optimization affects the results. The first the assumption in [26], we consider the ReRAM cells to wear
optimization target is MaxDLP, which creates one IB per out beyond 1011 writes. The compiler balances the writes to
11
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
the arrays by assigning and using ReRAM rows in a round- ReRAM cells. While we demonstrated that IMP architec-
robin manner. Assuming the arrays are continuously used ture/programming framework can work with large real-
for kernel computation (but not while the host is processing), world general purpose data-parallel applications, ReRAM as
the median of expected lifetime is 17.9 years. logic approach [7] has been demonstrated for only sequen-
tial micro-kernels (e.g. hamming, sqrt, square etc) with no
8 Related Work comparison to CPU or GPU systems.
To the best of our knowledge this is the first work that Near-Memory Computing Past processing-in-memory
demonstrates the feasibility of general purpose computing (PIM) solutions move compute near the memory [4, 5, 10, 17,
in ReRAM based memory. Below we discuss some of the 18, 24, 31, 33, 35, 38, 46, 47]. The proposed architecture lever-
closely related works. ages an emerging style of in-memory computing referred
to as bit-line computing [3]. Since, bit-line computing re-
ReRAM Computing Since ReRAMs have been introduced
purposes memory structures to perform computation in-situ,
in [42], several works have leveraged its dot-product com-
it is intrinsically more efficient than near-memory comput-
putation functionality for neuromorphic computing [25, 34].
ing which augments logic near memory. More importantly,
Recently, ISAAC [39] and PRIME [13] use ReRAMs to acceler-
it unlocks massive parallelism at near-zero silicon cost.
ate several Convolutional Neural Networks (CNNs). ISAAC
Recent works have leveraged bit-line computing in SRAM [3,
proposes a full-fledged CNN accelerator with carefully de-
21, 23] and DRAM [37, 38]. These works have demonstrated
signed pipelining and precision handling scheme. PRIME
only a handful of compute operations (bit-wise logical, match
studies a morphable ReRAM based main memory architec-
and copy) making them limited in applicability for general
ture for CNN acceleration. PipeLayer [41] further supports
purpose computing. Furthermore this work is the first to
training and testing of CNN by introducing efficient pipelin-
develop a programming framework and compiler for in-
ing scheme. Aside from CNN acceleration, ReRAM arrays
memory bit-line computing. Our software stack can be uti-
have been used for accelerating Boltzmann machine [9] and
lized for leveraging bit-line computing in other memory
perception network [45]. While it has been shown analog
technologies.
computation in ReRAM can substantially accelerate the ma-
chine learning workloads, none have targeted general pur-
pose computing exploiting the analog computation function-
9 Conclusion
ality of ReRAM. Another interesting work, Pinatubo [28], has This paper proposed novel general-purpose ReRAM-based In-
modified peripheral sense-amplifier circuitry to accomplish Memory Processor architecture (IMP), and its programming
logical operations like AND and OR. While this approach framework. IMP substantially improves the performance and
appears promising to build complex arithmetic operations, energy efficiency for general-purpose data parallel programs.
doing arithmetic on multi-bit ReRAM cells using bitwise op- IMP implements simple but powerful ISA that can lever-
erations comes with several challenges. Orthogonal to this age the underlying computational efficiency. We propose
work, we extend the set of supported operations at low cost. the programming model and the compilation framework,
ReRAM has also been explored to implement logic using in which users use TensorFlow to develop a program and
Majority-Inverter Graph (MIG) logic [7, 32, 40]. In this ap- maximize the parallelism using the compiler’s toolchain. Our
proach each ReRAM bit-cell acts as a majority gate. Since experimental results show IMP can achieve 7.5× over PAR-
resistive bit-cell is acting as a logic gate, it cannot store SEC CPU benchmarks and 763× speedup over Rodinia GPU
data during computation. Let us refer to this approach as benchmarks.
ReRAM bit-cell as logic. A critical difference between this
approach and ours is that we leverage in-situ operations Acknowledgments
where operations are performed in memory over the bit- We thank members of M-Bits research group and the anony-
lines without reading data out. The ReRAM bit-cell as logic mous reviewers for their feedback. This work was supported
approach is a flavor of near-memory computing technique in part by the NSF under the CAREER-1652294 award and
where input data is read out of memory and fed to another the XPS-1628991 award, and by C-FAR, one of the six SRC
memory location which acts as a logic unit, thus requiring STAR-net centers sponsored by MARCO and DARPA.
data-movement.
Furthermore, operations using majority gates can be ex- References
tremely slow, requiring huge number of memory accesses [1] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
to implement even simple functions. For example, a multi- Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu
ply is implemented using ≈56000 majority-gate operations Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey
Irving, Michael Isard, Yangqing Jia, Lukasz Kaiser, Manjunath Kud-
(majority-gate operation requires one memory cycle) and
lur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek
419 ReRAM cells [40]. Our approach implements a multi- Murray, Jon Shlens, Benoit Steiner, Ilya Sutskever, Paul Tucker, Vin-
ply in 18 memory cycles without requiring any additional cent Vanhoucke, Vijay Vasudevan, Oriol Vinyals, Pete Warden, Martin
12
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large- [16] Milos Ercegovac, Jean-Michel Muller, and Arnaud Tisserand. [n. d.].
Scale Machine Learning on Heterogeneous Distributed Systems. http: Simple Seed Architectures for Reciprocal and Square Root Reciprocal.
//download.tensorflow.org/paper/whitepaper2015.pdf ([n. d.]). http://arith.cs.ucla.edu/publications/Recip-Asil05.pdf
[2] Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bha- [17] A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung
ran Giridhar, Ronald G Dreslinski, David Blaauw, and Trevor Mudge. Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging
2013. Scaling towards kilo-core processors with asymmetric high-radix commodity DRAM devices and standard memory modules. In High
topologies. In High Performance Computer Architecture (HPCA2013), Performance Computer Architecture (HPCA), 2015 IEEE 21st International
2013 IEEE 19th International Symposium on. IEEE, 496–507. Symposium on.
[3] Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish [18] Basilio B. Fraguela, Jose Renau, Paul Feautrier, David Padua, and
Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Josep Torrellas. 2003. Programming the FlexRAM Parallel Intel-
Caches. In 2017 IEEE International Symposium on High Performance ligent Memory System. SIGPLAN Not. 38, 10 (June 2003), 49–60.
Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, https://doi.org/10.1145/966049.781505
2017. IEEE. [19] John Harrison, Ted Kubaska, Shane Story, et al. 1999. The compu-
[4] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2015. A scalable tation of transcendental functions on the IA-64 architecture. In Intel
processing-in-memory accelerator for parallel graph processing. In Technology Journal. Citeseer.
2015 ACM/IEEE 42nd Annual International Symposium on Computer [20] Miao Hu, R. Stanley Williams, John Paul Strachan, Zhiyong Li, Em-
Architecture (ISCA). 105–117. https://doi.org/10.1145/2749469.2750386 manuelle M. Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning
[5] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM- Ge, and Jianhua Joshua Yang. 2016. Dot-product engine for neuromor-
enabled Instructions: A Low-overhead, Locality-aware Processing-in- phic computing. In Proceedings of the 53rd Annual Design Automation
memory Architecture. In Proceedings of the 42Nd Annual International Conference on - DAC ’16. ACM Press, New York, New York, USA, 1–6.
Symposium on Computer Architecture (ISCA ’15). https://doi.org/10.1145/2897937.2898010
[6] Fabien Alibart, Ligang Gao, Brian D Hoskins, and Dmitri B Strukov. [21] Supreet Jeloka, Naveen Akesh, Dennis Sylvester, and David Blaauw.
2012. High precision tuning of state for memristive devices by adapt- 2015. A Configurable TCAM / BCAM / SRAM using 28nm push-rule
able variation-tolerant algorithm. Nanotechnology 23, 7 (2012), 075201. 6T bit cell (IEEE Symposium on VLSI Circuits).
[7] Debjyoti Bhattacharjee, Rajeswari Devadoss, and Anupam Chattopad- [22] Nan Jiang, James Balfour, Daniel U Becker, Brian Towles, William J
hyay. 2017. ReVAMP : ReRAM based VLIW Architecture for in-Memory Dally, George Michelogiannakis, and John Kim. 2013. A detailed
comPuting. Design, Automation & Test in Europe Conference & Exhibi- and flexible cycle-accurate network-on-chip simulator. In Performance
tion (DATE), 2017 (3 2017), 782–787. https://doi.org/10.23919/DATE. Analysis of Systems and Software (ISPASS), 2013 IEEE International
2017.7927095 Symposium on. IEEE, 86–96.
[8] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. [23] M. Kang, E. P. Kim, M. s. Keel, and N. R. Shanbhag. 2015. Energy-
The PARSEC benchmark suite: Characterization and architectural efficient and high throughput sparse distributed memory architecture.
implications. Proceedings of the International Conference on Parallel In 2015 IEEE International Symposium on Circuits and Systems (ISCAS).
Architectures and Compilation Techniques January (2008), 72–81. https: [24] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and
//doi.org/10.1145/1454115.1454128 Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neu-
[9] M N Bojnordi and E Ipek. 2016. Memristive Boltzmann machine: A romorphic Architecture with High-Density 3D Memory. In Proceedings
hardware accelerator for combinatorial optimization and deep learning. of ISCA, Vol. 43.
2016 IEEE International Symposium on High Performance Computer [25] Kuk-Hwan Kim, Siddharth Gaba, Dana Wheeler, Jose M Cruz-Albrecht,
Architecture (HPCA) (2016), 1–13. https://doi.org/10.1109/HPCA.2016. Tahir Hussain, Narayan Srinivasa, and Wei Lu. 2011. A Functional
7446049 Hybrid Memristor Crossbar-Array/CMOS System for Data Storage
[10] Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and and Neuromorphic Applications. Nano Letters 12, 1 (2011), 389–395.
Peter M. Kogge. 2004. A Low Cost, Multithreaded Processing-in- https://doi.org/10.1021/nl203687n
memory System. In Proceedings of the 3rd Workshop on Memory Perfor- [26] J. B. Kotra, M. Arjomand, D. Guttman, M. T. Kandemir, and C. R.
mance Issues: In Conjunction with the 31st International Symposium on Das. 2016. Re-NUCA: A Practical NUCA Architecture for ReRAM
Computer Architecture (WMPI ’04). ACM, New York, NY, USA, 16–22. Based Last-Level Caches. In 2016 IEEE International Parallel and Dis-
https://doi.org/10.1145/1054943.1054946 tributed Processing Symposium (IPDPS). 576–585. https://doi.org/10.
[11] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W 1109/IPDPS.2016.79
Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark [27] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli,
suite for heterogeneous computing. In Workload Characterization, 2009. M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici. 2013. A 3.1mW
IISWC 2009. IEEE International Symposium on. Ieee, 44–54. 8b 1.2GS/s single-channel asynchronous SAR ADC with alternate
[12] Pai-Yu Chen, Deepak Kadetotad, Zihan Xu, Abinash Mohanty, Binbin comparators for enhanced speed in 32nm digital SOI CMOS. In 2013
Lin, Jieping Ye, Sarma Vrudhula, Jae-sun Seo, Yu Cao, and Shimeng Yu. IEEE International Solid-State Circuits Conference Digest of Technical
2015. Technology-design co-optimization of resistive cross-point array Papers. IEEE, 468–469. https://doi.org/10.1109/ISSCC.2013.6487818
for accelerating learning algorithms on chip. In Design, Automation & [28] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan
Test in Europe Conference & Exhibition (DATE), 2015. IEEE, 854–859. Xie. 2016. Pinatubo: A processing-in-memory architecture for bulk
[13] Ping Chi, Shuangchen Li, and Cong Xu. 2016. PRIME : A Novel bitwise operations in emerging non-volatile memories. In Design Au-
Processing-in-memory Architecture for Neural Network Computation tomation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1–6.
in ReRAM-based Main Memory. In IEEE International Symposium on [29] P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D.
Computer Architecture. IEEE, 27–39. https://doi.org/10.1109/ISCA. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Rutten-
2016.13 berg. 1993. The multiflow trace scheduling compiler. The Journal of
[14] Marius Cornea, John Harrison, Cristina Iordache, Bob Norin, and Supercomputing 7, 1 (01 May 1993), 51–142. https://doi.org/10.1007/
Shane Story. 2000. Divide, Square Root, and Remainder Algorithms BF01205182
for the IA-64 Architecture. Open Source for Numerics, Intel Corporation [30] Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam,
(2000). Andrew Schwerin, Mark Oskin, and Susan J Eggers. 2006. Instruction
[15] John R. Ellis. 1986. Bulldog: A Compiler for VLSI Architectures. MIT scheduling for a tiled dataflow architecture. In ACM SIGOPS Operating
Press, Cambridge, MA, USA.
13
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA
Systems Review, Vol. 40. ACM, 141–150. with In-Situ Analog Arithmetic in Crossbars. 2016 ACM/IEEE 43rd
[31] Mark Oskin, Frederic T Chong, Timothy Sherwood, Mark Oskin, Fred- Annual International Symposium on Computer Architecture (ISCA) (6
eric T Chong, and Timothy Sherwood. 1998. Active Pages: A Computa- 2016), 14–26. https://doi.org/10.1109/ISCA.2016.12
tion Model for Intelligent Memory. ACM SIGARCH Computer Architec- [40] Mathias Soeken, Saeideh Shirinzadeh, Luca Gaetano, AmarÞ Rolf,
ture News 26, 3 (1998), 192–203. https://doi.org/10.1145/279358.279387 and Giovanni De Micheli. 2016. An MIG-based Compiler for Pro-
[32] P. E. Gaillardon, L. Amaru, A. Siemon, E. Linn, R. Waser, A. Chattopad- grammable Logic-in-Memory Architectures. Proceedings of the 2016
hyay, and G. De Micheli. 2016. The Programmable Logic-in-Memory 53rd ACM/EDAC/IEEE Design Automation Conference (DAC) 1 (2016).
(PLiM) computer. 2016 Design, Automation & Test in Europe Conference https://doi.org/10.1145/2897937.2897985
& Exhibition (DATE) (2016), 427–432. [41] L. Song, X. Qian, H. Li, and Y. Chen. 2017. PipeLayer: A Pipelined
[33] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE Interna-
Kozyrakis, R. Thomas, and K. Yelick. 1997. A case for intelligent tional Symposium on High Performance Computer Architecture (HPCA).
RAM. Micro, IEEE (1997). 541–552. https://doi.org/10.1109/HPCA.2017.55
[34] Mirko Prezioso, Farnood Merrikh-Bayat, BD Hoskins, GC Adam, Kon- [42] Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley
stantin K Likharev, and Dmitri B Strukov. 2015. Training and operation Williams. 2008. The missing memristor found. Nature 453, 7191 (2008),
of an integrated neuromorphic network based on metal-oxide mem- 80.
ristors. Nature 521, 7550 (2015), 61–64. [43] Pascal O Vontobel, Warren Robinett, Philip J Kuekes, Duncan R Stewart,
[35] S.H. Pugsley, J. Jestes, Huihui Zhang, R. Balasubramonian, V. Srini- Joseph Straznicky, and R Stanley Williams. 2009. Writing to and
vasan, A. Buyuktosunoglu, A. Davis, and Feifei Li. 2014. NDC: Analyz- reading from a nano-scale crossbar memory based on memristors.
ing the impact of 3D-stacked memory+logic devices on MapReduce Nanotechnology 20, 42 (2009), 425204.
workloads. In Performance Analysis of Systems and Software (ISPASS), [44] Z. Wei, Y. Kanzawa, K. Arita, Y. Katoh, K. Kawai, S. Muraoka, S. Mitani,
2014 IEEE International Symposium on. S. Fujii, K. Katayama, M. Iijima, T. Mikawa, T. Ninomiya, R. Miyanaga,
[36] Jury Sandrini, Marios Barlas, Maxime Thammasack, Tugba Demirci, Y. Kawashima, K. Tsuji, A. Himeno, T. Okada, R. Azuma, K. Shimakawa,
Michele De Marchi, Davide Sacchetto, Pierre-Emmanuel Gaillardon, H. Sugaya, T. Takagi, R. Yasuhara, K. Horiba, H. Kumigashira, and M.
Giovanni De Micheli, and Yusuf Leblebici. 2016. Co-Design of ReRAM Oshima. 2008. Highly reliable TaOx ReRAM and direct evidence of
Passive Crossbar Arrays Integrated in 180 nm CMOS Technology. IEEE redox reaction mechanism. In 2008 IEEE International Electron Devices
Journal on Emerging and Selected Topics in Circuits and Systems 6, 3 (9 Meeting. IEEE, 1–4. https://doi.org/10.1109/IEDM.2008.4796676
2016), 339–351. https://doi.org/10.1109/JETCAS.2016.2547746 [45] Chris Yakopcic and Tarek M Taha. 2013. Energy efficient perceptron
[37] V. Seshadri, K. Hsieh, A. Boroum, Donghyuk Lee, M.A. Kozuch, O. pattern recognition using segmented memristor crossbar arrays. In
Mutlu, P.B. Gibbons, and T.C. Mowry. 2015. Fast Bulk Bitwise AND Neural Networks (IJCNN), The 2013 International Joint Conference on.
and OR in DRAM. Computer Architecture Letters (2015). IEEE, 1–8.
[38] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata [46] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L.
Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM:
Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. [n. d.]. Throughput-oriented Programmable Processing in Memory. In Pro-
RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and ceedings of the 23rd International Symposium on High-performance
Initialization. In Proceedings of the 46th Annual IEEE/ACM International Parallel and Distributed Computing (HPDC ’14).
Symposium on Microarchitecture (MICRO-46). [47] Qiuling Zhu, B. Akin, H.E. Sumbul, F. Sadi, J.C. Hoe, L. Pileggi, and
[39] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, and Rajeev Balasub- F. Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for
ramonian. 2016. ISAAC : A Convolutional Neural Network Accelerator application-specific data intensive computing. In 3D Systems Integra-
tion Conference (3DIC), 2013 IEEE International.
14