0% found this document useful (0 votes)
19 views14 pages

In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das

The document presents a novel in-memory data parallel processor architecture that leverages Non-Volatile Memories (NVMs) for improved performance in general-purpose computing. It proposes a programmable framework and a compiler that transforms TensorFlow inputs into optimized code for the processor, achieving significant speedups over traditional CPU and GPU architectures. The architecture is designed to minimize data movement and exploit massive parallelism, demonstrating a potential for substantial efficiency gains in data-parallel applications.

Uploaded by

kechalikamalakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

In-Memory Data Parallel Processor: Daichi Fujiki Scott Mahlke Reetuparna Das

The document presents a novel in-memory data parallel processor architecture that leverages Non-Volatile Memories (NVMs) for improved performance in general-purpose computing. It proposes a programmable framework and a compiler that transforms TensorFlow inputs into optimized code for the processor, achieving significant speedups over traditional CPU and GPU architectures. The architecture is designed to minimize data movement and exploit massive parallelism, demonstrating a potential for substantial efficiency gains in data-parallel applications.

Uploaded by

kechalikamalakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

In-Memory Data Parallel Processor


Daichi Fujiki Scott Mahlke Reetuparna Das
University of Michigan University of Michigan University of Michigan
[email protected] [email protected] [email protected]

Abstract analog computation capabilities. For example, resistive mem-


Recent developments in Non-Volatile Memories (NVMs) have ories (ReRAMs) store the data in the form of resistance of tita-
opened up a new horizon for in-memory computing. Despite nium oxides, and by injecting voltage into the word line and
the significant performance gain offered by computational sensing the resultant current on the bit-line, the dot-product
NVMs, previous works have relied on manual mapping of of the input voltages and cell conductances is obtained using
specialized kernels to the memory arrays, making it infea- Ohm’s and Kirchhoff’s laws.
sible to execute more general workloads. We combat this Recent works have explored the design space of ReRAM-
problem by proposing a programmable in-memory proces- based accelerators for machine learning algorithms by lever-
sor architecture and data-parallel programming framework. aging this dot-product functionality [13, 39]. These ReRAM-
The efficiency of the proposed in-memory processor comes based accelerators exploit the massive parallelism and re-
from two sources: massive parallelism and reduction in data laxed precision requirements, to provide orders of magnitude
movement. A compact instruction set provides generalized improvement when compared to current CPU/GPU archi-
computation capabilities for the memory array. The pro- tectures and custom ASICs, in-spite of their high read/write
posed programming framework seeks to leverage the under- latency. In this paper, we seek to answer the question, to what
lying parallelism in the hardware by merging the concepts extent is resistive memory useful for more general-purpose
of data-flow and vector processing. To facilitate in-memory computation?
programming, we develop a compilation framework that Despite the significant performance gain offered by com-
takes a TensorFlow input and generates code for our in- putational NVMs, previous works have relied on manual
memory processor. Our results demonstrate 7.5× speedup mapping of convolution kernels to the memory arrays, mak-
over a multi-core CPU server for a set of applications from ing it difficult to configure it for diverse applications. We com-
Parsec and 763× speedup over a server-class GPU for a set bat this problem by proposing a programmable in-memory
of Rodinia benchmarks. processor architecture and programming framework. A gen-
eral purpose in-memory processor has the potential to im-
ACM Reference Format: prove performance of data-parallel application kernels by an
Daichi Fujiki, Scott Mahlke, and Reetuparna Das. 2018. In-Memory order of magnitude or more.
Data Parallel Processor . In Proceedings of 2018 Architectural Support The efficiency of an in-memory processor comes from
for Programming Languages and Operating Systems (ASPLOS’18). two sources. The first is massive data parallelism. NVMs
ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3173162. are composed of several thousands of arrays. Each of these
3173171
arrays are transformed into a single instruction multiple data
(SIMD) processing unit that can compute concurrently. The
1 Introduction second source is a reduction in data movement, by avoiding
shuffling of data between memory and processor cores. Our
Non-Volatile Memories (NVMs) create oppportunities for goal is to design an architecture, establish the programming
advanced in-memory computing. By re-purposing memory semantics and execution models, and develop a compiler, to
structures, certain NVMs have been shown to have in-situ expose the above benefits of ReRAM computing to general
purpose data parallel programs.
Permission to make digital or hard copies of all or part of this work for The in-memory processor architecture consists of memory
personal or classroom use is granted without fee provided that copies arrays and several digital components grouped in tiles, and
are not made or distributed for profit or commercial advantage and that a custom interconnect to facilitate communication between
copies bear this notice and the full citation on the first page. Copyrights
the arrays and instruction supply. Each array acts as a unit
for components of this work owned by others than the author(s) must
be honored. Abstracting with credit is permitted. To copy otherwise, or of storage as well as a vector processing unit. The proposed
republish, to post on servers or to redistribute to lists, requires prior specific architecture extends the ReRAM array to support in-situ
permission and/or a fee. Request permissions from [email protected]. operations beyond dot product (i.e., addition, element-wise
ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA multiplication, and subtraction). We adopt a SIMD execution
© 2018 Copyright held by the owner/author(s). Publication rights licensed model, where every cycle an instruction is multi-casted to
to the Association for Computing Machinery.
a set of arrays in a tile and executed in lock-step. The In-
ACM ISBN ISBN 978-1-4503-4911-6/18/03. . . $15.00
https://doi.org/10.1145/3173162.3173171 struction Set Architecture (ISA) for in-memory computation

1
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

consists of 13 instructions. The key challenge is develop- We extend the ReRAM memory array to support in-
ing a simple yet powerful ISA and programming framework situ operations beyond the dot product and design a
that can allow diverse data-parallel programs to leverage the simple ISA with limited compute capability.
underlying massive computational efficiency. • We develop a compiler that transforms DFGs in Google’s
The proposed programming model seeks to utilize the un- TensorFlow to a set of data-parallel modules and gener-
derling parallelism in the hardware by merging the concepts ates module code in the native memory ISA. The com-
of data-flow and vector processing (or SIMD). Data-flow ex- piler implements several optimizations to exploit un-
plicitly exposes the Instruction Level Parallelism (ILP) in the derlying hardware parallelism and unique features/con-
program, while vector processing exposes the Data Level straints of ReRAM-based computation.
Parallelism (DLP). Google’s TensorFlow [1] is a popular pro- • Although the in-memory compute ISA is simple and
gramming model for machine learning. We observe that limited in functionality, we demonstrate that with a
TensorFlow’s programming semantics is a perfect marriage good programming model and compiler, it is possible
of data-flow and vector-processing that can be applied to to off-load a large fraction of general-purpose compu-
more general applications. Thus, our proposed programming tation to memory. For instance, we are able to execute
framework uses TensorFlow as the input. in memory an average of 87% of the PARSEC applica-
We develop a TensorFlow compiler that generates binary tions studied.
code for our in-memory data-parallel processor. The Tensor- • Our experimental results show that the proposed ar-
Flow (TF) programs are essentially Data-Flow Graphs (DFG) chitecture can provide overall speedup of 7.5× over
where each operator node can have multi-dimensional vec- a state-of-art multicore CPU for the PARSEC applica-
tors, or tensors, as operands. A DFG that operates on one tions evaluated. It also provides a speedup of 763× over
element of a vector is referred to as a module by the compiler. state-of-art GPU for the Rodinia kernel benchmarks
The compiler transforms the input DFG into a collection evaluated. The proposed architecture operates with a
of data-parallel modules with identical machine code. Our thermal design power (TDP) of 415 W, improves the
execution model is coarse-grain SIMD. At runtime, a code energy efficiency of benchmarks by 230× and reduces
module is instantiated many times and processes indepen- the average power by 1.26×.
dent data elements. The programming model and compiler
support restricted communication between modules: reduce, 2 Processor Architecture
scatter and gather. Our compiler explores several interesting
We propose an in-memory data-parallel processor on ReRAM
optimizations such as unrolling of high-dimensional tensors,
substrate. This section discusses the proposed microarchi-
merging of DFG nodes to utilize n-ary ReRAM operations,
tecture, ISA, and implementation of the ISA.
pipelining compute and write-backs, maximizing ILP within
a module using VLIW style scheduling, and minimizing com-
2.1 Micro-architecture
munication between arrays.
For general purpose computation, we need to support The proposed in-memory processor adopts a tiled architec-
a variety of compute operations (e.g., division, exponent, ture as shown in Figure 1. A tile is composed of clusters of
square root). These operations can be directly expressed as memory nodes, few instruction buffers and a router. Each
nodes in TensorFlow’s DFG. Unfortunately, ReRAM arrays cluster consists of a few memory arrays, a small register
cannot support them natively due to their limited analog file, and look-up table (LUT). Each memory array is shown
computation capability. Our compiler performs an instruc- in Figure 1 (b). Internally, a memory array in the proposed
tion lowering step in the code-generation phase to trans- architecture consists of multiple rows of resistive bit-cells,
late higher-level TensorFlow operations to the in-memory a set of digital-analog converters (DACs) feeding both the
compute ISA. We discuss how the compiler can efficiently word-lines and bit-lines, sample and hold circuit (S+H), shift
support complex operations (e.g., division) using techniques and adder (S+A) and analog-digital converters (ADCs). The
such as the Newton-Raphson method which iteratively ap- process of reading and writing to ReRAM memory arrays
plies a set of simple instructions (add/multiply) to an initial remains unchanged. We refer the reader to ReRAM litera-
seed from the look-up table and refines the result. The com- ture for details [39, 42]. The memory arrays are capable of
piler also transforms other non-arithmetic primitives (e.g., both data storage and computation. We explain the com-
square and convolution) to the native memory SIMD ISA. pute capabilities of the memory arrays and the role of digital
In summary, this paper offers the following contributions: components (e.g. register file, S+A, LUT) in Section 2.2.
The tiles are connected by an H-Tree router network. The
• We design a processor architecture that re-purposes H-Tree network is chosen to suit communication patterns
resistive memory to support data-parallel in-memory typical in our programming model (Section 3) and it also
computation. In the proposed architecture, memory provides high-bandwidth communication for external I/O.
arrays store data and act as vector processing units. The clusters inside a tile are connected by a router or a

2
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

/d>/E^
Cluster

DAC

DAC
DAC

DAC
... ...

... ... ...


ReRAM
PU ... ReRAM
PU
Reg.
File DAC
External IO

...

tKZ>/E^
...
DAC
H-tree
...
XB ... ReRAM
PU ... ReRAM
PU
LUT

..
Inst. Buf DAC
... ...
S+A DAC ADC
ADC
... Router DAC

DAC
RRAM XB S+A

S+H
Reg SĂŵƉůĞ and Hold
ADC
Tiled architecture Tile Memory Array /
(a) Processing Unit (b)

Figure 1. In-Memory Processor Architecture. (a) Hierarchical Tiled Structure (b) ReRAM array Structure
 

   
   

               


   
   
      

          


             –     –          

                


Figure 2. In-situ ReRAM array operations.

crossbar topology. A shared bus facilitates communication shuffle) instructions because these are hard to do in-situ in-
inside a cluster. A hierarchical topology inside the tile limits memory. Instead, compiler transforms complex instructions
the network power consumption, while providing sufficient to a set of lut, add and mul instructions as discussed later.
bandwidth for infrequent communication typical in data- The ISA consists of 13 instructions as shown in Table 1. Each
parallel applications. ReRAM arrays executes the instruction locally, hence the
Each memory array can be thought of as a vector process- operand addressing modes reference rows inside the array
ing unit with few SIMD lanes. The processor adopts a SIMD or local registers. The instructions can have a size of up to 34
execution model. Each array is mapped to a specific instruc- bytes. Now we discuss the functionality and implementation
tion buffer. All arrays mapped to the same instruction buffer of individual instructions.
execute the same instruction. Every cycle, one instruction 1) add The add instruction is an n-ary operation that adds
is read out of the each instruction buffer and multi-casted the data in rows specified by <mask>. The <mask> is a 128-bit
to the memory arrays in the tile. The execution model is mask which is set for each row in the array that participates
discussed in detail in Section 4. in addition. Figure 2 (a) shows an add operation. The mask
The processor evaluated in this paper consists of 4,096 is fed to word-line DACs, which is used to apply a Vdd (’11’)
tiles, 8 clusters per tile, and 8 memory arrays per cluster. or Vdd/2 (’10’) to the word-lines. A ’1’ in the mask activates
Each array can store 4KB of data and has 8 SIMD lanes of 32 a row. Each bit-cell in a ReRAM array can be abstractly
bits each. Consequently, the processor has aggregate SIMD thought of as variable resistor. Addition is performed inside
width of two million lanes, aggregate memory capacity of the array by summing up currents generated by conductance
1GB and 494 mm 2 area. The resolution of ADC and DAC is (=resistance−1 ) of each bit-cell. A sample and hold (S + H)
set to 5 and 2 bits. circuit receives the bit-line current and feeds it the ADC
unit which outputs the digital value for the current. The
2.2 Instruction Set Architecture result from each bit-line represents the partial sum for bits
The proposed Instruction Set Architecture (ISA) is simple and stored in that bit-line. A word or data element is stored across
compact. Compared to a standard SIMD ISA, In-memory ISA multiple bit-lines. An external digital shifter and adder (S +
does not support complex (e.g. division) and specialized (e.g. A) combines the partial sums from bit-lines. The final result

3
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Opcode Format Cycles problem by using an additional set of DACs for feeding bit-
add <mask><dst> 3 lines. As in ISAAC, the operation is pipelined into 3 stages:
dot <mask><reg_mask><dst> 18 XB, ADC and S+A, processing 2 bits per cycle, resulting in
mul <src><src><dst> 18 18 cycles in total for 32 bit data.
sub <mask><mask><dst> 3
shift{l|r} <src><dst><imm> 3
4) sub The sub instruction performs element-wise subtrac-
mask <src><dst><imm> 3 tion over elements stored in the two set of memory rows
mov <src><dst> 3 (minuends and subtrahends) specified by <mask>s and stores
movs <src><dst><mask> 3 the result in <dst>. Subtraction in ReRAM arrays has not
movi <dst><imm> 1 been explored before. We support this operation by draining
movg <gaddr><gaddr> Variable the current via word-line as shown in Figure 2 (d). The out-
lut <src><dst> 4 put voltage for word-line DAC of the subtrahend row is set
reduce_sum <src><gaddr> Variable to ground allowing for current drain. Hence the remaining
Table 1. In-Memory Compute ISA. The instructions use current over the bit-line represents the difference between
operand addresses specified by either <src>, <dst> or minuend and subtrahend. For this operation we reverse the
<gaddr>. The <src> and <dst> is a 8-bit local address (1- voltage across memristor bit-cell. Fortunately, several re-
bit indicates memory/register + 7-bit row number/register ports on fabricated ReRAM demonstrate the symmetric V/I
number). The <gaddr> is a 4 byte global address (12-bit tile properties of memristor with reverse voltage across termi-
# + 6-bit array # + 7-bit row # + reserved bits). The <imm> nals [36, 44].
field is a 16 byte immediate value. 5) lut The lut instruction sends the value stored in <src>
as an address to the lookup table (LUT), and write back
is written back to <dst> memory row or register. Each of the data read from the LUT to <dst>. The multi-purpose
ReRAM crossbar (XB), ADC and S+A takes 1 cycle, resulting LUT is implemented for supporting high-level instructions.
in 3 cycles in total. LUT is utilized for nonlinear functions such as sigmoid, and
initial seeding of division and transcendental functions (Sec-
2) dot The dot instruction is also an n-ary operation which tion 5.1). The LUT has 512 entries of 8-bit numbers to suffice
emulates a dot product over the data in rows specified by the precision requirement of the arithmetic algorithms imple-
<mask>. A dot product is a sum of products. The sum is done mented [16]. LUT is a small SRAM structure which operates
using current summation over the bit-line as explained ear- at much higher frequency than ReRAM arrays and hence
lier. Each row computes a product by streaming in the multi- shared by multiple arrays. Its contents are initialized by the
plicand via the word-line DAC in a serial manner as shown host at runtime. lut takes 4 cycles, adding 1 cycle on top of
in Figure 2 (b). The multiplicands are stored in register file the basic XB, ADC, S+A pipeline.
and the individual registers are specified using <reg_mask>
field. 6) mov, movi, movg, movs The mov family of instructions
Robust current summation over ReRAM bit-lines has been facilitates movement of data between memory rows of an
demonstrated in prior works [20, 43]. We adapt the dot prod- array, registers, and even across arrays via global addressing
uct architecture from ISAAC [39] for our add and dot in- (<gaddr>). The global addresses are handled by the network,
structions. We refer the reader to these works for further hence the latency of gobal moves (movg) is variable. Imme-
implementation details. diate values can be stored to <dst> as well via movi instruc-
tion. These instructions are implemented using traditional
3) mul The mul instruction is 2-ary operation that per- memristor read/write operations. The selective mov (movs)
forms element-wise multiplication over elements stored in instruction selectively moves data to elements in <dst> based
the two <src> memory rows and stores the result in <dst>. on an 8-bit mask. Recall that any <dst> row can store 8 32-bit
To implement this instruction we utilize the row of DACs elements in the prototype architecture.
at the top of the array feeding the bit-lines (Figure 1 (c)).
The multiplicand is streamed in through the DACs serially 7) reduce_sum The reduce_sum instruction sums up the
2-bits at time and the product is accumulated over bit-lines values in the <src> row of different arrays. The reduction is
as shown in Figure 2 (c). The word-line DACs are set to Vdd executed outside the arrays. This instruction utilizes the H-
(’11’). tree network and the adders in the routers to reduce values
Note that element-wise multiplication was not supported across the tiles.
in prior works on memristor-based accelerators, and is a 8) shift / mask The shift instruction shifts each of the
new feature we designed for supporting general purpose vector element in <src> by <imm> bits. The mask instruction
data-parallel computation. Since dot product uses the same logically ANDs each of the vector element in <src> with
multiplicand for all elements stored in a row, it can not be <imm>. These instructions utilize the digital shift and adder
utilized for element-by-element multiplication. We solve this (S+A) outside the arrays.

4
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Discussion Our goal is two-fold. First, keep the instruc- same resolution as resistance level of ReRAM bit-cells. In
tion set as simple as possible to reduce design complexity our design, 2-bit DACs are required.
and retain area efficiency (hence memory density). Second,
expose all compute primitives which can be done in-situ
inside the memory array without reading the data out. The 3 Programming Model
proposed ISA does not include any instructions for looping, We choose Google’s TensorFlow [1] as the programming
branch or jump instructions. We rely on the compiler to front-end for proposed in-memory processor. By using Ten-
unroll loops wherever necessary. Our SIMD programming sorFlow, programmers write the kernels which will be of-
model ensures small code size, in spite of unrolling. Control floaded to the memory. TensorFlow expresses the kernel as a
flow is facilitated via condition computation and selective Data Flow Graph (DFG). Since TensorFlow is available for va-
moves (Section 3). The compute instructions in the ISA are riety of programming languages (e.g. Python, C++, Java, Go),
restricted to add, sub, dot, mul. Our programming model programmers can easily plug in the TensorFlow kernels in
based on TensorFlow, supports a rich set of compute opera- their code. Also, since TensorFlow supports variety of target
tions. Our compiler transforms them to a combination of ISA hardware systems (e.g. CPU, GPU, distributed system), pro-
instructions (Section 5.1) and hence enables general purpose grammers can easily validate the functionality of the kernel
computation. and scale the system depending on the input size.
TensorFlow (TF) offers a suitable programming paradigm
for data parallel in-memory computing. First, nodes in TF’s
2.3 Precision and Signed Arithmetic
DFGs can operate on multi-dimensional matrices. This fea-
Floating point operations need normalization based on expo- ture embeds the SIMD programming model and facilitates
nent, hence in-memory computation for the floating point easy exposure of Data Level Parallelism (DLP) to the com-
operands encumbers huge complexity. We adopt a fixed point piler. Second, irregular memory accesses are restricted by
representation. We give the flexibility for deciding the po- not allowing subscript notation. This feature benefits both
sition of the decimal point to trade-off between precision programmers and compilers. Programmers do not have to
and range. But the responsibility to prevent bit overflow and convert high-level data processing operations (e.g., vector
underflow is left to the programmers. We developed a testing addition) into low-level procedural representations (e.g., for-
tool that can calculate the dynamic range of the input that loop with memory access). The compiler can fully under-
assures the required precision. Note that under the condi- stand the memory access pattern. Third, the DFG naturally
tion that overflow/underflow does not happen, fixed point exposes Instruction Level Parallelism (ILP). This can be di-
representation gives better accuracy compared to floating rectly used by a compiler for Very Long Instruction Word
point. Section 6 discusses the impact on application output. (VLIW) style scheduling to further utilize underlying paral-
For general purpose computation, it is important to sup- lelism in the hardware without implementing complex out-
port negative values. Prior work [39] uses a biased repre- of-order execution support. Finally, TensorFlow supports
sentation for numbers, and then normalizes the bias via a persistent memory context in nodes of the DFG. This is
subtraction outside the memory arrays. This approach is useful in our merged memory and compute architecture for
perhaps reasonable for CNN dot products, because the over- storing persistent data across kernel invocations.
head of subtraction outside the array for normalizing the Our programming model and compilation framework sup-
bias, is compensated by multi-row addition within the array. port the following TensorFlow primitives (See Table 2 for
In general, data-parallel programs’ additions need not span the list of supported TF nodes.):
multiple rows (often 2 rows are sufficient). In such a sce-
nario, subtraction outside the array needs additional array Input nodes The proposed system supports three kinds
read which offsets the benefit of biased addition inside the of input: Placeholder, Const, and Variable. Placeholder is a
array. non-persistent input and will not be used for future module
We observe that for b-bit bit-cells (i.e. 2b resistance levels), invocations. Const is used to pass constants whose values
current summation followed by shift+adder across bit-lines are known at compile time. Scalar constants are included in
outputs the correct results as long as negative numbers are ISA, and vector constants are stored in either the register file
stored in 2b ’s complement notation. In our prototype de- or an array based on the type of their consumer node in the
sign, arrays have 2-bit bit-cell, hence addition over negative DFG. Variable is the input with persistent memory context,
numbers stored as 4’complement will yield correct results. of which data can be used and updated in the future kernel
Furthermore it can be mathematically proved that 4’s comple- invocations. Variables are initialized at kernel launch time.
ment is exactly equal to 2’s complement in base-4 representa- Operations The framework supports a variety of complex
tion. Thus there is no need for conversion between number operation nodes including transcendental functions. We dis-
formats. The same principle holds true for multiplication as cuss the process of lowering these operation nodes into na-
long as the DAC used for streaming in the multiplicand has tive memory ISA in Section 5.1.

5
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Input nodes Const Placeholder Variable


Abs Add ArgMin Div Exp FloorDiv Less Mul RealDiv Sigmoid
Arithmetic Operations
Sqrt Square Sub Sum Conv2D ExpandDims MatMul Reshape Tensordot
Control Flow etc. Assign AssignAdd Gather Identity Pack Select Stack NoOp

Table 2. Supported TensorFlow Nodes. ( has restrictions on function/data dimension.)

Intuitively, a DFG generated by TensorFlow can represent


one module. At kernel launch time, the number of module
Input matrix A instances are dynamically created in accordance with the
1 i n
input vector length.
Input matrix B
The proposed execution model allows restricted communi-
cation between instances of modules. Such communication is
only allowed using scatter/gather nodes or reduction nodes
Data Flow in the DFG. We find these communication primitives are
Graphs sufficient to express most TensorFlow kernels.
Each module is composed of one or more Instruction
Blocks (IB) as shown in Figure 3. An IB consists of a list
M1 Mi Mn of instructions which will be executed sequentially. Concep-
tually, an IB is responsible for executing a group of nodes in
the DFG. Multiple IBs in a module may execute in parallel
Modules
to expose ILP. The compiler explores several optimizations
to increase the number of concurrent IBs in a module and
thereby exposes the ILP inside a module.
IB1 IB2 IB1 IB2 IB1 IB2
We view rows in the ReRAM array as a SIMD vector unit
M1 M2 M1 M2 Mi Mi+1 Mi Mi+1 Mn-1 Mn Mn-1Mn ReRAM with multiple lanes or SIMD slots. Each IB is mapped to
IB1 IB1 IB2 IB2 IB1 IB1 IB2 IB2 IB1 IB1 IB2 IB2 arrays a single lane or one slot. To ensure full utilization of all
SIMD lanes in the array, the runtime maps identical IBs from
different instances of the same module to an individual array
Figure 3. Execution Model.
as shown in the last row of Figure 3. This mapping results in
Control Flow Control flow is supported by a select instruc- correct execution because all instances of a module have the
tion. A select instruction takes three operands and generates same set of IBs. Furthermore, IBs of a module are greedily
output as follows: assigned to nearby arrays so that the communication latency
O[i] = Cond[i] ? A[i] : B[i]. between IBs is minimized.
A select instruction is converted into multiple selective move
(movs) instructions. The Condition variable is precomputed
and used to generate the mask for the selective moves. 5 Compiler
The overall compilation flow is shown in Figure 4. Our com-
Reduction, Scatter, Gather A reduction node is supported
piler takes Google’s TensorFlow DFG in the protocol buffer
by the compiler and natively in the micro-architecture. Scat-
format as an input, optimizes it to leverage parallelism that
ter and gather operations are used to implement an indirect
the in-memory architecture offers, and generates executable
reference to the memory address given in the operand. These
code for the in-memory processor ISA. The compiler first an-
operations generate irregular memory accesses and require
alyzes the semantics of input DFG which has vector/matrix
synchronizations to guarantee consistency. Because of the
operands and creates a module with a single IB with required
non-negligible overhead, these operations should be used
control flow. Several optimizations detailed later expand a
rarely. We observe in many cases that these operations can be
module to expose intra-module parallelism by decomposing
eliminated before offloading the kernel by sending gathered
and replicating the instructions in the single IB into multiple
data from CPU.
IBs and merging redundant nodes. This is followed by in-
struction lowering, scheduling of IBs in a module, and code
4 Execution Model
generation. Instruction lowering transforms complex DFG
The proposed architecture processes data in a SIMD execu- nodes into simpler instructions supported by in-memory
tion model at the granularity of module. At runtime, different processor ISA. Instruction lowering is also done by promot-
instances of a module execute the same instructions on dif- ing the specific instructions (e.g. ABS) to general ones (e.g.,
ferent elements of input vectors in a lock-step manner. Our MASK) and expanding the instruction into a set of native
compiler generates a module by unrolling a single dimen- memory ISA instructions.
sion of multi-dimensional input vectors as shown in Figure 3.

6
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

  

 
   
   
  
   
   

 
Figure 4. Compilation Flow.

   Input Nodes Input Nodes


  
  [2,3] [4,5] [2,3] [4,5]
  [3] [4]


 
IB Expansion
[2] [5]
} Unpack
  + + +
[2+4,3+5]=[6,8] [2+4]=[6] [3+5]=[8]
 

}
 Pack
[6,8]
Reduce Reduce
[6+8]=[14] [6+8]=[14]
Figure 5. Node Merging.
Figure 6. IB Expansion.

The compiler tool-chain is developed using Python 3.6 and are typically small (e.g. 3x3 for HotSpot and Sobel filter), we
C++. The compiler front-end uses TensorFlow’s core frame- map the input data to the array and stream in the filter. This
work to parse the TensorFlow Graph. TensorFlow nodes approach reduces buffering for the input data and improves
supported at this time are listed in Table 2. array utilization. Furthermore, the compiler decomposes the
convolution into a series of matrix-vector dot-products done
5.1 Supporting Complex Operations simultaneously on different input matrix slices, thereby re-
The target memory ISA is quite simple and supports limited ducing the convolution time significantly.
number of compute instructions as described in Section 2.2.
Natively, the arrays can execute dot product, addition, multi- 5.2 Compiler Optimizations
plication and subtraction. However, general purpose compu- Node Merging A node merging pass is introduced to fill
tation requires supporting a diverse set of operations ranging the gap between the capabilities of the target in-memory
from division, exponents, transcendental functions, etc. We architecture and the expressibility of the programming lan-
support these complex operations by converting them into
guage. The proposed in-memory ISA can support compute
a set of LUT, addition and multiplication instructions based
operations over n-operands. A node merging pass promotes
on algorithms used in Intel IA-64 processors [14, 19]. a series of 2-operand compute nodes in the DFG of a module,
The compiler uses either Newton-Raphson or Maclaurin- to a single compute node with many operands as shown in
Goldschmidt methods that iteratively apply a set of instruc- Figure 5. The maximum number of operands n is limited
tions to an initial seed from the look-up table and refine by the number of array rows and the resolution of ADCs.
the result. Our implementation chooses the best algorithms
ADCs consume a significant fraction of chip power, and their
based on the precision requirement. We could have used sim-
power consumption is proportional to their resolution. Our
pler algorithms (e.g., SRT division), but we employ iterative compiler can generate code for an arbitrary resolution n
algorithms because (1) bit shift cannot be supported in the and the chip architects can choose a suitable n based on the
array, so for each bit shift operation the values need to be power budget.
read out and written back, (2) supporting bit-wise logical The node merging pass also combines certain combina-
operations (and, or) are challenging because of multi-level re- tions of nodes to reduce intermediate writes to memory
sistive bit-cells, and (3) simple algorithms often require more
arrays. For example, a node which feeds its results to a mul-
space, which is challenging for the data carefully aligned in tiplication node need not write back the results to memory.
the array. This is because multiplicand is directly streamed into the
Finally, the compiler also lowers convolution nodes in array from registers.
the DFG to the native memory ISA. Prior works [39] have
mapped convolution filter weights to the array and per- Instruction Block Scheduler Independent Instruction Blo-
formed dot product computation by streaming in the input cks (IBs) inside a module can be co-scheduled to maximize
features. Because filters used for general-purpose programs ILP as shown in the third row of Figure 3. Our compiler

7
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

adapts the Bottom-Up-Greedy (BUG) algorithm [15] for sched- Benchmark Input data shape # IB insts.
uling IBs. BUG was first used in the Bulldog VLIW com- Blackscholes [4, 10000000] 163

PARSEC
piler [15] and has been adapted in various schedulers for Canneal [2, 600, 4096] 6
VLIW/data-flow architecture, e.g. Multiflow compiler [29] Fluidanimate [3, 17, 229900] 294
and compiler for the tiled data-flow architecture, WaveScalar Streamcluster [2, 128, 1000000] 6
[30]. Our implementation of the BUG algorithm first tra- Backprop [16, 65536] 117

Rodinia
verses the DFG through a bottom-up path, collecting candi- Hotspot [1024, 1024] 26
date assignments of the instructions. Once the traversal path Kmeans [34, 494020] 91
StreamclusterGPU [2, 256, 65536] 6
reaches the input (define) node, it traverses a top-down path
to make a final assignment, minimizing the data transfer Table 3. Evaluated workloads. Numbers in bracket indicates
latency by taking both the operand location and successor size of respective x,y,z dimensions
location into consideration. We modify the original BUG
algorithm to introduce the notion of in-memory computing, the data characteristics, the SIMD slots assigned to a module
where a functional unit is identical to the data location. We may not be fully utilized in every cycle. In fact, expanding
also modified the algorithm to take into account read/write a module could slow down the total execution time when
latency, network resource collision latency, and operation the number of IBs across all module instances exceeds the
latency. aggregate SIMD slots in the memory chip. In such a scenario,
Instruction Block (IB) Expansion Instruction Blocks that multiple iterations may be needed to process all module
use multi-dimensional vectors as operands can be expanded instances, resulting in a performance loss.
into several instruction blocks with lower-dimension vec- Our compiler can generate code for arbitrary upper bounds
tors to further exploit ILP and DLP. For example, consider a on the number of IBs per module, and can flexibly tune the
program that processes 2D matrices of dimension sizes [2, intra-module parallelism with respect to inter-module par-
1024]. The compiler will first convert the program to a mod- allelism. We develop a simple analytical model to compute
ule which will be instantiated 1,024 times and executed in the approximate execution time given the number of IBs
parallel. Each module will have an IB that processes 2D vec- per module and number of module instances. The number
tors. The expansion pass will further decompose the module of module instances is dependent on input data size, and is
into 2 IBs that process 1D scalar value. only known at runtime. Thus, the optimal code is chosen at
The expansion pass traverses the nodes in a module’s DFG runtime based on the analytical model and streamed in to
in a bottom-up/breath-first order and detects the subtrees the memory chip from host.
that process multi-dimensional vectors of the same size. The
subtree regions detected are expanded. To ensure the dimen- 6 Methodology
sions are consistent between the sub-tree regions, pack and Benchmarks We use a subset of benchmarks from PAR-
unpack pseudo operations are inserted between these re- SEC multi-threaded CPU benchmark suite [8] and Rodinia
gions. Pack and unpack operations are later converted to GPU benchmark suite [11] as listed in Table 3. We re-write
mov instructions. A simplified example is shown in Figure 6. the kernels of the benchmarks in TensorFlow code and then
Pipelining A significant fraction of the compute instruc- generate in-memory ISA code using our compiler. We choose
tions goes through two phases: compute and write-back. to port the applications which could be easily transformed
Unfortunately, these two phases are serialized, since an ar- to Structure of Array (SoA) code for the ease of porting to
ray cannot compute and write simultaneously. Our compiler TensorFlow and a data-parallel architecture. We leave the
breaks this bottleneck by pipelining these phases and en- remaining benchmarks to future work. For the benchmarks
suring the destination address for the write-backs are in a which use floating point numbers in the kernel, we assess the
separate array. By using two arrays, one array computes effect of converting it into fixed point numbers. By tuning
while writing back the previous result to the other array. In the decimal point placement, we ensure that the input data
the worst case, this optimization lowers the utilization of is in the dynamic range of fixed point numbers. We ensure
arrays by half. Thus, this optimization is beneficial when the that the quality of result requirement defined by the bench-
number of modules needed for the input data is lower than mark is met. We use the native dataset for each benchmark
the aggregate SIMD capacity of the memory chip. and compare it with the native execution on the CPU and
GPU baseline systems. The size of the input for each kernel
Balancing Inter-Module and Intra-Module Parallelism
invocation ranges from 8MB to 2GB.
Some of the optimizations discussed above attempt to im-
prove performance by exposing parallelism inside a module. Area and Power Model All power/area parameters are
Because of Amdahl’s law, increasing the number of IBs in summarized in Table 4. We use CACTI to model energy and
a module will not result in linear speedup. Depending on area for registers and LUTs. The energy and area model for
ReRAM processing unit, including ReRAM crossbar array,

8
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Component Params Spec Power Area(mm2 ) Parameter CPU (2-sockets) GPU (1-card) IMP
ADC resolution 5 bits 64 mW 0.0753 SIMD slots 448 3840 2097152
frequency 1.2 GSps Frequency 3.6 GHz 1.58 GHz 20 MHz
number 64 × 2 Area 912.24 mm 2 471 mm2 494 mm2
DAC resolution 2 bits 0.82 mW 0.0026 TDP 290 W 250 W 416 W
number 64 × 256 7MB L2; 70MB L3 3MB L2 1GB
Memory
S+H number 64 × 128 0.16 mW 0.00025 64GB DRAM 12GB DRAM RRAM
ReRAM number 64 19.2 mW 0.0016 Table 5. Comparison of CPU, GPU, and IMP Parameters
Array
S+A number 64 1.4 mW 0.0015
server as CPU baseline and Nvidia Titan XP as the GPU base-
IR size 2KB 1.09 mW 0.0016
OR size 2KB 1.09 mW 0.0016 line. The IMP configuration (shown in Table 4) evaluated has
Register size 3KB 1.63 mW 0.0024 4,096 tiles and 64 128×128 ReRAM arrays in each tile.
XB bus width 16B 1.51 mW 0.0105 Table 5 compares important system parameters of the
size 10 × 10 three configurations analyzed. IMP has significantly higher
LUT number 8 6.8 mW 0.0056 degree of parallelism. IMP enjoys 546× (4681×) more SIMD
Inst. Buf size 8 × 2KB 5.83 mW 0.0129 slots than GPU (CPU). The massive parallelism comes at
Router flit size 16 0.82 mW 0.00434 lower frequency, IMP is 80× (180×) slower than GPU (CPU)
num_port 9 in terms of clock cycle period. IMP is approximately area
S+A number 1 0.05 mW 0.000004 neutral compared to GPU, and about 2× lower area than the
1 Tile Total 101 mW 0.12 2-socket CPU system. The TDP of IMP is significantly higher,
Inter-Tile number 584 0.81 W 2.50 however we will show that IMP has lower average power
Routers consumption and energy consumption (Section 7.3).
Chip total 416 W 494 mm2
7.2 Operation Study
Table 4. In-Memory Processor Parameters
Figure 7 presents the operation throughput of CPU, GPU,
and IMP, measured by profiling microbenchmarks of add,
multiply, divide, sqrt and exponential operations. We com-
sample-and-hold circuits, shift-and-add circuits are adapted pile the microbenchmarks with -O3 option and parallelize it
from the ISAAC [39]. We employ energy and power model using OpenMP for the CPU. We find IMP achieves orders of
in [2] for the on-chip interconnects and assume an activity magnitude improvement over the conventional architectures.
factor of 5% for TDP (given that the network operates at The reason is two fold: massive parallelism and reduction
2 GHz and memory at 20 MHz). The benchmarks show an in data movement. The proposed architecture IMP has 546×
order of magnitude lower utilization of network. ADC/DAC (4681×) more SIMD slots compared to GPU (CPU) as shown
energy and power are scaled for 5-bit and 2-bit precision [27]. in Table 5. Although IMP has lower frequency, it more than
While the state-of-the art ReRAM device supports 4 to 6 re- compensates this disadvantage by avoiding data movement.
sistance levels [6], strong non-uniform analog resistance due CPU and GPU have to pay a significant penalty for reading
to process variation makes it challenging to program ReRAM the data out of off-chip memory and passing it along the
for analog convolution, resulting in convolution errors [12]. on-chip memory hierarchy to compute units.
We conservatively limit the number of cell levels to two and IMP speedup is especially higher for the simple operations.
use multiple cells in a row to represent one data. The largest operation throughput is achieved by addition
Performance Model For determining the IMP performance, (2,460× over CPU and 374× over GPU), which has smallest
we develop a cycle accurate simulator which uses an inte- latency in IMP. On the other hand, division and transcenden-
grated network simulator [22]. Note ReRAM array executes tal functions take many cycles to produce the results. For
instructions in order, instruction latency is deterministic, example, it takes 62 cycles for division and 115 cycles for ex-
network communication is rare, and compiler schedules in- ponential, while addition takes only 3 cycles. Therefore, the
struction statically after accounting for network delay. Thus throughput gain becomes smaller for complex operations.
estimated performance for IMP is highly accurate. While CPU and IMP per-operation throughput reduces for
higher latency operations, GPU throughput increases. This is
7 Results because the GPU performance is bounded by the memory ac-
cess time, and unary operators (exponential and square root)
7.1 Configurations Studied have less amount of data transfer from the GPU memory.
In this section we evaluate the proposed In-Memory Pro- Figure 8 and 9 show the operation latency of addition and
cessor (IMP), and compare it to state-of-art CPU and GPU multiplication for different input size. We compare the ex-
baselines. We use an Intel Xeon E5-2697 v3 multi-socket ecution time of single-threaded CPU, multi-threaded CPU

9
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

ϭϬϬ͕ϬϬϬ
Wh KƉĞŶDW 'Wh /DW Wh KƉĞŶDW 'Wh /DW
Wh 'Wh /DW ϭ͘нϬϵ ϭ͘нϬϵ
ϭϬ͕ϬϬϬ ϭ͘нϬϴ ϭ͘нϬϴ
ϭ͘нϬϳ
ϭ͘нϬϳ

>ĂƚĞŶĐLJ;ŶƐͿ
ϭ͕ϬϬϬ ϭ͘нϬϲ

>ĂƚĞŶĐLJ;ŶƐͿ
'KWͬƐ

ϭ͘нϬϲ ϭ͘нϬϱ
ϭ͘нϬϱ ϭ͘нϬϰ
ϭϬϬ
ϭ͘нϬϰ ϭ͘нϬϯ
ϭϬ ϭ͘нϬϯ ϭ͘нϬϮ
ϭ͘нϬϮ ϭ͘нϬϭ
ϭ͘нϬϭ ϭ͘нϬϬ
ϭ
ĚĚ DƵů ŝǀ ^Ƌƌƚ džƉ ϭ͘нϬϬ ϰ ϲϰ ϭ͕ϬϮϰ ϭϲ͕ϯϴϰ ϮϲϮ͕ϭϰϰ
ϰ ϲϰ ϭ͕ϬϮϰ ϭϲ͕ϯϴϰ ϮϲϮ͕ϭϰϰ ĂƚĂƐŝnjĞ;<Ϳ
Figure 7. Operation throughput (log ĂƚĂƐŝnjĞ;<Ϳ
scale). Figure 8. Addition Latency. Figure 9. Multiplication Latency.

ϭϬϬϬ ϭϬ͕ϬϬϬ
<ĞƌŶĞů ĂƚĂůŽĂĚŝŶŐ EŽ ^ĞƋƵĞŶƚŝĂůнĂƌƌŝĞƌ
Wh 'Wh /DW ϭ͘ϬϬ

EŽƌŵĂůŝnjĞĚdžĞĐƵƚŝŽŶdŝŵĞ
ϭϬϬ ϭ͕ϬϬϬ
Ϭ͘ϴϬ
<ĞƌŶĞů^ƉĞĞĚƵƉ
ϭϬϬ
ϭϬ
Ŷ:ͬKƉ

Ϭ͘ϲϬ
ϭϬ
ϭ Ϭ͘ϰϬ
ϭ
Ϭ͘ϮϬ
ůĂĐŬƐĐŚŽůĞƐ

ĂŶŶĞĂů

,ŽƚƐƉŽƚ
ĂĐŬƉƌŽƉ
&ůƵŝĚĂŶŝŵĂƚĞ

^ƚƌĞĂŵĐůƵƐƚĞƌ

<ŵĞĂŶƐ

^ƚƌĞĂŵĐůƵƐƚĞƌ
'KDE

'KDE
Ϭ͘ϭ
Ϭ͘ϬϬ
Wh /DW /DW Wh /DW /DW Wh /DW /DW Wh /DW /DW
Ϭ͘Ϭϭ DĞŵ DĞŵ DĞŵ DĞŵ
ĚĚ DƵů ŝǀ ^Ƌƌƚ džƉ
ůĂĐŬƐĐŚŽůĞƐ &ůƵŝĚĂŶŝŵĂƚĞ ĂŶŶĞĂů ^ƚƌĞĂŵĐůƵƐƚĞƌ
WhĞŶĐŚŵĂƌŬƐ 'WhĞŶĐŚŵĂƌŬƐ

Figure 10. Operation energy. Figure 11. Kernel speedup. Figure 12. CPU Application performance.

(OpenMP), and GPU. IMP offers the highest operations per- SIMD slots. This series of multiplications of distance calcu-
formance among the three architectures, even for the small- lation increases its critical latency and limits the speedup.
est input size (4KB). As suggested in the operation throughput evaluation on Fig-
Figure 10 shows the energy consumption for each oper- ure 7, IMP achieves higher performance especially when the
ation. Because of the high operation latency and the large kernel has significant DLP and many simple operations. We
energy consumption of ADC, we observe higher energy con- observe in general mul, add, and movl instructions are most
sumption for the complex operations relative to GPU. Ulti- common, while movg, reduce_sum and lut are less frequent.
mately, the instruction mix of the application will determine For example, a blackscholes kernel has 14% add, 21% mul,
the energy efficiency of the IMP architecture. and 58% local move instructions. The rest are mask and lut.
The performance results for the overall PARSEC applica-
tion are presented in Figure 12. For this result, we assume
7.3 Application Study
two scenarios: (1) IMP (memory) assumes IMP is integrated
In this section we study the application performance. First, into the memory hierarchy and the memory region for the
we analyze kernel performance shown in Figure 11. For CPU kernel is allocated in IMP. (2) IMP (accelerator) is a configu-
benchmarks, the figure shows performance for hot kernels ration when IMP is used as an accelerator and requires data
in PARSEC benchmarks. We assume that non-kernel code of copy as GPUs do. While we believe IMP (memory) is the
PARSEC benchmarks are executed in the CPU. Note that this correct configuration, IMP (accelerator) is a near-term easier
data transfer overhead is taken into account in the results of configuration which can be a first step towards integrating
IMP. The GPU benchmarks from Rodinia are relatively small, IMP in host servers.
hence we regard them as application kernels. We observe a On average, IMP (accelerator) yields a 5.55× speedup and
41× speedup for CPU benchmarks and 763× speedups for IMP (memory) provides 7.54× for the Region of Interest (ROI).
GPU benchmarks. We observe that 41× kernel speedup does not translate to
GPU benchmarks obtain higher performance improve- similar application speedup due to Amdahl’s law. Figure 12
ment in IMP because of the opportunity to use dot product also shows the breakdown of the execution time, which is
operations and higher data level parallelism. On the other divided into kernel, data loading, communication on NoC,
hand, the speedup for kmeans is limited to 23×. kmeans and the non-kernel part of the ROI. The non-kernel part
deals with Euclidean distance calculation of 34 dimensional is mainly composed of time for barrier and unparalleled
vectors, and this incurs many element-wise multiplications. parts of the program. It can be seen that 88% of execution
Although kmeans shows significant DLP available in the dis- time can be off-loaded to IMP. We also observe that large
tance calculations, we could not fully utilize the DLP of the fraction of the execution time on ReRAM is used for data
application because of the capacity limitation of the IMP’s loading (4× of the kernel at maximum). Thus, as suggested

10
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

ϭϬϬϬϬ ϮϬϬ Ϯ͘Ϭ Ϯ͘ϳϯ ϳ͘ϱϵ


ĂƐĞůŝŶĞ /DW ĂƐĞůŝŶĞ /DW DĂdž>W Ϯ͘Ϭ

EŽƌŵĂůŝnjĞĚdžĞĐƵƚŝŽŶdŝŵĞ
ǀĞƌĂŐĞWŽǁĞƌ;tͿ
<ĞƌŶĞůŶĞƌŐLJ;:Ϳ

ϭϬϬ ϭϱϬ ϭ͘ϱ DĂdž/>W


DĂdžƌƌĂLJhƚŝů
ϭ ϭϬϬ
ϭ͘Ϭ
Ϭ͘Ϭϭ ϱϬ
Ϭ͘ϱ
Ϭ͘ϬϬϬϭ Ϭ
ůĂĐŬƐĐŚŽůĞƐ

ĂŶŶĞĂů

ĂĐŬƉƌŽƉ

,ŽƚƐƉŽƚ

ůĂĐŬƐĐŚŽůĞƐ

ĂŶŶĞĂů
&ůƵŝĚĂŶŝŵĂƚĞ

^ƚƌĞĂŵĐůƵƐƚĞƌ

<ŵĞĂŶƐ

^ƚƌĞĂŵĐůƵƐƚĞƌ

ĂĐŬƉƌŽƉ

,ŽƚƐƉŽƚ
&ůƵŝĚĂŶŝŵĂƚĞ

^ƚƌĞĂŵĐůƵƐƚĞƌ

<ŵĞĂŶƐ

^ƚƌĞĂŵĐůƵƐƚĞƌ
Ϭ͘Ϭ

WhďĞŶĐŚŵĂƌŬƐ 'WhďĞŶĐŚŵĂƌŬƐ WhďĞŶĐŚŵĂƌŬƐ 'WhďĞŶĐŚŵĂƌŬƐ

Figure 13. Energy consumption. Figure 14. Average power. Figure 15. Compiler optimizations.

Config Blackscholes Fluidanimate Canneal Streamcluster Backprop Hotspot Kmeans Streamcluster


MaxDLP 665 / 1 1015 / 1 7220 / 1 2698 / 1 1028 / 1 1081893 / 1 3623 / 1 5386 / 1
MaxILP 377 / 5 437 / 9 1216 / 1212 159 / 129 184 / 32 3125 / 1024 134 / 38 287 / 257
MaxArrayUtil 665 / 1 437 / 4 1228 / 444 2698 / 1 171 / 27 1024 / 3125 1584 / 3 1169 / 6
Lifetime (years) 8.89 20.1 32.2 22.1 15.7 250 5.88 12.8
Table 6. (1) IB latency (cycles) and # of IBs for different optimization targets. (2) Lifetime

before, in-memory accelerator is better coupled with the module to maximize DLP. This policy is useful when the data
existing memory hierarchy to avoid data loading overhead. size is larger than the SIMD slots IMP offers. However, the
We also find the NoC time is not the bottleneck, because of module does not have an opportunity to exploit ILP in the
the efficient reduction scheme supported by the reduction program. Also, IB expansion is not applied for this policy.
tree network integrated in the NoC. The second optimization target is MaxILP, which fully uti-
Figure 13 shows the total energy consumption of the entire lizes the ILP and lets IB expansion expand all multi-dimensio-
application (thus includes both kernel and non-kernel energy nal data in the module. This will create largest number of IBs
for PARSEC). We find 7.5× and 440× energy efficiency for per module and shortest execution time for single module.
CPU benchmarks and GPU benchmarks, respectively. This However, because of the sequential part of the IB, array uti-
energy reduction is partly due to energy efficiency of IMP for lization becomes lower. This policy can increase the overall
kernel’s instruction mix and partly due to reduced execution execution time when the kernel is invoked multiple times
time. due to insufficient SIMD slots in IMP.
Figure 14 shows the average power consumption of the The third optimization target, MaxArrayUtil, maximizes
benchmarks. The TDP of IMP is high when compared to the array utilization considering the number of SIMD slots
GPU and CPU (Table 5). ADCs are the largest contributer needed by input data. For example, if the incoming data
to peak power. The required resolution for ADCs is a func- consumes 30% of the total SIMD slots in IMP, each module
tion of maximum number of operands supported for n-ary can use 3 IBs to fully utilize all the arrays while avoiding
instructions in our ISA. To contain the TDP, we limit the multiple kernel invocations. The compiler optimizes under
ADC resolution to 5-bits and thereby limiting the number of the constraint of maximum IBs available per module
operands for n-ary instructions (add, dot). While this may Table 6 shows the maximum IB latency and the number
affect the performance of a customized dot-product based of IBs per module. Figure 15 presents the execution time
machine learning accelerator significantly, it is not a serious of different optimization policies normalized to MaxDLP
limitation for general purpose computation. Although IMP’s (baseline). MaxArrayUtil represents the best possible per-
TDP is high due to the ADC power consumption, the aver- formance provided by the compiler optimizations under re-
age power consumption is dependent on the instruction’s source constraints imposed by IMP. Overall it provides an
requirement for ADC resolution. For example, the ADCs con- average speedup of 2.3×.
sume less power for instructions with fewer operands. We Two other optimizations not captured by above graph
find that the average resolution for ADC is 2.07 bit (maximum are node merging and pipelining. On average, the module
resolution is 5-bit). Overall, the average power consumption latency is reduced by 13.8% with node merging and 20.8%
for IMP is estimated to be 70.1 W. The average power con- with pipelining.
sumption measured for the benchmarks in the baseline is
81.3 W.
7.5 Memory Lifetime
7.4 Effect of Compiler Optimizations We evaluate the memory lifetime by calculating the write
We introduce three optimization targets to the compiler and intensity of the benchmarks (last row in Table 6). Based on
evaluate how each optimization affects the results. The first the assumption in [26], we consider the ReRAM cells to wear
optimization target is MaxDLP, which creates one IB per out beyond 1011 writes. The compiler balances the writes to

11
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

the arrays by assigning and using ReRAM rows in a round- ReRAM cells. While we demonstrated that IMP architec-
robin manner. Assuming the arrays are continuously used ture/programming framework can work with large real-
for kernel computation (but not while the host is processing), world general purpose data-parallel applications, ReRAM as
the median of expected lifetime is 17.9 years. logic approach [7] has been demonstrated for only sequen-
tial micro-kernels (e.g. hamming, sqrt, square etc) with no
8 Related Work comparison to CPU or GPU systems.
To the best of our knowledge this is the first work that Near-Memory Computing Past processing-in-memory
demonstrates the feasibility of general purpose computing (PIM) solutions move compute near the memory [4, 5, 10, 17,
in ReRAM based memory. Below we discuss some of the 18, 24, 31, 33, 35, 38, 46, 47]. The proposed architecture lever-
closely related works. ages an emerging style of in-memory computing referred
to as bit-line computing [3]. Since, bit-line computing re-
ReRAM Computing Since ReRAMs have been introduced
purposes memory structures to perform computation in-situ,
in [42], several works have leveraged its dot-product com-
it is intrinsically more efficient than near-memory comput-
putation functionality for neuromorphic computing [25, 34].
ing which augments logic near memory. More importantly,
Recently, ISAAC [39] and PRIME [13] use ReRAMs to acceler-
it unlocks massive parallelism at near-zero silicon cost.
ate several Convolutional Neural Networks (CNNs). ISAAC
Recent works have leveraged bit-line computing in SRAM [3,
proposes a full-fledged CNN accelerator with carefully de-
21, 23] and DRAM [37, 38]. These works have demonstrated
signed pipelining and precision handling scheme. PRIME
only a handful of compute operations (bit-wise logical, match
studies a morphable ReRAM based main memory architec-
and copy) making them limited in applicability for general
ture for CNN acceleration. PipeLayer [41] further supports
purpose computing. Furthermore this work is the first to
training and testing of CNN by introducing efficient pipelin-
develop a programming framework and compiler for in-
ing scheme. Aside from CNN acceleration, ReRAM arrays
memory bit-line computing. Our software stack can be uti-
have been used for accelerating Boltzmann machine [9] and
lized for leveraging bit-line computing in other memory
perception network [45]. While it has been shown analog
technologies.
computation in ReRAM can substantially accelerate the ma-
chine learning workloads, none have targeted general pur-
pose computing exploiting the analog computation function-
9 Conclusion
ality of ReRAM. Another interesting work, Pinatubo [28], has This paper proposed novel general-purpose ReRAM-based In-
modified peripheral sense-amplifier circuitry to accomplish Memory Processor architecture (IMP), and its programming
logical operations like AND and OR. While this approach framework. IMP substantially improves the performance and
appears promising to build complex arithmetic operations, energy efficiency for general-purpose data parallel programs.
doing arithmetic on multi-bit ReRAM cells using bitwise op- IMP implements simple but powerful ISA that can lever-
erations comes with several challenges. Orthogonal to this age the underlying computational efficiency. We propose
work, we extend the set of supported operations at low cost. the programming model and the compilation framework,
ReRAM has also been explored to implement logic using in which users use TensorFlow to develop a program and
Majority-Inverter Graph (MIG) logic [7, 32, 40]. In this ap- maximize the parallelism using the compiler’s toolchain. Our
proach each ReRAM bit-cell acts as a majority gate. Since experimental results show IMP can achieve 7.5× over PAR-
resistive bit-cell is acting as a logic gate, it cannot store SEC CPU benchmarks and 763× speedup over Rodinia GPU
data during computation. Let us refer to this approach as benchmarks.
ReRAM bit-cell as logic. A critical difference between this
approach and ours is that we leverage in-situ operations Acknowledgments
where operations are performed in memory over the bit- We thank members of M-Bits research group and the anony-
lines without reading data out. The ReRAM bit-cell as logic mous reviewers for their feedback. This work was supported
approach is a flavor of near-memory computing technique in part by the NSF under the CAREER-1652294 award and
where input data is read out of memory and fed to another the XPS-1628991 award, and by C-FAR, one of the six SRC
memory location which acts as a logic unit, thus requiring STAR-net centers sponsored by MARCO and DARPA.
data-movement.
Furthermore, operations using majority gates can be ex- References
tremely slow, requiring huge number of memory accesses [1] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
to implement even simple functions. For example, a multi- Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu
ply is implemented using ≈56000 majority-gate operations Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey
Irving, Michael Isard, Yangqing Jia, Lukasz Kaiser, Manjunath Kud-
(majority-gate operation requires one memory cycle) and
lur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek
419 ReRAM cells [40]. Our approach implements a multi- Murray, Jon Shlens, Benoit Steiner, Ilya Sutskever, Paul Tucker, Vin-
ply in 18 memory cycles without requiring any additional cent Vanhoucke, Vijay Vasudevan, Oriol Vinyals, Pete Warden, Martin

12
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large- [16] Milos Ercegovac, Jean-Michel Muller, and Arnaud Tisserand. [n. d.].
Scale Machine Learning on Heterogeneous Distributed Systems. http: Simple Seed Architectures for Reciprocal and Square Root Reciprocal.
//download.tensorflow.org/paper/whitepaper2015.pdf ([n. d.]). http://arith.cs.ucla.edu/publications/Recip-Asil05.pdf
[2] Nilmini Abeyratne, Reetuparna Das, Qingkun Li, Korey Sewell, Bha- [17] A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung
ran Giridhar, Ronald G Dreslinski, David Blaauw, and Trevor Mudge. Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging
2013. Scaling towards kilo-core processors with asymmetric high-radix commodity DRAM devices and standard memory modules. In High
topologies. In High Performance Computer Architecture (HPCA2013), Performance Computer Architecture (HPCA), 2015 IEEE 21st International
2013 IEEE 19th International Symposium on. IEEE, 496–507. Symposium on.
[3] Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish [18] Basilio B. Fraguela, Jose Renau, Paul Feautrier, David Padua, and
Narayanasamy, David Blaauw, and Reetuparna Das. 2017. Compute Josep Torrellas. 2003. Programming the FlexRAM Parallel Intel-
Caches. In 2017 IEEE International Symposium on High Performance ligent Memory System. SIGPLAN Not. 38, 10 (June 2003), 49–60.
Computer Architecture, HPCA 2017, Austin, TX, USA, February 4-8, https://doi.org/10.1145/966049.781505
2017. IEEE. [19] John Harrison, Ted Kubaska, Shane Story, et al. 1999. The compu-
[4] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. 2015. A scalable tation of transcendental functions on the IA-64 architecture. In Intel
processing-in-memory accelerator for parallel graph processing. In Technology Journal. Citeseer.
2015 ACM/IEEE 42nd Annual International Symposium on Computer [20] Miao Hu, R. Stanley Williams, John Paul Strachan, Zhiyong Li, Em-
Architecture (ISCA). 105–117. https://doi.org/10.1145/2749469.2750386 manuelle M. Grafals, Noraica Davila, Catherine Graves, Sity Lam, Ning
[5] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM- Ge, and Jianhua Joshua Yang. 2016. Dot-product engine for neuromor-
enabled Instructions: A Low-overhead, Locality-aware Processing-in- phic computing. In Proceedings of the 53rd Annual Design Automation
memory Architecture. In Proceedings of the 42Nd Annual International Conference on - DAC ’16. ACM Press, New York, New York, USA, 1–6.
Symposium on Computer Architecture (ISCA ’15). https://doi.org/10.1145/2897937.2898010
[6] Fabien Alibart, Ligang Gao, Brian D Hoskins, and Dmitri B Strukov. [21] Supreet Jeloka, Naveen Akesh, Dennis Sylvester, and David Blaauw.
2012. High precision tuning of state for memristive devices by adapt- 2015. A Configurable TCAM / BCAM / SRAM using 28nm push-rule
able variation-tolerant algorithm. Nanotechnology 23, 7 (2012), 075201. 6T bit cell (IEEE Symposium on VLSI Circuits).
[7] Debjyoti Bhattacharjee, Rajeswari Devadoss, and Anupam Chattopad- [22] Nan Jiang, James Balfour, Daniel U Becker, Brian Towles, William J
hyay. 2017. ReVAMP : ReRAM based VLIW Architecture for in-Memory Dally, George Michelogiannakis, and John Kim. 2013. A detailed
comPuting. Design, Automation & Test in Europe Conference & Exhibi- and flexible cycle-accurate network-on-chip simulator. In Performance
tion (DATE), 2017 (3 2017), 782–787. https://doi.org/10.23919/DATE. Analysis of Systems and Software (ISPASS), 2013 IEEE International
2017.7927095 Symposium on. IEEE, 86–96.
[8] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. [23] M. Kang, E. P. Kim, M. s. Keel, and N. R. Shanbhag. 2015. Energy-
The PARSEC benchmark suite: Characterization and architectural efficient and high throughput sparse distributed memory architecture.
implications. Proceedings of the International Conference on Parallel In 2015 IEEE International Symposium on Circuits and Systems (ISCAS).
Architectures and Compilation Techniques January (2008), 72–81. https: [24] Duckhwan Kim, Jaeha Kung, Sek Chai, Sudhakar Yalamanchili, and
//doi.org/10.1145/1454115.1454128 Saibal Mukhopadhyay. 2016. Neurocube: A Programmable Digital Neu-
[9] M N Bojnordi and E Ipek. 2016. Memristive Boltzmann machine: A romorphic Architecture with High-Density 3D Memory. In Proceedings
hardware accelerator for combinatorial optimization and deep learning. of ISCA, Vol. 43.
2016 IEEE International Symposium on High Performance Computer [25] Kuk-Hwan Kim, Siddharth Gaba, Dana Wheeler, Jose M Cruz-Albrecht,
Architecture (HPCA) (2016), 1–13. https://doi.org/10.1109/HPCA.2016. Tahir Hussain, Narayan Srinivasa, and Wei Lu. 2011. A Functional
7446049 Hybrid Memristor Crossbar-Array/CMOS System for Data Storage
[10] Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, and and Neuromorphic Applications. Nano Letters 12, 1 (2011), 389–395.
Peter M. Kogge. 2004. A Low Cost, Multithreaded Processing-in- https://doi.org/10.1021/nl203687n
memory System. In Proceedings of the 3rd Workshop on Memory Perfor- [26] J. B. Kotra, M. Arjomand, D. Guttman, M. T. Kandemir, and C. R.
mance Issues: In Conjunction with the 31st International Symposium on Das. 2016. Re-NUCA: A Practical NUCA Architecture for ReRAM
Computer Architecture (WMPI ’04). ACM, New York, NY, USA, 16–22. Based Last-Level Caches. In 2016 IEEE International Parallel and Dis-
https://doi.org/10.1145/1054943.1054946 tributed Processing Symposium (IPDPS). 576–585. https://doi.org/10.
[11] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W 1109/IPDPS.2016.79
Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark [27] L. Kull, T. Toifl, M. Schmatz, P. A. Francese, C. Menolfi, M. Braendli,
suite for heterogeneous computing. In Workload Characterization, 2009. M. Kossel, T. Morf, T. M. Andersen, and Y. Leblebici. 2013. A 3.1mW
IISWC 2009. IEEE International Symposium on. Ieee, 44–54. 8b 1.2GS/s single-channel asynchronous SAR ADC with alternate
[12] Pai-Yu Chen, Deepak Kadetotad, Zihan Xu, Abinash Mohanty, Binbin comparators for enhanced speed in 32nm digital SOI CMOS. In 2013
Lin, Jieping Ye, Sarma Vrudhula, Jae-sun Seo, Yu Cao, and Shimeng Yu. IEEE International Solid-State Circuits Conference Digest of Technical
2015. Technology-design co-optimization of resistive cross-point array Papers. IEEE, 468–469. https://doi.org/10.1109/ISSCC.2013.6487818
for accelerating learning algorithms on chip. In Design, Automation & [28] Shuangchen Li, Cong Xu, Qiaosha Zou, Jishen Zhao, Yu Lu, and Yuan
Test in Europe Conference & Exhibition (DATE), 2015. IEEE, 854–859. Xie. 2016. Pinatubo: A processing-in-memory architecture for bulk
[13] Ping Chi, Shuangchen Li, and Cong Xu. 2016. PRIME : A Novel bitwise operations in emerging non-volatile memories. In Design Au-
Processing-in-memory Architecture for Neural Network Computation tomation Conference (DAC), 2016 53nd ACM/EDAC/IEEE. IEEE, 1–6.
in ReRAM-based Main Memory. In IEEE International Symposium on [29] P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D.
Computer Architecture. IEEE, 27–39. https://doi.org/10.1109/ISCA. Lichtenstein, Robert P. Nix, John S. O’Donnell, and John C. Rutten-
2016.13 berg. 1993. The multiflow trace scheduling compiler. The Journal of
[14] Marius Cornea, John Harrison, Cristina Iordache, Bob Norin, and Supercomputing 7, 1 (01 May 1993), 51–142. https://doi.org/10.1007/
Shane Story. 2000. Divide, Square Root, and Remainder Algorithms BF01205182
for the IA-64 Architecture. Open Source for Numerics, Intel Corporation [30] Martha Mercaldi, Steven Swanson, Andrew Petersen, Andrew Putnam,
(2000). Andrew Schwerin, Mark Oskin, and Susan J Eggers. 2006. Instruction
[15] John R. Ellis. 1986. Bulldog: A Compiler for VLSI Architectures. MIT scheduling for a tiled dataflow architecture. In ACM SIGOPS Operating
Press, Cambridge, MA, USA.

13
Session 1A: New Architectures ASPLOS’18, March 24–28, 2018, Williamsburg, VA, USA

Systems Review, Vol. 40. ACM, 141–150. with In-Situ Analog Arithmetic in Crossbars. 2016 ACM/IEEE 43rd
[31] Mark Oskin, Frederic T Chong, Timothy Sherwood, Mark Oskin, Fred- Annual International Symposium on Computer Architecture (ISCA) (6
eric T Chong, and Timothy Sherwood. 1998. Active Pages: A Computa- 2016), 14–26. https://doi.org/10.1109/ISCA.2016.12
tion Model for Intelligent Memory. ACM SIGARCH Computer Architec- [40] Mathias Soeken, Saeideh Shirinzadeh, Luca Gaetano, AmarÞ Rolf,
ture News 26, 3 (1998), 192–203. https://doi.org/10.1145/279358.279387 and Giovanni De Micheli. 2016. An MIG-based Compiler for Pro-
[32] P. E. Gaillardon, L. Amaru, A. Siemon, E. Linn, R. Waser, A. Chattopad- grammable Logic-in-Memory Architectures. Proceedings of the 2016
hyay, and G. De Micheli. 2016. The Programmable Logic-in-Memory 53rd ACM/EDAC/IEEE Design Automation Conference (DAC) 1 (2016).
(PLiM) computer. 2016 Design, Automation & Test in Europe Conference https://doi.org/10.1145/2897937.2897985
& Exhibition (DATE) (2016), 427–432. [41] L. Song, X. Qian, H. Li, and Y. Chen. 2017. PipeLayer: A Pipelined
[33] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. ReRAM-Based Accelerator for Deep Learning. In 2017 IEEE Interna-
Kozyrakis, R. Thomas, and K. Yelick. 1997. A case for intelligent tional Symposium on High Performance Computer Architecture (HPCA).
RAM. Micro, IEEE (1997). 541–552. https://doi.org/10.1109/HPCA.2017.55
[34] Mirko Prezioso, Farnood Merrikh-Bayat, BD Hoskins, GC Adam, Kon- [42] Dmitri B Strukov, Gregory S Snider, Duncan R Stewart, and R Stanley
stantin K Likharev, and Dmitri B Strukov. 2015. Training and operation Williams. 2008. The missing memristor found. Nature 453, 7191 (2008),
of an integrated neuromorphic network based on metal-oxide mem- 80.
ristors. Nature 521, 7550 (2015), 61–64. [43] Pascal O Vontobel, Warren Robinett, Philip J Kuekes, Duncan R Stewart,
[35] S.H. Pugsley, J. Jestes, Huihui Zhang, R. Balasubramonian, V. Srini- Joseph Straznicky, and R Stanley Williams. 2009. Writing to and
vasan, A. Buyuktosunoglu, A. Davis, and Feifei Li. 2014. NDC: Analyz- reading from a nano-scale crossbar memory based on memristors.
ing the impact of 3D-stacked memory+logic devices on MapReduce Nanotechnology 20, 42 (2009), 425204.
workloads. In Performance Analysis of Systems and Software (ISPASS), [44] Z. Wei, Y. Kanzawa, K. Arita, Y. Katoh, K. Kawai, S. Muraoka, S. Mitani,
2014 IEEE International Symposium on. S. Fujii, K. Katayama, M. Iijima, T. Mikawa, T. Ninomiya, R. Miyanaga,
[36] Jury Sandrini, Marios Barlas, Maxime Thammasack, Tugba Demirci, Y. Kawashima, K. Tsuji, A. Himeno, T. Okada, R. Azuma, K. Shimakawa,
Michele De Marchi, Davide Sacchetto, Pierre-Emmanuel Gaillardon, H. Sugaya, T. Takagi, R. Yasuhara, K. Horiba, H. Kumigashira, and M.
Giovanni De Micheli, and Yusuf Leblebici. 2016. Co-Design of ReRAM Oshima. 2008. Highly reliable TaOx ReRAM and direct evidence of
Passive Crossbar Arrays Integrated in 180 nm CMOS Technology. IEEE redox reaction mechanism. In 2008 IEEE International Electron Devices
Journal on Emerging and Selected Topics in Circuits and Systems 6, 3 (9 Meeting. IEEE, 1–4. https://doi.org/10.1109/IEDM.2008.4796676
2016), 339–351. https://doi.org/10.1109/JETCAS.2016.2547746 [45] Chris Yakopcic and Tarek M Taha. 2013. Energy efficient perceptron
[37] V. Seshadri, K. Hsieh, A. Boroum, Donghyuk Lee, M.A. Kozuch, O. pattern recognition using segmented memristor crossbar arrays. In
Mutlu, P.B. Gibbons, and T.C. Mowry. 2015. Fast Bulk Bitwise AND Neural Networks (IJCNN), The 2013 International Joint Conference on.
and OR in DRAM. Computer Architecture Letters (2015). IEEE, 1–8.
[38] Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata [46] Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L.
Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM:
Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. [n. d.]. Throughput-oriented Programmable Processing in Memory. In Pro-
RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and ceedings of the 23rd International Symposium on High-performance
Initialization. In Proceedings of the 46th Annual IEEE/ACM International Parallel and Distributed Computing (HPDC ’14).
Symposium on Microarchitecture (MICRO-46). [47] Qiuling Zhu, B. Akin, H.E. Sumbul, F. Sadi, J.C. Hoe, L. Pileggi, and
[39] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, and Rajeev Balasub- F. Franchetti. 2013. A 3D-stacked logic-in-memory accelerator for
ramonian. 2016. ISAAC : A Convolutional Neural Network Accelerator application-specific data intensive computing. In 3D Systems Integra-
tion Conference (3DIC), 2013 IEEE International.

14

You might also like