Cambricon: An Instruction Set Architecture For Neural Networks
Cambricon: An Instruction Set Architecture For Neural Networks
Shaoli Liu∗§ , Zidong Du∗§ , Jinhua Tao∗§ , Dong Han∗§ , Tao Luo∗§ , Yuan Xie† , Yunji Chen∗‡ and Tianshi Chen∗‡§
∗ State Key Laboratory of Computer Architecture, ICT, CAS, Beijing, China
Email: {liushaoli, duzidong, taojinhua, handong2014, luotao, cyj, chentianshi}@ict.ac.cn
† Department of Electrical and Computer Engineering, UCSB, Santa Barbara, CA, USA
Email: [email protected]
‡ CAS Center for Excellence in Brain Science and Intelligence Technology
§ Cambricon Ltd.
Abstract—Neural Networks (NN) are a family of models for a performance on specific tasks such as ImageNet recognition
broad range of emerging machine learning and pattern recon- [23] and Atari 2600 video games [33].
dition applications. NN techniques are conventionally executed Traditionally, NN techniques are executed on general-
on general-purpose processors (such as CPU and GPGPU),
which are usually not energy-efficient since they invest excessive purpose platforms composed of CPUs and GPGPUs, which
hardware resources to flexibly support various workloads. are usually not energy-efficient because both types of proces-
Consequently, application-specific hardware accelerators for sors invest excessive hardware resources to flexibly support
neural networks have been proposed recently to improve the various workloads [7], [10], [45]. Hardware accelerators cus-
energy-efficiency. However, such accelerators were designed for tomized to NNs have been recently investigated as energy-
a small set of NN techniques sharing similar computational
patterns, and they adopt complex and informative instructions efficient alternatives [3], [5], [11], [29], [32]. These accelera-
(control signals) directly corresponding to high-level functional tors often adopt high-level and informative instructions (con-
blocks of an NN (such as layers), or even an NN as a trol signals) that directly specify the high-level functional
whole. Although straightforward and easy-to-implement for blocks (e.g. layer type: convolutional/ pooling/ classifier) or
a limited set of similar NN techniques, the lack of agility even an NN as a whole, instead of low-level computational
in the instruction set prevents such accelerator designs from
supporting a variety of different NN techniques with sufficient operations (e.g., dot product), and their decoders can be fully
flexibility and efficiency. optimized to each instruction.
In this paper, we propose a novel domain-specific Instruction Although straightforward and easy-to-implement for a
Set Architecture (ISA) for NN accelerators, called Cambricon, small set of similar NN techniques (thus a small instruction
which is a load-store architecture that integrates scalar, vector, set), the design/verification complexity and the area/power
matrix, logical, data transfer, and control instructions, based
on a comprehensive analysis of existing NN techniques. Our overhead of the instruction decoder for such accelerators will
evaluation over a total of ten representative yet distinct NN easily become unacceptably large, when the need of flexibly
techniques have demonstrated that Cambricon exhibits strong supporting a variety of different NN techniques results in a
descriptive capacity over a broad range of NN techniques, and significant expansion of instruction set. Consequently, the
provides higher code density than general-purpose ISAs such design of such accelerators can only efficiently support a
as x86, MIPS, and GPGPU. Compared to the latest state-of-
the-art NN accelerator design DaDianNao [5] (which can only small subset of NN techniques sharing very similar computa-
accommodate 3 types of NN techniques), our Cambricon-based tional patterns and data locality, but is incapable of handling
accelerator prototype implemented in TSMC 65nm technology the significant diversity among existing NN techniques. For
incurs only negligible latency/power/area overheads, with a example, the state-of-the-art NN accelerator DaDianNao [5]
versatile coverage of 10 different NN benchmarks. can efficiently support the Multi-Layer Perceptrons (MLPs)
[50], but cannot accommodate the Boltzmann Machines
I. I NTRODUCTION (BMs) [39] whose neurons are fully connected to each
Artificial Neural Networks (NNs for short) are a large other. As a result, the ISA design is still a fundamental yet
family of machine learning techniques initially inspired by unresolved challenge that greatly limits both flexibility and
neuroscience, and have been evolving towards deeper and efficiency of existing NN accelerators.
larger structures over the last decade. Though computational- In this paper, we study the design of the ISA for NN
ly expensive, NN techniques as exemplified by deep learning accelerators, inspired by the success of RISC ISA design
[22], [25], [26], [27] have become the state-of-the-art across principles [37]: (a) First, decomposing complex and infor-
a broad range of applications (such as pattern recognition [8] mative instructions describing high-level functional blocks
and web search [17]), some have even achieved human-level of NNs (e.g., layers) into shorter instructions corresponding
to low-level computational operations (e.g., dot product)
Yunji Chen ([email protected]) is the corresponding author of this paper. allows an accelerator to have a broader application scope,
394
Table I. An overview to Cambricon instructions.
Instruction Type Examples Operands
Control jump, conditional branch register (scalar value), immediate
Matrix matrix load/store/move register (matrix address/size, scalar
value), immediate
Data Transfer Vector vector load/store/move register (vector address/size, scalar
value), immediate
Scalar scalar load/store/move register (scalar value), immediate
Matrix matrix multiply vector, vector multiply matrix, matrix register (matrix/vector address/size, s-
multiply scalar, outer product, matrix add matrix, matrix calar value)
subtract matrix
Computational Vector vector elementary arithmetics (add, subtract, multiply, register (vector address/size, scalar
divide), vector transcendental functions (exponential, value)
logarithmic), dot product, random vector generator,
maximum/minimum of a vector
Scalar scalar elementary arithmetics, scalar transcendental register (scalar value), immediate
functions
Logical Vector vector compare (greater than, equal), vector logical register (vector address/size, scalar)
operations (and, or, inverter), vector greater than merge
Scalar scalar compare, scalar logical operations register (scalar), immediate
with load/store instructions. Cambricon contains 64 32-bit support matrix and vector computational/logical instructions
General-Purpose Registers (GPRs) for scalars, which can be (see Section III for such instructions). Specifically, these
used in register-indirect addressing of the on-chip scratchpad instructions can load/store variable-size data blocks (spec-
memory, as well as temporally keeping scalar data. ified by the data-width operand in data transfer instructions)
Type of Instructions. The Cambricon contains four types from/to the main memory to/from the on-chip scratchpad
of instructions: computational, logical, control, and data memory, or move data between the on-chip scratchpad
transfer instructions. Although different instructions may memory and scalar GPRs. Fig. 2 illustrates the Vector LOAD
differ in their numbers of valid bits, the instruction length (VLOAD) instruction, which can load a vector with the size
is fixed to be 64-bit for the memory alignment and for the of Vsize from the main memory to the vector scratchpad
design simplicity of the load/store/decoding logic. In this memory, where the source address in main memory is the
section we only offer a brief introduction to the control sum of the base address saved in a GPR and an immediate
and data transfer instructions because they are similar to number. The formats of Vector STORE (VSTORE), Matrix
their corresponding MIPS instructions, though have been LOAD (MLOAD), and Matrix STORE (MSTORE) instruc-
adapted to fit NN techniques. For computational instructions tions are similar with that of VLOAD.
8 6 6 6 32 6
(including matrix, vector and scalar instructions) and logical opcode Reg0 Reg1 Reg2 Immed
instructions, however, the details will be provided in the next VLOAD Dest_addr V_size Src_base Src_offset
section (Section III). Figure 2. Vector Load (VLOAD) instruction.
Control Instructions. The Cambricon has two control in- On-chip Scratchpad Memory. Cambricon does not use
structions, jump and conditional branch, as illustrated in Fig. any vector register file, but directly keeps data in on-
1. The jump instruction specifies the offset via either an chip scratchpad memory, which is made visible to pro-
immediate or a GPR value, which will be accumulated to the grammers/compilers. In other words, the role of on-chip
Program Counter (PC). The conditional branch instruction scratchpad memory in Cambricon is similar to that of vector
specifies the predictor (stored in a GPR) in addition to register file in traditional ISAs, and sizes of vector operands
the offset, and the branch target (either PC + {o f f set} or are no longer limited by fixed-width vector register files.
PC +1) is determined by a comparison between the predictor Therefore, vector/matrix sizes are variable in Cambricon
and zero. instructions, and the only notable restriction is that the
vector/matrix operands in the same instruction cannot exceed
8 6/32 50/24
opcode Reg0/Immed the capacity of scratchpad memory. In case they do exceed,
JUMP Offset
the compiler will decompose long vectors/matrices into short
8 6 6/32 38/12 pieces/blocks and generate multiple instructions to process
opcode Reg0 Reg1/Immed them.
CB Condition Offset Just like the 32x512b vector registers have been baked into
Figure 1. top: Jump instruction. bottom: Condition Branch Intel AVX-512 [18], capacities of on-chip memories for both
(CB) instruction. vector and matrix instructions must be fixed in Cambricon.
Data Transfer Instructions. Data transfer instructions in More specifically, Cambricon fixes the memory capacity to
Cambricon support variable data size in order to flexibly be 64KB for vector instructions, 768KB for matrix instruc-
395
tions. Yet, Cambricon does not impose specific restriction Reg0 specifies the base scratchpad memory address of the
on bank numbers of scratchpad memory, leaving significant vector output (Voutaddr ); Reg1 specifies the size of the vector
freedom to microarchitecture-level implementations. output (Voutsize ); Reg2, Reg3, and Reg4 specify the base
address of the matrix input (Minaddr ), the base address
III. C OMPUTATIONAL /L OGICAL I NSTRUCTIONS
of the vector input (Vinaddr ), and the size of the vector
In neural networks, most arithmetic operations (e.g., input (Vinsize , note that it is variable), respectively. The
additions, multiplications and activation functions) can be MMV instruction can support matrix-vector multiplication
aggregated as vector operations [10], [45], and the ratio at arbitrary scales, as long as all the input and output data
can be as high as 99.992% according to our quantitative can be kept simultaneously in the scratchpad memory. We
observations on a state-of-the-art Convolutional Neural Net- choose to compute W x with the dedicated MMV instruction
work (GoogLeNet) winning the 2014 ImageNet competition instead of decomposing it as multiple vector dot products,
(ILSVRC14) [43]. In the meantime, we also discover that because the latter approach requires additional efforts (e.g.,
99.791% of the vector operations (such as dot product explicit synchronization, concurrent read/write requests to
operation) in the GoogLeNet can be aggregated further as the same address) to reuse the input vector x among different
matrix operations (such as vector-matrix multiplication). In row vectors of M, which is less efficient.
a nutshell, NNs can be naturally decomposed into scalar, 8 6 6 6 6 6 26
vector, and matrix operations, and the ISA design must effec- opcode Reg0 Reg1 Reg2 Reg3 Reg4
tively take advantages of the potential data-level parallelism MMV Vout_addr Vout_size Min_addr Vin_addr Vin_size
and data locality. Figure 4. Matrix Mult Vector (MMV) instruction.
A. Matrix Instructions Unlike the feedforward case, however, the MMV instruc-
Y1 Y2 Y3
tion no longer provides efficient support to the backforward
training process of an NN. More specifically, a critical step
~ ~ ~
of the well-known Back-Propagation (BP) algorithm is to
compute the gradient vector [20], which can be formulated
w21
b2 b 3 as a vector multiplied by a matrix. If we implement it with
w11
w31 b1
the MMV instruction, we need an additional instruction
X1 X2 X3 +1 implementing matrix transpose, which is rather expensive
Figure 3. Typical operations in NNs. in data movements. To avoid that, Cambricon provides a
We conduct a thorough and comprehensive review to Vector-Mult-Matrix (VMM) instruction which is directly
existing NN techniques, and design a total of six matrix applicable to the backforward training process. The VMM
instructions for Cambricon. Here we take a Multi-Level instruction has the same fields with the MMV instruction,
Perceptrons (MLP) [50], a well-known and representative except the opcode.
NN, as an example, and show how it is supported by Moreover, in training an NN, the weight matrix W often
the matrix instructions. Technically, an MLP usually has needs to be incrementally updated with W = W + ηΔW ,
multiple layers, each of which computes values of some where η is the learning rate and ΔW is estimated as the
neurons (i.e., output neurons) according to some neurons outer product of two vectors. Cambricon provides an Outer-
whose values are known (i.e., input neurons). We illustrate Product (OP) instruction (the output is a matrix), a Matrix-
the feedforward run of one such layer in Fig. 3. More Mult-Scalar (MMS) instruction, and a Matrix-Add-Matrix
specifically, the outputneuron yi (i = 1, 2, 3) in Fig. 3 can (MAM) instruction to collaboratively perform the weight
be computed as yi = f ∑3j=1 wi j x j + bi , where x j is the j- updating. In addition, Cambricon also provides a Matrix-
th input neuron, wi j is the weight between the i-th output Subtract-Matrix (MSM) instruction to support the weight
neuron and the j-th input neuron, bi is the bias of the i-th updating in Restricted Boltzmann Machine (RBM) [39].
output neuron, and f is the activation function. The output
B. Vector Instructions
neurons can be computed as a vector y = (y1 , y2 , y3 ):
Using Eq. 1 as an example, one can observe that the
y = f (W x + b) , (1)
matrix instructions defined in the prior subsection are still
where x = (x1 , x2 , x3 ) and b = (b1 , b2 , b3 ) are vectors of input insufficient to perform all the computations. We still need to
neurons and biases, respectively, W = (wi j ) is the weight add up the vector output of W x and the bias vector b, and
matrix, and f is the element-wise version of the activation then perform an element-wise activation to W x + b.
function f (see Section III-B). While Cambricon directly provides a Vector-Add-Vector
A critical step in Eq. 1 is to compute W x, which will be (VAV) instruction for vector additions, it requires multiple
performed by the Matrix-Mult-Vector (MMV) instruction in instructions to support the element-wise activation. Without
Cambricon. We illustrate this instruction in Fig. 4, where losing any generality, here we take the widely-used sigmoid
396
activation, f (a) = ea /(1 + ea ), as an example. The element- output feature map
wise sigmoid activation performed to each element of an
input vector (say, a) can be decomposed into 3 consecutive ` Multiple Output
feature maps
steps, and are supported by 3 instructions, respectively: x-axis
xi s
1. Computing the exponential eai for each element (ai , i = y-a
generate random vectors obeying other distributions (e.g., Figure 6. Vector Greater Than Merge (VGTM) instruction.
Gaussian distribution) using the Ziggurat algorithm [31], D. Scalar Instructions
with the help of vector arithmetic instructions and vector
compare instructions in Cambricon. Although we have observed that only 0.008% arithmetic
operations of the GoogLeNet [43] cannot be supported with
C. Logical Instructions matrix and vector instructions in Cambricon, there are also
The state-of-the-art NN techniques leverage a few oper- scalar operations that are indispensable to NNs, such as
ations that incorporate comparisons or other logical ma- elementary arithmetic operations and scalar transcendental
nipulations. The max-pooling operation is one such op- functions. We summarize them in Table I, which have been
eration (see Fig. 5a for an illustration), which seeks the formally defined as Cambricon’s scalar instructions.
neuron having the largest output among neurons within a
pooling window, and repeats this action for corresponding E. Code Examples
pooling windows in different input feature maps (see Fig. To illustrate the usage of our proposed instruction sets,
5b). Cambricon supports the max-pooling operation with we implement three simple yet representative components of
NNs, a MLP feedforward layer [50], a pooling layer [22],
397
MLP code: IV. A P ROTOTYPE ACCELERATOR
// $0: input size, $1: output size, $2: matrix size
Reorder
// $3: input address, $4: weight address Buffer
// $5: bias address, $6: output address
// $7-$10: temp variable address
Scalar Func.
L1 Cache
Unit
VLOAD $3, $0, #100 // load input vector from address (100)
Memory Queue
IO Interface
MLOAD $4, $2, #300 // load weight matrix from address (300)
Issue Queue
IO DMA
Vector
Decode
Fetch
MMV $7, $1, $4, $3, $0 // Wx Vector Func. Unit
Scratchpad
VAV $8, $1, $7, $5 // tmp=Wx+b (Vector DMAs)
Memory
VEXP $9, $1, $8 // exp(tmp)
AGU
VAS $10, $1, $9, #1 // 1+exp(tmp) Matrix
Matrix Func. Unit
Scratchpad
VDV $6, $1, $9, $10 // y=exp(tmp)/(1+exp(tmp)) (Matrix DMAs)
Memory
VSTORE $6, $1, #200 // store output vector to address (200)
Pooling code: Figure 8. A prototype accelerator based on Cambricon.
// $0: feature map size, $1: input data size, In this section, we present a prototype accelerator of
// $2: output data size, $3: pooling window size ̢ 1 Cambricon. We illustrate the design in Fig. 8, which contains
// $4: x-axis loop num, $5: y-axis loop num
seven major instruction pipeline stages: fetching, decod-
// $6: input addr, $7: output addr
// $8: y-axis stride of input
ing, issuing, register reading, execution, writing back, and
committing. We use mature techniques such as scratchpad
VLOAD $6, $1, #100 // load input neurons from address (100)
memory and DMA in this accelerator, since we found that
SMOVE $5, $3 // init y
L0: SMOVE $4, $3 // init x these classic techniques have been sufficient to reflect the
L1: VGTM
A $7, $0, $6, $7 flexibility (Section V-B1), conciseness (Section V-B2) and
// feature map m, output[m]=(input[x][y][m]>output[m])? efficiency (Section V-B3) of the ISA. We did not seek
// input[x][y][m]:output[m]
SADD $6, $6, $0 // update input address
to explore the emerging techniques (such as 3D stacking
SADD $4, $4, #-1 // x-- [51] and non-volatile memory [47], [46]) in our prototype
CB #L1, $4 // if(x>0) goto L1 design,but left such exploration as future work, because we
SADD $6, $6, $8 // update input address
believe that a promising ISA must be easy to implement and
SADD $5, $5, #-1 // y--
CB #L0, $5 // if(y>0) goto L0 should not be tightly coupled with emerging techniques.
VSTORE $7, $2, #200 // stroe output neurons to address (200) As illustrated in Fig. 8, after the fetching and decoding
BM code: stages, an instruction is injected into an in-order issue queue.
// $0: visible vector size, $1: hidden vector size, $2: v-h matrix (W) size After successfully fetching the operands (scalar data, or
// $3: h-h matrix (L) size, $4: visible vector address, $5: W address
// $6: L address, $7: bias address, $8: hidden vector address
address/size of vector/matrix data) from the scalar register
// $9-$17: temp variable address file, an instruction will be sent to different units depending
on the instruction type. Control instructions and scalar
VLOAD $4, $0, #100 // load visible vector from address (100)
computational/logical instructions will be sent to the scalar
VLOAD $9, $1, #200 // load hidden vector from address (200)
MLOAD $5, $2, #300 // load W matrix from address (300) functional unit for direct execution. After writing back to
MLOAD $6, $3, #400 // load L matrix from address (400) the scalar register file, such an instruction can be committed
MMV $10, $1, $5, $4, $0 // Wv from the reorder buffer1 as long as it has become the oldest
MMV $11, $1, $6, $9, $1 // Lh
VAV $12, $1, $10, $11 // Wv+Lh
uncommitted yet executed instruction.
VAV $13, $1, $12, $7 // tmp=Wv+Lh+b Data transfer instructions, vector/matrix computational
VEXP $14, $1, $13 // exp(tmp) instructions, and vector logical instructions, which may
VAS $15, $1, $14, #1 // 1+exp(tmp) access the L1 cache or scratchpad memories, will be sent to
VDV $16, $1, $14, $15 // y=exp(tmp)/(1+exp(tmp))
RV $17, $1
A
// i, r[i] = random(0,1)
the Address Generation Unit (AGU). Such an instruction
VGT $8, $1, $17, $16
A
// i, h[i] = (r[i]>y[i])?1:0 needs to wait in an in-order memory queue to resolve
VSTORE $8, $1, #500 // store hidden vector to address (500) potential memory dependencies2 with earlier instructions
Figure 7. Cambricon program fragments of MLP, pooling in the memory queue. After that, load/store requests of
and BM. scalar data transfer instructions will be sent to the L1
cache, data transfer/computational/logical instructions for
and a Boltzmann Machines (BM) layer [39], using Cam-
vectors will be sent to the vector functional unit, data
bricon instructions. For the sake of brevity, we omit scalar
transfer/computational instructions for matrices will be sent
load/store instructions for all three layers, and only show the
to matrix functional unit. After the execution, such an
program fragment of a single pooling window (with multiple
input and output feature maps) for the pooling layer. We 1 We need a reorder buffer even though instructions are in-order issued,
illustrate the concrete Cambricon program fragments in Fig. because the execution stages of different instructions may take significantly
different numbers of cycles.
7, and we observe that the code density of Cambricon is 2 Here we say two instructions are memory dependent if they access an
significantly higher than that of x86 and MIPS (see Section overlapping memory region, and at least one of them needs to write the
V for a comprehensive evaluation). memory region.
398
instruction can be retired from the memory queue, and then Change Dump (VCD) file. We are planning an MPW tape-
be committed from the reorder buffer as long as it has out of the prototype accelerator, with a small area budget of
become the oldest uncommitted yet executed instruction. 60 mm2 at a 65nm process with targeted operating frequency
The accelerator implements both vector and matrix func- of 1 Ghz. Therefore, we adopt moderate functional unit sizes
tional units. The vector unit contains 32 16-bit adders, 32 and scratchpad memory capacities in order to fit the area
16-bit multipliers, and is equipped with a 64KB scratchpad budget. II shows the details of design parameters.
memory. The matrix unit contains 1024 multipliers and
Table II. Parameters of our prototype accelerator.
1024 adders, which has been divided into 32 separate
computational blocks to avoid excessive wire congestion issue width 2
depth of issue queue 24
and power consumption on long-distance data movements. depth of memory queue 32
Each computational block is equipped with a separate 24KB depth of reorder buffer 64
scratchpad. The 32 computational blocks are connected capacity of vector scratchpad 64KB
through an h-tree bus that serves to broadcast input values memory
capacity of matrix scratchpad 768KB (24KB x 32)
to each block and to collect output values from each block. memory
A notable Cambricon feature is that it does not use any bank width of scratchpad mem- 512 bits (32 x 16-bit fixed point)
vector register file, but keeps data in on-chip scratchpad ory
operators in matrix function unit 1024 (32x32) multipliers &
memories. To efficiently access scratchpad memories, the adders
vector/matrix functional unit of the prototype accelerator operators in vector function unit 32 multipliers & dividers &
integrates three DMAs, each of which corresponds to one adders & transcendental func-
tion operators
vector/matrix input/output of an instruction. In addition,
the scratchpad memory is equipped with an IO DMA. Baselines. We compare the Cambricon-ACC with three
However, each scratchpad memory itself only provides a baselines. The first two are based on general-purpose CPU
single port for each bank, but may need to address up to and GPU, and the last one is a state-of-the-art NN hardware
four concurrent read/write requests. We design a specific accelerator:
structure for the scratchpad memory to tackle this issue (see • CPU. The CPU baseline is an x86-CPU with 256-bit SIMD
Fig. 9). Concretely, we decompose the memory into four support (Intel Xeon E5-2620, 2.10GHz, 64 GB memory).
banks according to addresses’ low-order two bits, connect We use the Intel MKL library [19] to implement vector
them with four read/write ports via a crossbar guaranteeing and matrix primitives for the CPU baseline, and GCC
that no bank will be simultaneously accessed. Thanks to v4.7.2 to compile all benchmarks with options “-O2 -lm
the dedicated hardware support, Cambricon does not need -march=native” to enable SIMD instructions.
expensive multi-port vector register file, and can flexibly and • GPU. The GPU baseline is a modern GPU card (NVIDI-
efficiently support different data widths using the on-chip A K40M, 12GB GDDR5, 4.29 TFlops peak at a 28nm
scratchpad memory. process); we implement all benchmarks (see below) with
Matrix Matrix Matrix IO the NVIDIA cuBLAS library [35], a state-of-the-art linear
DMA DMA DMA DMA
algebra library for GPU.
Port 0 Port 1 Port 2 Port 3 • NN Accelerator. The baseline accelerator is DaDian-
Crossbar
Nao, a state-of-the-art NN accelerator exhibiting remarkable
energy-efficiency improvement over a GPU [5]. We re-
Bank-00 Bank-01 Bank-10 Bank-11 implement the DaDianNao architecture at a 65nm process,
but replace all eDRAMs with SRAMs because we do not
Figure 9. Structure of matrix scratchpad memory.
have a 65nm eDRAM library. In addition, we re-size DaDi-
V. E XPERIMENTAL E VALUATION anNao such that it has a comparable amount of arithmetic
In this section, we first describe the evaluation methodol- operators and on-chip SRAM capacity as our design, which
ogy, and then present the experimental results. enables a fair comparison of two accelerators under our area
budget (<60 mm2 ) mentioned in the previous paragraph.
A. Methodology The re-implemented version of DaDianNao has a single
central tile and a total of 32 leaf tiles. The central tile has
Design evaluation. We synthesize the prototype accelera- 64KB SRAM, 32 16-bit adders and 32 16-bit multipliers;
tor of Cambricon (Cambricon-ACC, see Section IV) with Each leaf tile has 24KB SRAM, 32 16-bit adders and 32
Synopsys Design Compiler using TSMC 65nm GP standard 16-bit multipliers. In other words, the total numbers of
VT library, place and route the synthesized design with adders and multipliers, as well as the total SRAM capacity
the Synopsys ICC compiler, simulate and verify it with in the re-implemented DaDianNao, are the same with our
Synopsys VCS, and estimate the power consumption with prototype accelerator. Although we are constrained to give
Synopsys Prime-Time PX according to the simulated Value up eDRAMs in both accelerators, this is still a fair and
399
reasonable experimental setting, because the flexibility of ISAs. On average, the code length of Cambricon is about
an accelerator is mainly determined by its ISA, not concrete 6.41x, 9.86x, and 13.38x shorter than GPU, x86, and MIPS,
devices it integrates. In this sense, the flexibility gained from respectively. The observations are not surprising, because
Cambricon will still be there even when we resort to large Cambricon aggregates many scalar operations into vector
eDRAMs to remove main memory accesses and improve the instructions, and further aggregates vector operations into
performance for both accelerators. matrix instructions, which significantly reduces the code
Benchmarks. We take 10 representative NN techniques as length.
our benchmarks, see Table III. Each benchmark is translated Specifically, on MLP, Cambricon can improve the code
manually into assemblers to execute on Cambricon-ACC and density by 13.62x, 22.62x, and 32.92x against GPU, x86,
DaDianNao. We evaluate their cycle-level performance with and MIPS, respectively. The main reason is that there are
Synopsys VCS. very few scalar instructions in the Cambricon code of MLP.
However, on CNN, Cambricon achieves only 1.09x, 5.90x,
B. Experimental Results and 8.27x reduction of code length against GPU, x86, and
We compare Cambricon and Cambricon-ACC with the MIPS, respectively. It is because that the main body of
baselines in terms of metrics such as performance and CNN is a deeply nested loop requiring many individual
energy. We also provide the detailed layout characteristics scalar operations to manipulate the loop variable. Hence,
of the prototype accelerator. the advantage of aggregating scalar operations into vector
1) Flexibility: In view of the apparent flexibility provided operations has a small gain on code density.
by general-purpose ISAs (e.g., x86, MIPS and GPU-ISA), Moreover, we collect the percentage breakdown of Cam-
here we restrict our discussions to ISAs of NN accelerators. bricon instruction types in the 10 benchmarks. On average,
DaDianNao [5] and DianNao [3] are the two unique NN 38.0% instructions are data transfer instructions, 4.8% in-
accelerators that have explicit ISAs (other ones are often structions are control instructions, 12.6% instructions are
hardwired). They share similar ISAs, and our discussion is matrix instructions, 33.8% instructions are vector instruc-
exemplified by DaDianNao, the one with better performance tions, and 10.9 % instructions are scalar instructions. This
and multicore scaling. To be specific, the ISA of this observation clearly shows that vector/matrix instructions
accelerator only contains four 512-bit VLIW instructions play a critical role in NN techniques, thus efficient imple-
corresponding to four popular layer types of neural networks mentations of these instructions are essential to the perfor-
(fully-connected classifier layer, convolutional layer, pooling mance of an Cambricon-based accelerator.
layer, and local response normalization layer), rendering 3) Performance: We compare Cambricon-ACC against
it a rather incomplete ISA for the NN domain. Among x86-CPU and GPU on all 10 benchmarks listed in Table III.
10 representative benchmark networks listed in Table III, Fig. 12 illustrates the speedup of Cambricon-ACC against
the DaDianNao ISA is only capable of expressing MLP, x86-CPU, GPU, and DaDianNao. On average, Cambricon-
CNN, and RBM, but fails to implement the rest 7 bench- ACC is about 91.72x and 3.09x faster than of x86-CPU
marks (RNN, LSTM, AutoEncoder, Sparse AutoEncoder, and GPU, respectively. This is not surprising because
BM, SOM and HNN). An observation well explaining the Cambricon-ACC integrates dedicated functional units and
failure of DaDianNao on the 7 representative networks is scratchpad memory optimized for NN techniques.
that they cannot be characterized as aggregations of the four On the other hand, due to the incomplete and restricted
types of layers (thus aggregations of DaDianNao instruc- ISA, DaDianNao can only accommodate 3 out of 10 bench-
tions). In contrast, Cambricon defines a total of 43 64-bit marks (i.e., MLP, CNN and RBM), thus its flexibility is sig-
scalar/control/vector/matrix instructions, and is sufficiently nificantly worse than that of Cambricon-ACC. In the mean-
flexible to express all 10 networks. time, the better flexibility of Cambricon-ACC does not lead
2) Code Density: Code density is a meaningful ISA to significant performance loss. We compare Cambricon-
metric only when the ISA is flexible enough to cover a broad ACC against DaDianNao on the three benchmarks that
range of applications in the target domain. Therefore, we DaDianNao can support, and observe that Cambricon-ACC
only compare the code density of Cambricon with GPU, is only 4.5% slower than DaDianNao on average. The
MIPS, and x86, with 10 benchmarks implemented with reason for a small performance loss of Cambricon-ACC
Cambricon, CUDA-C, and C, respectively. We manually over DaDianNao is that, Cambricon decomposes complex
write the Cambricon program; We compile the CUDA-C high-level functional instructions of DaDianNao (e.g., an
programs with nvcc, and count the length of the generated instruction for a convolutional layer) into shorter and low-
ptx files after removing initialization and system-call in- level computational instructions (e.g., MMV and dot produc-
structions; We compile the C programs with x86 and MIPS t), which may bring in additional pipeline bubbles between
compilers, respectively (with the option -O2). We then count instructions. With the high code density provided by Cambri-
the lengths of two kinds of assemblers. We illustrate in con, however, the amount of additional bubbles is moderate,
Fig. 10 Cambricon’s reduction on code length over other the corresponding performance loss is therefore negligible.
400
Table III. Benchmarks (H stands for hidden layer, C stands for convolutional layer, K stands for kernel, P stands for pooling
layer, F stands for classifier layer, V stands for visible layer).
Technique Network Structure Description
MLP input(64) - H1(150) - H2(150) - Output(14) Using Multi-Layer Perceptron (MLP) to perform
anchorperson detection. [2]
CNN input(1@32x32) - C1(6@28x28, K: 6@5x5) Convolutional neural network (LeNet-5) for
- S1(6@14x14, K: 2x2) - C2(16@10x10, K: hand-written character recognition. [28]
16@5x5) - S2(16@5x5, K: 2x2) - F(120) - F(84)
- output(10)
RNN input(26) - H(93) - output(61) Recurrent neural network (RNN) on TIMIT
database. [15]
LSTM input(26) - H(93) - output(61) Long-short-time-memory (LSTM) neural net-
work on TIMIT database. [15]
Autoencoder input(320) - H1(200) - H2(100) - H3(50) - Out- A neural network pretrained by auto-encoder on
put(10) MNIST data set. [49]
Sparse Autoencoder input(320) - H1(200) - H2(100) - H3(50) - Out- A neural network pretrained by sparse auto-
put(10) encoder on MNIST data set. [49]
BM V(500) - H(500) Boltzmann machines (BM) on MINST data set.
[39]
RBM V(500) - H(500) Restricted boltzmann machine (RBM) on MINST
data set. [39]
SOM input data(64) - neurons(36) Self-organizing maps (SOM) based data mining
of seasonal flu. [48]
HNN vector (5), vector component(100) Hopfield neural network (HNN) on hand-written
digits data set. [36]
10
Figure 10. The reduction of code length against GPU, x86-CPU, and MIPS-CPU.
Data Transfer Control Matrix Vector Scalar
100%
80%
Percentage
60%
40%
20%
0%
100
Speed Up
10
0.1
Figure 12. The speedup of Cambricon-ACC against x86-CPU, GPU, and DaDianNao.
401
4) Energy Consumption: We also compare the energy the core & vector part and matrix part consume 8.20%, and
consumptions of Cambricon-ACC, GPU and DaDianNao, 59.26% power, respectively. Moreover, data movements in
which can be estimated as products of power consumptions the channel part consume 32.54% power, which is several
(in Watt) and the execution times (in Second). The power times higher than the power of the core & vector part. It can
consumption of GPU is reported by the NVPROF, and the be expected that the power consumption of the channel part
power consumptions of DaDianNao and Cambricon-ACC can be much higher if we do not divide the matrix part into
are estimated with Synopsys Prime-Tame PX according to multiple blocks.
the simulated Value Change Dump (VCD) file. We do not
have the energy comparison against CPU baseline, because Table IV. Layout characteristics of Cambricon-ACC (1
of the lack of hardware support for the estimation of the GHz), implemented in TSMC 65nm technology.
actual power of the CPU. Yet, recently it has been reported
Component Area(μm2 ) (%) Power(mW ) (%)
that an SIMD-CPU is an order-of-magnitude less energy-
efficient than a GPU (NVIDIA K20M) on neural network Whole Chip 56241000 100% 1695.60 100%
Core & Vector 5062500 9.00% 139.04 8.20%
applications [4], which well complements our experiments. Matrix 35259840 62.69% 1004.81 59.26%
As shown in Fig. 13, the energy consumptions of GPU Chanel 15918660 28.31% 551.75 32.54%
and DaDianNao are 130.53x and 0.916x that of Cambricon- Combinational 18081482 32.15% 476.97 28.13%
Memory 8461445 15.05% 174.14 10.27%
ACC, respectively, where the energy of DaDianNao is av- Registers 5612851 9.98% 300.29 17.71%
eraged over 3 benchmarks because it can only accommo- Clock network 877360 1.56% 744.20 43.89%
date 3 out of 10 benchmarks. Compared with Cambricon- Filler Cell 23207862 41.26%
ACC, the power consumption of GPU is much higher, as
the GPU spends excessive hardware resources to flexibly
support various workloads. On the other hand, the energy
consumption of Cambricon-ACC is only slightly higher than
of DaDianNao, because both accelerators integrate the same Core &
Vector
sizes of functional units and on-chip storage, and work at
the same frequency. The additional energy consumed by
Cambricon-ACC mainly comes from instruction pipeline Matrix
logic, memory queue, as well as the vector transcendental
functional unit. In contrast, DaDianNao uses a low-precision Figure 14. The layout of Cambricon-ACC, implemented in
but lightweight lookup table instead of using transcendental TSMC 65nm technology.
functional units.
5) Chip Layout: We show the layout of Cambricon-ACC VI. P OTENTIAL E XTENSION TO B ROADER T ECHNIQUES
in Fig. 14, and list the area and power breakdowns in Table Although Cambricon is designed for existing neural net-
IV. The overall area of Cambricon-ACC is 56.24 mm2 , work techniques, it can also support future neural network
which is about 1.6% larger than of DaDianNao (55.34 mm2 , techniques or even some classic statistical techniques, as
re-implemented version). The combinational logic (mainly long as they can be decomposed into scalar/ector/matrix
vector and matrix functional units) consumes 32.15% area instructions in Cambricon. Here we take logistic regression
of Cambricon-ACC, and the on-chip memory (mainly vector [21] as an example, and illustrate how it can be supported
and matrix scratchpad memories) consumes about 15.05% by Cambricon. Technically, logistic regression contains two
area. phases, training phase, and prediction phase. The training
The matrix part (including the matrix function unit and phase employs a gradient descent algorithm similar to the
the matrix scratchpad memory) accounts for 62.69% area of training phase of MLP technique, which can be supported
Cambricon-ACC, while the core part (including the instruc- by Cambricon. In the prediction
phase, the output can be
d
tion pipeline logic, scalar function unit, memory queue, and
computed as y = sigmoid ∑ θi xi (where x = (x0 , x1 ...xn )T
so on) and the vector part (including the vector function unit i=0
and the vector scratchpad memory) only account for 9.00 is the input vector, x0 always equals to 1, θ = (θ0 , θ1 ...θn )T
% area. The remaining 28.31% area is consumed by the is the model parameters). We can leverage the dot product
channel part, including wires connecting the core & vector instruction, scalar elementary arithmetic instructions, and
part and the matrix part, and wires connecting together scalar exponential instruction of Cambricon to perform the
different blocks of the matrix part. prediction phase of logistic regression. Moreover, given a
We also estimate the power consumption of the prototype batch of n different input vectors, the MMV instruction, vec-
design with Synopsys PrimePower. The peak power con- tor elementary arithmetic instructions and vector exponential
sumption is 1.695 W (under 100% toggle rate), which is only instruction in Cambricon collaboratively allow prediction
about one percentage of the K40M GPU. More specifically, phases of n inputs to be computed in parallel.
402
GPU/Cambricon-ACC DaDianNao/Cambricon-ACC
1000
Energy Reduction
100
10
0.1
Figure 13. The energy reduction of Cambricon-ACC over GPU and DaDianNao.
VII. R ELATED W ORK another NN accelerator, which arranges several customized
accelerators around a switch fabric [30]. Esmaeilzadeh et
In this section, we summarize prior work on NN tech- al. proposed a SIMD-like architecture (NnSP) for Multi-
niques and NN accelerator designs. Layer Perceptrons (MLPs) [10]. Chakradhar et al. mapped
Neural Networks. Existing NN techniques have exhibited the CNN to reconfigurable circuits [1]. Chi et al. proposed
significant diversity in their network topologies and learning PRIME [6], a novel process-in-memory architecture that
algorithms. For example, Deep Belief Networks (DBNs) implements reconfigurable NN accelerator in ReRAM-based
[41] consist of a sequence of layers, each of which is fully main memory. Hashmi et al. proposed the Aivo framework
connected to its adjacent layers. In contrast, Convolutional to characterize their specific cortical network model and
Neural Networks (CNNs) [25] use convolutional/pooling learning algorithms, which can generate execution code of
windows to specify connections between neurons, thus the their network model for general-purpose CPUs and GPUs
connection density is much lower than in DBNs. Interest- rather than hardware accelerators [16]. The above designs
ingly, connection densities of DBNs and CNNs are both were customized for one specific NN technique (e.g., MLP
lower than the Boltzmann Machines (BMs) [39] that fully or CNN), whose application scopes are limited. Chen et al.
connect all neurons with each other. Learning algorithms proposed a small-footprint NN accelerator called DianNao,
for different NNs may also differ from each other, as whose instructions directly correspond to different layer
exemplified by the remarkable discrepancy among the back- types in CNN [3]. DaDianNao adopts a similar instruction
propagation algorithm for training Multi-Level Perceptrons set, but achieves even higher performance and energy-
(MLPs) [50], the Gibbs sampling algorithm for training efficiency via keeping all network parameters on-chip, which
Restricted Boltzmann Machines (RBMs) [39], and the un- is a piece of innovation on accelerator architecture instead
supervised learning algorithm for training Self-Organizing of ISA [5]. Therefore, the application scope of DaDianNao
Map (SOM) [34]. is still limited by its ISA, which is similar to the case of
In a nutshell, while adopting high-level, complex, and DianNao. Liu et al. designed the PuDianNao accelerator that
informative instructions could be a feasible choice for ac- accommodates seven classic machine learning techniques,
celerators supporting a small set of similar NN techniques, whose control module only provides seven different opcodes
the significant diversity and the large number of existing NN (each corresponds to a specific machine learning technique)
techniques make it unfeasible to build a single accelerator [29]. Therefore, PuDianNao only allows minor changes to
that uses a considerable number of high-level instructions the seven machine learning techniques. In summary, the lack
to cover a broad range of NNs. Moreover, without a certain of agility in instruction sets prevents previous accelerators
degree of generality, even an exisiting successful accelerator from flexibly and efficiently supporting a variety of different
design may easily become inapplicable simply because of NN techniques.
the evolution of NN techniques. Comparison. Compared to prior work, we decompose tradi-
NN Accelerators. NN techniques are computationally in- tional high-level and complex instructions describing high-
tensive, and are traditionally executed on general-purpose level functional blocks of NNs (e.g., layers) into shorter
platforms composed of CPUs and GPGPUs, which are instructions corresponding to low-level computational oper-
usually not energy-efficient for NN techniques [3], because ations (e.g., scalar/vector/matrix operations), which allows
they invest excessive hardware resources to flexibly support a hardware accelerator to have a broader application scope.
various workloads. Over the past decade, there have been Furthermore, simple and short instructions may reduce the
many hardware accelerators customized to NNs, imple- design and verification complexity of the accelerators.
mented on FPGAs [13], [38], [40], [42] or as ASICs [3],
VIII. C ONCLUSION AND F UTURE W ORK
[12], [14], [44]. Farabet et al. proposed an accelerator
named Neuflow with systolic architecture [12], for the In this paper, we propose a novel ISA for neural networks
feed-forward paths of CNNs. Maashri et al. implemented called Cambricon, which allows NN accelerators to flexibly
403
support a broad range of different NN techniques. We [7] A. Coates, B. Huval, T. Wang, D. J. Wu, and A. Y. Ng. Deep learning
with cots hpc systems. In Proceedings of the 30th International
compare Cambricon with x86 and MIPS across ten diverse Conference on Machine Learning, 2013.
yet representative NNs, and observe that the code density of
Cambricon is significantly higher than that of x86 and MIPS. [8] G.E. Dahl, T.N. Sainath, and G.E. Hinton. Improving deep neural
We implement a Cambricon-based prototype accelerator in networks for LVCSR using rectified linear units and dropout. In
Proceedings of the 2013 IEEE International Conference on Acoustics,
TSMC 65nm technology, and the area is 56.24 mm2 , the Speech and Signal Processing, 2013.
power consumption is only 1.695 W . Thanks to Cambri-
con, this prototype accelerator can accommodate all ten [9] V. Eijkhout. Introduction to High Performance Scientific Computing.
In www.lulu.com, 2011.
benchmark NNs, while the state-of-the-art NN accelerator,
DaDianNao, can only support 3 of them. Even when exe- [10] H. Esmaeilzadeh, P. Saeedi, B.N. Araabi, C. Lucas, and Sied Mehdi
cuting the 3 benchmark NNs, our prototype accelerator still Fakhraie. Neural network stream processing core (NnSP) for em-
bedded systems. In Proceedings of the 2006 IEEE International
achieves comparable performance/energy-efficiency with the Symposium on Circuits and Systems, 2006.
state-of-the-art accelerator with negligible overheads. Our
future work includes the final chip tape-out of the prototype [11] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger.
Neural Acceleration for General-Purpose Approximate Programs. In
accelerator, an attempt to integrate Cambricon into a general- Proceedings of the 2012 IEEE/ACM International Symposium on
purpose processor, as well as an in-depth study that extends Microarchitecture, 2012.
Cambricon to support broader applications.
[12] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and
ACKNOWLEDGMENT Y. LeCun. NeuFlow: A runtime reconfigurable dataflow processor
for vision. In Proceedings of the 2011 IEEE Computer Society
This work is partially supported by the NSF of China Conference on Computer Vision and Pattern Recognition Workshops,
(under Grants 61133004, 61303158, 61432016, 61472396, 2011.
61473275, 61522211, 61532016, 61521092, 61502446), the
[13] C. Farabet, C. Poulet, J.Y. Han, and Y. LeCun. CNP: An FPGA-
973 Program of China (under Grant 2015CB358800), the S- based processor for Convolutional Networks. In Proceedings of the
trategic Priority Research Program of the CAS (under Grants 2009 International Conference on Field Programmable Logic and
XDA06010403, XDB02040009), the International Collabo- Applications, 2009.
ration Key Program of the CAS (under Grant 171111KYS- [14] V. Gokhale, Jonghoon Jin, A. Dundar, B. Martini, and E. Culurciello.
B20130002), and the 10000 talent program. Xie is supported A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks.
in part by NSF 1461698, 1500848, and 1533933. In IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2014.
R EFERENCES
[15] A. Graves and J. Schmidhuber. Framewise phoneme classification
[1] Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and with bidirectional LSTM networks. In Proceedings of the 2005 IEEE
Srihari Cadambi. A Dynamically Configurable Coprocessor for International Joint Conference on Neural Networks, 2005.
Convolutional Neural Networks. In Proceedings of the 37th Annual
International Symposium on Computer Architecture, 2010. [16] Atif Hashmi, Andrew Nere, James Jamal Thomas, and Mikko Lipasti.
A Case for Neuromorphic ISAs. In Proceedings of the 16th
[2] Yun-Fan Chang, P. Lin, Shao-Hua Cheng, Kai-Hsuan Chan, Yi-Chong International Conference on Architectural Support for Programming
Zeng, Chia-Wei Liao, Wen-Tsung Chang, Yu-Chiang Wang, and Languages and Operating Systems, 2011.
Yu Tsao. Robust anchorperson detection based on audio streams
using a hybrid I-vector and DNN system. In Proceedings of the [17] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero,
2014 Annual Summit and Conference on Asia-Pacific Signal and and Larry Heck. Learning Deep Structured Semantic Models for
Information Processing Association, 2014. Web Search Using Clickthrough Data. In Proceedings of the 22Nd
ACM International Conference on Conference on Information &
[3] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Knowledge Management, 2013.
Wu, Yunji Chen, and Olivier Temam. DianNao: A Small-footprint
High-throughput Accelerator for Ubiquitous Machine-learning. In [18] INTEL. AVX-512. https://software.intel.com/en-us/blogs/2013/avx-
Proceedings of the 19th International Conference on Architectural 512-instructions.
Support for Programming Languages and Operating Systems, 2014.
[19] INTEL. MKL. https://software.intel.com/en-us/intel-mkl.
[4] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu,
Yunji Chen, and Olivier Temam. A High-Throughput Neural Network
Accelerator. IEEE Micro, 2015. [20] Pineda Fernando J. Generalization of back-propagation to recurrent
neural networks. Phys. Rev. Lett., 1987.
[5] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia
Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier [21] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
Temam. DaDianNao: A Machine-Learning Supercomputer. In An Introduction to Statistical Learning. 2013.
Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture, 2014. [22] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the
best multi-stage architecture for object recognition? In Proceedings
[6] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, of the 12th IEEE International Conference on Computer Vision, 2009.
Yongpan Liu, Yu Wang, and Yuan Xie. A Novel Processing-in-
memory Architecture for Neural Network Computation in ReRAM- [23] Shaoqing Ren Jian Sun Kaiming He, Xiangyu Zhang. Delving Deep
based Main Memory. In Proceedings of the 43rd International into Rectifiers: Surpassing Human-Level Performance on ImageNet
Symposium on Computer Architecture (ISCA), 2016. Classification. In arXiv:1502.01852, 2015.
404
[24] V. Kantabutra. On hardware for computing exponential and trigono- [40] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur-
metric functions. Computers, IEEE Transactions on, 1996. danovic, E. Cosatto, and H.P. Graf. A Massively Parallel Copro-
cessor for Convolutional Neural Networks. In Proceedings of the
[25] Alex Krizhevsky, Sutskever Ilya, and Geoffrey E. Hinton. ImageNet 20th IEEE International Conference on Application-specific Systems,
Classification with Deep Convolutional Neural Networks. In Ad- Architectures and Processors, 2009.
vances in Neural Information Processing Systems 25. 2012.
[41] R. Sarikaya, G.E. Hinton, and A. Deoras. Application of Deep Belief
[26] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, Networks for Natural Language Understanding. Audio, Speech, and
and Yoshua Bengio. An Empirical Evaluation of Deep Architectures Language Processing, IEEE/ACM Transactions on, 2014.
on Problems with Many Factors of Variation. In Proceedings of the
24th International Conference on Machine Learning, 2007. [42] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale
Convolutional Networks. In Proceedings of the 2011 International
[27] Q.V. Le. Building high-level features using large scale unsupervised Joint Conference on Neural Networks, 2011.
learning. In Proceedings of the 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing, 2013. [43] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
[28] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Andrew Rabinovich. Going Deeper with Convolutions. In arX-
learning applied to document recognition. Proceedings of the IEEE, iv:1409.4842, 2014.
1998.
[44] O. Temam. A defect-tolerant accelerator for emerging high-
[29] Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan performance applications. In Proceedings of the 39th Annual In-
Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. ternational Symposium on Computer Architecture, 2012.
PuDianNao: A Polyvalent Machine Learning Accelerator. In Pro-
ceedings of the Twentieth International Conference on Architectural [45] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of
Support for Programming Languages and Operating Systems, 2015. neural networks on CPUs. In In Deep Learning and Unsupervised
Feature Learning Workshop, NIPS 2011, 2011.
[30] Maashri, A.A. and DeBole, M. and Cotter, M. and Chandramoorthy,
N. and Yang Xiao and Narayanan, V. and Chakrabarti, C. Accelerat- [46] Yu Wang, Tianqi Tang, Lixue Xia, Boxun Li, Peng Gu, Huazhong
ing neuromorphic vision algorithms for recognition. In Proceedings Yang, Hai Li, and Yuan Xie. Energy Efficient RRAM Spiking Neural
of the 49th ACM/EDAC/IEEE Design Automation Conference, 2012. Network for Real Time Classification. In Proceedings of the 25th
Edition on Great Lakes Symposium on VLSI, 2015.
[31] G Marsaglia and W W. Tsang. The ziggurat method for generating
random variables. Journal of statistical software, 2000.
[47] Cong Xu, Dimin Niu, Naveen Muralimanohar, Rajeev Balasubra-
monian, Tao Zhang, Shimeng Yu, and Yuan Xie. Overcoming the
[32] Paul A Merolla, John V Arthur, Rodrigo Alvarez-icaza, Andrew S Challenges of Cross-Point Resistive Memory Architectures. In Pro-
Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil ceedings of the 21st International Symposium on High Performance
Imam, Chen Guo, Yutaka Nakamura, Bernard Brezzo, Ivan Vo, Computer Architecture, 2015.
Steven K Esser, Rathinakumar Appuswamy, Brian Taba, Arnon Amir,
Myron D Flickner, William P Risk, Rajit Manohar, and Dharmendra S
[48] Tao Xu, Jieping Zhou, Jianhua Gong, Wenyi Sun, Liqun Fang, and
Modha. A million spiling-neuron interated circuit with a scalable
Yanli Li. Improved SOM based data mining of seasonal flu in
communication network and interface. Science, 2014.
mainland China. In Proceedings of the 2012 Eighth International
Conference on Natural Computation, 2012.
[33] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,
Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, An-
dreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, [49] Xian-Hua Zeng, Si-Wei Luo, and Jiao Wang. Auto-Associative
Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Neural Network System for Recognition. In Proceedings of the 2007
Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level International Conference on Machine Learning and Cybernetics,
control through deep reinforcement learning. In Nature, 2015. 2007.
[34] M.A. Motter. Control of the NASA Langley 16-foot transonic tunnel [50] Zhengyou Zhang, M. Lyons, M. Schuster, and S. Akamatsu. Com-
with the self-organizing map. In Proceedings of the 1999 American parison between geometry-based and Gabor-wavelets-based facial
Control Conference, 1999. expression recognition using multi-layer perceptron. In Proceedings
of the Third IEEE International Conference on Automatic Face and
[35] NVIDIA. CUBLAS. https://developer.nvidia.com/cublas. Gesture Recognition, 1998.
[36] C.S. Oliveira and E. Del Hernandez. Forms of adapting patterns to [51] Jishen Zhao, Guangyu Sun, Gabriel H. Loh, and Yuan Xie. Optimiz-
Hopfield neural networks with larger number of nodes and higher ing GPU energy efficiency with 3D die-stacking graphics memory
storage capacity. In Proceedings of the 2004 IEEE International and reconfigurable memory interface. ACM Transactions on Archi-
Joint Conference on Neural Networks, 2004. tecture and Code Optimization, 2013.
405