0% found this document useful (0 votes)
73 views14 pages

Performance Modeling For CNN Inference Accelerators On FPGA

This article describes a performance modeling approach for convolutional neural network (CNN) inference accelerators on field-programmable gate arrays (FPGAs). The model accounts for key design parameters like parallelism, on-chip buffer sizes, external memory bandwidth, and how they impact computation delay, data transfer latency, and energy efficiency. The model allows exploring the design space to identify performance bottlenecks and optimize parameters like loop tiling and unrolling early in the design process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views14 pages

Performance Modeling For CNN Inference Accelerators On FPGA

This article describes a performance modeling approach for convolutional neural network (CNN) inference accelerators on field-programmable gate arrays (FPGAs). The model accounts for key design parameters like parallelism, on-chip buffer sizes, external memory bandwidth, and how they impact computation delay, data transfer latency, and energy efficiency. The model allows exploring the design space to identify performance bottlenecks and optimize parameters like loop tiling and unrolling early in the design process.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1

Performance Modeling for CNN Inference


Accelerators on FPGA
Yufei Ma, Student Member, IEEE, Yu Cao, Fellow, IEEE, Sarma Vrudhula, Fellow, IEEE,
and Jae-sun Seo, Senior Member, IEEE

Abstract—The recently reported successes of convolutional limited by either the computation delay or the DRAM transfer
neural networks (CNNs) in many areas has generated wide delay, and the actual bound will be determined by the values of
interest in the development of FPGA-based accelerators. To the associated design parameters, as described by the roofline
achieve high performance and energy efficiency, an FPGA-based
accelerator must fully utilize the limited computation resources model in [7] [9]. The computation delay is determined by the
and minimize the data communication and memory access, both number of parallel processing engines (PEs), their utilization,
of which are impacted and constrained by a variety of design and the operating frequency. The DRAM transfer latency is
parameters, e.g. the degree and dimension of parallelism, the mainly affected by the external memory bandwidth and the
size of on-chip buffers, the bandwidth of the external memory, number of DRAM accesses, and the latter is strongly affected
and many more. The large design space of the accelerator
makes it impractical to search for the optimal design in the by the size of the on-chip buffers. With regard to the energy
implementation phase. To address this problem, a performance efficiency (i.e. performance per watt), the main components that
model is described to estimate the performance and resource determine the dynamic power consumption are the computation
utilization of an FPGA implementation. By this means, the logic and the memory traffic, the latter requiring efficient
performance bottleneck and design bound can be identified and data movement and high data reuse. All these considerations
the optimal design option can be explored early in the design
phase. The proposed performance model is validated using a show that there are numerous design parameters that determine
variety of CNN algorithms comparing the results with on-board the performance and energy efficiency of a CNN accelerator,
test results on two different FPGAs. making it impractical to find their optimal values during the
Index Terms—Convolutional neural networks, FPGA, Analyt- implementation phase, as the synthesis of one FPGA design
ical modeling. may take several hours. Robust and parametric models become
a necessity for efficient design space exploration and selection
I. I NTRODUCTION of the optimal values of the design parameters. The architectural
design space must be numerically characterized by design

M ANY reported successes of convolutional neural net-


works (CNNs) for computer vision tasks [1] [2] [3] [4]
[5] [6] have motivated the development of hardware implemen-
variables to control the accelerator performance and efficiency.
For instance, loop optimization techniques [7] [11], such as loop
unrolling and tiling, are employed to customize the acceleration
tations of CNNs. In particular, there has been increased interest strategy of parallel computation and data communication for
in field-programmable gate arrays (FPGAs) as a platform to convolution loops, whose variables in turn affect the resource
accelerate the post-training inference computations of CNNs utilization and memory access.
[7] [8] [9] [10] [11] [12] [13]. To achieve high performance
The starting point of this work is a general system-level
and low energy cost, a CNN accelerator must 1) fully utilize
model of a CNN accelerator shown in Fig. 1, which includes
the limited computing resources to maximize the parallelism
the external memory, on-chip buffers, and PEs. The hardware
when executing the large number of operations for different
architectural parameters, e.g. buffer sizes, are determined
convolution layers with varying dimensions, 2) exploit the data
by the design variables that control the loop unrolling and
locality by saving only the required data in on-chip buffers to
tiling. Combining the design constraints and the choices of
minimize the cost of external memory (e.g. DRAM) accesses,
and 3) manage the data storage patterns in buffers to increase
the data reuse and reduce the data movements.
CNN Size (N*) Loop Tiling (T*) Loop Unrolling (P*)
With the intervals of computation and off-chip communica-
FPGA
tion overlapped using dual buffering (or ping-pong buffering) Capacity
Image Pixels Pixels Pixels Processing Engine
technique, the performance of the CNN accelerator will be Weights External Weights
Direct
Memory Weights On-chip Weights Arrays
Memory
Access Buffers
Y. Ma, Y. Cao and J. Seo are with School of Electrical, Computer and Result (DRAM) Pixels (DMA) Pixels Pixels PE PE PE
Energy Engineering, Arizona State University, Tempe, AZ 85287 USA (email: Bandwidth
[email protected]; [email protected]; [email protected]).
S. Vrudhula is with School of Computing, Informatics, Decision Systems Size (byte) of DRAM access Capacity (bit) of buffers The number of DSPs
Engineering, Arizona State University, Tempe, AZ 85287 USA (email:
Delay (ms) of DRAM access Size (bit) of buffer access Computation delay (ms)
[email protected]).
This work was supported in part by the NSF I/UCRC Center for Embedded
Systems through NSF grants 1230401, 1237856, 1701241, 1361926 and Fig. 1. A general CNN hardware accelerator with three levels of hierarchy,
1535669, NSF grants 1652866 and 1715443, Intel Labs, and C-BRIC, one of where the loop design variables determine the key accelerator metrics, e.g.
six centers in JUMP, a SRC program sponsored by DARPA. delay, resource usage and memory access.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2

the acceleration strategy, a more fine-grained performance


model is built to achieve better prediction for a specific design Input feature maps Nif
Output feature maps
Nky
implementation, e.g. the design strategy in [11]. By this means,


Nkx
the proposed performance model makes it possible to identify
Nky
the performance bottleneck and design limitations in the early Nkx Niy
development phase by exploring the design space through ⊗ Nif Nky =
unrolling and tiling variables. Noy


Nkx Nof
The main contributions of this work are: Nif Nox
• The design objectives and resource costs are formulated Nix
using the design variables of loop unrolling and tiling. Nky Kernel maps
Nif
• A high-level performance model is proposed to estimate Nkx
the accelerator throughput, on-chip buffer size and the Fig. 2. Convolution operation is implemented by four levels of loops to
number of external and on-chip memory accesses. multiply and accumulate input features with kernel weights, where i: input, o:
• The design space is efficiently explored through the output, k: kernel, f : feature, x: x axis (width), and y: y axis (height), and the
parameters of loop dimensions are prefixed with “N” [11].
proposed model instead of the real FPGA compilation to
identify the performance bottleneck and obtain the optimal
design configurations. B. CNN Hardware Acceleration System
• The performance model is validated for a specific design
In the general model of a CNN accelerator shown in Fig. 1,
strategy across a variety of CNN algorithms comparing
due to the large data volume, both the weights and intermediate
with the on-board test results on two different FPGAs.
pixel results are stored in the external memory, e.g. DRAM.
• Evaluate the techniques that may further enhance the
The input and weight on-chip buffers temporally store the input
performance of our current design by improving the
data to be processed by the PEs, and the PE results are saved
efficiency of DRAM transactions and PE utilization.
in the output buffers. After completing the computation, the
The remainder of this paper is organized as follows. Section II results are transferred back to DRAM from the output buffers,
overviews the procedure to map CNN operations onto a FPGA which will be used as the input to the subsequent layer.
hardware system. A coarse-grained performance model is
presented in Section III for rough estimation, and the fine-
In this figure,
grained model is discussed in the following sections for a Buffered data (T*)
Pox × Poy × Pof = 2×2×3
specific design strategy. Section IV estimates the size and Parallel computation (P*)
Pif = Pkx = Pky = 1


latency of DRAM accesses. The latency of convolution and Input feature maps
other layers are formulated and estimated in Section V. The on-
Output feature maps
chip buffer size requirement is analyzed in Section VI, and the
size of buffer access is discussed in Section VII. Experiments Piy = Poy Tiy Tof
are performed to explore the design space in Section VIII. Pix = Pox ⊗ Pof
= Poy Toy
Section IX evaluates the techniques that may further improve Pof Pox
the current design performance.
Tix = Nix
Tof
Tox
Tif = Nif Tox = Nox
II. CNN I NFERENCE ACCELERATOR ON FPGA Kernel maps
Tky = Nky
A. Overview of Convolution Operation Tif = Nif
Tkx = Nkx 1 ≤ P* ≤ T* ≤ N*
The main operation in CNN algorithms involves accumulat-
ing the products of pixel values (e.g. features, activations or Fig. 3. The convolution acceleration strategy is customized by loop unrolling
(P ∗) for parallel computation and loop tiling (T ∗) for data buffering. The
neurons) with kernal weights, along different dimensions of parallelism is within one input feature map (P ix × P iy) and across multiple
the kernel and feature maps. Fig. 2 shows the four nested loops kernel maps (P of ). The demanded buffer size can be changed by tuning
involved in CNNs. Note that the prefix “N” (for “number”) variables T oy and T of [11].
used in describing various parameters (e.g. Nix, Niy, Nif, etc.)
denote the sizes of the kernel and feature maps. The loop
operations shown in Fig. 2 are written as: C. Convolution Loop Optimization
pixelL (no; x, y) = Loop optimization techniques [7] [11], e.g. unrolling and
Nky X
Nif X Nkx
tilling, are employed to customize the computation and com-
munication patterns in a CNN accelerator. Loop unrolling
X
pixelL−1 (ni; S × x + kx, S × y + ky) (1)
ni=1 ky=1 kx=1
directs the parallel computation along different convolution
dimensions, and the variables representing the unrolling degrees
× weight(ni, no; kx, ky) + bias(no);
are prefixed by “P ” (see Fig. 3). These variables determine the
where S is the sliding stride, x ∈ {1, 2, . . . , Nox}, number of PEs, which in turn determine the required number
y ∈ {1, 2, . . . , Noy}, no ∈ {1, 2, . . . , Nof}, L ∈ of DSPs in the FPGA to implement PEs, and thus decide the
{1, 2, . . . , #CON V s}, and #CON V s is the number of computation delay. The data flow from buffers into PEs is also
convolution layers. impacted by loop unrolling variables, which affect the number

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 3

of buffer access. Loop tilling divides a large CNN layer into TABLE I
multiple small tiles, which can be accommodated by on-chip L IST OF A BBREVIATIONS AND U NITS
buffers to increase data locality. Tiling sizes are represented Abbreviation Description Abbreviation Description
by variables prefixed with “T ” as shown in Fig. 3. The Px Pixel Rd Read
required buffer capacity is determined by the tiling variables, Wt Weight Wr Write
which also affects the DRAM access and thus the latency of Buf Buffer InBuf Input Buffer
W tBuf Weight Buffer OutBuf Output Buffer
DRAM transactions. The relationship between loop variables
BW Bandwidth 1T One Tile
and the key specifications of accelerators, e.g. delay, DSP
Unit Description Unit Description
usage, buffer size, and memory access that affects memory
power consumption, are shown in Fig. 1. bit / byte Data Size word RAM Depth
ms Delay Time M Hz Frequency

D. Convolution Acceleration Strategy


for one buffered tile (1 T ) of convolution. This is denoted by
To accurately predict the real implementation, a specific
#cycles 1 T , and is expressed as follows,
accelerator design strategy is needed to characterize the fine-
grained performance model with detailed design options. The #cycles 1 T
output stationary acceleration strategy of unrolling and tilling 
T if

T kx

T ky

T of

T ox

T oy

(2)
in [11] is adopted in this work as shown in Fig. 3. The loop = .
P if P kx P ky P of P ox P oy
unrolling or the parallel computations are only performed within
one input feature map (P ix = P ox > 1 and P iy = P oy > 1) The number of tiles for one convolution layer is
and across multiple kernel maps (P of > 1). That is, in Fig. 3, #tiles
the P ix × P iy blue pixels in an input feature map are operated 
Nif

Nkx

Nky

Nof

Nox

Noy

(3)
in parallel with green weights from P of different kernel maps, = .
T if T kx T ky T of T ox T oy
resulting in P ox × P oy pixels in each of the P of output
feature maps. Therefore, the total number of PEs (MAC units) The total number of computation clock cycles of one convolu-
is P ox × P oy × P of . The data required to compute one final tion (CV ) layer is
output pixel are fully buffered, i.e. T kx = Nkx, T ky = Nky
#cycles 1 CV = #tiles × #cycles 1 T. (4)
and T if = Nif, so that the partial sum can be consumed inside
the MAC unit without saving in the buffer. To ensure that the
DRAM accesses are from continuous addresses, the entire row B. On-chip Buffer Size
of the feature map is buffered, i.e. T ix = Nix and T ox = Nox.
Determined by the tiling variables, the input buffer (InBuf )
If the on-chip RAM capacity is large enough, either all pixels
size (bit) requirement to store one tile of input pixels is
or all weights of one layer are fully buffered, so that each data
only needs to be fetched from DRAM once to reduce DRAM bit InBuf = T ix × T iy × T if × bit P x, (5)
access. Finally, the required buffer sizes of each layer can be
changed by tuning T oy and T of . where bit P x is the bit width of one pixel (P x). Similarly,
the size (bit) requirement of weight buffer (W tBuf ) to store
one tile of weights is
III. C OARSE - GRAINED P ERFORMANCE M ODEL
bit W tBuf = T kx · T ky · T if · T of · bit W t, (6)
In this section, a coarse-grained performance model of a
where bit W t is the bit width of one weight (W t). The output
general CNN accelerator that is independent of a specific
buffer (OutBuf ) size (bit) requirement to store one tile of
acceleration strategy, is presented. Then, more detailed design
output pixels is
choices and constraints (e.g. unrolling and tiling variable
settings, memory storage pattern, and computation dataflow) bit OutBuf = T ox × T oy × T of × bit P x. (7)
are introduced to create a more precise and fine-grained
model in the following sections. Table I lists the mainly The theoretical sizes of the input, weight and output buffers
used abbreviations and units in this paper, which indicate are the maximum possible values of bit InBuf , bit W tBuf
the meaning of the variables discussed afterwards. and bit OutBuf of all the convolution layers, respectively. In
an actual implementation, the sizes of the buffers used may be
larger than these values due to inefficient storage pattern and
A. Computation Latency extra garbage data.

The number of multiplication operations per layer is Nm =


Nif × Nkx × Nky × Nof × Nox × Noy. The number of PEs that C. DRAM Access and Latency
determines the degree of parallel computations by unrolling In theory, the size of one tile of data read from or written to
is P m = P if × P kx × P ky × P of × P ox × P oy. A similar the external DRAM should be the same as the size of buffered
reasoning is applied to determine the number of clock cycles data. Therefore, the size (bytes) of input pixels (P x) read

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 4

(Rd) from DRAM for one convolution tile is byte RdP x = 2) Fast Fourier Transform: FFT [16] can reduce the number
bit InBuf /8. The size (bytes) of one tile of weights (W t) read of multiplications from Θ(Nox · Noy · Nkx · Nky) to Θ(Nox ·
from the DRAM is byte RdW t = bit W tBuf /8. The size Noy ·log2 (Nox)) and even further to Θ(Nox·Noy ·log2 (Nkx))
(bytes) of one tile of output pixels written (W r) to the DRAM with Overlap-and-Add [16]. The original kernel weights and
is byte W rP x = bit OutBuf /8. The latency (milliseconds input features are transformed into the frequency domain to
or ms) of DRAM transactions of one tile (1 T ) of data is do multiplications, and then the inverse FFT is applied to
determined by the size of DRAM access and the memory recover the results. Therefore, extra hardware is required to
bandwidth. This is given by implement the transform. In addition, the computation reduction
byte DRAM 1 T is decreased with smaller kernel size.
ms DRAM 1 T = , (8) 3) Winograd Transform: Winograd transform [15] is another
BW M emory × 106
approach to reduce the convolution operations. The number
where BW M emory is the external memory bandwidth of multiplications of the original convolution in one tile is
(GByte/s), and byte DRAM 1 T is the size of DRAM Θ((T kx · T ox)(T ky · T oy)), while in Winograd it is only
access of one tile, which can be byte RdP x, byte RdW t, Θ((T kx+T ox−1)(T ky+T oy−1)). For example, with kernel
or byte W rP x. size of 3 × 3 and tiled output features of 2 × 2, we can achieve
2.25× reduction of multiplication operations. However, the
D. On-chip Buffer Access addition operations are increased in Winograd, and additional
The size (bits) of on-chip buffer access (bit Buf Access) storage and bandwidth are required by the transform matrices.
is computed by multiplying the number of access clock cycles The operations can be further reduced with larger feature tiles,
(#cycles Access) with the total bit width of the corresponding but the complexity of the transform matrix will significantly
buffers (width Buf ). increase. Since Winograd essentially unrolls the computation
within a kernel window, the varying kernel sizes can affect its
bit Buf Access = #cycles Access × width Buf. (9) computation efficiency.
During computation, it is assumed that data are continuously
read from input and weight buffers and the results are written IV. M ODELING OF DRAM ACCESS
into the output buffers every clock cycle. Then, to estimate the
In this section, more accurate models of the DRAM access
buffer access during computation, #cycles Access equals the
are constructed by including the design constraints and the
number of computation cycles, and width Buf can be the
variables of loop acceleration described in Section II-D.
total bit width of input/weight/output buffers. The size (bits)
of buffer access by DMA that writes into input and weight
buffers and reads from output buffers is the same as the size A. Data Size of Convolution DRAM Access
of external memory access. The data stored in the input or
weight buffers may be read multiple times during computation, The direct memory access (DMA) engine shown in Fig. 1 is
hence the size of data read from buffers may be larger than used to transfer data to and from off-chip DRAM. To achieve
the size of data written into buffers from DRAM. Since each the maximum bandwidth, the data width of both the DMA
result is written into output buffers only once, the size of write (bit DM A) and the DRAM controller (bit DRAM ) are set
and read operations of output buffers are the same. to be 512 bits.
P ox represents the number of pixels that are computed in
parallel in each output feature map. For the feature map transfer,
E. Other Implementation Methods of Convolution
the number of groups of P ox pixels associated with one DMA
Instead of the aforementioned direct implementation of the address is then given by #P oxGroup = bbit DM A/(P ox ×
convolution loop operations, convolution can also be performed bit P x)c, where bit P x is the bit width per pixel. The effective
as matrix multiplication [8] or accelerated in the frequency or actual DMA bandwidth (as a fraction of the maximum) is
domain [14] [15]. Since these methods require significantly then given by
different hardware architecture and dataflow, we only briefly
#P oxGroup × P ox × bit P x
analyze them with our modeling parameters. ef f DM A P x = . (10)
1) Matrix Multiplication: The multiply and accumulate bit DM A
(MAC) operations in convolution can be mapped to matrix For example, if P ox = 7, bit DM A = 512 and bit P x = 16,
multiplication [8], which can utilize the library optimized for then there are #P oxGroup = 4 groups of P ox pixels in one
GPU, e.g. BLAS used by Caffe. The original 4-D kernel weights DMA address, and 4 × 7 × 16 = 448 bits are the effective
are transformed to be a matrix with Nof rows and Nkx·Nky·Nif number of bits out of the DMA bit width of 512 bits, resulting
columns. The 3-D input feature map is transformed into a in ef f DM A P x = 0.875.
matrix with Nkx · Nky · Nif rows and Nox · Noy columns. The intermediate pixel results stored in DRAM are arranged
There are redundant data in the transformed feature matrix row-by-row, map-by-map, and layer-by-layer. One convolution
due to the overlapped sliding of the kernel window. Therefore, tile needs T ix × T iy × T if input pixels. Then, the size (bytes)
this method could lead to either complex dataflow and extra of the input pixels read (Rd) from the DRAM for one tile is
hardware to perform the transform on the fly, or additional T ix × T iy × T if × bit P x
DRAM accesses due to the redundant data. byte RdP x = . (11)
ef f DM A P x × 8

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 5

Note that if ef f DM A P x < 1, it implies more bytes are output pixels are similar to byte RdP x in Equation (11) and
read than necessary, due to the alignment of data storage. byte W rP x in Equation (12), respectively.
Similarly, the size (bytes) of output pixels written (W r) to The weights of fully-connected (FC) layers are stored in
DRAM for one convolution tile is DRAM in the same way as convolution, and reuse the same
T ox × T oy × T of × bit P x weight buffers. Since the intermediate results of FC layers are
byte W rP x = . (12) small (< 20KB), they are always kept in the on-chip RAMs.
ef f DM A P x × 8
The Eltwise layer performs element-wise summation of the
For convolution weights, the ratio of effective DRAM outputs of two layers. Eltwise layer is executed after its two
bandwidth to the maximum of reading weights from DRAM is precedent layers are finished, so that it can directly read the
bbit DM A/bit W tc × bit W t results of one layer from the output buffers, without accessing
ef f DM A W t = . (13)
bit DM A DRAM. However, Eltwise layer still needs to read the outputs
The size (bytes) of input weights read from DRAM for one of the other layer from DRAM as the output buffers were
convolution tile is already refreshed.
T kx · T ky · T if · T of · bit W t
byte RdW t = . (14) V. M ODELING OF L ATENCY
ef f DM A W t × 8
A. Computation Delay (ms) of One Convolution Tile
Setting P if = P kx = P ky = 1, T if = Nif, T kx = Nky,
B. DRAM Access Delay of One Tile (1 T ) T kx = Nky, and T ox = Nox as described in Section II-D,
The data width of the DRAM controller interface to the Equation (2) can be written as
FPGA is assumed to be bit DRAM , running at frequency of
#cycles 1 T
M Hz DRAM . This means the theoretical maximum DRAM      
bandwidth (BW DRAM in GB/s) is (bit DRAM/8) × T of Nox T oy (15)
= Nif · Nkx · Nky · · · .
(M Hz DRAM/103 ), which is normally very difficult to P of P ox P oy
sustain due to the non-contiguous DRAM access. For example, Then, the computation delay (ms) of one convolution tile is
if bit DRAM = 512 bits, with M Hz DRAM = 266 MHz,
#cycles 1 T
then BW DRAM = (512/8) × (266/103 ) = 17.0 GB/s as ms Compute = , (16)
the maximum DRAM bandwidth. M Hz Accelerator × 103
In the CNN acceleration system described in [11], the where M Hz Accelerator is the clock frequency of the accel-
DMA engine is operated at the same clock frequency as erator in MHz. The number of tiles of one convolution layer
the CNN accelerator core (i.e. M Hz Accelerator) with (#tiles) is dNof/T of edNoy/T oye based on Equation (3) with
read/write data-width (bit DM A) of 512 bits. An asyn- Nif = T if , Nkx = T kx, Nky = T ky, and Nox = T ox as
chronous FIFO can be inserted between DMA and the DRAM described in Section II-D.
controller to synchronize data across the two clock domains.
Then, the DMA bandwidth (BW DM A) is (bit DM A/8) × B. Overall Delay (ms) of One Convolution Layer
(M Hz Accelerator/103 ). By this means, the bandwidth of With dual buffering technique, the DRAM access is over-
the external memory is bounded by the effective bandwidth of lapped with computation to improve performance [7] [10]. The
both the DRAM controller and the DMA as BW M emory = overall tile-by-tile delay of one convolution layer is illustrated
min(BW DRAM, BW DM A), which is used in Equa- in Fig. 4. Since the dual buffering pipeline is only within one
tion (8) to calculate the DRAM latency. layer with the current design choice, after the start of one
The more accurate and specific DRAM access sizes of one layer and before the computation of the first tile, both the
tile (byte DRAM 1 T ) are discussed in this section, including input pixels and weights (Wt) of one tile are first read from
byte RdP x, byte W rP x, and byte RdW t. Then, we can use DRAM. This is shown as “Input+Wt” at the beginning of one
Equation (8) to compute their corresponding DRAM access layer in Fig. 4. Similarly, after the completion of the last tile’s
delay (ms DRAM 1 T ), e.g. ms RdP x, ms W rP x, and computation, its output pixels are transferred back into DRAM,
ms RdW t, respectively. which is shown as “Output” at the end in Fig. 4. Therefore,
for each convolution layer, the delay of transferring inputs of
C. DRAM Access of Other Layers the first tile and outputs of the last tile cannot be overlapped
The DRAM access and performance of other layers, e.g. max- with the computation, and this delay is denoted as
pooling, fully-connected (FC) and Eltwise, are also investigated ms M em = ms RdP x + ms RdW t + ms W rP x. (17)
and included in our performance model. Since the analysis
process of theses layers are similar to the convolution layer, If the convolution layer has only one tile that is T iy = Niy
for simplicity, their detailed formulas used in the performance and T of = Nof, there is no overlapping of memory transfer
model are not presented. and computation as shown in Fig. 4(a), and the delay of this tile
The pixels of max-pooling layers are also transferred to (e.g. t = 1 in Fig. 4(a)) is only determined by the computation
and from the DRAM with loop tiling performed, depending delay as in Algorithm 1 (line 2).
on the adopted design choices [11] [17]. For max-pooling, If the convolution layer has multiple tiles and all its weights
the calculation of the DRAM transfer sizes of input and are fully buffered, i.e. T iy < Niy and T of = Nof, then the

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 6

t=1 Tile/Time
Input Delay of reading one tile of input pixels from DRAM (ms_RdPx)
#tiles = 1 Input Wt Output
Wt Delay of reading one tile of weights from DRAM (ms_RdWt)
Compute
Output Delay of writing one tile of output pixels into DRAM (ms_WrPx)
(a) Tiy = Niy and Tof = Nof
Compute Delay of computing one tile of data (ms_Compute)

t=1 t=2 t=3 t=1 t=2 t=3


#tiles = 3 Input Wt Input Input Output Output Output #tiles = 3 Input Wt Wt Wt Output Output Output
Compute Compute Compute Compute Compute Compute
(b) Tiy < Niy and Tof = Nof (c) Tiy = Niy and Tof < Nof

#tiles_y = 3
ty = 1, tf = 1 ty = 2, tf = 1 ty = 3, tf = 1 ty = 1, tf = 2 ty = 2, tf = 2 ty = 3, tf = 2 #tiles = ⌈Niy/Tiy⌉⌈Nof/Tof⌉
Input Wt Input Input Output Input Wt Output Input Output Input Output Output Output
#tiles_f = 2 #tiles_y = ⌈Niy/Tiy⌉
Compute Compute Compute Compute Compute Compute
(d) Tiy < Niy and Tof < Nof #tiles_f = ⌈Nof/Tof⌉

Fig. 4. The tile-by-tile delay of one convolution layer, and the DRAM access delay is overlapped with the computation delay due to dual buffering technique.
(a) Both inputs and weights fully buffered, (b) only weights fully buffered, (c) only inputs fully buffered, (d) neither inputs nor weights fully buffered.

weights only need to be read from DRAM once and can be already loaded during the previous tile and reused. Therefore,
reused by different tiles as illustrated in Fig. 4(b). The procedure the delay of a normal tile is estimated as in Algorithm 1
to estimate the delay of this convolution layer is summarized (line 34). As the first tile does not have a previous tile, there is
in Algorithm 1 (line 3 to line 12). The computation of the no transfer of output pixels back to DRAM as in Algorithm 1
first tile (e.g. t = 1 in Fig. 4(b)) is overlapped with fetching (line 28). For the last tile, there is no need to read input pixels
the input pixels of the next tile, and there is no DMA transfer for the next tile as in Algorithm 1 (line 30). When #tiles y
of output pixels of the previous layer, thus the delay of this tiles of weights are finished (e.g. ty = 3 and tf = 1 in Fig. 4(d)),
tile is determined by Algorithm 1 (line 6). The computation new tile of weights are loaded from DRAM, and DRAM access
of the last tile (e.g. t = 3 in Fig. 4(b)) is overlapped with also includes transfer of pixels as in Algorithm 1 (line 32).
transferring the output pixels of its previous tile, and its delay
is calculated by Algorithm 1 (line 8). For the other tiles (e.g. t=1 Tile/Time
t = 2 in Fig. 4(b)), the communication with DRAM includes #tiles = 1 Input Output
Compute
both reading input pixels and writing output pixels, and the (a) Max-pooling: inputs fully buffered in one tile
delay of one tile is expressed by Algorithm 1 (line 10). The
t=1 t=2 t=3
overall delay of this convolution layer is the sum of all the #tiles = 3 Input Input Input Output Output Output
tiles as well as the DRAM access delay before the first tile Compute Compute Compute

and after the last tile, i.e. ms M em. (b) Max-pooling : inputs partially buffered in multiple tiles

t=1 t=2 t=3


If the convolution layer has multiple tiles and all its pixels #tiles = 3 Wt Wt Wt
are fully buffered, i.e. T iy = Niy and T of < Nof, then the Compute Compute Compute

pixels only need to be read from DRAM once and can be (c) Fully connected
Input Delay of reading one tile of input pixels from DRAM
reused by different tiles as illustrated in Fig. 4(c). Similarly, Wt Delay of reading one tile of weights from DRAM
the procedure to estimate the delay of this convolution layer Output Delay of writing one tile of output pixels into DRAM
Compute Delay of computing one tile of data
is summarized in Algorithm 1 (line 13 to line 22).
If neither the weights nor the pixels of the convolution layer Fig. 5. The tile-by-tile delay of one pooling/fully-connected layer, and the
can be fully buffered, i.e. T iy < Niy and T of < Nof, its DRAM access delay is overlapped with the computation delay due to the dual
pipeline schedule is shown in Fig. 4(d) and the associated buffering technique.
delay is estimated in Algorithm 1 (line 23 to line 37). In this
case, either the pixels or the weights need to be re-fetched
multiple times from the DRAM. In our current design, the C. Delay Estimation of Other Layers
input pixels are re-fetched and the weights only need to be read With dual buffering technique employed, the overall tile-
once. If the DRAM access requirement of input pixels is more by-tile process of one max-pooling layer is illustrated in
than weights, we can also re-fetch weights instead and only Fig. 5(a)(b), which is similar to the convolution layer except
read input pixels once by changing the DMA instructions and that pooling does not need weights. If the pooling layer has only
associated control logic. Before the computation, the first tile one tile, which means the inputs of one pooling layer can be
of weights are loaded and reused by the following consecutive fully buffered, there is no overlapping between memory transfer
#tiles y = dNiy/T iye tiles of pixels to perform convolution. and computation as shown in Fig. 5(a). Fig. 5(b) illustrates the
Then, the next tile of weights are loaded and reused by the dual buffering pipeline of one pooling layer with multiple tiles.
following #tiles y tiles of pixels. This process iterates by Similar to Algorithm 1, we can compute the overall latency
#tiles f = dNof/T of e times to complete the computation of max-pooling layers according to the tile-by-tile execution
with all the #tiles f tiles of weights. By this means, the pixels schedule, with the delay of max-pooling computation and
are re-fetched by #tiles f times. A normal tile needs to read DRAM access calculated similar to the convolution layer.
input pixels of the next tile from DRAM and write output pixels Fig. 5(c) shows the pipeline schedule of FC layer, where
of the previous tile into DRAM, where the required weights are weights are fetched before the corresponding computation and

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 7

input : C, I, W , O, #tiles, #tiles y, #tiles f r(i, y) is y-th row in i-th input feature map, i ∈ {1, 2, …, Tif}, y ∈ {1, 2, …, Tiy − 2×padding}
output : ms 1 CV c(x) is the x-th column element in one row, x ∈ {1, 2, …, Tix − 2×padding}

1 if T iy = Niy and T of = Nof then north zero padding Base Addr. word_1Row = 3 Addr.
0 c(01) c(02) c(03) c(04) 0

stride = 2
2 T [1] = C pad r(1,02) r(1,04) r(1,06)
r(1,01) r(1,03) r(1,05) r(1,07) 3 c(05) c(06) c(07) c(08) 1
3 else if T iy < Niy and T of = Nof then r(1,08) r(1,10) r(1,12) 6 c(09) c(10) c(11) c(12) 2
4 for t = 1 to #tiles do r(1,09) r(1,11) pad 9
pad r(2,02) r(2,04) r(2,06) 12 Pox = Pix = 4
5 if t = 1 then r(2,01) r(2,03) r(2,05) r(2,07) 15
#rows_1Map = 4
6 T [t] = max(C, I) r(2,08) r(2,10) r(2,12) 18

stride = 2
7 else if t = #tiles then r(2,09) r(2,11) pad 21 Tix = 14
pad r(3,02) r(3,04) r(3,06) 24 Tiy = 14
8 T [t] = max(C, O) r(3,01) r(3,03) r(3,05) r(3,07) 27 Tif = 3
9 else r(3,08) r(3,10) r(3,12) 30 stride = 2
r(3,09) r(3,11) pad 33 padding = 1
10 T [t] = max(C, I + O)
Input Input Input Input
11 end Buffer 1 Buffer 2 Buffer 3 Buffer 4 south zero padding
12 end
Poy = Piy = 4
13 else if T iy = Niy and T of < Nof then
14 for t = 1 to #tiles do Fig. 6. The convolution data storage pattern in the input pixel buffers.
15 if t = 1 then
16 T [t] = max(C, W )
17 else if t = #tiles then A. Size and Storage of Input Buffers
18 T [t] = max(C, O)
Fig. 6 illustrates the proposed storage pattern of convolution
19 else
input pixels, which benefits the dataflow of P ox × P oy pixels
20 T [t] = max(C, W + O)
from buffers into MAC units [11]. The width of one input buffer
21 end
is determined by P ox to feed data for parallel computation of
22 end
P ox pixels in one feature map row. The number of input buffers
23 else
is determined by P oy to feed data for parallel computation
24 for tf = 1 to #tiles f do
of P oy multiple output rows. In Fig. 6, c(x) denotes one
25 for ty = 1 to #tiles y do
input pixel in the x-th column of a certain row, where x ∈
26 t = ty + (tf − 1) × #tiles y;
{1, 2, . . . , T ix − 2 × padding} and T ix includes both the
27 if ty = 1 and tf = 1 then
east and west zero padding. The east and west zero paddings
28 T [t] = max(C, I)
are not stored in buffers and instead they are masked out by
29 else if t = #tiles then
control logic before loading into the MAC units. The number
30 T [t] = max(C, O)
of addresses or words occupied by one row is
31 else if ty = #tiles y then
32 T [t] = max(C, I + W + O) word 1 Row = d(T ix − 2 × padding)/P oxe. (18)
33 else
34 T [t] = max(C, I + O) In Fig. 6, r(i, y) is the y-th row of the i-th input feature map,
35 end where i ∈ {1, 2, . . . , T if } and y ∈ {1, 2, . . . , T iy}. The T iy
36 end rows of one input feature map including north and south zero
37 end paddings if they exist are distributed across the P oy number
38 end
of input buffers. With stride = 2 as in Fig. 6, two adjacent
P#tiles rows are continuously stored in the same buffer according to
39 ms 1 CV = t=1 T [t] + ms M em
the dataflow requirement. Then, the number of rows of one
Algorithm 1: Delay estimation of one convolution layer feature map, i.e. r(i, y), in one buffer is
(ms 1 CV ), where C = ms Compute, I = ms RdP x,
W = ms RdW t, and O = ms W rP x. #rows 1 M ap = ddT iy/stridee/P oye × stride. (19)
The storage location of the subsequent input feature maps are
no outputs are transferred back to DRAM. The storage format aligned with the first feature map to simplify the address
of FC weights in the weight buffer allows us to read P of generation logic, which causes some overhead due to the
weights simultaneously every clock cycle to parallel compute noncontinuous storage pattern as shown by the blank spaces
P of outputs. Then, the computation cycles of one FC tile in the buffers in Fig. 6. By this means, the depth or words
equal to the depth of buffered FC weights. The overall delay requirement of one input buffer (InBuf ) storing T if input
of FC is bounded and determined by the computation delay or feature maps for one convolution layer is expressed as
the DRAM access delay of weights. word InBuf = word 1 Row · #rows 1 M ap · T if. (20)
VI. S IZE R EQUIREMENT OF O N - CHIP M EMORY The data width of one input buffer is P ox × bit P x and
With the specific data storage pattern of buffers, we can the number of input buffers is P oy × Dual with Dual = 2,
more precisely calculate the required on-chip buffer sizes than where Dual represents doubling of the number of buffers due
the rough estimation in Section III-B. to the dual buffer structure. Therefore, in every clock cycle,

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 8

P ox × P oy pixels can be fed into the MAC units. The input buffers as in Fig. 8, the parallel outputs are serialized into
buffer size requirement of one convolution layer is P oy × dP of /#OutBuf e clock cycles, where #OutBuf is
the number of output buffers excluding the dual buffer structure
bit InBuf =
(21) with #OutBuf 6 P of . By this means, the data width of one
Dual × P oy × P ox × bit P x × word InBuf. output buffer is P ox × bit P x, as shown in Fig. 8, to store
The final input buffer size is the maximum bit InBuf of all the parallel P ox outputs from the same feature map.
the convolution layers. The actual input buffer size in Equation The output buffer storage pattern is illustrated in Fig. 8,
(21) is larger than the rough estimation in Equation (5) due where c(x) is the x-th column element in one row with x ∈
to the mismatch of tile and buffer dimensions caused by the {1, 2, . . . , T ox} and r(o, y) is the y-th row in the o-th output
specific storage pattern. feature map with o ∈ {1, 2, . . . , T of } and y ∈ {1, 2, . . . , T oy}.
The outputs of the same feature map are continuously stored
w(i,o) is one kernel window of i-th input channel and o-th output channel, in the same buffer in a row-major order. One row (r(o, y))
i ∈ {1, 2, …, Tif}, o ∈ {1, 2, …, Tof}
is comprised of T ox elements (c(x)) continuously stored in
k(x,y) is one kernel weight inside the kernel window, dT ox/P oxe addresses, and we set T ox = Nox so that one
x ∈ {1, 2, …, Tkx}, y ∈ {1, 2, …, Tky}
entire row is processed while maintaining the row-major order.
Pof = 4 Base Addr. Addr. One feature map has T oy number of rows stored in one buffer
w(1,1) w(1,2) w(1,3) w(1,4) 0 k(1,1) 0 and it occupies T oy×dT ox/P oxe addresses. One output buffer
Tif = 3

w(2,1) w(2,2) w(2,3) w(2,4) 4 k(2,1) 1


w(3,1) w(3,2) w(3,3) w(3,4) 8 k(1,2) 2 stores dT of /#OutBuf e number of feature maps. Then, the
w(1,5) w(1,6) 12 k(2,2) 3 number of words or the depth of one output buffer (OutBuf )
w(2,5) w(2,6) 16 Tif = 3
for one convolution layer is
w(3,5) w(3,6) 20 Tof = 6
Weight Buffer Tkx = Tky = 2 word OutBuf = dT of /#OutBuf eT oydT ox/P oxe. (24)

Fig. 7. The convolution data storage pattern in the weight buffer. The output buffer size requirement of one convolution layer is
bit OutBuf = (Dual × #OutBuf )
(25)
× (P ox × bit P x) × word OutBuf.
B. Size and Storage of Weight Buffers
The storage pattern of weight buffer is illustrated in Fig. 7. If T of /#OutBuf is not an integer, the blank spaces in the
The k(x, y) in Fig. 7 denotes one weight inside the Nkx × output buffers as in Fig. 8 are wasted.
Nky kernel window, where x ∈ {1, 2, . . . , T kx} and y ∈
r(o, y) is y-th row in o-th output feature map, o ∈ {1, 2, …, Tof}, y ∈ {1, 2, …, Toy}
{1, 2, . . . , T ky}. In the chosen design, we always have T kx = c(x) is the x-th column element in one row, x ∈ {1, 2, …, Tox}
Nkx and T ky = Nky, so that one kernel window is fully Base Addr. Addr.
buffered. These T kx × T ky weights, i.e. k(x, y), are stored in r(1,1) r(2,1) r(3,1) r(4,1) 0 c(01) c(02) c(03) c(04) 0
continuous addresses as we serially compute one kernel window, r(1,2) r(2,2) r(3,2) r(4,2) 3 c(05) c(06) c(07) c(08) 1
e.g. P kx = P ky = 1. In Fig. 7, w(i, o) denotes one kernel r(1,3) r(2,3) r(3,3) r(4,3) 6 c(09) c(10) c(11) c(12) 2
r(1,4) r(2,4) r(3,4) r(4,4) 9
window of the i-th input channel and o-th output channel, which r(5,1) r(6,1) 12 Pox = 4
is comprised of T kx × T ky weights. Weights from different r(5,2) r(6,2) 15
input channels (T if ) are stacked in different addresses as we r(5,3) r(6,3) 18
Tox = 12
serially compute each input channel. To compute P of output r(5,4) r(6,4) 21
Toy = 4
channels in parallel, the weights of P of output channels are Output Output Output Output
Buffer 1 Buffer 2 Buffer 3 Buffer 4 Tof = 6
stored at the same address of the weight buffer. Therefore, the
bit width of the weight buffer is P of × bit W t. The words #OutBuf = 4
or depth of the weight buffer (W tBuf ) is
Fig. 8. The convolution data storage pattern in the output pixel buffers.
word W tBuf = T kx × T ky × T if × dT of /P of e. (22)
With dual buffering, the number of weight buffers is two. The
weight buffer size requirement of one convolution layer is D. Size and Storage of Pooling Buffers
The max pooling layers share the input and output buffers
bit W tBuf = Dual · P of · bit W t · word W tBuf. (23) with convolution layers. Due to the different dataflow require-
ment, the max-pooling input storage pattern in the input buffers
If T of /P of is not an integer, some blank spaces in the weight
is different from convolution inputs, but it is the same as
buffer are wasted as in Fig. 7. The final weight buffer size is
the output storage pattern of convolution outputs in Fig. 8.
the maximum bit W tBuf of all the convolution layers.
In addtion, the output buffer storage pattern of max-pooling
layers is also the same as the convolution outputs in Fig. 8.
C. Size and Storage of Output Buffers The pixels from the same feature map are stored in the same
After every N kx × N ky × N if clock cycles, there are buffer, and different feature maps are distributed across different
P ox × P oy × P of outputs from MAC units. To reduce the buffers. Therefore, the input and output buffer depth of one
bit width of data bus and the bandwidth requirement of output tile of max pooling is similar to Equation (24). The buffer size

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 9

requirement of pooling layers is ensured to be smaller than When neither weights nor pixels are fully buffered, i.e.
that of the convolution layers by using smaller pooling tiling T oy < Noy and T of < Nof, the same pixels are re-loaded
variables so that there is no overflow of pooling data. dNof/T of e times into input buffers as shown in Fig. 4(d).
Similar to Equation (21), the size (bit) of one tile (1 T ) of
VII. M ODELING OF O N - CHIP B UFFER ACCESS pixels written into the input buffers is

The energy cost of accessing data in the buffers dominates bit W rIn 1 T = word InBuf · P oy · P ox · bit P x. (31)
the on-chip memory energy consumption [18] [19], so it is
essential to reduce the size of buffer accesses for energy- The size (bit) of data loaded into the input buffers of all the
efficient design. To reduce the buffer access size, data should convolution layers is
be reused as much as possible either by multiple PEs or by bit W rInBuf =
different execution tiles, which will be discussed in this section. #CON
XV s (32)
bit W rIn 1 T [L] × #tiles In[L].
L=1
A. Reading Input and Weight Buffers of Convolution
Similarly, the size (bit) of one tile of weights written into the
Based on Equation (9) to estimate the buffer access,
weight buffers is
we need to compute #cycles Access first. In this case,
#cycles Access is the MAC computation clock cycles of bit W rW t 1 T = word W tBuf × P of × bit W t, (33)
one tile, which is #cycles 1 T in Equation (15). Then, the
computation clock cycles of all the convolution layers are and the size (bit) of data written into the weight buffers of all
#CON
the convolution layers is
XV s
#cycles C = #cycles 1 T [L] × #tiles[L], (26) bit W rW tBuf =
L=1 #CON
XV s (34)
where #CON V s is the number of convolution layers and bit W rW t 1 T [L] × #tiles W t[L].
#tiles is the number of tiles. The size (bit) of data read (Rd) L=1
from input buffers (InBuf ) for convolution layers is computed
by multiplying the read clock cycles with the total input buffer C. Data Access of Output Buffers of Convolution
data width as The number of clock cycles to write outputs into output
buffers during one tile is the same as word OutBuf , where
bit RdInBuf = #cycles C · (P ox · P oy · bit P x), (27) one word of data is written into one output buffer in one cycle.
where every P ox × P oy pixels are reused by P of MAC units Since every tile of one layer has outputs to be saved, the clock
and the number of input buffer accesses is reduced by P of cycles of writing outputs to output buffers is word OutBuf ×
times. Similarly, the size (bit) of data read (Rd) from weight #tiles. Then, the total cycles to load outputs into output buffers
buffers (W tBuf ) for all the convolution layers is (OutBuf ) are summed up across all the convolution layers as
#cycles W rOutBuf =
bit RdW tBuf = #cycles C × (P of × bit W t), (28)
#CON
XV s (35)
where every P of weights are reused by P ox × P oy MAC word OutBuf [L] × #tiles[L].
units and the number of weight buffer accesses is reduced by L=1
P ox × P oy times. The size (bit) of results written into the output buffers is
bit W rOutBuf = #cycles W rOutBuf
(36)
B. Writing Input and Weight Buffers of Convolution × #OutBuf × P ox × bit P x.
Before computation, the input data are written into the input Since each output is written into and read from the output
and weight buffers from DMA. As discussed in Section V-B, buffers only once, the size (bit) of data read from output buffers
not every tile needs to read both pixels and weights from (bit RdOutBuf ) by DMA equals to bit W rOutBuf .
DRAM, because some pixels or weights of one tile can be
reused by the following adjacent tiles. The number of tiles VIII. E XPERIMENTS AND A NALYSIS
of one convolution layer that write new weights (W t) to the
weight buffer is In this section, the proposed performance model is used to
explore the design space by tuning the key design variables, e.g.
#tiles W t = dNof/T of e. (29) unrolling and tiling sizes, DRAM bandwidth and accelerator
frequency, to identify the performance bottleneck and obtain
The number of tiles of one convolution layer that write new
the optimal design configurations.
input pixels (In) to the input buffers is
#tiles In = A. Design Space Exploration of Tilling Variables
(
d TNoy Nof
oy ed T of e, if T oy < Noy and T of < Nof
(30) The loop tiling strategy determines how many data of each
Noy layer are buffered, which affects the buffer capacity requirement,
d T oy e, otherwise

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 10

the number of DRAM accesses, and the accelerator perfor- minimum DRAM accesses. The red dot in Fig. 9 is our optimal
mance. Although we have fixed T kx = Nkx, T ky = Nky, design choice of T oy and T of that balances the buffer size
T if = Nif and T ox = Nox as mentioned in Section II-D, requirement and the number of DRAM accesses.
the remaining two tiling variables T oy and T of still give
us a huge design space as mentioned in [11]. For example,
VGG-16 has 13 convolution layers, and there are 13 × 2 = 26 (a) NiN (b) VGG-16
tiling variables and each variable can have 4 or more candidate
values determined by Noy/P oy or Nof/P of , then the total
number of T oy and T of choices is roughly 426 = 4.5 × 1015 , Our
which results in an enormous solution space that cannot be design
point
enumerated. Therefore, we randomly sample 30,000 tiling
configurations for different CNN algorithms to explore their
impact on the memory access and performance as in Fig. 9,
Fig. 10 and Fig. 11, where we set loop unrolling variables as
(c) GoogLeNet (d) ResNet-50
P ox × P oy × P of = 7 × 7 × 32.

(a) NiN (b) VGG-16

Fig. 10. The tiling variables (T oy and T of ) are swept to explore the rela-
tionship between the convolution throughputs and the total input/weight/output
buffer size requirement, where P ox × P oy × P of = 7 × 7 × 32,
M Hz Accelerator = 240, BW DRAM = 14.4 GB/s.
(c) GoogLeNet (d) ResNet-50

Fig. 10 shows the relationship between tiling sizes and


Our the convolution throughputs, where the accelerator operating
design
point frequency is 240 MHz and the DRAM bandwidth is 14.4 GB/s.
The throughput is computed by #operations/delay, where
#operations = 2Nm including both multiply and addition,
and delay is the sum of ms 1 CV over all the convolution
layers. If the tiling or buffer size is too small, the number
Fig. 9. The tiling variables (T oy and T of ) are swept to explore the relationship of DRAM access and the associated latency is significantly
between the size of DRAM accesses and the total input/weight/output buffer
size requirement, where P ox × P oy × P of = 7 × 7 × 32 with 16-bit data.
increased, which degrades the throughput. If the tiling size is
too large or there is only one tile in one layer, the DRAM
access latency cannot be well overlapped with the computation
The relationship between tiling variables and the number of
delay as mentioned in Section V-B, which results in lower
DRAM accesses is investigated in Fig. 9 with 16-bit data. The
throughput. This trend is shown by the dashed line in Fig. 10.
total convolution DRAM access size is computed by
The dashed lines of GoogLeNet and ResNet-50 are not as
#CON
XV s smooth as those of NiN and VGG-16. It is mainly because
byte DRAM = byte RdP x · #tiles In+ GoogLeNet and ResNet-50 have more layers resulting in much
L=1
(37)
 larger design space, which makes it more difficult to cover all
byte RdW t · #tiles W t + byte W rP x · #tiles , the design choices through random sampling. The red dots in
where the right-hand side variables are computed by Equation Fig. 10 are our design choices of T oy and T of , which are the
(11) (12) (14) (29) (30). The DRAM accesses of other layers same in Fig. 9, to achieve the best throughputs.
are also included in Fig. 9. One circle in Fig. 9 represents one Fig. 11 shows the relationship between tiling sizes and
design point of the tiling variables T oy and T of . Since the the number of on-chip buffer accesses for different CNN
buffer size is determined by the layer with the maximum tiling algorithms, which include both read and write operations of
size, there could be multiple different tiling configurations in input/weight/output buffers of all the layers in a given CNN
other layers leading to the same buffer size. The buffer size algorithm. Based on our acceleration strategy [11], the partial
in Fig. 9 includes input/weight/output buffers, which equals to sums are accumulated inside the MAC units, which do not
max(bit InBuf )+max(bit W tBuf )+max(bit OutBuf ) involve buffer access. The estimation of the number of on-chip
from Equation (21) (23) (25). With the increase of tiling and buffer accesses is discussed in Section VII. Our design choices
buffer sizes, the number of DRAM accesses is decreasing of T oy and T of shown by red dots in Fig. 11 can achieve
as shown by the dashed line in Fig. 9. After the buffer close to the optimal number of buffer accesses while having
size is increased to be large enough, we can achieve the best throughputs and low level of DRAM accesses.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 11

(a) NiN (b) VGG-16

(a) Pox = 7, Poy = 7, Pof = 8 (b) Pox = 7, Poy = 7, Pof = 16


(c) GoogLeNet (d) ResNet-50
Our
design
point

Fig. 11. The tiling variables (T oy and T of ) are swept to explore the (c) Pox = 7, Poy = 7, Pof = 32 (d) Pox = 14, Poy = 7, Pof = 32
relationship between the size of on-chip buffer accesses and the size requirement
of buffers, where P ox × P oy × P of = 7 × 7 × 32. Fig. 12. The convolution throughput is affected by the accelerator operating
frequency, DRAM bandwidth, and the number of MAC units. GoogLeNet is
shown as an example here.
B. Design Space Exploration for Performance
As convolution dominates the CNN operations [2] [20] [3] is limited by BW DM A, and DRAM roof is linearly
[4], we focus on the design space exploration of convolution increasing with the increase of frequency as in Fig. 13. After
throughputs. The convolution throughput is affected by several BW DM A is larger than BW DRAM , BW M emory is
factors, namely the accelerator operating frequency, external limited by BW DRAM instead, and DRAM roof stops
memory bandwidth and the loop unrolling variables, These growing with the frequency. The saturated throughputs in
are explored in Fig. 12 using GoogLeNet as an example. Fig. 12 are lower than DRAM roof in Fig. 13, which is
With a small number of MAC units and high DRAM band- mainly because there are redundant DRAM transfers and the
width (BW DRAM ) as shown in Fig. 12(a), the accelerator computation delay is not fully overlapped with DRAM latency.
throughput is mainly bounded by computation, and thus the
throughput is almost linearly increasing with the frequency
when BW DRAM > 12.8GB/s. If the DRAM bandwidth
is too low, e.g. 3.2 GB/s, the design is more likely to be
memory bounded and the throughput stops increasing with the
frequency. With more MAC units and higher frequency, the
throughputs are tend to increase, as shown in Fig. 12, until the
design touches the memory roof which is illustrated in Fig. 13.
The memory roof throughput [7] in Fig. 13 is the maxi- (a) GoogLeNet (b) VGG-16
mum achievable throughput under a certain external memory
bandwidth and it is defined as, Fig. 13. The external memory roof throughput (DRAM roof ) is the
maximum achievable throughput under a certain memory bandwidth.
#operations(GOP )
DRAM roof (GOP S) =
DRAM delay(s)
(38)
#operations(GOP ) C. Performance Model Validation
= BW M emory(GB/s),
#data(GByte) Fig. 14 shows the comparison of throughput and latency
where #data is the data size of DRAM accesses. between the performance model and the on-board test results
Since the computation-to-communication ratio (CTC), i.e. on Arria 10 and Stratix 10 with different number of MAC
#operations/#data, is a constant under a certain tiling set- units, where both pixels and weights are 16-bit fixed point data.
ting, DRAM roof is directly proportional to BW M emory. The differences between the estimation and on-board results
With the same setting of BW M emory for GoogLeNet and are within 3%, which are mainly due to the DRAM transfer
VGG-16, the shape of the curves in Fig. 13(a) and (b) are simi- latency mismatch, minor layers (e.g. average pooling), and some
lar. Since VGG-16 has a higher CTC, its memory roof through- pipeline stages in the real implementation. The compilation
put is much higher than GoogLeNet in Fig. 13. As discussed of our FPGA design using Quartus Pro 17.1 on 16-core Intel
in Section IV-B, the memory bandwidth (BW M emory) Xeon CPU E5-2650 v3 normally takes six to eight hours, while
is bounded by both the DRAM controller (BW DRAM ) the performance model running on laptop Intel Core i7-7500U
and DMA (BW DM A). At low frequency, BW M emory CPU using MATLAB takes about 1 to 5 seconds per design.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 12

Overall Throughput
1600
1400
14×7×32 MACs (Arria 10) A. Improving DRAM Bandwidth Utilization
14×7×32 MACs (Performance Model)
1200
14×7×64 MACs (Stratix 10) To simplify the control logic of data bus from DMA to input
(GOPS)
1000
14×7×64 MACs (Performance Model)
600
800
buffers, different feature map rows are aligned in different
400 addresses in our current design. By this means, if the number
200
0 of pixels in one row is smaller than bbit DM A/bit P xc, the
NiN VGG-16 GoogLeNet ResNet-50 ResNet-152
successive row directly starts from the next address instead of
35
14×7×32 MACs (Arria 10) continuously using the same address resulting in the waste of
30 14×7×32 MACs (Performance Model)
DMA datawidth. For example, with bit P x = 16, one address
Overall Latency

25 14×7×64 MACs (Stratix 10)


14×7×64 MACs (Performance Model)
20 can accommodate 512/16 = 32 pixels, if the width of the
(ms)

15
10
feature map is Nix = 14, then the actual number of pixels
5 of one row read from DRAM in Equation (11) is T ix = 32,
0
NiN VGG-16 GoogLeNet ResNet-50 ResNet-152
where 32−14 = 20 data are redundant. Some CNN models, e.g.
GoogLeNet and ResNet, have a lot of convolution layers with
Fig. 14. The performance model results are compared with on-board test small Nix, e.g. 7 or 14, then their throughputs are significantly
results of Arria 10 and Stratix 10 on overall (a) throughput and (b) latency. affected by the inefficient utilization of DMA datawidth.
To improve the DRAM bandwidth utilization, one method
is to store multiple rows in one DMA address, which involves
D. Related Works
the modifications of control logic and extra data paths from
Several related works have used performance model to DMA to input buffers. The other method is to keep the data
optimize the memory access and computation pattern of their aligned, but narrow the bit width of the data bus between DMA
proposed architecture and dataflow. Suda et al. [8] implements and input buffers. To attain the same data transfer rate, higher
convolution as matrix multiplication and uses a performance frequency is needed, and asynchronous FIFO may be used.
model to optimize the design. However, the execution time In the performance model, we reduce bit DM A to be 256
in [8] only counts computation time without considering and 128 and increase their corresponding frequency of the
the DRAM transfer latency. If the design becomes memory- data bus to predict the throughput improvements. In Fig. 15,
bounded, the model in [8] cannot properly predict the overall our current design (DMA 512-bit) serves as the baseline with
latency, which results in the estimation discrepancy of fully- data aligned, and bit DM A is set to be 256 or 128, which
connected layers with high computation parallelism. The has the same effect as supporting two or four rows in one
proposed systolic array architecture in [10] is also optimized address with bit DM A = 512, respectively. Fig. 15 shows that
through a performance model. The overall throughput is simply NiN, GoogLeNet and ResNet can benefit a lot from decreasing
computed by the minimum of the computation throughput and the DMA bit width, mainly because they have many layers
DRAM transfer throughput, where the overlap efficiency of with small Nix and the layers with small Nix are memory
computation and data transfer is not considered. The fine- bounded. On the contrary, VGG-16 cannot benefit from higher
grained tile-level data accesses of DRAM and buffers are not DRAM bandwidth utilization as the design is still computation
explored in [10]. The buffer and DRAM accesses are modeled bounded. Based on the prediction, it is compelling to improve
in [18] to explore different data reuse patterns by changing our design for higher DRAM bandwidth utilization.
the tiling strategy and computation order. Only coarse-grained
1100 900
modeling of the convolution memory access is analyzed without DMA 512-bit (Our current design) DMA 512-bit (Our current design)
Throughput (GOPS)

Throughput (GOPS)

DMA 256-bit (Model prediction) DMA 256-bit (Model prediction)


1000 800
considering the DRAM bandwidth utilization and the detailed 900
DMA 128-bit (Model prediction)
700
DMA 128-bit (Model prediction)

data storage patterns in buffers and DRAM. The proposed 800 600
Hybrid Data Reuse in [18] is similar to our tiling strategy 700 500

that different layers can use different tiling sizes to either 600 400
14 × 7 × 32 PEs 14 × 7 × 64 PEs 14 × 7 × 32 PEs 14 × 7 × 64 PEs
reuse weights or pixels to minimize the DRAM access. In our (a) NiN (b) GoogLeNet
1800 1000 DMA 512-bit (Our current design)
work, the relationship between the overall DRAM access and DMA 512-bit (Our current design)
Throughput (GOPS)

Throughput (GOPS)

DMA 256-bit (Model prediction) DMA 256-bit (Model prediction)


1600 DMA 128-bit (Model prediction) 900 DMA 128-bit (Model prediction)
the total buffer size is also investigated. The power of data 1400 800
movement in different hierarchy, e.g. DRAM, buffer, and PE 1200 700

array, is analytically modeled in [19] to compare the energy 1000 600

efficiency of different dataflow. However, the power is not 800


14 × 7 × 32 PEs 14 × 7 × 64 PEs
500
14 × 7 × 32 PEs 14 × 7 × 64 PEs
quantitatively formulated with the design variables in [19], and (c) VGG-16 (d) ResNet-50

the performance of the accelerator is not modeled.


Fig. 15. Performance model predicts that the throughput will be improved by
increasing the DRAM bandwidth utilization, which is achieved by decreasing
the DMA bit width to reduce the redundant DRAM accesses.
IX. F URTHER I MPROVEMENT O PPORTUNITIES

In this section, we use the proposed performance model


to evaluate the opportunities that may further enhance the B. Merging the First Layers
performance of the accelerator by improving the efficiency of In GoogLeNet and ResNet, there are multiple parallel
DRAM transactions and DSP utilization. branches of layers, and the first layer of each branch reads input

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 13

1000 620
pixels from the same precedent layer. If these convolution layers Uniform (Our current design)
Adjustable (Predication)
Uniform (Our current design)
Adjustable (Predication)

Throughput (GOPS)
Throughput (GOPS)
900 580
also have the same kernel size and stride, they can be merged 800
Ideal (Predication)
540
Ideal (Predication)

into one layer along the output feature map dimension (Nof). 700 500

By this means, the input pixels can be shared by the first layers 600 460

of different branches and only need to be read from DRAM 500


14 × 7 × 32 PEs 14 × 7 × 64 PEs
420
14 × 7 × 32 PEs 14 × 7 × 64 PEs
(a) NiN (b) GoogLeNet
once, as proposed in [13]. We change the corresponding settings 1700 Uniform (Our current design) 720
Uniform (Our current design)

Throughput (GOPS)
of our performance model, e.g. byte RdP x in Equation (11), to

Throughput (GOPS)
Adjustable (Predication) Adjustable (Predication)
1500 680
Ideal (Predication) Ideal (Predication)
estimate the effect of eliminating the repeated DRAM accesses 1300 640

1100 600
of the precedent layer as shown in Fig. 16. Since GoogLeNet
900 560
and ResNet are already memory-bounded in our current design, 700 520
reducing the DRAM access can considerably improve the 14 × 7 × 32 PEs
(c) VGG-16
14 × 7 × 64 PEs 14 × 7 × 32 PEs
(d) ResNet-50
14 × 7 × 64 PEs

throughputs. The required modifications of our current design


to merge the first layers involve changing the control logic and Fig. 17. Uniform: our current design as baseline with uniform PE mapping;
Adjustable: dynamically adjust the unrolling variables for different layers to
the descriptors of DMA transactions, and there is no significant improve PE utilization; Ideal: force PE utilization to be 100%.
overhead of additional hardware resources.

700 Normal (Our current design) 720


Normal X. C ONCLUSIONS
Throughput (GOPS)

Throughput (GOPS)

660 First Layers Merged (Model predication) 700 (Our current design)
620 680 First Layers Merged
580 660 (Model predication) In this work, a high-level performance model is proposed
540 640 to estimate the key specifications, e.g. throughput, of FPGA
500 620 accelerators for CNN inference, which enables the design
460 600
14 × 7 × 32 PEs 14 × 7 × 64 PEs 14 × 7 × 32 PEs 14 × 7 × 64 PEs space exploration to identify performance bottleneck in the
(a) GoogLeNet (b) ResNet-50 early development phase. The design strategy and resource
costs are formulated using the design variables of loop
Fig. 16. Performance model predicts that the throughput will be improved by
merging the first layers of different parallel branches, which read from the
unrolling and tiling. The proposed performance model is
same precedent layer, to eliminate the repeated DRAM access, where “Normal” validated for a specific acceleration strategy across a variety
denotes our current design as baseline. of CNN algorithms comparing with on-board test results on
two different FPGAs.

C. Improving PE Efficiency R EFERENCES


[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
Due to the highly varying dimensions of different convolution A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei,
layers in a given CNN model, it is a challenge task to efficiently “ImageNet Large Scale Visual Recognition Challenge,” International
distribute workloads across PEs, or we need to make loop Journal of Computer Vision (IJCV), 2015.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
dimensions (N∗) divisible by their corresponding unrolling with deep convolutional neural networks,” in Conf. on Neural Information
variables (P ∗). In [21] [22], adaptive parallelism scheme is Processing Systems (NIPS), 2012.
proposed to dynamically adjust the mapping of operations on [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks
different PEs, or the unrolling variables can be changed for for large-scale image recognition,” 2014. [Online]. Available:
http://arxiv.org/abs/1409.1556
each layer to maximize PE utilization. This requires the ability [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
to dynamically redirect data flow from buffers to PEs, which recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition
may need complex control logic, incur penalty of additional (CVPR), Jun., 2016.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
resources, and aggravate the burden on timing closure. D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
Instead of using uniform PE mapping and unrolling variables convolutions,” in IEEE Conf. on Computer Vision and Pattern Recognition
(CVPR), Jun., 2015.
in the current design, we adjust unrolling variables (P ox·P oy ·
[6] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
P of ) for different layers to achieve better PE utilization in the Inception-ResNet and the impact of residual connections on learning,”
performance model as shown by “Adjustable” in Fig. 17. We in Proc. of Conf. on Artificial Intelligence (AAAI), Feb., 2017.
also force PE utilization to be 100% by removing the ceiling [7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
functions in Equation (2), which is denoted by “Ideal” in in ACM/SIGDA Int. Sym. on Field-Programmable Gate Arrays (FPGA),
Fig. 17. However, the throughput improvements from adjustable 2015.
unrolling strategy are very limited (< 10%) for our design, [8] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula,
J.-s. Seo, and Y. Cao, “Throughput-optimized OpenCL-based FPGA
mainly because 1) the Nox·Noy·Nof dimensions of most layers accelerator for large-scale convolutional neural networks,” in ACM/SIGDA
have already been able to provide large enough parallelism for Int. Sym. on Field-Programmable Gate Arrays (FPGA), 2016.
our uniform unrolling strategy, and 2) most of our layers are [9] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: towards
uniformed representation and acceleration for deep convolutional neural
memory-bounded and the reduction of computation latency has networks,” in Proc. of Int. Conf. on Computer-Aided Design (ICCAD),
little effect on the throughput. Considering the large amount Nov., 2016.
of necessary design efforts for adjustable PE mapping and low [10] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang,
and J. Cong, “Automated systolic array architecture synthesis for high
expected improvements, we surmise it is not a primary task in throughput CNN inference on FPGAs,” in Proc. of Design Automation
our future work to adopt this technique. Conference (DAC), 2017.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 14

[11] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, “Optimizing the convolution Yu Cao (S’99-M’02-SM’09-F’17) received the B.S.
operation to accelerate deep neural networks on FPGA,” IEEE Trans. on degree in physics from Peking University in 1996. He
Very Large Scale Integration (VLSI) Systems, 2018. received the M.A. degree in biophysics and the Ph.D.
[12] H. Zeng, R. Chen, C. Zhang, and V. K. Prasanna, “A framework for degree in electrical engineering from University of
generating high throughput CNN implementations on FPGAs,” in Proc. California, Berkeley, in 1999 and 2002, respectively.
of ACM/SIGDA Int. Sym. on Field-Programmable Gate Arrays (FPGA), He worked as a summer intern at Hewlett-Packard
Feb., 2018. Labs, Palo Alto, CA in 2000, and at IBM Microelec-
[13] X. Lin, S. Yin, F. Tu, L. Liu, X. Li, and S. Wei, “LCP: A layer clusters tronics Division, East Fishkill, NY, in 2001. After
paralleling mapping method for accelerating Inception and Residual working as a post-doctoral researcher at the Berkeley
networks on FPGA,” in Proc. of Design Automation Conference (DAC), Wireless Research Center (BWRC), he is now a
Jun., 2018. Professor of Electrical Engineering at Arizona State
[14] K. Pavel and S. David, “Algorithms for efficient computation of convo- University, Tempe, Arizona. He has published numerous articles and two books
lution,” in IntechOpen, DOI: 10.5772/51942, Design and Architectures on nano-CMOS modeling and physical design. His research interests include
for Digital Signal Processing, Jan., 2013. physical modeling of nanoscale technologies, design solutions for variability
[15] J. Yu, K. Guo, Y. Hu, X. Ning, J. Qiu, H. Mao, S. Yao, T. Tang, and reliability, reliable integration of post-silicon technologies, and hardware
B. Li, Y. Wang, and H. Yang, “Real-time object detection towards high design for on-chip learning.
power efficiency,” in Design, Automation & Test in Europe Conference Dr. Cao was a recipient of the 2012 Best Paper Award at IEEE Computer
& Exhibition (DATE), Mar., 2018. Society Annual Symposium on VLSI, the 2010, 2012, 2013, 2015 and 2016 Top
[16] C. Zhang and V. K. Prasanna, “Frequency domain acceleration of 5% Teaching Award, Schools of Engineering, Arizona State University, 2009
convolutional neural networks on CPU-FPGA shared memory system,” ACM SIGDA Outstanding New Faculty Award, 2009 Promotion and Tenure
in Proc. of ACM/SIGDA Int. Sym. on Field-Programmable Gate Arrays Faculty Exemplar, Arizona State University, 2009 Distinguished Lecturer of
(FPGA), 2017. IEEE Circuits and Systems Society, 2008 Chunhui Award for outstanding
[17] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, “An automatic RTL compiler oversea Chinese scholars, the 2007 Best Paper Award at International
for high-throughput FPGA implementation of diverse deep convolutional Symposium on Low Power Electronics and Design, the 2006 NSF CAREER
neural networks,” in Int. Conf. on Field Programmable Logic and Award, the 2006 and 2007 IBM Faculty Award, the 2004 Best Paper Award at
Applications (FPL), Sep., 2017. International Symposium on Quality Electronic Design, and the 2000 Beatrice
[18] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional Winner Award at International Solid-State Circuits Conference. He has served
neural network architecture with reconfigurable computation patterns,” as Associate Editor of the IEEE Transactions on CAD, and on the technical
IEEE Trans. VLSI Syst., vol. 25, no. 8, pp. 2220–2233, 2017. program committee of many conferences.
[19] Y. Chen, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural networks,” in Sarma Vrudhula (M’85-SM’02-F’16) is a Professor
ACM/IEEE Int. Sym. on Computer Architecture (ISCA), Jun., 2016. of Computer Science and Engineering with Arizona
[20] M. Lin, Q. Chen, and S. Yan, “Network In Network,” CoRR, vol. State University, and the Director of the NSF I/UCRC
abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400 Center for Embedded Systems. His work spans
[21] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: a deep several areas in design automation and computer
learning accelerator that tames the diversity of CNNs through adaptive aided design for digital integrated circuit and systems,
data-level parallelization,” in Proc. of Design Automation Conference focusing on low power circuit design, and energy
(DAC), Jun., 2016. management of circuits and systems. Specific topics
[22] M. Putic, S. Venkataramani, S. Eldridge, A. Buyuktosunoglu, P. Bose, include: energy optimization of battery powered
and M. Stan, “Dyhard-DNN: even more DNN acceleration with dynamic computing systems, including smartphones, wireless
hardware reconfiguration,” in Proc. of Design Automation Conference sensor networks and IoT systems that relies energy
(DAC), Jun., 2018. harvesting; system level dynamic power and thermal management of multicore
processors and system-on-chip (SoC); statistical methods for the analysis of
process variations; statistical optimization of performance, power and leakage;
new circuit architectures of threshold logic circuits for the design of ASICs
and FPGAs. More recently he is investigating non-conventional methods
for implementing logic, including technology mapping with threshold logic
circuits; the implementation of threshold logic using resistive memory devices,
and the design and optimization of non-volatile logic. Prior to ASU, he was
a Professor in the ECE department at University of Arizona, Tucson AZ,
and was on the faculty of the EE-Systems department at the University of
Southern California. He was also the Founding Director of the NSF Center
for Low Power Electronics at the University of Arizona. He received the
B.Math. degree from the University of Waterloo, Waterloo, ON, Canada, and
the M.S.E.E. and Ph.D. degrees in electrical and computer engineering from
the University of Southern California, Los Angeles, USA.

Jae-sun Seo (S’04-M’10-SM’17) received the B.S.


degree in electrical engineering from Seoul National
University, Seoul, South Korea, in 2001, and the M.S.
and Ph.D. degrees in electrical engineering from the
University of Michigan, Ann Arbor, MI, USA, in
2006 and 2010, respectively.
From 2010 to 2013, he was with IBM T. J. Watson
Research Center, Yorktown Heights, NY, USA, where
Yufei Ma (S’16) received the B.S. degree in in-
he worked on in cognitive computing chips under
formation engineering from Nanjing University of the DARPA SyNAPSE Project and energy-efficient
Aeronautics and Astronautics, Nanjing, China, in
integrated circuits for high-performance processors.
2011, the M.S.E. degree in electrical engineering In 2014, he joined the School of Electrical, Computer and Energy Engineering,
from University of Pennsylvania, Philadelphia, PA,
Arizona State University, Tempe, AZ, USA, as an Assistant Professor. In 2015,
USA, in 2013. He is currently pursuing the Ph.D. he was with the Intel Circuits Research Lab as a Visiting Faculty. His current
degree with Arizona State University, Tempe, AZ,
research interests include efficient hardware design of machine learning and
USA. neuromorphic algorithms and integrated power management.
His current research interests include the high-
Dr. Seo was a recipient of the Samsung Scholarship during 2004–2009,
performance hardware acceleration of deep learning the IBM Outstanding Technical Achievement Award in 2012, and the NSF
algorithms on digital application-specified integrated
CAREER Award in 2017.
circuit and field-programmable gate array.

0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like