Performance Modeling For CNN Inference Accelerators On FPGA
Performance Modeling For CNN Inference Accelerators On FPGA
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 1
Abstract—The recently reported successes of convolutional limited by either the computation delay or the DRAM transfer
neural networks (CNNs) in many areas has generated wide delay, and the actual bound will be determined by the values of
interest in the development of FPGA-based accelerators. To the associated design parameters, as described by the roofline
achieve high performance and energy efficiency, an FPGA-based
accelerator must fully utilize the limited computation resources model in [7] [9]. The computation delay is determined by the
and minimize the data communication and memory access, both number of parallel processing engines (PEs), their utilization,
of which are impacted and constrained by a variety of design and the operating frequency. The DRAM transfer latency is
parameters, e.g. the degree and dimension of parallelism, the mainly affected by the external memory bandwidth and the
size of on-chip buffers, the bandwidth of the external memory, number of DRAM accesses, and the latter is strongly affected
and many more. The large design space of the accelerator
makes it impractical to search for the optimal design in the by the size of the on-chip buffers. With regard to the energy
implementation phase. To address this problem, a performance efficiency (i.e. performance per watt), the main components that
model is described to estimate the performance and resource determine the dynamic power consumption are the computation
utilization of an FPGA implementation. By this means, the logic and the memory traffic, the latter requiring efficient
performance bottleneck and design bound can be identified and data movement and high data reuse. All these considerations
the optimal design option can be explored early in the design
phase. The proposed performance model is validated using a show that there are numerous design parameters that determine
variety of CNN algorithms comparing the results with on-board the performance and energy efficiency of a CNN accelerator,
test results on two different FPGAs. making it impractical to find their optimal values during the
Index Terms—Convolutional neural networks, FPGA, Analyt- implementation phase, as the synthesis of one FPGA design
ical modeling. may take several hours. Robust and parametric models become
a necessity for efficient design space exploration and selection
I. I NTRODUCTION of the optimal values of the design parameters. The architectural
design space must be numerically characterized by design
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2
…
Nkx
the proposed performance model makes it possible to identify
Nky
the performance bottleneck and design limitations in the early Nkx Niy
development phase by exploring the design space through ⊗ Nif Nky =
unrolling and tiling variables. Noy
…
Nkx Nof
The main contributions of this work are: Nif Nox
• The design objectives and resource costs are formulated Nix
using the design variables of loop unrolling and tiling. Nky Kernel maps
Nif
• A high-level performance model is proposed to estimate Nkx
the accelerator throughput, on-chip buffer size and the Fig. 2. Convolution operation is implemented by four levels of loops to
number of external and on-chip memory accesses. multiply and accumulate input features with kernel weights, where i: input, o:
• The design space is efficiently explored through the output, k: kernel, f : feature, x: x axis (width), and y: y axis (height), and the
parameters of loop dimensions are prefixed with “N” [11].
proposed model instead of the real FPGA compilation to
identify the performance bottleneck and obtain the optimal
design configurations. B. CNN Hardware Acceleration System
• The performance model is validated for a specific design
In the general model of a CNN accelerator shown in Fig. 1,
strategy across a variety of CNN algorithms comparing
due to the large data volume, both the weights and intermediate
with the on-board test results on two different FPGAs.
pixel results are stored in the external memory, e.g. DRAM.
• Evaluate the techniques that may further enhance the
The input and weight on-chip buffers temporally store the input
performance of our current design by improving the
data to be processed by the PEs, and the PE results are saved
efficiency of DRAM transactions and PE utilization.
in the output buffers. After completing the computation, the
The remainder of this paper is organized as follows. Section II results are transferred back to DRAM from the output buffers,
overviews the procedure to map CNN operations onto a FPGA which will be used as the input to the subsequent layer.
hardware system. A coarse-grained performance model is
presented in Section III for rough estimation, and the fine-
In this figure,
grained model is discussed in the following sections for a Buffered data (T*)
Pox × Poy × Pof = 2×2×3
specific design strategy. Section IV estimates the size and Parallel computation (P*)
Pif = Pkx = Pky = 1
…
latency of DRAM accesses. The latency of convolution and Input feature maps
other layers are formulated and estimated in Section V. The on-
Output feature maps
chip buffer size requirement is analyzed in Section VI, and the
size of buffer access is discussed in Section VII. Experiments Piy = Poy Tiy Tof
are performed to explore the design space in Section VIII. Pix = Pox ⊗ Pof
= Poy Toy
Section IX evaluates the techniques that may further improve Pof Pox
the current design performance.
Tix = Nix
Tof
Tox
Tif = Nif Tox = Nox
II. CNN I NFERENCE ACCELERATOR ON FPGA Kernel maps
Tky = Nky
A. Overview of Convolution Operation Tif = Nif
Tkx = Nkx 1 ≤ P* ≤ T* ≤ N*
The main operation in CNN algorithms involves accumulat-
ing the products of pixel values (e.g. features, activations or Fig. 3. The convolution acceleration strategy is customized by loop unrolling
(P ∗) for parallel computation and loop tiling (T ∗) for data buffering. The
neurons) with kernal weights, along different dimensions of parallelism is within one input feature map (P ix × P iy) and across multiple
the kernel and feature maps. Fig. 2 shows the four nested loops kernel maps (P of ). The demanded buffer size can be changed by tuning
involved in CNNs. Note that the prefix “N” (for “number”) variables T oy and T of [11].
used in describing various parameters (e.g. Nix, Niy, Nif, etc.)
denote the sizes of the kernel and feature maps. The loop
operations shown in Fig. 2 are written as: C. Convolution Loop Optimization
pixelL (no; x, y) = Loop optimization techniques [7] [11], e.g. unrolling and
Nky X
Nif X Nkx
tilling, are employed to customize the computation and com-
munication patterns in a CNN accelerator. Loop unrolling
X
pixelL−1 (ni; S × x + kx, S × y + ky) (1)
ni=1 ky=1 kx=1
directs the parallel computation along different convolution
dimensions, and the variables representing the unrolling degrees
× weight(ni, no; kx, ky) + bias(no);
are prefixed by “P ” (see Fig. 3). These variables determine the
where S is the sliding stride, x ∈ {1, 2, . . . , Nox}, number of PEs, which in turn determine the required number
y ∈ {1, 2, . . . , Noy}, no ∈ {1, 2, . . . , Nof}, L ∈ of DSPs in the FPGA to implement PEs, and thus decide the
{1, 2, . . . , #CON V s}, and #CON V s is the number of computation delay. The data flow from buffers into PEs is also
convolution layers. impacted by loop unrolling variables, which affect the number
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 3
of buffer access. Loop tilling divides a large CNN layer into TABLE I
multiple small tiles, which can be accommodated by on-chip L IST OF A BBREVIATIONS AND U NITS
buffers to increase data locality. Tiling sizes are represented Abbreviation Description Abbreviation Description
by variables prefixed with “T ” as shown in Fig. 3. The Px Pixel Rd Read
required buffer capacity is determined by the tiling variables, Wt Weight Wr Write
which also affects the DRAM access and thus the latency of Buf Buffer InBuf Input Buffer
W tBuf Weight Buffer OutBuf Output Buffer
DRAM transactions. The relationship between loop variables
BW Bandwidth 1T One Tile
and the key specifications of accelerators, e.g. delay, DSP
Unit Description Unit Description
usage, buffer size, and memory access that affects memory
power consumption, are shown in Fig. 1. bit / byte Data Size word RAM Depth
ms Delay Time M Hz Frequency
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 4
(Rd) from DRAM for one convolution tile is byte RdP x = 2) Fast Fourier Transform: FFT [16] can reduce the number
bit InBuf /8. The size (bytes) of one tile of weights (W t) read of multiplications from Θ(Nox · Noy · Nkx · Nky) to Θ(Nox ·
from the DRAM is byte RdW t = bit W tBuf /8. The size Noy ·log2 (Nox)) and even further to Θ(Nox·Noy ·log2 (Nkx))
(bytes) of one tile of output pixels written (W r) to the DRAM with Overlap-and-Add [16]. The original kernel weights and
is byte W rP x = bit OutBuf /8. The latency (milliseconds input features are transformed into the frequency domain to
or ms) of DRAM transactions of one tile (1 T ) of data is do multiplications, and then the inverse FFT is applied to
determined by the size of DRAM access and the memory recover the results. Therefore, extra hardware is required to
bandwidth. This is given by implement the transform. In addition, the computation reduction
byte DRAM 1 T is decreased with smaller kernel size.
ms DRAM 1 T = , (8) 3) Winograd Transform: Winograd transform [15] is another
BW M emory × 106
approach to reduce the convolution operations. The number
where BW M emory is the external memory bandwidth of multiplications of the original convolution in one tile is
(GByte/s), and byte DRAM 1 T is the size of DRAM Θ((T kx · T ox)(T ky · T oy)), while in Winograd it is only
access of one tile, which can be byte RdP x, byte RdW t, Θ((T kx+T ox−1)(T ky+T oy−1)). For example, with kernel
or byte W rP x. size of 3 × 3 and tiled output features of 2 × 2, we can achieve
2.25× reduction of multiplication operations. However, the
D. On-chip Buffer Access addition operations are increased in Winograd, and additional
The size (bits) of on-chip buffer access (bit Buf Access) storage and bandwidth are required by the transform matrices.
is computed by multiplying the number of access clock cycles The operations can be further reduced with larger feature tiles,
(#cycles Access) with the total bit width of the corresponding but the complexity of the transform matrix will significantly
buffers (width Buf ). increase. Since Winograd essentially unrolls the computation
within a kernel window, the varying kernel sizes can affect its
bit Buf Access = #cycles Access × width Buf. (9) computation efficiency.
During computation, it is assumed that data are continuously
read from input and weight buffers and the results are written IV. M ODELING OF DRAM ACCESS
into the output buffers every clock cycle. Then, to estimate the
In this section, more accurate models of the DRAM access
buffer access during computation, #cycles Access equals the
are constructed by including the design constraints and the
number of computation cycles, and width Buf can be the
variables of loop acceleration described in Section II-D.
total bit width of input/weight/output buffers. The size (bits)
of buffer access by DMA that writes into input and weight
buffers and reads from output buffers is the same as the size A. Data Size of Convolution DRAM Access
of external memory access. The data stored in the input or
weight buffers may be read multiple times during computation, The direct memory access (DMA) engine shown in Fig. 1 is
hence the size of data read from buffers may be larger than used to transfer data to and from off-chip DRAM. To achieve
the size of data written into buffers from DRAM. Since each the maximum bandwidth, the data width of both the DMA
result is written into output buffers only once, the size of write (bit DM A) and the DRAM controller (bit DRAM ) are set
and read operations of output buffers are the same. to be 512 bits.
P ox represents the number of pixels that are computed in
parallel in each output feature map. For the feature map transfer,
E. Other Implementation Methods of Convolution
the number of groups of P ox pixels associated with one DMA
Instead of the aforementioned direct implementation of the address is then given by #P oxGroup = bbit DM A/(P ox ×
convolution loop operations, convolution can also be performed bit P x)c, where bit P x is the bit width per pixel. The effective
as matrix multiplication [8] or accelerated in the frequency or actual DMA bandwidth (as a fraction of the maximum) is
domain [14] [15]. Since these methods require significantly then given by
different hardware architecture and dataflow, we only briefly
#P oxGroup × P ox × bit P x
analyze them with our modeling parameters. ef f DM A P x = . (10)
1) Matrix Multiplication: The multiply and accumulate bit DM A
(MAC) operations in convolution can be mapped to matrix For example, if P ox = 7, bit DM A = 512 and bit P x = 16,
multiplication [8], which can utilize the library optimized for then there are #P oxGroup = 4 groups of P ox pixels in one
GPU, e.g. BLAS used by Caffe. The original 4-D kernel weights DMA address, and 4 × 7 × 16 = 448 bits are the effective
are transformed to be a matrix with Nof rows and Nkx·Nky·Nif number of bits out of the DMA bit width of 512 bits, resulting
columns. The 3-D input feature map is transformed into a in ef f DM A P x = 0.875.
matrix with Nkx · Nky · Nif rows and Nox · Noy columns. The intermediate pixel results stored in DRAM are arranged
There are redundant data in the transformed feature matrix row-by-row, map-by-map, and layer-by-layer. One convolution
due to the overlapped sliding of the kernel window. Therefore, tile needs T ix × T iy × T if input pixels. Then, the size (bytes)
this method could lead to either complex dataflow and extra of the input pixels read (Rd) from the DRAM for one tile is
hardware to perform the transform on the fly, or additional T ix × T iy × T if × bit P x
DRAM accesses due to the redundant data. byte RdP x = . (11)
ef f DM A P x × 8
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 5
Note that if ef f DM A P x < 1, it implies more bytes are output pixels are similar to byte RdP x in Equation (11) and
read than necessary, due to the alignment of data storage. byte W rP x in Equation (12), respectively.
Similarly, the size (bytes) of output pixels written (W r) to The weights of fully-connected (FC) layers are stored in
DRAM for one convolution tile is DRAM in the same way as convolution, and reuse the same
T ox × T oy × T of × bit P x weight buffers. Since the intermediate results of FC layers are
byte W rP x = . (12) small (< 20KB), they are always kept in the on-chip RAMs.
ef f DM A P x × 8
The Eltwise layer performs element-wise summation of the
For convolution weights, the ratio of effective DRAM outputs of two layers. Eltwise layer is executed after its two
bandwidth to the maximum of reading weights from DRAM is precedent layers are finished, so that it can directly read the
bbit DM A/bit W tc × bit W t results of one layer from the output buffers, without accessing
ef f DM A W t = . (13)
bit DM A DRAM. However, Eltwise layer still needs to read the outputs
The size (bytes) of input weights read from DRAM for one of the other layer from DRAM as the output buffers were
convolution tile is already refreshed.
T kx · T ky · T if · T of · bit W t
byte RdW t = . (14) V. M ODELING OF L ATENCY
ef f DM A W t × 8
A. Computation Delay (ms) of One Convolution Tile
Setting P if = P kx = P ky = 1, T if = Nif, T kx = Nky,
B. DRAM Access Delay of One Tile (1 T ) T kx = Nky, and T ox = Nox as described in Section II-D,
The data width of the DRAM controller interface to the Equation (2) can be written as
FPGA is assumed to be bit DRAM , running at frequency of
#cycles 1 T
M Hz DRAM . This means the theoretical maximum DRAM
bandwidth (BW DRAM in GB/s) is (bit DRAM/8) × T of Nox T oy (15)
= Nif · Nkx · Nky · · · .
(M Hz DRAM/103 ), which is normally very difficult to P of P ox P oy
sustain due to the non-contiguous DRAM access. For example, Then, the computation delay (ms) of one convolution tile is
if bit DRAM = 512 bits, with M Hz DRAM = 266 MHz,
#cycles 1 T
then BW DRAM = (512/8) × (266/103 ) = 17.0 GB/s as ms Compute = , (16)
the maximum DRAM bandwidth. M Hz Accelerator × 103
In the CNN acceleration system described in [11], the where M Hz Accelerator is the clock frequency of the accel-
DMA engine is operated at the same clock frequency as erator in MHz. The number of tiles of one convolution layer
the CNN accelerator core (i.e. M Hz Accelerator) with (#tiles) is dNof/T of edNoy/T oye based on Equation (3) with
read/write data-width (bit DM A) of 512 bits. An asyn- Nif = T if , Nkx = T kx, Nky = T ky, and Nox = T ox as
chronous FIFO can be inserted between DMA and the DRAM described in Section II-D.
controller to synchronize data across the two clock domains.
Then, the DMA bandwidth (BW DM A) is (bit DM A/8) × B. Overall Delay (ms) of One Convolution Layer
(M Hz Accelerator/103 ). By this means, the bandwidth of With dual buffering technique, the DRAM access is over-
the external memory is bounded by the effective bandwidth of lapped with computation to improve performance [7] [10]. The
both the DRAM controller and the DMA as BW M emory = overall tile-by-tile delay of one convolution layer is illustrated
min(BW DRAM, BW DM A), which is used in Equa- in Fig. 4. Since the dual buffering pipeline is only within one
tion (8) to calculate the DRAM latency. layer with the current design choice, after the start of one
The more accurate and specific DRAM access sizes of one layer and before the computation of the first tile, both the
tile (byte DRAM 1 T ) are discussed in this section, including input pixels and weights (Wt) of one tile are first read from
byte RdP x, byte W rP x, and byte RdW t. Then, we can use DRAM. This is shown as “Input+Wt” at the beginning of one
Equation (8) to compute their corresponding DRAM access layer in Fig. 4. Similarly, after the completion of the last tile’s
delay (ms DRAM 1 T ), e.g. ms RdP x, ms W rP x, and computation, its output pixels are transferred back into DRAM,
ms RdW t, respectively. which is shown as “Output” at the end in Fig. 4. Therefore,
for each convolution layer, the delay of transferring inputs of
C. DRAM Access of Other Layers the first tile and outputs of the last tile cannot be overlapped
The DRAM access and performance of other layers, e.g. max- with the computation, and this delay is denoted as
pooling, fully-connected (FC) and Eltwise, are also investigated ms M em = ms RdP x + ms RdW t + ms W rP x. (17)
and included in our performance model. Since the analysis
process of theses layers are similar to the convolution layer, If the convolution layer has only one tile that is T iy = Niy
for simplicity, their detailed formulas used in the performance and T of = Nof, there is no overlapping of memory transfer
model are not presented. and computation as shown in Fig. 4(a), and the delay of this tile
The pixels of max-pooling layers are also transferred to (e.g. t = 1 in Fig. 4(a)) is only determined by the computation
and from the DRAM with loop tiling performed, depending delay as in Algorithm 1 (line 2).
on the adopted design choices [11] [17]. For max-pooling, If the convolution layer has multiple tiles and all its weights
the calculation of the DRAM transfer sizes of input and are fully buffered, i.e. T iy < Niy and T of = Nof, then the
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 6
t=1 Tile/Time
Input Delay of reading one tile of input pixels from DRAM (ms_RdPx)
#tiles = 1 Input Wt Output
Wt Delay of reading one tile of weights from DRAM (ms_RdWt)
Compute
Output Delay of writing one tile of output pixels into DRAM (ms_WrPx)
(a) Tiy = Niy and Tof = Nof
Compute Delay of computing one tile of data (ms_Compute)
#tiles_y = 3
ty = 1, tf = 1 ty = 2, tf = 1 ty = 3, tf = 1 ty = 1, tf = 2 ty = 2, tf = 2 ty = 3, tf = 2 #tiles = ⌈Niy/Tiy⌉⌈Nof/Tof⌉
Input Wt Input Input Output Input Wt Output Input Output Input Output Output Output
#tiles_f = 2 #tiles_y = ⌈Niy/Tiy⌉
Compute Compute Compute Compute Compute Compute
(d) Tiy < Niy and Tof < Nof #tiles_f = ⌈Nof/Tof⌉
Fig. 4. The tile-by-tile delay of one convolution layer, and the DRAM access delay is overlapped with the computation delay due to dual buffering technique.
(a) Both inputs and weights fully buffered, (b) only weights fully buffered, (c) only inputs fully buffered, (d) neither inputs nor weights fully buffered.
weights only need to be read from DRAM once and can be already loaded during the previous tile and reused. Therefore,
reused by different tiles as illustrated in Fig. 4(b). The procedure the delay of a normal tile is estimated as in Algorithm 1
to estimate the delay of this convolution layer is summarized (line 34). As the first tile does not have a previous tile, there is
in Algorithm 1 (line 3 to line 12). The computation of the no transfer of output pixels back to DRAM as in Algorithm 1
first tile (e.g. t = 1 in Fig. 4(b)) is overlapped with fetching (line 28). For the last tile, there is no need to read input pixels
the input pixels of the next tile, and there is no DMA transfer for the next tile as in Algorithm 1 (line 30). When #tiles y
of output pixels of the previous layer, thus the delay of this tiles of weights are finished (e.g. ty = 3 and tf = 1 in Fig. 4(d)),
tile is determined by Algorithm 1 (line 6). The computation new tile of weights are loaded from DRAM, and DRAM access
of the last tile (e.g. t = 3 in Fig. 4(b)) is overlapped with also includes transfer of pixels as in Algorithm 1 (line 32).
transferring the output pixels of its previous tile, and its delay
is calculated by Algorithm 1 (line 8). For the other tiles (e.g. t=1 Tile/Time
t = 2 in Fig. 4(b)), the communication with DRAM includes #tiles = 1 Input Output
Compute
both reading input pixels and writing output pixels, and the (a) Max-pooling: inputs fully buffered in one tile
delay of one tile is expressed by Algorithm 1 (line 10). The
t=1 t=2 t=3
overall delay of this convolution layer is the sum of all the #tiles = 3 Input Input Input Output Output Output
tiles as well as the DRAM access delay before the first tile Compute Compute Compute
and after the last tile, i.e. ms M em. (b) Max-pooling : inputs partially buffered in multiple tiles
pixels only need to be read from DRAM once and can be (c) Fully connected
Input Delay of reading one tile of input pixels from DRAM
reused by different tiles as illustrated in Fig. 4(c). Similarly, Wt Delay of reading one tile of weights from DRAM
the procedure to estimate the delay of this convolution layer Output Delay of writing one tile of output pixels into DRAM
Compute Delay of computing one tile of data
is summarized in Algorithm 1 (line 13 to line 22).
If neither the weights nor the pixels of the convolution layer Fig. 5. The tile-by-tile delay of one pooling/fully-connected layer, and the
can be fully buffered, i.e. T iy < Niy and T of < Nof, its DRAM access delay is overlapped with the computation delay due to the dual
pipeline schedule is shown in Fig. 4(d) and the associated buffering technique.
delay is estimated in Algorithm 1 (line 23 to line 37). In this
case, either the pixels or the weights need to be re-fetched
multiple times from the DRAM. In our current design, the C. Delay Estimation of Other Layers
input pixels are re-fetched and the weights only need to be read With dual buffering technique employed, the overall tile-
once. If the DRAM access requirement of input pixels is more by-tile process of one max-pooling layer is illustrated in
than weights, we can also re-fetch weights instead and only Fig. 5(a)(b), which is similar to the convolution layer except
read input pixels once by changing the DMA instructions and that pooling does not need weights. If the pooling layer has only
associated control logic. Before the computation, the first tile one tile, which means the inputs of one pooling layer can be
of weights are loaded and reused by the following consecutive fully buffered, there is no overlapping between memory transfer
#tiles y = dNiy/T iye tiles of pixels to perform convolution. and computation as shown in Fig. 5(a). Fig. 5(b) illustrates the
Then, the next tile of weights are loaded and reused by the dual buffering pipeline of one pooling layer with multiple tiles.
following #tiles y tiles of pixels. This process iterates by Similar to Algorithm 1, we can compute the overall latency
#tiles f = dNof/T of e times to complete the computation of max-pooling layers according to the tile-by-tile execution
with all the #tiles f tiles of weights. By this means, the pixels schedule, with the delay of max-pooling computation and
are re-fetched by #tiles f times. A normal tile needs to read DRAM access calculated similar to the convolution layer.
input pixels of the next tile from DRAM and write output pixels Fig. 5(c) shows the pipeline schedule of FC layer, where
of the previous tile into DRAM, where the required weights are weights are fetched before the corresponding computation and
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 7
input : C, I, W , O, #tiles, #tiles y, #tiles f r(i, y) is y-th row in i-th input feature map, i ∈ {1, 2, …, Tif}, y ∈ {1, 2, …, Tiy − 2×padding}
output : ms 1 CV c(x) is the x-th column element in one row, x ∈ {1, 2, …, Tix − 2×padding}
1 if T iy = Niy and T of = Nof then north zero padding Base Addr. word_1Row = 3 Addr.
0 c(01) c(02) c(03) c(04) 0
stride = 2
2 T [1] = C pad r(1,02) r(1,04) r(1,06)
r(1,01) r(1,03) r(1,05) r(1,07) 3 c(05) c(06) c(07) c(08) 1
3 else if T iy < Niy and T of = Nof then r(1,08) r(1,10) r(1,12) 6 c(09) c(10) c(11) c(12) 2
4 for t = 1 to #tiles do r(1,09) r(1,11) pad 9
pad r(2,02) r(2,04) r(2,06) 12 Pox = Pix = 4
5 if t = 1 then r(2,01) r(2,03) r(2,05) r(2,07) 15
#rows_1Map = 4
6 T [t] = max(C, I) r(2,08) r(2,10) r(2,12) 18
stride = 2
7 else if t = #tiles then r(2,09) r(2,11) pad 21 Tix = 14
pad r(3,02) r(3,04) r(3,06) 24 Tiy = 14
8 T [t] = max(C, O) r(3,01) r(3,03) r(3,05) r(3,07) 27 Tif = 3
9 else r(3,08) r(3,10) r(3,12) 30 stride = 2
r(3,09) r(3,11) pad 33 padding = 1
10 T [t] = max(C, I + O)
Input Input Input Input
11 end Buffer 1 Buffer 2 Buffer 3 Buffer 4 south zero padding
12 end
Poy = Piy = 4
13 else if T iy = Niy and T of < Nof then
14 for t = 1 to #tiles do Fig. 6. The convolution data storage pattern in the input pixel buffers.
15 if t = 1 then
16 T [t] = max(C, W )
17 else if t = #tiles then A. Size and Storage of Input Buffers
18 T [t] = max(C, O)
Fig. 6 illustrates the proposed storage pattern of convolution
19 else
input pixels, which benefits the dataflow of P ox × P oy pixels
20 T [t] = max(C, W + O)
from buffers into MAC units [11]. The width of one input buffer
21 end
is determined by P ox to feed data for parallel computation of
22 end
P ox pixels in one feature map row. The number of input buffers
23 else
is determined by P oy to feed data for parallel computation
24 for tf = 1 to #tiles f do
of P oy multiple output rows. In Fig. 6, c(x) denotes one
25 for ty = 1 to #tiles y do
input pixel in the x-th column of a certain row, where x ∈
26 t = ty + (tf − 1) × #tiles y;
{1, 2, . . . , T ix − 2 × padding} and T ix includes both the
27 if ty = 1 and tf = 1 then
east and west zero padding. The east and west zero paddings
28 T [t] = max(C, I)
are not stored in buffers and instead they are masked out by
29 else if t = #tiles then
control logic before loading into the MAC units. The number
30 T [t] = max(C, O)
of addresses or words occupied by one row is
31 else if ty = #tiles y then
32 T [t] = max(C, I + W + O) word 1 Row = d(T ix − 2 × padding)/P oxe. (18)
33 else
34 T [t] = max(C, I + O) In Fig. 6, r(i, y) is the y-th row of the i-th input feature map,
35 end where i ∈ {1, 2, . . . , T if } and y ∈ {1, 2, . . . , T iy}. The T iy
36 end rows of one input feature map including north and south zero
37 end paddings if they exist are distributed across the P oy number
38 end
of input buffers. With stride = 2 as in Fig. 6, two adjacent
P#tiles rows are continuously stored in the same buffer according to
39 ms 1 CV = t=1 T [t] + ms M em
the dataflow requirement. Then, the number of rows of one
Algorithm 1: Delay estimation of one convolution layer feature map, i.e. r(i, y), in one buffer is
(ms 1 CV ), where C = ms Compute, I = ms RdP x,
W = ms RdW t, and O = ms W rP x. #rows 1 M ap = ddT iy/stridee/P oye × stride. (19)
The storage location of the subsequent input feature maps are
no outputs are transferred back to DRAM. The storage format aligned with the first feature map to simplify the address
of FC weights in the weight buffer allows us to read P of generation logic, which causes some overhead due to the
weights simultaneously every clock cycle to parallel compute noncontinuous storage pattern as shown by the blank spaces
P of outputs. Then, the computation cycles of one FC tile in the buffers in Fig. 6. By this means, the depth or words
equal to the depth of buffered FC weights. The overall delay requirement of one input buffer (InBuf ) storing T if input
of FC is bounded and determined by the computation delay or feature maps for one convolution layer is expressed as
the DRAM access delay of weights. word InBuf = word 1 Row · #rows 1 M ap · T if. (20)
VI. S IZE R EQUIREMENT OF O N - CHIP M EMORY The data width of one input buffer is P ox × bit P x and
With the specific data storage pattern of buffers, we can the number of input buffers is P oy × Dual with Dual = 2,
more precisely calculate the required on-chip buffer sizes than where Dual represents doubling of the number of buffers due
the rough estimation in Section III-B. to the dual buffer structure. Therefore, in every clock cycle,
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 8
P ox × P oy pixels can be fed into the MAC units. The input buffers as in Fig. 8, the parallel outputs are serialized into
buffer size requirement of one convolution layer is P oy × dP of /#OutBuf e clock cycles, where #OutBuf is
the number of output buffers excluding the dual buffer structure
bit InBuf =
(21) with #OutBuf 6 P of . By this means, the data width of one
Dual × P oy × P ox × bit P x × word InBuf. output buffer is P ox × bit P x, as shown in Fig. 8, to store
The final input buffer size is the maximum bit InBuf of all the parallel P ox outputs from the same feature map.
the convolution layers. The actual input buffer size in Equation The output buffer storage pattern is illustrated in Fig. 8,
(21) is larger than the rough estimation in Equation (5) due where c(x) is the x-th column element in one row with x ∈
to the mismatch of tile and buffer dimensions caused by the {1, 2, . . . , T ox} and r(o, y) is the y-th row in the o-th output
specific storage pattern. feature map with o ∈ {1, 2, . . . , T of } and y ∈ {1, 2, . . . , T oy}.
The outputs of the same feature map are continuously stored
w(i,o) is one kernel window of i-th input channel and o-th output channel, in the same buffer in a row-major order. One row (r(o, y))
i ∈ {1, 2, …, Tif}, o ∈ {1, 2, …, Tof}
is comprised of T ox elements (c(x)) continuously stored in
k(x,y) is one kernel weight inside the kernel window, dT ox/P oxe addresses, and we set T ox = Nox so that one
x ∈ {1, 2, …, Tkx}, y ∈ {1, 2, …, Tky}
entire row is processed while maintaining the row-major order.
Pof = 4 Base Addr. Addr. One feature map has T oy number of rows stored in one buffer
w(1,1) w(1,2) w(1,3) w(1,4) 0 k(1,1) 0 and it occupies T oy×dT ox/P oxe addresses. One output buffer
Tif = 3
Fig. 7. The convolution data storage pattern in the weight buffer. The output buffer size requirement of one convolution layer is
bit OutBuf = (Dual × #OutBuf )
(25)
× (P ox × bit P x) × word OutBuf.
B. Size and Storage of Weight Buffers
The storage pattern of weight buffer is illustrated in Fig. 7. If T of /#OutBuf is not an integer, the blank spaces in the
The k(x, y) in Fig. 7 denotes one weight inside the Nkx × output buffers as in Fig. 8 are wasted.
Nky kernel window, where x ∈ {1, 2, . . . , T kx} and y ∈
r(o, y) is y-th row in o-th output feature map, o ∈ {1, 2, …, Tof}, y ∈ {1, 2, …, Toy}
{1, 2, . . . , T ky}. In the chosen design, we always have T kx = c(x) is the x-th column element in one row, x ∈ {1, 2, …, Tox}
Nkx and T ky = Nky, so that one kernel window is fully Base Addr. Addr.
buffered. These T kx × T ky weights, i.e. k(x, y), are stored in r(1,1) r(2,1) r(3,1) r(4,1) 0 c(01) c(02) c(03) c(04) 0
continuous addresses as we serially compute one kernel window, r(1,2) r(2,2) r(3,2) r(4,2) 3 c(05) c(06) c(07) c(08) 1
e.g. P kx = P ky = 1. In Fig. 7, w(i, o) denotes one kernel r(1,3) r(2,3) r(3,3) r(4,3) 6 c(09) c(10) c(11) c(12) 2
r(1,4) r(2,4) r(3,4) r(4,4) 9
window of the i-th input channel and o-th output channel, which r(5,1) r(6,1) 12 Pox = 4
is comprised of T kx × T ky weights. Weights from different r(5,2) r(6,2) 15
input channels (T if ) are stacked in different addresses as we r(5,3) r(6,3) 18
Tox = 12
serially compute each input channel. To compute P of output r(5,4) r(6,4) 21
Toy = 4
channels in parallel, the weights of P of output channels are Output Output Output Output
Buffer 1 Buffer 2 Buffer 3 Buffer 4 Tof = 6
stored at the same address of the weight buffer. Therefore, the
bit width of the weight buffer is P of × bit W t. The words #OutBuf = 4
or depth of the weight buffer (W tBuf ) is
Fig. 8. The convolution data storage pattern in the output pixel buffers.
word W tBuf = T kx × T ky × T if × dT of /P of e. (22)
With dual buffering, the number of weight buffers is two. The
weight buffer size requirement of one convolution layer is D. Size and Storage of Pooling Buffers
The max pooling layers share the input and output buffers
bit W tBuf = Dual · P of · bit W t · word W tBuf. (23) with convolution layers. Due to the different dataflow require-
ment, the max-pooling input storage pattern in the input buffers
If T of /P of is not an integer, some blank spaces in the weight
is different from convolution inputs, but it is the same as
buffer are wasted as in Fig. 7. The final weight buffer size is
the output storage pattern of convolution outputs in Fig. 8.
the maximum bit W tBuf of all the convolution layers.
In addtion, the output buffer storage pattern of max-pooling
layers is also the same as the convolution outputs in Fig. 8.
C. Size and Storage of Output Buffers The pixels from the same feature map are stored in the same
After every N kx × N ky × N if clock cycles, there are buffer, and different feature maps are distributed across different
P ox × P oy × P of outputs from MAC units. To reduce the buffers. Therefore, the input and output buffer depth of one
bit width of data bus and the bandwidth requirement of output tile of max pooling is similar to Equation (24). The buffer size
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 9
requirement of pooling layers is ensured to be smaller than When neither weights nor pixels are fully buffered, i.e.
that of the convolution layers by using smaller pooling tiling T oy < Noy and T of < Nof, the same pixels are re-loaded
variables so that there is no overflow of pooling data. dNof/T of e times into input buffers as shown in Fig. 4(d).
Similar to Equation (21), the size (bit) of one tile (1 T ) of
VII. M ODELING OF O N - CHIP B UFFER ACCESS pixels written into the input buffers is
The energy cost of accessing data in the buffers dominates bit W rIn 1 T = word InBuf · P oy · P ox · bit P x. (31)
the on-chip memory energy consumption [18] [19], so it is
essential to reduce the size of buffer accesses for energy- The size (bit) of data loaded into the input buffers of all the
efficient design. To reduce the buffer access size, data should convolution layers is
be reused as much as possible either by multiple PEs or by bit W rInBuf =
different execution tiles, which will be discussed in this section. #CON
XV s (32)
bit W rIn 1 T [L] × #tiles In[L].
L=1
A. Reading Input and Weight Buffers of Convolution
Similarly, the size (bit) of one tile of weights written into the
Based on Equation (9) to estimate the buffer access,
weight buffers is
we need to compute #cycles Access first. In this case,
#cycles Access is the MAC computation clock cycles of bit W rW t 1 T = word W tBuf × P of × bit W t, (33)
one tile, which is #cycles 1 T in Equation (15). Then, the
computation clock cycles of all the convolution layers are and the size (bit) of data written into the weight buffers of all
#CON
the convolution layers is
XV s
#cycles C = #cycles 1 T [L] × #tiles[L], (26) bit W rW tBuf =
L=1 #CON
XV s (34)
where #CON V s is the number of convolution layers and bit W rW t 1 T [L] × #tiles W t[L].
#tiles is the number of tiles. The size (bit) of data read (Rd) L=1
from input buffers (InBuf ) for convolution layers is computed
by multiplying the read clock cycles with the total input buffer C. Data Access of Output Buffers of Convolution
data width as The number of clock cycles to write outputs into output
buffers during one tile is the same as word OutBuf , where
bit RdInBuf = #cycles C · (P ox · P oy · bit P x), (27) one word of data is written into one output buffer in one cycle.
where every P ox × P oy pixels are reused by P of MAC units Since every tile of one layer has outputs to be saved, the clock
and the number of input buffer accesses is reduced by P of cycles of writing outputs to output buffers is word OutBuf ×
times. Similarly, the size (bit) of data read (Rd) from weight #tiles. Then, the total cycles to load outputs into output buffers
buffers (W tBuf ) for all the convolution layers is (OutBuf ) are summed up across all the convolution layers as
#cycles W rOutBuf =
bit RdW tBuf = #cycles C × (P of × bit W t), (28)
#CON
XV s (35)
where every P of weights are reused by P ox × P oy MAC word OutBuf [L] × #tiles[L].
units and the number of weight buffer accesses is reduced by L=1
P ox × P oy times. The size (bit) of results written into the output buffers is
bit W rOutBuf = #cycles W rOutBuf
(36)
B. Writing Input and Weight Buffers of Convolution × #OutBuf × P ox × bit P x.
Before computation, the input data are written into the input Since each output is written into and read from the output
and weight buffers from DMA. As discussed in Section V-B, buffers only once, the size (bit) of data read from output buffers
not every tile needs to read both pixels and weights from (bit RdOutBuf ) by DMA equals to bit W rOutBuf .
DRAM, because some pixels or weights of one tile can be
reused by the following adjacent tiles. The number of tiles VIII. E XPERIMENTS AND A NALYSIS
of one convolution layer that write new weights (W t) to the
weight buffer is In this section, the proposed performance model is used to
explore the design space by tuning the key design variables, e.g.
#tiles W t = dNof/T of e. (29) unrolling and tiling sizes, DRAM bandwidth and accelerator
frequency, to identify the performance bottleneck and obtain
The number of tiles of one convolution layer that write new
the optimal design configurations.
input pixels (In) to the input buffers is
#tiles In = A. Design Space Exploration of Tilling Variables
(
d TNoy Nof
oy ed T of e, if T oy < Noy and T of < Nof
(30) The loop tiling strategy determines how many data of each
Noy layer are buffered, which affects the buffer capacity requirement,
d T oy e, otherwise
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 10
the number of DRAM accesses, and the accelerator perfor- minimum DRAM accesses. The red dot in Fig. 9 is our optimal
mance. Although we have fixed T kx = Nkx, T ky = Nky, design choice of T oy and T of that balances the buffer size
T if = Nif and T ox = Nox as mentioned in Section II-D, requirement and the number of DRAM accesses.
the remaining two tiling variables T oy and T of still give
us a huge design space as mentioned in [11]. For example,
VGG-16 has 13 convolution layers, and there are 13 × 2 = 26 (a) NiN (b) VGG-16
tiling variables and each variable can have 4 or more candidate
values determined by Noy/P oy or Nof/P of , then the total
number of T oy and T of choices is roughly 426 = 4.5 × 1015 , Our
which results in an enormous solution space that cannot be design
point
enumerated. Therefore, we randomly sample 30,000 tiling
configurations for different CNN algorithms to explore their
impact on the memory access and performance as in Fig. 9,
Fig. 10 and Fig. 11, where we set loop unrolling variables as
(c) GoogLeNet (d) ResNet-50
P ox × P oy × P of = 7 × 7 × 32.
Fig. 10. The tiling variables (T oy and T of ) are swept to explore the rela-
tionship between the convolution throughputs and the total input/weight/output
buffer size requirement, where P ox × P oy × P of = 7 × 7 × 32,
M Hz Accelerator = 240, BW DRAM = 14.4 GB/s.
(c) GoogLeNet (d) ResNet-50
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 11
Fig. 11. The tiling variables (T oy and T of ) are swept to explore the (c) Pox = 7, Poy = 7, Pof = 32 (d) Pox = 14, Poy = 7, Pof = 32
relationship between the size of on-chip buffer accesses and the size requirement
of buffers, where P ox × P oy × P of = 7 × 7 × 32. Fig. 12. The convolution throughput is affected by the accelerator operating
frequency, DRAM bandwidth, and the number of MAC units. GoogLeNet is
shown as an example here.
B. Design Space Exploration for Performance
As convolution dominates the CNN operations [2] [20] [3] is limited by BW DM A, and DRAM roof is linearly
[4], we focus on the design space exploration of convolution increasing with the increase of frequency as in Fig. 13. After
throughputs. The convolution throughput is affected by several BW DM A is larger than BW DRAM , BW M emory is
factors, namely the accelerator operating frequency, external limited by BW DRAM instead, and DRAM roof stops
memory bandwidth and the loop unrolling variables, These growing with the frequency. The saturated throughputs in
are explored in Fig. 12 using GoogLeNet as an example. Fig. 12 are lower than DRAM roof in Fig. 13, which is
With a small number of MAC units and high DRAM band- mainly because there are redundant DRAM transfers and the
width (BW DRAM ) as shown in Fig. 12(a), the accelerator computation delay is not fully overlapped with DRAM latency.
throughput is mainly bounded by computation, and thus the
throughput is almost linearly increasing with the frequency
when BW DRAM > 12.8GB/s. If the DRAM bandwidth
is too low, e.g. 3.2 GB/s, the design is more likely to be
memory bounded and the throughput stops increasing with the
frequency. With more MAC units and higher frequency, the
throughputs are tend to increase, as shown in Fig. 12, until the
design touches the memory roof which is illustrated in Fig. 13.
The memory roof throughput [7] in Fig. 13 is the maxi- (a) GoogLeNet (b) VGG-16
mum achievable throughput under a certain external memory
bandwidth and it is defined as, Fig. 13. The external memory roof throughput (DRAM roof ) is the
maximum achievable throughput under a certain memory bandwidth.
#operations(GOP )
DRAM roof (GOP S) =
DRAM delay(s)
(38)
#operations(GOP ) C. Performance Model Validation
= BW M emory(GB/s),
#data(GByte) Fig. 14 shows the comparison of throughput and latency
where #data is the data size of DRAM accesses. between the performance model and the on-board test results
Since the computation-to-communication ratio (CTC), i.e. on Arria 10 and Stratix 10 with different number of MAC
#operations/#data, is a constant under a certain tiling set- units, where both pixels and weights are 16-bit fixed point data.
ting, DRAM roof is directly proportional to BW M emory. The differences between the estimation and on-board results
With the same setting of BW M emory for GoogLeNet and are within 3%, which are mainly due to the DRAM transfer
VGG-16, the shape of the curves in Fig. 13(a) and (b) are simi- latency mismatch, minor layers (e.g. average pooling), and some
lar. Since VGG-16 has a higher CTC, its memory roof through- pipeline stages in the real implementation. The compilation
put is much higher than GoogLeNet in Fig. 13. As discussed of our FPGA design using Quartus Pro 17.1 on 16-core Intel
in Section IV-B, the memory bandwidth (BW M emory) Xeon CPU E5-2650 v3 normally takes six to eight hours, while
is bounded by both the DRAM controller (BW DRAM ) the performance model running on laptop Intel Core i7-7500U
and DMA (BW DM A). At low frequency, BW M emory CPU using MATLAB takes about 1 to 5 seconds per design.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 12
Overall Throughput
1600
1400
14×7×32 MACs (Arria 10) A. Improving DRAM Bandwidth Utilization
14×7×32 MACs (Performance Model)
1200
14×7×64 MACs (Stratix 10) To simplify the control logic of data bus from DMA to input
(GOPS)
1000
14×7×64 MACs (Performance Model)
600
800
buffers, different feature map rows are aligned in different
400 addresses in our current design. By this means, if the number
200
0 of pixels in one row is smaller than bbit DM A/bit P xc, the
NiN VGG-16 GoogLeNet ResNet-50 ResNet-152
successive row directly starts from the next address instead of
35
14×7×32 MACs (Arria 10) continuously using the same address resulting in the waste of
30 14×7×32 MACs (Performance Model)
DMA datawidth. For example, with bit P x = 16, one address
Overall Latency
15
10
feature map is Nix = 14, then the actual number of pixels
5 of one row read from DRAM in Equation (11) is T ix = 32,
0
NiN VGG-16 GoogLeNet ResNet-50 ResNet-152
where 32−14 = 20 data are redundant. Some CNN models, e.g.
GoogLeNet and ResNet, have a lot of convolution layers with
Fig. 14. The performance model results are compared with on-board test small Nix, e.g. 7 or 14, then their throughputs are significantly
results of Arria 10 and Stratix 10 on overall (a) throughput and (b) latency. affected by the inefficient utilization of DMA datawidth.
To improve the DRAM bandwidth utilization, one method
is to store multiple rows in one DMA address, which involves
D. Related Works
the modifications of control logic and extra data paths from
Several related works have used performance model to DMA to input buffers. The other method is to keep the data
optimize the memory access and computation pattern of their aligned, but narrow the bit width of the data bus between DMA
proposed architecture and dataflow. Suda et al. [8] implements and input buffers. To attain the same data transfer rate, higher
convolution as matrix multiplication and uses a performance frequency is needed, and asynchronous FIFO may be used.
model to optimize the design. However, the execution time In the performance model, we reduce bit DM A to be 256
in [8] only counts computation time without considering and 128 and increase their corresponding frequency of the
the DRAM transfer latency. If the design becomes memory- data bus to predict the throughput improvements. In Fig. 15,
bounded, the model in [8] cannot properly predict the overall our current design (DMA 512-bit) serves as the baseline with
latency, which results in the estimation discrepancy of fully- data aligned, and bit DM A is set to be 256 or 128, which
connected layers with high computation parallelism. The has the same effect as supporting two or four rows in one
proposed systolic array architecture in [10] is also optimized address with bit DM A = 512, respectively. Fig. 15 shows that
through a performance model. The overall throughput is simply NiN, GoogLeNet and ResNet can benefit a lot from decreasing
computed by the minimum of the computation throughput and the DMA bit width, mainly because they have many layers
DRAM transfer throughput, where the overlap efficiency of with small Nix and the layers with small Nix are memory
computation and data transfer is not considered. The fine- bounded. On the contrary, VGG-16 cannot benefit from higher
grained tile-level data accesses of DRAM and buffers are not DRAM bandwidth utilization as the design is still computation
explored in [10]. The buffer and DRAM accesses are modeled bounded. Based on the prediction, it is compelling to improve
in [18] to explore different data reuse patterns by changing our design for higher DRAM bandwidth utilization.
the tiling strategy and computation order. Only coarse-grained
1100 900
modeling of the convolution memory access is analyzed without DMA 512-bit (Our current design) DMA 512-bit (Our current design)
Throughput (GOPS)
Throughput (GOPS)
data storage patterns in buffers and DRAM. The proposed 800 600
Hybrid Data Reuse in [18] is similar to our tiling strategy 700 500
that different layers can use different tiling sizes to either 600 400
14 × 7 × 32 PEs 14 × 7 × 64 PEs 14 × 7 × 32 PEs 14 × 7 × 64 PEs
reuse weights or pixels to minimize the DRAM access. In our (a) NiN (b) GoogLeNet
1800 1000 DMA 512-bit (Our current design)
work, the relationship between the overall DRAM access and DMA 512-bit (Our current design)
Throughput (GOPS)
Throughput (GOPS)
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 13
1000 620
pixels from the same precedent layer. If these convolution layers Uniform (Our current design)
Adjustable (Predication)
Uniform (Our current design)
Adjustable (Predication)
Throughput (GOPS)
Throughput (GOPS)
900 580
also have the same kernel size and stride, they can be merged 800
Ideal (Predication)
540
Ideal (Predication)
into one layer along the output feature map dimension (Nof). 700 500
By this means, the input pixels can be shared by the first layers 600 460
Throughput (GOPS)
of our performance model, e.g. byte RdP x in Equation (11), to
Throughput (GOPS)
Adjustable (Predication) Adjustable (Predication)
1500 680
Ideal (Predication) Ideal (Predication)
estimate the effect of eliminating the repeated DRAM accesses 1300 640
1100 600
of the precedent layer as shown in Fig. 16. Since GoogLeNet
900 560
and ResNet are already memory-bounded in our current design, 700 520
reducing the DRAM access can considerably improve the 14 × 7 × 32 PEs
(c) VGG-16
14 × 7 × 64 PEs 14 × 7 × 32 PEs
(d) ResNet-50
14 × 7 × 64 PEs
Throughput (GOPS)
660 First Layers Merged (Model predication) 700 (Our current design)
620 680 First Layers Merged
580 660 (Model predication) In this work, a high-level performance model is proposed
540 640 to estimate the key specifications, e.g. throughput, of FPGA
500 620 accelerators for CNN inference, which enables the design
460 600
14 × 7 × 32 PEs 14 × 7 × 64 PEs 14 × 7 × 32 PEs 14 × 7 × 64 PEs space exploration to identify performance bottleneck in the
(a) GoogLeNet (b) ResNet-50 early development phase. The design strategy and resource
costs are formulated using the design variables of loop
Fig. 16. Performance model predicts that the throughput will be improved by
merging the first layers of different parallel branches, which read from the
unrolling and tiling. The proposed performance model is
same precedent layer, to eliminate the repeated DRAM access, where “Normal” validated for a specific acceleration strategy across a variety
denotes our current design as baseline. of CNN algorithms comparing with on-board test results on
two different FPGAs.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2019.2897634, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 14
[11] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, “Optimizing the convolution Yu Cao (S’99-M’02-SM’09-F’17) received the B.S.
operation to accelerate deep neural networks on FPGA,” IEEE Trans. on degree in physics from Peking University in 1996. He
Very Large Scale Integration (VLSI) Systems, 2018. received the M.A. degree in biophysics and the Ph.D.
[12] H. Zeng, R. Chen, C. Zhang, and V. K. Prasanna, “A framework for degree in electrical engineering from University of
generating high throughput CNN implementations on FPGAs,” in Proc. California, Berkeley, in 1999 and 2002, respectively.
of ACM/SIGDA Int. Sym. on Field-Programmable Gate Arrays (FPGA), He worked as a summer intern at Hewlett-Packard
Feb., 2018. Labs, Palo Alto, CA in 2000, and at IBM Microelec-
[13] X. Lin, S. Yin, F. Tu, L. Liu, X. Li, and S. Wei, “LCP: A layer clusters tronics Division, East Fishkill, NY, in 2001. After
paralleling mapping method for accelerating Inception and Residual working as a post-doctoral researcher at the Berkeley
networks on FPGA,” in Proc. of Design Automation Conference (DAC), Wireless Research Center (BWRC), he is now a
Jun., 2018. Professor of Electrical Engineering at Arizona State
[14] K. Pavel and S. David, “Algorithms for efficient computation of convo- University, Tempe, Arizona. He has published numerous articles and two books
lution,” in IntechOpen, DOI: 10.5772/51942, Design and Architectures on nano-CMOS modeling and physical design. His research interests include
for Digital Signal Processing, Jan., 2013. physical modeling of nanoscale technologies, design solutions for variability
[15] J. Yu, K. Guo, Y. Hu, X. Ning, J. Qiu, H. Mao, S. Yao, T. Tang, and reliability, reliable integration of post-silicon technologies, and hardware
B. Li, Y. Wang, and H. Yang, “Real-time object detection towards high design for on-chip learning.
power efficiency,” in Design, Automation & Test in Europe Conference Dr. Cao was a recipient of the 2012 Best Paper Award at IEEE Computer
& Exhibition (DATE), Mar., 2018. Society Annual Symposium on VLSI, the 2010, 2012, 2013, 2015 and 2016 Top
[16] C. Zhang and V. K. Prasanna, “Frequency domain acceleration of 5% Teaching Award, Schools of Engineering, Arizona State University, 2009
convolutional neural networks on CPU-FPGA shared memory system,” ACM SIGDA Outstanding New Faculty Award, 2009 Promotion and Tenure
in Proc. of ACM/SIGDA Int. Sym. on Field-Programmable Gate Arrays Faculty Exemplar, Arizona State University, 2009 Distinguished Lecturer of
(FPGA), 2017. IEEE Circuits and Systems Society, 2008 Chunhui Award for outstanding
[17] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, “An automatic RTL compiler oversea Chinese scholars, the 2007 Best Paper Award at International
for high-throughput FPGA implementation of diverse deep convolutional Symposium on Low Power Electronics and Design, the 2006 NSF CAREER
neural networks,” in Int. Conf. on Field Programmable Logic and Award, the 2006 and 2007 IBM Faculty Award, the 2004 Best Paper Award at
Applications (FPL), Sep., 2017. International Symposium on Quality Electronic Design, and the 2000 Beatrice
[18] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep convolutional Winner Award at International Solid-State Circuits Conference. He has served
neural network architecture with reconfigurable computation patterns,” as Associate Editor of the IEEE Transactions on CAD, and on the technical
IEEE Trans. VLSI Syst., vol. 25, no. 8, pp. 2220–2233, 2017. program committee of many conferences.
[19] Y. Chen, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural networks,” in Sarma Vrudhula (M’85-SM’02-F’16) is a Professor
ACM/IEEE Int. Sym. on Computer Architecture (ISCA), Jun., 2016. of Computer Science and Engineering with Arizona
[20] M. Lin, Q. Chen, and S. Yan, “Network In Network,” CoRR, vol. State University, and the Director of the NSF I/UCRC
abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400 Center for Embedded Systems. His work spans
[21] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: a deep several areas in design automation and computer
learning accelerator that tames the diversity of CNNs through adaptive aided design for digital integrated circuit and systems,
data-level parallelization,” in Proc. of Design Automation Conference focusing on low power circuit design, and energy
(DAC), Jun., 2016. management of circuits and systems. Specific topics
[22] M. Putic, S. Venkataramani, S. Eldridge, A. Buyuktosunoglu, P. Bose, include: energy optimization of battery powered
and M. Stan, “Dyhard-DNN: even more DNN acceleration with dynamic computing systems, including smartphones, wireless
hardware reconfiguration,” in Proc. of Design Automation Conference sensor networks and IoT systems that relies energy
(DAC), Jun., 2018. harvesting; system level dynamic power and thermal management of multicore
processors and system-on-chip (SoC); statistical methods for the analysis of
process variations; statistical optimization of performance, power and leakage;
new circuit architectures of threshold logic circuits for the design of ASICs
and FPGAs. More recently he is investigating non-conventional methods
for implementing logic, including technology mapping with threshold logic
circuits; the implementation of threshold logic using resistive memory devices,
and the design and optimization of non-volatile logic. Prior to ASU, he was
a Professor in the ECE department at University of Arizona, Tucson AZ,
and was on the faculty of the EE-Systems department at the University of
Southern California. He was also the Founding Director of the NSF Center
for Low Power Electronics at the University of Arizona. He received the
B.Math. degree from the University of Waterloo, Waterloo, ON, Canada, and
the M.S.E.E. and Ph.D. degrees in electrical and computer engineering from
the University of Southern California, Los Angeles, USA.
0278-0070 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.