0% found this document useful (0 votes)
46 views14 pages

Convolution Optimization For DNN

paper 4

Uploaded by

Yeshudas Muttu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views14 pages

Convolution Optimization For DNN

paper 4

Uploaded by

Yeshudas Muttu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Optimizing the Convolution Operation to Accelerate


Deep Neural Networks on FPGA
Yufei Ma , Student Member, IEEE, Yu Cao, Fellow, IEEE, Sarma Vrudhula , Fellow, IEEE,
and Jae-sun Seo, Senior Member, IEEE

Abstract— As convolution contributes most operations in con- when compared to software implementations on multicore
volutional neural network (CNN), the convolution acceleration processors with GPUs [10], [12], [13], [17]. This is due
scheme significantly affects the efficiency and performance of a to the fact that modern FPGAs allow customization of the
hardware CNN accelerator. Convolution involves multiply and
accumulate operations with four levels of loops, which results architecture and can exploit the availability of hundreds
in a large design space. Prior works either employ limited to thousands of on-chip DSP blocks. However, significant
loop optimization techniques, e.g., loop unrolling, tiling, and challenges remain in mapping CNNs onto FPGAs. The state-
interchange, or only tune some of the design variables after of-the-art CNNs require a large number (>1 billion) of compu-
the accelerator architecture and dataflow are already fixed. tationally intensive task (e.g., matrix multiplications on large
Without fully studying the convolution loop optimization before
the hardware design phase, the resulting accelerator can hardly numbers), involving a very large number of weights (>50 mil-
exploit the data reuse and manage data movement efficiently. This lion) [4], [5]. Deep CNN algorithms have tens to hundreds of
paper overcomes these barriers by quantitatively analyzing and layers, with significant differences between layers in terms of
optimizing the design objectives (e.g., memory access) of the CNN sizes and configurations. The limited computational resources
accelerator based on multiple design variables. Then, we propose and storage capacity on FPGA make the task of optimal
a specific dataflow of hardware CNN acceleration to minimize the
data communication while maximizing the resource utilization mapping of CNNs (e.g., minimizing latency subject to energy
to achieve high performance. The proposed CNN acceleration constraints or vice versa) a complex and multidimensional
scheme and architecture are demonstrated by implementing end- optimization problem. The high cost of off-chip commu-
to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet- nication is another major impediment to achieving higher
152 for inference. For VGG-16 CNN, the overall throughputs performance and lower energy. In fact, the energy cost
achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria
10 FPGAs, respectively. associated with the large amount of data movements and
memory accesses often exceeds the energy consumption of
Index Terms— Accelerator architectures, convolutional neural the computations [8], [20]. For these reasons, energy-efficient
networks (CNNs), field-programmable gate array (FPGA), neural
network hardware. hardware acceleration of CNNs on a FPGA requires simulta-
neous maximization of resource utilization and data reuse, and
minimization of data communication.
I. I NTRODUCTION More than 90% of the operations in a CNN involve con-

T HE field-programmable gate arrays (FPGA) are fast


becoming the platform of choice for accelerating the
inference phase of deep convolutional neural networks
volutions [2]–[4]. Therefore, it stands to reason that acceler-
ation schemes should focus on the management of parallel
computations and the organization of data storage and access
(CNNs). In addition to their conventional advantages of recon- across multiple levels of memories, e.g., off-chip dynamic
figurability and shorter design time over application-specific random access memory (DRAM), on-chip memory, and local
integrated circuits (ASICs) [20], [21] to catch up with the registers. In CNNs, convolutions are performed by four levels
rapid evolving of CNNs, FPGA can realize low latency infer- of loops that slide along both kernel and feature maps as
ence with competitive energy efficiency (∼10–50 GOP/s/W) shown in Fig. 1. This gives rise to a large design space
consisting of various choices for implementing parallelism,
Manuscript received October 27, 2017; revised February 3, 2018; accepted
March 6, 2018. This work was supported in part by the NSF I/UCRC Center sequencing of computations, and partitioning the large data
for Embedded Systems through NSF under Grant 1230401, Grant 1237856, set into smaller chunks to fit into on-chip memory. These
Grant 1701241, Grant 1361926, Grant 1535669, Grant 1652866, and Grant problems can be handled by the existing loop optimiza-
1715443; and in part by the Intel Labs, and in part by the Samsung Advanced
Institute of Technology. (Corresponding author: Yufei Ma.) tion techniques [6], [9], such as loop unrolling, tiling, and
Y. Ma, Y. Cao, and J.-s. Seo are with the School of Electrical, Computer, interchange. Although some CNN accelerators have adopted
and Energy Engineering, Arizona State University, Tempe, AZ 85287 USA these techniques [9], [11], [13], [19], the impact of these
(e-mail: [email protected]; [email protected]; [email protected]).
S. Vrudhula is with the School of Computing, Informatics, Decision techniques on design efficiency and performance has not
Systems Engineering, Arizona State University, Tempe, AZ 85287 USA been systematically and sufficiently studied. Without fully
(e-mail: [email protected]). studying the loop operations of convolutions, it is difficult to
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. efficiently customize the dataflow and architecture for high-
Digital Object Identifier 10.1109/TVLSI.2018.2815603 throughput CNN implementations. This paper aims to address
1063-8210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Four levels of convolution loops, where L denotes the index of convolution layer and S denotes the sliding stride [15].

these shortcomings. Specifically, the main contributions of this


paper include the following.
1) We provide an in-depth analysis of the three loop
optimization techniques for convolution operations and
use corresponding design variables to numerically char-
acterize the acceleration scheme.
2) The design objectives of CNN accelerators (e.g., latency,
memory) are quantitatively estimated based on the con- Fig. 2. Three levels of general hardware CNN accelerator hierarchy.
figurations of the design variables.
3) An efficient convolution acceleration strategy and
dataflow is proposed aimed at minimizing data commu- engines (PEs), as shown in Fig. 2. The basic flow is to
nication and memory access. fetch data from external memory to on-chip buffer, and then
4) A data router is designed to handle different settings feed them into registers and PEs. After the PE computation
for convolution sliding operations, e.g., strides and zero completes, results are transferred back to on-chip buffers and
paddings, especially for highly irregular CNNs. to the external memory if necessary, which will be used as
5) A corresponding hardware architecture is designed that input to the subsequent layer.
fully utilizes the computing resources for high perfor-
mance and efficiency, which is uniform and reusable for
B. Convolution Loops
all the layers.
6) The proposed acceleration scheme and architecture is Convolution is the main operation in CNN algorithms,
validated by implementing large-scale deep CNN algo- which involves 3-D multiply-and-accumulate (MAC) opera-
rithms, NiN [3], VGG-16 [4], and ResNet-50/ResNet- tions of input feature maps and kernel weights. Convolution
152 [5] for image recognition [1], on two Intel FPGAs. is implemented by four levels of loops as shown in the
The proposed accelerators achieve end-to-end inference pseudocodes in Fig. 1 and illustrated in Fig. 3. To efficiently
throughput of 715 GOPS on Arria 10 and 348 GOPS on map and perform the convolution loops, three loop optimiza-
Stratix V, respectively, using a batch size of 1. tion techniques [6], [9], namely, loop unrolling, loop tiling, and
The rest of this paper is organized as follows. Section II loop interchange, are employed to customize the computation
identifies the key design variables that are used to numerically and communication patterns of the accelerator with three levels
characterize the loop optimization techniques. Section III con- of memory hierarchy.
tains a quantitative analysis of hardware accelerator objectives.
Section IV describes the acceleration schemes used in some C. Loop Optimization and Design Variables
of the recent state-of-the-art CNN accelerators. Section V
As shown in Fig. 3, multiple dimensions are used to describe
presents the optimized acceleration scheme with specific
the sizes of the feature and kernel maps of each convolution
design variables. A corresponding dataflow and architecture
layer for a given CNN. The hardware design variables of loop
is proposed in Section VI. Section VII analyzes the experi-
unrolling and loop tiling will determine the acceleration factor
mental results and compares with prior works. Conclusions
and hardware footprint. All dimensions and variables used in
are presented in Section VIII.
this paper are listed in Table I.
The width and height of one kernel (or filter) window is
II. ACCELERATION OF C ONVOLUTION L OOPS
described by (Nkx, Nky). (Nix, Niy) and (Nox, Noy) define
A. General CNN Acceleration System the width and height of one input and output feature map (or
Recently reported CNN algorithms involve a large amount channel), respectively. Nif and Nof denote the number of
of data and weights. For them, the on-chip memory is insuf- input and output feature maps, respectively. The loop unrolling
ficient to store all the data, requiring gigabytes of external design variables are (Pkx, Pky), Pif, (Pox, Poy), and Pof, which
memory. Therefore, a typical CNN accelerator consists of denote the number of parallel computations. The loop tiling
three levels of storage hierarchy: 1) external memory; 2) on- design variables are (Tkx, Tky), Tif, (Tox, Toy), and Tof, which
chip buffers; and 3) registers associated with the processing represent the portion of data of the four loops stored in on-chip
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 3

TABLE I
C ONVOLUTION L OOP D IMENSIONS AND H ARDWARE D ESIGN VARIABLES

Fig. 4. Unroll loop-1 and its corresponding computing architecture.

Fig. 3. Four levels of convolution loops and their dimensions.

buffers. The constraints of these dimension and variables are


given by 1 ≤ P∗ ≤ T∗ ≤ N∗, where N∗, T∗, and P∗ denote
any dimension or variable that has a prefix of capital N, T, and
P, respectively. For instance, 1≤ Pkx ≤ Tkx ≤ Nkx. By default,
P∗, T∗ and N∗ are applied to all convolution layers.
The relationship of input and output variables is constraint
by (1)–(3), where S is the stride of the sliding window and
Fig. 5. Unroll loop-2 and its corresponding computing architecture.
the zero padding size is included in Nix, Niy, Tix, and Tiy

Nix = (Nox − 1)S + Nkx


Niy = (Noy − 1)S + Nky (1)
Tix = (Tox − 1)S + Nkx
Tiy = (Toy − 1)S + Nky (2)
Pix = Pox
Piy = Poy. (3)
1) Loop Unrolling: As illustrated in Figs. 4–7, unrolling
different convolution loops leads to different parallelization
of computations, which affects the optimal PE architecture
with respect to data reuse opportunities and memory access Fig. 6. Unroll loop-3 and its corresponding computing architecture.
patterns.
a) Loop-1 unrolling (Fig. 4): In this case, the inner
product of Pkx × Pky pixels (or activations) and weights from same (x, y) location are required to compute the inner product.
different (x, y) locations in the same feature and kernel map The inner-product operation results in the same computing
are computed every cycle. This inner product requires an adder structure as in unrolling Loop-1, but with a different adder
tree with fan-in of Pkx × Pky to sum the Pkx × Pky parallel tree fan-in of Pif.
multiplication results, and an accumulator to add the adder c) Loop-3 unrolling (Fig. 6): In every cycle, Pix × Piy
tree output with the previous partial sum. number of pixels from different (x, y) locations in the same
b) Loop-2 unrolling (Fig. 5): In every cycle, Pif number feature map are multiplied with the identical weight. Hence,
of pixels/weights from Pif different feature/kernel maps at the this weight can be reused Pix × Piy times. Since the Pix ×
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

3) Loop Interchange: Loop interchange determines the


order of the sequential computation of the four convo-
lution loops. There are two kinds of loop interchange,
namely, intratile and intertile loop orders. Intratile loop order
determines the pattern of data movements from on-chip buffer
to PEs. Intertile loop order determines the data movement from
external memory to on-chip buffer.

III. A NALYSIS ON D ESIGN O BJECTIVES OF CNN


ACCELERATOR

Fig. 7. Unroll loop-4 and its corresponding computing architecture.


In this section, we provide a quantitative analysis of the
impact of loop design variables (P∗ and T∗) on the following
design objectives that our CNN accelerator aims to minimize.

A. Computing Latency
The number of multiplication operations per layer (Nm) is
Nm = Nif × Nkx × Nky × Nof × Nox × Noy. (5)
Ideally, the number of computing cycles per layer should be
Nm/Pm, where Pm is the number of multipliers. However, for
different loop unrolling and tiling sizes, the multipliers cannot
necessarily be fully utilized for every convolution dimension.
The number of actual computing cycles per layer is
#_cycles = #i nter tile_cycles × #i ntr atile_cycles (6)
Fig. 8. Loop tiling determines the size of data stored in on-chip buffers.
where
Piy parallel multiplication contributes to independent Pix × #i nter tile_cycles = Nif /Tif Nkx/TkxNky/Tky
Piy output pixels, Pix × Piy accumulators are used to serially ×Nof /Tof Nox/ToxNoy/Toy (7)
accumulate the multiplier outputs and no adder tree is needed.
#i ntr atile_cycles = Tif /Pif Tkx/PkxTky/Pky
d) Loop-4 unrolling (Fig. 7): In every cycle, one pixel is
multiplied by Pof weights at the same (x, y) location but from ×Tof /Pof Tox/PoxToy/Poy. (8)
Pof different kernel maps, and this pixel is reused Pof times. Here, we assume that the multipliers receive input data
The computing structure is identical to unrolling Loop-3 using continuously without idle cycles. If the ratio of N∗ to T∗ or T∗
Pof multipliers and accumulators without an adder tree. to P∗ is not an integer, the multipliers or the external memory
The unrolling variable values of the four convolution loops transactions are not fully utilized. In addition to considering
collectively determine the total number of parallel MAC computing latency, memory transfer delay must also be con-
operations as well as the number of required multipliers (Pm) sidered for the overall system latency.
Pm = Pkx × Pky × Pif × Pix × Piy × Pof . (4)
B. Partial Sum Storage
2) Loop Tiling: On-chip memory of FPGAs is not always A partial sum (psum) is the intermediate result of the
large enough to store the entire data of deep CNN algorithms. inner-product operation that needs to be accumulated over
Therefore, it is reasonable to use denser external DRAMs several cycles to obtain one final output data. Therefore, partial
to store the weights and the intermediate pixel results of all sums need to be stored in memory for the next few cycles
layers. and sometimes have to be moved between PEs. An efficient
Loop tiling is used to divide the entire data into multiple acceleration strategy has to minimize the number of partial
blocks, which can be accommodated in the on-chip buffers, sums and process them locally as soon as possible to reduce
as illustrated in Fig. 8. With proper assignments of the loop data movements.
tiling size, the locality of data can be increased to reduce The flowchart to calculate the number of partial sums stored
the number of DRAM accesses, which incurs long latency in memory (#psum) is shown in Fig. 9. To obtain one final
and high-power consumption. The loop tiling sets the lower output pixel, we need to finish Loop-1 and Loop-2. Therefore,
bound on the required on-chip buffer size. The required size if both Loop-1 and Loop-2 are fully unrolled, the final output
of input pixel buffer is Tix × Tiy × Tif × (pixel_datawidth). pixel can be obtained right after the inner-product operations
The size of weight buffer is Tkx × Tky × Tif × Tof × with minimal #psum. If the loop tile size can cover all pixels
(weight_datawidth). The size of output pixel buffer is Tox × and weights in Loop-1 (Tkx = Nkx and Tky = Nky) and
Toy × Tof × (pixel_datawidth). Loop-2 (Tif = Nif ), then the partial sums can be consumed
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 5

Having Pm parallel multiplications per cycle requires


Pm pixels and Pm weights to be fed into the multipliers. The
number of distinct weights required per cycle is
Pwt = Pof × Pif × Pkx × Pky. (9)
If Loop-1 is not unrolled (Pkx = 1, Pky = 1), the number
of distinct pixels required per cycle (Ppx) is
Ppx = Pif × Pix × Piy. (10)
Otherwise, Ppx is
Ppx = Pif × ((Pix − 1)S + Pkx) × ((Piy − 1)S + Pky). (11)
Note that “distinct” only means that the pixels/weights are
from different feature/kernel map locations and their values
may be the same. The number of times a weight is spatially
reused in one cycle is
Reuse_wt= Pm/Pwt = Pix × Piy (12)
where the spatial reuse of weights is realized by unrolling
Loop-3 (Pix > 1 or Piy > 1). The number of times of a pixel
is spatially reused in one cycle (Reuse_px) is
Reuse_px = Pm/Ppx. (13)
If Loop-1 is not unrolled, Reuse_px is
Reuse_px = Pof (14)
Fig. 9. Design space exploration of the total number of partial sums that
need to be stored in memory [15].
otherwise, Reuse_px is
Po f × Pkx × Pky × Pi x × Pi y
Reuse px = .
((Pi x − 1)S + Pkx) × ((Pi y − 1)S + Pky)
within this tile as described in (9.2)–(9.5) inside Fig. 9. In this (15)
case, the number of partial sums, determined by P∗ or T∗,
is small and can be stored in local registers [(9.2) inside The spatial reuse of pixels is realized by either unrolling
Fig. 9] or in on-chip buffers [(9.3) inside Fig. 9]. If the loop Loop-4 (Pof > 1) or unrolling both Loop-1 and Loop-
tile cannot include all data for Loop-1 and Loop-2, partial 3 together. Only unrolling Loop-1 (Pix = 1, Piy = 1) or only
sums from one tile need to be stored in on-chip or off-chip unrolling Loop-3 (Pkx = 1, Pky = 1) hampers reusing pixels,
memory until it is consumed by another tile as in (9.6)–(9.9) and Reuse_px = Pof.
inside Fig. 9. In this case, the partial sums need to be stored If intra-tile Loop-3 is computed first, the weights can be
in on-chip buffers [(9.6) inside Fig. 9] or even in external reused for Tox × Toy/(Pox × Poy) consecutive cycles. If intra-
memory [(9.7) inside Fig. 9]. The loop computing order also tile Loop-4 is computed first, the pixels can be reused for
affects the number of partial sums, and the earlier Loop-1 and Tof/Pof consecutive cycles.
Loop-2 are computed, the fewer is the number of partial sums.
The requirement to store partial sums in different levels of D. Access of On-Chip Buffer
memory hierarchy significantly worsens data movements and With the data reuse, the number of on-chip buffer accesses
associated energy cost [8] since partial sums involve both can be significantly reduced. Without any data reuse, the total
read and write memory operations and typically require higher read operations from on-chip buffers for both pixels and
precision than pixels and weights. weights are Nm, as every multiplication needs one pixel
and one weight. With data reuse, the total number of read
operations from on-chip buffers for weights becomes
C. Data Reuse
#r ead_wt = Nm/Reuse_wt (16)
Reusing pixels and weights reduces the number of read
operations of on-chip buffers. There are mainly two types and the total number of read operations for pixels is
of data reuse: spatial reuse and temporal reuse. Spatial reuse
#r ead_px = Nm/Reuse_px. (17)
means that, after reading data from on-chip buffers, a single
pixel or weight is used for multiple parallel multipliers within If the final output pixels cannot be obtained within one
one clock cycle. On the other hand, temporal reuse means tile, their partial sums are stored in buffers. The number of
that a single pixel or weight is used for multiple consecutive write and read operations to/from buffers for partial sums
cycles. per cycle is 2 × Pof × Pox × Poy, where all partial sums
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

generated by Loop-1 (Pkx, Pky) and Loop-2 (Pif) are already


summed together right after multiplications. The total number
of writes/reads to/from buffers for partial sums is
#wr_rd_ psum= #_cycles×(2 × Pof × Pox × Poy). (18)
The number of times output pixels are written to on-chip
buffers (i.e., #write_px) is identical to the total number of
output pixels in the given CNN model. Finally, the total
number of on-chip buffer accesses is
#bu f f er _access = #r ead_px + #r ead_wt
+ #wr _r d_ psum + #wri te_ px. (19)

E. Access of External Memory


In our analysis, both the weights and intermediate results of
pixels are assumed to be stored in external memory (DRAM),
which is a necessity when mapping large-scale CNNs on
moderate FPGAs. The costs of DRAM accesses are higher
latency and energy than on-chip block RAM (BRAM) memory
accesses [8], [20], and therefore it is important to reduce
the number of external memory accesses to improve the
overall performance and energy efficiency. The minimum
number of DRAM accesses is achieved by having sufficiently
large on-chip buffers and proper loop computing orders, such Fig. 10. Design space exploration of the number of external memory
that every pixel and weight needs to be transferred from accesses.
DRAM only once. Otherwise, the same pixel or weight has
to be read multiple times from DRAM to be consumed for
multiple tiles. feature maps. However, kernel size (Nkx × Nky) is normally
The flowchart to estimate the number of DRAM accesses is very small (≤11 × 11) so that it cannot provide sufficient
shown in Fig. 10, where #DRAM_px and #DRAM_wt denote parallelism and other loops need to be further unrolled. A more
the number of DRAM access of one input pixel and one challenging problem is that kernel size may vary considerably
weight, respectively. After fetched out of DRAM, all data across different convolution layers in a given CNN model (e.g.,
should be exhaustedly utilized before being kicked out of the AlexNet [2], ResNet [5]), which may cause workload imbal-
buffer. Therefore, if the tile size or the on-chip buffer can ance and inefficient utilization of the PEs [21]. To address this,
fully cover either all input pixels or all weights of one layer, PEs need to be configured differently for layers with different
the minimum DRAM access can be achieved as (10.8) inside kernel sizes [10], which increase control complexity.
Fig. 10. By computing Loop-3 first, weights stored in buffer In type-(C), every row in the kernel window is fully unrolled
are reused and #DRAM_wt is reduced as in (10.1) and (10.5) (Pkx = Nkx) and Loop-3 is also partially unrolled. By this
inside Fig. 10. Similarly, by computing Loop-4 first, pixels means, pixels can be reused by the overlapping caused by
can be reused to reduce #DRAM_px as in (10.3) and (10.6) Loop-1 and Loop-3 as in (15), and weight reuse can also be
inside Fig. 10. However, computing Loop-3 or Loop-4 first realized by unrolling Loop-3 as in (12). However, Loop-4 is
may postpone the computation of Loop-1 or Loop-2, which not unrolled and further pixel reuse cannot be achieved. The
would lead to a large number of partial sums. PE efficiency issue caused by unrolling Loop-1 also affects
type-(C) [21].
IV. L OOP O PTIMIZATION IN R ELATED W ORKS In type-(A) and type-(B), Loop-3 is not unrolled, which
implies that weights cannot be reused. Type-(B) only unrolls
In this section, the acceleration schemes of the state-of- Loop-2 and Loop-4, but Nif × Nof of the first convolution
the-art hardware CNN accelerators are compared. The loop layer is usually small (≤3 × 96) and cannot provide sufficient
unrolling strategy of current designs can be categorized into parallelism, which results in low utilization and throughput.
the four types: If the first layer is computation bounded or the DRAM
1) [Type-(A)] unroll loop-1, loop-2, loop-4 [11], [13], delay is not overlapped with the computation, the throughput
[17], [19]; degradation will affect the overall performance, especially for
2) [Type-(B)] unroll loop-2, loop-4 [9], [14]; shallow CNNs, e.g., AlexNet and NiN.
3) [Type-(C)] unroll loop-1, loop-3 [7], [8], [21]; In type-(D), both Loop-3 and Loop-4 are unrolled so that
4) [Type-(D)] unroll loop-3, loop-4 [15], [16], [18]. both pixels and weights can be reused. In addition, Nox ×
By unrolling Loop-1, Loop-2, and Loop-4 in type-(A), Noy × Nof (≥7 × 7 × 64) is very large across all the
parallelism is employed in kernel maps, input and output convolution layers in AlexNet, VGG, and ResNet so that high
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 7

level of parallelism can be achieved even for largest FPGA D. Minimizing Access of External Memory
available with ∼3600 DSP slices. By this means, a uniform As we first compute Loop-1 and Loop-2 to reduce partial
configuration and structure of PEs can be applied for all the sums, we cannot achieve the minimum number of DRAM
convolution layers. access described in (10.1) and (10.3) inside Fig. 10, where
Loop tiling has been used in prior hardware CNN accel- neither the pixels nor the weights are fully buffered for one
erators to fit the large-scale CNN models into limited on- convolution layer. Therefore, we can only attain the minimum
chip buffers. However, only a few prior works [13], [18] have DRAM access by assigning sufficient buffer size for either all
shown their tiling configurations that determine the on-chip pixels or all weights of each layer as in (10.8) inside Fig. 10.
buffer size, but the tradeoff between the loop tiling size and Then, the optimization of minimizing the on-chip buffer size
the number of external memory accesses is not explored. while having minimum DRAM access is formulated as
The impact of loop interchange has not been rigorously
studied in prior works, but it can greatly impact the number min bi ts_BU F_ px_wt
of partial sums as well as the resulting data movements and s.t. #T ile_px L = 1or #T ile_wt L = 1
memory access. with ∀L ∈ [1, #C O N V s] (20)
where #Tile_px L and #Tile_wt L denote the number of tiling
V. P ROPOSED ACCELERATION S CHEME blocks for input pixels and weights of layer L, respec-
The optimization process of our proposed acceleration tively, and #CONVs is the number of convolution layers.
scheme is presented in this section, which includes appropriate bits_BUF_px_wt is the sum of pixel buffer size (bits_BUF_px)
selection of the convolution loop design variables. and weight buffer size (bits_BUF_wt), which are given by
bi ts_BU F_px_wt = bi ts_BU F_px + bi ts_BU F_wt. (21)
A. Minimizing Computing Latency Both pixel and weight buffers need to be large enough to
We set variables P∗ to be the common factors of T∗ for all cover the data in one tiling block for all the convolution layers.
the convolution layers to fully utilize PEs, and T∗ to be the This is expressed as
common factors of N∗ to make full use of external memory bi ts_BU F_px
transactions. For CNN models with only small common fac-
= M AX (wor ds_px L )
tors, it is recommended to set N∗/T∗ − N∗/T∗ and T∗/P∗
− T∗/P∗ as small as possible to minimize the inefficiency × pi xel_datawidth with L ∈ [1, #C O N V s] (22)
caused by the difference in sizes of CNN models. bi ts_BU F_wt
= M AX (wor ds_wt L )
B. Minimizing Partial Sum Storage ×weight_datawidth with L ∈ [1, #C O N V s] (23)

To reduce the number and movements of partial sums, both where words_px L and words_wt L denote the number of
Loop-1 and Loop-2 should be computed as early as possi- pixels and weights of one tiling block in layer L, respectively.
ble or unrolled as much as possible. To avoid the drawback of These are expressed in terms of loop tiling variables as
unrolling Loop-1 as discussed in Section IV and maximize the follows:
data reuse as discussed in Section III-C, we decide to unroll
wor ds_px L = Tix L ×Tiy L ×Tif L +Tox L ×Toy L ×T o f L (24)
Loop-3 (Pox > 1 or Poy > 1) and Loop-4 (Pof >1). By this
means, we cannot attain the minimum partial sum storage, wor ds_wt L = Tof L ×Tif L ×Tkx L ×Tky L (25)
as (9.1) inside Fig. 9. where words_px L is comprised of both input and output pixels.
Constrained by 1 ≤ P∗ ≤ T∗ ≤ N∗, the second least number The number of tiles in (20) is also determined by T∗ variables
of partial sum storage is achieved by (9.2) among (9.2)–(9.9)
inside Fig. 9. To satisfy the condition for (9.2), we serially #T ile_px L = Nif L /Tif L  × Nox L /Tox L  × Noy L /Toy L 
compute Loop-1 and Loop-2 first and ensure the required data (26)
of Loop-1 and Loop-2 are buffered, i.e., Tkx = Nkx, Tky = #T ile_wt L = Nkx L /Tkx L  × Nky L /Tky L  × Nif L /Tif L 
Nky and Tif = Nif . Therefore, we only need to store Pof ×
Pox × Poy number of partial sums, which can be retained in ×Nof L /Tof L . (27)
local registers with minimum data movements. By solving (20), we can find an optimal configuration of
T∗ variables that result in minimum DRAM access and on-
chip buffer size. However, since we have already set Tkx =
C. Minimizing Access of On-Chip Buffer
Nkx, Tky = Nky, Tif = Nif as in Section V-B, we can only
The number of on-chip buffer accesses is minimized by achieve a suboptimal solution by tuning Tox, Toy and Tof,
unrolling Loop-3 to reuse weights as shown in (12) and resulting in larger buffer size requirement. If the available
unrolling Loop-4 to reuse pixels as shown in (14). As our on-chip memory is sufficient, we set Tox = Nox so that an
partial sums are kept on local registers, they do not add entire row can be buffered to benefit the direct memory access
overhead to the buffer access and storage. (DMA) transactions with continuous data.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 12. Optimized loop unrolling and tiling strategy. The parallelism is
within one feature map (Pox × Poy) and across multiple kernels (Pof). The
tiling variables Tiy, Toy, and Tof can be tuned that decide the buffer sizes.

Fig. 11. To guarantee minimum DRAM accesses, either all pixels (blue E. Optimized Loop Design Variables
bars) are covered by pixel buffers (blue dashed lines) or all weights are
covered by weight buffers in one layer. Then, we try to lower the total buffer According to the aforementioned optimization process,
sizes/lines. (a) Pixels and weights distribution of convolution layers in VGG- we propose a convolution acceleration scheme for a high-
16. (b) Pixels and weights distribution of convolution layers in ResNet-50.
performance and low-communication CNN accelerator, which
is visualized in Fig. 12.
1) Loop Unrolling: For all the convolution layers, Loop-
1 and Loop-2 are not unrolled, which means Pkx = 1, Pky =
Finally, we have to solve (20) by searching Toy and Tof,
1 and Pif = 1. According to (7) and (8), Pox, Poy and Pof are
because it has a nonlinear objective function and constraints
set to be the common factors of the feature maps (Nox, Noy)
with integer variables. Since Toy and Tof in VGG-16 con-
and output channels (Nof), respectively, to fully utilize the
sist of 2 × #CONVs = 26 variables and each variable
multipliers. The configurations of Pox, Poy, and Pof of different
can have about four candidate values constrained by T∗/P∗
CNNs on different FPGAs are listed in Table II, which are
= integer and N∗ /T∗ = integer, the total number of Toy
largely constrained by the available computing resources.
and Tof configurations is about 426 = 4.5 × 1015 , which
By setting P∗ to be constant across all the convolution layers,
becomes an enormous solution space. In ResNet-50/ResNet-
a uniform structure and mapping of PEs can be realized to
152, the #CONVs are increased to be 53 and 155, respectively,
reduce the architecture complexity.
which makes the solution space even larger to be about
2) Loop Tiling: For loop tiling, we set Tkx = Nkx, Tky
4106 = 6.6 × 1063 and 4310 = 4.4 × 10186, respectively.
= Nky, Tif = Nif as described in Section V-B and shown
Therefore, it is impossible to enumerate all the candidate
in Fig. 12 so that data used in Loop-1 and Loop-2 are all
solutions.
buffered and Tox = Nox to benefit DMA transfer. Details of
In this paper, we propose to empirically find a satisfactory
Toy and Tof are described in Section V-D.
solution for a given on-chip memory capacity that takes
3) Loop Interchange: For loop interchange, we first serially
advantage of the property of CNNs. CNNs normally have large
compute Loop-1 and then Loop-2 as described in Section V-B.
pixel data volume and small weight sizes in the beginning
Finally, we compute Loop-3 and Loop-4, where the exact com-
few layers. As we proceed into deeper layers, the pixel sizes
putation order of these two loops does not have a pronounced
become smaller with extracted features, and the weight sizes
impact on the cost, based on our P∗ and T∗ choices.
become larger with more channels. This trend is illustrated
in Fig. 11, where the bars denote data sizes in each convolution
layer. To benefit from the data distribution property in different VI. P ROPOSED CNN ACCELERATOR
layers, we only need to make pixel buffers fully cover the To implement the optimized convolution acceleration
last few layers and weight buffers fully cover the beginning scheme in Section V-E, a data router is proposed with high
few layers. Then, the middle layers with both relatively large flexibility for different convolution sliding settings, e.g., strides
pixel and weight sizes become the constraints of the buffer and zero paddings, using variant data buses. A corresponding
sizes, and we only need to take care of these bounding layers, hardware PE architecture is also designed that minimizes
which significantly shrinks the solution space. The dashed on/off-chip memory accesses and data movements.
lines in Fig. 11 are the minimal buffer sizes we found while
guaranteeing minimum DRAM accesses, and the bounding
layers are pointed out by arrows. If this buffer size still cannot A. Data Bus From Buffer to PE (BUF2PE)
be fit into the FPGA on-chip memory, then we need to either In [15] and [16], a register array architecture is designed
change the tiling strategy or decrease the buffer sizes at the to rearrange and direct the pixel stream from buffers into
cost of more DRAM accesses as discussed in [15]. PEs. This method takes advantage of convolution stride being
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 9

TABLE II
O UR I MPLEMENTATION OF D IFFERENT CNN S ON D IFFERENT FPGA S

1 in VGG-16 so that pixels can be reused by the adja-


cent register array in the next computing cycles. However,
if stride is 2 or more, which frequently occurs in CNN
algorithms [2], [3], [5], pixels need to wait for Nkx × (Stride-
1) cycles to be reused by the neighboring register array. This
makes the control logic and wire routing among registers much
more complicated. Therefore, we propose a BUF2PE data bus
in Fig. 13 to implement the dataflow using FIFO to temporally
store pixels to be reused by the adjacent register array. This
method is similar to line buffer design in [22], where FIFOs
are used to align pixels from multiple feature rows to a kernel
window so that parallelism can be employed within a kernel
window, i.e., unrolling Loop-1, whereas this paper unrolls
Loop-3 to parallel compute within one feature map. By this
means, the wire routing within and across register arrays is
simplified, and the data router can follow the same pattern
for convolution with different strides and zero paddings to
improve the accelerator flexibility.
The detailed design of BUF2PE data bus is illustrated
in Fig. 13. Pixels from input buffers are loaded into the
corresponding registers as shown by the blue dashed box to the
blue solid box. Then, the pixels are sent to PEs or MAC units
and are also sent to FIFOs during cycles 0 to 5, waiting to be
reused by the adjacent register array. Register arrays except the Fig. 13. BUF2PE data bus directs the convolution pixel dataflow from input
rightmost one start reading input pixels from FIFOs at cycle 3, buffers to PEs (i.e., MAC units), where Pox = 3 and Poy = 3.

as shown by the purple pixels in Fig. 13. Meanwhile, the new


pixels are fed into the rightmost register array from buffers.
In this paper, the offset caused by west zero padding is handled with stride = 2 and zero padding = 3 is shown, which follows
by shifting the connection between buffers and register arrays, the same pattern as the case with stride = 1. The buffer storage
whereas [15] has to change the storage pattern within one pattern is adjusted according to different stride and padding
address of input buffer by a padding offset that increases the settings. Three rows of zeros are added to the buffer due
complexity of transferring data from DRAM to buffers. to the north zero padding of 3. With stride = 2, every two
The coarse-grained dataflow is shown in Fig. 14 at feature rows of pixels are continuously distributed across Poy buffer
map row level for stride = 1 and stride = 2. The data flow in banks. These adjustments are handled by the buffer write
Fig. 14(a) is the same as Fig. 13, where more clock cycles of enable and address signals during the reception of pixels from
operation is shown after cycle 8. In Fig. 14(b), the dataflow DRAM. Since the data movement within a register array or a
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 15. Convolution acceleration architecture with Pox × Poy × Pof MAC
units.

minimized. Pixels read from input pixel buffers are shared by


Pof MAC units and sliding overlapped pixels are also reused
by the data router. Weights read from weight buffers are
Fig. 14. Coarse-grained designs of BUF2PE data buses for (a) strides =
1 and zero padding = 1 and (b) stride = 2 and zero padding = 3. shared by Pox × Poy MAC units. The proposed architecture is
implemented with parameterized Verilog codes and is highly
feature map row is different for different settings of stride scalable to different CNN models in FPGAs or even ASICs
and zero padding, various BUF2PE data buses are needed by modifying design variables such as Pox, Poy and Pof.
for each dataflow, and the set of data buses are called data After the completion of Loop-1 and Loop-2, the partial sums
router. If these settings are identical, one BUF2PE bus can need to be added with biases as in Fig. 1 to obtain the final
handle different kernel sizes (Nkx × Nky) without penalty output pixels. Therefore, every Nkx × Nky × Nif cycles, MAC
of idle cycles as we serially compute Loop-1. Therefore, units output the partial sums into the adders to add with biases.
the BUF2PE bus in Fig. 14(b) can be applied for conv1 in Since Poy < Nkx × Nky × Nif for all the layers, we serialize
ResNet with stride = 2 and zero padding = 3. For other the Pox × Poy × Pof MAC outputs into Poy cycles. Then,
sliding settings in ResNet, e.g. stride = 2 and zero padding = we only need Pox × Pof adders to add the biases in parallel.
0, the corresponding variants of BUF2PE buses are designed The data width of one output buffer can also be reduced to
to direct the dataflow. The global control logic controls the be Pox, and we store the pixels of one output feature map in
switch among different BUF2PE buses inside the data router. one buffer bank, which could need totally Pof output buffers.
After Nkx × Nky cycles, we complete one kernel window If Pof is large, e.g., Pof = 64, it would require many output
sliding (Loop-1) and move to the next input feature map with buffers with shallow depth, resulting in low utilization of on-
the same dataflow until the last one as shown in Fig. 14. chip BRAMs (e.g., M20K memory block). In addition, batch
After Nkx × Nky ×Nif cycles, both Loop-1 and Loop-2 are normalization (Bnorm) layers in ResNet still need Pox × Pof
completed and we obtain Pox × Poy × Pof final output adders and multipliers that are expensive. We further serialize
pixels. the Pox × Pof parallel outputs to be Pox × #OUTBUF using
In summary, the proposed dataflow is scalable to Nkx × multiplexers with neighboring output feature maps stacked in
Nky by changing the control logic, and it can handle various one output buffer, as illustrated in Fig. 15. In ResNet, we set
sliding settings using variant BUF2PE data buses inside the #OUTBUF = 16 to ensure Poy × Pof / (#OUTBUF) < Nkx
data router, where the MAC units are reused and kept busy. × Nky × Nif or the number of serial output cycles is smaller
than the MAC unit output interval cycles. By this means,
B. Convolution PE Architecture the parallelism of adders and multipliers for bias and Bnorm
is significantly reduced, as well as the output buffer bandwidth
The PE architecture of convolution layers shown
and the used M20K BRAMs.
in Fig. 15 is designed according to the proposed acceleration
strategy and dataflow. It is comprised of Pox × Poy × Pof
PEs, and every PE in our architecture is an independent MAC C. Pooling Layers
unit consisting of one multiplier followed by an accumulator. Pooling is commonly used to reduce the feature map dimen-
As Loop-1 and Loop-2 are not unrolled, no adder tree is sion by replacing pixels within a kernel window (e.g., 2 × 2,
needed to sum the multiplier outputs. The partial sum is 3 × 3) by their maximum or average value. The output pixels
consumed inside each MAC unit until the final results are from previous convolution layers are stored row-by-row in the
obtained, such that the data movements of partial sums are output pixel buffers. As pooling operation only need pixels,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 11

after one tile of convolution is finished, we directly compute


pooling with pixels read from output pixel buffers to eliminate
the access of external memory. The unrolling factors of all the
pooling layers are the same. Since the width of output pixel
buffer is Pox, we can enable Pox × #OUTBUF parallel pooling
operations, which is large enough considering that pooling
layers involve much less operations compared to convolution
layers. Register arrays are used to reshape the pooling input
pixels and ensure continuous feeding of pixels into pooling
PEs without idle cycles. The PEs are either comparators for
max pooling or accumulators followed by constant coefficient
multipliers for average pooling. The outputs of pooling are
written back to the output pixel buffers and then transferred
to the external memory.

D. Fully Connected Layers


The inner-product layer or fully connected (FC) layer is Fig. 16. Overall FPGA-based CNN hardware acceleration system [16].
a special form of the convolution layer with Nkx = Nky
= Nox = Noy = 1, or there are no Loop-1 and Loop-3.
Therefore, we only unroll Loop-4 and reuse the same MAC not synthesized, and the dataflow just bypasses this module.
unit array used in convolution layers for all the FC layers. With two DRAM banks, both kernel and feature maps are
Contrary to convolution layers, FC layers have large amount separated into these two banks to enable full off-chip commu-
of weights but small amount of operations, which makes nication. Two Modular scatter-gather DMA engines provided
the throughput of FC layers primarily bounded by the off- by Intel are used to simultaneously read and write from/to
chip communication speed. Due to this, dual weight buffers these two DRAM banks. Data scatter and gather [16] are
can be used to overlap the inner-product computation with used to distribute the data stream from DMA into multiple
off-chip communication to increase FC throughput. However, input buffers and collect data from multiple output buffers
in recent CNN models, e.g., ResNet, the size of FC weights into one DMA stream, respectively. After the input images
(= 2 M) has been significantly reduced compared to that of and weights are loaded into DRAMs, the CNN inference
VGG (= 123.6M), and FC layers are completely removed acceleration process starts. When the computation of one loop
in NiN [3]. Considering this trend that CNNs are decreasing tile completes, the output pixels are transferred to DRAM, and
their reliance on FC layers and the relatively smaller number then the weights and pixels for the next loop tile are loaded
of FC operations, the dual buffer techniques are not used in from DRAM to on-chip buffers. The controller governs the
this paper. We reuse the convolution weight buffers for FC iterations of the four convolution loops and the layer-by-layer
weights and start the FC computations after the weights are sequential computation. The buffer read and write addresses
read from DRAM. Thus, in VGG implementation, FC layer are also generated by the controller.
has a significant contribution to the overall system latency. The fixed-point data representation is used, and both pixels
FC layer output pixels are directly stored in on-chip buffers and weights are 16-bit. The decimal points are dynamically
as their size is small (<20 kB). adjusted according to the ranges of pixel values in different
layers to fully utilize the existing data width [13]. By this
VII. E XPERIMENTAL R ESULTS means, the top-1 and top-5 ImageNet classification accuracy
A. System Setup degradation is within 2% compared with software floating-
point implementation [10]–[14].
The proposed hardware CNN inference accelerator is
demonstrated by implementing NiN [3], VGG-16 [4], and
ResNet-50/ResNet-152 [5] CNN models on two Intel FPGAs. B. Analysis of Experimental Results
The two Intel FPGAs, e.g., Stratix V GXA7 / Arria 10 The performance and specifications of our proposed CNN
GX 1150, consist of 234.7K/427.2K adaptive logic mod- accelerators are summarized in Table II. In Stratix V and Arria
ules (ALMs), 256/1,518 DSP blocks and 2,560/2,713 M20K 10, one DSP block can be configured as either two independent
BRAM blocks, respectively. The underlying FPGA boards for 18-bit × 18-bit multipliers or one multiplier followed by an
Stratix V and Arria 10 are Terasic DE5-Net and Nallatech accumulator, i.e., one MAC. Since one multiplier consumes
385A, respectively, and both are equipped with two banks much more logic than one adder, we use the DSP as two
of 4GB DDR3 DRAMs. independent multipliers and implement the accumulator inside
The overall CNN acceleration system on the FPGA chip the MAC unit by ALMs. Since Arria 10 has 1.8× more
shown in Fig. 16 is coded in parametrized Verilog scripts ALMs and 5.9× more DSP blocks than the Stratix V we
and configured by the proposed CNN compiler in [16] for use, larger loop unrolling variables (Pox × Poy × Pof) can be
different CNN and FPGA pairs. If a layer does not exist achieved in Arria 10 to obtain >2× throughput enhancement
in the CNN model, the corresponding computing module is than Stratix V.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE III
P REVIOUS CNN FPGA I MPLEMENTATIONS

Compared with [15], the unrolling variables, i.e., Pox ×


Poy × Pof, of VGG-16 are set to be 7 × 7 × 64 on Arria 10
instead of 14 × 14 × 16, where the number of MAC units
(= 3136) are the same and both sets of P∗ variables are the
common factors of the feature/kernel map sizes resulting in the
same computation cycles. The data router in Fig. 13 and the
data buses after MAC units in Fig. 15 are only related with
Pox and Poy, whereas the data buses related with Pof from
weight buffers to MAC units in Fig. 15 are relatively simple.
To reduce the data bus width and required logic, we choose
smaller Pox × Poy in this work as 7 × 7 with a larger Pof Fig. 17. Latency breakdown per image of NiN, VGG-16 and ResNet-
50/152 on Stratix V and Arria 10 FPGA platforms [16].
as 64. Since the greatest common factors of feature/kernel
maps, e.g., Nox × Noy × Nof, of all Convolution (Conv.)
layers in ResNets are 7×7×64, we still set Pox × Poy × Pof to
be 7 ×7 ×64. Since ResNets have more complex structure and
more types of layers, e.g., Eltwise and Bnorm, they consume
more logic elements than NiN and VGG-16 on Arria 10 and
cannot achieve the same parallel degree as NiN and VGG-16
on Stratix V. Since two FPGAs have close capacity of on-chip
Fig. 18. Logic utilization breakdown of NiN, VGG-16 and ResNet-50/152.
BRAMs, the loop tiling variables (T ∗) of the same CNN is
set to be the same for both FPGAs, which leads to similar the same amount of ALMs. As VGG-16 is highly uniform
BRAM consumption. with only one convolution sliding setting, e.g., stride = 1 and
The breakdown of the processing time per image of each padding = 1, only one BUF2PE bus is needed, which leads to
CNN is shown in Fig. 17 with batch size = 1. The MAC less logic and BRAM consumption of data router compared
computation time of convolution layers, e.g., “Conv MAC,” to NiN and ResNets. Convolution and FC layers share the
dominates the total latency by over 50%. “Conv DRAM” MAC units but have their own control logic to govern the
includes DRAM transaction delay of convolution weights sequential operations. Eltwise layers use adders to element-
and input–output pixels. The FC latency includes the inner- wise add pixels from two branches of layers. “Others” include
product computation delay and the DRAM transfer delay of the system interconnections, global control logic, bias adders,
FC weights. “Others” include the delay of average pooling, and configuration registers.
element-wise and pipeline stages.
The logic utilization in ALMs of each module is shown
in Fig. 18. Most multipliers in MAC units are implemented C. Comparison With Prior Works
by DSPs, and logic elements are mainly used to implement The reported results from recent CNN FPGA accelerators
accumulators in MAC units. With the same parallel compu- are listed in Table III. Rahman et al. [18] only implement
tation degree, the MAC units of the four CNNs use about convolution layers in AlexNet and uses the similar strategy as
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MA et al.: OPTIMIZING CONVOLUTION OPERATION TO ACCELERATE DEEP NEURAL NETWORKS ON FPGA 13

us to unroll Loop-3 and Loop-4, which can also achieve high [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
DSP utilization. In [10] and [17], the layer-by-layer computa- with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
Process. Syst. (NIPS), 2012, pp. 1097–1105.
tion is pipelined using different part of one or multiple FPGAs [3] M. Lin, Q. Chen, and S. Yan. (Mar. 2014). “Network in net-
resources to improve hardware utilization and thus throughput. work.” [Online]. Available: https://arxiv.org/abs/1312.4400
However, with the highly increasing number of convolution [4] K. Simonyan and A. Zisserman, “Very deep convolutional net-
works for large-scale image recognition,” in Proc. Int. Conf. Learn.
layers [5], it becomes very difficult to map different layers Represent. (ICLR), 2015.
onto different resources and balance the computation among [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern
all the pipeline stages. In addition, pipelining can increase the Recognit. (CVPR), Jun. 2016, pp. 770–778.
throughput but not necessarily the latency. Batch computing [6] D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations
with multiple input images is applied in [8], [10], [12], [17], for high-performance computing,” ACM Comput. Surv., vol. 26, no. 4,
pp. 345–420, Dec. 1994.
and [23]. The biggest advantage of this technique is to [7] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-
share the weights transferred from off-chip DRAM among efficient reconfigurable accelerator for deep convolutional neural net-
multiple images and thus increase the throughput at the cost works,” IEEE J. Solid-State Circuits, vol. 51, no. 1, pp. 127–138,
Jan. 2017.
of increased latency per image and external memory storage [8] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for
of multiple images. Benefit from batch computing and using energy-efficient dataflow for convolutional neural networks,” in Proc.
2144 DSP slices, which enables high parallelism degree, ACM/IEEE Int. Symp. Comput. Archit. (ISCA), Jun. 2016, pp. 367–379.
[9] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
Li et al. [17] also achieve high throughput of 565.94 GOPS FPGA-based accelerator design for deep convolutional neural networks,”
for AlexNet. In [12], an OpenCL-based CNN accelerator is in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA),
implemented on Arria10 FPGA, where the Intel FPGA SDK Feb. 2015, pp. 161–170.
[10] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient
for OpenCL provides a pregenerated platform that ensures CNN implementation on a deeply pipelined FPGA cluster,” in Proc.
timing closure at higher frequency than our RTL design. ACM Int. Symp. Low Power Electron. Design (ISLPED), Aug. 2016,
The Winograd transform is applied for convolution layers pp. 326–331.
[11] N. Suda et al., “Throughput-optimized OpenCL-based FPGA accelerator
that reduces multiplication operations by 2× or improves the for large-scale convolutional neural networks,” in Proc. ACM/SIGDA Int.
throughput by 2× using the same number of DSPs. The 16-b Symp. Field-Program. Gate Arrays (FPGA), Feb. 2016, pp. 16–25.
[12] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R.
floating-point data format is used with shared exponent, which Chiu, “An OpenCL deep learning accelerator on Arria 10,” in Proc.
allows directly using fixed-point 18-bit × 18-bit multipliers for ACM/SIGDA Symp. Field-Program. Gate Arrays (FPGA), Feb. 2017,
floating-point operations. Wei et al. [24] proposed an OpenCL- pp. 55–64.
[13] K. Guo et al., “Angel-Eye: A complete design flow for mapping CNN
based systolic array architecture to implement convolution on onto embedded FPGA,” IEEE Trans. Comput.-Aided Des. Integr. Circuits
Arria 10, which reduces the global PE interconnect fan-out to Syst., vol. 37, no. 1, pp. 35–47, Jan. 2018.
achieve high frequency and resource utilization. The VGG-16 [14] Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable and
modularized RTL compilation of convolutional neural networks onto
throughput of [24] is higher than ours mainly due to: 1) higher FPGA,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl. (FPL),
frequency; 2) lower precision of weights; and 3) dual buffer Aug./Sep. 2016, pp. 1–8.
scheme to hide DRAM latency. Guan et al. [23] proposed [15] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing loop opera-
tion and dataflow in FPGA acceleration of deep convolutional neural
an RTL–HLS hybrid framework to automatically generate networks,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate
FPGA hardware and implements convolution and FC as Arrays (FPGA), Feb. 2017, pp. 45–54.
matrix multiplication. Although the Stratix-V GSMD5 (with [16] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “An automatic RTL compiler
for high-throughput FPGA implementation of diverse deep convolutional
1590 DSP blocks) used in [23] has 6.2× more DSP blocks neural networks,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl.
than our Stratix-V GXA7, our accelerator on Stratix V can (FPL), Sep. 2017, pp. 1–8.
realize 1.2× higher throughput for ResNet-152 by higher [17] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high per-
formance FPGA-based accelerator for large-scale convolutional neural
hardware (DSP and logic) utilization through the proposed networks,” in Proc. IEEE Int. Conf. Field-Program. Logic Appl. (FPL),
loop optimization technique and exploiting logic elements to Aug. 2016, pp. 1–9.
[18] A. Rahman, J. Lee, and K. Choi, “Efficient FPGA acceleration of
implement multipliers as well as DSPs. convolutional neural networks using logical-3D compute array,” in Proc.
IEEE Design, Auto. Test Eur. Conf. (DATE), Mar. 2016, pp. 1393–1398.
VIII. C ONCLUSION [19] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space explo-
ration of FPGA-based deep convolutional neural networks,” in Proc.
In this paper, we present an in-depth analysis of convolu- IEEE Asia South Pacific Design Auto. Conf. (ASP-DAC), Jan. 2016,
tion loop acceleration strategy by numerically characterizing pp. 575–580.
the loop optimization techniques. The relationship between [20] S. Han et al., “EIE: Efficient inference engine on compressed deep
neural network,” in Proc. ACM/IEEE Int. Symp. Comput. Archit. (ISCA),
accelerator objectives and design variables are quantitatively Jun. 2016, pp. 243–254.
investigated. A corresponding new dataflow and architecture [21] L. Du et al., “A reconfigurable streaming deep convolutional neural
network accelerator for Internet of Things,” IEEE Trans. Circuits Syst.
is proposed to minimize data communication and enhance I, Reg. Papers, vol. 65, no. 1, pp. 198–208, Jan. 2018.
throughput. Our CNN accelerator implements end-to-end NiN, [22] B. Bosi, G. Bois, and Y. Savaria, “Reconfigurable pipelined 2-D con-
VGG-16, and ResNet-50/ResNet-152 CNN models on Stratix volvers for fast digital signal processing,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 7, no. 3, pp. 299–308, Sep. 1999.
V and Arria 10 FPGA, achieving the overall throughput [23] Y. Guan et al., “FP-DNN: an automated framework for mapping deep
of 348 GOPS and 715 GOPS, respectively. neural networks onto FPGAs with RTL-HLS hybrid templates,” in
Proc. IEEE Int. Symp. Field-Program. Custom Comput. Mach. (FCCM),
Apr./May 2017, pp. 152–159.
R EFERENCES [24] X. Wei et al., “Automated systolic array architecture synthesis for high
[1] O. Russakovsky et al., “ImageNet large scale visual recognition chal- throughput CNN inference on FPGAs,” in Proc. ACM the 54th Annu.
lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015. Design Autom. Conf. (DAC), Jun. 2017, pp. 1–6.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Yufei Ma (S’16) received the B.S. degree in infor- Sarma Vrudhula (M’85–SM’02–F’16) received the
mation engineering from the Nanjing University B.Math. degree from the University of Waterloo,
of Aeronautics and Astronautics, Nanjing, China, Waterloo, ON, Canada, and the M.S.E.E. and Ph.D.
in 2011 and the M.S.E. degree in electrical engineer- degrees in electrical and computer engineering from
ing from the University of Pennsylvania, Philadel- the University of Southern California, Los Angeles,
phia, PA, USA, in 2013. He is currently working CA, USA.
toward the Ph.D. degree at Arizona State University, He was a Professor at the ECE Department, Uni-
Tempe, AZ, USA. versity of Arizona, Tucson AZ, USA, and was on
His current research interests include the high- the faculty of the EE-Systems Department at the
performance hardware acceleration of deep learning University of Southern California. He was also the
algorithms on digital application-specified integrated Founding Director of the NSF Center for Low Power
circuit and field-programmable gate arrays. Electronics at the University of Arizona. He is currently a Professor of
Computer Science and Engineering with Arizona State University, Tempe,
AZ, USA, and the Director of the NSF I/UCRC Center for Embedded Sys-
tems. His current research interests include design automation and computer
aided design for digital integrated circuit and systems; low-power circuit
design; energy management of circuits and systems; energy optimization of
battery powered computing systems, including smartphones, wireless sensor
networks, and Internet of Things systems that relies energy harvesting; system
Yu Cao (S’99–M’02–SM’09–F’17) received the level dynamic power and thermal management of multicore processors and
B.S. degree in physics from Peking University, system-on-chip; statistical methods for the analysis of process variations;
Beijing, China, in 1996 and the M.A. degree in statistical optimization of performance, power, and leakage; a new circuit
biophysics and the Ph.D. degree in electrical engi- architectures of threshold logic circuits for the design of application-specific
neering from the University of California, Berkeley, integrated circuits and field-programmable gate arrays; nonconventional meth-
CA, USA, in 1999 and 2002, respectively. ods for implementing logic, including technology mapping with threshold
He was a Summer Intern at Hewlett-Packard Labs, logic circuits; the implementation of threshold logic using resistive memory
Palo Alto, CA, USA, in 2000, and at the IBM devices; and the design and optimization of nonvolatile logic.
Microelectronics Division, East Fishkill, NY, USA,
in 2001. He was a Postdoctoral Researcher at the
Berkeley Wireless Research Center, University of
California. He is currently a Professor of Electrical Engineering at Arizona
State University, Tempe, AZ, USA. He has authored or coauthored numerous Jae-sun Seo (S’04–M’10–SM’17) received the B.S.
articles and two books on Nano-CMOS Modeling and Physical Design. His degree in electrical engineering from Seoul National
current research interests include physical modeling of nanoscale technologies, University, Seoul, South Korea, in 2001 and the M.S.
design solutions for variability and reliability, reliable integration of postsili- and Ph.D. degrees in electrical engineering from the
con technologies, and hardware designs for on-chip learning. University of Michigan, Ann Arbor, MI, USA, in
Dr. Cao was a recipient of the 2012 Best Paper Award at the IEEE 2006 and 2010, respectively.
Computer Society Annual Symposium on VLSI, the 2010, 2012, 2013, From 2010 to 2013, he was with the IBM
2015, and 2016 Top 5% Teaching Award, Schools of Engineering, Arizona T. J. Watson Research Center, Yorktown Heights,
State University, the 2009 ACM SIGDA Outstanding New Faculty Award, NY, USA, focused on the cognitive computing chips
the 2009 Promotion and Tenure Faculty Exemplar, Arizona State Univer- under the DARPA SyNAPSE Project and energy-
sity, the 2009 Distinguished Lecturer of the IEEE Circuits and Systems efficient integrated circuits for high-performance
Society, the 2008 Chunhui Award for outstanding oversea Chinese scholars, processors. In 2014, he joined the School of Electrical, Computer and Energy
the 2007 Best Paper Award at International Symposium on Low Power Engineering, Arizona State University, Tempe, AZ, USA, as an Assistant
Electronics and Design, the 2006 NSF CAREER Award, the 2006 and Professor. In 2015, he was with the Intel Circuits Research Labortory as
2007 IBM Faculty Award, the 2004 Best Paper Award at International a Visiting Faculty. His current research interests include efficient hardware
Symposium on Quality Electronic Design, and the 2000 Beatrice Winner design of machine learning and neuromorphic algorithms and integrated power
Award at International Solid-State Circuits Conference. He was an Associate management.
Editor of the IEEE Transactions on Computer-Aided Design of Integrated Dr. Seo was a recipient of the Samsung Scholarship from 2004 to 2009,
Circuits and Systems. He served on the technical program committee of many the IBM Outstanding Technical Achievement Award in 2012, and the NSF
conferences. CAREER Award in 2017.

You might also like