Electronics 08 00065
Electronics 08 00065
Article
A Uniform Architecture Design for Accelerating 2D
and 3D CNNs on FPGAs
Zhiqiang Liu 1, *, Paul Chow 2 , Jinwei Xu 1 , Jingfei Jiang 1 , Yong Dou 1 and Jie Zhou 1
1 National Laboratory for Parallel and Distributed Processing, National University of Defense Technology,
Changsha 410073, China; [email protected] (J.X.); [email protected] (J.J.);
[email protected] (Y.D.); [email protected] (J.Z.)
2 Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada;
[email protected]
* Correspondence: [email protected]
Received: 3 December 2018; Accepted: 3 January 2019; Published: 7 January 2019
Abstract: Three-dimensional convolutional neural networks (3D CNNs) have gained popularity in
many complicated computer vision applications. Many customized accelerators based on FPGAs are
proposed for 2D CNNs, while very few are for 3D CNNs. Three-D CNNs are far more computationally
intensive and the design space for 3D CNN acceleration has been further expanded since one more
dimension is introduced, making it a big challenge to accelerate 3D CNNs on FPGAs. Motivated by
the finding that the computation patterns of 2D and 3D CNNs are very similar, we propose a uniform
architecture design for accelerating both 2D and 3D CNNs in this paper. The uniform architecture
is based on the idea of mapping convolutions to matrix multiplications. A customized mapping
module is developed to generate the feature matrix tilings with no need to store the entire enlarged
feature matrix on-chip or off-chip, a splitting strategy is adopted to reconstruct a convolutional layer
to adapt to the on-chip memory capacity, and a 2D multiply-and-accumulate (MAC) array is adopted
to compute matrix multiplications efficiently. For demonstration, we implement an accelerator
prototype with a high-level synthesis (HLS) methodology on a Xilinx VC709 board and test the
accelerator on three typical CNN models: AlexNet, VGG16, and C3D. Experimental results show
that the accelerator achieves state-of-the-art throughput performance on both 2D and 3D CNNs, with
much better energy efficiency than the CPU and GPU.
Keywords: 2D CNN; 3D CNN; accelerator; uniform architecture; FPGA; HLS; matrix multiplication;
2D MAC array
1. Introduction
In recent years, convolutional neural networks (CNNs) have gained great success in various
computer vision applications such as image classification [1], object detection [2], and face
recognition [3]. CNNs have been primarily applied on 2D images to automatically extract spatial
features and have significantly enhanced the image classification accuracy. To effectively incorporate
the motion information in video analysis, 3D CNNs with spatiotemporal convolutional kernels are
proposed. Owing to the ability to capture both spatial and temporal features, 3D CNNs have been
proved to be very effective in many video-based applications including object recognition [4], hand
gesture recognition [5], and human action recognition [6].
CNNs require vast amounts of memory as there are millions of parameters in a typical CNN
model. Meanwhile, CNNs are computationally intensive with over billions of operations for the
inference of one input. For example, VGG16 [7], a real-life 2D CNN model for image classification with
16 layers, takes around 31 GOPs for the inference of one image. C3D [6], a real-life 3D CNN model for
human action recognition with only 11 layers, takes more than 77 GOPs for the inference of a video
volume. As a result, CNN applications are mainly run on clusters of server CPUs and GPUs. What is
more, with the availability of compatible deep learning frameworks, including Caffe [8], Theano [9],
and TensorFlow [10], training and testing CNN models become much easier and more efficient on
these platforms. While CPU and GPU clusters are the dominant platforms in CNN applications,
customized accelerators with better energy efficiency and less power dissipation are still required. For
example, in the case of embedded systems with limited power such as auto-piloted car and robots,
higher energy efficiency is critical to increase the use of CNNs.
Owing to the advantages of high performance, energy efficiency, and flexibility, FPGAs have
attracted attention to be explored as CNN acceleration platforms. Moreover, high-level synthesis
(HLS) tools from FPGA vendors, such as Xilinx Vivado HLS and Intel FPGA SDK for OpenCL, reduce
the programming difficulty and shorten the development time significantly, making FPGA-based
solutions more popular. As reported in the recent surveys [11,12], many FPGA-based CNN accelerators
have been proposed for 2D CNNs and many tool-flows for mapping 2D CNNs on FPGAs have been
released. However, there are very few studies on accelerating 3D CNNs on FPGAs. Three-D CNNs
are far more computationally intensive than 2D CNNs, and generate far more intermediate results
during execution as the input is a video volume instead of a single image, causing greater memory
capacity and bandwidth demands. In addition, the design space for 3D CNN acceleration is further
expanded since the temporal dimension is introduced, making it even difficult to determine the
optimal solution. Therefore, current accelerator designs for 2D CNNs are not fit for accelerating
3D CNNs directly. For example, the designs in [13,14] adopt customized computation engines to
compute 2D convolutions. As there is one more dimension in 3D convolutions, new computation
engines are required with this approach. The design in [15] computes 2D convolutions with the Fast
Fourier Transform (FFT) algorithm. This approach is proved to be effective only for large convolutional
kernels like 11 × 11 or 7 × 7 and will be less efficient for small convolutional kernels like 3 × 3 or
1 × 1. Some designs accelerate 2D convolutions by reducing the computational requirements with the
Winograd algorithm [16], and the design in [17] even extends the Winograd algorithm to adapt to 3D
convolutions. However, the Winograd algorithm is very sensitive to the size of convolutional kernels.
For convolutional kernels with different size, different transformation matrices are required. Hence, the
Winograd algorithm is perfectly suitable for CNN models with uniform-sized convolutional kernels
like VGG while not suitable for CNN models with multi-sized convolutional kernels like AlexNet.
Another approach is mapping convolutions to matrix multiplication operations, which is typically
adopted in CPU and GPU implementations. Refs. [18,19] adopt this approach in their accelerator
designs for 2D CNNs and implement accelerators on FPGAs using the OpenCL framework. A main
concern of this approach is that it introduces high degree of data replications in the input features,
which can lead to either inefficiency in storage or complex memory access patterns. Especially, the
weight matrix and the feature matrix are both enlarged by a factor of the kernel temporal depth in 3D
convolutions, which further lifts the memory requirement.
We analytically find that the computation patterns in 2D and 3D CNNs are very similar. Motivated
by this finding, we attempt to design a uniform accelerator architecture for both 2D and 3D CNNs.
In the case of FPGA-based clouds, a uniform architecture allows switching of acceleration services
without reprogramming the FPGAs. For ASICs, which are not programmable, a uniform architecture
expands the applicability of the ASIC. The uniform architecture design is based on the idea of mapping
convolutions to matrix multiplication operations. The first challenge comes from the data replications
when mapping the input features to the feature matrix. It will introduce multiplied memory access
overheads when storing the entire feature matrix off-chip, and will cost a large amount of memory
space when storing the entire feature matrix on-chip. We propose an efficient matrix mapping module
that avoids data replications by reusing the overlapped data during the sliding of convolutional
windows. In addition, the mapping module generates only a tiling (several columns) of the feature
matrix on-the-fly instead of generating the entire one before matrix multiplication, which saves on-chip
Electronics 2019, 8, 65 3 of 19
memory consumption. The second challenge is that the weight matrix and feature matrix are enlarged
by a factor of the kernel temporal depth in 3D convolutions compared to 2D CNNs. Accordingly, it
lifts the memory consumption by a factor of the kernel temporal depth when storing the weight matrix
and feature matrix on-chip. To guarantee the uniform architecture can be applied to large CNN models
and be deployed on platforms with limited on-chip memory capacity, we adopt an effective splitting
strategy. A convolutional layer with a large amount of input channels will be split into multiple
convolutional layers with a smaller amount of input channels. The third challenge is how to compute
matrix multiplications efficiently on FPGAs. Different to the OpenCL-based computation framework
in [18,19], we adopt a 2D MAC array for matrix multiplications. The 2D MAC array is scalable and the
size is mainly determined according to the hardware resources, memory bandwidth and the size of
feature maps.
To summarize, our key contributions are as follows:
• We propose a uniform accelerator architecture design supporting both 2D and 3D CNNs, based
on the idea of mapping convolutions to matrix multiplication operations. Special efforts are made
on memory optimizations and computations to enhance throughput performance;
• We analytically model the resource utilization and throughput performance of our architecture,
which helps to configure an accelerator on a specific platform within certain constraints including
hardware performance, memory bandwidth and clock frequency;
• We demonstrate the architecture design by implementing an accelerator on the Xilinx VC709
board with the High-level synthesis (HLS) methodology. Three typical CNN models including
AlexNet, VGG16, and C3D, are tested on the accelerator. Experimental results show that the
accelerator achieves over 850 GOP/s for convolutional layers and nearly 700 GOP/s overall on
VGG16 and C3D, with much better energy efficiency than the CPU and GPU.
The rest of the paper is organized as follows: Section 2 briefly introduces the basic background
of CNNs and the design directions of the accelerator architecture; Section 3 presents the architecture
design and the main components; Section 4 provides the implementation and optimization details;
Section 5 presents the accelerator modeling; Section 6 reports the experimental results; and finally,
Section 7 concludes the paper.
c −1 k −1 k −1
∑ ∑ ∑ W [mm][ch][rr ][cc] ∗ X [ch][hh + rr ][ww + cc] (1)
ch=0 rr =0 cc=0
Electronics 2019, 8, 65 4 of 19
C
k
k Output
H k Output
H k
W W
(a) 2D convolution on an image (b) 2D convolution on multiple channels
k
H Output k Output
k d
H k d
L
W L
W
(c) 3D convolution on a volume (d) 3D convolution on multiple channels
c −1 d −1 k −1 k −1
∑ ∑ ∑ ∑ W [mm][ch][dd][rr ][cc] ∗ X [ch][ll + dd][hh + rr ][ww + cc] (2)
ch=0 dd=0 rr =0 cc=0
Compared to the convolutional layers in 2D CNNs, there is an accumulation along the temporal
dimension, as shown in Equation (2). By switching the order of the two outer accumulations, we
get Equation (3). We can find that the inner three accumulations in Equation (3) are very similar to
Equation (1) since the loop variable along the temporal dimension dd is fixed.
d −1 c −1 k −1 k −1
∑ ∑ ∑ ∑ W [mm][ch][dd][rr ][cc] ∗ X [ch][ll + dd][hh + rr ][ww + cc] (3)
dd=0 ch=0 rr =0 cc=0
We can further combine the outer two loops and hence get Equation (4), which is almost the same
as Equation (1) except that the number of input channels is enlarged by a factor of d. That is to say,
3D convolutions can be computed in the same way as 2D convolutions.
d ∗ c −1 k −1 k −1
∑ ∑ ∑ W [mm][ch%c][ch/c][rr ][cc] ∗ X [ch%c][ll + ch/c][hh + rr ][ww + cc] (4)
ch=0 rr =0 cc=0
Electronics 2019, 8, 65 5 of 19
0 1 2 63
flatten
flatten &
rearrange
3×3×3
64 64
3×3×3 32×32
32×32
weight matrix feature matrix output matrix
Similarly, 3D convolutions can also be mapped as matrix multiplications with the same method.
Compared to 2D convolutions, the convolution window slides across the input features not only along
the row and column directions, but also the temporal direction. Consequently, each pixel is covered by
the convolution window d × k × k times and hence replicated d × k × k times. The number of pixels in
the input feature is enlarged by a factor of d × k × k.
Here we can find that the approach of mapping convolutions as matrix multiplication operations
introduces a high degree of data replications. In CPU and GPU implementations, the entire feature
matrix is generated first before computing matrix multiplication. This is not a big problem to CPUs
and GPUs as they have abundant memory space and bandwidth. However, to FPGAs with limited
on-chip memory capacity and off-chip memory bandwidth, storing the entire feature matrix can be
a critical limitation. We will show how we optimize this approach in FPGA implementations with a
customized matrix mapping module in the next section.
Electronics 2019, 8, 65 6 of 19
0 1 2 3 4
0 1 2 3 4
0 1 2 3 4
...
...
(a) (b) (c) (d) feature matrix
0 1 2
1 2 3
0 1 2 3 4 2 3 4
c0 k
stride
c1
N −1
C [i ][ j] = ∑ A[i ][t] × B[t][ j] (0 ≤ i < M; 0 ≤ j < L) (5)
t =0
We adopt a 2D MAC array to compute matrix multiplication in the most straightforward way.
A MAC unit is composed of a multiplier to calculate the products, an adder to accumulate the products,
and a register to keep the partial sum. The MAC unit located at (i, j) receives operands from the
i-th row of matrix A and j-th column of matrix B and generates the pixel C [i ][ j]. In the case of CNN
accelerations, A is the weight matrix, B is the feature matrix, and C is the output matrix. Suppose
there are mr rows and mc columns in the MAC array, then a total of mr × mc pixels can be generated at
once. We adopt a simple matrix partitioning strategy to compute the whole output matrix with the 2D
MAC array. As shown in Figure 4, the weight matrix is partitioned into dm/mr e matrix blocks along
the row dimension and the feature matrix is partitioned into d(h × w)/mce matrix blocks along the
column dimension.
The 2D MAC array exploits the parallelism of convolutions in two aspects. The output channel
loop is unrolled by a factor of mr and hence mr channels of the output feature can be calculated
simultaneously. The column loop of each channel is unrolled by a factor of mc and thus mc pixels
in the same row of a channel can be computed in parallel. The 2D MAC array is scalable and the
size is determined according to the hardware resources, memory bandwidth, feature size and the
number of input channels. Hardware resources, especially the DSP slices on a FPGA chip, determine
the maximum number of MAC units that can be assigned to the 2D MAC array. The width of the 2D
MAC array is mainly restricted by the memory bandwidth. We can find an optimal value for the width
so that the 2D MAC array is well matched with the memory bandwidth. The 2D MAC array will be
under-utilized when the real width is greater than the optimal value, and the memory bandwidth is
not fully exploited when the real width is less than the optimal value. Also, the width of the 2D MAC
array is closely related to the feature size of a CNN model. For example, if the feature size is 56 × 56,
it is better to deploy 28 columns of MAC units instead of 32, which achieves the same throughput
performance with fewer MAC units. The height of the 2D MAC array is closely related to the number
of output channels in a CNN model. A common divisor of all the output channel numbers in all
Electronics 2019, 8, 65 8 of 19
convolutional layers is most preferred. A power of two may also be a good choice for the height of the
2D MAC array.
h×w
mc
h×w
c×k×k mc
mr mr
c×k×k
m m
Figure 4. Partitioning weight and feature matrices to blocks to adapt to the 2D multiply and accumulate
(MAC) array.
Inst. Bus
CPU Controller
Mapping Feature
Module FIFO
D
Data Bus
D Weight MAC MAC MAC MAC
R Buffer
3/4 Kernel
2D MAC Array
FIFO
Data Bus
Output
NL BN ACC
Buffer
Figure 5. Our proposed uniform accelerator architecture for 2D and 3D convolutional neural
networks (CNNs).
During the process of matrix multiplication, mr rows of kernel data and mc columns of feature
data are required at every cycle. To offer enough data ports to the 2D MAC array, we assign mr Block
RAMs in the weight buffer and mc + 2 × pad Block RAMs in the feature buffer. The additional 2 × pad
Block RAMs are for the padding data when the convolution window slides to the edges. As the 2D
MAC array calculates mr × mc pixels of the output feature simultaneously, we assign mc Block RAMs
Electronics 2019, 8, 65 9 of 19
in the output buffer. Once the mr × mc results are generated, they will be cached in the output buffer
in mr cycles, which is much shorter than the matrix multiplication latency. Meanwhile, we can adopt
the burst access pattern when storing the output feature back to the off-chip memory, which lifts the
memory bandwidth utilization rate.
To save on-chip memory consumption, the weight buffer stores only mr groups of kernels on-chip,
the feature buffer stores k + stride rows for each channel of the input feature, and the output buffer
stores only mr rows (one row for each channel) of the output feature. To achieve pipelining between
memory access and computation, the input buffer pre-caches stride rows for each channel of the input
feature during the matrix multiplication and the ping-pong strategy is used on the output buffer.
Therefore, the memory access time is overlapped with the computation time to the most extent.
127:112 111:96 95:80 79:64 63:56 55:48 47:40 39:32 31:24 23:16 15: 8 7:0
c m Ix Ox tm_max tc_max k pad stride bn_opt nl_opt opcode
4. Accelerator Implementation
As a case study, we implement an accelerator prototype based on the uniform architecture with
the HLS methodology. The pseudo-code in Figure 6 (left) demonstrates the working process of a
convolutional layer. The weight buffer, feature buffer, and output buffer are declared respectively with
specified data types. We adopt fixed-point arithmetic logic units in our implementation. As shown
in Figure 6 (right), each kernel is represented by eight bits including one sign bit and seven fraction
bits, each pixel in input and output features is represented by 16 bits including one sign bit, seven
integer bits, and eight fraction bits, and each intermediate result is represented by 32 bits including
one sign bit, 16 integer bits, and 15 fraction bits. The intermediate results are represented by 32 bits
to preserve precision during accumulations and will be truncated to 16 bits before writing back to
memory. The weight buffer is completely partitioned in the row dimension with the array_partition
pragma. The feature buffer and output buffer are completely partitioned in the column dimension.
The core functions include the load-weight function (line 10), the load-feature function (line 14),
the matrix-mapping function (line 17), the matrix-multiply function (line 18), and the store-feature
function (line 20). As the function name reflects, the load-weight function loads weights from the
off-chip memory to the weight buffer; the load-feature function loads the input feature from the
off-chip memory to the input buffer; the store-feature function stores the output feature from the
output buffer back to the off-chip memory; the matrix-mapping function is corresponding to the matrix
mapping module; and the matrix-multiply function is corresponding to the 2D MAC array.
Performance Density Throughput (GOP/s) 118 137 645 1790 142
1.35 1.17 1.42 1.69 1.02 1.87 2.56 1.62
(OP/DSP/cycle) Performance Density
1.35 1.17 1.42 1.69 1.02
(OP/DSP/cycle)
multiplications in 3D convolutional layers by 70.4% at the with slight modifications. We will show how the 2D Winograd
cost of 101% more additions. The additions are executed with reduces
algorithm performs the tomultiplications
accelerate 3D CNNs in 3D inconvolutional layers by
our future work. VI
LUTs instead of DSP slices. That is why the implementation 70.4% at the cost of 101% more additions. The additions are
This paper summariz
in [?]2019,
has a 8,
performance density greater than 2.0 and a much executed withVII. LUTs instead of DSP slices. That is why the
C ONCLUSION
Electronics 65 11 of 19 celerator design for C
Versionbetter
December 17, 2018 submitted
overall.toTable ?? lists
Journal NotallSpecified implementation in [19] has a performance density12greater than
performance density the known This paper summarizes our recent work on hardware of ac-18 convolutions can be d
accelerator implementations for 3D CNNs. Our accelerator 2.0 and a much better performance density overall. Table IV
celerator design for CNNs. We analytically find that 3D multiple 2D CNNs. T
achieves state-of-the-art performance in terms of throughput. lists all the known accelerator implementations for 3D CNNs.
convolutions can be decomposed as the accumulation of for 2D CNNs can als
Our accelerator achieves state-of-the-art performance in terms
1 KerType k b u f [ mr ] [ k d e p t h ] ; multiple 2D CNNs. Therefore, current accelerator designs by appending an accu
of throughput.
2 #pragma HLS a r r a y p a r t i t i o n for 2D CNNs can also be used to accelerate 3D CNNs temporal dimension. Fo
3 ImgType i b u f [ i d e p t h ] [ mc+2 * pad ] ; by 1appending
t y p e d ean
f accumulation
a p f i x e d < module
8 , 1>working KerType along
; the framework for 3D CN
4 #pragma HLS a r r a y p a r t i t i o n temporal
2 t ydimension.
p e d e f a pForf demonstration,
i x e d <16 , 8> we propose
ImgType a scalable
; a real-life 3D CNN mo
5 MidType o b u f [ o d e p t h ] [ mc ] ; framework
3 t y p fore d e3D
f aCNNs
p f i xand implement
e d <32 , 17> an accelerator
MidType ; for accelerator achieves sta
6 #pragma HLS a r r a y p a r t i t i o n a real-life
4 3D CNN model. Evaluation results show that our other FPGA implemen
7 accelerator
5 MidType achievesC[state-of-the-art
mr ] [ mc ] ; performance compared to demonstrations on CNN
8 f o r ( tm = 0 ; tm < tm max ; tm ++){ other
6 FPGA
#pragma implementations.
HLS a r r a yFuture p a r twork
i t i o nincludes further ASIC implementations
9 #pragma HLS d a t a f l o w demonstrations
7 on CNN models with variable kernel sizes and on the FPGA prototype
10 load weight ( ) ; ASIC8 implementations
f o r ( t = 0 ; for t <computer
c * k * k vision
; t ++){ applications based
11 f o r ( t l = 0 ; t l < l ; t l ++){ on the
9 FPGA prototype.
#pragma HLS p i p e l i n e
12 f o r ( t h = 0 ; t h < h ; t h ++){ 10 f o r ( i = 0 ; i < mr ; i ++){ [1] N. Suda, V. Chandra, G
R EFERENCES
13 #pragma HLS d a t a f l o w 11 #pragma HLS u n r o l l Seo, and Y. Cao, “Thro
[1]12N. Suda, V.fChandra, for large-scale convolut
14 load feature (); or ( j = 0 ; j A.
G. Dasika, mc ; jY.++){
<Mohanty, Ma, S. Vrudhula, J.-s.
2016 ACM/SIGDA Inte
15 f o r ( t c = 0 ; t c < tc max ; t c ++){ 13Seo, and Y. Cao, “Throughput-optimized
#pragma HLS u n ropencl-based
oll fpga accelerator
Gate Arrays. ACM, 20
for large-scale convolutional neural networks,” in Proceedings of the
16 #pragma HLS d a t a f l o w 142016 ACM/SIGDA l a s tInternational
= ( t ==0)? 0 : onCField-Programmable
Symposium [ i ][ j ]; [2] A. Krizhevsky, I. Sutske
with deep convolutional
17 matrix mapping ( ) ; 15Gate Arrays. tempACM, 2016, = A[ pp.i 16–25.
] [ t ] * B[ t ] [ j ] ; mation processing syste
[2]16A. Krizhevsky,CI.[ Sutskever,
i ] [ j ] and
= G. l a E.
s tHinton,
+temp“Imagenet
; classification
18 matrix multiply ( ) ; with deep convolutional neural networks,” in Advances in neural infor- [3] S. Ren, K. He, R. Girsh
19 } 17mation processing
} systems, 2012, pp. 1097–1105. object detection with re
information processing
20 store feature (); [3]18D. Maturana
} and S. Scherer, “Voxnet: A 3d convolutional neural network
[4] D. Maturana and S. Sche
21 } 19for real-time
} object recognition,” in Ieee/rsj International Conference on
for real-time object reco
Intelligent Robots and Systems, 2015, pp. 922–928.
22 } Intelligent Robots and S
[4] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recog-
[5] P. Molchanov, S. Gupta
23 } As shown
nition with in [19],
3d convolutional thenetworks,”
neural Winograd algorithm
in Computer reduces
Vision and the nition with 3d convoluti
Pattern Recognitioncomplexity
computation Workshops, 2015, pp. 1–7. in CNNs. However, there
significantly Pattern Recognition Wo
As shown in [?], the Winograd algorithm reduces the [5] T. Du, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
are still some
spatiotemporal limitations
features in terms of
with 3d convolutional flexibility.
networks,” TheInter-
in IEEE Winograd [6] T. Du, L. Bourdev, R.
Figure 6. High
computation levelsignificantly
complexity synthesisin(HLS)
CNNs.pseudocode
However, theredemonstrating
algorithm
national
the
Conference
working
canononly
process
be applied
Computer
of
when
Vision, 2016,
a convolutional
pp. the convolution stride
4489–4497.
spatiotemporal features
361 depthare
ofstill
layer eachsomeBlock
(left) and RAM
howintois
limitations given
terms
compute bytheEquation
of flexibility.
matrix 6 if the on-chip
The Winograd
multiplication
[6] Jia, is 1memory
with andthe
Yangqing, 2DisMAC
Shelhamer,
it varies abundant.
Evan, array
withDonahue,
the The
(right).
sizeJeff, layer_num
ofKarayev, in By
Sergey, kernels.
convolution Long,
national Conference on
[7] Jia, Yangqing, Shelhame
algorithm can only be applied when the convolution stride and Jonathan, “Caffe: Convolutional architecture for fast feature embed-
362 the equation indicates the total number of layers in a CNN
is 1 and it varies with the size of convolution kernels. By
model.
comparison,
ding,” our framework with matrix
arXiv preprint arXiv:1408.5093, pp. 675–678, 2014. multiply approach is and Jonathan, “Caffe: C
more generic and can adaptA.to Almahairi,
varied strides and convolution ding,” arXiv preprint ar
4.1. Computation Optimization with HLS Pragmas
comparison, our framework with matrix multiply approach is
[7] T. D. Team,
D. Bahdanau,
R. Alrfou,
N. Ballas,
G. Alain,
F. Bastien,
C. Angermueller,
[8] T. D. Team, R. Alrfou
kernels. On the other hand,J. Bayer,
we have and toA. Belikov, “Theano:that the
acknowledge D. Bahdanau, N. Ballas
more generic and cankdepthadapt to varied
max strides
ci and
× × to k i convolution
ki i 0, AarXiv
1,python ...,framework
2,preprintlayer_num for fast computation of mathematical expressions,”
1 (6) as it
The dataflow optimization = is( adopted ) improve
kernels. On the other hand, we have to acknowledge that the
( = the throughput− performance.
Winograd )
algorithm is perfectly
arXiv:1605.02688, 2017. fit
The dataflow for accelerating C3D, A python framework for
arXiv preprint arXiv:16
has a A.
[8] M. Abadi, uniform
Agarwal, kernel size E.
P. Barham, ofBrevdo,
3×3× andC.a Citro,
Z. 3Chen, fixed G.stride
S. of 1
Winograd algorithm is perfectly fit for accelerating C3D, as it
pragma enables task-level pipelining, allowing + functions and loops to overlap in their operation, [9] M. Abadi, A. Agarwal,
363 The feature buffer stores input features in mc 2 Corrado, × pad Block
in allA. Davis, RAMs.
convolutional and The
J. Dean,layers. M.We
× additional
notice
Devin, 2 pad
that the implementations
“Tensorflow: Large-scale
Corrado, A. Davis, J. D
has a uniform kernel size of 3 × 3 × 3 and a fixed stride of 1 machine learning on heterogeneous distributed systems,” 2015.
in [19] adoptS.the 3DK.Winograd algorithm. OurYu,work has shown machine learning on he
364 and increasing the concurrency of the RTL implementation. As shown in Figure 6 (left), the dataflow
BlockinRAMs are introduced
all convolutional due to
layers. We notice thatthe padding required
the implementations [9] J.at
Qiu,the edges.
J. Wang, Each
Yao, pixel
Guo, inE.input
B. Li, Zhou, J.featuresT. Tang,is
[10] J. Qiu, J. Wang, S. Y
thatS.2D
N. Xu, CNN
Song, accelerators
Y. Wang, can “Going
and H. Yang, be used to accelerate
deeper with embedded3D CNNs
365 pragma is specified within the tc-loop, th-loop and tm-loop respectively. Figure 7a illustrates the
in [?] adopt
represented asthe
two3D bytes
Winogradandalgorithm.
hence Our theworkwidthhas shown
of each BlockfpgawithRAM slight
platform foris 16. Assuming
modifications.
convolutional neural willthe
Wenetwork,” show indepth theof2Deach
how International
ACM Winograd
N. Xu, S. Song, Y. Wan
fpga platform for convo
that 2D CNN accelerators can be used to accelerate 3D CNNs Symposium on FPGA, 2016.
algorithm performs to accelerate
366 working
Block RAM process of the
is idepth, thetc-loop without
total on-chip dataflow
memory pipelining.
consumed by the The matrix-mapping
input buffer can3DbeCNNs in our future
function
calculated bywork.
and the Symposium on FPGA, 2
369
optimization
In the caseworks
whenon thethe th-loop.
on-chip memoryThe load-feature function
is limited, kdepth and andidepththeare
store-feature
specified byfunction
users under are fully
370
overlapped
the by the tc-loop.
memory constraint. The th-loop
For some is pipelined
convolutional layers and
withthe pipeline
a large number interval equals
of input to thekdepth
channels, maximum
371 latency of the three parts. Figure 7d illustrates the tm-loop with dataflow
may be less than the width of the weight matrix or idepth may be less than the height of the feature pipelining. Notice that the
372 matrix-multiply
matrix. function
The splitting is dependent
strategy will split the to convolutional
the weight matrix layerand hence the
to multiple th-loop has to
convolutional waitwith
layers until the
373 load-weight
less number offunction is done.toThe
input channels fit tolatency of the
the weight th-loop
buffer andisfeature
typically much longer than the latency of
buffer.
374 the load-weight function.
The output buffer Foroutput
stores example, in theinsecond
features mc Block convolutional
RAMs. The layer of the strategy
ping-pong C3D model, the th-loop
is adopted
375 for the37,355
takes outputcycles
buffer.while
Each pixel in output features
the load-weight is represented
function takes onlyas 578two bytes In
cycles. and hence
this case,thethe
width of
ping-pong
376 each Blockcan
strategy RAM is 16.the
reduce Assuming
execution thetime
depthbyofateach
mostBlock RAM
1.5%. is odepth,the
Therefore, the ping-pong
total on-chip memoryis not
strategy
377 consumed
adopted on bythe
theweight
output buffer
buffer can be calculated
to save by mc × odepth
on-chip memory × 4 bytes.That
consumption. The depth
is whyofthe each Block
load-weight
378 RAM is given by Equation 8 if the on-chip
function is not fully overlapped with the th-loop. memory is abundant.
379 In the case when the on-chip memory is limited or the feature width is too large, odepth can be
380 specified by users under the memory constraint. The 2D MAC array may be under-utilized in some
381 convolutional layers with large feature width. The real unrolling factor of the output channel loop is
382 less than mr, which is given by the following equation:
tc_loop_dataflow tc_loop_dataflow
store-feature store-feature
load-weight load-weight
th_loop_dataflow th_loop_dataflow
The HLS pragmas unroll and pipeline are used inside these functions to reduce latency and enhance
throughput performance. Figure 6 (right) shows the HLS pseudocode of the matrix-multiply function.
The unroll pragma enables some or all loop iterations to occur in parallel by creating multiple copies
of the loop body in the RTL design. In the matrix-multiply function, mr × mc multipliers and adders
are created shaping the 2D MAC array and hence mr × mc multiply-accumulations are computed
concurrently. The pipeline pragma helps to reduce the initiation interval for a loop by allowing the
concurrent execution of operations. The initiation interval in the matrix-multiply function is one
cycle after optimization. To summarize, the total execution latency is greatly reduced and the system
throughput is enhanced significantly owing to the HLS pragmas.
no longer the bottleneck. In our implementation, the load-feature and store-feature functions are
fully overlapped by the matrix-multiply function. That is to say, the 2D MAC array is only idle
during setting up and flushing the pipeline, and will be fully utilized once the pipeline is ready in
convolutional layers.
5. Accelerator Modeling
The feature buffer stores input features in mc + 2 × pad Block RAMs. The additional 2 × pad
Block RAMs are introduced due to the padding required at the edges. Each pixel in input features
is represented as two bytes and hence the width of each Block RAM is 16. Assuming the depth of
each Block RAM is idepth, the total on-chip memory consumed by the input buffer can be calculated
by (mc + 2 × pad) × idepth × 2 bytes. The depth of each Block RAM is given by Equation (7) if the
on-chip memory is abundant.
In the case when the on-chip memory is limited, kdepth and idepth are specified by users under
the memory constraint. For some convolutional layers with a large number of input channels, kdepth
may be less than the width of the weight matrix or idepth may be less than the height of the feature
matrix. The splitting strategy will split the convolutional layer to multiple convolutional layers with
less number of input channels to fit to the weight buffer and feature buffer.
The output buffer stores output features in mc Block RAMs. The ping-pong strategy is adopted
for the output buffer. Each pixel in output features is represented as two bytes and hence the width of
each Block RAM is 16. Assuming the depth of each Block RAM is odepth, the total on-chip memory
consumed by the output buffer can be calculated by mc × odepth × 4 bytes. The depth of each Block
RAM is given by Equation (8) if the on-chip memory is abundant.
In the case when the on-chip memory is limited or the feature width is too large, odepth can be
specified by users under the memory constraint. The 2D MAC array may be under-utilized in some
convolutional layers with large feature width. The real unrolling factor of the output channel loop is
less than mr, which is given by the following equation:
m×l×h×w×c×d×k×k×2 (10)
Owing to the dataflow optimization, the functions are working in a pipelined way and some
functions are even completely overlapped by others, as illustrated in Figure 7. The total execution
cycles for a convolutional layer can be calculated by Equation (11) where cclldw , cclld f , and cclst f
indicate the execution cycles of the load-weight, load-feature and store-feature function respectively.
I Ith indicates the pipeline interval of the th-loop in Figure 6.
m
d e × (cclldw + cclld f + l × h × I Ith ) + cclst f (11)
mr
The matrix-mapping function takes c × k × k cycles to generate a feature matrix block and the
matrix-multiply function also takes c × k × k cycles to complete a matrix multiplication. Hence, the
total cycles taken by the tc-loop in Figure 6 is given by:
The execution time of the other functions are closely related to the real memory access bandwidth.
We get simplified models for the memory-related functions under sufficient memory access bandwidth:
The load-weight function takes c × k × k cycles, the load-feature function takes c × stride × dw/mce
cycles, and the store-feature function takes mr × dw/mce cycles. The pipeline interval of the th-loop is
the maximum value of cclld f , cclst f and ccltc .
6. Evaluation
In this section, we first introduce the experimental setup and then detailed experimental results
are provided.
with floating-point arithmetic. The dataset for testing AlexNet and VGG16 is ImageNet [20] and the
dataset for testing C3D is UCF1017 [21]. We use a batch size of 16 for both versions during testing.
Figure 8 presents the evaluation results of each layer in the three CNN models. According to
the performance model, the peak throughput of our accelerator at 120 MHz is 860.2 GOP/s. The
conv2a layer achieves the highest throughput of 811.5 GOP/s in AlexNet, the conv1b layer achieves the
highest throughput of 856.1 GOP/s in VGG16, and the conv2a layer achieves the highest throughput
of 851.2 GOP/s in C3D. The first convolutional layer in all three CNN models has only three input
channels which will make the 2D MAC array under-utilized. Since the C3D has a temporal depth of
three, the real input channels in the first convolutional layer is nine. That is why the conv1a layer
achieves a poor throughput in AlexNet and VGG16 while achieves a much higher throughput in C3D.
We can also find from Figure 8 that the throughput performance of each layer decreases with the
layer depth. The reason is that the feature size decreases owing to the pooling layers and hence the
th-loop iterates less times in latter convolutional layers. As shown in Equation (11), cclldw and cclld f
will account for more percentage of the total execution cycles when h decreases. In fully-connected
layers, the 2D MAC array is under-utilized restricted by the memory bandwidth. Thus, the throughput
of fully-connected layers is far lower than that of convolutional layers, around 40 GOP/s. That is why
the average throughput of the convolutional layers is more than 407.2 GOP/s in AlexNet while the
overall throughput is only 231.6 GOP/s. By comparison, VGG16 and C3D have more convolutional
layers and hence fully connected layers have less effect to the overall throughput. The accelerator
achieves an overall throughput of 691.6 GOP/s on VGG16 and 667.7 GOP/s on C3D.
Electronics 2019, 8, 65 16 of 19
Table 4 lists the comparisons between the CPU, the GPU and our accelerator. Compared to the
CPU, our accelerator achieves a 7.5-fold improvement on VGG16 and 8.2-fold improvement on C3D in
terms of throughput and latency. The thermal design power (TDP) values of the CPU and the GPU
are 84 and 250 W respectively. According to the power report by the Vivado Design Suite, the total
on-chip power of our accelerator is only 15.8 W. We can see from Table 4 that our accelerator achieves
the best power efficiency among all the platforms. Compared to the CPU, our accelerator achieves a
39.8-fold improvement on VGG16 and 43.9-fold improvement on C3D in terms of power efficiency.
The Titan X Pascal GPU has a great leading advantage in terms of throughput performance: 52.4-fold
to CPU and 6.9-fold to our accelerator on VGG16. By contrast, our accelerator achieves better power
efficiency than the GPU: 2.3-fold on VGG16 and 2.0-fold on C3D.
Table 4. Evaluation results on the central processing unit (CPU), graphics processing unit (GPU), and
our accelerators.
density among all the listed implementations. In terms of throughput, our accelerator achieves
state-of-the-art performance on both VGG and C3D. The implementation in [19] achieves the best
throughput performance with a much higher clock frequency over other implementations.
As shown in [17], the Winograd algorithm reduces the computation complexity significantly in
CNNs. However, there are still some limitations in terms of flexibility. The Winograd algorithm can
only be applied when the convolution stride is 1 and the transform matrices vary with the size of
convolution kernels. By comparison, our architecture with matrix multiply approach is more generic
and can adapt to varied strides and convolution kernels. On the other hand, we have to acknowledge
that the Winograd algorithm is perfectly fit for accelerating VGG and C3D, as they have a uniform
kernel size of 3 × 3 and a fixed stride of 1 in all convolutional layers. However, it is not suitable
for accelerating AlexNet, which has multiple kernel sizes (11 × 11, 5 × 5, and 3 × 3) and multiple
strides (4 and 1). It is not a problem for our architecture as we have demonstrated in the evaluation.
The architecture in [17] adopts a template-based approach supporting accelerating both 2D and 3D
CNNs. Different configurations (e.g., unrolling factors, Winograd algorithms) have to be customized
for different CNN models. That is why they implement two accelerators for accelerating VGG16 and
C3D respectively. By comparison, our accelerator can accelerate different CNN models with the same
configuration. The AlexNet, VGG16, and C3D are run on the same accelerator in our evaluation, and
achieve very close throughput performance on convolutional layers.
Ref. [18] Ref. [22] Ref. [19] Ref. [23] Ref. [17] Ours
Altera Arria10 Arria10 Xilinx Xilinx Xilinx
FPGA
Stratix-V GX1150 GX1150 ZC706 XC7VX690T XC7VX690T
CNN Model VGG VGG VGG C3D C3D VGG C3D
Precision fixed fixed fixed/float fixed fixed fixed
Clock (MHz) 120 150 385 172 150 120
727 3036 2756 810 1536 3595
DSPs
(37%) (100%) (91%) (90%) (42%) (99%)
Throughput (GOP/s) 118 645 1790 142 431 691.6 667.7
Performance Density
1.35 1.42 1.69 1.02 1.87 1.60 1.55
(OP/DSP/cycle)
7. Conclusions
This paper summarizes our recent work on hardware accelerator design for CNNs. The proposed
uniform architecture ensures fast development of 2D and 3D CNN accelerators with state-of-the-art
performance in throughput, latency, and energy efficiency. Despite the loss in performance density
compared with customized accelerators using the Winograd algorithm, our architecture is more
generic, which supports accelerating different 2D and 3D CNN models without reconfiguring the
FPGA. This means that the architecture is also suitable for ASICs. Future work includes further
demonstrations on other CNN-based applications and ASIC implementations for computer vision
applications based on the FPGA prototype.
Author Contributions: Conceptualization, Z.L., J.J. and P.C.; methodology, Z.L., J.J. and Y.D.; software, Z.L., J.X.
and J.Z.; validation, Z.L., J.X. and J.Z.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L.
and P.C.; supervision, P.C. and Y.D.
Funding: This research was funded by the National Key Research and Development Program of China (Grant No.
2018YFB1003405), the National Science and Technology Major Project of the Ministry of Science and Technology of
China (Grant No. 2018ZX01028-101), and the National Natural Science Foundation of China (Grant No. 61802419,
61732018, 61303070, 61602496 and 61502507). This research also supported by the Dusan and Anne Miklas and
Xilinx.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2019, 8, 65 18 of 19
Abbreviations
The following abbreviations are used in this manuscript:
2D Two Dimensional
3D Three Dimensional
CNN Convolutional Neural Network
FPGA Field Programmable Gate Array
MAC Multiply and Accumulate
HLS High Level Synthesis
CPU Central Processing Unit
GPU Graphics Processing Unit
ASIC Application Specific Integrated Circuit
GOPs Giga-Operations
GOP/s Giga-Operations Per Second
FFT Fast Fourier Transform
References
1. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural
networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA,
3–6 December 2012; pp. 1097–1105.
2. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
11–12 December 2015; pp. 91–99.
3. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA,
7–12 June 2015; pp. 815–823.
4. Maturana, D.; Scherer, S. VoxNet: A 3D Convolutional Neural Network for real-time object recognition.
In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Hamburg,
Germany, 28 September–2 October 2015; pp. 922–928.
5. Molchanov, P.; Gupta, S.; Kim, K.; Kautz, J. Hand gesture recognition with 3D convolutional neural networks.
In Proceedings of the Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June
2015; pp. 1–7.
6. Du, T.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D
Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision,
Santiago, Chile, 7–13 December 2015; pp. 4489–4497.
7. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv
2014, arXiv:1409.1556.
8. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe:
Convolutional Architecture for Fast Feature Embedding. arXiv 2014, arXiv:1408.5093.
9. Team, T.D.; Alrfou, R.; Alain, G.; Almahairi, A.; Angermueller, C.; Bahdanau, D.; Ballas, N.; Bastien, F.;
Bayer, J.; Belikov, A. Theano: A Python framework for fast computation of mathematical expressions. arXiv
2017, arXiv:1605.02688.
10. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.;
Devin, M. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016,
arXiv:1603.04467
11. Abdelouahab, K.; Pelcat, M.; Serot, J.; Berry, F. Accelerating CNN inference on FPGAs: A Survey. arXiv 2018,
arXiv:1806.01683.
12. Venieris, S.I.; Kouris, A.; Bouganis, C.S. Toolflows for Mapping Convolutional Neural Networks on FPGAs:
A Survey and Future Directions. ACM Comput. Surv. 2018, 51, 56. [CrossRef]
13. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper
with embedded fpga platform for convolutional neural network. In Proceedings of the ACM International
Symposium on FPGA, Monterey, CA, USA, 21–23 February 2016.
Electronics 2019, 8, 65 19 of 19
14. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpga-based accelerator design for deep
convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170.
15. Zhang, C.; Prasanna, V. Frequency Domain Acceleration of Convolutional Neural Networks on
CPU-FPGA Shared Memory System. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 35–44.
16. Aydonat, U.; O’Connell, S.; Capalija, D.; Ling, A.C.; Chiu, G.R. An OpenCLTM Deep Learning Accelerator
on Arria 10. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 55–64.
17. Shen, J.; Huang, Y.; Wang, Z.; Qiao, Y.; Wen, M.; Zhang, C. Towards a Uniform Template-based Architecture
for Accelerating 2D and 3D CNNs on FPGA. In Proceedings of the ACM/SIGDA International Symposium,
Monterey, CA, USA, 25–26 February 2018; pp. 97–106.
18. Suda, N.; Chandra, V.; Dasika, G.; Mohanty, A.; Ma, Y.; Vrudhula, S.; Seo, J.S.; Cao, Y. Throughput-Optimized
OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks. In Proceedings of the
2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA,
21–23 February 2016; pp. 16–25.
19. Zhang, J.; Li, J. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural
Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 25–34.
20. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.;
Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252.
[CrossRef]
21. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes From Videos in the
Wild. arXiv 2012, arXiv:1212.0402.
22. Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing Loop Operation and Dataflow in FPGA Acceleration of
Deep Convolutional Neural Networks. In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54.
23. Fan, H.; Niu, X.; Liu, Q.; Luk, W. F-C3D: FPGA-based 3-dimensional convolutional neural network.
In Proceedings of the International Conference on Field Programmable Logic and Applications, Ghent,
Belgium, 4–8 September 2017; pp. 1–4.
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).