0% found this document useful (0 votes)
71 views8 pages

Evaluating Fast Algorithms For Convolutional Neural Networks On FPGAs

1) The document describes a technique for implementing convolutional neural networks (CNNs) using the Winograd minimal filtering algorithm on FPGAs to improve efficiency compared to conventional convolution algorithms. 2) The authors propose an architecture using line buffers to cache feature maps, general matrix multiplication and element-wise multiplication in Winograd processing elements (PEs), and parallelizing multiple PEs. 3) Analytical models are developed to explore the design space and identify optimal parameters. Experiments demonstrate state-of-the-art CNNs like AlexNet and VGG16 achieve high performance and energy efficiency on FPGAs using this approach.

Uploaded by

Maryam Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views8 pages

Evaluating Fast Algorithms For Convolutional Neural Networks On FPGAs

1) The document describes a technique for implementing convolutional neural networks (CNNs) using the Winograd minimal filtering algorithm on FPGAs to improve efficiency compared to conventional convolution algorithms. 2) The authors propose an architecture using line buffers to cache feature maps, general matrix multiplication and element-wise multiplication in Winograd processing elements (PEs), and parallelizing multiple PEs. 3) Analytical models are developed to explore the design space and identify optimal parameters. Experiments demonstrate state-of-the-art CNNs like AlexNet and VGG16 achieve high performance and energy efficiency on FPGAs using this approach.

Uploaded by

Maryam Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines

Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs

Liqiang Lu∗1,3 , Yun Liang†1 , Qingcheng Xiao1 , Shengen Yan2,3


1
Center for Energy-efficient Computing and Applications, Peking University, Beijing, China
2
Department of Information Engineering, The Chinese University of Hong Kong.
3
SenseTime Group Limited.
Email: {liqianglu, ericlyun, walkershaw}@pku.edu.cn, [email protected]

Abstract—In recent years, Convolutional Neural Networks A CNN typically involves multiple layers, where the
(CNNs) have become widely adopted for computer vision output feature maps of one layer are the input feature maps
tasks. FPGAs have been adequately explored as a promising of the following layer. Prior studies have shown that the
hardware accelerator for CNNs due to its high performance,
energy efficiency, and reconfigurability. However, prior FPGA computation of the state-of-the-art CNNs are dominated
solutions based on the conventional convolutional algorithm by the convolutional layers [6, 7]. Using the conventional
is often bounded by the computational capability of FPGAs convolution algorithm, each element in the output feature
(e.g., the number of DSPs). In this paper, we demonstrate map is computed individually by using multiple multiply-
that fast Winograd algorithm can dramatically reduce the accumulate operations. While the prior FPGA solutions of
arithmetic complexity, and improve the performance of CNNs
on FPGAs. We first propose a novel architecture for imple- CNNs using this algorithm have demonstrated preliminary
menting Winograd algorithm on FPGAs. Our design employs success [5–9, 11], greater efficiency is possible when the
line buffer structure to effectively reuse the feature map data algorithm itself can be more efficient. In this paper, we
among different tiles. We also effectively pipeline the Winograd show how convolution using Winograd algorithm [21] can
PE engine and initiate multiple PEs through parallelization. dramatically reduce the arithmetic complexity, and improve
Meanwhile, there exists a complex design space to explore. We
propose an analytical model to predict the resource usage and the performance of CNNs on FPGAs. Using Winograd
reason about the performance. Then, we use the model to guide algorithm, a tile of elements in the output feature map are
a fast design space exploration. Experiments using the state- generated together by exploiting the structural similarity
of-the-art CNNs demonstrate the best performance and energy among them. This helps to cut down the arithmetic com-
efficiency on FPGAs. We achieve an average 1006.4 GOP/s plexity by reducing the required number of multiplications.
for the convolutional layers and 854.6 GOP/s for the overall
AlexNet and an average 3044.7 GOP/s for the convolutional It has been demonstrated that fast Winograd algorithm can
layers and 2940.7 GOP/s for the overall VGG16 on Xilinx be used to derive efficient algorithms for CNNs with small
ZCU102 platform. filters [16].
More importantly, the current trend of CNNs is towards
deeper topologies with small filters. For example, all con-
I. INTRODUNCTION
volutional layers of Alexnet employ 3 × 3 and 5 × 5 filters
Deep convolutional neural networks (CNNs) have except the first layer [3]; VGG16 only uses 3×3 filters [22].
achieved remarkable performance for various computer vi- This opens up the opportunities of using Winograd algorithm
sion tasks including image classification, object detection, for efficient implementation of CNNs. However, although
and semantic segmentation [1, 2]. The significant accuracy using Winograd algorithm on FPGAs is appealing, several
improvement of CNNs comes at the cost of huge computa- problems remain. First, it is crucial that the design can not
tional complexity as it requires a comprehensive assessment only minimize the memory bandwidth requirement but also
of all the regions across the feature maps [3, 4]. Towards match the memory throughput with the computation engines.
such overwhelming computation pressure, hardware acceler- Second, there exists a large design space when mapping
ators such as GPUs, FPGAs, and ASICs have been employed the Winograd algorithm onto FPGAs. It is very difficult
to accelerate CNNs [5–17]. Among the accelerators, FPGAs to reason about which designs will improve or harm the
have emerged as a promising solution due to its high performance.
performance, energy efficiency, and reprogramability. More In this paper, we design a line-buffer structure to cache the
importantly, High Level Synthesis (HLS) using C or C++ feature maps for Winograd algorithm. This allows different
has greatly lowered the programming hurdle of FPGAs and tiles to reuse the data when the convolution operations
improve the productivity [18–20]. progress. The computation of Winograd algorithm involves
a mixed matrix transformation of general purpose matrix
∗ Work done while the author interned at Sensetime. multiplication (GEMM) and element-wise multiplication
† Corresponding Author. (EWMM). Then, we design an efficient Winograd PE and

978-1-5386-4037-1/17 $31.00 © 2017 IEEE 101


DOI 10.1109/FCCM.2017.64
Figure 1: Comparison of conventional and Winograd convolution algorithms. We assume the stride S is 1 for Winograd algorithm .

initiate multiple PEs through parallelization. Finally, we algorithm, each element in the output feature map is com-
develop analytical models to estimate the resource usage puted individually by multiplying and accumulating the
and predict the performance. We use the models to explore corresponding input feature data with filters.
the design space and identify the optimal design parameters. B. Winograd Algorithm
This paper makes the following contributions. The trends of CNNs are moving towards deeper topolo-
• We propose an architecture for efficient implementa- gies with small filters. The conventional convolution algo-
tion of CNNs using Winograd algorithm on FPGAs. rithm is general, but less efficient. As an alternative, convo-
The architecture employs line-buffer structure, general lution can be implemented more efficiently using Winograd
purpose and element-wise matrix multiplication for minimal filtering algorithm [21].
Winograd PE, and PE parallelization. Let us denote the result of computing m outputs with
• We develop analytical resource and performance mod- the r-tap FIR filter as F (m, r). Conventional algorithm
els and use the models to explore the design space to for F (2, 3) requires 2 × 3 = 6 multiplications. Winograd
identify the optimal parameters. algorithm computes F (2, 3) in the following way:
• We perform rigorous validation of our techniques us- T T T
ing the state-of-the-art CNNs including AlexNet and In = [z0 z1 z2 z3 ] F = [x0 x1 x2 ] Out = [y0 y1 ]
⎡ ⎤
VGG16.   x    
z0 z 1 z 2 ⎣ 0 ⎦ m 1 + m2 + m3 y
Experiments using the state-of-the-art CNNs demonstrate x1 = = 0 (1)
z1 z2 z 3 m2 − m3 + m4 y1
the best performance and energy efficiency of CNNs on x2
FPGAs. We achieve an average 1006.4 GOP/s for the con- m1 ,m2 ,m3 ,m4 are:
volutional layers and 854.6 GOP/s for the overall AlexNet x0 + x1 + x2
and an average 3044.7 GOP/s for the convolutional layers m1 = (z0 − z2 )x0 m2 = (z1 + z2 )
2 (2)
and 2940.7 GOP/s for the overall VGG16 on Xilinx ZCU102 x0 − x1 + x2
m4 = (z1 − z3 )x2 m3 = (z2 − z1 )
platform. This comes to 36.2 GOP/s/W energy efficiency for 2
AlexNet and 124.6 GOP/s/W energy efficiency for VGG16. Only 4 multiplications are necessary for computing m1 −
II. BACKGROUND m4 . The one-dimensional convolution using Winograd algo-
rithm can be formulated using the transformation matrices
A. CNN Basics
A, B and G as follows,
In general, CNNs is composed of a series of layers and
Out = AT [(GF )  (B T In)] (3)
each layer in turn is composed of input feature maps, filters ⎡ ⎤ ⎡ ⎤
and output feature maps. Among these layers, convolu- 1 0 −1 0 1 0 0  
tional layers account for the major computation. CNNs are ⎢0 1 1 0 ⎥ ⎢ 12 21 21 ⎥ T 11 1 0
B = ⎣0 −1 1
T
0 ⎦
G = ⎣ 1 1 1 ⎦ A = 0 1 −1 −1
trained off-line and FPGAs are mainly used for accelerating 2
−2 2
0 1 0 −1 0 0 1
the inference phase [5, 7, 8, 23]. Figure 1(a) presents a
typical convolutional layer and its implementation using where  is element-wise multiplication (EWMM). In this
conventional algorithm. With the conventional convolution paper, we use two-dimensional Winograd algorithm F (m ×

102
m, r × r), where the output tile size is m × m, the filter size
is r × r and the input tile size is n × n (n = m + r − 1).
The output tile can be derived as follows,
Out = AT [U  V ] A
(4)
U = GF GT V = B T InB
By defining the transformation matrices A, B, and G,
we can formulate the 2-D Winograd algorithm as a mixed
general purpose and element-wise matrix multiplication as
shown in Figure 1(b). The transformation matrices are Figure 2: Architecture overview
generated offline once the n and r are determined. In our
implementation, the multiplication with the constants in the Winograd algorithm (Section III-C) and instantiate multiple
transformation matrices are converted to shift operations PEs through parallelization (Section III-D). Third, different
(like × 12 ), which is more efficient and uses only LUT and implementation parameters (tile size, parallelization degree)
Flip Flops on FPGAs. form a large design space with multiple dimension resource
As shown in Figure 1(b), each time Winograd algorithm is and bandwidth constraints. We propose an analytical model
called, it generates a tile of size m×m together. The number for performance prediction and leverage it to explore the
of multiplications is determined by  in Equation 4. To space efficiently (Section III-E).
compute the m×m tile in the output feature map, Wingograd
algorithm requires n2 multiplications while the conventional A. Architecture Overview
algorithm requires m2 r2 multiplications. For example, for a Figure 2 presents the architecture overview of convolu-
4 × 4 output tile generated by convolving a 6 × 6 input tile tional layer based on Winograd algorithm on FPGAs. We
with a 3 × 3 filter, conventional convolution needs 42 × 32 = identify data reuse opportunities in the feature maps of
144 multiplications, while Winograd algorithm only needs neighboring tiles. To this end, we naturally implement line
6 × 6 = 36 multiplications. However, Winograd algorithm buffers. There are multiple channels of input feature maps
requires more additions than conventional algorithm as it (M ) as shown in Figure 1. Each line of the line buffers
needs to add the intermediate results together. stores the same rows across all the channels. Winograd PEs
fetch data from line buffers. Concretely, given an n×n input
III. A RCHITECTURE D ESIGN
tile, a Winograd PE will generate an m × m output tile. We
In this paper, we propose a FPGA accelerator design initiate an array of PEs by parallelizing the processing of the
for CNNs based on two-dimensional Winograd algorithm. multiple channels. Finally, we use double buffers to overlap
Defying conventional convolution algorithm where each the data transfer and computation. All the input data (e.g.
element in the output feature map is computed individually, input feature maps, filters) are stored in the external memory
Winograd algorithm can generate a tile of output feature initially. The input and output feature maps are transferred to
maps together by exploiting the structural similarity among FPGAs via a FIFO. However, the size of the filters increases
the elements in the same tile of the input feature map. More significantly as the network goes deeper. It is unpractical to
clearly, given a size n×n input tile and r×r filter, we employ load all the filters to on-chip memory. In our design, we
Winograd algorithm to generate a size m×m (n = m+r−1) split the input and output channels into several groups. Each
output feature map. To derive the next m × m tile of output group only contains a portion of filters. We load the filters
feature map, we just need to slide the input tile by m and group by group when they are needed. In the following, we
perform the same Winograd computation as shown in the assume there is only one group for easy illustration.
Figure 1 (b).
Several challenges arise when designing and implement- B. Line Buffer Design
ing the Winograd algorithm based CNN accelerator on There exist data reuse opportunities both horizontally and
FPGAs. First, the convolution layers have high memory vertically. Clearly, two neighboring tiles share (r − 1) × n
bandwidth demand. We observe that the neighboring tiles elements for each input feature map as shown in Figure
share input feature map data both horizontally and vertically. 1(b). To exploit the data reuse opportunities, we store a few
We leverage on this observation to design line buffers to lines in the on-chip memory. Each input line buffer contains
maximize the data reuse (Section III-B). Second, different M × W elements, where M is the number of input channels
from the conventional convolution algorithm, Winograd al- and W is the width of the input feature maps as shown in
gorithm generates a tile of output feature maps at a time. Figure 2. Each output line buffer contains N × C elements,
This requires all the elements in the input tiles and filters where N is the number of output channels and C is the
are ready at the same time before the Winograd transfor- width of the output feature maps as shown in Figure 1 (b).
mation starts. We design an efficient PE engine for the However, different layers may have different feature map

103
channels. This corresponds to parallelizing/unrolling the four
loops (row, col, ti, to) surrounding the Winograd engine in
Figure 1 (b). We choose not to parallelize the row_loop as it
will significantly increases the size of line buffers. Different
parallelization strategies of the other three loops can lead
to different data sharing and throughput [7]. Similar to [7],
we choose to parallelize the ti_loop and to_loop loops as
Figure 3: Winograd PE design
the parallelization of col_loop can lead to serious memory
width and channels. In practice, We set W as the maximal bank conflicts. We define the unroll factors of ti_loop and
width of all the feature maps. to_loop are Pm and Pn , respectively. Therefore, there are a
To reuse the data, we store n + m input lines in on-chip total of Pm × Pn Winograd PEs in parallel. We implement
memory in total and rotate the lines as a circular buffer. More the parallelization through loop unrolling.
clearly, initially, Winograd engines will read the first n lines Together with loop unrolling, we also partition the in-
from the line buffer directly, meanwhile the next m lines put, output and filter buffers to sustain efficient memory
of the line buffer will load data from external memory. The bandwidth. Clearly, we implement 4 dimension filters which
computation of the n lines and the transfer of m lines are include dimension row, column, input and output channels.
done in parallel by employing the double buffer design. Note We partition each dimension. We implement 2 dimension
that the stride between two neighboring tiles in Winograd input and output buffers and partition each dimension. Table
algorithm is m. Therefore, Winograd PE engines will skip I gives the partition factors for various buffers.
the next m lines and process the following n lines from the
Table I: Memory partition factors.
line buffer and the skipped m lines will be overwritten by
the new load data from the external memory. During this buffers Column Row Input channels Out channels
filter r r Pm Pn
process, if it reaches the bottom of the line buffer, it will input n - Pm -
rotate to the beginning of the line buffer. output m - - Pn

C. Winograd PE Design
Figure 3 gives the dataflow of our Winograd PEs. We E. Design Space Exploration
divide the Winograd algorithm in Figure 1 (b) into 4 stages Our Winograd implementation involves a few design
so that different tiles can be effectively overlapped through parameters, input tile size (n), and parallelization degree
pipelining. The transformation matrices (A, B, G) are com- (Pm and Pn ). Given an input tile size n, since the filter size
puted offline once the Winograd tile size is determined. In is fixed for a neural network layer (e.g., 3 × 3, 5 × 5), the
stage 1, the input tiles and filters are transformed. Note that output tile size m can be determined (m = n−r +1). These
the filter transformation can be done offline. The reason we design parameters affect both the performance and accuracy.
choose to transform it online is to save the on-chip BRAM Here, we develop an analytical model that can predict the
resources. Moreover, this will not cause extra delay as the performance of Winograd algorithm on FPGAs. Then, we
transformation of input and filter can be done in parallel as rely on it to explore the design space.
they are independent. In stage 2, we use an array of DSPs As mentioned in Section II-B, the multiplication saving
to perform the EWMM computation. In stage 3, we perform increases as the input tile size n increases. However, the
additional transformation. Finally, in stage 4, we accumulate range of the constants in the transformation matrices will
the output tiles from different input channels. increase as n increases, which may cause precision loss.
The Winograd algorithm in Figure 1 (b) is implemented In this work, we use fixed-point 16 bits to represent both
using local buffer to store the transformation matrices. In our data and filter. We set the precision to 2−10 for filters to
implementation, we completely partition the transformation maintain a high accuracy as prior work [23]. Under this
and intermediate matrices to registers. This helps to improve precision constraint, we set the maximum value for n to 8
the memory bandwidth as it alleviates the memory bank as beyond this we can not precisely represent the constants
conflicts. Note that when we multiply the constants in the in the transformation matrix.
transformation matrix with the input and filters, we do not In the following, we model the resource consumption
use the DSPs. Instead, we implement the multiplication with and predict the performance for different input tile size n
constants using shift operations, which are implemented as and parallelization degree Pm and Pn . As mentioned in
Look Up Table (LUT) arrays on FPGAs. Section II-B, only the EWMM operation will consume DSP.
D. PE Parallelization Therefore, the number of DSPs only depends on the size of
input tile and parallelization degree.
To initiate an array of PEs, we can parallelize the row and
column of the input feature maps, and the input and output DSP = n2 × Pn × Pm (5)

104
LUT is difficult to predict. Here, we approximate its
consumption using linear regression models,
LU T = αnr × Pm × Pn (6)
where αnr is the LUT consumption of a single Winograd PE
with the input tile size n and filter size r. αnr is pre-trained
on different platforms.
The number of BRAM banks is computed by adding
Figure 4: FC layer implementation
the banks for filter, input and output buffers based on the
memory partition factors in Table I.
Banks = r2 × Pm × Pn
+ (n + m) × n × Pm (7)
+ 2 × m 2 × Pn

We also model the memory bandwidth between the on-


chip and off-chip memory. To efficiently utilize the resource,
the data transfer speed must be greater than or equal to the
computation speed. The time to process n rows of input data
in the line buffer is,
Figure 5: Automatic tool flow
W M N 1
Tcompute = (  ×   ×   × II + Pdepth ) × (8)
m Pm Pn F req {H, W, M, R, C, N, r}, our goal is to find the optimal so-
where F req is the operating frequency of the FPGAs. lution {n, Pm , Pn } to maximize the performance (Equa-
II denotes the iteration interval of the pipeline. In our tion 14) with resources and bandwidth constraints. To solve
implementation, loops in Figure 1 (b) are perfectly pipelined, this problem, we rely on our performance models to explore
so the II = 1. Pdepth is the pipeline depth, which can be the design space and identify the optimal solution.
ignored when the loop trip count is large enough.
The computation is in parallel with the transfer of m rows F. Implementation of Other Layers
of input and output data. In addition to convolution layers, there are also other lay-
m × W × max(N, M ) × 16 ers in CNNs such as Fully Connected (FC) layers, Pooling
Ttransf er = (9) and Rectified Linear Unit (ReLU) layers. Here, we describe
Bandwidth
how to implement these layers.
We require that Ttransf er  Tcompute . Therefore, we can
get the bandwidth requirement as, FC layers connect all the neurons in the previous layer
to every single neuron in the weight matrix as shown in
Figure 4(a). The computation is a matrix-vector product.
Pm × Pn
Bandwidth  m2 ×   × 16 × f req (10) The operations in FC layers can be treated as EWMM by
min(N, M ) filling the input neurons and its corresponding weights into
We define the Tinit as the time to load the first n rows a matrix. To reuse the Winograd PE, FC layers only need
of the input image into on-chip memory and filters, to bypass the transformation stages (stage 1, 3 in Figure
M ×N ×r×r+n×W ×M 3). The weights in FC layers are significantly larger than
Tinit = (11) the input neurons. Therefore, similar to [8], we load the
Bandwidth/16
entire input neurons of FC layer into on-chip memory but
The total operations and processing time of the convolu- stream the weights using FIFO interface. In addition, the
tion are, FC computation contains no data reuse opportunities. To
OP s = H × W × M × N × r2 × 2 (12) improve the memory bandwidth, an effective approach is
to increase the batch size Nbatch (the number of input
H
Ttotal =   × Tcompute + Tinit (13) images). Specifically, we assemble a batch of images from
m
the previous layer, these images are processed together.
We define the effective performance of convolution based
Max Pooling layers are widely used in CNNs, which
on Winograd algorithm as,
output the maximum values in subregions of input feature
OP s maps. ReLU layers sets any input value less than zero to
P erfef f = (14)
Ttotal zero. ReLU and Pooling are implemented by introducing
Now, given a convolutional layer represented by comparison operators to the output buffers.

105
Figure 6: Resource utilization and performance results for 3 × 3 filter

Figure 7: Resource utilization and performance results for 5 × 5 filter

IV. AUTOMATIC T OOL F LOW CNNs including AlexNet and VGG16 (Section V-C). It
We design an automatic tool flow to automate the mapping should be noted that the performance we report in the
of CNNs onto FPGAs as shown in Figure 5. The flow following is the effective performance. It is computed by
consists of four steps. In step 1, CNN architecture and dividing the total operations by the total processing time
FPGA configuration are fed into the design space exploration (Equation 14). For conventional algorithm, the effective
engine (DSEE). Clearly, we use Caffe prototxt to describe performance is always bounded by M axF , the maximum
the structure of CNNs [24]. The FPGA configuration pa- computational capability of the FPGA platform. M axF =
rameters include the memory bandwidth, number of DSPs, DSP × F req × 2, where 2 means multiply and add op-
logic cells and on-chip memory capacity. Then, the output erations. However, for Winograd algorithm, the effective
of DSEE is the optimal solution {n, Tm , Tn } as described performance can exceed the M axF as Winograd algorithm
in Section III-E. In step 2, based on the optimal solution, can increase the effective DSP efficiency by reducing the
we develop a Code Generate Engine (CGE) which can number of multiplications required by convolution.
generate the Winograd convolution functions automatically. B. Model and Resource Analysis
The functions describe the whole accelerator architecture In this subsection, we evaluate our analytical models and
including line buffers, buffer management, and Winograd analyze the resource usage of Winograd algorithm using a
PEs. The generated implementation is HLS compatible C single convolutional layer. We use a typical input feature
code. Pragmas such as memory partition factors, loop unroll map size: 224(H)×224(W ) and try two different filter sizes:
factors Tn Tm and FIFO interfaces are inserted into the 3 × 3 and 5 × 5. Figure 6 and Figure 7 compare the predict
functions. In step 3, we use Xilinx HLS tool to synthe- and actual performance for different input tile size and
size the code into register transfer level. Finally, we use parallelization degree, and give the corresponding resource
Xilinx SDSoC (software-defined system-on-chip) tool-chain utilization. The experiments are performed on Xilinx ZC706.
to generate the bitstream. We can see that our performance prediction is very accurate.
V. E XPERIMENT E VALUATION On average, the prediction error is 15.4% and 13.7% for
filters 3 × 3 and 5 × 5, respectively. The sources of the
A. Experiments Setup inaccuracy may come from the discrepancy of actual and
We evaluate our techniques on two FPGA platforms: peak bandwidth and DDR access latency.
Xilinx ZC706 and ZCU102. Xilinx ZC706 platform consists Thanks to the Winograd algorithm, DSP is no longer
of a Kintex-7 FPGA and dual ARM Cortex-A9 proces- the limiting resource for most cases as shown by Figure 6
sors. The external memory is 1 GB DDR3. Our FPGA and 7. Instead, BRAMs and memory bandwidth can be the
implementation is operated at 166MHz frequency on this limiting resources. The BRAMs consumption comes from
platform. Xilinx ZCU102 consists of an UltraScale FPGA, a few aspects. First, unlike the conventional convolution,
quad ARM Cortex-A53 processors, 500 MB DDR3. Our Winograd convolution requires more buffers because of
FPGA implementations is operated at 200MHz frequency on the line buffer structure. Second, paralleling Winograd PEs
this platform. To measure the runtime power, we plugged a requires memory partition to sustain the on-chip memory
power meter in the FPGA platform. bandwidth. Finally, when the computation efficiency im-
In the following, we first present the model and resource proves, the off-chip bandwidth might become the bottleneck.
analysis results for a typical convolution layer (Section V-B). Overall, Winograd algorithm saves the DSPs and improves
Then, we perform case studies using the state-of-the-art the overall resource utilization.

106
C. Case Study efficiency on each platform. In Table III, we can see that our
Here, we evaluate our Winograd implemention using implementation achieves better resource efficiency, which
AlexNet and VGGNet. Table II gives the parameters for each comes from the reduction of arithmetic complexity and novel
network in our implementation. architecture. Our implementation also improves the energy
efficiency from 3.79 GOP/s/W to 36.2 GOP/s/W.
Table II: Design parameters 2) VGGNet: In VGG16 [22], all convolutional layers are
ZC706 (ZCU102) n Pn Pm Nbatch
with 3 × 3 filters, which fit well for Winograd algorithm.
Alexnet(3 × 3) 6(6) 2(4) 8(8) 32(128) VGG16 consists of 5 convolution groups with different
VGG16(3 × 3) 7(6) 4(4) 4(16) 32(128) input size (224, 112, 56, 28, 14). Table IV compares our
techniques with prior works. For the convolutional layers, we
1) AlexNet: AlexNet consists of five convolution and improve the average performance from 136.5 - 488 GOP/s to
three FC layers [3]. The input image is 224 × 224. All the 3044.7 GOP/s compared to [5, 8, 23]. For the overall CNN,
convolution layers use small filters (5 × 5 and 3 × 3) except we improve the performance from 117.8 - 354 GOP/s to
the first convolution layer (11 × 11). For the first layer, 2940.7 GOP/s.
We choose to use the conventional convolution algorithm Similar to AlexNet experiments, we also measure the
for implementation. For the rest layers, we use a uniform resource efficiency and energy efficiency. Similar findings
3 × 3 filter for Winograd algorithm. For the 5 × 5 filter, we hold for VGG16. We notice that we achieve higher perfor-
implement it using four 3 × 3 filters with zero padding. mance for VGG16 than AlexNet. This is because VGG16
Table III gives the results. [7] only gives the convolution uses uniform convolution structure, while AlexNet uses
implementation without FC layers and [5] only gives the two different convolution structures. We also find that the
overall CNN performance without the detailed results for performance of convolutional layer decreases as the network
each convolutional layer. Compared to prior work [7], we goes deeper. This is due to the fact that the initial time (Tinit )
improve the average convolution performance from 61.6 accounts for more total time (Ttotal ) and the initial time only
GOP/s to 1006.4 GOP/s1 . For the overall CNN, we improve involves data transfer without actual computation.
the performance from 72.4 GOP/s to 854.6 GOP/s compared Table IV: Performance comparison for VGG
to [5].
[23] [5] [8] Our Impl
Table III: Performance comparison for Alexnet Precision 16bits fixed 16bits fixed 16bits fixed 16bits fixed
Device ZC706 GSD8 XC7VX690T ZCU102
[7] [5] Our Impl Our Impl
Freq (MHz) 150 120 150 200
Precision 32bits float 16bits fixed 16bits fixed 16bits fixed
Logic cell (K) 350K 695K 693K 600K
Device VX485T GSD8 ZC706 ZCU102
DSP 900 1963 3600 2520
Freq (MHz) 100 120 167 200
BRAM (Kb) 1090×18 2567 × 20 2940 × 18 1824 × 18
Logic cell (K) 485.7 695 350 600
conv1 (GOP/s) 123.8 - 320 2734.7
DSP2 2800 1963 900 2520
conv2 (GOP/s) 235.3 - 635 3212.4
BRAM (Kb) 2060 × 18 2567 × 20 1090 × 18 1824 × 18 conv3 (GOP/s) 235.3 - 600 3111.1
conv1 (GOP/s) 27.5 - 83.1 409.6 conv4 (GOP/s) 254.8 - 585 3069.3
conv2 (GOP/s) 83.8 - 501.7 1355.6 conv5 (GOP/s) 70.2 - 400 2431.4
conv3 (GOP/s) 78.8 - 610.2 1535.7 conv average
conv4 (GOP/s) 77.9 - 401.2 1361.7 187.8 136.5 488 3044.7
(GOP/s)
conv5 (GOP/s) 77.6 - 355.6 1285.7
conv average CNN average
61.6 - 271.8 1006.4 137.0 117.8 354 2940.7
(GOP/s) (GOP/s)
Power (W) 9.6 - 25 23.6
CNN average DSP Efficiency
- 72.4 201.4 854.6 0.152 0.06 0.10 1.16
(GOP/s) (GOP/s/DSPs)
Power (W) 18.6 19.1 9.4 23.6 Logic cell
DSP Efficiency Efficiency 0.391 0.196 0.511 4.901
0.022 0.037 0.224 0.339
(GOP/s/DSPs) (GOP/s/cells)
Logic cell Energy
Efficiency 0.127 0.104 0.575 1.424 Efficiency 14.3 - 14.2 124.6
(GOP/s/cells/K (GOP/s/W)
Energy
Efficiency 3.31 3.79 21.4 36.2
(GOP/s/W) D. Comparison with GPU
In this subsection, we conduct a comparison between
To make a fair comparison across different platforms.
GPU and FPGA platforms. For GPUs, we measure the
We also present the total resource efficiency and energy
performance of VGG16 using Caffe framework [24] on
1 In [7], FC layer is not implemented. So the efficiency value of [7] is NVIDIA TitanX platform. To make a fair comparison, we
calculated based on the average performance of convolution. test the performance of TitanX with the latest CuDNN 5.1
2 In Xilinx ZC706 (Kintex-7) Platform, a single DSP(DSP48E1) slice
[25] as Winograd algorithm is also included in CuDNN 5.1.
can be implemented as one 18 × 25 fixed-point multiplier. In Altera GSD8
(Stratix-V) Platform, a single DSP slice can be implemented as two 18×18 Power on GPU is obtained using NVIDIA profiling tools.
fixed-point multipliers Table V shows the comparison results. As shown, TitanX

107
Table V: Comparison with GPU platform
propose a CNN architecture on FPGAs based on Wino-
Device TitanX1 TitanX2 ZC706 ZCU102 grad algorithm, which can effectively reduce the arithmetic
Technology 28 nm 28 nm 28 nm 16 nm
Precision 32bits float 32bits float 16bits fixed 16bits fixed
complexity. We also develop analytical models to estimate
CNN average
4.98 5.60 0.67 2.94
the resource usage and performance. Our implementations
(TOP/s)
Power (W) 130 134 9.4 23.6
of Alexnet and VGG16 achieve the overall performance of
Energy efficiency 854.6 GOP/s and 2940.7 GOP/s, respectively on ZCU102
38.3 41.8 72.3 124.6
(GOP/s/W) FPGA platform, which outperforms all previous work.
1 We use the default implementation in Cudnn5.1, selected layers will
call Winograd algorithm. VIII. ACKNOWLEDGEMENT
2 We force every layer to use the Winograd algorithm. We thank Qian Li for her help in GPU experiment.
R EFERENCES
gives better better performance, but our implementation on [1] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
Xilinx ZCU102 FPGA achieves much better (2.98X) energy human-level performance on imagenet classification,” in ICCV, 2015.
[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
efficiency. accurate object detection and semantic segmentation,” in CVPR, 2014.
VI. R ELATED W ORK [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in NIPS, 2012.
Recently, FPGAs are gaining popularity for use as acceler- [4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
ators for deep learning tasks due to its high performance, low recognition,” arXiv preprint arXiv:1512.03385, 2015.
[5] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, and
power and reconfigurability. Most FPGA accelerators focus Y. Cao, “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-
on the implementations of convolutional layers using the Scale Convolutional Neural Networks,” in FPGA, 2016.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
conventional algorithms[5, 7, 10, 11, 14, 23]. Zhang [7] et V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR,
al. propose a design space exploration technique to optimize 2015.
[7] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing FPGA-
the throughput from computation resources and bandwidth based accelerator design for deep convolutional neural networks,” in FPGA,
aspects. Qiu et al. [23] propose a dynamic-precision data 2015.
[8] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: towards uniformed
quantization to increase DSP efficiency. Several other studies representation and acceleration for deep convolutional neural networks,” in
target a uniform implementation for convolutional layer and ICCAD, 2016.
[9] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An FPGA-based processor
FC layer [5, 8, 15]. In [5], 3D convolution operations is for convolutional networks,” in FPL, 2009.
flattened as 2D general purpose matrix multiplication, which [10] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate
approximations of nonlinear convolutional networks,” in CVPR, 2015.
is widely adopted on GPU platforms. But on FPGAs, it can [11] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,
result in massive memory usage. Zhang [8] et al. present E. Cosatto, and H. P. Graf, “A massively parallel coprocessor for convolutional
neural networks,” in ASAP, 2009.
an uniform representation for convolutional layers and FC [12] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture for Energy-
layers, which can share the same computing resources. Song Efficient Dataflow for Convolutional Neural Networks,” in ISCA, 2016.
[13] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon:
[15] et al. propose a general purpose accelerator using an instruction set architecture for neural networks,” in ISCA, 2016.
kernel-partition method. [14] S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, “A dynamically
configurable coprocessor for convolutional neural networks,” in ACM SIGARCH
A few studies also focus on reducing the arithmetic Computer Architecture News, 2010.
complexity of convolution [10, 16, 26, 27] using non- [15] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-brain: a deep
learning accelerator that tames the diversity of CNNs through adaptive data-
conventional algorithms. Zhang et al. [10] reduce the compu- level parallelization,” in DAC, 2016.
tation using low-rank approximation which is based on min- [16] A. Lavin, “Fast algorithms for convolutional neural networks,” arXiv preprint
arXiv:1509.09308, 2015.
imizing the reconstruction error of nonlinear response. Lavin [17] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao:a
[16] evaluates Fast Fourier Algorithm (FFT) and Winograd small-footprint high-throughput accelerator for ubiquitous machine-learning,”
Acm Sigplan Notices, 2014.
algorithm on GPU platforms. But FFT shows less efficiency [18] Y. Liang, K. Rupnow, Y. Li, D. Min, M. N. Do, and D. Chen, “High-level
for convolutions with small filters. [26] implements FFT synthesis: productivity, performance, and software constraints,” ECE, 2012.
[19] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, “High-
on FPGA platform for CNN. But it shows little reduction level synthesis for FPGAs: From prototyping to deployment,” TCAD, 2011.
of computation complexity with small filters like 3 × 3. [20] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson,
S. Brown, and T. Czajkowski, “LegUp: high-level synthesis for FPGA-based
Aydonat et al. [27] apply Winograd algorithm on Arria processor/accelerator systems,” in FPGA, 2011.
10 FPGA platform. But they only use 1-D Winograd to [21] S. Winograd, “Arithmetic complexity of computations,” 1980.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
reduce arithmetic complexity. In our work, we evaluate 2-D scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
Winograd algorithm on FPGA platforms and use line-buffer [23] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song
et al., “Going deeper with embedded FPGA platform for convolutional neural
structure to enable data reuses and performance models to network,” in FPGA, 2016.
guide design space exploration. [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-
rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embed-
ding,” in MM, 2014.
VII. C ONCLUSION [25] “NVIDIA cuDNN, https://developer.nvidia.com/cudnn.”
FPGAs have been widely used to accelerate CNN-based [26] C. Zhang and V. Prasanna, “Frequency Domain Acceleration of Convolutional
Neural Networks on CPU-FPGA Shared Memory System,” in FPGA, 2017.
applications. However, prior implementations based on the [27] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu, “An
conventional convolutional algorithms are mainly limited by OpenCLTM Deep Learning Accelerator on Arria 10,” in FPGA, 2017.

the computational capability on FPGAs. In this work, we

108

You might also like