0% found this document useful (0 votes)
11 views

(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

250 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO.

1, JANUARY 2021

High Performance CNN Accelerators Based on


Hardware and Algorithm Co-Optimization
Tian Yuan, Weiqiang Liu , Senior Member, IEEE, Jie Han , Senior Member, IEEE,
and Fabrizio Lombardi , Fellow, IEEE

Abstract— Convolutional neural networks (CNNs) have been recognition [3]. For better accuracy, CNNs require inten-
widely used in image classification and recognition due to their sive and extensive computation. For real-time processing,
effectiveness; however, CNNs use a large volume of weight data CNNs are usually accelerated by parallel processors such
that is difficult to store in on-chip memory of embedded designs.
Pruning can compress the CNN model at a small accuracy loss; as graphic processing units (GPUs) [4]. Although GPUs
however, a pruned CNN model operates slower when imple- accelerate computation, a substantial increase of power limits
mented on a parallel architecture. In this paper, a hardware- its application to embedded systems. For low power and
oriented CNN compression strategy is proposed; a deep neural high performance digital systems, field programmable gate
network (DNN) model is divided into “no-pruning layers (N P- arrays (FPGA) [5]–[7] and application specific integrated cir-
layers)” and “pruning layers ( P-layers)”. A N P-layer has a
regular weights distribution for parallel computing and high cuit (ASIC) [8]–[10] have been used for CNN accelerators in
performance. A P-layer is irregular due to pruning, but it recent years.
generates a high compression ratio. Uniform and incremental However, the on-chip memory resources in current FPGAs
quantization schemes are used to achieve a tradeoff between com- are not sufficient to completely store a large-scale CNN model.
pression ratio and processing efficiency at a small loss in accuracy. Therefore, off-chip memory is generally used in an FPGA
A distributed convolutional architecture with several parallel
finite impulse response (FIR) filters is further proposed for implementation of CNNs; this causes a limitation in terms of
the regular model in the N P-layers. A shift-accumulator based bandwidth and speed. Therefore, model compression methods
processing element with an activation-driven data flow (ADF) is have been studied quite extensively. Among them, network
proposed for the irregular sparse model in the P-layers. Based pruning [11] is one of the most widely applied compression
on the proposed compression strategy and hardware architec- methods [12]–[14]. As a cost of improving compression ratio,
ture, a hardware/algorithm co-optimization (HACO) approach is
proposed for implementing a N P − P hybrid compressed CNN the irregularity caused by pruning affects the performance
model on FPGAs. For a hardware accelerator on a single FPGA of parallel computing. A compressed sparse model not only
chip without the use of off-chip memory, a 27.5× compression requires decoding but it also causes an imbalanced weights
ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. load and a difficulty in activations reading. In [15], it has
The implementation of the compressed VGG-16 model on a Xilinx been shown that processing a sparse layer takes dozens of
VCU118 evaluation board processes 83.0 frames per second (FPS)
for image applications, this is 1.8× superior than the state-of- milliseconds and requires a large memory utilization.
the-art design found in the technical literature. To achieve a better tradeoff between model size and per-
formance of large CNNs, hardware-oriented compression and
Index Terms— Convolutional neural network (CNN), field pro-
grammable gate array (FPGA), network compression, hardware hybrid quantization strategies are proposed in this paper by
acceleration. requiring a smaller memory. By considering the processing
feature of a CNN, the size of feature maps is reduced, but the
model size expands as the layers deepen. The reduced feature
I. I NTRODUCTION maps require less computation, and the expanded models
require a larger memory. As per the above characteristic,
C ONVOLUTIONAL neural networks (CNNs) have been
extensively used for image classification [1], [2] and all layers are divided into two categories: “no-pruning layers
(N P-layers)” and “pruning layers (P-layers)”. With a regular
Manuscript received April 9, 2020; revised September 7, 2020; accepted weights distribution, a N P-layer utilizes parallel computing
October 7, 2020. Date of publication October 21, 2020; date of current version for high performance. A P-layer is irregular due to the pruning
December 21, 2020. This work was supported in part by the National Natural
Science Foundation of China under Grant 62022041 and Grant 61871216, but it has a high compression ratio.
and in part by the Six Talent Peaks Project in Jiangsu Province under Grant Leveraging the proposed compression strategy, the VGG-16
2018XYDXX-009. This article was recommended by Associate Editor P. Li. [1], one of the most useful CNN models for image classifi-
(Corresponding author: Weiqiang Liu.)
Tian Yuan and Weiqiang Liu are with the College of Electronics and cation, is implemented on a Xilinx VCU118 evaluation board
Information Engineering, Nanjing University of Aeronautics and Astronautics, without off-chip memory such as DRAM to store weight data.
Nanjing 211106, China (e-mail: [email protected]). The proposed CNN accelerators achieve high performance
Jie Han is with the Department of Electrical and Computer Engineering,
University of Alberta, Edmonton, AB T6G 1H9, Canada. because only on-chip memory in an FPGA is used. Based on
Fabrizio Lombardi is with the Department of Electrical and Computer the hardware-oriented compression-based architecture, a hard-
Engineering, Northeastern University, Boston, MA 02115 USA. ware/algorithm co-optimization scheme (HACO) is proposed
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. for implementation of the CNNs. To the best of the author’s
Digital Object Identifier 10.1109/TCSI.2020.3030663 knowledge, this is the first work that implements VGG-16 on
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 251

a single FPGA chip without the use of off-chip memory. The and max pooling are the two typical pooling operations that
main contributions of this work are summarized as follows: are commonly used in CNNs. Average pooling computes the
• A hardware-oriented compression method and a average value of the local field while max pooling selects the
uniformly-incremental hybrid quantization strategy are largest value of the local field as the output. Max pooling is
proposed. used in this work due to its high efficiency.
• The proposed compression strategy has been applied in FC layers are the last few layers in a CNN. All input neurons
the VGG-16 model and achieves a 27.5× compression are fully connected to every neuron in the next layer through
ratio with a 2.04% top-1 accuracy loss and a 0.44% top-5 weights. Therefore, FC layers have many weights to be stored.
accuracy loss compared to the single-precision floating- O denotes the number of multiply-accumulate (MAC) oper-
point VGG-16 model using the ISVRC2012 test data set. ations required in each layer (including both O F C and OConv ):
• A distributed convolutional architecture is proposed for
O F C = Uin × Uout
FPGAs with a fast pipeline data path of CNNs.
• A shift-accumulator based processing element and an
OConv = (W × H × N) × (K × K × M) (2)
efficient activation-driven data flow are proposed for a where Uin and Uout denote the number of input and output
sparse model. neurons, respectively.
• A hardware/algorithm co-optimization approach is pro-
posed for high performance implementations on FPGAs.
B. Related Works
• The proposed VGG-16 network is implemented on the
Xilinx VCU118 evaluation platform achieving 30.3∼83.0 A tiling technique [5], [16] has widely been used to
frames per second (FPS) with a compression ratio of address insufficient memory in embedded architectures. For
34.5∼27.5×, so attaining the highest performance com- image compression, the adaptive joint photographic experts
pared with the state-of-the-art designs. group (JPEG) method [17] has been proposed to dynamically
This paper is organized as follows. Section II provides the adjust the compression ratio for the desired quality. For speed-
background of CNNs. The proposed compression strategy is ing up convolution, a fast finite impulse response (FIR) algo-
presented in Section III. Section IV proposes a distributed rithm (FFA) [6] has been proposed. The Winograd algorithm
FIR based hardware architecture for the N P-layers and a has been studied for sparse networks [18]. To achieve a high
shift-accumulator based processing element for the P-layers. throughput, [7] has presented a design method to fully exploit
Section V evaluates the computation time and the hard- the limited resources in FPGAs. Approximate computing has
ware requirement. Section VI proposes a hardware/algorithm widely been studied in recent years; its objectives are to
co-optimization method. Section VII provides the experimen- achieve low energy and high performance at an acceptable
tal results and analysis. Comparison with the state-of-the-art accuracy loss [19]. Neural networks require a significant
designs is also provided in this section. Section VIII concludes large-scale computation and have high error resilience, so suit-
this paper. able for approximate computing. For example, approximate
multipliers [20] can be used in CNNs with a very small loss
of accuracy [21]; however, this method has only been applied
II. BACKGROUND
to small-scale neural networks.
A. CNN Basics Deep compression has been proposed in [12] to reduce the
CNNs extract the features of images and process these model size of CNNs. Using network pruning [11], weight
feature maps to classify images by finding and using the quantization and Huffman coding, a high compression ratio
weights in each layer. In a deep learning algorithm, the CNN has been achieved. A so-called dynamic network surgery
weights are found through training. A typical CNN has three has been used to accelerate training and avoid unacceptable
types of layers: convolutional (Conv) layers, pooling layers pruning [13]. Binary neural networks (BNNs) have been
and fully connected (FC) layers. proposed to train deep neural networks (DNNs) with weights
Convolutional layers are used to extract image features. and activations constrained to +1 or −1 [22]. At a reduced
A convolutional layer receives a feature map X W,H,M and complexity and a small number of weights, BNNs achieve
generates a feature map of YW,H,N by the filter H K ,K ,M,N , a high performance for embedded systems [23], [24]. Acti-
which is calculated by: vations for very low bit-width has also been proposed, thus
saving memory and accelerating training.

M 
K 
K
Incremental Network Quantization (INQ) [25] has been pro-
Y (i, j, n) = H ( p, q, m, n)X (i ×s + p, j ×s +q, m)
m=1 q=1 p=1
posed for efficient CNN models with low precision weights.
(1) INQ divides weights into several groups, and incrementally
quantifies each set of data. At each time of quantization
where W and H indicate the width and the height of the process, a network is retrained and weights that have not
feature map; K indicates the size of convolution kernel; M been quantified are updated to compensate for the accuracy
and N indicate the input channels and output channels; s is loss. Using the INQ algorithm, weights in a network can be
the stride of filter. quantified as ±2n with only a small accuracy loss, where n
Pooling layers compute a local field of feature map to is an integer. The MAC is the main arithmetic unit in CNNs.
output a pixel, so reducing the size of feature maps. Average With the quantized data form, multiplication can be replaced

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
252 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

Fig. 1. Overview of the hardware-oriented compression strategy.

by shift operations in a MAC; therefore, INQ has a high speed layers and FC layers. The overview of the proposed strategy
performance on hardware. is shown in Fig. 1.
The significant difference between INQ and the proposed
compression strategy is that the proposed strategy targets A. Hybrid Quantization Strategy
sparse neural network models, whereas the original INQ
For P-layers, INQ is applied to achieve a higher compres-
causes an unacceptable accuracy loss for sparse models. Due to
the use of non-pruning layers (NP-layers) defined in this work, sion ratio. For a layer l, the weights are stored in an array Wl .
The network sparsity is determined by Tl as a binary array
the proposed method compensates the accuracy loss from the
with the same size of Wl . Tl is calculated by the following
pruning-layers (P-layers) in sparse networks. Furthermore, the
equation:
implementation of NP-layers can increase parallel processing 
performance due to its regular structure. 0 Wl (i ) = 0
Tl (i ) = (3)
1 Wl (i ) = 0
III. T HE P ROPOSED C OMPRESSION S TRATEGY A 0 and 1 in Tl indicates that the corresponding weight is zero
or non-zero, respectively. At each time of quantization, 1s in
We propose a hardware-oriented compression strategy for
the array Tl are randomly divided into two groups: A1 and
large-scale CNNs to store weights on a chip. The front layers
A2 . The data in A2 will be modified to zero which indicates
incur a significant level of computation but with a smaller
that the corresponding weight will be quantified. The weight
model size. Also, the front layers are less error-resilient: [12]
quantization in the P-layers is represented as:
has applied pruning (albeit at a smaller rate) in the front 
layers of VGG and AlexNet. [26] shows that by pruning the +2log2 |Wl (i)| Wl (i ) > 0
Wl (i ) = (4)
filters in the front layers, the accuracy significantly decreases. −2log2 |Wl (i)| Wl (i ) ≤ 0
In addition, an irregular sparse model affects the performance
of parallel computation. Hence, pruning the front layers is not By using quantization, the original multiplication in a MAC is
efficient. The principle of the proposed strategy is to apply replaced by a shift operation. Therefore, only the sign bit and
different compression strategies to different layers, so making exponent need to be stored for the direction and the number
the network to be front-regular but back-irregular. of bits to shift, respectively. After quantization, the model is
All layers are divided into two types: N P-layers and retrained to compensate for the error. Weights are updated as
P-layers. In the beginning, all layers are pruned. Then per the following equation:
INQ is used for quantifying the P-layers. During the incre- ∂E
mental quantization, the error introduced by pruning and Wl (i ) ← Wl (i ) − η Tl (i ) (5)
∂ Wl (i )
quantization in the P-layers is compensated by the weight
where η is the learning rate that depends on the learning
update of N P-layers, in which the N P-layers are no longer
policy; E is the objective function. Tl (i ) denotes the mask
sparse. Therefore, the proposed compressed model has more
of weights, which determines whether the weight must be
error resilience, hence making the model easier to converge.
updated. For zeros entries in Tl , the corresponding weight
When the quantization in the P-layers is completed, the
will not be updated because it is already zero or has been
N P-layers are quantified as fixed-point numbers so ready for
quantified. For the N P-layer l, the weights are quantified
computation.
after the quantization of the P-layers. The following uniform
In general, the N P-layers have a regular structure for
quantization method is applied:
higher parallel performance, while the P-layers are highly
compressed for higher compression ratio. The N P-layers are Wl (i )
Wl (i ) = r ound( )Q (Q = 2−n ) (6)
generally Conv layers, but the P-layers can include Conv Q

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 253

where Q indicates the quantization factor and n is the fraction TABLE I


bit. For the case that most of weights are less than 1, the VGG-16 C OMPRESSION R ESULT
following quantization method is used:


⎪ 1− Q Wl (i ) ≥ 1

Wl (i ) = −1 Wl (i ) < −1 (7)

⎪ Wl (i )
⎩r ound( )Q Wl (i ) ∈ [−1, 1)
Q
in Table I. Compared to the single-precision floating-point
VGG-16 model, we achieve a 27.5× compression ratio with
Algorithm 1 Hybrid Sparse Network Quantization Strategy 2.04% top-1 accuracy loss and 0.44% top-5 accuracy loss.
The standard VGG-16 model used in this work is down-
Input: P-layers and N P-layers
loaded from Caffe Model Zoo1 (as widely used). We tested this
Output: Quantized network model Q M
model on the ImageNet dataset 2012 through the framework
1: for all layer s ∈ P do
Caffe; and obtained the same accuracy as in [29].
2: Compute Tl through Eq. (3)
3: end for
IV. T HE P ROPOSED H ARDWARE A RCHITECTURE AND
4: Set grouping ratio r
I MPROVED DATA F LOW
5: for all layer s ∈ P do
6: Randomly divide index of 1 from Tl into two parts: A 1 , Most state-of-the-art CNN designs on FPGAs use off-chip
A2 by grouping ratio r memory to store weights and feature maps. A key advantage
7: Set Tl (i ) (i ∈ A 2 ) to 0 of the proposed compression strategy is to implement VGG-16
8: Quantify Wl (i ) (i ∈ A 2 ) through Eq. (4) on the latest Xilinx UltraScale FPGA using on-chip memory
9: end for only.
10: Retrain and update weights by Eq. (5) Based on the proposed compression strategy, a new hard-
11: Goto 4 until all weights in P-layers have been quantified ware architecture for FPGA implementation is developed with
12: for all layer s ∈ N P do a fast pipeline dataflow of the CNNs without off-chip memory.
13: Quantify Wl through Eq. (6) or Eq. (7) As mentioned in Section III, the N P-layers are in front of the
14: end for P-layers. Therefore, the compressed model can be processed
15: return Q M = P ∪ N P in a pipelined manner.
Two hardware architectures are proposed to process these
In the proposed compression strategy, weights in the N P- two types of layers. Due to the regular convolution model in
layers are given by fixed-point numbers, so processed without N P-layers, an FIR based convolution processing element and
decoding and faster compared to the compressed weights in an improved data flow are proposed for the N P-layers. For the
the P-layers. The weights in the P-layers are discrete in irregular sparse model in P-layers, a parallel shift-accumulator
±2n , so requiring less memory due to the lower bit width based processing element is proposed to reduce the redundant
and smaller quantity. Moreover, the multiplication is replaced computation in the parallel processing of the sparse models.
by shift operations in the P-layers, requiring less resources.
Overall, the compressed CNN models have a high performance A. Conv Processing Element
on hardware implementation. The hybrid quantization strategy A 1-D convolution is processed by a FIR filter, and multiple
is detailed in Algorithm 1. parallel FIR filters (the number of FIR filters is denoted as
F) compute the 2-D convolution. The FIR filter is efficient
B. Compression Experiments and Results in convolution processing due to its stringent requirement on
bandwidth. Considering a 3 × 3 convolution, three inputs are
We implemented the proposed compression strategy on the required for an output in the FIR filters, while nine inputs are
Caffe [27] platform to compress the VGG-16 [1] model; required in traditional methods.
this has been trained and tested with ILSVRC2012 data set Reference [6] has proposed a parallel fast FIR algo-
[28]. By using the proposed strategy, 96%, 96% and 77% of rithm (FFA) for CNNs, which can save multiplications in
the weights are pruned in three FC layers. For quantization, convolution. Compared with [6], the proposed FIR filter design
weights bit-widths are set to 8-bit and 4-bit for N P-layers uses cut-set retiming for the convolution kernel; this approach
and P-layers. Moreover, 5 bits are set to store the index for requires less resources and can be used to process larger
a weight in the pruned layers, which indicates the number convolution kernels very efficiently. So, it is possible to highly
of zeros between two adjacent no-zero weights. Most of the parallelize processing in higher levels with less resources.
weights in the Conv layer are less than 1; therefore, we employ As shown in Fig. 2, in our design, a 3-tap FIR filter is retimed
Eq. (7) for the N P-layers. Overall, weights are incrementally to improve the frequency, and 3 parallel retimed FIR filters
quantified from 50%, to 75%, to 87.25%, and to 100%. constitute a 3 × 3 convolution processing element (Conv-PE)
In addition, all activations bit-widths are set to 16-bits. In these to compute the 2-D convolution.
experiments, a 20.1MB compressed model is utilized with
a 33.94% top-1 error and a 12.44% top-5 error, as shown 1 https://github.com/BVLC/caffe/wiki/Model-Zoo

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
254 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

Fig. 2. Convolution processing element (Conv-PE): (a) a cut-set retimed


3-tap FIR; (b) a cut-set retimed 3 × 3 convolution processing element. Fig. 4. Feature map integration for 3 × 3 kernel and the data flow in Conv
layers.

takes W + C f ir cycles, where C f ir denotes the latency


of a FIR. To acquire a W × H matrix, the process costs
(W + C f ir ) × H cycles. To eliminate the redundancy caused
by the FIR latency C f ir , neighbor rows of the feature map
are connected by filling K zeros. By applying the so-called
integrated rows of the feature map, the W × H matrix becomes
three (W + K − 2) × H arrays that requires (W + K − 2) ×
H + C f ir cycles, where K is the width of the Conv kernel.
As shown in Fig. 4, Row 1 to Row H are integrated as the
second input array of the Conv-PE. In particular, the first
and last arrays must be padded with zeros before processing.
Fig. 3. Complex convolution processing unit (CCPU).
When W is small and C f ir is large, such as in the last
several layers of VGG-16, performance can be significantly
A complex convolution processing unit (CCPU) consists improved.
of several Conv-PEs; the number of Conv-PEs in a CCPU It is worth mentioning that for convolution kernel of differ-
is denoted by Mnp . As Fig. 3 shows, a CCPU computes ent sizes, zero padding is also different. The processing speed
Mnp channels of a feature map in parallel. The Mnp data has been increased with the improved data flow by slightly
output from the Conv-PEs is also accumulated. To reduce the requiring more memory.
delay, the Mnp data are divided into Mnp /a groups, where a
denotes the number of data that is added in a group. There-
fore, the entire addition is divided into loga (Mnp ) levels. C. F×F Ping-Pong Buffer for Conv-PEs
Additionally, when Mnp is less than the input channel M,
the accumulated data will be temporarily stored in the CCPU F parallel FIR filters require an F times sampling rate of
buffer; it will be sent to the accumulator in the next iterations. the non-parallel version. As the on-chip storage is limited,
After M/Mnp iterations, the CCPU outputs a channel of the an F × F ping-pong buffer (FPPB) is proposed.
feature map. A F parallel FIR requires F lines of input data in the
Pooling layers may occasionally follow the Conv layers. proposed hardware architecture. The data in the feature map is
When pooling is required, the data output from ReLU is stored stored in rows. Then the F data in different addresses must be
in a pooling buffer. Several rows of data need to be stored prior read to match the processing speed for parallel performance.
to computing. As shown in Figs. 5 and 6, we first integrate the F adjacent
data in the feature map. Then the integrated data (A, B, C…)
are sequentially sent to the FPPB, which consists of two F × F
B. Improved Data Flow in Conv Layers blocks, namely, BLOCK0 and BLOCK1. These two blocks
For a better performance of 2-D convolution by FIR filters, are controlled to alternatively receive and output data. Once a
an improved data flow in Conv layers is proposed; it integrates block is full, it outputs data in columns while the other block
a 2-D feature map matrix into an array. To integrate each row receives the data.
of the feature map, zeros are filled between two neighbor rows, This scheme can transform spatial relations of data in a
which are of no use after convolution. These values are reset feature map to match the parallel FIR processing, while requir-
to zero in the ReLU module. ing less memory compared with the conventional caching
A Conv-PE receives three parallel arrays and output one method [6]. Through the FPPB, the bandwidth requirement
array. To acquire an array with a length of W , the process is also relaxed with a small hardware utilization.

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 255

Fig. 7. The distributed convolution architecture.


Fig. 5. F×F ping-pong buffer for parallel FIRs.

Fig. 6. The timing diagram of the proposed FPPB.

D. Distributed Conv Architecture Fig. 8. Feature map storage format in FMRs.

With no off-chip memory, a distributed Conv architecture is buffer. The map buffer receives Nnp channels of data and
proposed for high speed processing. Each Conv-PE in the same output Mnp channels of data. If the input feature map data
CCPU has a RAM to transmit the feature map. In addition, of the current layer is of no use, then they are replaced by the
different CCPUs receive weights from different RAMs; so, data from the map buffer for the next layer computation.
the speed of data transmission matches the processing speed, In the proposed distributed architecture, the feature map
hence relieving the restriction of bandwidth. storage format is determined by Mnp . As shown in Fig. 8,
As shown in Fig. 7, there are Mnp feature map the first Mnp feature maps are set as a group, stored in the
RAMs (FMRs) to store the feature map, and each one cor- bottom of FMRs. The next groups are subsequently stored
responds to a FPPB. The Mnp data in the feature map is sent in a stack fashion. Due to the different size of feature maps
to each Conv-PE after the conversion of FPPB, so the Mnp between layers, the indexes of the next groups are hard to be
input channels are processed at the same time. determined. To ensure the architecture work correctly, Mnp
In the Conv architecture, each CCPU processes a 3-D should be a multiple of Nnp . Dual port RAMs are used in this
feature map to output a 2-D feature map, as briefly discussed implementation and there is no read and write conflict.
in Section IV-A. Nnp denotes the number of CCPUs that
process the convolution in parallel. Similarly, Nnp weight
RAMs (WRs) store the weights in a distributed fashion, such E. Shift-Accumulator Based Processing Architecture for
P-Layers
that each WR corresponds to a CCPU. All CCPUs receive
the same data of the input feature map. Due to the different Based on the proposed compression strategy, the P-layers
convolution kernels, this implies that Nnp channels are output make a highly compressed sparse model with ±2n weight
at the same time. types. Multiplications are replaced by shift operations to
The entire convolution architecture processes Mnp channels reduce resources. To accelerate the processing of the P-layers,
of input feature map in parallel, and outputs Nnp channels of there are two levels in this parallel computing scheme:
the output feature map. • Unrolling the multiple shift-accumulations.
Nnp is generally smaller than the number of output channels • Unrolling the different weight kernels (Conv kernel
N; therefore, a buffer is required for temporarily storing an in Conv layers or all weights connected to a out-
intermediate feature map. Using the data integration method put neuron in FC layers). Due to the irregular sparse
of Section IV-C, the serial data that is output by a CCPU, model, computations of each weight kernels are different,
is converted into F parallel streams and stored in the map so decreasing the utilization of the hardware units.

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
256 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

Fig. 10. The transformed Conv-PE for general CNN computation.

The architecture of SAC-PU has high fanouts when N p is


large. To reduce the influence of a large net delay, the output
registers of the activations and decoder can be copied for
sharing the loads.

F. Generalization of the Proposed Architecture


Fig. 9. The architecture of SAC-PU and activations-driven data flow (ADF). The proposed F × F Conv-PE also computes smaller Conv
kernels, as determined by the input weights. For example, 4
input non-zero weights and 5 input zeros correspond to a 2×2
Fig. 9 shows the M p × N p shift-accumulator based process- Conv kernel. The 3 × 3 Conv kernel is widely used in CNNs,
ing unit (SAC-PU). A shift-accumulator consists of M p so with good efficiency when the size of the Conv-PE is 3 ×3.
shifters while a SAC-PU consists of N p shift-accumulators. Some CNNs use larger Conv kernels in some layers, such
M p shift-accumulations are executed in a shift-accumulator; as 7 ×7 in the first layer of ResNet [2], but this is well beyond
the N p shift-accumulators process different sparse weight the capability of a Conv-PE; however, it is not always effective
kernels. to increase the size of Conv-PE due to rare use of large Conv
To fully exploit the sparsity of the model, computations in kernels; therefore, a general Conv-PE is proposed for CNNs
the shift-accumulator are selected by the non-zero weights. at a high efficiency.
Thus, activations must be read, and they require specific Consider a 3 × 3 Conv-PE and three 3-tap FIR filters
indexes that depend on weights. As the sparse weight ker- connected in series, as shown in Fig. 10, MUXs are inserted
nels must be considered together, different indexes cause in each connection to select the functions of the Conv-PE.
issues when reading activations. So, an activations-driven data • Parallel Mode: When Conv kernels are smaller than or
flow (ADF) is proposed for the P-layers to reduce redundant equal to 3 × 3, the FIR filters receive multiple data from
computations and improve the read speed of activations. When the feature map to compute at the same time multiple
an activation is not needed in a sparse weight kernel, it may rows. A Conv-PE computes 2-D convolution. However,
still be needed in other weight kernels; therefore, an activation when the size of Conv kernel is 1 × 1, this Conv-PE is
is processed at every shift-accumulator to assess whether only 11.1% as powerful as the 3×3 scheme; therefore, the
needed. Activations are processed in an active mode, so this output bit-width must be expanded 3 times for 3 output
is referred to as ADF. data when Conv kernel is 1 × 1.
The efficiency of ADF is determined by N p and the sparsity • Serial Mode: When the Conv kernels are larger than
of P-layers. Eq. (8) is used to evaluate the efficiency, where 3 × 3, each FIR filter receives the data from the previous
Rkp denotes the pruning rate of each weight kernel. When filter to configure as a larger FIR filter. Each Conv-PE
E ad f is larger than 1, ADF is faster than traditional methods. loads one row of weights for 1-D convolution. If the
A large N p value implies that the activation is likely to be kernel size is smaller than 9 × 9, zeros will be filled
accepted by other weight kernels, so increasing the efficiency for the FIR filter. Thus, there may be some inefficiency
of ADF. However, N p cannot take a very large value because when the Conv-PE operates in serial mode; however as
it would require a substantial increase hardware; therefore, M p large Conv kernels are a small part in most networks, the
can be reduced for a larger N p . performance will not be significantly affected. Multiple
executions of the 1-D convolution are required for a 2-D
Acti vati ons × N p (1 − Rkp ) convolution. For example, if the size of the kernel is
E ad f = (8) 7 × 7, the feature maps should be input 7 times for a 2-D
Acti vati ons
convolution. However, in some cases, the multiple 1-D
The pruning rate of each weight kernel’s is difficult to find convolutions can be unrolled; for example, if Mnp = 32
analytically; therefore, the average of the pruning rate in a and M = 3, each seven Conv-PEs can load 49 weights
layer can be used to evaluate the efficiency, that can be used for the 7 × 7 convolution. The 7 outputs from Conv-PEs
to determine N p . are added in the accumulator. These seven Conv-PEs

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 257

compute a 2-D convolution; also, the 3 input channels Computation hardware is determined by Mnp and Nnp ,
should be expanded to 21 for parallel computing. which mostly originate from the Conv-PEs and the accumula-
The proposed general Conv-PE can switch modes between tor. The number of multipliers and adders are given as follows:
1-D and 2-D convolution with a processing capability for a
Mulnp = 9Mnp × Nnp
Conv kernel smaller than 9 × 9. The serial FIR filters cause a
log2 (Mnp )  
significant delay, that must be retimed for high performance. 
Addnp = (8Mnp + Mnp /2i ) × Nnp (13)
i=1
V. E VALUATION
In Eq. (13), all adders are converted into a two-input form for
Under the N P-P hybrid model, the performance of each evaluation.
module is analyzed. In this section, computation time and
hardware utilization are quantitatively evaluated for the pro-
posed optimization algorithm of Section VI. B. P-Layers Computation Time Evaluation
The computation time in the P-layers can be divided into
three parts: weights decoding, activation reading and SAC
A. Time Analysis of N P-Layer computing. The compressed weights are decoded into weights
For processing the Conv layers using the proposed CCPUs, and indexes, stored in caches. Then the activations are read
a 2-D convolution takes (W + 1) × H + C f ir clock cycles, by ADF. After all data is cached, the SAC-PU processes such
so all data of the feature map is read once. Consider the input data. Each part operates based on the previous parts. Therefore,
and output channels, then the clock cycles of all N P-layers the time consumption in each part must be accumulated.
can be computed as follows: For weights decoding, the required number of clock cycles
depends on the number of weights after pruning:

NP  
M N
Cnp = (W + 1) × H + C f ir × × (9) 
P
Mnp Nnp Cw = Q l × (1 − Rlp ) (14)
If Conv-PEs work in a serial mode, it will take an additional where Rlp denotes the pruning rate of a layer.
time given by a number of clock cycles with a factor of The P-layers may include Conv layers and FC layers,
K H compared to the parallel mode. Consider the unrolling so two cases of clock cycle utilization must be differentiated:
of convolution, then Eq. (9) is computed as follows:
P
Conv 
  N

NP
M×K N Ca = max (L) × W × H ×
Cnp = (W + 1) × H + C f ir × × Np
Mnp Nnp 
PFC 
Uout
(10) + max (L) × (15)
Np
As weights are loaded in the buffer prior to the next
where L is the length of the activations (equal to the last index
convolution, no additional time overhead is encountered. Due
value of the weight kernels).
to the map buffer, the feature map can be loaded in FMRs
Similarly, the number of clock cycles required for SAC
when the map data in the last iteration is of no use. Therefore,
computing is given by:
the total time is given by:
P  
Conv
max (Q k (1 − Rkp )) N
Tnp = Cnp (Mnp , Nnp ) × tnp (11) Cs = ×W ×H ×
Mp Np
For storage, the largest feature map and the number of  max (Q k (1 − Rkp ))
PFC  
Uout
weights determine the size of the used memory. As per reuse, + × (16)
Mp Np
Nnp affects the size of the map buffer. Additionally, each
CCPU requires a buffer to store at least a feature map for where Q k is the weight number and Rkp is the pruning rate
accumulation. Therefore, the entire memory is given by: of a weight kernel, different from Eq. (14).
As per the above equations, the total time execution of the
2N − Nnp
M Snp = × max(W × H × M) × ba P-layers is given as follows:
N NP

NP T p = Cw (Rlp )tw +Ca (L, N p )ta +Cs (Rkp , M p , N p )ts (17)
+ Nnp max(W × H ) × ba + Q l × bw (12)
NP
A pipelined design can be achieved between reading activa-
where ba and bw denote the activation bit-width and weight tions and the SAC computation for the processing time to be
bit-width; Q l denotes the weight number of a layer. In general, overlapped. Therefore, the time execution is given by:
the largest feature map can be found in the first several layers; 
thus the layer dividing method does not affect the size of the Cw (Rlp )tw +Ca (L, N p )ta Ca ta ≥ Cs ts
Tp = (18)
memory in the N P-layers. Cw (Rlp )tw +Cs (Rkp , M p , N p )ts Ca ta < Cs ts

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
258 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

Fig. 11. Overview of hardware/algorithm co-optimization (HACO).

Storage in the P-layers is function of the largest feature


map, the number of weights and the size of the weight cache.
The memory size is given by:

P
M S p = 2 max(W × H × M) × ba + Q l × bw
P
Np

+ N p × max (Q k ) × (bw + bi ) (19)
where bi denotes the index bit-width.
Computational resources are determined by the number of
shifters and accumulators, which is the same as the N P-layers:
Fig. 12. Time consumption of the pipelined hybrid system.
Sht p = M p × N p
log2 (M p )   Tnp and T p are affected by multiple factors; among them, the

Add p = M p /2 i
× Np (20) frequency is the most difficult to be evaluated quantitatively.
i=1
However, it can be assessed after synthesis. By utilizing more
hardware, frequency could decrease, so a loss in performance,
In Eq. (20), all adders are also converted into a two-input form which is often unavoidable. Therefore, only parameters such
for evaluation. as Mnp , Nnp , L, Rlp , Rkp , M p , N p and layer divisions need to
be considered.
VI. H ARDWARE /A LGORITHM C O -O PTIMIZATION
In this section, HACO is proposed to find the best design B. Goal Simplification
by dividing the layers into N P-layers and P-layers and For efficiency of the hardware, the considered parameters
by proper sizing the CCPUs and SAC-PU, hence reducing are analyzed under a few sampling assumptions. As mentioned
the computation time while retaining a high efficiency. The in Section IV-D, Mnp is a multiple of Nnp . Therefore, Tnp is
overview of HACO is shown in Fig. 11. denoted as Tnp (k, Nnp ), where k is a positive integer. However,
to fully reuse the memory, Nnp must be increased as much as
A. Optimization Objective possible; due to hardware limitation, it can be achieved when
For pipeline processing, N P-layers and P-layers are k is equal to 1.
processed by CCPUs and SAC-PU, respectively. As Fig. 12 L and Rkp are related to each weight kernel (that must be
shows, the N P-layers are processed by the CCPUs first. After considered when implementing pruning). To ensure correct-
the computation of the N P-layers has been completed, the ness, L is given by the largest amount such that all activations
feature maps as output of the CCPUs are sent to the SAC-PU. are read once per shift-accumulator. Therefore, there is a
Meanwhile, the CCPUs execute the next computation. Thus, difference in values between the actual Tnp and T p .
the entire computation time is determined by the slower step Rkp is difficult to determined prior to training due to the
in the entire process. irregular pruning. Provided that SAC computing is faster than
Consider the limited resources on FPGA, then this problem the process of activation reading, the proposed architecture
can be formulated as: operates correctly. In the case that the same frequencies are
used for them, M p should be less than or equal to the number
mi n Tnp (Mnp , Nnp ) of activations read out in a clock cycle. Due to the above
s.t. Tnp (Mnp , Nnp ) ≥ T P (L, Rlp , Rkp , M p , N p ) simplification, the process is now given by:
Resour ces U tili zati on(Mnp , Nnp , M p , N p ) ≤ Vr mi n Tnp (Nnp )
Accur acy(L, Rlp , Rkp ) ≥ Ar (21) s.t. Tnp (Nnp ) − T p (Rlp , M p , N p ) ≥ Margi n
where Ar and Vr are constant, that are determined by user Resour ces U tili zati on(k, Nnp , M p , N p ) ≤ Vr
requirements. Accur acy(Rlp ) ≥ Ar (22)

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 259

The actual accuracy is hard to determine unless training and determine Nnp , it can be evaluated by:
an implementation are pursued. However, the least value of the NP
[(W +1) × H +C] × NMnp × NNnp
accuracy can be acquired before retraining. For example, the E Conv =     (24)
requirement of accuracy is given by Ar . A network (all layers NP
[(W +1) × H +C] × NMnp × NNnp
are sparse), whose accuracy is A and A ≥ Ar , can be selected
as a baseline sparse network. Assume the actual accuracy is
A∗ , and A∗ is always larger than A because some layers are Algorithm 2 Hardware/Algorithm Co-Optimization Algorithm
N P-layers. Therefore, it can be ensured that A∗ is larger than
Input: spliter pointer, R A, prior Rlp
Ar ( A∗ ≥ A ≥ Ar ).
Output: Nnp , M p , N p and new Rlp
1: Allocate available hardware to CCPUs and SAC-PU by
R A.
C. Hardware/Algorithm Co-Optimization
2: for all M p , N p and Nnp do
The utilization of CCPUs and the number of N P-layers are 3: Find the largest value under the restriction of each
two important parameters that determine the performance, but available hardware.
it is difficult to consider them together. Therefore, the proposed 4: end for
approach considers them separately. 5: for all layers do
A spliter pointer is used to determine whether the layers 6: Compute the required memory and find the minimal value
belong to either the N P-layers or the P-layers. At the begin- (the furthest spliter pointer can move).
ning, the pointer must be initialized. As the Conv layers incur 7: end for
in significantly more computation than the FC layers, and the 8: Compute TN P , T P .
number of weights is large in the FC layers, the spliter initially 9: if TN P − T P ≥ Margi n then
indicates that N P-layers are Conv layers and FC layers are 10: while TN P − T P ≥ Margi n and the pointer is under the
P-layers. valid range do
Then available hardware is allocated for CCPUs by a R A. 11: Move the spliter pointer to the front layer.
R A is defined as follows: 12: Compute TN P , T P .
13: end while
Resour cesnp
RA = (23) 14: Retrain the model through the spliter, acquiring new Rlp .
Resour ces p 15: Compute TN P , T P .
16: if TN P − T P < Margi n then
In FPGA, the Resour ces can be evaluated by the number of
17: repeat
DSPs and LUTs, and the number of DSPs should be converted
18: Reduce Nnp and enlarge M p , N p .
into the equivalent number of LUTs.
19: until Minimal TN P − T P
A higher R A means that more N P-layers can be processed
20: else
with the restrictions of Tnp > T p , which is faster due to higher
21: repeat
efficiency in N P-layers. A smaller R A means more P-layers,
22: Enlarge Nnp and reduce M p , N p .
which have a smaller model size.
23: until Minimal TN P − T P
After initialization, the spliter pointer moves to the front
24: end if
layers with a fixed R A, so reducing Tnp . Moreover, T p is
25: else
calculated by using the prior value of Rlp which is evaluated
26: while TN P − T P < Margi n do
by the pruning rate of a full sparse model. When the pointer
27: Reduce Nnp and enlarge M p , N p .
is fixed, the model must be retrained. Due to the N P-layers,
28: Compute TN P , T P .
the final Rlp is larger than the previous value. Then, R A is
29: end while
finely tuned for the final result.
30: end if
During the pointer adjustment process, the activation mem-
31: return Nnp , M p , N p and new Rlp
ory increases, but the weight memory decreases. If the total
memory increases, the pointer returns to the back layers,
finding the least size for the memory. Therefore, the farthest
VII. E XPERIMENT AND R ESULTS
distance the spliter pointer can be adjusted, is rather limited,
so R A cannot be too small. Similarly, if Tnp −T p < Margi n at A. Experiment Set
the beginning, the pointer is not adjusted but more hardware is The proposed processing architecture is scalable with dif-
allocated to SAC-PE for satisfying this condition. This process ferent numbers of Nnp and Mnp , so affecting performance and
is described in Algorithm 2. the utilization of the DSP units. Memory depends on the size
A Larger Nnp may improve performance. However, the of the network model and the input image; so, the number of
efficiency may decline due to the ceils in Eq. (9), which means on-chip memory resources of an FPGA determines whether the
a waste in computation. The best scenario is that there are no network can be implemented on the FPGA without off-chip
remainders of NMnp and NNnp in every N P-layers. Computation DRAM.
is different between layers due to the different values for W HACO is implemented on VGG-16 with the Xilinx VCU118
and H . If the efficiency needs to be taken into account to platform. Performance varies as function of R A.The largest

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
260 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

TABLE II
C OMPARISON W ITH O THER VGG-16 D ESIGNS W ITH THE S AME L EVEL A CCURACY (66.06% T OP -1)

value of R A has the best performance and the largest model activations results in less efficiency in the P-layers. Therefore,
size (i.e., 21.1MB as mentioned in Section III-B). The least only the parallel weight decoding is used in our experiment.
value of R A has a slower speed but also the smallest model
size. Initially, the N P-layers include Conv layers only, and
P-layers are FC layers. With the decrease of R A, the spliter B. Performance Analysis and Comparison
pointer moves to the front layers. When the pointer moves In HACO, the largest and least values of R A are utilized as
to layer Conv 3-3, the needed memory increases due to the shown in Table IV. For the largest value of R A, Tnp is larger
large feature map. Therefore, the layer Conv 3-3 is the farthest than T p initially. Therefore, the spliter pointer does not move,
distance that the pointer can move in VGG-16. so staying at the end of the Conv layers. In this case, Nnp = 33
For the largest model, VCU118 requires significant on-chip and N p = 51 are obtained by HACO. For the least value of
memory. URAMs and BRAMs are utilized for VCU118. Due R A, N p = 256 and Nnp = 16 are obtained, and finally the
to the large model size, the memory for the weights is given by pointer moves to the layer Conv 4-3. To achieve the highest
URAMs. To fully use the RAMs resources, 8 9-bit weights are efficiency, Nnp and N p are mapped to 2n in our experiments.
utilized. The remaining memories are allocated for activations. For the P-layers, the caches are synthesized as BRAMs and
3 parallel activations are utilized for a bit-width of 48-bit. distributed RAMs. The results are shown in Table IV.
Therefore, every 3 FMRs with a bit-width of 144-bit are syn- The FMRs and CCPUs use the same frequency (200 MHz
thesized as two groups of URAMs or BRAMs. This allocation for Nnp = 16; 150 MHz for Nnp = 32). Weights are loaded
is adjusted by balancing URAMs and BRAMs. Computation in the buffer during convolution. As weight loading is faster
in VCU118 is performed by mostly DSPs and LUTs. For than convolution computation and requires a large area and
high performance in multiplication, DSPs are allocated for longer delay in the design, the weight RAM uses a slower
N P-layers, and LUTs are allocated for P-layers for shift frequency (100 MHz in this work). All frequencies are shown
operation. The available hardware are given by 60% of the in Table V.
VCU118. As DSPs cannot be used in P-layers, a decrease Table II shows the hardware resources and performance
of R A is possible. For the highest efficiency in convolution, of previous VGG-16 FPGA designs and the proposed design
Nnp , M p and N p should be given by 2n , where n is a positive with the largest and least values of R A. The proposed design
integer. N p is limited to a range of 32∼512 due to the ADF achieves the highest FPS of 83.0 at 150MHz; this is 1.8×
efficiency and the largest number of output channels. In the faster than [7]. 4,096 DSPs are used due to a large number
P-layers, activations and weights are required to be cached of CCPUs. All other three state-of-the-art designs use off-chip
before computing but accounting for large memory. When the DDR3 DRAM so limiting the processing speed due to band-
available BRAMs are insufficient, distributed RAMs can be width; the proposed design uses only on-chip BRAMs and
used as alternatives. URAMs for storing the feature map and weights.
To improve the speed of weight decoding and activation R A is reduced at a smaller model size. As shown is
reading, data must be integrated in RAMs for parallel com- Table II, the design with the least R A has a smaller memory
puting, as the improvement in frequency is not effective. (URAM is 8 times larger than a BRAM, so the least R A
If activations are needed to be integrated, weights should be saves 728 BRAMs). Using HACO, redundancy in N P-layers
decoded in the same form. However, the weights in a sparse is removed to attain a higher efficiency; therefore, the number
model are not serial; this causes the number of useful weights of DSPs is reduced to 1,024. Compared with [6], the least R A
to be difficult to calculate. Therefore, parallel decoding cannot achieves 90% performance with only a 46% usage of DSPs.
be achieved when multiple activations are integrated. Since Compared with [7], the least R A achieves 67% performance
the feature map shrinks during pooling, the integration of with a 36% usage of DSPs and, a 20% usage of FFs.

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 261

TABLE III
C OMPARISON B ETWEEN P ROPOSED G ENERAL CCPU S AND O THER D ESIGNS

TABLE IV TABLE V
R ESULTS ON VGG-16 BY HACO W ITH L ARGEST AND L EAST R A F REQUENCIES OF N P -L AYERS AND P -L AYERS

mentioned in Section V, computation in all the Conv layers


only requires 4.0 ms for ResNet-34, 9.1 ms for ResNet-50
and 20.3ms for ResNet-152 when Nnp = 32 (largest R A).
The proposed general CCPUs are compared to other state-of-
the-art FPGA designs in Table III.

VIII. C ONCLUSION
A hardware-oriented compression strategy has been ini-
tially proposed in this paper. This strategy achieves high
performance and 27.5× compression ratio for VGG-16. It has
Although there are some designs of embedded BNNs which been shown that the proposed strategy incurs in a very small
could achieve even higher FPS, the accuracy loss is higher than accuracy loss compared to the single-precision floating-point
the designs in Table II. For example, [24] can only achieve implementation as tested on the ILSVRC2012 data set through
55.8% top-1 accuracy for VGG-16. Therefore, these designs the Caffe framework.
are not considered; only the designs with the same level of As a case study, the architecture of the proposed compressed
accuracy are compared in Table II. VGG-16 has been designed with no off-chip memory on
CCPUs are also applied to ResNet. In the first layer of the Xilinx FPGA VCU118 platform. It has been shown that
ResNet, the size of the Conv kernels is 7 × 7, therefore, the the design achieved using the proposed tool HACO achieves
CCPUs will operate in a serial mode. Seven copies of the input the highest performance of 83.0 FPS for the same level of
image are loaded into the FMRs, and each image is processed accuracy in image processing. The proposed general Conv-PE
by 1-D convolution. Then, the output channels is added to structure has a high efficiency in processing large Conv
obtain an output feature map. For ResNet, an extra RAM is kernels; therefore, the entire architecture can process a wide
utilized to store the feature map of the shortcut connection. range of CNN models with small additional hardware. The
The FMRs must store up to 224 × 224 × 32 data for input proposed hardware design can be applied to several real-time
feature maps; map buffer and shortcut buffer must store up to resource constrained image processing applications.
112 × 112 × 64 data, separately. The required total memory The proposed compression method is applicable to other
is 67% of that used by VGG-16. As per the time evaluation CNN models; new quantization and pruning methods can be

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
262 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021

also used in the proposed HACO framework to obtain designs [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
with even higher performance levels. Imagenet classification using binary convolutional neural networks,” in
Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 525–542.
[24] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design
R EFERENCES flow of accelerating hybrid extremely low bit-width neural network in
embedded FPGA,” in Proc. 28th Int. Conf. Field Program. Log. Appl.
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks (FPL), 2018, pp. 163–1636.
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. [25] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
Available: http://arxiv.org/abs/1409.1556 quantization: Towards lossless CNNs with low-precision weights,” 2017,
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for arXiv:1702.03044. [Online]. Available: http://arxiv.org/abs/1702.03044
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [26] M. A. Hanif, R. Hafiz, and M. Shafique, “Error resilience analysis
(CVPR), Jun. 2016, pp. 770–778. for systematically employing approximate computing in convolutional
[3] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. neural networks,” in Proc. Des., Autom. Test Eur. Conf. Exhib. (DATE),
(ICCV), Dec. 2015, pp. 1440–1448. 2018, pp. 913–916.
[4] S. Chetlur et al., “CuDNN: Efficient primitives for deep learning,” 2014, [27] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
arXiv:1410.0759. [Online]. Available: http://arxiv.org/abs/1410.0759 ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- [28] J. Deng, W. Dong, R. Socher, L. Li, and L. Fei-Fei, “ImageNet:
efficient reconfigurable accelerator for deep convolutional neural net- A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Vis. Pattern Recognit., 2009, pp. 248–255.
Jan. 2017. [29] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
[6] J. Wang, J. Lin, and Z. Wang, “Efficient hardware architectures for deep tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers, Gate Arrays, 2016, pp. 26–35.
vol. 65, no. 6, pp. 1941–1953, Jun. 2018. [30] Y. Guan et al., “FP-DNN: An automated framework for mapping deep
[7] S. Yin et al., “A high throughput acceleration for hybrid neural net- neural networks onto FPGAs with RTL-HLS hybrid templates,” in
works with efficient resource management on FPGA,” IEEE Trans. Proc. IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput.
Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 4, pp. 678–691, Mach. (FCCM), Apr. 2017, pp. 152–159.
Apr. 2019. [31] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the convolution
[8] D. Han, J. Lee, J. Lee, and H.-J. Yoo, “A low-power deep neural network operation to accelerate deep neural networks on FPGA,” IEEE Trans.
online learning processor for real-time object tracking application,” IEEE Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 5, pp. 1794–1804, Jul. 2018.
May 2019.
[9] Y. Wang, J. Lin, and Z. Wang, “FPAP: A folded architecture for energy-
quality scalable convolutional neural networks,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 66, no. 1, pp. 288–301, Jan. 2019. Tian Yuan received the B.S. degree in information
[10] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for engineering from the Nanjing University of Aero-
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers, nautics and Astronautics (NUAA), Nanjing, China,
vol. 65, no. 5, pp. 1642–1651, May 2018. in 2018, where he is currently pursuing the M.S.
[11] B. Hassibi and D. G. Stork, “Second order derivatives for network degree in circuits and systems. His research interests
pruning: Optimal brain surgeon,” in Proc. Adv. Neural Inf. Process. Syst., include hardware architectures design for deep learn-
1993, pp. 164–171. ing, neural network compression and quantization,
[12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com- and approximate computing.
pressing deep neural networks with pruning, trained quantization
and Huffman coding,” 2015, arXiv:1510.00149. [Online]. Available:
http://arxiv.org/abs/1510.00149
[13] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
DNNs,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1379–1387.
[14] S. Ye et al., “Progressive DNN compression: A key to achieve ultra- Weiqiang Liu (Senior Member, IEEE) received the
high weight pruning and quantization rates using ADMM,” 2019, B.S. degree in information engineering from the
arXiv:1903.09769. [Online]. Available: http://arxiv.org/abs/1903.09769 Nanjing University of Aeronautics and Astronautics
[15] L. Lu and Y. Liang, “SpWA: An efficient sparse winograd convolutional (NUAA), Nanjing, China, and the Ph.D. degree
neural networks accelerator on FPGAs,” in Proc. 55th ACM/ESDA/IEEE in electronic engineering from Queen’s University
Des. Autom. Conf. (DAC), Jun. 2018, pp. 1–6. Belfast (QUB), Belfast, U.K., in 2006 and 2012,
[16] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator respectively. In December 2013, he joined the Col-
for ubiquitous machine-learning,” SIGARCH Comput. Archit. News, lege of Electronics and Information Engineering,
vol. 42, no. 1, pp. 269–284, Feb. 2014, doi: 10.1145/2654822.2541967. NUAA, where he is currently a Professor and the
[17] J. H. Ko, D. Kim, T. Na, and S. Mukhopadhyay, “Design and analysis Vice Dean of the College of Electronics and Infor-
of a neural network inference engine based on adaptive weight compres- mation Engineering. His research interests include
sion,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, emerging technologies in computing systems, computer arithmetic, hardware
no. 1, pp. 109–121, Jan. 2019. security, and VLSI design for digital signal processing and cryptography.
[18] S. Li, J. Park, and P. Tak Peter Tang, “Enabling sparse winograd convo- He has published one research book by Artech House and over 100 leading
lution by native pruning,” 2017, arXiv:1702.08597. [Online]. Available: journal and conference papers. He is a Senior Member of the Chinese Institute
http://arxiv.org/abs/1702.08597 of Electronics. One of his papers was selected as the Feature Paper of IEEE TC
[19] W. Liu, F. Lombardi, and M. Shulte, “A retrospective and prospective in the 2017 December issue. He received the prestigious Outstanding Young
view of approximate computing [Point of View],” Proc. IEEE, vol. 108, Scholar Award by the National Natural Science Foundation China (NSFC)
no. 3, pp. 394–399, Mar. 2020. in 2020. He is the Program Co-Chair of the IEEE Symposium on Computer
[20] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design Arithmetic (ARITH), and a program member for a number of international
of approximate Radix-4 booth multipliers for error-tolerant computing,” conferences. He is a member of both the Circuits & Systems for Com-
IEEE Trans. Comput., vol. 66, no. 8, pp. 1435–1441, Aug. 2017. munications (CASCOM) Technical Committee and the VLSI Systems and
[21] Z. Liu, K. Jia, W. Liu, Q. Wei, F. Qiao, and H. Yang, “INA: Incremental Applications (VSA) Technical Committee, IEEE Circuits and Systems Society.
network approximation algorithm for limited precision deep neural He has served as a Guest Editor for the Proceedings of the IEEE and an
networks,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS
Nov. 2019, pp. 1–7. I: R EGULAR PAPERS , the IEEE T RANSACTIONS ON C OMPUTERS , the IEEE
[22] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, T RANSACTIONS ON E MERGING T OPIC IN C OMPUTING AND C OMPUTERS ,
“Binarized neural networks: Training deep neural networks with weights and the IEEE O PEN J OURNAL OF C OMPUTER S OCIETY, and a Steering Com-
and activations constrained to +1 or –1,” 2016, arXiv:1602.02830. mittee Member of the IEEE T RANSACTIONS ON M ULTI -S CALE C OMPUTING
[Online]. Available: http://arxiv.org/abs/1602.02830 S YSTEMS .

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 263

Jie Han (Senior Member, IEEE) received the B.S. Fabrizio Lombardi (Fellow, IEEE) received the
degree in electronic engineering from Tsinghua Uni- B.S. degree (Hons.) in electronic engineering from
versity, Beijing, China, in 1999, and the Ph.D. the University of Essex, U.K., in 1977, the mas-
degree from the Delft University of Technology, The ter’s degree in microwaves and modern optics, the
Netherlands, in 2004. He is currently a Professor Diploma degree in microwave engineering from
with the Department of Electrical and Computer the Microwave Research Unit, University College
Engineering, University of Alberta, Edmonton, AB, London, in 1978, and the Ph.D. degree from the
Canada. His research interests include approximate University of London in 1982. He is currently the
computing, stochastic computing, reliability and International Test Conference (ITC) Endowed Chair
fault tolerance, nanoelectronic circuits and systems, Professorship with Northeastern University, Boston,
as well as novel computational models for nanoscale MA, USA. His research interests are bio-inspired
and biological applications. He was a recipient of the Best Paper Award and nano manufacturing/computing, VLSI design, testing, and fault/defect
at the International Symposium on Nanoscale Architectures (NanoArch) tolerance of digital systems. He has extensively published in these areas and
2015 and Best Paper Nominations at the 25th Great Lakes Symposium coauthored/edited seven books. He is currently the Vice President for Publi-
on VLSI (GLSVLSI) 2015, NanoArch 2016, and the 19th International cations of both the IEEE Computer Society and a member of the executive
Symposium on Quality Electronic Design (ISQED) 2018. He was nominated committee of the IEEE Nanotechnology Council. He was the Editor-in-Chief
for the 2006 Christiaan Huygens Prize of Science by the Royal Dutch of the IEEE T RANSACTION ON C OMPUTERS from 2007 to 2010, the IEEE
Academy of Science. His work was recognized by Science, for developing T RANSACTIONS ON E MERGING T OPICS IN C OMPUTING from 2013 to 2017,
a theory of fault-tolerant nanocircuits (2005). He has served as the General and the IEEE T RANSACTIONS ON N ANOTECHNOLOGY from 2014 to 2019.
Chair for GLSVLSI 2017 and the IEEE International Symposium on Defect
and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) 2013, and a
Technical Program Committee Chair for GLSVLSI 2016, DFT 2012, and the
Symposium on Stochastic & Approximate Computing for Signal Processing
and Machine Learning, 2017. He is currently an Associate Editor of the IEEE
T RANSACTIONS ON E MERGING T OPICS IN C OMPUTING (TETC), the IEEE
T RANSACTIONS ON N ANOTECHNOLOGY, the IEEE Circuits and Systems
Magazine, and the IEEE O PEN J OURNAL OF THE C OMPUTER S OCIETY and
Microelectronics Reliability (Elsevier journal).

Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.

You might also like