(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
(P1) High - Performance - CNN - Accelerators - Based - On - Hardware - and - Algorithm - Co-Optimization
1, JANUARY 2021
Abstract— Convolutional neural networks (CNNs) have been recognition [3]. For better accuracy, CNNs require inten-
widely used in image classification and recognition due to their sive and extensive computation. For real-time processing,
effectiveness; however, CNNs use a large volume of weight data CNNs are usually accelerated by parallel processors such
that is difficult to store in on-chip memory of embedded designs.
Pruning can compress the CNN model at a small accuracy loss; as graphic processing units (GPUs) [4]. Although GPUs
however, a pruned CNN model operates slower when imple- accelerate computation, a substantial increase of power limits
mented on a parallel architecture. In this paper, a hardware- its application to embedded systems. For low power and
oriented CNN compression strategy is proposed; a deep neural high performance digital systems, field programmable gate
network (DNN) model is divided into “no-pruning layers (N P- arrays (FPGA) [5]–[7] and application specific integrated cir-
layers)” and “pruning layers ( P-layers)”. A N P-layer has a
regular weights distribution for parallel computing and high cuit (ASIC) [8]–[10] have been used for CNN accelerators in
performance. A P-layer is irregular due to pruning, but it recent years.
generates a high compression ratio. Uniform and incremental However, the on-chip memory resources in current FPGAs
quantization schemes are used to achieve a tradeoff between com- are not sufficient to completely store a large-scale CNN model.
pression ratio and processing efficiency at a small loss in accuracy. Therefore, off-chip memory is generally used in an FPGA
A distributed convolutional architecture with several parallel
finite impulse response (FIR) filters is further proposed for implementation of CNNs; this causes a limitation in terms of
the regular model in the N P-layers. A shift-accumulator based bandwidth and speed. Therefore, model compression methods
processing element with an activation-driven data flow (ADF) is have been studied quite extensively. Among them, network
proposed for the irregular sparse model in the P-layers. Based pruning [11] is one of the most widely applied compression
on the proposed compression strategy and hardware architec- methods [12]–[14]. As a cost of improving compression ratio,
ture, a hardware/algorithm co-optimization (HACO) approach is
proposed for implementing a N P − P hybrid compressed CNN the irregularity caused by pruning affects the performance
model on FPGAs. For a hardware accelerator on a single FPGA of parallel computing. A compressed sparse model not only
chip without the use of off-chip memory, a 27.5× compression requires decoding but it also causes an imbalanced weights
ratio is achieved with 0.44% top-5 accuracy loss for VGG-16. load and a difficulty in activations reading. In [15], it has
The implementation of the compressed VGG-16 model on a Xilinx been shown that processing a sparse layer takes dozens of
VCU118 evaluation board processes 83.0 frames per second (FPS)
for image applications, this is 1.8× superior than the state-of- milliseconds and requires a large memory utilization.
the-art design found in the technical literature. To achieve a better tradeoff between model size and per-
formance of large CNNs, hardware-oriented compression and
Index Terms— Convolutional neural network (CNN), field pro-
grammable gate array (FPGA), network compression, hardware hybrid quantization strategies are proposed in this paper by
acceleration. requiring a smaller memory. By considering the processing
feature of a CNN, the size of feature maps is reduced, but the
model size expands as the layers deepen. The reduced feature
I. I NTRODUCTION maps require less computation, and the expanded models
require a larger memory. As per the above characteristic,
C ONVOLUTIONAL neural networks (CNNs) have been
extensively used for image classification [1], [2] and all layers are divided into two categories: “no-pruning layers
(N P-layers)” and “pruning layers (P-layers)”. With a regular
Manuscript received April 9, 2020; revised September 7, 2020; accepted weights distribution, a N P-layer utilizes parallel computing
October 7, 2020. Date of publication October 21, 2020; date of current version for high performance. A P-layer is irregular due to the pruning
December 21, 2020. This work was supported in part by the National Natural
Science Foundation of China under Grant 62022041 and Grant 61871216, but it has a high compression ratio.
and in part by the Six Talent Peaks Project in Jiangsu Province under Grant Leveraging the proposed compression strategy, the VGG-16
2018XYDXX-009. This article was recommended by Associate Editor P. Li. [1], one of the most useful CNN models for image classifi-
(Corresponding author: Weiqiang Liu.)
Tian Yuan and Weiqiang Liu are with the College of Electronics and cation, is implemented on a Xilinx VCU118 evaluation board
Information Engineering, Nanjing University of Aeronautics and Astronautics, without off-chip memory such as DRAM to store weight data.
Nanjing 211106, China (e-mail: [email protected]). The proposed CNN accelerators achieve high performance
Jie Han is with the Department of Electrical and Computer Engineering,
University of Alberta, Edmonton, AB T6G 1H9, Canada. because only on-chip memory in an FPGA is used. Based on
Fabrizio Lombardi is with the Department of Electrical and Computer the hardware-oriented compression-based architecture, a hard-
Engineering, Northeastern University, Boston, MA 02115 USA. ware/algorithm co-optimization scheme (HACO) is proposed
Color versions of one or more of the figures in this article are available
online at https://ieeexplore.ieee.org. for implementation of the CNNs. To the best of the author’s
Digital Object Identifier 10.1109/TCSI.2020.3030663 knowledge, this is the first work that implements VGG-16 on
1549-8328 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 251
a single FPGA chip without the use of off-chip memory. The and max pooling are the two typical pooling operations that
main contributions of this work are summarized as follows: are commonly used in CNNs. Average pooling computes the
• A hardware-oriented compression method and a average value of the local field while max pooling selects the
uniformly-incremental hybrid quantization strategy are largest value of the local field as the output. Max pooling is
proposed. used in this work due to its high efficiency.
• The proposed compression strategy has been applied in FC layers are the last few layers in a CNN. All input neurons
the VGG-16 model and achieves a 27.5× compression are fully connected to every neuron in the next layer through
ratio with a 2.04% top-1 accuracy loss and a 0.44% top-5 weights. Therefore, FC layers have many weights to be stored.
accuracy loss compared to the single-precision floating- O denotes the number of multiply-accumulate (MAC) oper-
point VGG-16 model using the ISVRC2012 test data set. ations required in each layer (including both O F C and OConv ):
• A distributed convolutional architecture is proposed for
O F C = Uin × Uout
FPGAs with a fast pipeline data path of CNNs.
• A shift-accumulator based processing element and an
OConv = (W × H × N) × (K × K × M) (2)
efficient activation-driven data flow are proposed for a where Uin and Uout denote the number of input and output
sparse model. neurons, respectively.
• A hardware/algorithm co-optimization approach is pro-
posed for high performance implementations on FPGAs.
B. Related Works
• The proposed VGG-16 network is implemented on the
Xilinx VCU118 evaluation platform achieving 30.3∼83.0 A tiling technique [5], [16] has widely been used to
frames per second (FPS) with a compression ratio of address insufficient memory in embedded architectures. For
34.5∼27.5×, so attaining the highest performance com- image compression, the adaptive joint photographic experts
pared with the state-of-the-art designs. group (JPEG) method [17] has been proposed to dynamically
This paper is organized as follows. Section II provides the adjust the compression ratio for the desired quality. For speed-
background of CNNs. The proposed compression strategy is ing up convolution, a fast finite impulse response (FIR) algo-
presented in Section III. Section IV proposes a distributed rithm (FFA) [6] has been proposed. The Winograd algorithm
FIR based hardware architecture for the N P-layers and a has been studied for sparse networks [18]. To achieve a high
shift-accumulator based processing element for the P-layers. throughput, [7] has presented a design method to fully exploit
Section V evaluates the computation time and the hard- the limited resources in FPGAs. Approximate computing has
ware requirement. Section VI proposes a hardware/algorithm widely been studied in recent years; its objectives are to
co-optimization method. Section VII provides the experimen- achieve low energy and high performance at an acceptable
tal results and analysis. Comparison with the state-of-the-art accuracy loss [19]. Neural networks require a significant
designs is also provided in this section. Section VIII concludes large-scale computation and have high error resilience, so suit-
this paper. able for approximate computing. For example, approximate
multipliers [20] can be used in CNNs with a very small loss
of accuracy [21]; however, this method has only been applied
II. BACKGROUND
to small-scale neural networks.
A. CNN Basics Deep compression has been proposed in [12] to reduce the
CNNs extract the features of images and process these model size of CNNs. Using network pruning [11], weight
feature maps to classify images by finding and using the quantization and Huffman coding, a high compression ratio
weights in each layer. In a deep learning algorithm, the CNN has been achieved. A so-called dynamic network surgery
weights are found through training. A typical CNN has three has been used to accelerate training and avoid unacceptable
types of layers: convolutional (Conv) layers, pooling layers pruning [13]. Binary neural networks (BNNs) have been
and fully connected (FC) layers. proposed to train deep neural networks (DNNs) with weights
Convolutional layers are used to extract image features. and activations constrained to +1 or −1 [22]. At a reduced
A convolutional layer receives a feature map X W,H,M and complexity and a small number of weights, BNNs achieve
generates a feature map of YW,H,N by the filter H K ,K ,M,N , a high performance for embedded systems [23], [24]. Acti-
which is calculated by: vations for very low bit-width has also been proposed, thus
saving memory and accelerating training.
M
K
K
Incremental Network Quantization (INQ) [25] has been pro-
Y (i, j, n) = H ( p, q, m, n)X (i ×s + p, j ×s +q, m)
m=1 q=1 p=1
posed for efficient CNN models with low precision weights.
(1) INQ divides weights into several groups, and incrementally
quantifies each set of data. At each time of quantization
where W and H indicate the width and the height of the process, a network is retrained and weights that have not
feature map; K indicates the size of convolution kernel; M been quantified are updated to compensate for the accuracy
and N indicate the input channels and output channels; s is loss. Using the INQ algorithm, weights in a network can be
the stride of filter. quantified as ±2n with only a small accuracy loss, where n
Pooling layers compute a local field of feature map to is an integer. The MAC is the main arithmetic unit in CNNs.
output a pixel, so reducing the size of feature maps. Average With the quantized data form, multiplication can be replaced
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
252 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
by shift operations in a MAC; therefore, INQ has a high speed layers and FC layers. The overview of the proposed strategy
performance on hardware. is shown in Fig. 1.
The significant difference between INQ and the proposed
compression strategy is that the proposed strategy targets A. Hybrid Quantization Strategy
sparse neural network models, whereas the original INQ
For P-layers, INQ is applied to achieve a higher compres-
causes an unacceptable accuracy loss for sparse models. Due to
the use of non-pruning layers (NP-layers) defined in this work, sion ratio. For a layer l, the weights are stored in an array Wl .
The network sparsity is determined by Tl as a binary array
the proposed method compensates the accuracy loss from the
with the same size of Wl . Tl is calculated by the following
pruning-layers (P-layers) in sparse networks. Furthermore, the
equation:
implementation of NP-layers can increase parallel processing
performance due to its regular structure. 0 Wl (i ) = 0
Tl (i ) = (3)
1 Wl (i ) = 0
III. T HE P ROPOSED C OMPRESSION S TRATEGY A 0 and 1 in Tl indicates that the corresponding weight is zero
or non-zero, respectively. At each time of quantization, 1s in
We propose a hardware-oriented compression strategy for
the array Tl are randomly divided into two groups: A1 and
large-scale CNNs to store weights on a chip. The front layers
A2 . The data in A2 will be modified to zero which indicates
incur a significant level of computation but with a smaller
that the corresponding weight will be quantified. The weight
model size. Also, the front layers are less error-resilient: [12]
quantization in the P-layers is represented as:
has applied pruning (albeit at a smaller rate) in the front
layers of VGG and AlexNet. [26] shows that by pruning the +2log2 |Wl (i)| Wl (i ) > 0
Wl (i ) = (4)
filters in the front layers, the accuracy significantly decreases. −2log2 |Wl (i)| Wl (i ) ≤ 0
In addition, an irregular sparse model affects the performance
of parallel computation. Hence, pruning the front layers is not By using quantization, the original multiplication in a MAC is
efficient. The principle of the proposed strategy is to apply replaced by a shift operation. Therefore, only the sign bit and
different compression strategies to different layers, so making exponent need to be stored for the direction and the number
the network to be front-regular but back-irregular. of bits to shift, respectively. After quantization, the model is
All layers are divided into two types: N P-layers and retrained to compensate for the error. Weights are updated as
P-layers. In the beginning, all layers are pruned. Then per the following equation:
INQ is used for quantifying the P-layers. During the incre- ∂E
mental quantization, the error introduced by pruning and Wl (i ) ← Wl (i ) − η Tl (i ) (5)
∂ Wl (i )
quantization in the P-layers is compensated by the weight
where η is the learning rate that depends on the learning
update of N P-layers, in which the N P-layers are no longer
policy; E is the objective function. Tl (i ) denotes the mask
sparse. Therefore, the proposed compressed model has more
of weights, which determines whether the weight must be
error resilience, hence making the model easier to converge.
updated. For zeros entries in Tl , the corresponding weight
When the quantization in the P-layers is completed, the
will not be updated because it is already zero or has been
N P-layers are quantified as fixed-point numbers so ready for
quantified. For the N P-layer l, the weights are quantified
computation.
after the quantization of the P-layers. The following uniform
In general, the N P-layers have a regular structure for
quantization method is applied:
higher parallel performance, while the P-layers are highly
compressed for higher compression ratio. The N P-layers are Wl (i )
Wl (i ) = r ound( )Q (Q = 2−n ) (6)
generally Conv layers, but the P-layers can include Conv Q
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 253
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
254 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 255
With no off-chip memory, a distributed Conv architecture is buffer. The map buffer receives Nnp channels of data and
proposed for high speed processing. Each Conv-PE in the same output Mnp channels of data. If the input feature map data
CCPU has a RAM to transmit the feature map. In addition, of the current layer is of no use, then they are replaced by the
different CCPUs receive weights from different RAMs; so, data from the map buffer for the next layer computation.
the speed of data transmission matches the processing speed, In the proposed distributed architecture, the feature map
hence relieving the restriction of bandwidth. storage format is determined by Mnp . As shown in Fig. 8,
As shown in Fig. 7, there are Mnp feature map the first Mnp feature maps are set as a group, stored in the
RAMs (FMRs) to store the feature map, and each one cor- bottom of FMRs. The next groups are subsequently stored
responds to a FPPB. The Mnp data in the feature map is sent in a stack fashion. Due to the different size of feature maps
to each Conv-PE after the conversion of FPPB, so the Mnp between layers, the indexes of the next groups are hard to be
input channels are processed at the same time. determined. To ensure the architecture work correctly, Mnp
In the Conv architecture, each CCPU processes a 3-D should be a multiple of Nnp . Dual port RAMs are used in this
feature map to output a 2-D feature map, as briefly discussed implementation and there is no read and write conflict.
in Section IV-A. Nnp denotes the number of CCPUs that
process the convolution in parallel. Similarly, Nnp weight
RAMs (WRs) store the weights in a distributed fashion, such E. Shift-Accumulator Based Processing Architecture for
P-Layers
that each WR corresponds to a CCPU. All CCPUs receive
the same data of the input feature map. Due to the different Based on the proposed compression strategy, the P-layers
convolution kernels, this implies that Nnp channels are output make a highly compressed sparse model with ±2n weight
at the same time. types. Multiplications are replaced by shift operations to
The entire convolution architecture processes Mnp channels reduce resources. To accelerate the processing of the P-layers,
of input feature map in parallel, and outputs Nnp channels of there are two levels in this parallel computing scheme:
the output feature map. • Unrolling the multiple shift-accumulations.
Nnp is generally smaller than the number of output channels • Unrolling the different weight kernels (Conv kernel
N; therefore, a buffer is required for temporarily storing an in Conv layers or all weights connected to a out-
intermediate feature map. Using the data integration method put neuron in FC layers). Due to the irregular sparse
of Section IV-C, the serial data that is output by a CCPU, model, computations of each weight kernels are different,
is converted into F parallel streams and stored in the map so decreasing the utilization of the hardware units.
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
256 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 257
compute a 2-D convolution; also, the 3 input channels Computation hardware is determined by Mnp and Nnp ,
should be expanded to 21 for parallel computing. which mostly originate from the Conv-PEs and the accumula-
The proposed general Conv-PE can switch modes between tor. The number of multipliers and adders are given as follows:
1-D and 2-D convolution with a processing capability for a
Mulnp = 9Mnp × Nnp
Conv kernel smaller than 9 × 9. The serial FIR filters cause a
log2 (Mnp )
significant delay, that must be retimed for high performance.
Addnp = (8Mnp + Mnp /2i ) × Nnp (13)
i=1
V. E VALUATION
In Eq. (13), all adders are converted into a two-input form for
Under the N P-P hybrid model, the performance of each evaluation.
module is analyzed. In this section, computation time and
hardware utilization are quantitatively evaluated for the pro-
posed optimization algorithm of Section VI. B. P-Layers Computation Time Evaluation
The computation time in the P-layers can be divided into
three parts: weights decoding, activation reading and SAC
A. Time Analysis of N P-Layer computing. The compressed weights are decoded into weights
For processing the Conv layers using the proposed CCPUs, and indexes, stored in caches. Then the activations are read
a 2-D convolution takes (W + 1) × H + C f ir clock cycles, by ADF. After all data is cached, the SAC-PU processes such
so all data of the feature map is read once. Consider the input data. Each part operates based on the previous parts. Therefore,
and output channels, then the clock cycles of all N P-layers the time consumption in each part must be accumulated.
can be computed as follows: For weights decoding, the required number of clock cycles
depends on the number of weights after pruning:
NP
M N
Cnp = (W + 1) × H + C f ir × × (9)
P
Mnp Nnp Cw = Q l × (1 − Rlp ) (14)
If Conv-PEs work in a serial mode, it will take an additional where Rlp denotes the pruning rate of a layer.
time given by a number of clock cycles with a factor of The P-layers may include Conv layers and FC layers,
K H compared to the parallel mode. Consider the unrolling so two cases of clock cycle utilization must be differentiated:
of convolution, then Eq. (9) is computed as follows:
P
Conv
N
NP
M×K N Ca = max (L) × W × H ×
Cnp = (W + 1) × H + C f ir × × Np
Mnp Nnp
PFC
Uout
(10) + max (L) × (15)
Np
As weights are loaded in the buffer prior to the next
where L is the length of the activations (equal to the last index
convolution, no additional time overhead is encountered. Due
value of the weight kernels).
to the map buffer, the feature map can be loaded in FMRs
Similarly, the number of clock cycles required for SAC
when the map data in the last iteration is of no use. Therefore,
computing is given by:
the total time is given by:
P
Conv
max (Q k (1 − Rkp )) N
Tnp = Cnp (Mnp , Nnp ) × tnp (11) Cs = ×W ×H ×
Mp Np
For storage, the largest feature map and the number of max (Q k (1 − Rkp ))
PFC
Uout
weights determine the size of the used memory. As per reuse, + × (16)
Mp Np
Nnp affects the size of the map buffer. Additionally, each
CCPU requires a buffer to store at least a feature map for where Q k is the weight number and Rkp is the pruning rate
accumulation. Therefore, the entire memory is given by: of a weight kernel, different from Eq. (14).
As per the above equations, the total time execution of the
2N − Nnp
M Snp = × max(W × H × M) × ba P-layers is given as follows:
N NP
NP T p = Cw (Rlp )tw +Ca (L, N p )ta +Cs (Rkp , M p , N p )ts (17)
+ Nnp max(W × H ) × ba + Q l × bw (12)
NP
A pipelined design can be achieved between reading activa-
where ba and bw denote the activation bit-width and weight tions and the SAC computation for the processing time to be
bit-width; Q l denotes the weight number of a layer. In general, overlapped. Therefore, the time execution is given by:
the largest feature map can be found in the first several layers;
thus the layer dividing method does not affect the size of the Cw (Rlp )tw +Ca (L, N p )ta Ca ta ≥ Cs ts
Tp = (18)
memory in the N P-layers. Cw (Rlp )tw +Cs (Rkp , M p , N p )ts Ca ta < Cs ts
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
258 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 259
The actual accuracy is hard to determine unless training and determine Nnp , it can be evaluated by:
an implementation are pursued. However, the least value of the NP
[(W +1) × H +C] × NMnp × NNnp
accuracy can be acquired before retraining. For example, the E Conv = (24)
requirement of accuracy is given by Ar . A network (all layers NP
[(W +1) × H +C] × NMnp × NNnp
are sparse), whose accuracy is A and A ≥ Ar , can be selected
as a baseline sparse network. Assume the actual accuracy is
A∗ , and A∗ is always larger than A because some layers are Algorithm 2 Hardware/Algorithm Co-Optimization Algorithm
N P-layers. Therefore, it can be ensured that A∗ is larger than
Input: spliter pointer, R A, prior Rlp
Ar ( A∗ ≥ A ≥ Ar ).
Output: Nnp , M p , N p and new Rlp
1: Allocate available hardware to CCPUs and SAC-PU by
R A.
C. Hardware/Algorithm Co-Optimization
2: for all M p , N p and Nnp do
The utilization of CCPUs and the number of N P-layers are 3: Find the largest value under the restriction of each
two important parameters that determine the performance, but available hardware.
it is difficult to consider them together. Therefore, the proposed 4: end for
approach considers them separately. 5: for all layers do
A spliter pointer is used to determine whether the layers 6: Compute the required memory and find the minimal value
belong to either the N P-layers or the P-layers. At the begin- (the furthest spliter pointer can move).
ning, the pointer must be initialized. As the Conv layers incur 7: end for
in significantly more computation than the FC layers, and the 8: Compute TN P , T P .
number of weights is large in the FC layers, the spliter initially 9: if TN P − T P ≥ Margi n then
indicates that N P-layers are Conv layers and FC layers are 10: while TN P − T P ≥ Margi n and the pointer is under the
P-layers. valid range do
Then available hardware is allocated for CCPUs by a R A. 11: Move the spliter pointer to the front layer.
R A is defined as follows: 12: Compute TN P , T P .
13: end while
Resour cesnp
RA = (23) 14: Retrain the model through the spliter, acquiring new Rlp .
Resour ces p 15: Compute TN P , T P .
16: if TN P − T P < Margi n then
In FPGA, the Resour ces can be evaluated by the number of
17: repeat
DSPs and LUTs, and the number of DSPs should be converted
18: Reduce Nnp and enlarge M p , N p .
into the equivalent number of LUTs.
19: until Minimal TN P − T P
A higher R A means that more N P-layers can be processed
20: else
with the restrictions of Tnp > T p , which is faster due to higher
21: repeat
efficiency in N P-layers. A smaller R A means more P-layers,
22: Enlarge Nnp and reduce M p , N p .
which have a smaller model size.
23: until Minimal TN P − T P
After initialization, the spliter pointer moves to the front
24: end if
layers with a fixed R A, so reducing Tnp . Moreover, T p is
25: else
calculated by using the prior value of Rlp which is evaluated
26: while TN P − T P < Margi n do
by the pruning rate of a full sparse model. When the pointer
27: Reduce Nnp and enlarge M p , N p .
is fixed, the model must be retrained. Due to the N P-layers,
28: Compute TN P , T P .
the final Rlp is larger than the previous value. Then, R A is
29: end while
finely tuned for the final result.
30: end if
During the pointer adjustment process, the activation mem-
31: return Nnp , M p , N p and new Rlp
ory increases, but the weight memory decreases. If the total
memory increases, the pointer returns to the back layers,
finding the least size for the memory. Therefore, the farthest
VII. E XPERIMENT AND R ESULTS
distance the spliter pointer can be adjusted, is rather limited,
so R A cannot be too small. Similarly, if Tnp −T p < Margi n at A. Experiment Set
the beginning, the pointer is not adjusted but more hardware is The proposed processing architecture is scalable with dif-
allocated to SAC-PE for satisfying this condition. This process ferent numbers of Nnp and Mnp , so affecting performance and
is described in Algorithm 2. the utilization of the DSP units. Memory depends on the size
A Larger Nnp may improve performance. However, the of the network model and the input image; so, the number of
efficiency may decline due to the ceils in Eq. (9), which means on-chip memory resources of an FPGA determines whether the
a waste in computation. The best scenario is that there are no network can be implemented on the FPGA without off-chip
remainders of NMnp and NNnp in every N P-layers. Computation DRAM.
is different between layers due to the different values for W HACO is implemented on VGG-16 with the Xilinx VCU118
and H . If the efficiency needs to be taken into account to platform. Performance varies as function of R A.The largest
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
260 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
TABLE II
C OMPARISON W ITH O THER VGG-16 D ESIGNS W ITH THE S AME L EVEL A CCURACY (66.06% T OP -1)
value of R A has the best performance and the largest model activations results in less efficiency in the P-layers. Therefore,
size (i.e., 21.1MB as mentioned in Section III-B). The least only the parallel weight decoding is used in our experiment.
value of R A has a slower speed but also the smallest model
size. Initially, the N P-layers include Conv layers only, and
P-layers are FC layers. With the decrease of R A, the spliter B. Performance Analysis and Comparison
pointer moves to the front layers. When the pointer moves In HACO, the largest and least values of R A are utilized as
to layer Conv 3-3, the needed memory increases due to the shown in Table IV. For the largest value of R A, Tnp is larger
large feature map. Therefore, the layer Conv 3-3 is the farthest than T p initially. Therefore, the spliter pointer does not move,
distance that the pointer can move in VGG-16. so staying at the end of the Conv layers. In this case, Nnp = 33
For the largest model, VCU118 requires significant on-chip and N p = 51 are obtained by HACO. For the least value of
memory. URAMs and BRAMs are utilized for VCU118. Due R A, N p = 256 and Nnp = 16 are obtained, and finally the
to the large model size, the memory for the weights is given by pointer moves to the layer Conv 4-3. To achieve the highest
URAMs. To fully use the RAMs resources, 8 9-bit weights are efficiency, Nnp and N p are mapped to 2n in our experiments.
utilized. The remaining memories are allocated for activations. For the P-layers, the caches are synthesized as BRAMs and
3 parallel activations are utilized for a bit-width of 48-bit. distributed RAMs. The results are shown in Table IV.
Therefore, every 3 FMRs with a bit-width of 144-bit are syn- The FMRs and CCPUs use the same frequency (200 MHz
thesized as two groups of URAMs or BRAMs. This allocation for Nnp = 16; 150 MHz for Nnp = 32). Weights are loaded
is adjusted by balancing URAMs and BRAMs. Computation in the buffer during convolution. As weight loading is faster
in VCU118 is performed by mostly DSPs and LUTs. For than convolution computation and requires a large area and
high performance in multiplication, DSPs are allocated for longer delay in the design, the weight RAM uses a slower
N P-layers, and LUTs are allocated for P-layers for shift frequency (100 MHz in this work). All frequencies are shown
operation. The available hardware are given by 60% of the in Table V.
VCU118. As DSPs cannot be used in P-layers, a decrease Table II shows the hardware resources and performance
of R A is possible. For the highest efficiency in convolution, of previous VGG-16 FPGA designs and the proposed design
Nnp , M p and N p should be given by 2n , where n is a positive with the largest and least values of R A. The proposed design
integer. N p is limited to a range of 32∼512 due to the ADF achieves the highest FPS of 83.0 at 150MHz; this is 1.8×
efficiency and the largest number of output channels. In the faster than [7]. 4,096 DSPs are used due to a large number
P-layers, activations and weights are required to be cached of CCPUs. All other three state-of-the-art designs use off-chip
before computing but accounting for large memory. When the DDR3 DRAM so limiting the processing speed due to band-
available BRAMs are insufficient, distributed RAMs can be width; the proposed design uses only on-chip BRAMs and
used as alternatives. URAMs for storing the feature map and weights.
To improve the speed of weight decoding and activation R A is reduced at a smaller model size. As shown is
reading, data must be integrated in RAMs for parallel com- Table II, the design with the least R A has a smaller memory
puting, as the improvement in frequency is not effective. (URAM is 8 times larger than a BRAM, so the least R A
If activations are needed to be integrated, weights should be saves 728 BRAMs). Using HACO, redundancy in N P-layers
decoded in the same form. However, the weights in a sparse is removed to attain a higher efficiency; therefore, the number
model are not serial; this causes the number of useful weights of DSPs is reduced to 1,024. Compared with [6], the least R A
to be difficult to calculate. Therefore, parallel decoding cannot achieves 90% performance with only a 46% usage of DSPs.
be achieved when multiple activations are integrated. Since Compared with [7], the least R A achieves 67% performance
the feature map shrinks during pooling, the integration of with a 36% usage of DSPs and, a 20% usage of FFs.
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 261
TABLE III
C OMPARISON B ETWEEN P ROPOSED G ENERAL CCPU S AND O THER D ESIGNS
TABLE IV TABLE V
R ESULTS ON VGG-16 BY HACO W ITH L ARGEST AND L EAST R A F REQUENCIES OF N P -L AYERS AND P -L AYERS
VIII. C ONCLUSION
A hardware-oriented compression strategy has been ini-
tially proposed in this paper. This strategy achieves high
performance and 27.5× compression ratio for VGG-16. It has
Although there are some designs of embedded BNNs which been shown that the proposed strategy incurs in a very small
could achieve even higher FPS, the accuracy loss is higher than accuracy loss compared to the single-precision floating-point
the designs in Table II. For example, [24] can only achieve implementation as tested on the ILSVRC2012 data set through
55.8% top-1 accuracy for VGG-16. Therefore, these designs the Caffe framework.
are not considered; only the designs with the same level of As a case study, the architecture of the proposed compressed
accuracy are compared in Table II. VGG-16 has been designed with no off-chip memory on
CCPUs are also applied to ResNet. In the first layer of the Xilinx FPGA VCU118 platform. It has been shown that
ResNet, the size of the Conv kernels is 7 × 7, therefore, the the design achieved using the proposed tool HACO achieves
CCPUs will operate in a serial mode. Seven copies of the input the highest performance of 83.0 FPS for the same level of
image are loaded into the FMRs, and each image is processed accuracy in image processing. The proposed general Conv-PE
by 1-D convolution. Then, the output channels is added to structure has a high efficiency in processing large Conv
obtain an output feature map. For ResNet, an extra RAM is kernels; therefore, the entire architecture can process a wide
utilized to store the feature map of the shortcut connection. range of CNN models with small additional hardware. The
The FMRs must store up to 224 × 224 × 32 data for input proposed hardware design can be applied to several real-time
feature maps; map buffer and shortcut buffer must store up to resource constrained image processing applications.
112 × 112 × 64 data, separately. The required total memory The proposed compression method is applicable to other
is 67% of that used by VGG-16. As per the time evaluation CNN models; new quantization and pruning methods can be
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
262 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 68, NO. 1, JANUARY 2021
also used in the proposed HACO framework to obtain designs [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
with even higher performance levels. Imagenet classification using binary convolutional neural networks,” in
Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 525–542.
[24] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design
R EFERENCES flow of accelerating hybrid extremely low bit-width neural network in
embedded FPGA,” in Proc. 28th Int. Conf. Field Program. Log. Appl.
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks (FPL), 2018, pp. 163–1636.
for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. [25] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
Available: http://arxiv.org/abs/1409.1556 quantization: Towards lossless CNNs with low-precision weights,” 2017,
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for arXiv:1702.03044. [Online]. Available: http://arxiv.org/abs/1702.03044
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [26] M. A. Hanif, R. Hafiz, and M. Shafique, “Error resilience analysis
(CVPR), Jun. 2016, pp. 770–778. for systematically employing approximate computing in convolutional
[3] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis. neural networks,” in Proc. Des., Autom. Test Eur. Conf. Exhib. (DATE),
(ICCV), Dec. 2015, pp. 1440–1448. 2018, pp. 913–916.
[4] S. Chetlur et al., “CuDNN: Efficient primitives for deep learning,” 2014, [27] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-
arXiv:1410.0759. [Online]. Available: http://arxiv.org/abs/1410.0759 ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.
[5] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- [28] J. Deng, W. Dong, R. Socher, L. Li, and L. Fei-Fei, “ImageNet:
efficient reconfigurable accelerator for deep convolutional neural net- A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput.
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, Vis. Pattern Recognit., 2009, pp. 248–255.
Jan. 2017. [29] J. Qiu et al., “Going deeper with embedded FPGA platform for convolu-
[6] J. Wang, J. Lin, and Z. Wang, “Efficient hardware architectures for deep tional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program.
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers, Gate Arrays, 2016, pp. 26–35.
vol. 65, no. 6, pp. 1941–1953, Jun. 2018. [30] Y. Guan et al., “FP-DNN: An automated framework for mapping deep
[7] S. Yin et al., “A high throughput acceleration for hybrid neural net- neural networks onto FPGAs with RTL-HLS hybrid templates,” in
works with efficient resource management on FPGA,” IEEE Trans. Proc. IEEE 25th Annu. Int. Symp. Field-Programmable Custom Comput.
Comput.-Aided Design Integr. Circuits Syst., vol. 38, no. 4, pp. 678–691, Mach. (FCCM), Apr. 2017, pp. 152–159.
Apr. 2019. [31] Y. Ma, Y. Cao, S. Vrudhula, and J.-S. Seo, “Optimizing the convolution
[8] D. Han, J. Lee, J. Lee, and H.-J. Yoo, “A low-power deep neural network operation to accelerate deep neural networks on FPGA,” IEEE Trans.
online learning processor for real-time object tracking application,” IEEE Very Large Scale Integr. (VLSI) Syst., vol. 26, no. 7, pp. 1354–1367,
Trans. Circuits Syst. I, Reg. Papers, vol. 66, no. 5, pp. 1794–1804, Jul. 2018.
May 2019.
[9] Y. Wang, J. Lin, and Z. Wang, “FPAP: A folded architecture for energy-
quality scalable convolutional neural networks,” IEEE Trans. Circuits
Syst. I, Reg. Papers, vol. 66, no. 1, pp. 288–301, Jan. 2019. Tian Yuan received the B.S. degree in information
[10] Y.-J. Lin and T. S. Chang, “Data and hardware efficient design for engineering from the Nanjing University of Aero-
convolutional neural network,” IEEE Trans. Circuits Syst. I, Reg. Papers, nautics and Astronautics (NUAA), Nanjing, China,
vol. 65, no. 5, pp. 1642–1651, May 2018. in 2018, where he is currently pursuing the M.S.
[11] B. Hassibi and D. G. Stork, “Second order derivatives for network degree in circuits and systems. His research interests
pruning: Optimal brain surgeon,” in Proc. Adv. Neural Inf. Process. Syst., include hardware architectures design for deep learn-
1993, pp. 164–171. ing, neural network compression and quantization,
[12] S. Han, H. Mao, and W. J. Dally, “Deep compression: Com- and approximate computing.
pressing deep neural networks with pruning, trained quantization
and Huffman coding,” 2015, arXiv:1510.00149. [Online]. Available:
http://arxiv.org/abs/1510.00149
[13] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
DNNs,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1379–1387.
[14] S. Ye et al., “Progressive DNN compression: A key to achieve ultra- Weiqiang Liu (Senior Member, IEEE) received the
high weight pruning and quantization rates using ADMM,” 2019, B.S. degree in information engineering from the
arXiv:1903.09769. [Online]. Available: http://arxiv.org/abs/1903.09769 Nanjing University of Aeronautics and Astronautics
[15] L. Lu and Y. Liang, “SpWA: An efficient sparse winograd convolutional (NUAA), Nanjing, China, and the Ph.D. degree
neural networks accelerator on FPGAs,” in Proc. 55th ACM/ESDA/IEEE in electronic engineering from Queen’s University
Des. Autom. Conf. (DAC), Jun. 2018, pp. 1–6. Belfast (QUB), Belfast, U.K., in 2006 and 2012,
[16] T. Chen et al., “Diannao: A small-footprint high-throughput accelerator respectively. In December 2013, he joined the Col-
for ubiquitous machine-learning,” SIGARCH Comput. Archit. News, lege of Electronics and Information Engineering,
vol. 42, no. 1, pp. 269–284, Feb. 2014, doi: 10.1145/2654822.2541967. NUAA, where he is currently a Professor and the
[17] J. H. Ko, D. Kim, T. Na, and S. Mukhopadhyay, “Design and analysis Vice Dean of the College of Electronics and Infor-
of a neural network inference engine based on adaptive weight compres- mation Engineering. His research interests include
sion,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 38, emerging technologies in computing systems, computer arithmetic, hardware
no. 1, pp. 109–121, Jan. 2019. security, and VLSI design for digital signal processing and cryptography.
[18] S. Li, J. Park, and P. Tak Peter Tang, “Enabling sparse winograd convo- He has published one research book by Artech House and over 100 leading
lution by native pruning,” 2017, arXiv:1702.08597. [Online]. Available: journal and conference papers. He is a Senior Member of the Chinese Institute
http://arxiv.org/abs/1702.08597 of Electronics. One of his papers was selected as the Feature Paper of IEEE TC
[19] W. Liu, F. Lombardi, and M. Shulte, “A retrospective and prospective in the 2017 December issue. He received the prestigious Outstanding Young
view of approximate computing [Point of View],” Proc. IEEE, vol. 108, Scholar Award by the National Natural Science Foundation China (NSFC)
no. 3, pp. 394–399, Mar. 2020. in 2020. He is the Program Co-Chair of the IEEE Symposium on Computer
[20] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han, and F. Lombardi, “Design Arithmetic (ARITH), and a program member for a number of international
of approximate Radix-4 booth multipliers for error-tolerant computing,” conferences. He is a member of both the Circuits & Systems for Com-
IEEE Trans. Comput., vol. 66, no. 8, pp. 1435–1441, Aug. 2017. munications (CASCOM) Technical Committee and the VLSI Systems and
[21] Z. Liu, K. Jia, W. Liu, Q. Wei, F. Qiao, and H. Yang, “INA: Incremental Applications (VSA) Technical Committee, IEEE Circuits and Systems Society.
network approximation algorithm for limited precision deep neural He has served as a Guest Editor for the Proceedings of the IEEE and an
networks,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), Associate Editor for the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS
Nov. 2019, pp. 1–7. I: R EGULAR PAPERS , the IEEE T RANSACTIONS ON C OMPUTERS , the IEEE
[22] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, T RANSACTIONS ON E MERGING T OPIC IN C OMPUTING AND C OMPUTERS ,
“Binarized neural networks: Training deep neural networks with weights and the IEEE O PEN J OURNAL OF C OMPUTER S OCIETY, and a Steering Com-
and activations constrained to +1 or –1,” 2016, arXiv:1602.02830. mittee Member of the IEEE T RANSACTIONS ON M ULTI -S CALE C OMPUTING
[Online]. Available: http://arxiv.org/abs/1602.02830 S YSTEMS .
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.
YUAN et al.: HIGH PERFORMANCE CNN ACCELERATORS BASED ON HARDWARE AND ALGORITHM CO-OPTIMIZATION 263
Jie Han (Senior Member, IEEE) received the B.S. Fabrizio Lombardi (Fellow, IEEE) received the
degree in electronic engineering from Tsinghua Uni- B.S. degree (Hons.) in electronic engineering from
versity, Beijing, China, in 1999, and the Ph.D. the University of Essex, U.K., in 1977, the mas-
degree from the Delft University of Technology, The ter’s degree in microwaves and modern optics, the
Netherlands, in 2004. He is currently a Professor Diploma degree in microwave engineering from
with the Department of Electrical and Computer the Microwave Research Unit, University College
Engineering, University of Alberta, Edmonton, AB, London, in 1978, and the Ph.D. degree from the
Canada. His research interests include approximate University of London in 1982. He is currently the
computing, stochastic computing, reliability and International Test Conference (ITC) Endowed Chair
fault tolerance, nanoelectronic circuits and systems, Professorship with Northeastern University, Boston,
as well as novel computational models for nanoscale MA, USA. His research interests are bio-inspired
and biological applications. He was a recipient of the Best Paper Award and nano manufacturing/computing, VLSI design, testing, and fault/defect
at the International Symposium on Nanoscale Architectures (NanoArch) tolerance of digital systems. He has extensively published in these areas and
2015 and Best Paper Nominations at the 25th Great Lakes Symposium coauthored/edited seven books. He is currently the Vice President for Publi-
on VLSI (GLSVLSI) 2015, NanoArch 2016, and the 19th International cations of both the IEEE Computer Society and a member of the executive
Symposium on Quality Electronic Design (ISQED) 2018. He was nominated committee of the IEEE Nanotechnology Council. He was the Editor-in-Chief
for the 2006 Christiaan Huygens Prize of Science by the Royal Dutch of the IEEE T RANSACTION ON C OMPUTERS from 2007 to 2010, the IEEE
Academy of Science. His work was recognized by Science, for developing T RANSACTIONS ON E MERGING T OPICS IN C OMPUTING from 2013 to 2017,
a theory of fault-tolerant nanocircuits (2005). He has served as the General and the IEEE T RANSACTIONS ON N ANOTECHNOLOGY from 2014 to 2019.
Chair for GLSVLSI 2017 and the IEEE International Symposium on Defect
and Fault Tolerance in VLSI and Nanotechnology Systems (DFT) 2013, and a
Technical Program Committee Chair for GLSVLSI 2016, DFT 2012, and the
Symposium on Stochastic & Approximate Computing for Signal Processing
and Machine Learning, 2017. He is currently an Associate Editor of the IEEE
T RANSACTIONS ON E MERGING T OPICS IN C OMPUTING (TETC), the IEEE
T RANSACTIONS ON N ANOTECHNOLOGY, the IEEE Circuits and Systems
Magazine, and the IEEE O PEN J OURNAL OF THE C OMPUTER S OCIETY and
Microelectronics Reliability (Elsevier journal).
Authorized licensed use limited to: Presidency University. Downloaded on October 06,2023 at 06:26:59 UTC from IEEE Xplore. Restrictions apply.