A Convolutional Neural Network Accelerator Architecture
A Convolutional Neural Network Accelerator Architecture
Abstract— Convolutional neural networks (CNNs) have been general purpose accelerator architecture that can be used to deploy
widely deployed in deep learning applications, especially on power different CNN algorithms [11]-[14]. Reference [13] proposes to use
hungry GP-GPUs. Recent efforts in designing CNN accelerators are two 8-bit multipliers in a PE to switch between 8- and 16-bit
considered as a promising alternative to achieve higher energy precisions. ENVISION in [12] proposes a booth multiplier based
efficiency. Unfortunately, with the growing complexity of CNN, the dynamic voltage accuracy frequency scaling technology that can be
demanded computational and storage resources for accelerators configured to 4-, 8- or 16-bits. Furthermore, UNPU [14] uses serial
keep increasing, hindering its wider applications in mobile devices. multipliers to implement lookup table-based PE to enable precisions
On the other hand, many quantization algorithms have been from 1 to 16 bits. In short, at the cost of additional area for PE, control
proposed for efficient CNN training, which brings many small or logics and storage, the configurability can provide different
zero weights. This is a unique opportunity for accelerator designers precisions to different neural networks or different layers in one
to employ much fewer bits, e.g., 4 bits, in both arithmetic core and neural network, thereby improving overall energy saving [12]-[14].
storage, thereby saving significant design cost. However, such a
single precision strategy inevitably compromises the accuracy as
some key operations may demand a higher precision. Thus, this
paper proposes a low power CNN accelerator architecture that can
simultaneously conduct computations with mixed precisions and
assign the appropriate arithmetic cores to operation with different
precision demands. This proposed architecture can achieve
significant area and energy savings, without accuracy compromise.
The experimental results show that the proposed architecture Fig. 1. Distributions of effective weight precisions for different
implemented on FPGA can reduces almost half of the weight convolution layers in (a) AlexNet and (b) VGG16.
storage and MAC area, and lower the dynamic power by 12.1%
when compared with a state-of-the-art CNN accelerator design. However, the precision control of all the prior work is at most at
the granularity of layer-wise. In other words, within the same layer,
I. INTRODUCTION all the operations use the same precision and bit-width. It has been
noted that the effective precision of weights within a layer may
Convolutional neural network (CNN) has been widely deployed actually significantly vary. Here effective precision refers to the
in various deep learning domains, especially mobile applications [1]- minimum number of bits to represent a weight without accuracy loss.
[3]. Such mobile devices are commonly constrained by the available For example, even if we can use 16-bit to store an integer weight of
computational and storage resources and hence demand more 3, its effective precision is only 2 bits. Fig. 1 presents the effective
efficient hardware implementations to achieve smaller storage, faster weight precision distributions of different layers for two
inference, and lower power. Unlike power hungry GP-GPUs [4], representative neural networks of AlexNet and VGG16 from [20].
FPGA [5]-[9] and ASIC [10]-[14] are considered as a more While the highest effective precision is up to 16 bits, the majority of
promising alternative to low power CNN accelerator implementation. weights within the same layer only require 5 bits or even lower. This
However, as CNN algorithm complexity continues growing and is partially due to the robustness nature of neural networks that tend
results in significant arithmetic and storage consumption, there has to converge to more distributive and sparse weights during the
been substantial progress to design light weight CNN to achieve quantization driven training procedure. The architectures in the prior
better speed and accuracy tradeoff [15]-[17]. Among such efforts, work have to use the highest effective precision, i.e., the longest bit-
quantization and sparsification has become an active research area width, for each layer to ensure accuracy. However, as Fig. 1 indicates,
[18]-[20], bringing down the demanded resources for multiplications many bits, especially the higher order bits, actually can be skipped to
in CNN. Through careful training procedure, many weights can be save storage or computation resources without any accuracy loss.
zeroed out or quantized to fewer bits with limited accuracy loss [21]-
[22]. It is then an appealing option to design the processing element The most ideal approach is to have fine-granular mixed precision
(PE), i.e.¸the basic arithmetic unit of a CNN, only with demanded bit support within a CNN layer and per-PE control for all the precisions.
widths to reduce design overhead. Then, the accelerator can select the desired bit-width for each
operation to maximize the energy saving. Apparently, such an
Prior work in [11][14][23] have studied the precision reduction attempt is impractical as it demands: (1) significant resources to
for fixed point computations to achieve 4× storage saving with only implement mixed precisions for all the possible bit-widths; and (2)
0.4-0.6% accuracy compromise. In addition, ternary or binary weight too complex logics to control the selected precision for each
networks have also been investigated for more aggressive energy operation and for each PE. Then the question is: can we compromise
reduction [6]. While such a quantized network can be effectively the number of supported precisions to achieve a feasible solution?
implemented in hardware with single precision, it is constrained to a
particular scenario and hence not applicable to varying demands [11]- This is possible, but a non-trivial task that needs to address the
[19]. To support various scenarios, many researchers investigate following challenges:
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
• Since it is impractical to separately control precisions for each PE, CNN layer are completed. Apparently, most energy consumption
it is highly desired to have a universal accelerator architecture that comes from the repeated data movement and computations. Thus, a
can support with-in layer and layer-wise mixed precisions for longer bit-width (higher precision) inevitably causes higher energy
different layers and different neural networks. and area overhead in both transmission and computation [28]-[29].
• It is not clear how to select the desired precision for a given
III. PROPOSED ACCELERATOR ARCHITECTURE
weight distribution and how to implement the mixed precision PE
logics to support a universal architecture. A. Architecture Overview and Its Data Flow Control
• We need additional control logics to fetch, process and merge the In the proposed architecture with fine-granular mixed precision
results with mixed precisions and ensure its cost can be covered by configurability, we deploy two types of PEs to separately process the
the saving from the mixed precision support. full and low precision operations. The proposed architecture and its
Thus, in this paper, we propose a general-purpose CNN data flow control is illustrated in Fig. 3, which includes an array of
accelerator architecture with fine granular mixed precision FPPEs, a row of LPPEs, a row of adders, GLB, full precision buffer
support to address the aforementioned challenges. In particular, the (FPB), and additional control logics:
contributions of the work can be summarized as below: • FPPE contains full precision multiplier and storage (i.e.,16 bit in
• Universal accelerator architecture: We separate the CNN this paper) and shared by the entire array.
computations into two groups in the proposed architecture, full and • LPPE is a PE array that uses fewer bits for MAC and storage (i.e.,
low precision. A row of full precision PEs (FPPE) are added to a low 8 bit in this paper) than the full precision PEs.
precision PE (LPPE) array as the core part of the accelerator. Such an • Adder is a row of full precision adders that collect and add the
architecture is applicable to different CNNs to support simultaneous output of LPPEs and FPPEs.
computations using mixed precisions. • GLB is exactly the same as the one in Fig. 2 that temporarily
• Mixed Precision Logic and Control: A unique arbiter logic is stores data for CNN.
designed to separate the queued operations for low and full precision. • FPB has the same functionality as GLB but stores full precision
The full precision data is fed to a FPPE with an embedded decoder to data and used by FPPE.
control mapping and data flow. • Arbiter logic determines if the fetched data should be sent to
• Design optimization and tradeoff: We provide a systematic FPPE or LPPE.
analysis method to decide the most appropriate bit-width threshold In this work, the proposed architecture is designed to handle two
and design constraints for low and full precisions so as to maximize different precisions through LPPE and FPPE to achieve finer granular
area saving without timing penalty. within layer and layer-wise mixed precision support. However, we
The proposed architecture is implemented in FPGA and compared can always employ similar techniques as in [12]-[14] to support more
with a state-of-art architecture [10][23]-[27]. The experimental precisions in each PE, as the proposed techniques are general.
results show that the proposed implementation can save almost half
of the weights storage and MAC area when compared with the single
precision baseline accelerator. The proposal also helps lower down
the dynamic power by 12.1% for the entire system.
II. BACKGROUND
Recently it has been a popular research topic to design a general
purpose accelerator architecture that can support different neural
networks for different applications [8]-[14]. A commonly-used array
based architecture is illustrated in Fig. 2 for its top-level architecture
and memory hierarchy [10].
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
Locate the corresponding feature map for the fetched weight from the FPPEs cause low utilization and area waste. In this section, we
arbiter logic; (2) Improve the reuse of the feature map by finding out provide a systematic method to optimize the two design options.
the weights sharing the same map; (3) Deliver the corresponding A. Bit-Width Threshold Selection
partial sum that accumulates the output from the multiplier.
Without loss of generality, assuming the unit storage for 1 bit is
normalized to 1, the storage area consumption for the single precision
architecture in Fig. 2 can be roughly estimated by:
𝑆𝑃𝐸 = 𝑁 × 𝑊 × 𝑘 ()
𝑆𝐺𝐿𝐵 = 𝑀 × 𝑊 ()
where 𝑆𝑃𝐸 and 𝑆𝐺𝐿𝐵 denote the storage used in PE and GLB; W is the
full precision bit-width; k is the number of PEs in the array; N and M
are the numbers of weights stored in PE and GLB for one
computation loop. Now for the proposed architecture by including
additional FPPEs, we can assume 𝑊𝑖 is the selected bit-width
threshold for full and low precisions and LPPE is an array of the same
size as in Fig. 2. Then the storage consumption for the proposed
Fig. 4. Architecture overview of the propsoed FPPE. architecture in Fig. 3 is:
To find out the mapping from weight to feature map and then ′
𝑆𝐿𝑃𝑃𝐸 = 𝑁 × 𝑘 × 𝑊𝑖 ()
partial sum, a naive way is to keep track of the correspondence from
one to the other, which is challenging for both computational and ′
𝑆𝐹𝑃𝑃𝐸 = 𝑁 × 𝑘 × (1 − 𝑝(𝑊𝑖 )) × 𝑊 ()
storage complexity. Thus, we design a decoder as the very first stage
′
of FPPE that decodes the aforementioned mapping from the weight 𝑆𝐺𝐿𝐵 = 𝑀 × 𝑊𝑖 ()
index. The weight index is decoded to 4 parameters, Pdes_col, Pin_ifmap, ′
Pin_fp and Pin_lppe, as described in Table I. After decoding, the Ifmap- 𝑆𝐹𝑃𝐵 = 𝑀 × (1 − 𝑝(𝑊𝑖 )) × 𝑊 ()
receiver collects the feature map from the data bus using Pin_fp and ′ ′ ′ ′
Pin_lppe, which is sent to a MAC along with the corresponding weight. where 𝑆𝐿𝑃𝑃𝐸 , 𝑆𝐹𝑃𝑃𝐸 , 𝑆𝐺𝐿𝐵 and 𝑆𝐹𝑃𝐵 denote the storage used for LPPE,
Ifmaps reader then uses Pin_ifmap to decide where to reuse the feature FPPE, GLB and FPB, respectively; 𝑝(𝑊𝑖 ) denotes the probability of
map. A lookup table, as shown in Fig. 4, is employed to associate the having fewer or equal bits than 𝑊𝑖 , which can be characterized from
feature map with the Psum, and determine the memory address to the weights of the trained network [20].
store the MAC output according to the parameter Pdes_col. Finally, the
calculated output is fed to the corresponding column of LPPE using An example for AlexNet and VGG16 is shown in Fig. 5, with
Pin_lppe to add up the output from LPPE. This procedure is repeated marked curves demonstrating the percentage of weights (Y-axis) that
until the entire convolution operation is completed. can be effectively represented with a particular bit-width (X-axis). It
is clear that most layers only need a small bit-width for almost all the
Table I. Parameters from weight index and their definitions.
weights. Then the relative storage saving for LPPE can be
Parameter Definition
approximately calculated by EQs (1)-(6), and simplified to:
Pdes_col 10-bit register for the destination column of FPPE output
′ ′
𝑆𝑃𝐸 −𝑆𝑃𝐸 −𝑆𝐹𝑃 𝑃𝐸
Pin_ifmap 2-bit register for the index of feature map (Ifmaps) 𝛼≈ = 𝑝(𝑊𝑖 ) −
𝑊𝑖
()
𝑆𝑃𝐸 𝑊
Pin_fp 4-bit register for index of weight in the filter plane
Similarly, the relative storage saving for GLB is:
Pin_lppe 10-bit register for index of weight in the LPPE ′ ′
𝑆𝐺𝐿𝐵 −𝑆𝐺𝐿𝐵 −𝑆𝐹𝑃𝐵 𝑊𝑖
C. Arbiter Logic for Weight Selection 𝛽≈ = 𝑝(𝑊𝑖 ) − =𝛼 ()
𝑆𝐺𝐿𝐵 𝑊
The functionality of the arbiter logic module in Fig. 3 is to direct Thus, we can define an auxiliary function 𝑔(𝑊𝑖 ) = 𝑊 . To achieve 𝑊𝑖
the weights that need to be represented with full precision to FPB and storage saving across different layers, we need to choose a 𝑊𝑖 that
the other weights to GLB. Assume that a 16-bit signed fixed-point yields to 𝑝(𝑊𝑖 ) ≥ 𝑔(𝑊𝑖 ) . In other words, as in Fig. 5, 𝑊𝑖 needs to
weight is fetched as 𝑊 = {𝑊15 , 𝑊14 … 𝑊1 , 𝑊0 }. For a given bit- make all the marked curves on top of the red line. With such a
width threshold Wi, the module detects a weight to be low precision strategy, we can define the appropriate bit-width threshold 𝑊𝑖
by checking if the bits before 𝑊𝑖−1 are all equal to the sign bit. For for the underlying CNN.
example, with Wi =8, the module checks if {𝑊14 … 𝑊8 , 𝑊7 } all equal
B. Optimizing FPPE for Design Tradeoff
to the sign bit {𝑊15 } . If so, then this weight can be effectively
represented by only 8 bits of {𝑊7 , 𝑊6 … 𝑊1 , 𝑊0 } and sent to GLB for With the given bit-width threshold, we still need to decide the
corresponding operation. Otherwise the weight is processed as full number of FPPEs in the array to provide matched timing between
LPPE and FPPE computations. Similar as in the previous section, by
precision and sent to FPB.
assuming one MAC per PE, the PE for the architecture in Fig. 2
IV. DESIGN OPTIMIZATION AND TRADEOFF approximately requires 𝑇 = 𝑁 cycles to process N weights, where T
is the overall computation time. Then for the proposed architecture
With the proposed architecture, we are able to map any CNN to
with FPPE, the computation time for FPPE is:
this architecture and conduct the computation with the mixed
precisions. However, there remain two key questions to be answered: 𝑇 ′ = 𝑘 × 𝑁 × (1 − 𝑝(𝑊𝑖 ))/𝑛 ()
(1) For a given bit-width threshold, the operand with a higher where n is the number of FPPEs in the array. To achieve 𝑇 ≥ 𝑇 ′ , n
effective precision will be sent to FPPE. Thus, a too low bit-width need to meet the following constraint:
threshold actually require more FPPEs to be implemented, incurring
𝑛 ≥ 𝑁 × (1 − 𝑝(𝑊𝑖 )) ()
additional design cost . On the other hand, too few FPPEs may result
in timing bottleneck for full precision computation, while too many Moreover, the area for computation logics for the PE array in Fig. 2
can be approximated by 𝐿 = 𝐿𝑊 × 𝑘 , where 𝐿𝑤 is the number of
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
LUTs (or area) for the MAC. Now for the proposed architecture in 28% compared with the baseline, thereby improving both the overall
Fig. 3, its area for computation logics is: performance and energy consumption.
𝐿′ = 𝐿′ ′
𝐿𝑃𝑃𝐸 + 𝐿𝐹𝑃𝑃𝐸 = 𝑘 × 𝐿𝑊𝑖 + 𝑛 × 𝐿𝑊 () Table III. Performance and design overhead comparison between
′ ′ the baseline and the proposed architectures.
where 𝐿𝐿𝑃𝑃𝐸 and 𝐿𝐹𝑃𝑃𝐸 are the computation logic areas for LPPE
Storage (KB) # LUT Norm.
and FPPE, respectively; 𝐿𝑊𝑖 is the number of LUTs for a low
precision MAC. To meet the design goal of 𝐿 ≥ 𝐿′ , we have the Weight Ifmap Psum MAC System power
following constraint for n: Baseline 116.8 69.3 137.7 47k 204k 1x
𝑛 ≤ 𝑘 × (𝐿𝑊 − 𝐿𝑊𝑖 )/𝐿𝑊 () Alex Proposed 62.0 69.5 142.9 25k 192k 0.88x
According to EQs (9)-(12), we can have the following design Net Saving 46.9% -0.3% -3.8% 45.9% 6.2% 12.1%
constraints for the number of FPPEs n: VGG Proposed 58.9 69.5 142.5 23k 183k 0.88x
𝑘 × (𝐿𝑊 − 𝐿𝑊𝑖 )/𝐿𝑊 ≥ 𝑛 ≥ 𝑁 × (1 − 𝑝(𝑊𝑖 )) () 16 Saving 49.6% -0.3% -3.5% 49.8% 10.5% 12.1%
As long as n meets the design constraint above, FPPE can provide Fig. 6 further compares the fine granular mixed precision
energy saving without incurring additional timing penalty. Without configurability for different layers for the two CNNs. With the
loss of generality, we select n=14 in our implementation to have the breakdown of computations precision, it is found that low precision
same width of a FPPE row as the LPPE array. is sufficient for the majority of the computations (over 95%) while a
few key operations still demand full precision.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [15] A. G. Howard, et al., “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” Proc. CVPR, 2017.
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” Proc. CVPR, 2014. [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,“XNOR-Net:
ImageNet classification using binary convolutional neural networks,” Proc.
[2] K. He, X. Zhang, S. Ren, and J. Sun,“Deep residual learning for image CVPR, 2016.
recognition,” Proc. CVPR, 2016.
[17] F. N. Landola, et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer
[3] M. Bojarski, et al., “End to end learning for self-driving cars,” Proc. CVPR, parameters and <0.5MB model size,” Proc. CVPR, 2016.
2016.
[18] Q. He, et al.,“Effective quantization methods for recurrent neural networks,”
[4] C. Qingqing, N. Balasubramanian, and A. Balasubramanian, “MobiRNN: Proc. CVPR, 2016..
Efficient recurrent neural network execution on mobile GPU,” Proc. ACM,
2017. [19] C. Zhu, S. Han, H. Mao, and W. J. Dally,“Trained ternary quantization,”
Proc. ICLR, 2017.
[5] G. Peng, et al., “FBNA: A Fully Binarized Neural Network Accelerator,”
Proc. FPL, 2018. [20] L. Lai, N. Suda, and V. Chandra,“Deep convolutional neural network
inference with floating-point weights and fixed-point activations,” Proc.
[6] D. J. M. Moss, “High performance binary neural networks on the CVPR, 2017.
xeon+FPGATM platform,” Proc. FPL, 2017.
[21] S. Hashemi, et al., "Understanding the Impact of Precision Quantization on
[7] R. Cai, et al., “VIBNN: Hardware Acceleration of Bayesian Neural the Accuracy and Energy of Neural Networks," Proc. DATE, 2017.
Networks,”. Proc. ACM, 2018.
[22] J. H. Ko, et al., "Adaptive Weight Compression for Memory-Efficient
[8] Y. Ma, T. Zheng, Y. Cao, S. Vrudhula and J. Seo, “Algorithm-Hardware Neural Networks," Proc. DATE, 2017.
Co-Design of Single Shot Detector for Fast Object Detection on FPGAs,”
[23] I. Chakraborty, et al. "Efficient Hybrid Network Architectures for
Proc. ICCAD, 2018.
Extremely Quantized Neural Networks Enabling Intelligence at the Edge,"
[9] C. Yao ,J. He, X. Zhang, C. Hao and D. Chen, “Cloud-DNN: An Open Proc. ACM, 2019.
Framework for Mapping DNN Models to Cloud FPGAs,” Proc. ACM, 2019.
[24] X. Zhang, et al., "DNNBuilder: an automated tool for building high-
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- performance DNN hardware accelerators for FPGAs," Proc. ACM, 2018.
efficient reconfigurable accelerator for deep convolutional neural networks,”
IEEE J. Solid-State Circuits, vol. 52(1):127–138, 2017. [25] A. Aimar, et al., “NullHop: A flexible convolutional neural network
accelerator based on sparse representations of feature maps,” IEEE TNNLS,
[11] S. Yin, et al., “An ultra-high energy-efficient reconfigurable processor for vol. 30(3): 644-656, 2018.
deep neural networks with binary/ternary weights in 28nm CMOS,” Proc.
VLSI Circuit, 2018. [26] B. Fleischer, et al., “A scalable multi-TeraOPS deep learning processor core
for AI training and inference,” Proc. VLSI, 2018.
[12] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision: A
0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy- [27] S. Yin, et al., “A 1.06 to 5.09 TOPS/W reconfigurable hybrid neural
frequency-scalable convolutional neural network processor 28 nm FDSOI,” network processor,” Proc. VLSI, 2017.
Proc. ISSCC, 2017. [28] R. Hameed, et al., “Understanding sources of inefficiency in general-
[13] D. Shin, J. Lee, J. Lee, and H. -J. Yoo, “DNPU: An 8.1 TOPS/W purpose chips,” Proc. ISCA, 2010.
reconfigurable CNN-RNN processor for general-purpose deep neural [29] M. Horowitz, “Computing’s energy problem (and what we can do about it),”
networks,” IEEE ISSCC, pp: 240–241, 2017. Proc. ISSCC, 2014.
[14] J. Lee, et al., “UNPU: an energy-efficient deep neural network accelerator
with fully variable weight bit precision,” IEEE JSSC, vol. 54(1):173–185,
2019.
Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.