A Convolutional Neural Network Accelerator Architecture

The document proposes a convolutional neural network accelerator architecture that can perform computations with mixed precisions and assign appropriate arithmetic cores to operations with different precision demands. This allows significant area and energy savings without accuracy compromise. The architecture separates computations into full and low precision groups and uses different processing elements for each. Experimental results show the architecture reduces weight storage and computation area by almost half while lowering dynamic power by 12.1% compared to state-of-the-art designs.

Uploaded by

guantongpeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views5 pages

A Convolutional Neural Network Accelerator Architecture

Uploaded by

guantongpeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

A Convolutional Neural Network Accelerator Architecture

with Fine-Granular Mixed Precision Configurability

Xian Zhou, Li Zhang, Chuliang Guo, Xunzhao Yin, Cheng Zhuo
Email: [email protected]
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China

Abstract— Convolutional neural networks (CNNs) have been general purpose accelerator architecture that can be used to deploy
widely deployed in deep learning applications, especially on power different CNN algorithms [11]-[14]. Reference [13] proposes to use
hungry GP-GPUs. Recent efforts in designing CNN accelerators are two 8-bit multipliers in a PE to switch between 8- and 16-bit
considered as a promising alternative to achieve higher energy precisions. ENVISION in [12] proposes a booth multiplier based
efficiency. Unfortunately, with the growing complexity of CNN, the dynamic voltage accuracy frequency scaling technology that can be
demanded computational and storage resources for accelerators configured to 4-, 8- or 16-bits. Furthermore, UNPU [14] uses serial
keep increasing, hindering its wider applications in mobile devices. multipliers to implement lookup table-based PE to enable precisions
On the other hand, many quantization algorithms have been from 1 to 16 bits. In short, at the cost of additional area for PE, control
proposed for efficient CNN training, which brings many small or logics and storage, the configurability can provide different
zero weights. This is a unique opportunity for accelerator designers precisions to different neural networks or different layers in one
to employ much fewer bits, e.g., 4 bits, in both arithmetic core and neural network, thereby improving overall energy saving [12]-[14].
storage, thereby saving significant design cost. However, such a
single precision strategy inevitably compromises the accuracy as
some key operations may demand a higher precision. Thus, this
paper proposes a low power CNN accelerator architecture that can
simultaneously conduct computations with mixed precisions and
assign the appropriate arithmetic cores to operation with different
precision demands. This proposed architecture can achieve
significant area and energy savings, without accuracy compromise.
The experimental results show that the proposed architecture Fig. 1. Distributions of effective weight precisions for different
implemented on FPGA can reduces almost half of the weight convolution layers in (a) AlexNet and (b) VGG16.
storage and MAC area, and lower the dynamic power by 12.1%
when compared with a state-of-the-art CNN accelerator design. However, the precision control of all the prior work is at most at
the granularity of layer-wise. In other words, within the same layer,
I. INTRODUCTION all the operations use the same precision and bit-width. It has been
noted that the effective precision of weights within a layer may
Convolutional neural network (CNN) has been widely deployed actually significantly vary. Here effective precision refers to the
in various deep learning domains, especially mobile applications [1]- minimum number of bits to represent a weight without accuracy loss.
[3]. Such mobile devices are commonly constrained by the available For example, even if we can use 16-bit to store an integer weight of
computational and storage resources and hence demand more 3, its effective precision is only 2 bits. Fig. 1 presents the effective
efficient hardware implementations to achieve smaller storage, faster weight precision distributions of different layers for two
inference, and lower power. Unlike power hungry GP-GPUs [4], representative neural networks of AlexNet and VGG16 from [20].
FPGA [5]-[9] and ASIC [10]-[14] are considered as a more While the highest effective precision is up to 16 bits, the majority of
promising alternative to low power CNN accelerator implementation. weights within the same layer only require 5 bits or even lower. This
However, as CNN algorithm complexity continues growing and is partially due to the robustness nature of neural networks that tend
results in significant arithmetic and storage consumption, there has to converge to more distributive and sparse weights during the
been substantial progress to design light weight CNN to achieve quantization driven training procedure. The architectures in the prior
better speed and accuracy tradeoff [15]-[17]. Among such efforts, work have to use the highest effective precision, i.e., the longest bit-
quantization and sparsification has become an active research area width, for each layer to ensure accuracy. However, as Fig. 1 indicates,
[18]-[20], bringing down the demanded resources for multiplications many bits, especially the higher order bits, actually can be skipped to
in CNN. Through careful training procedure, many weights can be save storage or computation resources without any accuracy loss.
zeroed out or quantized to fewer bits with limited accuracy loss [21]-
[22]. It is then an appealing option to design the processing element The most ideal approach is to have fine-granular mixed precision
(PE), i.e.¸the basic arithmetic unit of a CNN, only with demanded bit support within a CNN layer and per-PE control for all the precisions.
widths to reduce design overhead. Then, the accelerator can select the desired bit-width for each
operation to maximize the energy saving. Apparently, such an
Prior work in [11][14][23] have studied the precision reduction attempt is impractical as it demands: (1) significant resources to
for fixed point computations to achieve 4× storage saving with only implement mixed precisions for all the possible bit-widths; and (2)
0.4-0.6% accuracy compromise. In addition, ternary or binary weight too complex logics to control the selected precision for each
networks have also been investigated for more aggressive energy operation and for each PE. Then the question is: can we compromise
reduction [6]. While such a quantized network can be effectively the number of supported precisions to achieve a feasible solution?
implemented in hardware with single precision, it is constrained to a
particular scenario and hence not applicable to varying demands [11]- This is possible, but a non-trivial task that needs to address the
[19]. To support various scenarios, many researchers investigate following challenges:

978-1-7281-3320-1/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
• Since it is impractical to separately control precisions for each PE, CNN layer are completed. Apparently, most energy consumption
it is highly desired to have a universal accelerator architecture that comes from the repeated data movement and computations. Thus, a
can support with-in layer and layer-wise mixed precisions for longer bit-width (higher precision) inevitably causes higher energy
different layers and different neural networks. and area overhead in both transmission and computation [28]-[29].
• It is not clear how to select the desired precision for a given
III. PROPOSED ACCELERATOR ARCHITECTURE
weight distribution and how to implement the mixed precision PE
logics to support a universal architecture. A. Architecture Overview and Its Data Flow Control
• We need additional control logics to fetch, process and merge the In the proposed architecture with fine-granular mixed precision
results with mixed precisions and ensure its cost can be covered by configurability, we deploy two types of PEs to separately process the
the saving from the mixed precision support. full and low precision operations. The proposed architecture and its
Thus, in this paper, we propose a general-purpose CNN data flow control is illustrated in Fig. 3, which includes an array of
accelerator architecture with fine granular mixed precision FPPEs, a row of LPPEs, a row of adders, GLB, full precision buffer
support to address the aforementioned challenges. In particular, the (FPB), and additional control logics:
contributions of the work can be summarized as below: • FPPE contains full precision multiplier and storage (i.e.,16 bit in
• Universal accelerator architecture: We separate the CNN this paper) and shared by the entire array.
computations into two groups in the proposed architecture, full and • LPPE is a PE array that uses fewer bits for MAC and storage (i.e.,
low precision. A row of full precision PEs (FPPE) are added to a low 8 bit in this paper) than the full precision PEs.
precision PE (LPPE) array as the core part of the accelerator. Such an • Adder is a row of full precision adders that collect and add the
architecture is applicable to different CNNs to support simultaneous output of LPPEs and FPPEs.
computations using mixed precisions. • GLB is exactly the same as the one in Fig. 2 that temporarily
• Mixed Precision Logic and Control: A unique arbiter logic is stores data for CNN.
designed to separate the queued operations for low and full precision. • FPB has the same functionality as GLB but stores full precision
The full precision data is fed to a FPPE with an embedded decoder to data and used by FPPE.
control mapping and data flow. • Arbiter logic determines if the fetched data should be sent to
• Design optimization and tradeoff: We provide a systematic FPPE or LPPE.
analysis method to decide the most appropriate bit-width threshold In this work, the proposed architecture is designed to handle two
and design constraints for low and full precisions so as to maximize different precisions through LPPE and FPPE to achieve finer granular
area saving without timing penalty. within layer and layer-wise mixed precision support. However, we
The proposed architecture is implemented in FPGA and compared can always employ similar techniques as in [12]-[14] to support more
with a state-of-art architecture [10][23]-[27]. The experimental precisions in each PE, as the proposed techniques are general.
results show that the proposed implementation can save almost half
of the weights storage and MAC area when compared with the single
precision baseline accelerator. The proposal also helps lower down
the dynamic power by 12.1% for the entire system.
II. BACKGROUND
Recently it has been a popular research topic to design a general
purpose accelerator architecture that can support different neural
networks for different applications [8]-[14]. A commonly-used array
based architecture is illustrated in Fig. 2 for its top-level architecture
and memory hierarchy [10].

Fig. 3. Overview of the proposed architecture and its data flow.

After the accelerator loads data from DRAM, an arbiter logic
selects the weights for full precision operations and delivers that to
FPB while the other weights are delivered to GLB. Then a computing
procedure is invoked. GLB and FPB first send weights to LPPEs and
Fig. 2. A commonly-used CNN accelerator architecture [23]-[27]. FPPEs, respectively. FPPEs locate the corresponding operands and
Ifmaps using the weight index and then conduct MAC operations.
This accelerator architecture consist of PE array, global buffer The computations in LPPE array and FPPE row are simultaneously
(GLB), and additional control logics [23]-[27]. The PE is the core carried out. After FPPE operation is completed, the outputs are added
arithmetic unit in the accelerator, which contains in-PE memory to the ones from LPPEs in the same column. The sum is then sent
blocks and multiply-and-accumulate (MAC) units. The role of GLB back to GLB. This procedure is repeated until all the data are
is to temporarily store data, such as input feature maps (Ifmaps), processed, which completes one computation loop.
partial sums (Psums), weights, and bias. At the very beginning, all B. Design of FPPE
Ifmaps and weights are stored in DRAM. During the convolution
process, data is transferred from DRAM to GLB, and then assigned FPPE is the core part of the proposed architecture, as plotted in
to a PE for calculation. The results of a PE are transferred back to Fig. 4. Unlike a conventional PE in [23]-[27], FPPE needs to
GLB. Such a procedure is repeated until all the operations within a complete three tasks to support mixed precision computation: (1)

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
Locate the corresponding feature map for the fetched weight from the FPPEs cause low utilization and area waste. In this section, we
arbiter logic; (2) Improve the reuse of the feature map by finding out provide a systematic method to optimize the two design options.
the weights sharing the same map; (3) Deliver the corresponding A. Bit-Width Threshold Selection
partial sum that accumulates the output from the multiplier.
Without loss of generality, assuming the unit storage for 1 bit is
normalized to 1, the storage area consumption for the single precision
architecture in Fig. 2 can be roughly estimated by:
𝑆𝑃𝐸 = 𝑁 × 𝑊 × 𝑘 ()
𝑆𝐺𝐿𝐵 = 𝑀 × 𝑊 ()
where 𝑆𝑃𝐸 and 𝑆𝐺𝐿𝐵 denote the storage used in PE and GLB; W is the
full precision bit-width; k is the number of PEs in the array; N and M
are the numbers of weights stored in PE and GLB for one
computation loop. Now for the proposed architecture by including
additional FPPEs, we can assume 𝑊𝑖 is the selected bit-width
threshold for full and low precisions and LPPE is an array of the same
size as in Fig. 2. Then the storage consumption for the proposed
Fig. 4. Architecture overview of the propsoed FPPE. architecture in Fig. 3 is:
To find out the mapping from weight to feature map and then ′
𝑆𝐿𝑃𝑃𝐸 = 𝑁 × 𝑘 × 𝑊𝑖 ()
partial sum, a naive way is to keep track of the correspondence from
one to the other, which is challenging for both computational and ′
𝑆𝐹𝑃𝑃𝐸 = 𝑁 × 𝑘 × (1 − 𝑝(𝑊𝑖 )) × 𝑊 ()
storage complexity. Thus, we design a decoder as the very first stage
′
of FPPE that decodes the aforementioned mapping from the weight 𝑆𝐺𝐿𝐵 = 𝑀 × 𝑊𝑖 ()
index. The weight index is decoded to 4 parameters, Pdes_col, Pin_ifmap, ′
Pin_fp and Pin_lppe, as described in Table I. After decoding, the Ifmap- 𝑆𝐹𝑃𝐵 = 𝑀 × (1 − 𝑝(𝑊𝑖 )) × 𝑊 ()
receiver collects the feature map from the data bus using Pin_fp and ′ ′ ′ ′
Pin_lppe, which is sent to a MAC along with the corresponding weight. where 𝑆𝐿𝑃𝑃𝐸 , 𝑆𝐹𝑃𝑃𝐸 , 𝑆𝐺𝐿𝐵 and 𝑆𝐹𝑃𝐵 denote the storage used for LPPE,
Ifmaps reader then uses Pin_ifmap to decide where to reuse the feature FPPE, GLB and FPB, respectively; 𝑝(𝑊𝑖 ) denotes the probability of
map. A lookup table, as shown in Fig. 4, is employed to associate the having fewer or equal bits than 𝑊𝑖 , which can be characterized from
feature map with the Psum, and determine the memory address to the weights of the trained network [20].
store the MAC output according to the parameter Pdes_col. Finally, the
calculated output is fed to the corresponding column of LPPE using An example for AlexNet and VGG16 is shown in Fig. 5, with
Pin_lppe to add up the output from LPPE. This procedure is repeated marked curves demonstrating the percentage of weights (Y-axis) that
until the entire convolution operation is completed. can be effectively represented with a particular bit-width (X-axis). It
is clear that most layers only need a small bit-width for almost all the
Table I. Parameters from weight index and their definitions.
weights. Then the relative storage saving for LPPE can be
Parameter Definition
approximately calculated by EQs (1)-(6), and simplified to:
Pdes_col 10-bit register for the destination column of FPPE output
′ ′
𝑆𝑃𝐸 −𝑆𝑃𝐸 −𝑆𝐹𝑃 𝑃𝐸
Pin_ifmap 2-bit register for the index of feature map (Ifmaps) 𝛼≈ = 𝑝(𝑊𝑖 ) −
𝑊𝑖
()
𝑆𝑃𝐸 𝑊
Pin_fp 4-bit register for index of weight in the filter plane
Similarly, the relative storage saving for GLB is:
Pin_lppe 10-bit register for index of weight in the LPPE ′ ′
𝑆𝐺𝐿𝐵 −𝑆𝐺𝐿𝐵 −𝑆𝐹𝑃𝐵 𝑊𝑖
C. Arbiter Logic for Weight Selection 𝛽≈ = 𝑝(𝑊𝑖 ) − =𝛼 ()
𝑆𝐺𝐿𝐵 𝑊
The functionality of the arbiter logic module in Fig. 3 is to direct Thus, we can define an auxiliary function 𝑔(𝑊𝑖 ) = 𝑊 . To achieve 𝑊𝑖

the weights that need to be represented with full precision to FPB and storage saving across different layers, we need to choose a 𝑊𝑖 that
the other weights to GLB. Assume that a 16-bit signed fixed-point yields to 𝑝(𝑊𝑖 ) ≥ 𝑔(𝑊𝑖 ) . In other words, as in Fig. 5, 𝑊𝑖 needs to
weight is fetched as 𝑊 = {𝑊15 , 𝑊14 … 𝑊1 , 𝑊0 }. For a given bit- make all the marked curves on top of the red line. With such a
width threshold Wi, the module detects a weight to be low precision strategy, we can define the appropriate bit-width threshold 𝑊𝑖
by checking if the bits before 𝑊𝑖−1 are all equal to the sign bit. For for the underlying CNN.
example, with Wi =8, the module checks if {𝑊14 … 𝑊8 , 𝑊7 } all equal
B. Optimizing FPPE for Design Tradeoff
to the sign bit {𝑊15 } . If so, then this weight can be effectively
represented by only 8 bits of {𝑊7 , 𝑊6 … 𝑊1 , 𝑊0 } and sent to GLB for With the given bit-width threshold, we still need to decide the
corresponding operation. Otherwise the weight is processed as full number of FPPEs in the array to provide matched timing between
LPPE and FPPE computations. Similar as in the previous section, by
precision and sent to FPB.
assuming one MAC per PE, the PE for the architecture in Fig. 2
IV. DESIGN OPTIMIZATION AND TRADEOFF approximately requires 𝑇 = 𝑁 cycles to process N weights, where T
is the overall computation time. Then for the proposed architecture
With the proposed architecture, we are able to map any CNN to
with FPPE, the computation time for FPPE is:
this architecture and conduct the computation with the mixed
precisions. However, there remain two key questions to be answered: 𝑇 ′ = 𝑘 × 𝑁 × (1 − 𝑝(𝑊𝑖 ))/𝑛 ()
(1) For a given bit-width threshold, the operand with a higher where n is the number of FPPEs in the array. To achieve 𝑇 ≥ 𝑇 ′ , n
effective precision will be sent to FPPE. Thus, a too low bit-width need to meet the following constraint:
threshold actually require more FPPEs to be implemented, incurring
𝑛 ≥ 𝑁 × (1 − 𝑝(𝑊𝑖 )) ()
additional design cost . On the other hand, too few FPPEs may result
in timing bottleneck for full precision computation, while too many Moreover, the area for computation logics for the PE array in Fig. 2
can be approximated by 𝐿 = 𝐿𝑊 × 𝑘 , where 𝐿𝑤 is the number of

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
LUTs (or area) for the MAC. Now for the proposed architecture in 28% compared with the baseline, thereby improving both the overall
Fig. 3, its area for computation logics is: performance and energy consumption.
𝐿′ = 𝐿′ ′
𝐿𝑃𝑃𝐸 + 𝐿𝐹𝑃𝑃𝐸 = 𝑘 × 𝐿𝑊𝑖 + 𝑛 × 𝐿𝑊 () Table III. Performance and design overhead comparison between
′ ′ the baseline and the proposed architectures.
where 𝐿𝐿𝑃𝑃𝐸 and 𝐿𝐹𝑃𝑃𝐸 are the computation logic areas for LPPE
Storage (KB) # LUT Norm.
and FPPE, respectively; 𝐿𝑊𝑖 is the number of LUTs for a low
precision MAC. To meet the design goal of 𝐿 ≥ 𝐿′ , we have the Weight Ifmap Psum MAC System power
following constraint for n: Baseline 116.8 69.3 137.7 47k 204k 1x
𝑛 ≤ 𝑘 × (𝐿𝑊 − 𝐿𝑊𝑖 )/𝐿𝑊 () Alex Proposed 62.0 69.5 142.9 25k 192k 0.88x
According to EQs (9)-(12), we can have the following design Net Saving 46.9% -0.3% -3.8% 45.9% 6.2% 12.1%
constraints for the number of FPPEs n: VGG Proposed 58.9 69.5 142.5 23k 183k 0.88x
𝑘 × (𝐿𝑊 − 𝐿𝑊𝑖 )/𝐿𝑊 ≥ 𝑛 ≥ 𝑁 × (1 − 𝑝(𝑊𝑖 )) () 16 Saving 49.6% -0.3% -3.5% 49.8% 10.5% 12.1%
As long as n meets the design constraint above, FPPE can provide Fig. 6 further compares the fine granular mixed precision
energy saving without incurring additional timing penalty. Without configurability for different layers for the two CNNs. With the
loss of generality, we select n=14 in our implementation to have the breakdown of computations precision, it is found that low precision
same width of a FPPE row as the LPPE array. is sufficient for the majority of the computations (over 95%) while a
few key operations still demand full precision.

Fig. 5. Percerntage of weights in different layers of (a) AlexNet and

(b) VGG16 that can be represented with a particular bit-width. Fig. 6. Breakdown of weights at different layers that can be
represented by low precision (LP) and full precision (FP) for AlexNet
V. EXPERIMENTAL RESULTS
(left) and VGG16 (right).
To evaluate the proposed architecture, we first implement a
Fig. 7 illustrates the relative storage saving for LPPE and GLB at
commonly used CNN architecture in [10] using the data stationary
different layers when compared to the baseline. Since FPPE adds
dataflow as the baseline on a ZCU102 FPGA board, which has
extra storage to store the mapping information of the weight, the
224KB GLB and a 12 by 14 PE array. The total storage size is 608
actual total storage saving is less than α as shown in Fig. 7(a), but still
bytes for one PE after accounting the storage for FIFO. In this work,
contributes to almost 17.8%. Similar observations are found for
we focus on AlexNet and VGG16 using 16-bit fixed point for
VGG16 with total storage saving of almost 16.8%.
evaluations. The two networks are trained and quantized as in [20].
We then implement the proposed architecture, where the two
design options of optimal bit-width threshold 𝑊𝑖 for each layer and
the number of FPPEs are determined with the strategies in Sec. IV.
Table II compares the design parameter and overhead for the PE in
the baseline and LPPE/FPPE in the proposed architecture. The
second line summarizes the numbers of PEs in each architecture. The
third line compares the size of storage used in PE/LPPE/FPPE, where
LPPE and FPPE use 0.63x and 1.63x storage compared with the PE Fig 7. Comparison of storage savings for different layers of (a)
in the baseline. The last two lines compare the LUT consumption for AlexNet and (b) VGG16.
MAC and the entire PE. With the more complex logic, FPPE VI. CONCLUSIONS
consumes 1.22x LUT compared with PE, while LPPE consumes only This paper proposes a CNN accelerator architecture with fine-
0.59x. Considering the numbers of LPPE and FPPE, the overall granular mixed precision configurability to support mixed precision
saving can be achieved for both storage and LUTs. computations within-layer and layer-wise for CNNs. By exploiting
Table II. PE design comparison for the two architectures. the characteristics of quantized weights distributions, where the
Baseline/PE Prop./LPPE Prop./FPPE effective precisions of most weights require fewer bit-widths, the
Number 168 168 14 proposed architecture employ two separate groups of PE to
Storage(byte) 608 (1x) 384 (0.63x) 1003 (1.63x) simultaneously store and calculate with different precisions. We
#LUT (MAC) 280 (1x) 128 (0.46x) 280 (1x) implement the proposed CNN accelerator using an FPGA platform
#LUT (PE) 1313 (1x) 777 (0.59x) 1606 (1.22x) and find that the proposed architecture can achieve weight storage
and MAC area reduction by almost 50%, with dynamic power saving
Table III further compares the overall performance and system-
by 12.1%.
level design overhead of the two architectures when executing
different CNNs. It is clear that both weight storage and MAC area VII. ACKNOWLEDGMENT
(#LUT) savings of the proposed architecture are almost 50% for the This work was supported in part by the National Key R&D
two CNNs of AlexNet and VGG16. In addition, the proposed Program of China (Grant No. 2018YFE0126300), NSFC (Grant No.
architecture can reduce the dynamic power up to 12.1%. Since the 6197413), and Zhejiang Provincial Key R&D program (Grant No.
bit-width of the operands in LPPE is reduced to 8-bit from 16-bit in 2020C01052).
the proposed architecture, the critical path delay is reduced by almost

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [15] A. G. Howard, et al., “Mobilenets: Efficient convolutional neural networks
for mobile vision applications,” Proc. CVPR, 2017.
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” Proc. CVPR, 2014. [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,“XNOR-Net:
ImageNet classification using binary convolutional neural networks,” Proc.
[2] K. He, X. Zhang, S. Ren, and J. Sun,“Deep residual learning for image CVPR, 2016.
recognition,” Proc. CVPR, 2016.
[17] F. N. Landola, et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer
[3] M. Bojarski, et al., “End to end learning for self-driving cars,” Proc. CVPR, parameters and <0.5MB model size,” Proc. CVPR, 2016.
2016.
[18] Q. He, et al.,“Effective quantization methods for recurrent neural networks,”
[4] C. Qingqing, N. Balasubramanian, and A. Balasubramanian, “MobiRNN: Proc. CVPR, 2016..
Efficient recurrent neural network execution on mobile GPU,” Proc. ACM,
2017. [19] C. Zhu, S. Han, H. Mao, and W. J. Dally,“Trained ternary quantization,”
Proc. ICLR, 2017.
[5] G. Peng, et al., “FBNA: A Fully Binarized Neural Network Accelerator,”
Proc. FPL, 2018. [20] L. Lai, N. Suda, and V. Chandra,“Deep convolutional neural network
inference with floating-point weights and fixed-point activations,” Proc.
[6] D. J. M. Moss, “High performance binary neural networks on the CVPR, 2017.
xeon+FPGATM platform,” Proc. FPL, 2017.
[21] S. Hashemi, et al., "Understanding the Impact of Precision Quantization on
[7] R. Cai, et al., “VIBNN: Hardware Acceleration of Bayesian Neural the Accuracy and Energy of Neural Networks," Proc. DATE, 2017.
Networks,”. Proc. ACM, 2018.
[22] J. H. Ko, et al., "Adaptive Weight Compression for Memory-Efficient
[8] Y. Ma, T. Zheng, Y. Cao, S. Vrudhula and J. Seo, “Algorithm-Hardware Neural Networks," Proc. DATE, 2017.
Co-Design of Single Shot Detector for Fast Object Detection on FPGAs,”
[23] I. Chakraborty, et al. "Efficient Hybrid Network Architectures for
Proc. ICCAD, 2018.
Extremely Quantized Neural Networks Enabling Intelligence at the Edge,"
[9] C. Yao ,J. He, X. Zhang, C. Hao and D. Chen, “Cloud-DNN: An Open Proc. ACM, 2019.
Framework for Mapping DNN Models to Cloud FPGAs,” Proc. ACM, 2019.
[24] X. Zhang, et al., "DNNBuilder: an automated tool for building high-
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy- performance DNN hardware accelerators for FPGAs," Proc. ACM, 2018.
efficient reconfigurable accelerator for deep convolutional neural networks,”
IEEE J. Solid-State Circuits, vol. 52(1):127–138, 2017. [25] A. Aimar, et al., “NullHop: A flexible convolutional neural network
accelerator based on sparse representations of feature maps,” IEEE TNNLS,
[11] S. Yin, et al., “An ultra-high energy-efficient reconfigurable processor for vol. 30(3): 644-656, 2018.
deep neural networks with binary/ternary weights in 28nm CMOS,” Proc.
VLSI Circuit, 2018. [26] B. Fleischer, et al., “A scalable multi-TeraOPS deep learning processor core
for AI training and inference,” Proc. VLSI, 2018.
[12] B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “Envision: A
0.26-to-10 TOPS/W subword-parallel dynamic-voltage-accuracy- [27] S. Yin, et al., “A 1.06 to 5.09 TOPS/W reconfigurable hybrid neural
frequency-scalable convolutional neural network processor 28 nm FDSOI,” network processor,” Proc. VLSI, 2017.
Proc. ISSCC, 2017. [28] R. Hameed, et al., “Understanding sources of inefficiency in general-
[13] D. Shin, J. Lee, J. Lee, and H. -J. Yoo, “DNPU: An 8.1 TOPS/W purpose chips,” Proc. ISCA, 2010.
reconfigurable CNN-RNN processor for general-purpose deep neural [29] M. Horowitz, “Computing’s energy problem (and what we can do about it),”
networks,” IEEE ISSCC, pp: 240–241, 2017. Proc. ISSCC, 2014.
[14] J. Lee, et al., “UNPU: an energy-efficient deep neural network accelerator
with fully variable weight bit precision,” IEEE JSSC, vol. 54(1):173–185,
2019.

Authorized licensed use limited to: CALIFORNIA INSTITUTE OF TECHNOLOGY. Downloaded on November 02,2020 at 12:22:38 UTC from IEEE Xplore. Restrictions apply.

Part1 Overview Release 13 en
No ratings yet
Part1 Overview Release 13 en
38 pages
Priyanka Pavithra Fundoodata
0% (1)
Priyanka Pavithra Fundoodata
170 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
No ratings yet
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
13 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Convolution Optimization For DNN
No ratings yet
Convolution Optimization For DNN
14 pages
A CNN Accelerator On FPGA With A Flexible Structure
No ratings yet
A CNN Accelerator On FPGA With A Flexible Structure
6 pages
Accelerating Binarized Convolutional 2017
No ratings yet
Accelerating Binarized Convolutional 2017
10 pages
Electronics 10 02859 v2
No ratings yet
Electronics 10 02859 v2
16 pages
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
No ratings yet
An Energy-Efficient Precision-Scalable ConvNet Processor in 40-Nm CMOS-1
12 pages
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
No ratings yet
Ane Cient Implementation of 2D Convolution in CNN: Jing Chang and Jin Sha
8 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Electronics 08 00065
No ratings yet
Electronics 08 00065
19 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Efficient Hardware Architectures For Deep Convolutional Neural Network
No ratings yet
Efficient Hardware Architectures For Deep Convolutional Neural Network
13 pages
MAC
No ratings yet
MAC
5 pages
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
No ratings yet
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
6 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
No ratings yet
A Low-Power Sparse Convolutional Neural Network Accelerator With Pre-Encoding Radix-4 Booth Multiplier
5 pages
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
No ratings yet
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
12 pages
Irmak2021energy Efficient
No ratings yet
Irmak2021energy Efficient
4 pages
2017.01.jssc - Eyeriss Design
No ratings yet
2017.01.jssc - Eyeriss Design
12 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
An Efficient Reconfigurable Hardware Accelerator For CNN
No ratings yet
An Efficient Reconfigurable Hardware Accelerator For CNN
5 pages
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
No ratings yet
Power Efficientreconfigurable Accelerator For Deep Convolutional Neural Networks
6 pages
An In-Memory VLSI Architecture For Convolutional Neural Networks
No ratings yet
An In-Memory VLSI Architecture For Convolutional Neural Networks
12 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
No ratings yet
Optimizing FPGA-based Accelerator Design For Deep Convolutional Neural Networks
10 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
NullHop A Flexible Convolutional Neural Network Accelerator Based On Sparse Representations of Feature Maps
No ratings yet
NullHop A Flexible Convolutional Neural Network Accelerator Based On Sparse Representations of Feature Maps
13 pages
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
No ratings yet
A Scalable and Efficient Convolutional Neural Network Accelerator Using HLS For A System-On-Chip Design
18 pages
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
No ratings yet
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
12 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
No ratings yet
Fully On-Chip MAC at 14 NM Enabled by Accurate Row-Wise Programming of PCM-Based Weights and Parallel Vector-Transport in Duration-Format
8 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
No ratings yet
Design of A Lightweight Convolutional Neural Network Accelerated by FPGA
4 pages
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
4 pages
Cafpga: An Automatic Generation Model For CNN Accelerator
No ratings yet
Cafpga: An Automatic Generation Model For CNN Accelerator
30 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
9 pages
2020 Hong
No ratings yet
2020 Hong
18 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
2019 Ics Mcdanel Zhang Kung Dong
No ratings yet
2019 Ics Mcdanel Zhang Kung Dong
12 pages
Energy-Efficient Convolution Architecture Based On Rescheduled Dataflow
No ratings yet
Energy-Efficient Convolution Architecture Based On Rescheduled Dataflow
12 pages
Power Efficient Design of High-Performance Convolu
No ratings yet
Power Efficient Design of High-Performance Convolu
14 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
CNN hw1
No ratings yet
CNN hw1
13 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
No ratings yet
A 0.61-J Frame Pipelined Wired-Logic DNN Processor in 16-nm FPGA Using Convolutional Non-Linear Neural Network
11 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
No ratings yet
2022 Review of FPGA-Based Accelerators of Deep Convolutional Neural Networks
7 pages
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
No ratings yet
Area Efficient Compression For Floating-Point Feature Maps in Convolutional Neural Network Accelerators
5 pages
Data and Hardware Efficient Design For Convolutional Neural Network!
No ratings yet
Data and Hardware Efficient Design For Convolutional Neural Network!
10 pages
A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators
No ratings yet
A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators
14 pages
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
1941 - National Building Code of Canada
No ratings yet
1941 - National Building Code of Canada
432 pages
Unit 8 - TQM
No ratings yet
Unit 8 - TQM
37 pages
Footscan®v9 Software Packages
No ratings yet
Footscan®v9 Software Packages
1 page
X4751 enUS 4751 CementIndustryBrochure 010920
No ratings yet
X4751 enUS 4751 CementIndustryBrochure 010920
12 pages
Item Analysis Procedures 1
No ratings yet
Item Analysis Procedures 1
2 pages
Csit 301 Lesson Plan 1
No ratings yet
Csit 301 Lesson Plan 1
5 pages
Project Proposal
No ratings yet
Project Proposal
9 pages
Curriculum Map Subject: Science Quarter: 4 Grade Level: Grade 4 Topic: Earth and Space
100% (1)
Curriculum Map Subject: Science Quarter: 4 Grade Level: Grade 4 Topic: Earth and Space
5 pages
Curriculum Vitae Indra Hermawan
No ratings yet
Curriculum Vitae Indra Hermawan
1 page
Week 03 - Quiz
No ratings yet
Week 03 - Quiz
1 page
CFE Final Course Output 2024 2025 1
No ratings yet
CFE Final Course Output 2024 2025 1
8 pages
Patient Clinical Audit Case Study Example
No ratings yet
Patient Clinical Audit Case Study Example
3 pages
The Energy Transition Conference 2023 - Delegates Brochure
No ratings yet
The Energy Transition Conference 2023 - Delegates Brochure
25 pages
CBSE Class 6 Social Science Sample Paper SA 2 SET 1
No ratings yet
CBSE Class 6 Social Science Sample Paper SA 2 SET 1
2 pages
Darrel Todd Woodruff 261 WEST 600 NORTH #1, Logan, UT 84321 435-232-4326 Email Website
No ratings yet
Darrel Todd Woodruff 261 WEST 600 NORTH #1, Logan, UT 84321 435-232-4326 Email Website
2 pages
Nursing Care Assignment
No ratings yet
Nursing Care Assignment
8 pages
Hydroline Breather FSB TB 130417
No ratings yet
Hydroline Breather FSB TB 130417
3 pages
5 BA Q1250 Komax HMI EN
No ratings yet
5 BA Q1250 Komax HMI EN
31 pages
Molas Lubes-Products List
No ratings yet
Molas Lubes-Products List
2 pages
Session 2 Overview of Integrity
No ratings yet
Session 2 Overview of Integrity
19 pages
01 - Electricity - Basic Principles
No ratings yet
01 - Electricity - Basic Principles
14 pages
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
No ratings yet
BUCHI Destilador B-324 LIGAL 489 Operationmanual - SP
30 pages
Eurocode 7 Geotechnical Limit Analysis
No ratings yet
Eurocode 7 Geotechnical Limit Analysis
19 pages
CHAPTER 2 - FILE HANDLING-txtfile
No ratings yet
CHAPTER 2 - FILE HANDLING-txtfile
23 pages
Ril Painting Procedure
No ratings yet
Ril Painting Procedure
3 pages
Clay Plasters: Work Sheet 5.1
No ratings yet
Clay Plasters: Work Sheet 5.1
28 pages
Chapters 7
No ratings yet
Chapters 7
64 pages
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
No ratings yet
WS - 3 Class X Phy CH - 10 (Light - Refraction) - 1
3 pages

A Convolutional Neural Network Accelerator Architecture

Uploaded by

A Convolutional Neural Network Accelerator Architecture

Uploaded by

A Convolutional Neural Network Accelerator Architecture

with Fine-Granular Mixed Precision Configurability

978-1-7281-3320-1/20/$31.00 ©2020 IEEE

Fig. 3. Overview of the proposed architecture and its data flow.

Fig. 5. Percerntage of weights in different layers of (a) AlexNet and

You might also like