0% found this document useful (0 votes)

19 views13 pages

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Uploaded by

a18257157319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views13 pages

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Uploaded by

a18257157319

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Mix and Match: A Novel FPGA-Centric Deep

Neural Network Quantization Framework
Sung-En Chang∗1 , Yanyu Li∗1 , Mengshu Sun∗1 , Runbin Shi2 , Hayden K.-H. So2 , Xuehai Qian3 ,
Yanzhi Wang1 , Xue Lin1
1
Northeastern University, 2 The University of Hong Kong, 3 University of Southern California
{chang.sun, li.yanyu, sun.meng, yanz.wang, xue.lin}@northeastern.edu, {rbshi, hso}@eee.hku.hk, [email protected]

Abstract—Deep Neural Networks (DNNs) have achieved ex- up to GBs (Giga Bytes) for model size and 102 GFLOPs (Giga
traordinary performance in various application domains. To Floating Point Operations) for inference computation, making
support diverse DNN models, efficient implementations of DNN it a challenging task to perform on-device inference.
inference on edge-computing platforms, e.g., ASICs, FPGAs,
and embedded systems, are extensively investigated. Due to the To efficiently execute the diverse DNN inference models
huge model size and computation amount, model compression for broader applications, the resource-constrained edge com-
is a critical step to deploy DNN models on edge devices. This puting platforms require two crucial supports. The first one
paper focuses on weight quantization, a hardware-friendly model is the specialized hardware acceleration for DNN inference.
compression approach that is complementary to weight pruning. Extensive research efforts have been dedicated to the efficient
Unlike existing methods that use the same quantization scheme
for all weights, we propose the first solution that applies different implementations of DNN inference models on various edge-
quantization schemes for different rows of the weight matrix. computing platforms, such as ASICs [8]–[14], FPGAs [15]–
It is motivated by (1) the distribution of the weights in the [18], and embedded CPUs/GPUs [19]–[23].
different rows are not the same; and (2) the potential of achieving The second is the DNN model compression technique,
better utilization of heterogeneous FPGA hardware resources. To
which not only seeks more efficient hardware implementation
achieve that, we first propose a hardware-friendly quantization
scheme named sum-of-power-of-2 (SP2) suitable for Gaussian- based on given models, but also explores the opportunity of
like weight distribution, in which the multiplication arithmetic algorithm and hardware co-design to achieve better trade-offs
can be replaced with logic shifter and adder, thereby enabling among accuracy, hardware cost, and performance. There are
highly efficient implementations with the FPGA LUT resources. two essential techniques for model compression: DNN weight
In contrast, the existing fixed-point quantization is suitable
pruning [24]–[30] and weight quantization [31]–[47].
for Uniform-like weight distribution and can be implemented
efficiently by DSP. Then to fully explore the resources, we propose This paper focuses on DNN weight quantization, which be-
an FPGA-centric mixed scheme quantization (MSQ) with an comes imperative to the DNN hardware acceleration especially
ensemble of the proposed SP2 and the fixed-point schemes. on the FPGA and ASIC platforms. By representing weights
Combining the two schemes can maintain, or even increase with fewer bits, weight quantization can directly simplify the
accuracy due to better matching with weight distributions.
implementations and accelerate the inference execution speed
For the FPGA implementations, we develop a parameterized
architecture with heterogeneous Generalized Matrix Multiplica- in a hardware-friendly manner. Also, it is supported in GPUs
tion (GEMM) cores—one using LUTs for computations with (e.g., PyTorch [22] for NVIDIA GPUs) and mobile devices
SP2 quantized weights and the other utilizing DSPs for fixed- (e.g., TensorFlow-Lite [23]). In addition, weight quantization
point quantized weights. Given the partition ratio among the two yields far less training overhead than weight pruning, let alone
schemes based on resource characterization, MSQ quantization
the training-heavy network architecture search (NAS)-based
training algorithm derives an optimally quantized model for the
FPGA implementation. We evaluate our FPGA-centric quan- model compression techniques. Specifically, in state-of-the-art
tization framework across multiple application domains. With DNN quantization methods (including our work), retraining
optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq process takes usually 1/3 ∼ 1/2 of the epochs as those for
XC7Z020 and XC7Z045, we achieve performance improvement the pre-training process, which is totally acceptable training
of 2.1 × −4.1× compared to solely exploiting DSPs for all
overhead in the exchange for significant inference speedup.
multiplication operations. In addition, the CNN implementations
with the proposed MSQ scheme can achieve higher accuracy Weight quantization can be considered as a mapping from
and comparable hardware utilization efficiency compared to the 32-bit floating-point weights into m-bit weight representations.
state-of-the-art designs. There are different types of quantization schemes including
Index Terms—deep neural network, quantization, FPGA, in- binary [31]–[34], ternary [35]–[37], low-bit-width fixed-point
ference
[38]–[43], and power-of-2 [44]–[47]. In general, binary and
I. I NTRODUCTION ternary quantization schemes result in significant accuracy
loss, for example, > 5% under binary and 2%−3% for ternary
Deep learning or Deep Neural Networks (DNNs) have
quantization. The fixed-point quantization can represent the
achieved extraordinary performance in various application do-
DNN weights using low bit-width, e.g., 4-bit, with negligible
mains [1]–[7]. However, the state-of-the-art DNNs may require
accuracy loss. To further simplify hardware implementations,
∗ Equal contribution. power-of-2 quantization scheme was proposed to replace the

2378-203X/21/$31.00 ©2021 IEEE 208

DOI 10.1109/HPCA51647.2021.00027
multiplications with bit-shifting operations. However, power- and negligible accuracy degradation.
of-2 results in non-negligible accuracy degradation, usually • We provide the first DNN quantization solution that
around 1% − 2%, which even cannot be overcome with jointly applies two quantization schemes to achieve better
increasing precision. utilization of heterogeneous FPGA hardware resources
To overcome the challenges, instead of using the same while not harming the quantized model accuracy.
quantization scheme for all weights, we propose the first • Our framework features a novel architecture with het-
solution that applies different quantization schemes for dif- erogeneous GEMM engines and design optimizations,
ferent rows of the weight matrix. It is motivated by (1) to accommodate our mixed scheme quantization and to
the distribution of weights in the different rows are not the optimize FPGA resource allocation.
same; and (2) the potential of achieving better utilization • The effectiveness of our proposed MSQ is validated
of heterogeneous FPGA hardware resources. We propose a across multiple application domains and with FPGA
hardware-friendly quantization scheme named sum-of-power- devices, on inference accuracy and FPGA resource uti-
of-2 (SP2) suitable for Gaussian-like weight distribution, in lization efficiency.
which the multiplication arithmetic can be replaced with logic Our work is significantly different from existing quantiza-
shifter and adder, thereby enabling highly efficient implemen- tion frameworks that leverage the inter-layer, multi-precision
tations with the FPGA LUT resources. At the same time, approach. We exploit the previously neglected flexibility on
the SP2 quantization enjoys the negligible accuracy loss, just quantization schemes (using both fixed-point and SP2) by
like the fixed-point quantization scheme. In comparison, the adopting a novel intra-layer, multi-scheme approach. Specifi-
fixed-point quantization is suitable for Uniform-like weight cally, we identify an optimized ratio of the two schemes from
distribution and can be implemented efficiently by DSP. FPGA (LUT and DSP) resource characterization, and then
To fully explore the FPGA resources, we propose an FPGA- assign the different rows of the weight matrix within a layer
centric mixed scheme quantization (MSQ) with an ensemble into the two schemes according to the weight distributions.
of the proposed SP2 and the fixed-point schemes. Given that Our method is totally perpendicular to, and can be combined
each individual scheme can achieve negligible accuracy loss, with, the existing inter-layer, multi-precision approaches.
we demonstrate that combining the two can maintain, or even
reach higher accuracy. It is due to the benefit of using two II. BACKGROUND ON DNN W EIGHT Q UANTIZATION
quantization schemes: even within a single layer, the local
weight distributions can be diverse, if we assign the right A. Weight Quantization Schemes
quantization scheme to better fit the local weight distributions, 1) Uniform Interval Quantization Schemes: Uniform in-
accuracy can be boosted. terval quantization schemes include binary, ternary, and low-
For the FPGA implementations, we developed a param- bit-width fixed-point. Binary or ternary quantization uses ex-
eterized architecture with heterogeneous Generalized Matrix tremely low precision for DNN models, i.e., binarized (e.g., -1,
Multiplication (GEMM) cores—one using LUTs for computa- +1) or ternarized (e.g., -1, 0, +1) levels. Representative binary
tions with SP2 quantized weights and the other utilizing DSPs quantization methods include Binaryconnect [31], Binarized
for fixed-point quantized weights. We first find the partition Neural Network (BNN) [32], XNOR-net [33], and ABC-Net
ratio of the SP2 to fixed-point quantization for weights of a [34]. With weights constrained to {−1, 1}, multiplications can
DNN layer through FPGA resource characterization, such that be replaced by additions/subtractions. Additions/subtractions
the DSP utilization is kept at 100% and LUT utilization can can also be eliminated using XNOR and AND operations if
also be optimized. Given the partition ratio, MSQ quantization activations are quantized to binary as well. On the other hand,
training algorithm derives an optimally quantized model for ternary quantization schemes are implemented in TWN [35],
the FPGA implementation. We evaluate our FPGA-centric TTQ [36], and [37]. Ternary representation keeps zero in quan-
quantization framework across multiple application domains tization levels, which requires one more bit to present weights.
including image classification, object detection and recogni- Ternary networks also benefit from non-multiplication opera-
tion, machine translation, speech recognition, sentiment clas- tions while maintaining the natural sparsity (since zero weights
sification, and natural language processing, with various DNNs are kept). Although binary and ternary quantization can signif-
such as convolutional neural networks (CNN), and recurrent icantly reduce operations and simplify the implementations of
neural networks (RNN). With optimal SP2/fixed-point ratios hardware accelerators, it introduces non-negligible accuracy
on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, we loss. For example, based on reports from the above works,
achieve performance improvement of 2.1 × −4.1× compared accuracy typically degrades by > 5% under the binary scheme,
to solely exploiting DSPs for all multiplication operations. In and 2 − 3% for ternary.
addition, the CNN implementations with the proposed MSQ Comparing with binary and ternary quantization, the fixed-
scheme can achieve higher accuracy and comparable hardware point quantization scheme applies the modest and flexible
utilization efficiency compared to state-of-the-arts. quantization rates to preserve the accuracy as that of the
The contributions of this work are: 32-bit floating-point models. For example, 4-bit fixed-point
• We propose a novel hardware-friendly SP2 quantization introduces zero or negligible accuracy loss. Fixed-point quan-
scheme, which enjoys both non-multiplication operations tization scheme has been implemented with different meth-

209
ods/algorithms by DoReFa-Net [38], PACT [39], DSQ [40], Algorithm 1: DNN Quantization with ADMM and STE
QIL [41], μ L2Q [42], and LSQ [43]. input : 32-bit floating-point DNN model M, with weights
With the m-bit fixed-point scheme, quantized weight values W to be quantized.
are defined as the scaling factor α times quantization levels: Quantization scheme: S ∈ {Fixed-point, Power-of-2,
Sum-of-power-of-2}
1 2 target: Quantized model M̂
QF P (m, α) = ±α × {0, , , ..., 1}. (1)
2m−1 − 1 2m−1 − 1 // Initialization:
And the mapping from a 32-bit floating-point weight w into U 0 = 0; Z 0 = W;
the quantized weight ŵ by m-bit fixed-point representation (in foreach Epoch do
sign-magnitude) is given by the following quantizer: // Update Z, U:
Z t ← projS (W + U t−1 );
ŵ = w U t ← W − Z t + U t−1 ;
QF P (m,α) foreach Batch do
(2)
−1
1 // STE for activation quantization:
=α·h round((2m − 1) · h(w, α)) , input ← projS (input);
2m − 1 loss ← M(input);
1
where QF P (m,α) (·) denotes the quantizer function to project loss ← loss + 2
W − Z t + U t 2 ;
Backpropagate loss and update W;
onto QF P (m, α); the function h(·) transforms a value within
[−1, +1] into the range of [0, 1], for example we can use
Return M̂ ← M{projS (W)}.
h(·) = tanh(·)/2 + 0.5; and w, α clips w according to
⎧
⎪
⎨−1, w < −α
w, α = w/α, −α ≤ w ≤ α . (3) also be observed from Eq (5) that when w is a large value,
⎪
⎩ increasing m does not have an effect on ŵ. In practice, 3 ∼ 7
1, w>α
bits are usually used for power-of-2 quantization, and more
2) Non-Uniform Interval Quantization Schemes: On the bits could not further promote the accuracy of the quantized
other hand, power-of-2 quantization is a non-uniform interval models. As mentioned in §II-A1 that 4-bit fixed-point results
quantization scheme, representative methods including [44]– in negligible accuracy degradation, but 4-bit power-of-2 quan-
[47]. Power-of-2 quantization replaces multiplications by bit tization will result in accuracy loss of 1% − 2%.
shifting operations and this number system also possesses
higher precision around the mean, which fits the Gaussian B. Quantization Algorithms
distribution of DNN weights better [48], [49]. With an m- Quantization performs projection from the continuous do-
bit weight representation (in sign-magnitude), the quantized main to a discrete number system, which makes the gradients
weight values by the power-of-2 scheme are defined as of the loss function unavailable for backpropagation during the
1 1 training. Two approaches can be applied to solving this un-
QP 2 (m, α) = ±α × {0, , , ..., 1}. (4) available gradient issue. One is employing a Straight Through
22m−1 −2 22m−1 −3
Estimator (STE) [50], [51] to set the gradient to the constant
And the power-of-2 quantizer is then given by
value of 1 as
Forward : y = round(x)
ŵ = w , (7)
∂y
QP 2 (m,α) Backward : = 1x∈R
m ∂x
α · h−1 2round(log2 h(w,α)) h(w, α) > 2−2 +1
= m . which is effective in the quantization training. The other
0 h(w, α) ≤ 2−2 +1
approach employs Alternating Direction Method of Multipliers
(5)
(ADMM) to iteratively solve the parameters with a target
quantization scheme as the optimization contraint [47], elimi-
With weights quantized into the power-of-2 scheme, multi-
nating the need to backpropagate through the quantizer. In this
plications between weight i.e., 2b (b ∈ N) and activation i.e.,
work, we use a combination of ADMM and STE, as shown in
a can be implemented by bit shifting as follows:
⎧ Algorithm 1, which in general follows the ADMM algorithm
⎪
⎨a << b, b > 0 for weight quantization and where the STE is only applied for
2b × a = a, b = 0. (6) activation quantization.
⎪
⎩
a >> b, b < 0 III. S UM - OF -P OWER - OF -2 (SP2) Q UANTIZATION S CHEME
Although the power-of-2 quantization scheme can simplify In this section, we propose a new hardware-friendly sum-
hardware implementation by eliminating multiplications, its of-power-of-2 (SP2) quantization scheme, which enjoys the
precision cannot be increased effectively with increasing m, non-multiplication operations for the inference computation as
because increasing m will merely increase resolution around the binary, ternary, and power-of-2 schemes, while achieving
the mean, while the tails are still in low precision. This can negligible inference accuracy degradation.

210
TABLE I
A NALYSIS ON THE OPERATIONS FOR WEIGHT- ACTIVATION MULTIPLICATION BY TWO QUANTIZATION SCHEMES OF THE WEIGHTS .

Weight Activation Ops for Weight × Activation

Quantization Scheme m-bit fixed-point n-bit fixed-point
n-bit addition for m − 2 times
Operands (m − 1)-bit integer n-bit integer
Quantization Scheme m-bit SP2 n-bit fixed-point shift by up to 2m1 − 2 bits
m1 -bit integer, m2 -bit integer shift by up to 2m2 − 2 bits
Operands n-bit integer
m1 + m2 = m − 1, m1 ≥ m2 up to (n + 2m1 − 2)-bit addition

levels. Although the power-of-2 quantization scheme in m-

bit representations also provide 2m − 1 quantization levels,
the quantization levels resulted from the two schemes scatter
distinctly, and therefore the schemes perform differently in
preserving the accuracy of the quantized models. In Figure
1, the curve represents the actual probability distribution of
DNN weights in a representative layer. Along the x-axis, we
label the quantization levels by fixed-point, power-of-2, and
our sum-of-power-of-2 quantization schemes. All the three
schemes use 4-bit representations and therefore each of them
has 15 quantization levels within [−1, +1].
First, let us understand the intuition why the power-of-2
quantized models incur the non-negligible accuracy degrada-
tion, while SP2 and fixed-point quantized models can achieve
Fig. 1. Quantization levels by fixed-point, power-of-2, and SP2 in 4-bit
weight representation precision, and weight probability distribution of similar accuracy performance. The power-of-2 scheme has
the 4th layer in MobileNet-V2. very high precision around the mean with only 4-bit weight
presentation, but the tail ends present very low precision.
In contrast, our SP2 quantization possesses relatively evenly
A. SP2 Quantization Scheme scattered quantization levels, which is close to that of fixed-
The proposed hardware-friendly sum-of-power-of-2 (SP2) point quantization levels, except the tail ends where very few
quantization scheme can be considered as a variant of the weight values are presented. This explains the advantages of
power-of-2 quantization. SP2 scheme can eliminate multipli- SP2 quantization scheme.
cation operations in the (quantized) DNN inference models Next, we analyze the effect of SP2 quantization scheme on
(as the power-of-2 scheme), and at the same time is designed the computation of weight-activation multiplication. In Table
to address the non-negligible accuracy loss of power-of-2 I, we compare fixed-point and SP2 quantization schemes of
quantization. This is achieved by solving the low precision the weights, while throughout this paper we use fixed-point
issue in the tail ends of the weight distribution. quantization for the activation. In the first scheme with m-bit
Formally, the quantized weight values by the sum-of-power- fixed-point quantization for the weight and n-bit fixed-point
of-2 scheme with a total of m-bit representations are quantization on the activation, the weight operand is actually
represented as the (m − 1)-bit unsigned integer, since 1 bit is
QSP 2 (m, α) = ±α × {q1 + q2 }, for the sign. Although a quantization level is within [−1, +1],
1 1 1 the actual weight operand is the (m − 1)-bit unsigned integer.
q1 ∈ {0, 2m1 −1 , 2m1 −2 , ..., }, (8)
2 2 2 And the activation operand is directly represented as the n-bit
1 1 1 unsigned integer, because activations are non-negative. The
q2 ∈ {0, 2m2 −1 , 2m2 −2 , ..., },
2 2 2 operations for implementing weight-activation multiplication
where q1 and q2 are power-of-2 numbers in similar format are therefore n-bit additions for (m − 2) times.
as the quantization levels in Eq. (4), and m1 and m2 are the For the second scheme with m-bit SP2 quantization on the
number of bits to represent the power-of-2 numbers i.e., q1 weight, we have an m1 -bit unsigned integer and an m2 -bit
and q2 , respectively. Please note that with a total of m bits to unsigned integer together to encode the quantization level of
represent an SP2 quantized weight value, 1 bit is still reserved the quantized weight, and m1 + m2 = m − 1 because 1 bit is
for the sign bit, and therefore we have m1 + m2 + 1 = m with for the sign. The quantization level is then 2b1 + 2b2 , where
m1 ≥ m2 . In addition, the quantization levels by SP2 i.e., b1 and b2 are encoded with m1 and m2 bits, respectively.
±{q1 + q2 } are within [−1, +1]. The weight-activation multiplication is implemented by (1)
Note that with m-bit representations, the SP2 scheme pro- shift of the activation operand by b1 bits, (2) shift of the
vides a total of 2m1 × 2m2 × 2 − 1 = 2m − 1 quantization activation operand by b2 bits, and (3) addition of the two

211
TABLE II
R ESULT FROM DIFFERENT QUANTIZATION SCHEMES FOR THE R ES N ET-18 AND M OBILE N ET- V 2 DNN MODELS ON CIFAR10, CIFAR100, AND
I MAGE N ET DATASETS .

Quantization Bit width ResNet-18 Accuracy (%) MobileNet-v2 Accuracy (%)

Scheme (Wght./Actv.) Top1 Top5 Top1 Top5
CIFAR10
Baseline (FP) 32/32 93.62 - 92.51 -
P2 4/4 92.97 (−0.65) - 91.34(−1.17) -
Fixed 4/4 93.43 (−0.19) - 92.34 (−0.17) -
SP2 4/4 93.47 (−0.15) - 92.72 (+0.21) -
MSQ (half/half) 4/4 93.53 (−0.09) - 92.57 (+0.06) -
MSQ (optimal) 4/4 93.65 (+0.03) - 92.55 (+0.04) -
CIFAR100
Baseline (FP) 32/32 74.49 92.70 71.48 91.98
P2 4/4 73.88 (−0.61) 92.14 (−0.56) 68.68 (−2.80) 90.06 (−1.92)
Fixed 4/4 74.37 (−0.12) 92.31 (−0.39) 71.16 (−0.32) 91.63 (−0.35)
SP2 4/4 74.33 (−0.17) 92.49 (−0.21) 71.13 (−0.35) 91.69 (−0.29)
MSQ (half/half) 4/4 74.58 (+0.09) 92.39 (−0.31) 71.21 (−0.27) 91.74 (−0.24)
MSQ (optimal) 4/4 74.60 (+0.11) 92.63 (−0.07) 71.50 (+0.02) 91.82 (−0.16)
ImageNet Wght./Actv. 4/32
Baseline (FP) 32/32 69.76 89.08 71.88 90.29
P2 4/4 68.20 (−1.56) 87.14 (−1.94) 69.93(−1.95) 88.63(−1.66)
Fixed 4/4 69.72 (−0.04) 88.67 (−0.41) 71.26 (−0.62) 90.18 (−0.11)
SP2 4/4 69.74 (−0.02) 88.71 (−0.37) 71.32 (−0.56) 90.17 (−0.12)
MSQ (half/half) 4/4 70.11 (+0.35) 89.41 (+0.33) 71.26 (−0.62) 90.04 (−0.25)
MSQ (optimal) 4/4 70.27 (+0.51) 89.42 (+0.34) 71.31(−0.57) 90.11(−0.18)

shifted operands. Since b1 and b2 are encoded by m1 - and For ImageNet, both the ﬁxed-point (Fixed) and sum-of-power-
m2 -bit unsigned integer, respectively, Operations (1) and (2) of-2 (SP2) schemes have negligible accuracy loss, ≤ 0.41% for
can be shift by at most 2m1 − 2 and 2m2 − 2 bits, respectively. ResNet-18 and ≤ 0.62% for MobileNet-v2 accross the three
The shifted activation operands will be n + 2m1 − 2 and datasets. These two schemes achieve comparable accuracy of
n + 2m2 − 2 bits respectively. Therefore one (n + 2m1 − 2)-bit quantized models. In summary, the 4-bit-width Fixed and SP2
addition is needed. In summary, with SP2 weight quantization, quantization schemes are essentially equivalent in terms of the
the weight-activation multiplication can be implemented with accuracy of the quantized models, and their accuracy losses
two shift operations and one addition operation. are negligible.

B. Accuracy Performance Analysis IV. A FPGA-C ENTRIC M IXED S CHEME Q UANTIZATION

In this section, we discuss the accuracy performance of the In this section, we propose our mixed scheme quantization
fixed-point (Fixed), power-of-2 (P2), and proposed sum-of- (MSQ) for FPGA implementations of DNN inference models.
power-of-2 (SP2) quantization schemes. Our baseline models Based on the analysis in §III-B, fixed-point and SP2 quantiza-
use 32-bit floating-point (FP) for both weights and activation are equivalent in preserving the accuracy of the quantized
tions. All the quantization schemes apply 4-bit quantization. models when with the same precision, e.g., 4-bit for both.
While different quantization schemes are explored for weights, Therefore, in our proposed MSQ, the fixed-point and SP2
activations are using fixed-point quantization. Table II sum- schemes with the 4-bit precision are applied in the DNN model
marizes the quantized models’ accuracy with the accuracy quantization, (1) for better FPGA resource allocation, and (2)
changes (with respect to the baseline FP models) marked with negligible accuracy loss.
in the brackets. Experiments are conducted with ResNet-
18 and MobileNet-v2 models on CIFAR10, CIFAR100, and A. Motivation
ImageNet. As for activation quantization, 4-bit fixed-point The idea of proposed mixed scheme quantization (MSQ)
is used for all the models, except the MobileNet-v2 on is to partition DNN weights in the same layer into two
ImageNet dataset. MobileNets are a family of specialized and categories, one is handled by fixed-point quantization, and
lightweight models, therefore presenting unstable convergence the other by SP2 quantization. These two schemes use the
under modifications such as pruning and quantization. Activa- same precision to facilitate hardware implementation. The
tion quantization of MobileNet-v2 on ImageNet is performed motivations to adopt MSQ are: First, let a weight matrix be
in §IV-C with comparisons to existing works. obtained by transforming the weight tensor of a layer into a 2D
In Table II, we now focus on P2, Fixed, and SP2 quan- GEMM matrix with rows and columns. Weights in different
tization schemes, and the MSQ scheme will be discussed rows of the matrix may present different distributions. For rows
in §IV. First, power-of-2 (P2) results in significant accuracy with more Gaussian-like weight distributions (with smaller
degradation, around 1%∼2% in general with extreme case of variances), SP2 quantization is preferable; while for rows with
2.80% Top-5 accuracy loss of MobileNet-v2 on CIFAR100. more Uniform-like weight distributions (with larger variances),

212
Algorithm 2: FPGA-Centric Mixed Scheme Quantiza- low even though DSP utilization reaches the maximum. Incor-
tion(MSQ) porating the SP2 quantization can increase the LUT utilization,
input : 32-bit floating-point DNN model M, with weights and therefore enhancing the throughput. The exploration of the
W to be quantized. optimal ratio of SP2 to fixed-point among the weight matrix
target: Quantized model M̂ rows is elaborated in §VI.
// Initialization:
U 0 = 0; Z 0 = W; C. Accuracy Results
Partition rate P RSP 2 from FPGA resource characterization;
Sf =Fixed-point; Sp =SP2; 1) Experiment Setup: We evaluate our MSQ in three ap-
foreach Epoch do plication domains i.e., image classification with convolutional
(l)
Calculate variance vr for each r-th row of the layer l neural networks (CNNs); object detection and recognition with
weight matrix W(l) ; YOLO-v3; machine translation, speech recognition, and senti-
(l)
Sort v1:R to obtain the threshold θ(l) such that P RSP 2 ment classification with recurrent neural networks (RNNs). We
of the rows with variances less than θ(l) ; use no extra data augmentations in our quantization, other than
(l)
if vr < θ(l) then S ← Sp ;
else S ← Sf ;
those already employed for training the 32-bit floating-point
// Update Z, U : baseline models. Our quantization training algorithm uses step
Z t ← projS (W + U t−1 ); or cosine learning rate decay and 2 regularization, following
U t ← W − Z t + U t−1 ; training algorithms of the baseline models. Our quantization
foreach Batch do algorithms are implemented with the PyTorch framework on
input ← projS (input);
NVIDIA TITAN RTX GPUs and GeForce RTX 2080Ti GPUs.
loss ← M(input);
1
loss ← loss + 2
W − Z t + U t 2 ; For image classification, we evaluate with the deep residual
Backpropagate loss and update W; net (ResNet-18) [52], which is a widely used model for
computer vision tasks, as well as the lightweight MobileNet-
Return M̂ ← M{projS (W)}. v2 model [53]. We test on CIFAR10 [54], CIFAR100 [54],
and ImageNet ILSVRC-2012 [55] datasets. DNN models for
CIFAR10 and CIFAR100 datasets are trained from scratch and
fixed-point quantization should be used. Thus, the mixed quantized for 150 epochs. For ImageNet dataset, pre-trained
scheme is necessary at algorithm level—it can achieve similar models in 32-bit floating-point are used and quantized for 90
or even potentially higher accuracy than existing schemes. epochs. The initial learning rate is 8e − 3 for CIFAR10, 4e − 3
Second, our approach also leads to a better utilization of for CIFAR100, 5e − 4 for ImageNet.
heterogeneous resources available in FPGA— weights based For object detection, we explore the implementation of a
on the two schemes can be managed by LUT and DSP fully convolutional neural network (FCNN) called YOLO-v3
resources. Specifically, the operations involving SP2 quantized [56] on MS COCO 2014 [57] dataset. The learning rate starts
weights should be implemented by LUTs; while those with from 1e − 2, and decays to 5e − 4 with cosine annealing. We
fixed-point quantized weights can leverage the DSPs, the more evaluate mean Average Precision (mAP) at an IoU threshold
limited resources on FPGA for DNN hardware accelerators. value of 0.5 ([email protected]), as well as average mAP over the
Overall, our MSQ achieves a sweet design spot achieving both IoU threshold range from 0.5 to 0.95 (mAP@(0.5 : 0.95)).
high accuracy and processing throughput, thanks to the high For RNNs, we evaluate three networks. The first one is an
and optimized utilization of both LUTs and DSPs. LSTM network with 256 hidden neurons in two layers [58] on
Penn Tree Bank (PTB) [59] dataset for the machine translation
B. Algorithm application with perplexity (PPL) as the evaluation metric
In MSQ, each row in a weight matrix should employ either (lower PPL is better). The second is a network based on
the SP2 or fixed-point scheme. To determine the scheme for GRU with 1024 hidden neurons in two layers [60] on TIMIT
each row, the weight variances of all the rows are calculated. acoustic-phonetic continuous speech corpus [61] dataset for
We define a threshold θ for the variances, such that for the the speech recognition application. The evaluation metric is
rows with smaller variances than the threshold, the SP2 quan- Phoneme Error Rate (PER) and lower PER is better. Finally,
tization is employed; and otherwise, the fixed-point scheme is we use another LSTM network with three hidden layers each
applied. By setting the proper threshold θ, the desired partition having 512 neurons on IMDB [62] dataset for sentiment
ratio of SP2 to fixed-point can be achieved with improved classification. Our learning rate is 1e − 3 for all the RNNs.
FPGA resource utilization. Algorithm 2 provides the details. 2) Result Analysis: Tables II, III, and IV summarize quan-
The optimal ratio of SP2 to fixed-point is determined by the tization results for the image classification. Table II compares
available resources on FPGA devices and resource utilization different quantization schemes including power-of-2 (P2),
required to support the design. Generally, the utilization factor fixed-point (Fixed), sum-of-power-of-2 (SP2), and our mixed
of DSPs should be maintained at 100% to take full advantage scheme quantization (MSQ). Two partitioning ratios are tested
of the DSP resource for the fixed-point multiplications. When for MSQ, the first one being P RSP 2:F ixed = 1 : 1, and the
only fixed-point quantization is applied, the LUT utilization is second one being P RSP 2:F ixed = 2 : 1 that is the optimal

213
TABLE III TABLE V
C OMPARISONS WITH EXISTING WORKS WITH R ES N ET-18 MODEL ON YOLO- V 3 ON COCO 2014 DATASET WITH 4- BIT QUANTIZATION . (8×
I MAGE N ET DATASET. COMPRESSION RATE )

Methods
Bit-width
Top-1 (%) Top-5 (%) Image Size Scheme mAP @0.5 : 0.95 mAP @0.5
(W/A)
Baseline(FP) 37.7 56.8
Baseline(FP) 32/32 69.76 89.08 320
MSQ 35.8 53.9
Dorefa [38] 4/4 68.10 88.10
PACT [39] 4/4 69.20 89.00 Baseline(FP) 45.6 64.7
640
DSQ [40] 4/4 69.56 N/A MSQ 44.1 64.8
QIL [41] 4/4 70.10 N/A
μL2Q [42] 4/32 65.92 86.72
LQ-NETS [44] 4/4 69.30 88.80 TABLE VI
MSQ 4/4 70.27 89.42 RNN ON MACHINE TRANSLATION , SPEECH RECOGNITION , AND
SENTIMENT CLASSIFICATION .

TABLE IV Bit Width

C OMPARISONS WITH EXISTING WORKS WITH M OBILE N ET- V 2 MODEL Scheme Evaluation Metric Result
(W/A)
ON I MAGE N ET DATASET.
LSTM on PTB
EQMBaseline(FP) 32/32 Perplexity 109
Bit-width (PPL) lower better
Methods Top-1 (%) Top-5 (%)
(W/A) EQM [63] 4/4 PPL 114
Baseline(FP) 32/32 71.88 90.29 OurBaseline(FP) 32/32 PPL 110.89
PACT [39] 4/4 61.40 N/A Fixed 4/4 PPL 113.03
DSQ [40] 4/4 64.80 N/A SP2 4/4 PPL 113.42
MSQ 4/4 65.64 86.98 MSQ(half/half) 4/4 PPL 112.74
MSQ(optimal) 4/4 PPL 112.72
GRU on TIMIT
OurBaseline(FP) 32/32 Phoneme Error Rate 19.24%
ratio from FPGA characterizations. On Top-1 accuracy, MSQ (PER) lower better
has the minimum accuracy loss for most cases. Fixed 4/4 PER 20.14%
The accuracy increase of MSQ compared to sole SP2 or SP2 4/4 PER 20.09%
MSQ(half/half) 4/4 PER 19.58%
Fixed results from several aspects. First, combining SP2 and MSQ(optimal) 4/4 PER 19.53%
Fixed makes the quantized DNN weights fit the original LSTM on IMDB
weight distribution better. In addition, model compression EQMBaseline(FP) 32/32 Accuracy 89.54%
could slightly increase accuracy when weight bit-width ≥ 4, as EQM [63] 4/4 Accuracy 88.47%
OurBaseline(FP) 32/32 Accuracy 86.37%
the quantization resolution is high enough so that the inference Fixed 4/4 Accuracy 86.12%
results of DNNs are not affected, and quantization noise can SP2 4/4 Accuracy 86.02%
potentially act as regularization that benefits generalization and MSQ(half/half) 4/4 Accuracy 86.28%
MSQ(optimal) 4/4 Accuracy 86.31%
addresses overfitting.
Tables III, and IV compare our MSQ with existing DNN
quantization works including Dorefa [38], PACT [39], DSQ
[40], QIL [41], μL2Q [42], and LQ-NETS [44]. Those when the input size is small. This is because the smaller feature
works and our MSQ start with the same pre-trained models maps are more sensitive to quantization error. There is no
with the same baseline accuracy. Here we use the optimal existing quantization methods reporting about YOLO network
P RSP 2:F ixed = 2 : 1 in MSQ. Note that this optimal ratio is quantization. To provide an idea about the mAP degradation by
from hardware characterization, not for increasing accuracy. our MSQ, we can compare with the weight pruning method on
Table III shows that Dorefa, PACT, DSQ, μL2Q, and LQ- YOLO with also 8× compression rate [64], which decreases
NETS have up to 3.84% accuracy degradation, only QIL the [email protected] by ∼ 3.0 on a simpler dataset than COCO
reports lossless accuracy performance. Our MSQ increases 2014. In general, at the same compression rate, and especially
accuracy by 0.49% compared with the floating-point model. when the dataset is simpler, weight pruning should have less
Table IV shows that the lightweight model MobileNet-v2 is accuracy degradation than weight quantization. But our MSQ
much harder to quantize with 4-bits (for both weight and can have comparable or even smaller mAP degradation. It
activation), our MSQ achieves the highest accuracy of the demonstrates our MSQ works very well on YOLO networks.
quantized models. Table VI shows that our MSQ scheme outperforms the Fixed
On the even larger YOLO-v3 model for object detection, and SP2 quantization for all the three RNN tasks. We also
we apply 4-bit quantization, which is equivalent to 8× com- compare our method with existing work EQM [63] on the PTB
pression rate. We test on two image sizes i.e., 320×320 and and IMDB datasets. Because we do not have the same pre-
640×640. Our MSQ performs very well in preserving the mAP trained models as in EQM [63], we also need to report on their
values i.e., with negligible mAP degradation, and for the case pre-trained (32-bit floating-point) baseline models. On the PTB
of 640×640 input size and [email protected], MSQ can even increase dataset, EQM [63] increases perplexity (PPL) by 5.00 (the
the mAP value. We notice a slightly higher mAP degradation lower the better), while our MSQ only increases by < 2.00.

214
For the IMDB dataset, EQM loses near 1% accuracy and MSQ devices of other types. Specifically, since the multiplications
only loses 0.06% accuracy. Note that we have not found any with fixed-point and SP2 weights consume the DSP and LUT,
DNN quantization works investigating the TIMIT dataset, so respectively, the LUT/DSP ratio decides the parallel PE counts
we could not compare with existing works on TIMIT. for these two operation types. For different devices, we select
different proper ratios of PE counts for fixed-point and SP2
V. FPGA I MPLEMENTATION : D ESIGN AND O PTIMIZATION
according to the available resource amount. Importantly, the
Besides obtaining accuracy advantage, the proposed MSQ PE ratio is used as the desired SP2/fixed-point ratio and sent
assembling the fixed-point and SP2 quantization schemes to Algorithm 2 to obtain the properly quantized models with
significantly promotes the efficiency of the FPGA deployment. the novel MSQ scheme.
Specifically, the newly joined SP2 quantization provides two
apparent advantages in the hardware aspect: (i) the multipli-
B. Architecture with Heterogeneous GEMM Engines
cation arithmetic involving the SP2 quantized weights can be
implemented with simple logic shifter and adder, instead of the This section provides a design based on the versatile tensor
conventional multiplier; and (ii) since the FPGA underlying accelerator (VTA) [67]. The hardware framework contains four
components include DSP and LUT, the rest LUTs can be lever- modules as shown in Figure 3(a), where the Instruction
aged for computations with SP2 weights while the DSPs are module loads the instructions and provides control signals
simulatenously fully utilized for conventional multiplication. to other modules. Load and Store modules control the in-
Therefore, with the proposed MSQ as an ensemble of fixed- put/output activation and weight data communication between
point and SP2, the same device can possibly deliver higher on-chip buffers and DRAM. The Compute module executes
performance than existing designs, in which the throughput is the workloads, with the RegFile as the scratchpad memory
theoretically bounded by the DSP count. for partial sum accumulation and TensorALU computing the
This section addresses the hardware design challenges with element-wise operations (e.g., activation). The major computa-
mixed number systems. Please note that the hardware ben- tion components are the general purpose matrix multiplication
efit from SP2 is orthogonal to prior research efforts (e.g., (GEMM) cores. Different from VTA, there are two heteroge-
dataflow [65] and locality [66] optimization), and therefore neous GEMM cores, GEMMfixed for conventional multiplications,
can be employed by any existing DNN accelerator. and GEMMsp2 for SP2 operations. Besides conventional GEMM
acceleration framework, our GEMMfixed can be naturally com-
A. FPGA Resource Characterization
bined with advanced GEMM acceleration frameworks with
architectural optimizations on the fixed-point operations (and
uses DSP resources on FPGA). An example is Bit-Fusion [11],
which is orthogonal and can be combined with our MSQ.
Firstly, the fixed point operations executed on DSP in our
MSQ framework can be accelerated by Bit-Fusion. Secondly,
MSQ assigns a large portion (beyond 50%) of computations in
each layer to SP2 and leverages LUTs for computation, which
are previously not fully exploited by fixed-point acceleration
techniques like Bit-Fusion. A doubling performance can be
anticipated as fixed-point and SP2 are computed in parallel
on FPGA.
The detailed workflow of two GEMM cores is illustrated in
Figure 3(b). A tiled block of input activation data with a size
Fig. 2. Resource ratio of different FPGA devices. For each device, LUT, of Bat × Blkin is read from the input buffer to the register
FF, and BRAM numbers are all normalized with respect to DSP number. array, where Bat is the batch size and Blkin is the input
channel count of the tile that will be computed in parallel.
FPGA devices provide different types of resources, i.e., Note that the input activation will be broadcasted to both
DSP, LUT, BRAM, and FF, for computation and storage, GEMM cores. As Figure 3(c) displays, the GEMMfixed core is
and the resource amount ratios vary in different FPGA de- composed of multipliers implemented with DSPs on FPGA,
vices. Figure 2 presents the resource ratios of Zynq series while the GEMMsp2 uses LUTs to realize shift and addition for
devices (each device name starts with “XC” that is omit- the novel SP2 based computations. Meanwhile, two weight
ted for simplicity), with each bar normalized by the DSP buffers provide the weight values in fixed-point and SP2
count on the corresponding device. The ratio of LUT to formats, respectively. The partial results will be accumulated
DSP attracts our attention, since this number directly decides and stored in individual register filers, and the final results
the building block for multiplications with fixed-point and are written to individual output buffers. Because the filters
SP2 quantized weights, respectively. Apparently, the ratio of are allocated to heterogeneous GEMM cores depending on their
LUT/DSP in XC7Z045/XC7Z020 devices are larger than that weight representation format, two filter index buffers are set
in XCZU4CG/XCZU5CG devices. This also occurs in FPGA to instruct the Store unit to write the output data to the

215
DRAM from from
Weight Buffer Weight Buffer GEMM (fixed)
(fixed) (SP2) Input Weight

Instruction Fetch Module

Multiplier

Compute Module
Accumulator
Register Micro-op
from Input Buffer
Load File Cache Store
Module Module
Tensor ALU GEMMfixed GEMMsp2 GEMM (SP2)
-bit
Weight
-bit -bit
GEMMfixed GEMMsp2 Weight Weight
Input
Index
Buffer
(fixed) Shifter A Shifter B

Input Buffer +
Index
Weight Buffer Output Buffer Buffer to to
(fixed) (fixed) (SP2) Output Buffer Output Buffer
Accumulator
(fixed) (SP2)
Weight Buffer Output Buffer
(SP2) (SP2)

(a) (b) (c)

Fig. 3. Hardware architecture of convolution for MSQ number system. (a) Overall framework with GEMMf ixed core for ﬁxed-point operations and
GEMMf ixed core for SP2 operations; (b) Dataﬂow in heterogeneous GEMM cores; and (c) Computations in heterogeneous GEMM cores.

proper global addresses. Figure 3(c) gives a detailed structure TABLE VII
to handle ﬁxed-point and SP2 operations in two GEMM cores. H ARDWARE IMPLEMENTATION PARAMETERS WITH DIFFERENT
DEVICES AND SETTINGS . Bat, Blkin , AND Blkout,f ixed ARE SET SUCH
Two design parameters Blkout,f ixed and Blkout,sp2 indi- THAT THE DSP UTILIZATION COULD REACH MAXIMUM . Blkout,sp2 IS
cate the parallel PE count in each GEMM core and size of INCREASED UNTIL THE LUT UTILIZATION IS HIGH ENOUGH AND
corresponding registers array, as illustrated in Figure 3(b). OPTIMIZED .

Two factors are considered in selecting Blkout,f ixed and

Blkout Ratio Peak Thrpt.
Blkout,sp2 . One is that the ratio of Blkout,f ixed to Blkout,sp2 Impl. Device Bat Blkin
Fixed SP2 (fixed/SP2) (GOPS)
should be equivalent to that of different weight types (fixed- D1-1 1 16 16 0 1:0 52.8
point/SP2). An imbalanced ratio may result in under-utilization D1-2 XC7Z020 1 16 16 16 1:1 106
D1-3 1 16 16 24 1:1.5 132
of the certain GEMM core. The other is that the on-chip D2-1 4 16 16 0 1:0 208
resources (DSP and LUT count) should yield a particular ratio D2-2 XC7Z045 4 16 16 16 1:1 416
of design parameters, i.e., a proper number facilitates fully D2-3 4 16 16 32 1:2 624
utilization of FPGA resources, which is the key motivation
of this work. Specifically, we develop an FPGA-centric MSQ
quantization method as mentioned in §IV-B that automatically all implementations, the quantization bit-width for the CNN
trains quantized DNN models to achieve a particular ratio that models is fixed to 4, and the working frequency is set to
meets the resource allocation on FPGA devices. Additionally, 100MHz.
we incorporate the processing operations after the convolution
computations into the GEMM cores, including batch normal-
B. Evaluation with FPGA
ization, activation (ReLU) and pooling, as these operations
consume few resources and incur negligible latency increase 1) Resource Utilization: Figure 4 presents the resource
compared with convolution computations. utilization with different implementations. Apparently, with
the size increase of GEMMsp2 , more on-chip LUT can be
VI. E VALUATION
leveraged for a better peak throughput (GOPS). As Table VII
A. Experiment Setup shows, on XC7Z020 device (D1-1,2,3), the peak throughput
To demonstrate the improvement of the proposed SP2 (and was improved to 2.5× (from 52.8 to 132 GOPS) with the
MSQ) scheme in the hardware aspect, we implemented the extra GEMMsp2 core. The maximum size of the PE array for
architecture with heterogeneous GEMM cores on the embedded SP2 is 1.5× of that for fixed point. This peak throughput
FPGA device, in which a high efficiency is usually in demand improvement is 3× on XC7Z045, from 208 to 624 GOPS.
under resource limitation. As Table VII shows, we imple- Although the ratio of available LUT/DSP is the same between
mented the architecture on the Zynq XC7Z020 and XC7Z045 the two devices, the optimal proportion of PE count for SP2
devices with different design parameters that result in different on XC7Z020 (1.5× of fixed-point) is a smaller than that on
throughput and resource utilization results. Note that for each XC7Z045 (2× of fixed-point). This is because a portion of
device, we set up different ratios between the PE array sizes LUTs is consumed by Load and Store modules to accom-
of the GEMMfixed and GEMMsp2 cores. Specifically, we progres- modate the GEMMsp2 core. The proposed architecture design
sively increase the size of GEMMsp2 core (Blkout,sp2 ), till the is general for all devices through adjusting the Blkout,sp2 to
LUT utilization reaches 70%. For example, on XC7Z020 the fully utilize the LUT resource and quantizing the models with
desired fixed/SP2 ratio is 1:1.5 and on XC7Z045 it is 1:2. For the corresponding fixed-point/SP2 ratio.

216
especially in terms of frame rate. We do not find implementa-
tions with ResNet-18 and MobileNet-v2 in other work, so we
compare it with other CNNs.
Our proposed solution is beneficial over low-precision GPU
for the following two reasons: (1) Current low-precision GPU
(Tensor-RT solution) relies on 8-bit, while we can go to 4-bit
and further assisted by SP2; (2) FPGA solution is dataflow-
based and energy-efficient in general [71]. Comparing with
a state-of-art energy-efficient GPU (NVIDIA Jetson AGX,
power consumption 10-15W) with Tensor-RT support, we use
ResNet-18 as example, measured under the same accuracy.
Our FPGA solution (XC7Z045) is slightly higher performant
Fig. 4. FPGA resource utilization with different devices and settings. In (99FPS vs. 78FPS), but more than 3× higher energy efficiency
the three designs for each of the two devices, the DSP utilization is maintained as the FPGA only consumes around 4W power.
at 100% and the LUT utilization is raised to 70%−80% with FF and BRAM
resources. VII. R ELATED W ORK
This section introduces the DNN weight quantization meth-
ods/algorithms for fixed-point and P2 quantization schemes,
2) Real-world Performance and Comparison: To present and discusses DNN weight quantization on FPGA platforms.
the performance with real-world applications, we employed
different CNN and RNN models with the proper SP2/fixed- A. DNN Quantization Methods
point ratios on the two devices. The networks ResNet-18 Zhou et al. [38] first explored the potential of fixed-point
and MobileNet-v2 are implemented based on the ImageNet quantization by introducing hyperbolic tangent transformation
dataset. The performance results of each network under vari- to weights and activations, with scaling factors to minimize
ous hardware configurations are displayed in Table VIII. For quantization error. Choi et al. [39] improved this method
some layers in CNNs like the first convolutional layer, the by adding a parameterized clipping threshold to activations.
peak throughput cannot be reached since the number of input As alternatives for solving the non-differentiable problem,
channels is less than Blkin so that the data cannot fill all of DSQ [40] developed an evolving training method to gradually
the PEs. Generally, for CNN models, the overall PE utilization approaximate STE. QIL [41] parameterized the quantization
reaches 52.4% to 70.1%, and the heterogeneous GEMMfixed and interval and trained it with task loss, avoiding access to the
GEMMsp2 cores improve the throughput by 2.1 × −2.5× with original training data. μL2Q [42] introduced data distribution
the optimal design compared to utilizing the GEMMfixed core loss during training to minimize quantization error. LQ-Net
only. Compared with the design with only 4-bit fixed-point [44] and LSQ [43] proposed a differentiable method to learn
(fixed4/SP2 = 1 : 0) quantization, the optimal design with the quantizer for each layer jointly with parameters. Miyashita
the ratio of fixed4/SP2 = 1 : 1.5 on XC7Z020 decreases et al. [45] replaced fixed-point quantizer with logarithmic
the latency per image from 100.7ms to 47.1ms (2.13×) for representation to exploit bit shift operations to accelerate
ResNet-18, and the optimal design with the ratio of fixed4/SP2 inference. INQ [46] splits weights into groups and iteratively
= 1 : 2 on XC7Z045 decreases the latency from 25.1ms to quantize the model to low bit-width. Leng et al. [47] em-
10.1ms (2.49×) for ResNet-18. The latency improvement is ployed ADMM training technique to increase the accuracy of
more significant when compared with the 8-bit fixed-point extremely low bit-width DNNs.
design, as the optimal design on XC7Z020 achieves latency In addition to these quantization methods for inference
decrease from 181.3ms to 47.1ms (3.83×), and the optimal acceleration, Zhu et al. [72] proposed a low-bit training frame-
design on XC7Z045 achieves latency decrease from 45.2ms work for the training acceleration. They used the direction
to 10.1ms (4.48×). As for RNN models, the PE utilization sensitive gradient clipping and the deviation counteractive
is 42.9% − 59.2%, and the performance is increased by learning rate scaling to ensure a unified 8-bit (INT8) training
2.4 × −4.1×. with minor accuracy degradation.
The optimal MSQ implementations of CNNs based on
ImageNet and previous designs are compared in Table IX, B. Weight Quantization in FPGA Implementations
from which it can be observed that our ResNet-18 imple- Weight quantization has been widely applied to DNN
mentations achieve the highest accuracy and enjoy comparable implementations on FPGAs [73]. Some work studies fixed-
hardware utilization efficiency represented by GOPS/DSP and point quantization. The work [68] utilizes greedy solution
GOPS/kLUT with designs in [68], [69]. The work [70] ac- to determine the radix position of each layer for quantiza-
quires higher utilization efficiency but much lower accuracy. tion. [70] investigates a hybrid quantization scheme that allows
MobileNet-v2 has the most complicated structure among all different bit-widths for weights, providing more flexibility. For
these networks, making it difficult to deploy on hardware Binarized Neural Networks (BNNs), multiplications can be
platforms, but our designs can still achieve high performance, executed with XNOR gates [74]–[76]. A fully binarized neural

217
TABLE VIII
P ERFORMANCE OF VARIOUS DNN APPLICATIONS ON HARDWARE UNDER DIFFERENT SETTINGS .

Utilization Throughput (GOPS)

Ratio
Device ResNet-18 MobileNet-v2 YOLO-v3 LSTM GRU LSTM
(ﬁxed/SP2) LUT DSP BRAM36 FF
on ImageNet on ImageNet on COCO on PTB on TIMIT on IMDB
1:0 12160 220 39 9403 36.0 33.0 36.6 26.1 22.6 25.0
XC7Z020 1:1 22912 220 49 14523 74.4 65.7 74.1 52.9 49.2 58.7
1:1.5 (opt.) 28288 220 56 17083 77.0 71.8 84.0 77.2 77.2 59.7
1:0 41830 900 160 31293 144.7 129.6 143.6 91.3 89.6 108.0
XC7Z045 1:1 93440 900 194 65699 285.5 258.1 283.7 183.2 212.5 217.2
1:2 (opt.) 145049 900 225.5 111575 359.2 326.9 390.0 318.2 369.2 340.7

TABLE IX
C OMPARISONS OF CNN S ON I MAGE N ET WITH PREVIOUS IMPLEMENTATIONS .

VGG AlexNet DiracDeltaNet ResNet-18 MobileNet-v2

Implementation
[68] [70] [69] (Our opt.) (Our opt.)
Device XC7Z045 XC7Z045 XC7Z020 XC7Z045 XCZU3EG XC7Z020 XC7Z045 XC7Z020 XC7Z045
Bit-width (W/A) 16/16 8/8 8/8 8/8 1/4 4/4 4/4
Top-1 Accuracy 67.84% 67.72% 67.62% 54.6% 68.5% 70.27% 65.64%
Frequency (MHz) 150 150 214 200 250 100 100
LUT 182616 139385 29867 86262 24130 28288 145049 28288 145049
DSP 780 900 190 808 37 220 900 220 900
BRAM36 486 390.5 85.5 303 170 56 225.5 56 225.5
Throughput (GOPS) 187.8 292 84.3 493 47.09 77.0 359.2 71.8 326.9
Frame Rate (FPS) 6.06 9.42 2.72 340 96.5 21.3 99.1 120.7 549.3
GOPS/DSP 0.241 0.324 0.444 0.610 1.273 0.350 0.391 0.326 0.363
GOPS/kLUT 1.029 2.096 2.825 5.747 1.953 2.725 2.475 2.538 2.252

network accelerator is implemented in [76] through utilizing networks (CNN) and recurrent neural networks (RNN). With
odd-even padding to replace the zero padding values. Another optimal SP2/fixed-point ratios on two FPGA devices, i.e.,
scheme called logarithmic quantization using power of 2 is Zynq XC7Z020 and XC7Z045, we achieve performance im-
explored in [77]. In addition, weight quantization could be provement of 2.1×−4.1× compared to solely exploiting DSPs
employed with a two-stage arithmetic unit for low bit-width for all multiplication operations.
CNNs [78], a fast matrix and Winograd algorithm [79], a
novel CNN architecture for software-hardware codesign [69], ACKNOWLEDGMENT
a design flow of DNN implementations for more flexible This work is partly supported by the National Science Foun-
quantization schemes [80], and an OpenCL-based framework dation CCF-1901378, CCF-1919117, CCF-1919289, CNS-
Deep Learning Accelerator (DLA) to accomodate designs with 1909172 and DARPA-HR00112090055.
different bit-widths [81]. In addition, dynamic quantization
with bit fusion in [11] improves the bit-level flexibility by R EFERENCES
matching various bit-widths for different DNN layers. [1] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,” in
VIII. C ONCLUSION Thirty-first AAAI conference on artificial intelligence (AAAI), 2017.
[2] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
This paper investigates efficient DNN inference engine “Feature pyramid networks for object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition (CVPR),
on FPGA devices through DNN quantization, and proposes 2017, pp. 2117–2125.
the first solution that applies different quantization schemes [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
for different rows of the weight matrix. We propose a A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,
“Deep neural networks for acoustic modeling in speech recognition:
hardware-friendly quantization scheme named SP2 suitable for The shared views of four research groups,” IEEE Signal processing
Gaussian-like weight distribution, in which the multiplication magazine, vol. 29, no. 6, pp. 82–97, 2012.
arithmetic can be replaced with logic shifter and adder, thereby [4] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke, “The
microsoft 2017 conversational speech recognition system,” in 2018 IEEE
enabling highly efficient implementations with the FPGA LUT international conference on acoustics, speech and signal processing
resources. In contrast, the fixed-point quantization is suitable (ICASSP). IEEE, 2018, pp. 5934–5938.
for Uniform-like weight distribution and can be implemented [5] R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in Proceed-
efficiently by DSP. To fully explore the FPGA resources, intra- ings of the 25th international conference on Machine learning (ICML),
layer, multi-scheme quantization framework with an ensemble 2008, pp. 160–167.
of the SP2 and fixed-point schemes. We evaluate our FPGA- [6] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez,
centric quantization framework across multiple application “A survey on deep learning in medical image analysis,” Medical image
domains with various DNNs such as convolutional neural analysis, vol. 42, pp. 60–88, 2017.

218
[7] J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, [26] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, P. A. dnns,” in Advances in neural information processing systems (NeurIPS),
Keane, and O. Ronneberger, “Clinically applicable deep learning for 2016, pp. 1379–1387.
diagnosis and referral in retinal disease,” Nature medicine, vol. 24, no. 9, [27] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang,
pp. 1342–1350, 2018. and J. Zhu, “Discrimination-aware channel pruning for deep neural
[8] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “Lergan: A zero-free, low networks,” in Advances in Neural Information Processing Systems
data movement and pim-based gan architecture,” in Proceedings of the (NeurIPS), 2018, pp. 875–886.
51st Annual IEEE/ACM International Symposium on Microarchitecture [28] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y.
(MICRO). IEEE, 2018, pp. 669–681. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance
[9] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and score propagation,” in Proceedings of the IEEE Conference on Computer
B. Yuan, “Sc-dcnn: Highly-scalable deep convolutional neural network Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.
using stochastic computing,” Proceedings of the 22nd International [29] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via
Conference on Architectural Support for Programming Languages and geometric median for deep convolutional neural networks acceleration,”
Operating Systems (ASPLOS), vol. 51, no. 2, pp. 405–418, 2017. in Proceedings of the IEEE Conference on Computer Vision and Pattern
[10] R. Cai, A. Ren, N. Liu, C. Ding, L. Wang, X. Qian, M. Pedram, and Recognition (CVPR), 2019, pp. 4340–4349.
Y. Wang, “Vibnn: Hardware acceleration of bayesian neural networks,” [30] X. Dong and Y. Yang, “Network pruning via transformable architec-
in Proceedings of the 23rd International Conference on Architectural ture search,” in Advances in Neural Information Processing Systems
Support for Programming Languages and Operating Systems (ASPLOS). (NeurIPS), 2019, pp. 759–770.
ACM, 2018, pp. 476–488.
[31] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
[11] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Es-
deep neural networks with binary weights during propagations,” in
maeilzadeh, “Bit fusion: Bit-level dynamically composable architecture
Advances in neural information processing systems (NeurIPS), 2015,
for accelerating deep neural networks,” in Proceedings of the 45th
pp. 3123–3131.
Annual International Symposium on Computer Architecture (ISCA).
IEEE Press, 2018, pp. 764–775. [32] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
[12] C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, “Tie: energy- gio, “Binarized neural networks: Training deep neural networks with
efficient tensor train-based inference engine for deep neural network,” in weights and activations constrained to+ 1 or-1,” arXiv preprint
Proceedings of the 46th Annual International Symposium on Computer arXiv:1602.02830, 2016.
Architecture (ISCA), 2019, pp. 264–278. [33] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
[13] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo, Imagenet classification using binary convolutional neural networks,” in
N. Yoshikawa, and Y. Wang, “A stochastic-computing based deep European conference on computer vision (ECCV). Springer, 2016, pp.
learning framework using adiabatic quantum-flux-parametron supercon- 525–542.
ducting technology,” in Proceedings of the 46th Annual International [34] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional
Symposium on Computer Architecture (ISCA), 2019, pp. 567–578. neural network,” in Advances in Neural Information Processing Systems
[14] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, (NeurIPS), 2017, pp. 345–353.
“Admm-nn: An algorithm-hardware co-design framework of dnns using [35] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint
alternating direction methods of multipliers,” in Proceedings of the 24th arXiv:1605.04711, 2016.
International Conference on Architectural Support for Programming [36] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
Languages and Operating Systems (ASPLOS), 2019, pp. 925–938. in International Conference on Learning Representations (ICLR), 2017.
[15] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing [37] Z. He and D. Fan, “Simultaneously optimizing weight and quantizer
fpga-based accelerator design for deep convolutional neural networks,” of ternary neural network using truncated gaussian approximation,” in
in Proceedings of the 2015 ACM/SIGDA International Symposium on Proceedings of the IEEE Conference on Computer Vision and Pattern
Field-Programmable Gate Arrays (FPGA). ACM, 2015, pp. 161–170. Recognition (CVPR), 2019, pp. 11 438–11 446.
[16] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, [38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:
R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural Training low bitwidth convolutional neural networks with low bitwidth
networks with software-programmable fpgas,” in Proceedings of the gradients,” arXiv preprint arXiv:1606.06160, 2016.
2017 ACM/SIGDA International Symposium on Field-Programmable [39] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,
Gate Arrays (FPGA). ACM, 2017, pp. 15–24. and K. Gopalakrishnan, “Pact: Parameterized clipping activation for
[17] Z. Chen, A. Howe, H. T. Blair, and J. Cong, “Fpga-based lstm ac- quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
celeration for real-time eeg signal processing,” in Proceedings of the
[40] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan,
2018 ACM/SIGDA International Symposium on Field-Programmable
“Differentiable soft quantization: Bridging full-precision and low-bit
Gate Arrays (FPGA). ACM, 2018, pp. 288–288.
neural networks,” in Proceedings of the IEEE International Conference
[18] R. Shi, Y. Ding, X. Wei, H. Liu, H. So, and C. Ding, “Ftdl: An
on Computer Vision (ICCV), 2019, pp. 4852–4861.
fpga-tailored architecture for deep learning systems,” in The 2020
ACM/SIGDA International Symposium on Field-Programmable Gate [41] S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S. J. Hwang,
Arrays (FPGA), 2020, pp. 320–320. and C. Choi, “Learning to quantize deep networks by optimizing
quantization intervals with task loss,” in Proceedings of the IEEE
[19] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren,
Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
“Patdnn: Achieving real-time dnn execution on mobile devices with
pp. 4350–4359.
pattern-based weight pruning,” in Proceedings of the 25th International
Conference on Architectural Support for Programming Languages and [42] G. Cheng, L. Ye, L. Tao, Z. Xiaofan, H. Cong, C. Deming, and
Operating Systems (ASPLOS), 2020, pp. 907–922. C. Yao, “μl2q: An ultra-low loss quantization method for dnn,” The 2019
[20] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, International Joint Conference on Neural Networks (IJCNN), 2019.
L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An [43] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S.
automated end-to-end optimizing compiler for deep learning,” in 13th Modha, “Learned step size quantization,” International Conference on
USENIX Symposium on Operating Systems Design and Implementation Learning Representations (ICLR), 2019.
(OSDI 18), 2018, pp. 578–594. [44] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization
[21] https://github.com/alibaba/MNN. for highly accurate and compact deep neural networks,” in Proceedings
[22] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017. of the European conference on computer vision (ECCV), 2018, pp. 365–
[23] https://www.tensorflow.org/mobile/tflite/. 382.
[24] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking [45] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neu-
the value of network pruning,” International Conference on Learning ral networks using logarithmic data representation,” arXiv preprint
Representations (ICLR), 2019. arXiv:1603.01025, 2016.
[25] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [46] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
sparsity in deep neural networks,” in Advances in neural information quantization: Towards lossless cnns with low-precision weights,” in
processing systems (NeurIPS), 2016, pp. 2074–2082. International Conference on Learning Representations (ICLR), 2017.

219
[47] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural onto embedded fpga,” IEEE Transactions on Computer-Aided Design
network: Squeeze the last bit out with admm,” in Thirty-Second AAAI of Integrated Circuits and Systems (TCAD), vol. 37, no. 1, pp. 35–47,
Conference on Artificial Intelligence (AAAI), 2018. 2017.
[48] C. Baskin, E. Schwartz, E. Zheltonozhskii, N. Liss, R. Giryes, [69] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott,
A. M. Bronstein, and A. Mendelson, “Uniq: Uniform noise injec- L. Lavagno, K. Vissers, J. Wawrzynek, and K. Keutzer, “Synetgy:
tion for non-uniform quantization of neural networks,” arXiv preprint Algorithm-hardware co-design for convnet accelerators on embedded
arXiv:1804.10969, 2018. fpgas,” in Proceedings of the 2019 ACM/SIGDA International Sympo-
[49] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight sium on Field-Programmable Gate Arrays (FPGA). ACM, 2019, pp.
uncertainty in neural networks,” in Proceedings of the 32nd International 23–32.
Conference on International Conference on Machine Learning (ICML), [70] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of
2015, pp. 1613–1622. accelerating hybrid extremely low bit-width neural network in embedded
[50] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating fpga,” in 2018 28th International Conference on Field Programmable
gradients through stochastic neurons for conditional computation,” arXiv Logic and Applications (FPL). IEEE, 2018, pp. 163–1636.
preprint arXiv:1308.3432, 2013. [71] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding
[51] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin, “Understanding performance differences of fpgas and gpus: (abtract only),” ser. FPGA
straight-through estimator in training activation quantized neural nets,” ’18. New York, NY, USA: Association for Computing Machinery, 2018,
in International Conference on Learning Representations (ICLR), 2018. p. 288. [Online]. Available: https://doi.org/10.1145/3174243.3174970
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [72] F. Zhu, R. Gong, F. Yu, X. Liu, Y. Wang, Z. Li, X. Yang, and J. Yan,
recognition,” in Proceedings of the IEEE conference on computer vision “Towards unified int8 training for convolutional neural network,” in
and pattern recognition (CVPR), 2016, pp. 770–778. Proceedings of the IEEE/CVF Conference on Computer Vision and
[53] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Pattern Recognition (CVPR), 2020, pp. 1969–1979.
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings [73] K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P. Debacker,
of the IEEE conference on computer vision and pattern recognition M. Verhelst, and Y. Wang, “Neural network accelerator comparison,”
(CVPR), 2018, pp. 4510–4520. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
[54] A. Krizhevsky, “Learning multiple layers of features from tiny images,” [74] H. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto, and M. Motomura,
2009. “A memory-based realization of a binarized deep convolutional neural
[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification network,” in 2016 International Conference on Field-Programmable
with deep convolutional neural networks,” in Advances in neural infor- Technology (FPT). IEEE, 2016, pp. 277–280.
mation processing systems, 2012, pp. 1097–1105. [75] H. Nakahara, T. Fujii, and S. Sato, “A fully connected layer elimination
[56] J. Redmon and A. Farhadi, “Yolov3: An incremental for a binarized convolutional neural network on an fpga,” in 2017 27th
improvement,” CoRR, vol. abs/1804.02767, 2018. [Online]. Available: International Conference on Field Programmable Logic and Applica-
http://arxiv.org/abs/1804.02767 tions (FPL). IEEE, 2017, pp. 1–4.
[57] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, [76] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “Fbna: A
J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft fully binarized neural network accelerator,” in 2018 28th International
COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. Conference on Field Programmable Logic and Applications (FPL).
[Online]. Available: http://arxiv.org/abs/1405.0312 IEEE, 2018, pp. 51–513.
[58] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [77] C. Luo, W. Cao, L. Wang, and P. H. Leong, “Rna: An accurate
computation, vol. 9, no. 8, pp. 1735–1780, 1997. residual network accelerator for quantized and reconstructed deep neural
[59] M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large networks,” IEICE Transactions on Information and Systems, vol. 102,
annotated corpus of english: The penn treebank,” 1993. no. 5, pp. 1037–1045, 2019.
[60] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, [78] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang, “Accelerating low
H. Schwenk, and Y. Bengio, “Learning phrase representations using bit-width convolutional neural networks with embedded fpga,” in 2017
rnn encoder-decoder for statistical machine translation,” in Proceedings 27th International Conference on Field Programmable Logic and Ap-
of the 2014 Conference on Empirical Methods in Natural Language plications (FPL). IEEE, 2017, pp. 1–4.
Processing (EMNLP), 2014, pp. 1724–1734. [79] D. Wu, J. Chen, W. Cao, and L. Wang, “A novel low-communication
[61] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. energy-efficient reconfigurable cnn acceleration architecture,” in 2018
Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. 28th International Conference on Field Programmable Logic and Ap-
nist speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, plications (FPL). IEEE, 2018, pp. 64–643.
1993. [80] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and
[62] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, D. Chen, “Dnnbuilder: an automated tool for building high-performance
“Learning word vectors for sentiment analysis,” in Proceedings of the dnn hardware accelerators for fpgas,” in 2018 IEEE/ACM International
49th annual meeting of the association for computational linguistics: Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–8.
Human language technologies-volume 1. Association for Computa- [81] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and
tional Linguistics, 2011, pp. 142–150. K. Nealis, “Exploration of low numeric precision deep learning inference
[63] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou, “Effective using intel® fpgas,” in 2018 IEEE 26th Annual International Symposium
quantization methods for recurrent neural networks,” arXiv preprint on Field-Programmable Custom Computing Machines (FCCM). IEEE,
arXiv:1611.10176, 2016. 2018, pp. 73–80.
[64] P. Zhang, Y. Zhong, and X. Li, “Slimyolov3: Narrower, faster and
better for real-time UAV applications,” CoRR, vol. abs/1907.11093,
2019. [Online]. Available: http://arxiv.org/abs/1907.11093
[65] Q. Sun, T. Chen, J. Miao, and B. Yu, “Power-driven dnn dataflow
optimization on fpga,” in 2019 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–7.
[66] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong, “Fp-dnn: An automated framework for mapping deep
neural networks onto fpgas with rtl-hls hybrid templates,” in 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2017, pp. 152–159.
[67] T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm,
Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, “A hardware–
software blueprint for flexible deep learning specialization,” IEEE Micro,
vol. 39, no. 5, pp. 8–16, 2019.
[68] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang, “Angel-eye: A complete design flow for mapping cnn

220

Auto QNN
No ratings yet
Auto QNN
23 pages
Paper 8
No ratings yet
Paper 8
7 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
DAC'22 EBSP Bit Sparsity DNN
No ratings yet
DAC'22 EBSP Bit Sparsity DNN
6 pages
HAQ Hardware-Aware Automated Quantization With Mixed Precision
No ratings yet
HAQ Hardware-Aware Automated Quantization With Mixed Precision
9 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
Deep Convolutional Neural Network Inference With Floating-Point Weights and
No ratings yet
Deep Convolutional Neural Network Inference With Floating-Point Weights and
10 pages
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
No ratings yet
High-Performance Acceleration of 2-D and 3-D CNNs On FPGAs Using Static Block Floating Point
15 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
Resiliency of Deep Neural Networks Under Quantizations
No ratings yet
Resiliency of Deep Neural Networks Under Quantizations
11 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
No ratings yet
An End-to-End Workflow To Efficiently Compress and Deploy DNN Classifiers On SoC FPGA
4 pages
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
No ratings yet
HAQ: Hardware-Aware Automated Quantization With Mixed Precision
10 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
No ratings yet
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
14 pages
Accelerating Deep Neural Networks Implem
No ratings yet
Accelerating Deep Neural Networks Implem
18 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Counterexample Guided Neural Network Quantization Refinement
No ratings yet
Counterexample Guided Neural Network Quantization Refinement
14 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
10 3390@electronics8030295
No ratings yet
10 3390@electronics8030295
15 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Introduction To Weight Quantization PDF
No ratings yet
Introduction To Weight Quantization PDF
9 pages
Sensors 24 00181 v2
No ratings yet
Sensors 24 00181 v2
26 pages
NNQuant 1
No ratings yet
NNQuant 1
14 pages
ICCV'21 Liu Improving Neural Network Efficiency Via Post-Training Quantization With Adaptive Floating-Point ICCV 2021 Paper
No ratings yet
ICCV'21 Liu Improving Neural Network Efficiency Via Post-Training Quantization With Adaptive Floating-Point ICCV 2021 Paper
10 pages
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
No ratings yet
Accelerating Binarized Neural Networks Comparison of FPGA CPU GPU and ASIC
8 pages
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
4 pages
Energy-and-Area-Efficient CNN
No ratings yet
Energy-and-Area-Efficient CNN
14 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
Risc Acc
No ratings yet
Risc Acc
7 pages
A DNN Optimization Framework With Unlabeled
No ratings yet
A DNN Optimization Framework With Unlabeled
5 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
Data-Free Quantization Through Weight Equalization and Bias Correction
No ratings yet
Data-Free Quantization Through Weight Equalization and Bias Correction
13 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
No ratings yet
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
6 pages
Pal 2025 Eng. Res. Express 7 015317
No ratings yet
Pal 2025 Eng. Res. Express 7 015317
16 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Electronics 11 00663
No ratings yet
Electronics 11 00663
14 pages
FP-DNN An Automated Framework For Mapping
No ratings yet
FP-DNN An Automated Framework For Mapping
8 pages
A Real-Time Object Detection Processor With Xnor-B
No ratings yet
A Real-Time Object Detection Processor With Xnor-B
13 pages
(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA
No ratings yet
(P2) An Efficient Implementation of Convolutional Neural Network With CLIP-Q Quantization On FPGA
10 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
16496-Article Text-19990-1-2-20210518
No ratings yet
16496-Article Text-19990-1-2-20210518
9 pages
Embedded Deep Learning Accelerators - A Survey On Recent Advances
No ratings yet
Embedded Deep Learning Accelerators - A Survey On Recent Advances
19 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Zhang 2021
No ratings yet
Zhang 2021
12 pages
Training High-Performance and Large-Scale Deep Neural Networks With Full 8-Bit Integers
No ratings yet
Training High-Performance and Large-Scale Deep Neural Networks With Full 8-Bit Integers
14 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
And The Bit Goes Down
No ratings yet
And The Bit Goes Down
11 pages
Tcas-I Haco Final
No ratings yet
Tcas-I Haco Final
14 pages
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
No ratings yet
A Deep Learning Accelerator Based On A Streaming Architecture For Binary Neural Networks
19 pages
Low Precision Networks For Efficient Inference On Fpgas White Paper
No ratings yet
Low Precision Networks For Efficient Inference On Fpgas White Paper
6 pages
2024内容创作者生态报告新榜 2024.11 76页
No ratings yet
2024内容创作者生态报告新榜 2024.11 76页
78 pages
Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack For AES Key Recovery
No ratings yet
Trident: A Hybrid Correlation-Collision GPU Cache Timing Attack For AES Key Recovery
13 pages
TheWallStreetJournal - 10.29.2024
No ratings yet
TheWallStreetJournal - 10.29.2024
28 pages
Q: Query Acceleration Can Be Generic and Efficient in The Cloud
No ratings yet
Q: Query Acceleration Can Be Generic and Efficient in The Cloud
14 pages
Depgraph: A Dependency-Driven Accelerator For Efficient Iterative Graph Processing
No ratings yet
Depgraph: A Dependency-Driven Accelerator For Efficient Iterative Graph Processing
14 pages
Heat Behind The Meter: A Hidden Threat of Thermal Attacks in Edge Colocation Data Centers
No ratings yet
Heat Behind The Meter: A Hidden Threat of Thermal Attacks in Edge Colocation Data Centers
14 pages
Widir: A Wireless-Enabled Directory Cache Coherence Protocol
No ratings yet
Widir: A Wireless-Enabled Directory Cache Coherence Protocol
14 pages
Systematic Approaches For Precise and Approximate Quantum State Runtime Assertion
No ratings yet
Systematic Approaches For Precise and Approximate Quantum State Runtime Assertion
15 pages
Gradpim: A Practical Processing-In-Dram Architecture For Gradient Descent
No ratings yet
Gradpim: A Practical Processing-In-Dram Architecture For Gradient Descent
14 pages
Syncron:: Efficient Synchronization Support For Near-Data-Processing Architectures
No ratings yet
Syncron:: Efficient Synchronization Support For Near-Data-Processing Architectures
14 pages
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
No ratings yet
Tensor Casting: Co-Designing Algorithm-Architecture For Personalized Recommendation Training
14 pages
Cheetah: Optimizing and Accelerating Homomorphic Encryption For Private Inference
No ratings yet
Cheetah: Optimizing and Accelerating Homomorphic Encryption For Private Inference
14 pages
TILT: Achieving Higher Fidelity On A Trapped-Ion Linear-Tape Quantum Computing Architecture
No ratings yet
TILT: Achieving Higher Fidelity On A Trapped-Ion Linear-Tape Quantum Computing Architecture
14 pages
Revisiting Hyperdimensional Learning For Fpga and Low-Power Architectures
No ratings yet
Revisiting Hyperdimensional Learning For Fpga and Low-Power Architectures
14 pages
New Models For Understanding and Reasoning About Speculative Execution Attacks
No ratings yet
New Models For Understanding and Reasoning About Speculative Execution Attacks
14 pages
Faster SCHR Odinger-Style Simulation of Quantum Circuits: Aneeqa Fatima Igor L. Markov
No ratings yet
Faster SCHR Odinger-Style Simulation of Quantum Circuits: Aneeqa Fatima Igor L. Markov
14 pages
Stealth-Persist: Architectural Support For Persistent Applications in Hybrid Memory Systems
No ratings yet
Stealth-Persist: Architectural Support For Persistent Applications in Hybrid Memory Systems
14 pages
TSOPER: Efficient Coherence-Based Strict Persistency
No ratings yet
TSOPER: Efficient Coherence-Based Strict Persistency
14 pages
Introduction To Java
No ratings yet
Introduction To Java
15 pages
7th Grade Standards
No ratings yet
7th Grade Standards
4 pages
10th English First Revision QP Theni DT
No ratings yet
10th English First Revision QP Theni DT
4 pages
Readme PDF
100% (1)
Readme PDF
5 pages
Screenshot 2024-08-31 at 11.57.48 PM
No ratings yet
Screenshot 2024-08-31 at 11.57.48 PM
31 pages
Performing in Musicals - Novak, Elaine Adams
100% (2)
Performing in Musicals - Novak, Elaine Adams
324 pages
Quinceañera
No ratings yet
Quinceañera
9 pages
01skills1 U4 L1 Test VocabularyGrammar
No ratings yet
01skills1 U4 L1 Test VocabularyGrammar
2 pages
Sesotho Merged
No ratings yet
Sesotho Merged
52 pages
Linear Equations and Applications, Inequalities and Absolute Values
100% (1)
Linear Equations and Applications, Inequalities and Absolute Values
22 pages
Worksheet Present Simplecontinuous Martín Cárdenas
No ratings yet
Worksheet Present Simplecontinuous Martín Cárdenas
9 pages
11th English Quarterly Exam 2024 Original Question Paper With Answer Key Ramanathapuram District PDF Download
No ratings yet
11th English Quarterly Exam 2024 Original Question Paper With Answer Key Ramanathapuram District PDF Download
10 pages
HCI-unit 4
No ratings yet
HCI-unit 4
11 pages
CMAT 2025 Top-25 Static GK Questions On Books and Authors
No ratings yet
CMAT 2025 Top-25 Static GK Questions On Books and Authors
23 pages
The Hungry Mouse
No ratings yet
The Hungry Mouse
13 pages
Als426 Language and Linguistics - Japan - Taro Nagazumi
No ratings yet
Als426 Language and Linguistics - Japan - Taro Nagazumi
19 pages
Typhlotricholigioides and Mexiconiscus From Mexico and Cylindron 1994
No ratings yet
Typhlotricholigioides and Mexiconiscus From Mexico and Cylindron 1994
8 pages
Unit 4 - Lesson 5 Assessment
No ratings yet
Unit 4 - Lesson 5 Assessment
4 pages
Chabacano PDF
No ratings yet
Chabacano PDF
2 pages
Lingoda Class
No ratings yet
Lingoda Class
42 pages
Research Project 2 Pg. 2
No ratings yet
Research Project 2 Pg. 2
1 page
Infinitive (Present) Past Past Participle Translation Infinitive (Present) Past Past Participle Translation
No ratings yet
Infinitive (Present) Past Past Participle Translation Infinitive (Present) Past Past Participle Translation
1 page
PY-GB PAE Guidance Documents
No ratings yet
PY-GB PAE Guidance Documents
10 pages
How To Write A Reaction Paper
No ratings yet
How To Write A Reaction Paper
2 pages
Lark Ascending Listening Lesson
No ratings yet
Lark Ascending Listening Lesson
2 pages
Question Bank
No ratings yet
Question Bank
3 pages
Are NLP Models Really Able To Solve Simple Math Word Problems?
No ratings yet
Are NLP Models Really Able To Solve Simple Math Word Problems?
15 pages
Spelling, Abbreviations and Symbols Guide
No ratings yet
Spelling, Abbreviations and Symbols Guide
111 pages
Twenty-Eighth Sunday in Ordinary Time - USCCB 2
No ratings yet
Twenty-Eighth Sunday in Ordinary Time - USCCB 2
5 pages
Class-4 Computer TERM - 1
No ratings yet
Class-4 Computer TERM - 1
7 pages

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Uploaded by

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Uploaded by

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Mix and Match: A Novel FPGA-Centric Deep

2378-203X/21/$31.00 ©2021 IEEE 208

Weight Activation Ops for Weight × Activation

levels. Although the power-of-2 quantization scheme in m-

Quantization Bit width ResNet-18 Accuracy (%) MobileNet-v2 Accuracy (%)

B. Accuracy Performance Analysis IV. A FPGA-C ENTRIC M IXED S CHEME Q UANTIZATION

TABLE IV Bit Width

Instruction Fetch Module

(a) (b) (c)

Two factors are considered in selecting Blkout,f ixed and

Utilization Throughput (GOPS)

VGG AlexNet DiracDeltaNet ResNet-18 MobileNet-v2

You might also like