Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
Abstract—Deep Neural Networks (DNNs) have achieved ex- up to GBs (Giga Bytes) for model size and 102 GFLOPs (Giga
traordinary performance in various application domains. To Floating Point Operations) for inference computation, making
support diverse DNN models, efficient implementations of DNN it a challenging task to perform on-device inference.
inference on edge-computing platforms, e.g., ASICs, FPGAs,
and embedded systems, are extensively investigated. Due to the To efficiently execute the diverse DNN inference models
huge model size and computation amount, model compression for broader applications, the resource-constrained edge com-
is a critical step to deploy DNN models on edge devices. This puting platforms require two crucial supports. The first one
paper focuses on weight quantization, a hardware-friendly model is the specialized hardware acceleration for DNN inference.
compression approach that is complementary to weight pruning. Extensive research efforts have been dedicated to the efficient
Unlike existing methods that use the same quantization scheme
for all weights, we propose the first solution that applies different implementations of DNN inference models on various edge-
quantization schemes for different rows of the weight matrix. computing platforms, such as ASICs [8]–[14], FPGAs [15]–
It is motivated by (1) the distribution of the weights in the [18], and embedded CPUs/GPUs [19]–[23].
different rows are not the same; and (2) the potential of achieving The second is the DNN model compression technique,
better utilization of heterogeneous FPGA hardware resources. To
which not only seeks more efficient hardware implementation
achieve that, we first propose a hardware-friendly quantization
scheme named sum-of-power-of-2 (SP2) suitable for Gaussian- based on given models, but also explores the opportunity of
like weight distribution, in which the multiplication arithmetic algorithm and hardware co-design to achieve better trade-offs
can be replaced with logic shifter and adder, thereby enabling among accuracy, hardware cost, and performance. There are
highly efficient implementations with the FPGA LUT resources. two essential techniques for model compression: DNN weight
In contrast, the existing fixed-point quantization is suitable
pruning [24]–[30] and weight quantization [31]–[47].
for Uniform-like weight distribution and can be implemented
efficiently by DSP. Then to fully explore the resources, we propose This paper focuses on DNN weight quantization, which be-
an FPGA-centric mixed scheme quantization (MSQ) with an comes imperative to the DNN hardware acceleration especially
ensemble of the proposed SP2 and the fixed-point schemes. on the FPGA and ASIC platforms. By representing weights
Combining the two schemes can maintain, or even increase with fewer bits, weight quantization can directly simplify the
accuracy due to better matching with weight distributions.
implementations and accelerate the inference execution speed
For the FPGA implementations, we develop a parameterized
architecture with heterogeneous Generalized Matrix Multiplica- in a hardware-friendly manner. Also, it is supported in GPUs
tion (GEMM) cores—one using LUTs for computations with (e.g., PyTorch [22] for NVIDIA GPUs) and mobile devices
SP2 quantized weights and the other utilizing DSPs for fixed- (e.g., TensorFlow-Lite [23]). In addition, weight quantization
point quantized weights. Given the partition ratio among the two yields far less training overhead than weight pruning, let alone
schemes based on resource characterization, MSQ quantization
the training-heavy network architecture search (NAS)-based
training algorithm derives an optimally quantized model for the
FPGA implementation. We evaluate our FPGA-centric quan- model compression techniques. Specifically, in state-of-the-art
tization framework across multiple application domains. With DNN quantization methods (including our work), retraining
optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq process takes usually 1/3 ∼ 1/2 of the epochs as those for
XC7Z020 and XC7Z045, we achieve performance improvement the pre-training process, which is totally acceptable training
of 2.1 × −4.1× compared to solely exploiting DSPs for all
overhead in the exchange for significant inference speedup.
multiplication operations. In addition, the CNN implementations
with the proposed MSQ scheme can achieve higher accuracy Weight quantization can be considered as a mapping from
and comparable hardware utilization efficiency compared to the 32-bit floating-point weights into m-bit weight representations.
state-of-the-art designs. There are different types of quantization schemes including
Index Terms—deep neural network, quantization, FPGA, in- binary [31]–[34], ternary [35]–[37], low-bit-width fixed-point
ference
[38]–[43], and power-of-2 [44]–[47]. In general, binary and
I. I NTRODUCTION ternary quantization schemes result in significant accuracy
loss, for example, > 5% under binary and 2%−3% for ternary
Deep learning or Deep Neural Networks (DNNs) have
quantization. The fixed-point quantization can represent the
achieved extraordinary performance in various application do-
DNN weights using low bit-width, e.g., 4-bit, with negligible
mains [1]–[7]. However, the state-of-the-art DNNs may require
accuracy loss. To further simplify hardware implementations,
∗ Equal contribution. power-of-2 quantization scheme was proposed to replace the
209
ods/algorithms by DoReFa-Net [38], PACT [39], DSQ [40], Algorithm 1: DNN Quantization with ADMM and STE
QIL [41], μ L2Q [42], and LSQ [43]. input : 32-bit floating-point DNN model M, with weights
With the m-bit fixed-point scheme, quantized weight values W to be quantized.
are defined as the scaling factor α times quantization levels: Quantization scheme: S ∈ {Fixed-point, Power-of-2,
Sum-of-power-of-2}
1 2 target: Quantized model M̂
QF P (m, α) = ±α × {0, , , ..., 1}. (1)
2m−1 − 1 2m−1 − 1 // Initialization:
And the mapping from a 32-bit floating-point weight w into U 0 = 0; Z 0 = W;
the quantized weight ŵ by m-bit fixed-point representation (in foreach Epoch do
sign-magnitude) is given by the following quantizer: // Update Z, U:
Z t ← projS (W + U t−1 );
ŵ = w U t ← W − Z t + U t−1 ;
QF P (m,α) foreach Batch do
(2)
−1
1 // STE for activation quantization:
=α·h round((2m − 1) · h(w, α)) , input ← projS (input);
2m − 1 loss ← M(input);
1
where QF P (m,α) (·) denotes the quantizer function to project loss ← loss + 2
W − Z t + U t 2 ;
Backpropagate loss and update W;
onto QF P (m, α); the function h(·) transforms a value within
[−1, +1] into the range of [0, 1], for example we can use
Return M̂ ← M{projS (W)}.
h(·) = tanh(·)/2 + 0.5; and w, α clips w according to
⎧
⎪
⎨−1, w < −α
w, α = w/α, −α ≤ w ≤ α . (3) also be observed from Eq (5) that when w is a large value,
⎪
⎩ increasing m does not have an effect on ŵ. In practice, 3 ∼ 7
1, w>α
bits are usually used for power-of-2 quantization, and more
2) Non-Uniform Interval Quantization Schemes: On the bits could not further promote the accuracy of the quantized
other hand, power-of-2 quantization is a non-uniform interval models. As mentioned in §II-A1 that 4-bit fixed-point results
quantization scheme, representative methods including [44]– in negligible accuracy degradation, but 4-bit power-of-2 quan-
[47]. Power-of-2 quantization replaces multiplications by bit tization will result in accuracy loss of 1% − 2%.
shifting operations and this number system also possesses
higher precision around the mean, which fits the Gaussian B. Quantization Algorithms
distribution of DNN weights better [48], [49]. With an m- Quantization performs projection from the continuous do-
bit weight representation (in sign-magnitude), the quantized main to a discrete number system, which makes the gradients
weight values by the power-of-2 scheme are defined as of the loss function unavailable for backpropagation during the
1 1 training. Two approaches can be applied to solving this un-
QP 2 (m, α) = ±α × {0, , , ..., 1}. (4) available gradient issue. One is employing a Straight Through
22m−1 −2 22m−1 −3
Estimator (STE) [50], [51] to set the gradient to the constant
And the power-of-2 quantizer is then given by
value of 1 as
Forward : y = round(x)
ŵ = w , (7)
∂y
QP 2 (m,α) Backward : = 1x∈R
m ∂x
α · h−1 2round(log2 h(w,α)) h(w, α) > 2−2 +1
= m . which is effective in the quantization training. The other
0 h(w, α) ≤ 2−2 +1
approach employs Alternating Direction Method of Multipliers
(5)
(ADMM) to iteratively solve the parameters with a target
quantization scheme as the optimization contraint [47], elimi-
With weights quantized into the power-of-2 scheme, multi-
nating the need to backpropagate through the quantizer. In this
plications between weight i.e., 2b (b ∈ N) and activation i.e.,
work, we use a combination of ADMM and STE, as shown in
a can be implemented by bit shifting as follows:
⎧ Algorithm 1, which in general follows the ADMM algorithm
⎪
⎨a << b, b > 0 for weight quantization and where the STE is only applied for
2b × a = a, b = 0. (6) activation quantization.
⎪
⎩
a >> b, b < 0 III. S UM - OF -P OWER - OF -2 (SP2) Q UANTIZATION S CHEME
Although the power-of-2 quantization scheme can simplify In this section, we propose a new hardware-friendly sum-
hardware implementation by eliminating multiplications, its of-power-of-2 (SP2) quantization scheme, which enjoys the
precision cannot be increased effectively with increasing m, non-multiplication operations for the inference computation as
because increasing m will merely increase resolution around the binary, ternary, and power-of-2 schemes, while achieving
the mean, while the tails are still in low precision. This can negligible inference accuracy degradation.
210
TABLE I
A NALYSIS ON THE OPERATIONS FOR WEIGHT- ACTIVATION MULTIPLICATION BY TWO QUANTIZATION SCHEMES OF THE WEIGHTS .
211
TABLE II
R ESULT FROM DIFFERENT QUANTIZATION SCHEMES FOR THE R ES N ET-18 AND M OBILE N ET- V 2 DNN MODELS ON CIFAR10, CIFAR100, AND
I MAGE N ET DATASETS .
shifted operands. Since b1 and b2 are encoded by m1 - and For ImageNet, both the fixed-point (Fixed) and sum-of-power-
m2 -bit unsigned integer, respectively, Operations (1) and (2) of-2 (SP2) schemes have negligible accuracy loss, ≤ 0.41% for
can be shift by at most 2m1 − 2 and 2m2 − 2 bits, respectively. ResNet-18 and ≤ 0.62% for MobileNet-v2 accross the three
The shifted activation operands will be n + 2m1 − 2 and datasets. These two schemes achieve comparable accuracy of
n + 2m2 − 2 bits respectively. Therefore one (n + 2m1 − 2)-bit quantized models. In summary, the 4-bit-width Fixed and SP2
addition is needed. In summary, with SP2 weight quantization, quantization schemes are essentially equivalent in terms of the
the weight-activation multiplication can be implemented with accuracy of the quantized models, and their accuracy losses
two shift operations and one addition operation. are negligible.
212
Algorithm 2: FPGA-Centric Mixed Scheme Quantiza- low even though DSP utilization reaches the maximum. Incor-
tion(MSQ) porating the SP2 quantization can increase the LUT utilization,
input : 32-bit floating-point DNN model M, with weights and therefore enhancing the throughput. The exploration of the
W to be quantized. optimal ratio of SP2 to fixed-point among the weight matrix
target: Quantized model M̂ rows is elaborated in §VI.
// Initialization:
U 0 = 0; Z 0 = W; C. Accuracy Results
Partition rate P RSP 2 from FPGA resource characterization;
Sf =Fixed-point; Sp =SP2; 1) Experiment Setup: We evaluate our MSQ in three ap-
foreach Epoch do plication domains i.e., image classification with convolutional
(l)
Calculate variance vr for each r-th row of the layer l neural networks (CNNs); object detection and recognition with
weight matrix W(l) ; YOLO-v3; machine translation, speech recognition, and senti-
(l)
Sort v1:R to obtain the threshold θ(l) such that P RSP 2 ment classification with recurrent neural networks (RNNs). We
of the rows with variances less than θ(l) ; use no extra data augmentations in our quantization, other than
(l)
if vr < θ(l) then S ← Sp ;
else S ← Sf ;
those already employed for training the 32-bit floating-point
// Update Z, U : baseline models. Our quantization training algorithm uses step
Z t ← projS (W + U t−1 ); or cosine learning rate decay and 2 regularization, following
U t ← W − Z t + U t−1 ; training algorithms of the baseline models. Our quantization
foreach Batch do algorithms are implemented with the PyTorch framework on
input ← projS (input);
NVIDIA TITAN RTX GPUs and GeForce RTX 2080Ti GPUs.
loss ← M(input);
1
loss ← loss + 2
W − Z t + U t 2 ; For image classification, we evaluate with the deep residual
Backpropagate loss and update W; net (ResNet-18) [52], which is a widely used model for
computer vision tasks, as well as the lightweight MobileNet-
Return M̂ ← M{projS (W)}. v2 model [53]. We test on CIFAR10 [54], CIFAR100 [54],
and ImageNet ILSVRC-2012 [55] datasets. DNN models for
CIFAR10 and CIFAR100 datasets are trained from scratch and
fixed-point quantization should be used. Thus, the mixed quantized for 150 epochs. For ImageNet dataset, pre-trained
scheme is necessary at algorithm level—it can achieve similar models in 32-bit floating-point are used and quantized for 90
or even potentially higher accuracy than existing schemes. epochs. The initial learning rate is 8e − 3 for CIFAR10, 4e − 3
Second, our approach also leads to a better utilization of for CIFAR100, 5e − 4 for ImageNet.
heterogeneous resources available in FPGA— weights based For object detection, we explore the implementation of a
on the two schemes can be managed by LUT and DSP fully convolutional neural network (FCNN) called YOLO-v3
resources. Specifically, the operations involving SP2 quantized [56] on MS COCO 2014 [57] dataset. The learning rate starts
weights should be implemented by LUTs; while those with from 1e − 2, and decays to 5e − 4 with cosine annealing. We
fixed-point quantized weights can leverage the DSPs, the more evaluate mean Average Precision (mAP) at an IoU threshold
limited resources on FPGA for DNN hardware accelerators. value of 0.5 ([email protected]), as well as average mAP over the
Overall, our MSQ achieves a sweet design spot achieving both IoU threshold range from 0.5 to 0.95 (mAP@(0.5 : 0.95)).
high accuracy and processing throughput, thanks to the high For RNNs, we evaluate three networks. The first one is an
and optimized utilization of both LUTs and DSPs. LSTM network with 256 hidden neurons in two layers [58] on
Penn Tree Bank (PTB) [59] dataset for the machine translation
B. Algorithm application with perplexity (PPL) as the evaluation metric
In MSQ, each row in a weight matrix should employ either (lower PPL is better). The second is a network based on
the SP2 or fixed-point scheme. To determine the scheme for GRU with 1024 hidden neurons in two layers [60] on TIMIT
each row, the weight variances of all the rows are calculated. acoustic-phonetic continuous speech corpus [61] dataset for
We define a threshold θ for the variances, such that for the the speech recognition application. The evaluation metric is
rows with smaller variances than the threshold, the SP2 quan- Phoneme Error Rate (PER) and lower PER is better. Finally,
tization is employed; and otherwise, the fixed-point scheme is we use another LSTM network with three hidden layers each
applied. By setting the proper threshold θ, the desired partition having 512 neurons on IMDB [62] dataset for sentiment
ratio of SP2 to fixed-point can be achieved with improved classification. Our learning rate is 1e − 3 for all the RNNs.
FPGA resource utilization. Algorithm 2 provides the details. 2) Result Analysis: Tables II, III, and IV summarize quan-
The optimal ratio of SP2 to fixed-point is determined by the tization results for the image classification. Table II compares
available resources on FPGA devices and resource utilization different quantization schemes including power-of-2 (P2),
required to support the design. Generally, the utilization factor fixed-point (Fixed), sum-of-power-of-2 (SP2), and our mixed
of DSPs should be maintained at 100% to take full advantage scheme quantization (MSQ). Two partitioning ratios are tested
of the DSP resource for the fixed-point multiplications. When for MSQ, the first one being P RSP 2:F ixed = 1 : 1, and the
only fixed-point quantization is applied, the LUT utilization is second one being P RSP 2:F ixed = 2 : 1 that is the optimal
213
TABLE III TABLE V
C OMPARISONS WITH EXISTING WORKS WITH R ES N ET-18 MODEL ON YOLO- V 3 ON COCO 2014 DATASET WITH 4- BIT QUANTIZATION . (8×
I MAGE N ET DATASET. COMPRESSION RATE )
Methods
Bit-width
Top-1 (%) Top-5 (%) Image Size Scheme mAP @0.5 : 0.95 mAP @0.5
(W/A)
Baseline(FP) 37.7 56.8
Baseline(FP) 32/32 69.76 89.08 320
MSQ 35.8 53.9
Dorefa [38] 4/4 68.10 88.10
PACT [39] 4/4 69.20 89.00 Baseline(FP) 45.6 64.7
640
DSQ [40] 4/4 69.56 N/A MSQ 44.1 64.8
QIL [41] 4/4 70.10 N/A
μL2Q [42] 4/32 65.92 86.72
LQ-NETS [44] 4/4 69.30 88.80 TABLE VI
MSQ 4/4 70.27 89.42 RNN ON MACHINE TRANSLATION , SPEECH RECOGNITION , AND
SENTIMENT CLASSIFICATION .
214
For the IMDB dataset, EQM loses near 1% accuracy and MSQ devices of other types. Specifically, since the multiplications
only loses 0.06% accuracy. Note that we have not found any with fixed-point and SP2 weights consume the DSP and LUT,
DNN quantization works investigating the TIMIT dataset, so respectively, the LUT/DSP ratio decides the parallel PE counts
we could not compare with existing works on TIMIT. for these two operation types. For different devices, we select
different proper ratios of PE counts for fixed-point and SP2
V. FPGA I MPLEMENTATION : D ESIGN AND O PTIMIZATION
according to the available resource amount. Importantly, the
Besides obtaining accuracy advantage, the proposed MSQ PE ratio is used as the desired SP2/fixed-point ratio and sent
assembling the fixed-point and SP2 quantization schemes to Algorithm 2 to obtain the properly quantized models with
significantly promotes the efficiency of the FPGA deployment. the novel MSQ scheme.
Specifically, the newly joined SP2 quantization provides two
apparent advantages in the hardware aspect: (i) the multipli-
B. Architecture with Heterogeneous GEMM Engines
cation arithmetic involving the SP2 quantized weights can be
implemented with simple logic shifter and adder, instead of the This section provides a design based on the versatile tensor
conventional multiplier; and (ii) since the FPGA underlying accelerator (VTA) [67]. The hardware framework contains four
components include DSP and LUT, the rest LUTs can be lever- modules as shown in Figure 3(a), where the Instruction
aged for computations with SP2 weights while the DSPs are module loads the instructions and provides control signals
simulatenously fully utilized for conventional multiplication. to other modules. Load and Store modules control the in-
Therefore, with the proposed MSQ as an ensemble of fixed- put/output activation and weight data communication between
point and SP2, the same device can possibly deliver higher on-chip buffers and DRAM. The Compute module executes
performance than existing designs, in which the throughput is the workloads, with the RegFile as the scratchpad memory
theoretically bounded by the DSP count. for partial sum accumulation and TensorALU computing the
This section addresses the hardware design challenges with element-wise operations (e.g., activation). The major computa-
mixed number systems. Please note that the hardware ben- tion components are the general purpose matrix multiplication
efit from SP2 is orthogonal to prior research efforts (e.g., (GEMM) cores. Different from VTA, there are two heteroge-
dataflow [65] and locality [66] optimization), and therefore neous GEMM cores, GEMMfixed for conventional multiplications,
can be employed by any existing DNN accelerator. and GEMMsp2 for SP2 operations. Besides conventional GEMM
acceleration framework, our GEMMfixed can be naturally com-
A. FPGA Resource Characterization
bined with advanced GEMM acceleration frameworks with
architectural optimizations on the fixed-point operations (and
uses DSP resources on FPGA). An example is Bit-Fusion [11],
which is orthogonal and can be combined with our MSQ.
Firstly, the fixed point operations executed on DSP in our
MSQ framework can be accelerated by Bit-Fusion. Secondly,
MSQ assigns a large portion (beyond 50%) of computations in
each layer to SP2 and leverages LUTs for computation, which
are previously not fully exploited by fixed-point acceleration
techniques like Bit-Fusion. A doubling performance can be
anticipated as fixed-point and SP2 are computed in parallel
on FPGA.
The detailed workflow of two GEMM cores is illustrated in
Figure 3(b). A tiled block of input activation data with a size
Fig. 2. Resource ratio of different FPGA devices. For each device, LUT, of Bat × Blkin is read from the input buffer to the register
FF, and BRAM numbers are all normalized with respect to DSP number. array, where Bat is the batch size and Blkin is the input
channel count of the tile that will be computed in parallel.
FPGA devices provide different types of resources, i.e., Note that the input activation will be broadcasted to both
DSP, LUT, BRAM, and FF, for computation and storage, GEMM cores. As Figure 3(c) displays, the GEMMfixed core is
and the resource amount ratios vary in different FPGA de- composed of multipliers implemented with DSPs on FPGA,
vices. Figure 2 presents the resource ratios of Zynq series while the GEMMsp2 uses LUTs to realize shift and addition for
devices (each device name starts with “XC” that is omit- the novel SP2 based computations. Meanwhile, two weight
ted for simplicity), with each bar normalized by the DSP buffers provide the weight values in fixed-point and SP2
count on the corresponding device. The ratio of LUT to formats, respectively. The partial results will be accumulated
DSP attracts our attention, since this number directly decides and stored in individual register filers, and the final results
the building block for multiplications with fixed-point and are written to individual output buffers. Because the filters
SP2 quantized weights, respectively. Apparently, the ratio of are allocated to heterogeneous GEMM cores depending on their
LUT/DSP in XC7Z045/XC7Z020 devices are larger than that weight representation format, two filter index buffers are set
in XCZU4CG/XCZU5CG devices. This also occurs in FPGA to instruct the Store unit to write the output data to the
215
DRAM from from
Weight Buffer Weight Buffer GEMM (fixed)
(fixed) (SP2) Input Weight
Compute Module
Accumulator
Register Micro-op
from Input Buffer
Load File Cache Store
Module Module
Tensor ALU GEMMfixed GEMMsp2 GEMM (SP2)
-bit
Weight
-bit -bit
GEMMfixed GEMMsp2 Weight Weight
Input
Index
Buffer
(fixed) Shifter A Shifter B
Input Buffer +
Index
Weight Buffer Output Buffer Buffer to to
(fixed) (fixed) (SP2) Output Buffer Output Buffer
Accumulator
(fixed) (SP2)
Weight Buffer Output Buffer
(SP2) (SP2)
Fig. 3. Hardware architecture of convolution for MSQ number system. (a) Overall framework with GEMMf ixed core for fixed-point operations and
GEMMf ixed core for SP2 operations; (b) Dataflow in heterogeneous GEMM cores; and (c) Computations in heterogeneous GEMM cores.
proper global addresses. Figure 3(c) gives a detailed structure TABLE VII
to handle fixed-point and SP2 operations in two GEMM cores. H ARDWARE IMPLEMENTATION PARAMETERS WITH DIFFERENT
DEVICES AND SETTINGS . Bat, Blkin , AND Blkout,f ixed ARE SET SUCH
Two design parameters Blkout,f ixed and Blkout,sp2 indi- THAT THE DSP UTILIZATION COULD REACH MAXIMUM . Blkout,sp2 IS
cate the parallel PE count in each GEMM core and size of INCREASED UNTIL THE LUT UTILIZATION IS HIGH ENOUGH AND
corresponding registers array, as illustrated in Figure 3(b). OPTIMIZED .
216
especially in terms of frame rate. We do not find implementa-
tions with ResNet-18 and MobileNet-v2 in other work, so we
compare it with other CNNs.
Our proposed solution is beneficial over low-precision GPU
for the following two reasons: (1) Current low-precision GPU
(Tensor-RT solution) relies on 8-bit, while we can go to 4-bit
and further assisted by SP2; (2) FPGA solution is dataflow-
based and energy-efficient in general [71]. Comparing with
a state-of-art energy-efficient GPU (NVIDIA Jetson AGX,
power consumption 10-15W) with Tensor-RT support, we use
ResNet-18 as example, measured under the same accuracy.
Our FPGA solution (XC7Z045) is slightly higher performant
Fig. 4. FPGA resource utilization with different devices and settings. In (99FPS vs. 78FPS), but more than 3× higher energy efficiency
the three designs for each of the two devices, the DSP utilization is maintained as the FPGA only consumes around 4W power.
at 100% and the LUT utilization is raised to 70%−80% with FF and BRAM
resources. VII. R ELATED W ORK
This section introduces the DNN weight quantization meth-
ods/algorithms for fixed-point and P2 quantization schemes,
2) Real-world Performance and Comparison: To present and discusses DNN weight quantization on FPGA platforms.
the performance with real-world applications, we employed
different CNN and RNN models with the proper SP2/fixed- A. DNN Quantization Methods
point ratios on the two devices. The networks ResNet-18 Zhou et al. [38] first explored the potential of fixed-point
and MobileNet-v2 are implemented based on the ImageNet quantization by introducing hyperbolic tangent transformation
dataset. The performance results of each network under vari- to weights and activations, with scaling factors to minimize
ous hardware configurations are displayed in Table VIII. For quantization error. Choi et al. [39] improved this method
some layers in CNNs like the first convolutional layer, the by adding a parameterized clipping threshold to activations.
peak throughput cannot be reached since the number of input As alternatives for solving the non-differentiable problem,
channels is less than Blkin so that the data cannot fill all of DSQ [40] developed an evolving training method to gradually
the PEs. Generally, for CNN models, the overall PE utilization approaximate STE. QIL [41] parameterized the quantization
reaches 52.4% to 70.1%, and the heterogeneous GEMMfixed and interval and trained it with task loss, avoiding access to the
GEMMsp2 cores improve the throughput by 2.1 × −2.5× with original training data. μL2Q [42] introduced data distribution
the optimal design compared to utilizing the GEMMfixed core loss during training to minimize quantization error. LQ-Net
only. Compared with the design with only 4-bit fixed-point [44] and LSQ [43] proposed a differentiable method to learn
(fixed4/SP2 = 1 : 0) quantization, the optimal design with the quantizer for each layer jointly with parameters. Miyashita
the ratio of fixed4/SP2 = 1 : 1.5 on XC7Z020 decreases et al. [45] replaced fixed-point quantizer with logarithmic
the latency per image from 100.7ms to 47.1ms (2.13×) for representation to exploit bit shift operations to accelerate
ResNet-18, and the optimal design with the ratio of fixed4/SP2 inference. INQ [46] splits weights into groups and iteratively
= 1 : 2 on XC7Z045 decreases the latency from 25.1ms to quantize the model to low bit-width. Leng et al. [47] em-
10.1ms (2.49×) for ResNet-18. The latency improvement is ployed ADMM training technique to increase the accuracy of
more significant when compared with the 8-bit fixed-point extremely low bit-width DNNs.
design, as the optimal design on XC7Z020 achieves latency In addition to these quantization methods for inference
decrease from 181.3ms to 47.1ms (3.83×), and the optimal acceleration, Zhu et al. [72] proposed a low-bit training frame-
design on XC7Z045 achieves latency decrease from 45.2ms work for the training acceleration. They used the direction
to 10.1ms (4.48×). As for RNN models, the PE utilization sensitive gradient clipping and the deviation counteractive
is 42.9% − 59.2%, and the performance is increased by learning rate scaling to ensure a unified 8-bit (INT8) training
2.4 × −4.1×. with minor accuracy degradation.
The optimal MSQ implementations of CNNs based on
ImageNet and previous designs are compared in Table IX, B. Weight Quantization in FPGA Implementations
from which it can be observed that our ResNet-18 imple- Weight quantization has been widely applied to DNN
mentations achieve the highest accuracy and enjoy comparable implementations on FPGAs [73]. Some work studies fixed-
hardware utilization efficiency represented by GOPS/DSP and point quantization. The work [68] utilizes greedy solution
GOPS/kLUT with designs in [68], [69]. The work [70] ac- to determine the radix position of each layer for quantiza-
quires higher utilization efficiency but much lower accuracy. tion. [70] investigates a hybrid quantization scheme that allows
MobileNet-v2 has the most complicated structure among all different bit-widths for weights, providing more flexibility. For
these networks, making it difficult to deploy on hardware Binarized Neural Networks (BNNs), multiplications can be
platforms, but our designs can still achieve high performance, executed with XNOR gates [74]–[76]. A fully binarized neural
217
TABLE VIII
P ERFORMANCE OF VARIOUS DNN APPLICATIONS ON HARDWARE UNDER DIFFERENT SETTINGS .
TABLE IX
C OMPARISONS OF CNN S ON I MAGE N ET WITH PREVIOUS IMPLEMENTATIONS .
network accelerator is implemented in [76] through utilizing networks (CNN) and recurrent neural networks (RNN). With
odd-even padding to replace the zero padding values. Another optimal SP2/fixed-point ratios on two FPGA devices, i.e.,
scheme called logarithmic quantization using power of 2 is Zynq XC7Z020 and XC7Z045, we achieve performance im-
explored in [77]. In addition, weight quantization could be provement of 2.1×−4.1× compared to solely exploiting DSPs
employed with a two-stage arithmetic unit for low bit-width for all multiplication operations.
CNNs [78], a fast matrix and Winograd algorithm [79], a
novel CNN architecture for software-hardware codesign [69], ACKNOWLEDGMENT
a design flow of DNN implementations for more flexible This work is partly supported by the National Science Foun-
quantization schemes [80], and an OpenCL-based framework dation CCF-1901378, CCF-1919117, CCF-1919289, CNS-
Deep Learning Accelerator (DLA) to accomodate designs with 1909172 and DARPA-HR00112090055.
different bit-widths [81]. In addition, dynamic quantization
with bit fusion in [11] improves the bit-level flexibility by R EFERENCES
matching various bit-widths for different DNN layers. [1] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4,
inception-resnet and the impact of residual connections on learning,” in
VIII. C ONCLUSION Thirty-first AAAI conference on artificial intelligence (AAAI), 2017.
[2] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
This paper investigates efficient DNN inference engine “Feature pyramid networks for object detection,” in Proceedings of the
IEEE conference on computer vision and pattern recognition (CVPR),
on FPGA devices through DNN quantization, and proposes 2017, pp. 2117–2125.
the first solution that applies different quantization schemes [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly,
for different rows of the weight matrix. We propose a A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,
“Deep neural networks for acoustic modeling in speech recognition:
hardware-friendly quantization scheme named SP2 suitable for The shared views of four research groups,” IEEE Signal processing
Gaussian-like weight distribution, in which the multiplication magazine, vol. 29, no. 6, pp. 82–97, 2012.
arithmetic can be replaced with logic shifter and adder, thereby [4] W. Xiong, L. Wu, F. Alleva, J. Droppo, X. Huang, and A. Stolcke, “The
microsoft 2017 conversational speech recognition system,” in 2018 IEEE
enabling highly efficient implementations with the FPGA LUT international conference on acoustics, speech and signal processing
resources. In contrast, the fixed-point quantization is suitable (ICASSP). IEEE, 2018, pp. 5934–5938.
for Uniform-like weight distribution and can be implemented [5] R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in Proceed-
efficiently by DSP. To fully explore the FPGA resources, intra- ings of the 25th international conference on Machine learning (ICML),
layer, multi-scheme quantization framework with an ensemble 2008, pp. 160–167.
of the SP2 and fixed-point schemes. We evaluate our FPGA- [6] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,
M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez,
centric quantization framework across multiple application “A survey on deep learning in medical image analysis,” Medical image
domains with various DNNs such as convolutional neural analysis, vol. 42, pp. 60–88, 2017.
218
[7] J. De Fauw, J. R. Ledsam, B. Romera-Paredes, S. Nikolov, N. Tomasev, [26] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
S. Blackwell, H. Askham, X. Glorot, B. O’Donoghue, D. Visentin, P. A. dnns,” in Advances in neural information processing systems (NeurIPS),
Keane, and O. Ronneberger, “Clinically applicable deep learning for 2016, pp. 1379–1387.
diagnosis and referral in retinal disease,” Nature medicine, vol. 24, no. 9, [27] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang,
pp. 1342–1350, 2018. and J. Zhu, “Discrimination-aware channel pruning for deep neural
[8] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “Lergan: A zero-free, low networks,” in Advances in Neural Information Processing Systems
data movement and pim-based gan architecture,” in Proceedings of the (NeurIPS), 2018, pp. 875–886.
51st Annual IEEE/ACM International Symposium on Microarchitecture [28] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y.
(MICRO). IEEE, 2018, pp. 669–681. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance
[9] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and score propagation,” in Proceedings of the IEEE Conference on Computer
B. Yuan, “Sc-dcnn: Highly-scalable deep convolutional neural network Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.
using stochastic computing,” Proceedings of the 22nd International [29] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via
Conference on Architectural Support for Programming Languages and geometric median for deep convolutional neural networks acceleration,”
Operating Systems (ASPLOS), vol. 51, no. 2, pp. 405–418, 2017. in Proceedings of the IEEE Conference on Computer Vision and Pattern
[10] R. Cai, A. Ren, N. Liu, C. Ding, L. Wang, X. Qian, M. Pedram, and Recognition (CVPR), 2019, pp. 4340–4349.
Y. Wang, “Vibnn: Hardware acceleration of bayesian neural networks,” [30] X. Dong and Y. Yang, “Network pruning via transformable architec-
in Proceedings of the 23rd International Conference on Architectural ture search,” in Advances in Neural Information Processing Systems
Support for Programming Languages and Operating Systems (ASPLOS). (NeurIPS), 2019, pp. 759–770.
ACM, 2018, pp. 476–488.
[31] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
[11] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Es-
deep neural networks with binary weights during propagations,” in
maeilzadeh, “Bit fusion: Bit-level dynamically composable architecture
Advances in neural information processing systems (NeurIPS), 2015,
for accelerating deep neural networks,” in Proceedings of the 45th
pp. 3123–3131.
Annual International Symposium on Computer Architecture (ISCA).
IEEE Press, 2018, pp. 764–775. [32] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
[12] C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, “Tie: energy- gio, “Binarized neural networks: Training deep neural networks with
efficient tensor train-based inference engine for deep neural network,” in weights and activations constrained to+ 1 or-1,” arXiv preprint
Proceedings of the 46th Annual International Symposium on Computer arXiv:1602.02830, 2016.
Architecture (ISCA), 2019, pp. 264–278. [33] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
[13] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo, Imagenet classification using binary convolutional neural networks,” in
N. Yoshikawa, and Y. Wang, “A stochastic-computing based deep European conference on computer vision (ECCV). Springer, 2016, pp.
learning framework using adiabatic quantum-flux-parametron supercon- 525–542.
ducting technology,” in Proceedings of the 46th Annual International [34] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary convolutional
Symposium on Computer Architecture (ISCA), 2019, pp. 567–578. neural network,” in Advances in Neural Information Processing Systems
[14] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, (NeurIPS), 2017, pp. 345–353.
“Admm-nn: An algorithm-hardware co-design framework of dnns using [35] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprint
alternating direction methods of multipliers,” in Proceedings of the 24th arXiv:1605.04711, 2016.
International Conference on Architectural Support for Programming [36] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
Languages and Operating Systems (ASPLOS), 2019, pp. 925–938. in International Conference on Learning Representations (ICLR), 2017.
[15] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing [37] Z. He and D. Fan, “Simultaneously optimizing weight and quantizer
fpga-based accelerator design for deep convolutional neural networks,” of ternary neural network using truncated gaussian approximation,” in
in Proceedings of the 2015 ACM/SIGDA International Symposium on Proceedings of the IEEE Conference on Computer Vision and Pattern
Field-Programmable Gate Arrays (FPGA). ACM, 2015, pp. 161–170. Recognition (CVPR), 2019, pp. 11 438–11 446.
[16] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, [38] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:
R. Gupta, and Z. Zhang, “Accelerating binarized convolutional neural Training low bitwidth convolutional neural networks with low bitwidth
networks with software-programmable fpgas,” in Proceedings of the gradients,” arXiv preprint arXiv:1606.06160, 2016.
2017 ACM/SIGDA International Symposium on Field-Programmable [39] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan,
Gate Arrays (FPGA). ACM, 2017, pp. 15–24. and K. Gopalakrishnan, “Pact: Parameterized clipping activation for
[17] Z. Chen, A. Howe, H. T. Blair, and J. Cong, “Fpga-based lstm ac- quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
celeration for real-time eeg signal processing,” in Proceedings of the
[40] R. Gong, X. Liu, S. Jiang, T. Li, P. Hu, J. Lin, F. Yu, and J. Yan,
2018 ACM/SIGDA International Symposium on Field-Programmable
“Differentiable soft quantization: Bridging full-precision and low-bit
Gate Arrays (FPGA). ACM, 2018, pp. 288–288.
neural networks,” in Proceedings of the IEEE International Conference
[18] R. Shi, Y. Ding, X. Wei, H. Liu, H. So, and C. Ding, “Ftdl: An
on Computer Vision (ICCV), 2019, pp. 4852–4861.
fpga-tailored architecture for deep learning systems,” in The 2020
ACM/SIGDA International Symposium on Field-Programmable Gate [41] S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S. J. Hwang,
Arrays (FPGA), 2020, pp. 320–320. and C. Choi, “Learning to quantize deep networks by optimizing
quantization intervals with task loss,” in Proceedings of the IEEE
[19] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren,
Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
“Patdnn: Achieving real-time dnn execution on mobile devices with
pp. 4350–4359.
pattern-based weight pruning,” in Proceedings of the 25th International
Conference on Architectural Support for Programming Languages and [42] G. Cheng, L. Ye, L. Tao, Z. Xiaofan, H. Cong, C. Deming, and
Operating Systems (ASPLOS), 2020, pp. 907–922. C. Yao, “μl2q: An ultra-low loss quantization method for dnn,” The 2019
[20] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, International Joint Conference on Neural Networks (IJCNN), 2019.
L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An [43] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S.
automated end-to-end optimizing compiler for deep learning,” in 13th Modha, “Learned step size quantization,” International Conference on
USENIX Symposium on Operating Systems Design and Implementation Learning Representations (ICLR), 2019.
(OSDI 18), 2018, pp. 578–594. [44] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization
[21] https://github.com/alibaba/MNN. for highly accurate and compact deep neural networks,” in Proceedings
[22] A. Paszke, S. Gross, S. Chintala, and G. Chanan, “Pytorch,” 2017. of the European conference on computer vision (ECCV), 2018, pp. 365–
[23] https://www.tensorflow.org/mobile/tflite/. 382.
[24] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking [45] D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neu-
the value of network pruning,” International Conference on Learning ral networks using logarithmic data representation,” arXiv preprint
Representations (ICLR), 2019. arXiv:1603.01025, 2016.
[25] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured [46] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
sparsity in deep neural networks,” in Advances in neural information quantization: Towards lossless cnns with low-precision weights,” in
processing systems (NeurIPS), 2016, pp. 2074–2082. International Conference on Learning Representations (ICLR), 2017.
219
[47] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural onto embedded fpga,” IEEE Transactions on Computer-Aided Design
network: Squeeze the last bit out with admm,” in Thirty-Second AAAI of Integrated Circuits and Systems (TCAD), vol. 37, no. 1, pp. 35–47,
Conference on Artificial Intelligence (AAAI), 2018. 2017.
[48] C. Baskin, E. Schwartz, E. Zheltonozhskii, N. Liss, R. Giryes, [69] Y. Yang, Q. Huang, B. Wu, T. Zhang, L. Ma, G. Gambardella, M. Blott,
A. M. Bronstein, and A. Mendelson, “Uniq: Uniform noise injec- L. Lavagno, K. Vissers, J. Wawrzynek, and K. Keutzer, “Synetgy:
tion for non-uniform quantization of neural networks,” arXiv preprint Algorithm-hardware co-design for convnet accelerators on embedded
arXiv:1804.10969, 2018. fpgas,” in Proceedings of the 2019 ACM/SIGDA International Sympo-
[49] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight sium on Field-Programmable Gate Arrays (FPGA). ACM, 2019, pp.
uncertainty in neural networks,” in Proceedings of the 32nd International 23–32.
Conference on International Conference on Machine Learning (ICML), [70] J. Wang, Q. Lou, X. Zhang, C. Zhu, Y. Lin, and D. Chen, “Design flow of
2015, pp. 1613–1622. accelerating hybrid extremely low bit-width neural network in embedded
[50] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating fpga,” in 2018 28th International Conference on Field Programmable
gradients through stochastic neurons for conditional computation,” arXiv Logic and Applications (FPL). IEEE, 2018, pp. 163–1636.
preprint arXiv:1308.3432, 2013. [71] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding
[51] P. Yin, J. Lyu, S. Zhang, S. Osher, Y. Qi, and J. Xin, “Understanding performance differences of fpgas and gpus: (abtract only),” ser. FPGA
straight-through estimator in training activation quantized neural nets,” ’18. New York, NY, USA: Association for Computing Machinery, 2018,
in International Conference on Learning Representations (ICLR), 2018. p. 288. [Online]. Available: https://doi.org/10.1145/3174243.3174970
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [72] F. Zhu, R. Gong, F. Yu, X. Liu, Y. Wang, Z. Li, X. Yang, and J. Yan,
recognition,” in Proceedings of the IEEE conference on computer vision “Towards unified int8 training for convolutional neural network,” in
and pattern recognition (CVPR), 2016, pp. 770–778. Proceedings of the IEEE/CVF Conference on Computer Vision and
[53] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Pattern Recognition (CVPR), 2020, pp. 1969–1979.
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings [73] K. Guo, W. Li, K. Zhong, Z. Zhu, S. Zeng, S. Han, Y. Xie, P. Debacker,
of the IEEE conference on computer vision and pattern recognition M. Verhelst, and Y. Wang, “Neural network accelerator comparison,”
(CVPR), 2018, pp. 4510–4520. https://nicsefc.ee.tsinghua.edu.cn/projects/neural-network-accelerator/.
[54] A. Krizhevsky, “Learning multiple layers of features from tiny images,” [74] H. Nakahara, H. Yonekawa, T. Sasao, H. Iwamoto, and M. Motomura,
2009. “A memory-based realization of a binarized deep convolutional neural
[55] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification network,” in 2016 International Conference on Field-Programmable
with deep convolutional neural networks,” in Advances in neural infor- Technology (FPT). IEEE, 2016, pp. 277–280.
mation processing systems, 2012, pp. 1097–1105. [75] H. Nakahara, T. Fujii, and S. Sato, “A fully connected layer elimination
[56] J. Redmon and A. Farhadi, “Yolov3: An incremental for a binarized convolutional neural network on an fpga,” in 2017 27th
improvement,” CoRR, vol. abs/1804.02767, 2018. [Online]. Available: International Conference on Field Programmable Logic and Applica-
http://arxiv.org/abs/1804.02767 tions (FPL). IEEE, 2017, pp. 1–4.
[57] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, [76] P. Guo, H. Ma, R. Chen, P. Li, S. Xie, and D. Wang, “Fbna: A
J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft fully binarized neural network accelerator,” in 2018 28th International
COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. Conference on Field Programmable Logic and Applications (FPL).
[Online]. Available: http://arxiv.org/abs/1405.0312 IEEE, 2018, pp. 51–513.
[58] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [77] C. Luo, W. Cao, L. Wang, and P. H. Leong, “Rna: An accurate
computation, vol. 9, no. 8, pp. 1735–1780, 1997. residual network accelerator for quantized and reconstructed deep neural
[59] M. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large networks,” IEICE Transactions on Information and Systems, vol. 102,
annotated corpus of english: The penn treebank,” 1993. no. 5, pp. 1037–1045, 2019.
[60] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, [78] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang, “Accelerating low
H. Schwenk, and Y. Bengio, “Learning phrase representations using bit-width convolutional neural networks with embedded fpga,” in 2017
rnn encoder-decoder for statistical machine translation,” in Proceedings 27th International Conference on Field Programmable Logic and Ap-
of the 2014 Conference on Empirical Methods in Natural Language plications (FPL). IEEE, 2017, pp. 1–4.
Processing (EMNLP), 2014, pp. 1724–1734. [79] D. Wu, J. Chen, W. Cao, and L. Wang, “A novel low-communication
[61] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. energy-efficient reconfigurable cnn acceleration architecture,” in 2018
Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. 28th International Conference on Field Programmable Logic and Ap-
nist speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, plications (FPL). IEEE, 2018, pp. 64–643.
1993. [80] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and
[62] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, D. Chen, “Dnnbuilder: an automated tool for building high-performance
“Learning word vectors for sentiment analysis,” in Proceedings of the dnn hardware accelerators for fpgas,” in 2018 IEEE/ACM International
49th annual meeting of the association for computational linguistics: Conference on Computer-Aided Design (ICCAD). IEEE, 2018, pp. 1–8.
Human language technologies-volume 1. Association for Computa- [81] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and
tional Linguistics, 2011, pp. 142–150. K. Nealis, “Exploration of low numeric precision deep learning inference
[63] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou, “Effective using intel® fpgas,” in 2018 IEEE 26th Annual International Symposium
quantization methods for recurrent neural networks,” arXiv preprint on Field-Programmable Custom Computing Machines (FCCM). IEEE,
arXiv:1611.10176, 2016. 2018, pp. 73–80.
[64] P. Zhang, Y. Zhong, and X. Li, “Slimyolov3: Narrower, faster and
better for real-time UAV applications,” CoRR, vol. abs/1907.11093,
2019. [Online]. Available: http://arxiv.org/abs/1907.11093
[65] Q. Sun, T. Chen, J. Miao, and B. Yu, “Power-driven dnn dataflow
optimization on fpga,” in 2019 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–7.
[66] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong, “Fp-dnn: An automated framework for mapping deep
neural networks onto fpgas with rtl-hls hybrid templates,” in 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2017, pp. 152–159.
[67] T. Moreau, T. Chen, L. Vega, J. Roesch, E. Yan, L. Zheng, J. Fromm,
Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, “A hardware–
software blueprint for flexible deep learning specialization,” IEEE Micro,
vol. 39, no. 5, pp. 8–16, 2019.
[68] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang, “Angel-eye: A complete design flow for mapping cnn
220