0% found this document useful (0 votes)

56 views11 pages

Resiliency of Deep Neural Networks Under Quantizations

This paper examines the effect of quantization on deep neural networks when the network complexity is varied. It studies feedforward and convolutional neural networks with ternary (+1, 0, -1) weights trained through retraining. The paper analyzes the performance gap between floating-point and quantized networks of different sizes, and proposes a metric to evaluate the trade-off between network size and precision for hardware implementation.

Uploaded by

God Gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views11 pages

Resiliency of Deep Neural Networks Under Quantizations

Uploaded by

God Gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Under review as a conference paper at ICLR 2016

R ESILIENCY OF D EEP N EURAL N ETWORKS

UNDER Q UANTIZATION

Wonyong Sung, Sungho Shin & Kyuyeon Hwang

Department of Electrical and Computer Engineering
Seoul National University
Seoul, 08826 Korea
[email protected]
[email protected]
[email protected]
arXiv:1511.06488v3 [cs.LG] 7 Jan 2016

A BSTRACT
The complexity of deep neural network algorithms for hardware implementation
can be much lowered by optimizing the word-length of weights and signals. Direct
quantization of floating-point weights, however, does not show good performance
when the number of bits assigned is small. Retraining of quantized networks has
been developed to relieve this problem. In this work, the effects of quantization
are analyzed for a feedforward deep neural network (FFDNN) and a convolutional
neural network (CNN) when their network complexity is changed. The complex-
ity of the FFDNN is controlled by varying the unit size in each hidden layer and
the number of layers, while that of the CNN is done by modifying the feature map
configuration. We find that some performance gap exists between the floating-
point and the retrain-based ternary (+1, 0, -1) weight neural networks when the
size is not large enough, but the discrepancy almost vanishes in fully complex net-
works whose capability is limited by the training data, rather than by the number
of connections. This research shows that highly complex DNNs have the capa-
bility of absorbing the effects of severe weight quantization through retraining,
but connection limited networks are less resilient. This paper also presents the
effective compression ratio to guide the trade-off between the network size and
the precision when the hardware resource is limited.

1 I NTRODUCTION
Deep neural networks (DNNs) begin to find many real-time applications, such as speech recognition,
autonomous driving, gesture recognition, and robotic control (Sak et al., 2015; Chen et al., 2015;
Jalab et al., 2015; Corradini et al., 2015). Although most of deep neural networks are implemented
using GPUs (Graphics Processing Units) in these days, their implementation in hardware can give
many benefits in terms of power consumption and system size (Ovtcharov et al., 2015). FPGA
based implementation examples of CNN show more than 10 times advantage in power consumption
(Ovtcharov et al., 2015).
Neural network algorithms employ many multiply and add (MAC) operations that mimic the oper-
ations of biological neurons. This suggests that reconfigurable hardware arrays that contain quite
homogeneous hardware blocks, such as MAC units, can give very efficient solution to real-time neu-
ral network system design. Early studies on word-length determination of neural networks reported
the needed precision of at least 8 bits (Holt & Baker, 1991). Our recent works show that the pre-
cision required for implementing FFDNN, CNN or RNN needs not be very high, especially when
the quantized networks are trained again to learn the effects of lowered precision. In the fixed-point
optimization examples shown in Hwang & Sung (2014); Anwar et al. (2015); Shin et al. (2015),
neural networks with ternary weights showed quite good performance which was close to that of
floating-point arithmetic.
In this work, we try to know if retraining can recover the performance of FFDNN and CNN under
quantization with only ternary (+1, 0, -1) levels or 3 bits (+3, +2, +1, 0, -1, -2, -3) for weight

1
Under review as a conference paper at ICLR 2016

representation. Note that bias values are not quantized. For this study, the network complexity is
changed to analyze their effects on the performance gap between floating-point and retrained low-
precision fixed-point deep neural networks.
We conduct our experiments with a feed-forward deep neural network (FFDNN) for phoneme recog-
nition and a convolutional neural network (CNN) for image classification. To control the network
size, not only the number of units in each layer but also the number of hidden layers are varied in the
FFDNN. For the CNN, the number of feature maps for each layer and the number of layers are both
changed. The FFDNN uses the TIMIT corpus and the CNN employs the CIFAR-10 dataset. We
also propose a metric called effective compression ratio (ECR) for comparing extremely quantized
bigger networks with moderately quantized or floating-point networks with the smaller size. This
analysis intends to find an insight to the knowledge representation capability of highly quantized
networks, and also provides a guideline to network size and word-length determination for efficient
hardware implementation of DNNs.

2 R ELATED W ORK
Fixed-point implementation of signal processing algorithms has long been of interest for VLSI based
design of multimedia and communication systems. Some of early works used statistical modeling
of quantization noise for application to linear digital filters. The simulation-based word-length op-
timization method utilized simulation tools to evaluate the fixed-point performance of a system, by
which non-linear algorithms can be optimized (Sung & Kum, 1995). Ternary (+1, 0, -1) coeffi-
cients based digital filters were used to eliminate multiplications at the cost of higher quantization
noise. The implementation of adaptive filters with ternary weights were developed, but it demanded
oversampling to remove the quantization effects (Hussain et al., 2007).
Fixed-point neural network design also has been studied with the same purpose of reducing the hard-
ware implementation cost (Moerland & Fiesler, 1997). In Holt & Baker (1991), back propagation
simulation with 16-bit integer arithmetic was conducted for several problems, such as NetTalk, Par-
ity, Protein and so on. This work conducted the experiments while changing the number of hidden
units, which was, however, relatively small numbers. The integer simulations showed quite good
results for NetTalk and Parity, but not for Protein benchmarks. With direct quantization of trained
weights, this work also confirmed satisfactory operation of neural networks with 8-bit precision.
An implementation with ternary weights were reported for neural network design with optical fiber
networks (Fiesler et al., 1990). In this ternary network design, the authors employed retraining after
direct quantization to improve the performance of a shallow network.
Recently, fixed-point design of DNNs is revisited, and FFDNN and CNN with ternary weights show
quite good performances that are very close to the floating-point results. The ternary weight based
FFDNN and CNN are used for VLSI and FPGA based implementations, by which the algorithms
can operate with only on-chip memory consuming very low power (Kim et al., 2014). Binary weight
based deep neural network design is also studied (Courbariaux et al., 2015). Pruned floating-point
weights are also utilized for efficient GPU based implementations, where small valued weights are
forced to zero to reduce the number of arithmetic operations and the memory space for weight
storage (Yu et al., 2012b; Han et al., 2015). A network restructuring technique using singular value
decomposition technique is also studied (Xue et al., 2013; Rigamonti et al., 2013).

3 F IXED - POINT FFDNN AND CNN D ESIGN

This section explains the design of FFDNN and CNN with varying network complexity and, also,
the fixed-point optimization procedure.

3.1 FFDNN AND CNN D ESIGN

A feedforward deep neural network with multiple hidden layers are depicted in Figure 1. Each layer
k has a signal vector yk , which is propagated to the next layer by multiplying the weight matrix
Wk+1 , adding biases bk+1 , and applying the activation function φk+1 (·) as follows:

yk+1 = φk+1 Wk+1 yk + bk+1 . (1)

2
Under review as a conference paper at ICLR 2016

in-h1 h1-h2 h2-h3 h3-h4 h4-out

In F F F F Out

I
Input
t h1 h2 h3 h4 O t t
Output

Figure 1: Feed-forward deep neural network with 4 hidden layers.

S2-C3 S3-F1
S1-C2

F1-Out

In F Out

In-C1

Input C1 S1 C2 S2 C3 S3 F1

Figure 2: CNN structure with 3 convolution layers and 1 fully-connected layers.

One of the most popular activation functions is the rectified linear unit defined as
Relu(x) = max(0, x). (2)

In this work, an FFDNN for phoneme recognition is used. The reference DNN has four hidden
layers. Each of the hidden layers has Nh units; the value of Nh is changed to control the complexity
of the network. We conduct experiments with the Nh size of 32, 64, 128, 256, 512, and 1024. The
number of hidden layers is also reduced. The input layer of the network has 1,353 units to accept
11 frames of a Fourier-transform-based filter-bank with 40 coefficients (+energy) distributed on a
mel-scale, together with their first and second temporal derivatives. The output layer consists of
61 softmax units which correspond to 61 target phoneme labels. Phoneme recognition experiments
were performed on the TIMIT corpus. The standard 462 speaker set with all SA records removed
was used for training, and a separate development set of 50 speaker was used for early stopping. Re-
sults are reported for the 24-speaker core test set. The network was trained using a backpropagation
algorithm with 128 mini-batch size. Initial learning rate was 10−5 and it was decreased until 10−7
during the training. Momentum was 0.9 and RMSProp was adopted for weights update (Tieleman
& Hinton, 2012). The dropout technique was employed with 0.2 dropout rate in each layer.
The CNN used is for CIFAR-10 dataset. It contains a training set of 50,000 and a test set of 10,000
32×32 RGB color images representing airplanes, automobiles, birds, cats, deers, dogs, frogs, horses,
ships and trucks. We divided the training set to 40,000 images for training and 10,000 images for
validation. This CNN has 3 convolution and pooling layers and a fully connected hidden layer
with 64 units, and the output has 10 softmax units as shown in Figure 2. We control the number
of feature maps in each convolution layer. The reference size has 32-32-64 feature maps with 5
by 5 kernel size as used in Krizhevskey (2014). We did not perform any preprocessing and data
augmentation such as ZCA whitening and global contrast normalization. To know the effects of
network size variation, the number of feature maps is reduced or increased. The configurations of
the feature maps used for the experiments are 8-8-16, 16-16-32, 32-32-64, 64-64-128, 96-96-192,
and 128-128-256. The number of feature map layers is also changed, resulting in 32-32-64, 32-64,

3
Under review as a conference paper at ICLR 2016

and 64 map configurations. Note that the fully connected layer in the CNN is not changed. The
network was trained using a backpropagation algorithm with 128 mini-batch size. Initial learning
rate was 0.001 and it was decreased to 10−8 during the training procedure. Momentum was 0.8 and
RMSProp was applied for weights update.

3.2 F IXED - POINT OPTIMIZATION OF DNN S

Reducing the word-length of weights brings several advantages in hardware based implementation
of neural networks. First, it lowers the arithmetic precision, and thereby reduces the number of
gates needed for multipliers. Second, the size of memory for storing weights is minimized, which
would be a big advantage when keeping them on a chip, instead of external DRAM or NAND flash
memory. Note that FFDNNs and recurrent neural networks demand a very large number of weights.
Third, the reduced arithmetic precision or minimization of off-chip memory accesses leads to low
power consumption. However, we need to concern the quantization effects that degrade the system
performance.
Direct quantization converts a floating-point value to the closest integer number, which is conven-
tionally used in signal processing system design. However, direct quantization usually demands
more than 8 bits, and does not show good performance when the number of bits is small. In fixed-
point deep neural network design, retraining of quantized weights shows quite good performance.
The fixed-point DNN algorithm design consists of three steps: floating-point training, direct quan-
tization, and retraining of weights. The floating-point training procedure can be any of the state of
the art techniques, which may include unsupervised learning and dropout. Note that fixed-point op-
timization needs to be based on the best performing floating-point weights. Thus, the floating-point
weight optimization may need to be conducted several times with different initializations, and this
step consumes the most of the time. After the floating-point training, direct quantization is followed.
For direct quantization, uniform quantization function is employed and the function Q(·) is defined
as follows :

|(w)| M −1
Q(w) = sgn(w) · ∆ · min + 0.5 , (3)
∆ 2
where sgn(·) is a sign function, ∆ is a quantization step size, and M represents the number of
quantization levels. Note that M needs to be an odd number since the weight values can be posi-
tive or negative. When M is 7, the weights are represented by -3·∆, -2·∆, -1·∆, 0, +1·∆, +2·∆,
+3·∆,which can be represented in 3 bits.
The quantization step size ∆ is determined to minimize the L2 error, E, depicted as follows.
N
1X 2
E= Q(wi ) − wi (4)
2 i=1

where N is the number of weights in each weight group, wi is the i-th weight value represented in
floating-point. This process needs some iterations, but does not take much time.
For network retraining, we maintain both floating-point and quantized weights because the amount
of weight updates in each training step is much smaller than the quantization step size ∆. The
forward and backward propagation is conducted using quantized weights, but the weight update is
applied to the floating-point weights and newly quantized values are generated at each iteration.
This retraining procedure usually converges quickly and does not take much time when compared
to the floating-point training.

4 A NALYSIS OF QUANTIZATION EFFECTS

4.1 D IRECT QUANTIZATION

The performance of the FFDNN and the CNN with directly quantized weights is analyzed while
varying the number of units in each layer or the number of feature maps, respectively. In this
analysis, the quantization is performed on each weight group, which is illustrated in Figure 1 and

4
Under review as a conference paper at ICLR 2016

Figure 2, to know the sensitivity of word-length reduction. In this sub-section, we try to analyze the
effects of direct quantization.
The quantized weight can be represented as follows,
wiq = wi + wid (5)

where wid is the distortion of each weight due to quantization. In the direct quantization, we can
assume that the distortion wid is not dependent each other.

w0 w0d
x0 × x0 ×
w1 w1d
x1 × x1 ×
w2 w2d
x2 × + y[n] x2 × + y d [ n]

...

...
...

...

wN -1 wNd-1
xN -1 × xN -1 ×
(a) (b)

Figure 3: Computation model for a unit in the hidden layer j ((a): floating-point, (b): distortion).

floating result 80 floating result

90 all direct In-C1
In-h1 S1-C2
h1-h2 70 S2-C3
80 h2-h3
Miss classification rate (%)

h3-h4
Phone error rate (%)

h4-out
70 60

60 50

50
40

40
30

30
32 64 128 256 512 1024 8-16 16-32 32-64 64-128 96-192128-256
Size of the network Size of the network

(a) (b)

Figure 4: Sensitivity analysis of direct quantization ((a): FFDNN, (b): CNN). In the figure (b),
x-axis label ‘8-16’ represents the number of feature map is ‘8-8-16’.

Consider a computation procedure for a unit in a hidden layer, the signal from the previous layer is
summed up after multiplication with the weights as illustrated in Figure 3a. We can also assemble a
model for distortion, which is shown in Figure 3b. In the distortion model, since wid is independent
each other, we can assume that the effects of the summed distortion is reduced according to the
random process theory. This analysis means that the quantization effects are reduced when the
number of units in the anterior layer increases, but slowly.
Figure 4a illustrates the performance of the FFDNN with floating-point arithmetic, 2-bit direct quan-
tization of all the weights, and 2-bit direct quantization only on the weight group ‘In-h1’, ‘h1-h2’,
and ‘h4-out’. Consider the quantization performance of the ‘In-h1’ layer, the phone-error rate is
higher than the floating-point result with an almost constant amount, about 10%. Note that the num-
ber of input to the ‘In-h1’ layer is fixed, 1353, regardless of the hidden unit size. Thus, the amount
of distortion delivered to each unit of the hidden layer 1 can be considered unchanged. Figure 4a
also shows the quantization performance on ‘h1-h2’ and ‘h4-out’ layers, which informs the trend of

5
Under review as a conference paper at ICLR 2016

floating result floating result

90 8 bit direct
8 bit direct 80
6 bit direct
6 bit direct 5 bit direct
4 bit direct
80 4 bit direct
70 3 bit direct

Miss classification rate (%)

2 bit direct 2 bit direct
Phone error rate (%)

70
60

60
50

50
40

40
30

30
64 128 256 512 1024 8-8-16 16-16-32 32-32-64 64-64-128 128-128-256
Size of the network Size of the network

(a) (b)

Figure 5: Performance of direct quantization with multiple precision ((a): FFDNN, (b): CNN).

reduced gap to the floating-point performance as the network size increases. This can be explained
by the sum of increased number of independent distortions when the network size grows. The per-
formance of all 2-bit quantization also shows the similar trend of reduced gap to the floating-point
performance. But, apparently, the performance of 2-bit directly quantized networks is not satisfac-
tory.
In Figure 4b, a similar analysis is conducted to the CNN with direct quantization when the number of
feature maps increases or decreases. In the CNN, the number of input to each output is determined
by the number of input feature maps and the kernel size. For example, at the first layer C1, the
number of input signal for computing one output is only 75 (=3×25) regardless of the network size,
where the input map size is always 3 and the kernel size is 25. However, at the second layer C2,
the number of input feature maps increases as the network size grows. When the feature map of
32-32-64 is considered, the number of input for the C2 layer grows to 800 (=32×25). Thus, we can
expect a reduced distortion as the number of feature maps increases.
Figure 5a shows the performance of direct quantization with 2, 4, 6, and 8-bit precision when the
network complexity varies. In the FFDNN, 6 bit direct quantization seems enough when the network
size is larger than 128. But, small FFDNNs demand 8 bits for near floating-point performance. The
CNN in Figure 5b also shows the similar trend. The direct quantization requires about 6 bits when
the feature map configuration is 16-16-32 or larger.

4.2 E FFECTS OF RETRAINING ON QUANTIZED NETWORKS

Retraining is conducted on the directly quantized networks using the same data for floating-point
training. The fixed-point performance of the FFDNN is shown in Figure 6a when the number of
hidden units in each layer varies. The performance of direct 2 bits (ternary levels), direct 3 bits (7-
levels), retrain-based 2 bits, and retrain-based 3 bits are compared with the floating-point simulation.
We can find that the performance gap between the floating-point and the retrain-based fixed-point
networks converges very fast as the network size grows. Although the performance gap between
the direct and the floating-point networks also converges, the rate of convergence is significantly
different. In this figure, the performance of the floating-point network almost saturates when the
network size is about 1024. Note that the TIMIT corpus that is used for training has only 3 hours
of data. Thus, the network with 1024 hidden units can be considered in the ‘training-data limited
region’. Here, the gap between the floating-point and fixed-point networks almost vanishes when the
network is in the ‘training-data limited region’. However, when the network size is limited, such as
32, 64, 128, or 256, there is some performance gap between the floating-point and highly quantized
networks even if retraining on the quantized networks is performed.
The similar experiments are conducted for the CNN with varying feature map sizes, and the results
are shown in Figure 6b. The configuration of the feature maps used for the experiments are 8-8-16,

6
Under review as a conference paper at ICLR 2016

floating result
90 floating result 80 2 bit direct
2 bit direct 2 bit retraining
3 bit direct 3 bit direct
2 bit retrain 3 bit retraining
80 70

Miss classification rate (%)

3 bit retrain
Phone error rate (%)

70 60

60
50

50
40

40
30

30
32 64 128 256 512 1024 8-16 16-32 32-64 64-128 96-192128-256
Size of the network Size of the network

(a) (b)

Figure 6: Comparison of retrain-based and direct quantization for DNN (a) and CNN (b). All
the weights are quantized with ternary and 7-level weights. In the figure (b), x-axis label ’8-16’
represents the number of feature map is ’8-8-16’.

16-16-32, 32-32-64, 64-64-128, 96-96-192, and 128-128-256. The size of the fully connected layer
is not changed. In this figure, the floating-point and the fixed-point performances with retraining
also converge very fast as the number of feature maps increases. The floating-point performance
saturates when the feature map size is 128-128-256, and the gap is less than 1% when comparing
the floating-point and the retrain-based 2-bit networks. However, also, there is some performance
gap when the number of feature maps is reduced. This suggests that a fairly high performance
feature extraction can be designed even using very low-precision weights if the number of feature
maps can be increased.

4.3 F IXED - POINT PERFORMANCES WHEN VARYING THE DEPTH

It is well known that increasing the depth usually results in positive effects on the performance of a
DNN (Yu et al., 2012a). The network complexity of a DNN is changed by increasing or reducing
the number of hidden layers or feature map levels. The result of fixed-point and floating-point
performances when varying the number of hidden layers for the FFDNN is summarized in Table 1.
The number of units in each hidden layer is 512. This table shows that both the floating-point and
the fixed-point performances of the FFDNN increase when adding hidden layers from 0 to 4. The
performance gap between the floating-point and the fixed-point networks shrinks as the number of
levels increases.
Table 1: Framewise phoneme error rate on TIMIT with respect to the depth in DNN

Number of layers
# Quantization levels Direct Retraining Difference
(Floating-point result)
1 3-level 69.88% 38.58% 3.91%
(34.67%) 7-level 56.81% 36.57% 1.90%
2 3-level 47.74% 33.89% 2.38%
(31.51%) 7-level 36.99% 33.04% 1.53%
3 3-level 49.27% 33.05% 2.24%
(30.81%) 7-level 36.58% 31.72% 0.91%
4 3-level 48.13% 31.86% 1.55%
(30.31%) 7-level 34.77% 31.49% 1.18%

The network complexity of the CNN is also varied by reducing the level of feature maps as shown
in Table 2. As expected, the performance of both the floating-point and retrain-based low-precision
networks degrades as the number of levels is reduced. The performance gap between them is very
small with 7-level quantization for all feature map levels.

7
Under review as a conference paper at ICLR 2016

These results for the FFDNN and the CNN with varied number of levels also show that the ef-
fects of quantization can be much reduced by retraining when the network contains some redundant
complexity.

Table 2: Miss classification rate on CIFAR-10 with respect to the depth in CNN

Layer
# Quantization levels Direct Retraining Difference
(Floating-point result)
64 3-level 72.95% 35.37% 1.18%
(34.19%) 7-level 46.60% 34.15% -0.04%
32-64 3-level 55.30% 29.51% 0.22%
(29.29%) 7-level 39.80% 29.32% 0.03%
32-32-64 3-level 79.88% 27.94% 1.07%
(26.87%) 7-level 47.91% 26.95% 0.08%

5 E FFECTIVE COMPRESSION RATIO

So far we have examined the effect of direct and retraining-based quantization to the final classifica-
tion error rates. As the number of quantization level decreases, more memory space can be saved at
the cost of sacrificing the accuracy. Therefore, there is a trade-off between the total memory space
for storing weights and the final classification accuracy. In practice, investigating this trade-off is
important for deciding the optimal bit-widths for representing weights and implementing the most
efficient neural network hardware.
In this section, we propose a guideline for finding the optimal bit-widths in terms of the total number
of bits consumed by the network weights when the desired accuracy or the network size is given.
Note that we assume 2n − 1 quantization levels are represented by n bits (i.e. 2 bits are required
for representing a ternary weight). For simplicity, all layers are quantized with the same number
of quantization levels. However, the similar approach can be applied to the layer-wise quantization
analysis.

50 50
floating result floating result
48 48 8 bit retrain
8 bit direct
7 bit direct 7 bit retrain
46 6 bit direct 46 6 bit retrain
5 bit direct 5 bit retrain
44 4 bit direct 4 bit retrain
44
Phone error rate (%)
Phone error rate (%)

3 bit direct 3 bit retrain

2 bit direct 2 bit retrain
42 42
40 40
38 38
36 36
34 34
32 32
30 30
6 7 8 6 7 8
10 10 10 10 10 10
# of bits # of bits

(a) (b)

Figure 7: Framewise phone error rate of phoneme recognition DNNs with respect to the total number
of bits for weights with (a) direct quantization and (b) after retraining.

The optimal combination of the bit-width and layer size can be found when the number of total
bits or the accuracy is given as shown in Figure 7. The figure shows the framewise phoneme error
rate on TIMIT with respect to the number of total bits, while varying the layer size of DNNs with
various number of quantization bits from 2 to 8 bits. The network has 4 hidden layers with the
uniform sizes. With direct quantization, the optimal hardware design can be achieved with about 5
bits. On the other hand, the weight representation with only 2 bits shows the best performance after
retraining.

8
Under review as a conference paper at ICLR 2016

floating result
90 2 bit direct
3 bit direct
2 bit retrain
80 3 bit retrain

Phone error rate (%)

30
3 4 5 6
10 10 10 10
# of params

Figure 8: Obtaining effective number of parameters for the uncompressed network.

10 8 bits (direct) 10
7 bits (direct)
9 6 bits (direct) 9
5 bits (direct)
8 8
Effective compression ratio

Effective compression ratio

4 bits (direct)
3 bits (direct) 7
7
2 bits (direct)

6
6

5
5
4
4
3 8 bits (retrain)
3 7 bits (retrain)
6 bits (retrain)
2 5 bits (retrain)
2
4 bits (retrain)
1 3 bits (retrain)
1 2 bits (retrain)
0
0
32 64 128 256 512 1024 32 64 128 256 512 1024
Size of the network Size of the network

(a) (b)

Figure 9: Effective compression ratio (ECR) with respect to the layer size and the number of bits
per weights for (a) direct quantization and (b) retrain-based quantization.

The remaining question is how much memory space can be saved by quantization while maintaining
the accuracy. To examine this, we introduce a metric called effective compression ratio (ECR),
which is defined as follows:
Effective uncompressed size
ECR = (6)
Compressed size
The compressed size is the total memory bits required for storing all weights with quantization. The
effective uncompressed size is the total memory size with 32-bit floating point representation when
the network achieves the same accuracy as that of the quantized network.
Figure 8 describes how to obtain the effective number of parameters for uncompressed networks.
Specifically, by varying the size, we find the number of total parameters of the floating-point network
that shows the same accuracy as the quantized one. After that, the effective uncompressed size can
be computed by multiplying 32 bits to the effective number of parameters.
Once we get the corresponding effective uncompressed size for the specific network size and the
number of quantization bits, the ECR can be computed by (6). The ECRs for the direct and retrain-
based quantization for various network sizes and quantization bits are shown in Figure 9. For the
direct quantization, 5 bit quantization shows the best ECR except for the layer size of 1024. On the
other hand, even 2 bit quantization performs better than the others after retraining. That is, after
retraining, a bigger network with extreme ternary (2 bit) quantization is more efficient in terms of

9
Under review as a conference paper at ICLR 2016

the memory usage for weights than any other smaller networks with higher quantization bits when
they are compared at the same accuracy.

6 D ISCUSSION

In this study, we control the network size by changing the number of units in the hidden layers,
the number of feature maps, or the number of levels. At any case, reduced complexity lowers
the resiliency to quantization. We are now conducting similar experiments to the recurrent neural
networks that are known to be more sensitive to quantization (Shin et al., 2015). This work seems
to be directly related to several network optimization methods, such as pruning, fault tolerance, and
decomposition (Yu et al., 2012b; Han et al., 2015; Xue et al., 2013; Rigamonti et al., 2013). In
the pruning, retraining of weights is conducted after zeroing small valued weights. The effects of
pruning, fault tolerance, and network decomposition efficiency would be dependent on the redundant
representation capability of DNNs.
This study can be applied to hardware efficient DNN design. For design with limited hardware
resources, when the size of the reference DNN is relatively small, it is advised to employ a very
low-precision arithmetic and, instead, increase the network complexity as much as the hardware
capacity allows. But, when the DNNs are in the performance saturation region, this strategy does
not always gain much because growing the ‘already-big’ network size brings almost no performance
advantages. This can be observed in Figure 7b and Figure 9b where 6 bit quantization performed
best at the largest layer size (1,024).

7 C ONCLUSION

We analyze the performance of fixed-point deep neural networks, an FFDNN for phoneme recogni-
tion and a CNN for image classification, while not only changing the arithmetic precision but also
varying their network complexity. The low-precision networks for this analysis are obtained by us-
ing the retrain based quantization method, and the network complexity is controlled by changing
the configurations of the hidden layers or feature maps. The performance gap between the floating-
point and the fixed-point neural networks with ternary weights (+1, 0, -1) almost vanishes when
the DNNs are in the performance saturation region for the given training data. However, when the
complexity of DNNs are reduced, by lowering either the number of units, feature maps, or hidden
layers, the performance gap between them increases. In other words, a large size network that may
contain redundant representation capability for the given training data does not hurt by the lowered
precision, but a very compact network does.

ACKNOWLEDGMENTS
This work was supported in part by the Brain Korea 21 Plus Project and the National Re-
search Foundation of Korea (NRF) grants funded by the Korea government (MSIP) (No.
2015R1A2A1A10056051).

R EFERENCES
Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convo-
lutional neural networks for object recognition. In Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on, pp. 1131–1135. IEEE, 2015.

Chen, Chenyi, Seff, Ari, Kornhauser, Alain, and Xiao, Jianxiong. Deepdriving: Learning affordance
for direct perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015.

Corradini, Maria Letizia, Giantomassi, Andrea, Ippoliti, Gianluca, Longhi, Sauro, and Orlando,
Giuseppe. Robust control of robot arms via quasi sliding modes and neural networks. In Advances
and Applications in Sliding Mode Control systems, pp. 79–105. Springer, 2015.

Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre. Binaryconnect: Training deep neu-
ral networks with binary weights during propagations. arXiv preprint arXiv:1511.00363, 2015.

10
Under review as a conference paper at ICLR 2016

Fiesler, Emile, Choudry, Amar, and Caulfield, H John. Weight discretization paradigm for optical
neural networks. In The Hague’90, 12-16 April, pp. 164–173. International Society for Optics
and Photonics, 1990.
Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. 2015.
Holt, Jordan L and Baker, Thomas E. Back propagation simulations using limited precision calcula-
tions. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume 2,
pp. 121–126. IEEE, 1991.
Hussain, B Zahir M et al. Short word-length lms filtering. In Signal Processing and Its Applications,
2007. ISSPA 2007. 9th International Symposium on, pp. 1–4. IEEE, 2007.
Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using
weights +1, 0, and -1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1–6.
IEEE, 2014.
Jalab, Hamid A, Omer, Herman, et al. Human computer interface using hand gesture recognition
based on neural network. In Information Technology: Towards New Smart World (NSITNSW),
2015 5th National Symposium on, pp. 1–6. IEEE, 2015.
Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wonyong. X1000 real-time phoneme recognition
VLSI using feed-forward deep neural networks. In Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on, pp. 7510–7514. IEEE, 2014.
Krizhevskey, A. CUDA-convnet, 2014.
Moerland, Perry and Fiesler, Emile. Neural network adaptations to hardware implementations.
Technical report, IDIAP, 1997.
Ovtcharov, Kalin, Ruwase, Olatunji, Kim, Joo-Young, Fowers, Jeremy, Strauss, Karin, and Chung,
Eric S. Accelerating deep convolutional neural networks using specialized hardware. Microsoft
Research Whitepaper, 2, 2015.
Rigamonti, Roberto, Sironi, Amos, Lepetit, Vincent, and Fua, Pascal. Learning separable filters. In
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2754–2761.
IEEE, 2013.
Sak, Haşim, Senior, Andrew, Rao, Kanishka, and Beaufays, Françoise. Fast and accurate recurrent
neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947, 2015.
Shin, Sungho, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point performance analysis of recurrent
neural networks. arXiv preprint arXiv:1512.01322, 2015.
Sung, Wonyong and Kum, Ki-II. Simulation-based word-length optimization method for fixed-point
digital signal processing systems. Signal Processing, IEEE Transactions on, 43(12):3087–3090,
1995.
Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
Xue, Jian, Li, Jinyu, and Gong, Yifan. Restructuring of deep neural network acoustic models with
singular value decomposition. In INTERSPEECH, pp. 2365–2369, 2013.
Yu, Dong, Deng, Alex Acero, Dahl, George, Seide, Frank, and Li, Gang. More data + deeper
model = better accuracy. In keynote at International Workshop on Statistical Machine Learning
for Speech Processing, 2012a.
Yu, Dong, Seide, Frank, Li, Gang, and Deng, Li. Exploiting sparseness in deep neural networks for
large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2012
IEEE International Conference on, pp. 4409–4412. IEEE, 2012b.

Cloth Store Management System Class 12th Final Project
83% (6)
Cloth Store Management System Class 12th Final Project
32 pages
A Convolutional Neural Network Accelerator Architecture
No ratings yet
A Convolutional Neural Network Accelerator Architecture
5 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
Counterexample Guided Neural Network Quantization Refinement
No ratings yet
Counterexample Guided Neural Network Quantization Refinement
14 pages
Deep Convolutional Neural Network Inference With Floating-Point Weights and
No ratings yet
Deep Convolutional Neural Network Inference With Floating-Point Weights and
10 pages
Mixed Precision Training
No ratings yet
Mixed Precision Training
12 pages
Jacob Quantization and Training
No ratings yet
Jacob Quantization and Training
10 pages
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
No ratings yet
An Energy Efficient Convolutional Neural Network Accelerator For Speech Classification Based On FPGA and Quantization
13 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
No ratings yet
Quantization and Training of Neural Networks For Efficient Integer-Arithmetic-Only Inference
14 pages
541 - Literature Review
No ratings yet
541 - Literature Review
19 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
Implementation of A Fast Artificial Neural Network Library (Fann)
No ratings yet
Implementation of A Fast Artificial Neural Network Library (Fann)
92 pages
Image Based Classification
No ratings yet
Image Based Classification
8 pages
Computer Science As Level 9618 Theory Notes
100% (3)
Computer Science As Level 9618 Theory Notes
106 pages
RTN: Reparameterized Ternary Network: Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang
No ratings yet
RTN: Reparameterized Ternary Network: Yuhang Li, Xin Dong, Sai Qian Zhang, Haoli Bai, Yuanpeng Chen, Wei Wang
9 pages
NNQuant 1
No ratings yet
NNQuant 1
14 pages
UCNN: Exploiting Computational Reuse in Deep Neural Networks Via Weight Repetition
No ratings yet
UCNN: Exploiting Computational Reuse in Deep Neural Networks Via Weight Repetition
14 pages
Paper 8
No ratings yet
Paper 8
7 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
BNN in FPGA
No ratings yet
BNN in FPGA
15 pages
ElieNicolas BNNs
No ratings yet
ElieNicolas BNNs
16 pages
Fixed Point Network Analysis of Rnns
No ratings yet
Fixed Point Network Analysis of Rnns
5 pages
Bitwise Neural Network
No ratings yet
Bitwise Neural Network
5 pages
Introduction To Weight Quantization PDF
No ratings yet
Introduction To Weight Quantization PDF
9 pages
Ann On Fpga Ieee
No ratings yet
Ann On Fpga Ieee
6 pages
Integer Quantization For Deep Learning Inference
No ratings yet
Integer Quantization For Deep Learning Inference
20 pages
A Hardware-Oriented Echo State Network and Its FPGA Implementation
No ratings yet
A Hardware-Oriented Echo State Network and Its FPGA Implementation
5 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Design and Implementation of Deep Neural Network Hardware Chip and Its Performance Analysis
No ratings yet
Design and Implementation of Deep Neural Network Hardware Chip and Its Performance Analysis
10 pages
3an Empirical Study of Binary N
No ratings yet
3an Empirical Study of Binary N
11 pages
IEEE2015
No ratings yet
IEEE2015
6 pages
E8627 IranArze
No ratings yet
E8627 IranArze
18 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
No ratings yet
Accelerating Low Bit-Width Convolutional Neural Networks With Embedded FPGA
4 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Quantum-Classical Hybrid Quantized Neural Network
No ratings yet
Quantum-Classical Hybrid Quantized Neural Network
30 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
No ratings yet
I N Q: T L CNN L - P W: Ncremental Etwork Uantization Owards Ossless S With OW Recision Eights
14 pages
HW 5
No ratings yet
HW 5
10 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
Hardware Implementation of Neural Networks
No ratings yet
Hardware Implementation of Neural Networks
5 pages
Letters: Direct Neural-Network Hardware-Implementation Algorithm
No ratings yet
Letters: Direct Neural-Network Hardware-Implementation Algorithm
4 pages
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
No ratings yet
Energy-Efficient FPGA Implementation of Power-Of-2 Weights-Based Convolutional Neural Networks With Low Bit-Precision Input Images
5 pages
Finn RTL
No ratings yet
Finn RTL
22 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
No ratings yet
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
6 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
FPGA Implementation of A Trained Neural Network: Seema Singh, Shreyashree Sanjeevi, Suma V, Akhil Talashi
No ratings yet
FPGA Implementation of A Trained Neural Network: Seema Singh, Shreyashree Sanjeevi, Suma V, Akhil Talashi
10 pages
A Survey of Randomized Algorithms For Training Neural Networks
No ratings yet
A Survey of Randomized Algorithms For Training Neural Networks
10 pages
Deep Learning and Computer Vision For Video Analytics
No ratings yet
Deep Learning and Computer Vision For Video Analytics
37 pages
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
7 pages
Fin Irjmets1684902949
No ratings yet
Fin Irjmets1684902949
6 pages
International Refereed Journal of Engineering and Science (IRJES)
No ratings yet
International Refereed Journal of Engineering and Science (IRJES)
4 pages
Articulo Cientifico Digitales 1
No ratings yet
Articulo Cientifico Digitales 1
3 pages
Information Sciences: Le Zhang, P.N. Suganthan
No ratings yet
Information Sciences: Le Zhang, P.N. Suganthan
3 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
Technical Communication Quantum
No ratings yet
Technical Communication Quantum
82 pages
Instruction Manual: MSA 230 Polivalent Electrofusion Unit
No ratings yet
Instruction Manual: MSA 230 Polivalent Electrofusion Unit
36 pages
Es3a4 Cad
No ratings yet
Es3a4 Cad
12 pages
4IT1 01 Que 20210428
No ratings yet
4IT1 01 Que 20210428
24 pages
Battlefront 2 - Creation Tutorials
No ratings yet
Battlefront 2 - Creation Tutorials
19 pages
Luyen Dich Va Qua Mau Cau Thong Dung Tech24 VN 4433
No ratings yet
Luyen Dich Va Qua Mau Cau Thong Dung Tech24 VN 4433
190 pages
IT Essentials 8 - Labs and PT Labs Index
No ratings yet
IT Essentials 8 - Labs and PT Labs Index
4 pages
C# Windows - Lecture
No ratings yet
C# Windows - Lecture
59 pages
A Survey of Modern Deep Learning Based Object Detection Models
No ratings yet
A Survey of Modern Deep Learning Based Object Detection Models
19 pages
DS-96000NI-E (/H) NVR: Series
No ratings yet
DS-96000NI-E (/H) NVR: Series
4 pages
RP Pamphlet19 Railings Part 2
No ratings yet
RP Pamphlet19 Railings Part 2
33 pages
Ak Viii It August l3 Canva 2024-25
No ratings yet
Ak Viii It August l3 Canva 2024-25
5 pages
Information Systems 302 Study Guide
No ratings yet
Information Systems 302 Study Guide
73 pages
Zemichael Gebremeskel CV
No ratings yet
Zemichael Gebremeskel CV
3 pages
Shaft Design
No ratings yet
Shaft Design
18 pages
2023 - Sap Hana Tdi Faq
No ratings yet
2023 - Sap Hana Tdi Faq
21 pages
Uses of C++
No ratings yet
Uses of C++
17 pages
Textoon: Generating Vivid 2D Cartoon Characters From Text Descriptions
No ratings yet
Textoon: Generating Vivid 2D Cartoon Characters From Text Descriptions
8 pages
HUAWEI MAR-LX3A 9.1.0.215 (C605E5R4P2) Release Notes
No ratings yet
HUAWEI MAR-LX3A 9.1.0.215 (C605E5R4P2) Release Notes
8 pages
Algorithms, Flowcharts and Pseudocode: An Overview For Key Stage 3 and 4 Teachers
No ratings yet
Algorithms, Flowcharts and Pseudocode: An Overview For Key Stage 3 and 4 Teachers
13 pages
DX Diag
No ratings yet
DX Diag
30 pages
STSW TPM I2c DRV
No ratings yet
STSW TPM I2c DRV
5 pages
BIM Template Training Bentley July2013
No ratings yet
BIM Template Training Bentley July2013
35 pages
Form Customer
No ratings yet
Form Customer
3 pages
E-Sys EN Release-Notes V3 36 2
No ratings yet
E-Sys EN Release-Notes V3 36 2
1 page
Powerpoint 2007: Create Brisk Presentations
No ratings yet
Powerpoint 2007: Create Brisk Presentations
16 pages
Prolink PC Camera User's Manual: Read This First
No ratings yet
Prolink PC Camera User's Manual: Read This First
4 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet