Resiliency of Deep Neural Networks Under Quantizations
Resiliency of Deep Neural Networks Under Quantizations
A BSTRACT
The complexity of deep neural network algorithms for hardware implementation
can be much lowered by optimizing the word-length of weights and signals. Direct
quantization of floating-point weights, however, does not show good performance
when the number of bits assigned is small. Retraining of quantized networks has
been developed to relieve this problem. In this work, the effects of quantization
are analyzed for a feedforward deep neural network (FFDNN) and a convolutional
neural network (CNN) when their network complexity is changed. The complex-
ity of the FFDNN is controlled by varying the unit size in each hidden layer and
the number of layers, while that of the CNN is done by modifying the feature map
configuration. We find that some performance gap exists between the floating-
point and the retrain-based ternary (+1, 0, -1) weight neural networks when the
size is not large enough, but the discrepancy almost vanishes in fully complex net-
works whose capability is limited by the training data, rather than by the number
of connections. This research shows that highly complex DNNs have the capa-
bility of absorbing the effects of severe weight quantization through retraining,
but connection limited networks are less resilient. This paper also presents the
effective compression ratio to guide the trade-off between the network size and
the precision when the hardware resource is limited.
1 I NTRODUCTION
Deep neural networks (DNNs) begin to find many real-time applications, such as speech recognition,
autonomous driving, gesture recognition, and robotic control (Sak et al., 2015; Chen et al., 2015;
Jalab et al., 2015; Corradini et al., 2015). Although most of deep neural networks are implemented
using GPUs (Graphics Processing Units) in these days, their implementation in hardware can give
many benefits in terms of power consumption and system size (Ovtcharov et al., 2015). FPGA
based implementation examples of CNN show more than 10 times advantage in power consumption
(Ovtcharov et al., 2015).
Neural network algorithms employ many multiply and add (MAC) operations that mimic the oper-
ations of biological neurons. This suggests that reconfigurable hardware arrays that contain quite
homogeneous hardware blocks, such as MAC units, can give very efficient solution to real-time neu-
ral network system design. Early studies on word-length determination of neural networks reported
the needed precision of at least 8 bits (Holt & Baker, 1991). Our recent works show that the pre-
cision required for implementing FFDNN, CNN or RNN needs not be very high, especially when
the quantized networks are trained again to learn the effects of lowered precision. In the fixed-point
optimization examples shown in Hwang & Sung (2014); Anwar et al. (2015); Shin et al. (2015),
neural networks with ternary weights showed quite good performance which was close to that of
floating-point arithmetic.
In this work, we try to know if retraining can recover the performance of FFDNN and CNN under
quantization with only ternary (+1, 0, -1) levels or 3 bits (+3, +2, +1, 0, -1, -2, -3) for weight
1
Under review as a conference paper at ICLR 2016
representation. Note that bias values are not quantized. For this study, the network complexity is
changed to analyze their effects on the performance gap between floating-point and retrained low-
precision fixed-point deep neural networks.
We conduct our experiments with a feed-forward deep neural network (FFDNN) for phoneme recog-
nition and a convolutional neural network (CNN) for image classification. To control the network
size, not only the number of units in each layer but also the number of hidden layers are varied in the
FFDNN. For the CNN, the number of feature maps for each layer and the number of layers are both
changed. The FFDNN uses the TIMIT corpus and the CNN employs the CIFAR-10 dataset. We
also propose a metric called effective compression ratio (ECR) for comparing extremely quantized
bigger networks with moderately quantized or floating-point networks with the smaller size. This
analysis intends to find an insight to the knowledge representation capability of highly quantized
networks, and also provides a guideline to network size and word-length determination for efficient
hardware implementation of DNNs.
2 R ELATED W ORK
Fixed-point implementation of signal processing algorithms has long been of interest for VLSI based
design of multimedia and communication systems. Some of early works used statistical modeling
of quantization noise for application to linear digital filters. The simulation-based word-length op-
timization method utilized simulation tools to evaluate the fixed-point performance of a system, by
which non-linear algorithms can be optimized (Sung & Kum, 1995). Ternary (+1, 0, -1) coeffi-
cients based digital filters were used to eliminate multiplications at the cost of higher quantization
noise. The implementation of adaptive filters with ternary weights were developed, but it demanded
oversampling to remove the quantization effects (Hussain et al., 2007).
Fixed-point neural network design also has been studied with the same purpose of reducing the hard-
ware implementation cost (Moerland & Fiesler, 1997). In Holt & Baker (1991), back propagation
simulation with 16-bit integer arithmetic was conducted for several problems, such as NetTalk, Par-
ity, Protein and so on. This work conducted the experiments while changing the number of hidden
units, which was, however, relatively small numbers. The integer simulations showed quite good
results for NetTalk and Parity, but not for Protein benchmarks. With direct quantization of trained
weights, this work also confirmed satisfactory operation of neural networks with 8-bit precision.
An implementation with ternary weights were reported for neural network design with optical fiber
networks (Fiesler et al., 1990). In this ternary network design, the authors employed retraining after
direct quantization to improve the performance of a shallow network.
Recently, fixed-point design of DNNs is revisited, and FFDNN and CNN with ternary weights show
quite good performances that are very close to the floating-point results. The ternary weight based
FFDNN and CNN are used for VLSI and FPGA based implementations, by which the algorithms
can operate with only on-chip memory consuming very low power (Kim et al., 2014). Binary weight
based deep neural network design is also studied (Courbariaux et al., 2015). Pruned floating-point
weights are also utilized for efficient GPU based implementations, where small valued weights are
forced to zero to reduce the number of arithmetic operations and the memory space for weight
storage (Yu et al., 2012b; Han et al., 2015). A network restructuring technique using singular value
decomposition technique is also studied (Xue et al., 2013; Rigamonti et al., 2013).
A feedforward deep neural network with multiple hidden layers are depicted in Figure 1. Each layer
k has a signal vector yk , which is propagated to the next layer by multiplying the weight matrix
Wk+1 , adding biases bk+1 , and applying the activation function φk+1 (·) as follows:
yk+1 = φk+1 Wk+1 yk + bk+1 . (1)
2
Under review as a conference paper at ICLR 2016
In F F F F Out
I
Input
t h1 h2 h3 h4 O t t
Output
S2-C3 S3-F1
S1-C2
F1-Out
In F Out
In-C1
Input C1 S1 C2 S2 C3 S3 F1
One of the most popular activation functions is the rectified linear unit defined as
Relu(x) = max(0, x). (2)
In this work, an FFDNN for phoneme recognition is used. The reference DNN has four hidden
layers. Each of the hidden layers has Nh units; the value of Nh is changed to control the complexity
of the network. We conduct experiments with the Nh size of 32, 64, 128, 256, 512, and 1024. The
number of hidden layers is also reduced. The input layer of the network has 1,353 units to accept
11 frames of a Fourier-transform-based filter-bank with 40 coefficients (+energy) distributed on a
mel-scale, together with their first and second temporal derivatives. The output layer consists of
61 softmax units which correspond to 61 target phoneme labels. Phoneme recognition experiments
were performed on the TIMIT corpus. The standard 462 speaker set with all SA records removed
was used for training, and a separate development set of 50 speaker was used for early stopping. Re-
sults are reported for the 24-speaker core test set. The network was trained using a backpropagation
algorithm with 128 mini-batch size. Initial learning rate was 10−5 and it was decreased until 10−7
during the training. Momentum was 0.9 and RMSProp was adopted for weights update (Tieleman
& Hinton, 2012). The dropout technique was employed with 0.2 dropout rate in each layer.
The CNN used is for CIFAR-10 dataset. It contains a training set of 50,000 and a test set of 10,000
32×32 RGB color images representing airplanes, automobiles, birds, cats, deers, dogs, frogs, horses,
ships and trucks. We divided the training set to 40,000 images for training and 10,000 images for
validation. This CNN has 3 convolution and pooling layers and a fully connected hidden layer
with 64 units, and the output has 10 softmax units as shown in Figure 2. We control the number
of feature maps in each convolution layer. The reference size has 32-32-64 feature maps with 5
by 5 kernel size as used in Krizhevskey (2014). We did not perform any preprocessing and data
augmentation such as ZCA whitening and global contrast normalization. To know the effects of
network size variation, the number of feature maps is reduced or increased. The configurations of
the feature maps used for the experiments are 8-8-16, 16-16-32, 32-32-64, 64-64-128, 96-96-192,
and 128-128-256. The number of feature map layers is also changed, resulting in 32-32-64, 32-64,
3
Under review as a conference paper at ICLR 2016
and 64 map configurations. Note that the fully connected layer in the CNN is not changed. The
network was trained using a backpropagation algorithm with 128 mini-batch size. Initial learning
rate was 0.001 and it was decreased to 10−8 during the training procedure. Momentum was 0.8 and
RMSProp was applied for weights update.
Reducing the word-length of weights brings several advantages in hardware based implementation
of neural networks. First, it lowers the arithmetic precision, and thereby reduces the number of
gates needed for multipliers. Second, the size of memory for storing weights is minimized, which
would be a big advantage when keeping them on a chip, instead of external DRAM or NAND flash
memory. Note that FFDNNs and recurrent neural networks demand a very large number of weights.
Third, the reduced arithmetic precision or minimization of off-chip memory accesses leads to low
power consumption. However, we need to concern the quantization effects that degrade the system
performance.
Direct quantization converts a floating-point value to the closest integer number, which is conven-
tionally used in signal processing system design. However, direct quantization usually demands
more than 8 bits, and does not show good performance when the number of bits is small. In fixed-
point deep neural network design, retraining of quantized weights shows quite good performance.
The fixed-point DNN algorithm design consists of three steps: floating-point training, direct quan-
tization, and retraining of weights. The floating-point training procedure can be any of the state of
the art techniques, which may include unsupervised learning and dropout. Note that fixed-point op-
timization needs to be based on the best performing floating-point weights. Thus, the floating-point
weight optimization may need to be conducted several times with different initializations, and this
step consumes the most of the time. After the floating-point training, direct quantization is followed.
For direct quantization, uniform quantization function is employed and the function Q(·) is defined
as follows :
|(w)| M −1
Q(w) = sgn(w) · ∆ · min + 0.5 , (3)
∆ 2
where sgn(·) is a sign function, ∆ is a quantization step size, and M represents the number of
quantization levels. Note that M needs to be an odd number since the weight values can be posi-
tive or negative. When M is 7, the weights are represented by -3·∆, -2·∆, -1·∆, 0, +1·∆, +2·∆,
+3·∆,which can be represented in 3 bits.
The quantization step size ∆ is determined to minimize the L2 error, E, depicted as follows.
N
1X 2
E= Q(wi ) − wi (4)
2 i=1
where N is the number of weights in each weight group, wi is the i-th weight value represented in
floating-point. This process needs some iterations, but does not take much time.
For network retraining, we maintain both floating-point and quantized weights because the amount
of weight updates in each training step is much smaller than the quantization step size ∆. The
forward and backward propagation is conducted using quantized weights, but the weight update is
applied to the floating-point weights and newly quantized values are generated at each iteration.
This retraining procedure usually converges quickly and does not take much time when compared
to the floating-point training.
The performance of the FFDNN and the CNN with directly quantized weights is analyzed while
varying the number of units in each layer or the number of feature maps, respectively. In this
analysis, the quantization is performed on each weight group, which is illustrated in Figure 1 and
4
Under review as a conference paper at ICLR 2016
Figure 2, to know the sensitivity of word-length reduction. In this sub-section, we try to analyze the
effects of direct quantization.
The quantized weight can be represented as follows,
wiq = wi + wid (5)
where wid is the distortion of each weight due to quantization. In the direct quantization, we can
assume that the distortion wid is not dependent each other.
w0 w0d
x0 × x0 ×
w1 w1d
x1 × x1 ×
w2 w2d
x2 × + y[n] x2 × + y d [ n]
...
...
...
...
wN -1 wNd-1
xN -1 × xN -1 ×
(a) (b)
Figure 3: Computation model for a unit in the hidden layer j ((a): floating-point, (b): distortion).
h3-h4
Phone error rate (%)
h4-out
70 60
60 50
50
40
40
30
30
32 64 128 256 512 1024 8-16 16-32 32-64 64-128 96-192128-256
Size of the network Size of the network
(a) (b)
Figure 4: Sensitivity analysis of direct quantization ((a): FFDNN, (b): CNN). In the figure (b),
x-axis label ‘8-16’ represents the number of feature map is ‘8-8-16’.
Consider a computation procedure for a unit in a hidden layer, the signal from the previous layer is
summed up after multiplication with the weights as illustrated in Figure 3a. We can also assemble a
model for distortion, which is shown in Figure 3b. In the distortion model, since wid is independent
each other, we can assume that the effects of the summed distortion is reduced according to the
random process theory. This analysis means that the quantization effects are reduced when the
number of units in the anterior layer increases, but slowly.
Figure 4a illustrates the performance of the FFDNN with floating-point arithmetic, 2-bit direct quan-
tization of all the weights, and 2-bit direct quantization only on the weight group ‘In-h1’, ‘h1-h2’,
and ‘h4-out’. Consider the quantization performance of the ‘In-h1’ layer, the phone-error rate is
higher than the floating-point result with an almost constant amount, about 10%. Note that the num-
ber of input to the ‘In-h1’ layer is fixed, 1353, regardless of the hidden unit size. Thus, the amount
of distortion delivered to each unit of the hidden layer 1 can be considered unchanged. Figure 4a
also shows the quantization performance on ‘h1-h2’ and ‘h4-out’ layers, which informs the trend of
5
Under review as a conference paper at ICLR 2016
70
60
60
50
50
40
40
30
30
64 128 256 512 1024 8-8-16 16-16-32 32-32-64 64-64-128 128-128-256
Size of the network Size of the network
(a) (b)
Figure 5: Performance of direct quantization with multiple precision ((a): FFDNN, (b): CNN).
reduced gap to the floating-point performance as the network size increases. This can be explained
by the sum of increased number of independent distortions when the network size grows. The per-
formance of all 2-bit quantization also shows the similar trend of reduced gap to the floating-point
performance. But, apparently, the performance of 2-bit directly quantized networks is not satisfac-
tory.
In Figure 4b, a similar analysis is conducted to the CNN with direct quantization when the number of
feature maps increases or decreases. In the CNN, the number of input to each output is determined
by the number of input feature maps and the kernel size. For example, at the first layer C1, the
number of input signal for computing one output is only 75 (=3×25) regardless of the network size,
where the input map size is always 3 and the kernel size is 25. However, at the second layer C2,
the number of input feature maps increases as the network size grows. When the feature map of
32-32-64 is considered, the number of input for the C2 layer grows to 800 (=32×25). Thus, we can
expect a reduced distortion as the number of feature maps increases.
Figure 5a shows the performance of direct quantization with 2, 4, 6, and 8-bit precision when the
network complexity varies. In the FFDNN, 6 bit direct quantization seems enough when the network
size is larger than 128. But, small FFDNNs demand 8 bits for near floating-point performance. The
CNN in Figure 5b also shows the similar trend. The direct quantization requires about 6 bits when
the feature map configuration is 16-16-32 or larger.
Retraining is conducted on the directly quantized networks using the same data for floating-point
training. The fixed-point performance of the FFDNN is shown in Figure 6a when the number of
hidden units in each layer varies. The performance of direct 2 bits (ternary levels), direct 3 bits (7-
levels), retrain-based 2 bits, and retrain-based 3 bits are compared with the floating-point simulation.
We can find that the performance gap between the floating-point and the retrain-based fixed-point
networks converges very fast as the network size grows. Although the performance gap between
the direct and the floating-point networks also converges, the rate of convergence is significantly
different. In this figure, the performance of the floating-point network almost saturates when the
network size is about 1024. Note that the TIMIT corpus that is used for training has only 3 hours
of data. Thus, the network with 1024 hidden units can be considered in the ‘training-data limited
region’. Here, the gap between the floating-point and fixed-point networks almost vanishes when the
network is in the ‘training-data limited region’. However, when the network size is limited, such as
32, 64, 128, or 256, there is some performance gap between the floating-point and highly quantized
networks even if retraining on the quantized networks is performed.
The similar experiments are conducted for the CNN with varying feature map sizes, and the results
are shown in Figure 6b. The configuration of the feature maps used for the experiments are 8-8-16,
6
Under review as a conference paper at ICLR 2016
floating result
90 floating result 80 2 bit direct
2 bit direct 2 bit retraining
3 bit direct 3 bit direct
2 bit retrain 3 bit retraining
80 70
70 60
60
50
50
40
40
30
30
32 64 128 256 512 1024 8-16 16-32 32-64 64-128 96-192128-256
Size of the network Size of the network
(a) (b)
Figure 6: Comparison of retrain-based and direct quantization for DNN (a) and CNN (b). All
the weights are quantized with ternary and 7-level weights. In the figure (b), x-axis label ’8-16’
represents the number of feature map is ’8-8-16’.
16-16-32, 32-32-64, 64-64-128, 96-96-192, and 128-128-256. The size of the fully connected layer
is not changed. In this figure, the floating-point and the fixed-point performances with retraining
also converge very fast as the number of feature maps increases. The floating-point performance
saturates when the feature map size is 128-128-256, and the gap is less than 1% when comparing
the floating-point and the retrain-based 2-bit networks. However, also, there is some performance
gap when the number of feature maps is reduced. This suggests that a fairly high performance
feature extraction can be designed even using very low-precision weights if the number of feature
maps can be increased.
It is well known that increasing the depth usually results in positive effects on the performance of a
DNN (Yu et al., 2012a). The network complexity of a DNN is changed by increasing or reducing
the number of hidden layers or feature map levels. The result of fixed-point and floating-point
performances when varying the number of hidden layers for the FFDNN is summarized in Table 1.
The number of units in each hidden layer is 512. This table shows that both the floating-point and
the fixed-point performances of the FFDNN increase when adding hidden layers from 0 to 4. The
performance gap between the floating-point and the fixed-point networks shrinks as the number of
levels increases.
Table 1: Framewise phoneme error rate on TIMIT with respect to the depth in DNN
Number of layers
# Quantization levels Direct Retraining Difference
(Floating-point result)
1 3-level 69.88% 38.58% 3.91%
(34.67%) 7-level 56.81% 36.57% 1.90%
2 3-level 47.74% 33.89% 2.38%
(31.51%) 7-level 36.99% 33.04% 1.53%
3 3-level 49.27% 33.05% 2.24%
(30.81%) 7-level 36.58% 31.72% 0.91%
4 3-level 48.13% 31.86% 1.55%
(30.31%) 7-level 34.77% 31.49% 1.18%
The network complexity of the CNN is also varied by reducing the level of feature maps as shown
in Table 2. As expected, the performance of both the floating-point and retrain-based low-precision
networks degrades as the number of levels is reduced. The performance gap between them is very
small with 7-level quantization for all feature map levels.
7
Under review as a conference paper at ICLR 2016
These results for the FFDNN and the CNN with varied number of levels also show that the ef-
fects of quantization can be much reduced by retraining when the network contains some redundant
complexity.
Table 2: Miss classification rate on CIFAR-10 with respect to the depth in CNN
Layer
# Quantization levels Direct Retraining Difference
(Floating-point result)
64 3-level 72.95% 35.37% 1.18%
(34.19%) 7-level 46.60% 34.15% -0.04%
32-64 3-level 55.30% 29.51% 0.22%
(29.29%) 7-level 39.80% 29.32% 0.03%
32-32-64 3-level 79.88% 27.94% 1.07%
(26.87%) 7-level 47.91% 26.95% 0.08%
50 50
floating result floating result
48 48 8 bit retrain
8 bit direct
7 bit direct 7 bit retrain
46 6 bit direct 46 6 bit retrain
5 bit direct 5 bit retrain
44 4 bit direct 4 bit retrain
44
Phone error rate (%)
Phone error rate (%)
(a) (b)
Figure 7: Framewise phone error rate of phoneme recognition DNNs with respect to the total number
of bits for weights with (a) direct quantization and (b) after retraining.
The optimal combination of the bit-width and layer size can be found when the number of total
bits or the accuracy is given as shown in Figure 7. The figure shows the framewise phoneme error
rate on TIMIT with respect to the number of total bits, while varying the layer size of DNNs with
various number of quantization bits from 2 to 8 bits. The network has 4 hidden layers with the
uniform sizes. With direct quantization, the optimal hardware design can be achieved with about 5
bits. On the other hand, the weight representation with only 2 bits shows the best performance after
retraining.
8
Under review as a conference paper at ICLR 2016
floating result
90 2 bit direct
3 bit direct
2 bit retrain
80 3 bit retrain
60
50
40
30
3 4 5 6
10 10 10 10
# of params
10 8 bits (direct) 10
7 bits (direct)
9 6 bits (direct) 9
5 bits (direct)
8 8
Effective compression ratio
4 bits (direct)
3 bits (direct) 7
7
2 bits (direct)
6
6
5
5
4
4
3 8 bits (retrain)
3 7 bits (retrain)
6 bits (retrain)
2 5 bits (retrain)
2
4 bits (retrain)
1 3 bits (retrain)
1 2 bits (retrain)
0
0
32 64 128 256 512 1024 32 64 128 256 512 1024
Size of the network Size of the network
(a) (b)
Figure 9: Effective compression ratio (ECR) with respect to the layer size and the number of bits
per weights for (a) direct quantization and (b) retrain-based quantization.
The remaining question is how much memory space can be saved by quantization while maintaining
the accuracy. To examine this, we introduce a metric called effective compression ratio (ECR),
which is defined as follows:
Effective uncompressed size
ECR = (6)
Compressed size
The compressed size is the total memory bits required for storing all weights with quantization. The
effective uncompressed size is the total memory size with 32-bit floating point representation when
the network achieves the same accuracy as that of the quantized network.
Figure 8 describes how to obtain the effective number of parameters for uncompressed networks.
Specifically, by varying the size, we find the number of total parameters of the floating-point network
that shows the same accuracy as the quantized one. After that, the effective uncompressed size can
be computed by multiplying 32 bits to the effective number of parameters.
Once we get the corresponding effective uncompressed size for the specific network size and the
number of quantization bits, the ECR can be computed by (6). The ECRs for the direct and retrain-
based quantization for various network sizes and quantization bits are shown in Figure 9. For the
direct quantization, 5 bit quantization shows the best ECR except for the layer size of 1024. On the
other hand, even 2 bit quantization performs better than the others after retraining. That is, after
retraining, a bigger network with extreme ternary (2 bit) quantization is more efficient in terms of
9
Under review as a conference paper at ICLR 2016
the memory usage for weights than any other smaller networks with higher quantization bits when
they are compared at the same accuracy.
6 D ISCUSSION
In this study, we control the network size by changing the number of units in the hidden layers,
the number of feature maps, or the number of levels. At any case, reduced complexity lowers
the resiliency to quantization. We are now conducting similar experiments to the recurrent neural
networks that are known to be more sensitive to quantization (Shin et al., 2015). This work seems
to be directly related to several network optimization methods, such as pruning, fault tolerance, and
decomposition (Yu et al., 2012b; Han et al., 2015; Xue et al., 2013; Rigamonti et al., 2013). In
the pruning, retraining of weights is conducted after zeroing small valued weights. The effects of
pruning, fault tolerance, and network decomposition efficiency would be dependent on the redundant
representation capability of DNNs.
This study can be applied to hardware efficient DNN design. For design with limited hardware
resources, when the size of the reference DNN is relatively small, it is advised to employ a very
low-precision arithmetic and, instead, increase the network complexity as much as the hardware
capacity allows. But, when the DNNs are in the performance saturation region, this strategy does
not always gain much because growing the ‘already-big’ network size brings almost no performance
advantages. This can be observed in Figure 7b and Figure 9b where 6 bit quantization performed
best at the largest layer size (1,024).
7 C ONCLUSION
We analyze the performance of fixed-point deep neural networks, an FFDNN for phoneme recogni-
tion and a CNN for image classification, while not only changing the arithmetic precision but also
varying their network complexity. The low-precision networks for this analysis are obtained by us-
ing the retrain based quantization method, and the network complexity is controlled by changing
the configurations of the hidden layers or feature maps. The performance gap between the floating-
point and the fixed-point neural networks with ternary weights (+1, 0, -1) almost vanishes when
the DNNs are in the performance saturation region for the given training data. However, when the
complexity of DNNs are reduced, by lowering either the number of units, feature maps, or hidden
layers, the performance gap between them increases. In other words, a large size network that may
contain redundant representation capability for the given training data does not hurt by the lowered
precision, but a very compact network does.
ACKNOWLEDGMENTS
This work was supported in part by the Brain Korea 21 Plus Project and the National Re-
search Foundation of Korea (NRF) grants funded by the Korea government (MSIP) (No.
2015R1A2A1A10056051).
R EFERENCES
Anwar, Sajid, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point optimization of deep convo-
lutional neural networks for object recognition. In Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on, pp. 1131–1135. IEEE, 2015.
Chen, Chenyi, Seff, Ari, Kornhauser, Alain, and Xiao, Jianxiong. Deepdriving: Learning affordance
for direct perception in autonomous driving. arXiv preprint arXiv:1505.00256, 2015.
Corradini, Maria Letizia, Giantomassi, Andrea, Ippoliti, Gianluca, Longhi, Sauro, and Orlando,
Giuseppe. Robust control of robot arms via quasi sliding modes and neural networks. In Advances
and Applications in Sliding Mode Control systems, pp. 79–105. Springer, 2015.
Courbariaux, Matthieu, Bengio, Yoshua, and David, Jean-Pierre. Binaryconnect: Training deep neu-
ral networks with binary weights during propagations. arXiv preprint arXiv:1511.00363, 2015.
10
Under review as a conference paper at ICLR 2016
Fiesler, Emile, Choudry, Amar, and Caulfield, H John. Weight discretization paradigm for optical
neural networks. In The Hague’90, 12-16 April, pp. 164–173. International Society for Optics
and Photonics, 1990.
Han, Song, Mao, Huizi, and Dally, William J. Deep compression: Compressing deep neural network
with pruning, trained quantization and huffman coding. 2015.
Holt, Jordan L and Baker, Thomas E. Back propagation simulations using limited precision calcula-
tions. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume 2,
pp. 121–126. IEEE, 1991.
Hussain, B Zahir M et al. Short word-length lms filtering. In Signal Processing and Its Applications,
2007. ISSPA 2007. 9th International Symposium on, pp. 1–4. IEEE, 2007.
Hwang, Kyuyeon and Sung, Wonyong. Fixed-point feedforward deep neural network design using
weights +1, 0, and -1. In Signal Processing Systems (SiPS), 2014 IEEE Workshop on, pp. 1–6.
IEEE, 2014.
Jalab, Hamid A, Omer, Herman, et al. Human computer interface using hand gesture recognition
based on neural network. In Information Technology: Towards New Smart World (NSITNSW),
2015 5th National Symposium on, pp. 1–6. IEEE, 2015.
Kim, Jonghong, Hwang, Kyuyeon, and Sung, Wonyong. X1000 real-time phoneme recognition
VLSI using feed-forward deep neural networks. In Acoustics, Speech and Signal Processing
(ICASSP), 2014 IEEE International Conference on, pp. 7510–7514. IEEE, 2014.
Krizhevskey, A. CUDA-convnet, 2014.
Moerland, Perry and Fiesler, Emile. Neural network adaptations to hardware implementations.
Technical report, IDIAP, 1997.
Ovtcharov, Kalin, Ruwase, Olatunji, Kim, Joo-Young, Fowers, Jeremy, Strauss, Karin, and Chung,
Eric S. Accelerating deep convolutional neural networks using specialized hardware. Microsoft
Research Whitepaper, 2, 2015.
Rigamonti, Roberto, Sironi, Amos, Lepetit, Vincent, and Fua, Pascal. Learning separable filters. In
Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2754–2761.
IEEE, 2013.
Sak, Haşim, Senior, Andrew, Rao, Kanishka, and Beaufays, Françoise. Fast and accurate recurrent
neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947, 2015.
Shin, Sungho, Hwang, Kyuyeon, and Sung, Wonyong. Fixed point performance analysis of recurrent
neural networks. arXiv preprint arXiv:1512.01322, 2015.
Sung, Wonyong and Kum, Ki-II. Simulation-based word-length optimization method for fixed-point
digital signal processing systems. Signal Processing, IEEE Transactions on, 43(12):3087–3090,
1995.
Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
Xue, Jian, Li, Jinyu, and Gong, Yifan. Restructuring of deep neural network acoustic models with
singular value decomposition. In INTERSPEECH, pp. 2365–2369, 2013.
Yu, Dong, Deng, Alex Acero, Dahl, George, Seide, Frank, and Li, Gang. More data + deeper
model = better accuracy. In keynote at International Workshop on Statistical Machine Learning
for Speech Processing, 2012a.
Yu, Dong, Seide, Frank, Li, Gang, and Deng, Li. Exploiting sparseness in deep neural networks for
large vocabulary speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2012
IEEE International Conference on, pp. 4409–4412. IEEE, 2012b.
11