0% found this document useful (0 votes)

10 views9 pages

Do Deep Nets Really Need To Be Deep

This paper investigates whether shallow neural networks can achieve performance comparable to deep neural networks on tasks like speech recognition and image classification. The authors provide empirical evidence that shallow networks can learn complex functions previously attributed to deep networks, often using the same number of parameters. Through model compression techniques, they demonstrate that shallow networks can mimic deep networks effectively, challenging the necessity of deep architectures for certain tasks.

Uploaded by

mostafa.shamsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views9 pages

Do Deep Nets Really Need To Be Deep

Uploaded by

mostafa.shamsi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Do Deep Nets Really Need to be Deep?

Lei Jimmy Ba Rich Caruana

University of Toronto Microsoft Research
[email protected] [email protected]

Abstract
Currently, deep neural networks are the state of the art on problems such as speech
recognition and computer vision. In this paper we empirically demonstrate that
shallow feed-forward nets can learn the complex functions previously learned by
deep nets and achieve accuracies previously only achievable with deep models.
Moreover, in some cases the shallow nets can learn these deep functions using the
same number of parameters as the original deep models. On the TIMIT phoneme
recognition and CIFAR-10 image recognition tasks, shallow nets can be trained
that perform similarly to complex, well-engineered, deeper convolutional models.

1 Introduction
You are given a training set with 1M labeled points. When you train a shallow neural net with one
fully connected feed-forward hidden layer on this data you obtain 86% accuracy on test data. When
you train a deeper neural net as in [1] consisting of a convolutional layer, pooling layer, and three
fully connected feed-forward layers on the same data you obtain 91% accuracy on the same test set.
What is the source of this improvement? Is the 5% increase in accuracy of the deep net over the
shallow net because: a) the deep net has more parameters; b) the deep net can learn more complex
functions given the same number of parameters; c) the deep net has better inductive bias and thus
learns more interesting/useful functions (e.g., because the deep net is deeper it learns hierarchical
representations [5]); d) nets without convolution can’t easily learn what nets with convolution can
learn; e) current learning algorithms and regularization methods work better with deep architectures
than shallow architectures[8]; f) all or some of the above; g) none of the above?
There have been attempts to answer this question. It has been shown that deep nets coupled with
unsupervised layer-by-layer pre-training [10] [19] work well. In [8], the authors show that depth
combined with pre-training provides a good prior for model weights, thus improving generalization.
There is well-known early theoretical work on the representational capacity of neural nets. For
example, it was proved that a network with a large enough single hidden layer of sigmoid units can
approximate any decision boundary [4]. Empirical work, however, shows that it is difficult to train
shallow nets to be as accurate as deep nets. For vision tasks, a recent study on deep convolutional
nets suggests that deeper models are preferred under a parameter budget [7]. In [5], the authors
trained shallow nets on SIFT features to classify a large-scale ImageNet dataset and found that it
was difficult to train large, high-accuracy, shallow nets. And in [17], the authors show that deeper
models are more accurate than shallow models in speech acoustic modeling.
In this paper we provide empirical evidence that shallow nets are capable of learning the same
function as deep nets, and in some cases with the same number of parameters as the deep nets. We
do this by first training a state-of-the-art deep model, and then training a shallow model to mimic the
deep model. The mimic model is trained using the model compression method described in the next
section. Remarkably, with model compression we are able to train shallow nets to be as accurate
as some deep models, even though we are not able to train these shallow nets to be as accurate as
the deep nets when the shallow nets are trained directly on the original labeled training data. If a
shallow net with the same number of parameters as a deep net can learn to mimic a deep net with
high fidelity, then it is clear that the function learned by that deep net does not really have to be deep.

1
2 Training Shallow Nets to Mimic Deep Nets
2.1 Model Compression

The main idea behind model compression [3] is to train a compact model to approximate the func-
tion learned by a larger, more complex model. For example, in [3], a single neural net of modest
size could be trained to mimic a much larger ensemble of models—although the small neural nets
contained 1000 times fewer parameters, often they were just as accurate as the ensembles they were
trained to mimic. Model compression works by passing unlabeled data through the large, accurate
model to collect the scores produced by that model. This synthetically labeled data is then used to
train the smaller mimic model. The mimic model is not trained on the original labels—it is trained
to learn the function that was learned by the larger model. If the compressed model learns to mimic
the large model perfectly it makes exactly the same predictions and mistakes as the complex model.
Surprisingly, often it is not (yet) possible to train a small neural net on the original training data to be
as accurate as the complex model, nor as accurate as the mimic model. Compression demonstrates
that a small neural net could, in principle, learn the more accurate function, but current learning
algorithms are unable to train a model with that accuracy from the original training data; instead, we
must train the complex intermediate model first and then train the neural net to mimic it. Clearly,
when it is possible to mimic the function learned by a complex model with a small net, the function
learned by the complex model wasn’t truly too complex to be learned by a small net. This suggests
to us that the complexity of a learned model, and the size and architecture of the representation best
used to learn that model, are different things.

2.2 Mimic Learning via Regressing Logits with L2 Loss

On both TIMIT and CIFAR-10 we use model compression to train shallow mimic nets using data
labeled by either a deep net, or an ensemble of deep nets, trained on the original TIMIT or CIFAR-10
training data. The deep models are trained in the usual way using softmax output and cross-entropy
cost function. The shallow mimic models, however, instead of being trained with cross-entropy on
the 183 p values where pk = ezk / j ezj output by the softmax layer from the deep model, are
P
trained directly on the 183 log probability values z, also called logits, before the softmax activation.
Training on logits, which are logarithms of predicted probabilities, makes learning easier for the
student model by placing equal emphasis on the relationships learned by the teacher model across
all of the targets. For example, if the teacher predicts three targets with probability [2 × 10−9 , 4 ×
10−5 , 0.9999] and those probabilities are used as prediction targets and cross entropy is minimized,
the student will focus on the third target and tend to ignore the first and second targets. A student,
however, trained on the logits for these targets, [10, 20, 30], will better learn to mimic the detailed
behaviour of the teacher model. Moreover, consider a second training case where the teacher predicts
logits [−10, 0, 10]. After softmax, these logits yield the same predicted probabilities as [10, 20, 30],
yet clearly the teacher models the two cases very differently. By training the student model directly
on the logits, the student is better able to learn the internal model learned by the teacher, without
suffering from the information loss that occurs from passing through logits to probability space.
We formulate the SNN-MIMIC learning objective function as a regression problem given training
data {(x(1) , z (1) ),...,(x(T ) , z (T ) ) }:

1 X
L(W, β) = ||g(x(t) ; W, β) − z (t) ||22 , (1)
2T t

where W is the weight matrix between input features x and hidden layer, β is the weights from
hidden to output units, g(x(t) ; W, β) = βf (W x(t) ) is the model prediction on the tth training data
point and f (·) is the non-linear activation of the hidden units. The parameters W and β are updated
using standard error back-propagation algorithm and stochastic gradient descent with momentum.
We have also experimented with other mimic loss functions, such as minimizing the KL divergence
KL(pteacher kpstudent ) cost function and L2 loss on probabilities. Regression on logits outperforms all
the other loss functions and is one of the key techniques for obtaining the results in the rest of this

2
paper. We found that normalizing the logits from the teacher model by subtracting the mean and
dividing the standard deviation of each target across the training set can improve L2 loss slightly
during training, but normalization is not crucial for obtaining good student mimic models.

2.3 Speeding-up Mimic Learning by Introducing a Linear Layer

To match the number of parameters in a deep net, a shallow net has to have more non-linear hidden
units in a single layer to produce a large weight matrix W . When training a large shallow neural
network with many hidden units, we find it is very slow to learn the large number of parameters in the
weight matrix between input and hidden layers of size O(HD), where D is input feature dimension
and H is the number of hidden units. Because there are many highly correlated parameters in this
large weight matrix, gradient descent converges slowly. We also notice that during learning, shallow
nets spend most of the computation in the costly matrix multiplication of the input data vectors and
large weight matrix. The shallow nets eventually learn accurate mimic functions, but training to
convergence is very slow (multiple weeks) even with a GPU.
We found that introducing a bottleneck linear layer with k linear hidden units between the input
and the non-linear hidden layer sped up learning dramatically: we can factorize the weight matrix
W ∈ RH×D into the product of two low-rank matrices, U ∈ RH×k and V ∈ Rk×D , where
k << D, H. The new cost function can be written as:
1 X
L(U, V, β) = ||βf (U V x(t) ) − z (t) ||22 (2)
2T t

The weights U and V can be learnt by back-propagating through the linear layer. This re-
parameterization of weight matrix W not only increases the convergence rate of the shallow mimic
nets, but also reduces memory space from O(HD) to O(k(H + D)).
Factorizing weight matrices has been previously explored in [16] and [20]. While these prior works
focus on using matrix factorization in the last output layer, our method is applied between the input
and hidden layer to improve the convergence speed during training.
The reduced memory usage enables us to train large shallow models that were previously infeasible
due to excessive memory usage. Note that the linear bottleneck can only reduce the representational
power of the network, and it can always be absorbed into a single weight matrix W .

3 TIMIT Phoneme Recognition

The TIMIT speech corpus has 462 speakers in the training set, a separate development set for cross-
validation that includes 50 speakers, and a final test set with 24 speakers. The raw waveform au-
dio data were pre-processed using 25ms Hamming window shifting by 10ms to extract Fourier-
transform-based filter-banks with 40 coefficients (plus energy) distributed on a mel-scale, together
with their first and second temporal derivatives. We included +/- 7 nearby frames to formulate the
final 1845 dimension input vector. The data input features were normalized by subtracting the mean
and dividing by the standard deviation on each dimension. All 61 phoneme labels are represented
in tri-state, i.e., three states for each of the 61 phonemes, yielding target label vectors with 183
dimensions for training. At decoding time these are mapped to 39 classes as in [13] for scoring.

3.1 Deep Learning on TIMIT

Deep learning was first successfully applied to speech recognition in [14]. Following their frame-
work, we train two deep models on TIMIT, DNN and CNN. DNN is a deep neural net consisting
of three fully connected feedforward hidden layers consisting of 2000 rectified linear units (ReLU)
[15] per layer. CNN is a deep neural net consisting of a convolutional layer and max-pooling layer
followed by three hidden layers containing 2000 ReLU units [2]. The CNN was trained using the
same convolutional architecture as in [6]. We also formed an ensemble of nine CNN models, ECNN.
The accuracy of DNN, CNN, and ECNN on the final test set are shown in Table 1. The error rate
of the convolutional deep net (CNN) is about 2.1% better than the deep net (DNN). The table also
shows the accuracy of shallow neural nets with 8000, 50,000, and 400,000 hidden units (SNN-8k,

3
SNN-50k, and SNN-400k) trained on the original training data. Despite having up to 10X as many
parameters as DNN, CNN, and ECNN, the shallow models are 1.4% to 2% less accurate than the
DNN, 3.5% to 4.1% less accurate than the CNN, and 4.5% to 5.1% less accurate than the ECNN.

3.2 Learning to Mimic an Ensemble of Deep Convolutional TIMIT Models

The most accurate single model that we trained on TIMIT is the deep convolutional architecture in
[6]. Because we have no unlabeled data from the TIMIT distribution, we use the same 1.1M points
in the train set as unlabeled data for compression by throwing away the labels.1 Re-using the 1.1M
train set reduces the accuracy of the student mimic models, increasing the gap between the teacher
and mimic models on test data: model compression works best when the unlabeled set is very large,
and when the unlabeled samples do not fall on train points where the teacher model is likely to have
overfit. To reduce the impact of the gap caused by performing compression with the original train
set, we train the student model to mimic a more accurate ensemble of deep convolutional models.
We are able to train a more accurate model on TIMIT by forming an ensemble of nine deep, con-
volutional neural nets, each trained with somewhat different train sets, and with architectures of
different kernel sizes in the convolutional layers. We used this very accurate model, ECNN, as the
teacher model to label the data used to train the shallow mimic nets. As described in Section 2.2
the logits (log probability of the predicted values) from each CNN in the ECNN model are averaged
and the average logits are used as final regression targets to train the mimic SNNs.
We trained shallow mimic nets with 8k (SNN-MIMIC-8k) and 400k (SNN-MIMIC-400k) hidden
units on the re-labeled 1.1M training points. As described in Section 2.3, to speed up learning both
mimic models have 250 linear units between the input and non-linear hidden layer—preliminary
experiments suggest that for TIMIT there is little benefit from using more than 250 linear units.

3.3 Compression Results For TIMIT

Architecture # Param. # Hidden units PER

8k + dropout
SNN-8k ∼12M ∼8k 23.1%
trained on original data
50k + dropout
SNN-50k ∼100M ∼50k 23.0%
trained on original data
250L-400k + dropout
SNN-400k ∼180M ∼400k 23.6%
trained on original data
2k-2k-2k + dropout
DNN ∼12M ∼6k 21.9%
trained on original data
c-p-2k-2k-2k + dropout
CNN ∼13M ∼10k 19.5%
trained on original data
ECNN ensemble of 9 CNNs ∼125M ∼90k 18.5%
250L-8k
SNN-MIMIC-8k ∼12M ∼8k 21.6%
no convolution or pooling layers
250L-400k
SNN-MIMIC-400k ∼180M ∼400k 20.0%
no convolution or pooling layers

Table 1: Comparison of shallow and deep models: phone error rate (PER) on TIMIT core test set.

The bottom of Table 1 shows the accuracy of shallow mimic nets with 8000 ReLUs and 400,000
ReLUs (SNN-MIMIC-8k and -400k) trained with model compression to mimic the ECNN. Surpris-
ingly, shallow nets are able to perform as well as their deep counterparts when trained with model
compression to mimic a more accurate model. A neural net with one hidden layer (SNN-MIMIC-
8k) can be trained to perform as well as a DNN with a similar number of parameters. Furthermore,
if we increase the number of hidden units in the shallow net from 8k to 400k (the largest we could
1
That SNNs can be trained to be as accurate as DNNs using only the original training data highlights that it
should be possible to train accurate SNNs on the original training data given better learning algorithms.

4
train), we see that a neural net with one hidden layer (SNN-MIMIC-400k) can be trained to perform
comparably to a CNN, even though the SNN-MIMIC-400k net has no convolutional or pooling lay-
ers. This is interesting because it suggests that a large single hidden layer without a topology custom
designed for the problem is able to reach the performance of a deep convolutional neural net that
was carefully engineered with prior structure and weight-sharing without any increase in the number
of training examples, even though the same architecture trained on the original data could not.

83 82
ShallowNet
ShallowNet
DeepNet
DeepNet
82 81 ShallowMimicNet
ShallowMimicNet
Convolutional Net

Accuracy on TIMIT Test Set

Accuracy on TIMIT Dev Set

Convolutional Net
Ensemble of CNNs
81 Ensemble of CNNs 80

80 79

79 78

78 77

77 76

76 75
1 10 100 1 10 100
Number of Parameters (millions) Number of Parameters (millions)

Figure 1: Accuracy of SNNs, DNNs, and Mimic SNNs vs. # of parameters on TIMIT Dev (left) and
Test (right) sets. Accuracy of the CNN and target ECNN are shown as horizontal lines for reference.

Figure 1 shows the accuracy of shallow nets and deep nets trained on the original TIMIT 1.1M data,
and shallow mimic nets trained on the ECNN targets, as a function of the number of parameters in
the models. The accuracy of the CNN and the teacher ECNN are shown as horizontal lines at the top
of the figures. When the number of parameters is small (about 1 million), the SNN, DNN, and SNN-
MIMIC models all have similar accuracy. As the size of the hidden layers increases and the number
of parameters increases, the accuracy of a shallow model trained on the original data begins to lag
behind. The accuracy of the shallow mimic model, however, matches the accuracy of the DNN until
about 4 million parameters, when the DNN begins to fall behind the mimic. The DNN asymptotes
at around 10M parameters, while the shallow mimic continues to increase in accuracy. Eventually
the mimic asymptotes at around 100M parameters to an accuracy comparable to that of the CNN.
The shallow mimic never achieves the accuracy of the ECNN it is trying to mimic (because there
is not enough unlabeled data), but it is able to match or exceed the accuracy of deep nets (DNNs)
having the same number of parameters trained on the original data.

4 Object Recognition: CIFAR-10

To verify that the results on TIMIT generalize to other learning problems and task domains, we ran
similar experiments on the CIFAR-10 Object Recognition Task[12]. CIFAR-10 consists of a set
of natural images from 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog,
horse, ship, truck. The dataset is a labeled subset of the 80 million tiny images dataset[18] and is
divided into 50,000 train and 10,000 test images. Each image is 32x32 pixels in 3 color channels,
yielding input vectors with 3072 dimensions. We prepared the data by subtracting the mean and
dividing the standard deviation of each image vector to perform global contrast normalization. We
then applied ZCA whitening to the normalized images. This pre-processing is the same used in [9].

4.1 Learning to Mimic an Ensemble of Deep Convolutional CIFAR-10 Models

We follow the same approach as with TIMIT: An ensemble of deep CNN models is used to label
CIFAR-10 images for model compression. The logit predictions from this teacher model are used
as regression targets to train a mimic shallow neural net (SNN). CIFAR-10 images have a higher
dimension than TIMIT (3072 vs. 1845), but the size of the CIFAR-10 training set is only 50,000
compared to 1.1 million examples for TIMIT. Fortunately, unlike TIMIT, in CIFAR-10 we have
access to unlabeled data from a similar distribution by using the superset of CIFAR-10: the 80
million tiny images dataset. We add the first one million images from the 80 million set to the
original 50,000 CIFAR-10 training images to create a 1.05M mimic training (transfer) set.

5
Architecture # Param. # Hidden units Err.

DNN 2000-2000 + dropout ∼10M 4k 57.8%

128c-p-1200L-30k
SNN-30k ∼70M ∼190k 21.8%
+ dropout input&hidden
single-layer 4000c-p
∼125M ∼3.7B 18.4%
feature extraction followed by SVM
CNN[11] 64c-p-64c-p-64c-p-16lc
∼10k ∼110k 15.6%
(no augmentation) + dropout on lc
CNN[21] 64c-p-64c-p-128c-p-fc
(no augmentation) + dropout on fc ∼56k ∼120k 15.13%
and stochastic pooling
teacher CNN 128c-p-128c-p-128c-p-1kfc
(no augmentation) + dropout on fc ∼35k ∼210k 12.0%
and stochastic pooling
ECNN
ensemble of 4 CNNs ∼140k ∼840k 11.0%
(no augmentation)
SNN-CNN-MIMIC-30k 64c-p-1200L-30k
∼54M ∼110k 15.4%
trained on a single CNN with no regularization
SNN-CNN-MIMIC-30k 128c-p-1200L-30k
∼70M ∼190k 15.1%
trained on a single CNN with no regularization
SNN-ECNN-MIMIC-30k 128c-p-1200L-30k
∼70M ∼190k 14.2%
trained on ensemble with no regularization

Table 2: Comparison of shallow and deep models: classification error rate on CIFAR-10. Key: c,
convolution layer; p, pooling layer; lc, locally connected layer; fc, fully connected layer

CIFAR-10 images are raw pixels for objects viewed from many different angles and positions,
whereas TIMIT features are human-designed filter-bank features. In preliminary experiments we
observed that non-convolutional nets do not perform well on CIFAR-10, no matter what their depth.
Instead of raw pixels, the authors in [5] trained their shallow models on the SIFT features. Similarly,
[7] used a base convolution and pooling layer to study different deep architectures. We follow the
approach in [7] to allow our shallow models to benefit from convolution while keeping the models
as shallow as possible, and introduce a single layer of convolution and pooling in our shallow mimic
models to act as a feature extractor to create invariance to small translations in the pixel domain. The
SNN-MIMIC models for CIFAR-10 thus consist of a convolution and max pooling layer followed
by fully connected 1200 linear units and 30k non-linear units. As before, the linear units are there
only to speed learning; they do not increase the model’s representational power and can be absorbed
into the weights in the non-linear layer after learning.
Results on CIFAR-10 are consistent with those from TIMIT. Table 2 shows results for the shallow
mimic models, and for much deeper convolutional nets. The shallow mimic net trained to mimic the
teacher CNN (SNN-CNN-MIMIC-30k) achieves accuracy comparable to CNNs with multiple con-
volutional and pooling layers. And by training the shallow model to mimic the ensemble of CNNs
(SNN-ECNN-MIMIC-30k), accuracy is improved an additional 0.9%. The mimic models are able
to achieve accuracies previously unseen on CIFAR-10 with models with so few layers. Although the
deep convolutional nets have more hidden units than the shallow mimic models, because of weight
sharing, the deeper nets with multiple convolution layers have fewer parameters than the shallow
fully connected mimic models. Still, it is surprising to see how accurate the shallow mimic mod-
els are, and that their performance continues to improve as the performance of the teacher model
improves (see further discussion of this in Section 5.2).

5 Discussion
5.1 Why Mimic Models Can Be More Accurate than Training on Original Labels
It may be surprising that models trained on targets predicted by other models can be more accurate
than models trained on the original labels. There are a variety of reasons why this can happen:

6
• If some labels have errors, the teacher model may eliminate some of these errors (i.e.,
censor the data), thus making learning easier for the student.
• Similarly, if there are complex regions in p(y|X) that are difficult to learn given the features
and sample density, the teacher may provide simpler, soft labels to the student. Complexity
can be washed away by filtering targets through the teacher model.
• Learning from the original hard 0/1 labels can be more difficult than learning from a
teacher’s conditional probabilities: on TIMIT only one of 183 outputs is non-zero on each
training case, but the mimic model sees non-zero targets for most outputs on most train-
ing cases, and the teacher can spread uncertainty over multiple outputs for difficult cases.
The uncertainty from the teacher model is more informative to the student model than the
original 0/1 labels. This benefit is further enhanced by training on logits.
• The original targets may depend in part on features not available as inputs for learning, but
the student model sees targets that depend only on the input features; the targets from the
teacher model are a function only of the available inputs; the dependence on unavailable
features has been eliminated by filtering targets through the teacher model.

The mechanisms above can be seen as forms

of regularization that help prevent overfitting in 77.5
the student model. Typically, shallow models SNN-8k
SNN-8k + dropout
trained on the original targets are more prone 77 SNN-Mimic-8k

to overfitting than deep models—they begin to Phone Recognition Accuracy 76.5

overfit before learning the accurate functions
76
learned by deeper models even with dropout
(see Figure 2). If we had more effective regu- 75.5
larization methods for shallow models, some of
75
the performance gap between shallow and deep
models might disappear. Model compression 74.5
appears to be a form of regularization that is ef-
74
fective at reducing this gap. 0 2 4 6 8 10 12 14
Number of Epochs

5.2 The Capacity and Representational Figure 2: Shallow mimic tends not to overfit.
Power of Shallow Models

Figure 3 shows results of an experiment with

TIMIT where we trained shallow mimic mod- 83
els of two sizes (SNN-MIMIC-8k and SNN- Mimic with 8k Non-Linear Units
Mimic with 160k Non-Linear Units
MIMIC-160k) on teacher models of different y=x (no student-teacher gap)
Accuracy of Mimic Model on Dev Set

82
accuracies. The two shallow mimic models are
trained on the same number of data points. The
only difference between them is the size of the 81
hidden layer. The x-axis shows the accuracy of
the teacher model, and the y-axis is the accu-
80
racy of the mimic models. Lines parallel to the
diagonal suggest that increases in the accuracy
of the teacher models yield similar increases in 79

the accuracy of the mimic models. Although

the data does not fall perfectly on a diagonal,
78
there is strong evidence that the accuracy of the 78 79 80 81 82 83

mimic models continues to increase as the ac- Accuracy of Teacher Model on Dev Set

curacy of the teacher model improves, suggest- Figure 3: Accuracy of student models continues to
ing that the mimic models are not (yet) running improve as accuracy of teacher models improves.
out of capacity. When training on the same tar-
gets, SNN-MIMIC-8k always perform worse than SNN-MIMIC-160K that has 10 times more pa-
rameters. Although there is a consistent performance gap between the two models due to the differ-
ence in size, the smaller shallow model was eventually able to achieve a performance comparable to
the larger shallow net by learning from a better teacher, and the accuracy of both models continues
to increase as teacher accuracy increases. This suggests that shallow models with a number of pa-
rameters comparable to deep models probably are capable of learning even more accurate functions

7
if a more accurate teacher and/or more unlabeled data become available. Similarly, on CIFAR-10
we saw that increasing the accuracy of the teacher model by forming an ensemble of deep CNNs
yielded commensurate increase in the accuracy of the student model. We see little evidence that shal-
low models have limited capacity or representational power. Instead, the main limitation appears to
be the learning and regularization procedures used to train the shallow models.

5.3 Parallel Distributed Processing vs. Deep Sequential Processing

Our results show that shallow nets can be competitive with deep models on speech and vision tasks.
In our experiments the deep models usually required 8–12 hours to train on Nvidia GTX 580 GPUs
to reach the state-of-the-art performance on TIMIT and CIFAR-10 datasets. Interestingly, although
some of the shallow mimic models have more parameters than the deep models, the shallow models
train much faster and reach similar accuracies in only 1–2 hours.
Also, given parallel computational resources, at run-time shallow models can finish computation in
2 or 3 cycles for a given input, whereas a deep architecture has to make sequential inference through
each of its layers, expending a number of cycles proportional to the depth of the model. This benefit
can be important in on-line inference settings where data parallelization is not as easy to achieve
as it is in the batch inference setting. For real-time applications such as surveillance or real-time
speech translation, a model that responds in fewer cycles can be beneficial.

6 Future Work

The tiny images dataset contains 80 millions images. We are currently investigating whether, if by
labeling these 80M images with a teacher, it is possible to train shallow models with no convolutional
or pooling layers to mimic deep convolutional models.
This paper focused on training the shallowest-possible models to mimic deep models in order to
better understand the importance of model depth in learning. As suggested in Section 5.3, there
are practical applications of this work as well: student models of small-to-medium size and depth
can be trained to mimic very large, high-accuracy deep models, and ensembles of deep models,
thus yielding better accuracy with reduced runtime cost than is currently achievable without model
compression. This approach allows one to adjust flexibly the trade-off between accuracy and com-
putational cost.
In this paper we are able to demonstrate empirically that shallow models can, at least in principle,
learn more accurate functions without a large increase in the number of parameters. The algorithm
we use to do this—training the shallow model to mimic a more accurate deep model, however, is
awkward. It depends on the availability of either a large unlabeled dataset (to reduce the gap between
teacher and mimic model) or a teacher model of very high accuracy, or both. Developing algorithms
to train shallow models of high accuracy directly from the original data without going through the
intermediate teacher model would, if possible, be a significant contribution.

7 Conclusions

We demonstrate empirically that shallow neural nets can be trained to achieve performances pre-
viously achievable only by deep models on the TIMIT phoneme recognition and CIFAR-10 image
recognition tasks. Single-layer fully connected feedforward nets trained to mimic deep models can
perform similarly to well-engineered complex deep convolutional architectures. The results suggest
that the strength of deep learning may arise in part from a good match between deep architectures
and current training procedures, and that it may be possible to devise better learning algorithms to
train more accurate shallow feed-forward nets. For a given number of parameters, depth may make
learning easier, but may not always be essential.

Acknowledgements We thank Li Deng for generous help with TIMIT, Li Deng and Ossama Abdel-
Hamid for the code for their deep convolutional TIMIT model, Chris Burges, Li Deng, Ran Gilad-
Bachrach, Tapas Kanungo and John Platt for discussion that significantly improved this work, David
Johnson for help with the speech model, and Mike Aultman for help with the GPU cluster.

8
References
[1] Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, and Gerald Penn. Applying convolutional
neural networks concepts to hybrid nn-hmm model for speech recognition. In Acoustics, Speech and
Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4277–4280. IEEE, 2012.
[2] Ossama Abdel-Hamid, Li Deng, and Dong Yu. Exploring convolutional neural network structures and
optimization techniques for speech recognition. Interspeech 2013, 2013.
[3] Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings
of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages
535–541. ACM, 2006.
[4] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control,
Signals and Systems, 2(4):303–314, 1989.
[5] Yann N Dauphin and Yoshua Bengio. Big neural networks waste capacity. arXiv preprint
arXiv:1301.3583, 2013.
[6] Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael Seltzer, Geoff Zweig,
Xiaodong He, Jason Williams, et al. Recent advances in deep learning for speech research at Microsoft.
ICASSP 2013, 2013.
[7] David Eigen, Jason Rolfe, Rob Fergus, and Yann LeCun. Understanding deep architectures using a
recursive convolutional network. arXiv preprint arXiv:1312.1847, 2013.
[8] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy
Bengio. Why does unsupervised pre-training help deep learning? The Journal of Machine Learning
Research, 11:625–660, 2010.
[9] Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout net-
works. In Proceedings of The 30th International Conference on Machine Learning, pages 1319–1327,
2013.
[10] G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,
313(5786):504–507, 2006.
[11] G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R.R. Salakhutdinov. Improving neural net-
works by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[12] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer
Science Department, University of Toronto, Tech. Rep, 2009.
[13] K.F. Lee and H.W. Hon. Speaker-independent phone recognition using hidden markov models. Acoustics,
Speech and Signal Processing, IEEE Transactions on, 37(11):1641–1648, 1989.
[14] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep belief
networks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14–22, 2012.
[15] V. Nair and G.E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. 27th
International Conference on Machine Learning, pages 807–814. Omnipress Madison, WI, 2010.
[16] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank
matrix factorization for deep neural network training with high-dimensional output targets. In Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655–6659.
IEEE, 2013.
[17] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-dependent deep
neural networks. In Interspeech, pages 437–440, 2011.
[18] Antonio Torralba, Robert Fergus, and William T Freeman. 80 million tiny images: A large data set for
nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transac-
tions on, 30(11):1958–1970, 2008.
[19] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol. Stacked denoising autoencoders:
Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine
Learning Research, 11:3371–3408, 2010.
[20] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular
value decomposition. Proc. Interspeech, Lyon, France, 2013.
[21] Matthew D. Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural
networks. arXiv preprint arXiv:1301.3557, 2013.

RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
CS60010 - Deep - NN - PPTX (1) 2
No ratings yet
CS60010 - Deep - NN - PPTX (1) 2
50 pages
4 Classification 2
No ratings yet
4 Classification 2
55 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
Chapter 06 - in Class
No ratings yet
Chapter 06 - in Class
37 pages
Lit Deep Learning
No ratings yet
Lit Deep Learning
19 pages
Deep - Learning
No ratings yet
Deep - Learning
49 pages
5th Unit DL Final Class Notes
No ratings yet
5th Unit DL Final Class Notes
77 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 7+Deep+Learning
108 pages
Tom M CMU ANN Lecture Notes
No ratings yet
Tom M CMU ANN Lecture Notes
68 pages
Bio Optimization of Deep Learning Network Architectures 22fguqp5
No ratings yet
Bio Optimization of Deep Learning Network Architectures 22fguqp5
11 pages
Deep Unsupervised Learning
No ratings yet
Deep Unsupervised Learning
90 pages
20 StatMechDeep
No ratings yet
20 StatMechDeep
30 pages
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
No ratings yet
Lecture Slides 1 - Introduction, PLA, and Logistic Regression - 2021
48 pages
Conmatphys 031119 050745
No ratings yet
Conmatphys 031119 050745
28 pages
Statistics Mechanic of Deep Learning
No ratings yet
Statistics Mechanic of Deep Learning
28 pages
Module 2 DL Snotes P1
No ratings yet
Module 2 DL Snotes P1
16 pages
21D070048_SRE_Report
No ratings yet
21D070048_SRE_Report
4 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
Deep Learning Tutorial: Reference: Hung-Yi Lee
100% (1)
Deep Learning Tutorial: Reference: Hung-Yi Lee
179 pages
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
No ratings yet
Deep Neural Networks Are Lazy: On The Inductive Bias of Deep Learning
78 pages
Statistical Mechanics of Deep Learning
No ratings yet
Statistical Mechanics of Deep Learning
30 pages
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit V Deep Learning Techniqueswwwjntumaterials
32 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
A Little Book of Deep Learning - Francois Fleuret
No ratings yet
A Little Book of Deep Learning - Francois Fleuret
149 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
140 pages
S. Radhakrshinan
No ratings yet
S. Radhakrshinan
37 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
168 pages
LBDL
No ratings yet
LBDL
142 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Cheatsheets For Deep Learning 1650192034
No ratings yet
Cheatsheets For Deep Learning 1650192034
95 pages
Unit 3
No ratings yet
Unit 3
16 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Map of The GD&T World
No ratings yet
Map of The GD&T World
2 pages
Deep Learning Model
No ratings yet
Deep Learning Model
144 pages
WATERAX Genuine Parts Catalog PDF
100% (1)
WATERAX Genuine Parts Catalog PDF
77 pages
Unit - V
No ratings yet
Unit - V
44 pages
100 Geometry Problems: Contributors: Djmathman, Abishek99, Captainflint
No ratings yet
100 Geometry Problems: Contributors: Djmathman, Abishek99, Captainflint
8 pages
Unit 1
No ratings yet
Unit 1
20 pages
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
The Deep Learning Revolution: Introductory Overview Lecture
No ratings yet
The Deep Learning Revolution: Introductory Overview Lecture
35 pages
Welding Metallurgy of Carbon Steel PDF
100% (1)
Welding Metallurgy of Carbon Steel PDF
17 pages
The Little Book of Deep Learning
100% (1)
The Little Book of Deep Learning
140 pages
Unit II
No ratings yet
Unit II
27 pages
City School Itep Test
100% (4)
City School Itep Test
4 pages
Vectors and Projectiles
No ratings yet
Vectors and Projectiles
8 pages
Deep Neural Network
No ratings yet
Deep Neural Network
12 pages
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
No ratings yet
A Selective Overview of Deep Learning: Jianqing Fan Cong Ma Yiqiao Zhong April 16, 2019
37 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Unit 1 Handwritten Notes
No ratings yet
Unit 1 Handwritten Notes
13 pages
(Ebook) Understanding Deep Learning by Simon J. D. Prince ISBN 9780262048644, 0262048647 all chapters available
No ratings yet
(Ebook) Understanding Deep Learning by Simon J. D. Prince ISBN 9780262048644, 0262048647 all chapters available
95 pages
Xavier Initialization PDF
No ratings yet
Xavier Initialization PDF
8 pages
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #3 Some Example Neural Networks
7 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Understanding Deep Learning Simon J. D. Prince Instant Download
100% (2)
Understanding Deep Learning Simon J. D. Prince Instant Download
61 pages
11th Physics Book Back Questions With Answers in English
No ratings yet
11th Physics Book Back Questions With Answers in English
29 pages
Rectangular Pyramid Easy 1
No ratings yet
Rectangular Pyramid Easy 1
2 pages
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
Christianity As A Double-Edged Sword in Colonial Africa
No ratings yet
Christianity As A Double-Edged Sword in Colonial Africa
12 pages
Industrial Training (Presentation Slide)
No ratings yet
Industrial Training (Presentation Slide)
20 pages
Cifras Internacionais
No ratings yet
Cifras Internacionais
17 pages
The Geometry of Futon Comfort
No ratings yet
The Geometry of Futon Comfort
5 pages
Nips10 Workshop Tutorial Final PDF
No ratings yet
Nips10 Workshop Tutorial Final PDF
73 pages
A Group Theoretic Perspective On Unsupervised Deep Learning
No ratings yet
A Group Theoretic Perspective On Unsupervised Deep Learning
2 pages
1.4. Exact ODEs. Integrating Factors
No ratings yet
1.4. Exact ODEs. Integrating Factors
9 pages
Bead Weaving
No ratings yet
Bead Weaving
17 pages
WK2 Cloud Computing Presentation PDF
No ratings yet
WK2 Cloud Computing Presentation PDF
16 pages
Pathogenic Cellular and Molecular Mediators in Lupus Nephritis
No ratings yet
Pathogenic Cellular and Molecular Mediators in Lupus Nephritis
18 pages
List Turnkey Approved Material
No ratings yet
List Turnkey Approved Material
24 pages
PT Mapeh-6 Q3
No ratings yet
PT Mapeh-6 Q3
11 pages
NIH Public Access: The Risks Associated With Alcohol Use and Alcoholism
No ratings yet
NIH Public Access: The Risks Associated With Alcohol Use and Alcoholism
15 pages
Lampiran SPTJM PSP
No ratings yet
Lampiran SPTJM PSP
7 pages
MTH 601 MCQ
0% (1)
MTH 601 MCQ
4 pages
XPS FOAM - SquareEdge
No ratings yet
XPS FOAM - SquareEdge
4 pages
First Solar Filing To AZCC
No ratings yet
First Solar Filing To AZCC
5 pages
s1 Result Analysis
No ratings yet
s1 Result Analysis
4 pages
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
No ratings yet
C - Diagnostic Testers 43 - Diagnostic Test - Ignition Coil Test (Only Bear Tester) All Engines
1 page
Mod. AV-1/EV: Tabletop Power Supply Unit For Electric Measurements and Machines
No ratings yet
Mod. AV-1/EV: Tabletop Power Supply Unit For Electric Measurements and Machines
1 page
Succeed I Can Worksheet
No ratings yet
Succeed I Can Worksheet
8 pages
Navsure N400i
No ratings yet
Navsure N400i
76 pages