DeepLearningBook_RefsByLastFirstNames
DeepLearningBook_RefsByLastFirstNames
in Signal Processing
Vol. 7, Nos. 3–4 (2013) 197–387
c 2014 L. Deng and D. Yu
DOI: 10.1561/2000000039
Li Deng Dong Yu
Microsoft Research Microsoft Research
One Microsoft Way One Microsoft Way
Redmond, WA 98052; USA Redmond, WA 98052; USA
[email protected] [email protected]
Contents
1 Introduction 198
1.1 Definitions and background . . . . . . . . . . . . . . . . . 198
1.2 Organization of this monograph . . . . . . . . . . . . . . 202
ii
iii
12 Conclusion 343
References 349
Abstract
L. Deng and D. Yu. Deep Learning: Methods and Applications. Foundations and
Trends
R
in Signal Processing, vol. 7, nos. 3–4, pp. 197–387, 2013.
DOI: 10.1561/2000000039.
1
Introduction
198
1.1. Definitions and background 199
The authors have been actively involved in deep learning research and
in organizing or providing several of the above events, tutorials, and
editorials. In particular, they gave tutorials and invited lectures on
this topic at various places. Part of this monograph is based on their
tutorials and lecture material.
Before embarking on describing details of deep learning, let’s pro-
vide necessary definitions. Deep learning has various closely related
definitions or high-level descriptions:
• http://deeplearning.net/reading-list/
• http://ufldl.stanford.edu/wiki/index.php/
UFLDL_Recommended_Readings
• http://www.cs.toronto.edu/∼hinton/
• http://deeplearning.net/tutorial/
• http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
205
206 Some Historical Context of Deep Learning
Figure 2.1: Gartner hyper cycle graph representing five phases of a technology
(http://en.wikipedia.org/wiki/Hype_cycle).
Figure 2.2: Applying Gartner hyper cycle graph to analyzing the history of artificial
neural network technology (We thank our colleague John Platt during 2012 for
bringing this type of “Hyper Cycle” graph to our attention for concisely analyzing
the neural network history).
Figure 2.3: The famous NIST plot showing the historical speech recognition error
rates achieved by the GMM-HMM approach for a number of increasingly difficult
speech recognition tasks. Data source: http://itl.nist.gov/iad/mig/publications/
ASRhistory/index.html
Figure 2.4: Extracting WERs of one task from Figure 2.3 and adding the signifi-
cantly lower WER (marked by the star) achieved by the DNN technology.
213
214
3.1. A three-way categorization 215
latter case, the use of Bayes rule can turn this type of generative
networks into a discriminative one for learning.
2. Deep networks for supervised learning, which are intended
to directly provide discriminative power for pattern classifica-
tion purposes, often by characterizing the posterior distributions
of classes conditioned on the visible data. Target label data are
always available in direct or indirect forms for such supervised
learning. They are also called discriminative deep networks.
3. Hybrid deep networks, where the goal is discrimination which
is assisted, often in a significant way, with the outcomes of genera-
tive or unsupervised deep networks. This can be accomplished by
better optimization or/and regularization of the deep networks
in category (2). The goal can also be accomplished when discrim-
inative criteria for supervised learning are used to estimate the
parameters in any of the deep generative or unsupervised deep
networks in category (1) above.
Note the use of “hybrid” in (3) above is different from that used
sometimes in the literature, which refers to the hybrid systems for
speech recognition feeding the output probabilities of a neural network
into an HMM [17, 25, 42, 261].
By the commonly adopted machine learning tradition (e.g.,
Chapter 28 in [264], and Reference [95], it may be natural to just clas-
sify deep learning techniques into deep discriminative models (e.g., deep
neural networks or DNNs, recurrent neural networks or RNNs, convo-
lutional neural networks or CNNs, etc.) and generative/unsupervised
models (e.g., restricted Boltzmann machine or RBMs, deep belief
networks or DBNs, deep Boltzmann machines (DBMs), regularized
autoencoders, etc.). This two-way classification scheme, however,
misses a key insight gained in deep learning research about how gener-
ative or unsupervised-learning models can greatly improve the training
of DNNs and other deep discriminative or supervised-learning mod-
els via better regularization or optimization. Also, deep networks for
unsupervised learning may not necessarily need to be probabilistic or be
able to meaningfully sample from the model (e.g., traditional autoen-
coders, sparse coding networks, etc.). We note here that more recent
216 Three Classes of Deep Learning Networks
models, and the “product” nodes build up the feature hierarchy. Prop-
erties of “completeness” and “consistency” constrain the SPN in a desir-
able way. The learning of SPNs is carried out using the EM algorithm
together with back-propagation. The learning procedure starts with a
dense SPN. It then finds an SPN structure by learning its weights,
where zero weights indicate removed connections. The main difficulty
in learning SPNs is that the learning signal (i.e., the gradient) quickly
dilutes when it propagates to deep layers. Empirical solutions have been
found to mitigate this difficulty as reported in [289]. It was pointed
out in that early paper that despite the many desirable generative
properties in the SPN, it is difficult to fine tune the parameters using
the discriminative information, limiting its effectiveness in classifica-
tion tasks. However, this difficulty has been overcome in the subse-
quent work reported in [125], where an efficient BP-style discriminative
training algorithm for SPN was presented. Importantly, the standard
gradient descent, based on the derivative of the conditional likelihood,
suffers from the same gradient diffusion problem well known in the
regular DNNs. The trick to alleviate this problem in learning SPNs
is to replace the marginal inference with the most probable state of
the hidden variables and to propagate gradients through this “hard”
alignment only. Excellent results on small-scale image recognition tasks
were reported by Gens and Domingo [125].
Recurrent neural networks (RNNs) can be considered as another
class of deep networks for unsupervised (as well as supervised) learning,
where the depth can be as large as the length of the input data sequence.
In the unsupervised learning mode, the RNN is used to predict the data
sequence in the future using the previous data samples, and no addi-
tional class information is used for learning. The RNN is very powerful
for modeling sequence data (e.g., speech or text), but until recently
they had not been widely used partly because they are difficult to train
to capture long-term dependencies, giving rise to gradient vanishing or
gradient explosion problems which were known in early 1990s [29, 167].
These problems can now be dealt with more easily [24, 48, 85, 280].
Recent advances in Hessian-free optimization [238] have also partially
overcome this difficulty using approximated second-order information
or stochastic curvature estimates. In the more recent work [239], RNNs
3.2. Deep networks for unsupervised or generative learning 221
modeling tool, the deep architecture of speech has more recently been
successfully applied to solve the very difficult problem of single-channel,
multi-talker speech recognition, where the mixed speech is the visible
variable while the un-mixed speech becomes represented in a new hid-
den layer in the deep generative architecture [301, 391]. Deep generative
graphical models are indeed a powerful tool in many applications due
to their capability of embedding domain knowledge. However, they are
often used with inappropriate approximations in inference, learning,
prediction, and topology design, all arising from inherent intractability
in these tasks for most real-world applications. This problem has been
addressed in the recent work of Stoyanov et al. [352], which provides
an interesting direction for making deep generative graphical models
potentially more useful in practice in the future. An even more drastic
way to deal with this intractability was proposed recently by Bengio
et al. [30], where the need to marginalize latent variables is avoided
altogether.
The standard statistical methods used for large-scale speech recog-
nition and understanding combine (shallow) hidden Markov models
for speech acoustics with higher layers of structure representing dif-
ferent levels of natural language hierarchy. This combined hierarchical
model can be suitably regarded as a deep generative architecture, whose
motivation and some technical detail may be found in Section 7 of the
recent monograph [200] on “Hierarchical HMM” or HHMM. Related
models with greater technical depth and mathematical treatment can
be found in [116] for HHMM and [271] for Layered HMM. These early
deep models were formulated as directed graphical models, missing the
key aspect of “distributed representation” embodied in the more recent
deep generative networks of the DBN and DBM discussed earlier in this
chapter. Filling in this missing aspect would help improve these gener-
ative models.
Finally, dynamic or temporally recursive generative models based
on neural network architectures can be found in [361] for human motion
modeling, and in [344, 339] for natural language and natural scene pars-
ing. The latter model is particularly interesting because the learning
algorithms are capable of automatically determining the optimal model
structure. This contrasts with other deep architectures such as DBN
3.3. Deep networks for supervised learning 223
where only the parameters are learned while the architectures need to
be pre-defined. Specifically, as reported in [344], the recursive struc-
ture commonly found in natural scene images and in natural language
sentences can be discovered using a max-margin structure prediction
architecture. It is shown that the units contained in the images or sen-
tences are identified, and the way in which these units interact with
each other to form the whole is also identified.
The term “hybrid” for this third category refers to the deep architecture
that either comprises or makes use of both generative and discrimina-
tive model components. In the existing hybrid architectures published
in the literature, the generative component is mostly exploited to help
with discrimination, which is the final goal of the hybrid architecture.
How and why generative modeling can help with discrimination can be
examined from two viewpoints [114]:
This section and the next two will each select one prominent example
deep network for each of the three categories outlined in Section 3.
Here we begin with the category of the deep models designed mainly
for unsupervised learning.
4.1 Introduction
230
4.2. Use of deep autoencoders to extract speech features 231
Figure 4.1: The architecture of the deep autoencoder used in [100] for extracting
binary speech codes from high-resolution spectrograms. [after [100], @Elsevier].
probabilities of its hidden units are treated as the data for training
another Bernoulli-Bernoulli RBM. These two RBM’s can then be com-
posed to form a deep belief net (DBN) in which it is easy to infer the
states of the second layer of binary hidden units from the input in a
single forward pass. The DBN used in this work is illustrated on the left
side of Figure 4.1, where the two RBMs are shown in separate boxes.
(See more detailed discussions on the RBM and DBN in Section 5).
The deep autoencoder with three hidden layers is formed by
“unrolling” the DBN using its weight matrices. The lower layers of
this deep autoencoder use the matrices to encode the input and the
upper layers use the matrices in reverse order to decode the input.
This deep autoencoder is then fine-tuned using error back-propagation
to minimize the reconstruction error, as shown on the right side of Fig-
ure 4.1. After learning is complete, any variable-length spectrogram
4.2. Use of deep autoencoders to extract speech features 233
Figure 4.2: Top to Bottom: The ordinal spectrogram; reconstructions using input
window sized of N = 1, 3, 9, and 13 while forcing the coding units to take values of
zero one (i.e., a binary code) . [after [100], @Elsevier].
234 Deep Autoencoders — Unsupervised Learning
Figure 4.3: Top to bottom: The original spectrogram from the test set; reconstruc-
tion from the 312-bit VQ coder; reconstruction from the 312-bit autoencoder; coding
errors as a function of time for the VQ coder (blue) and autoencoder (red); spec-
trogram of the VQ coder residual; spectrogram of the deep autoencoder’s residual.
[after [100], @ Elsevier].
4.3. Stacked denoising autoencoders 235
Figure 4.4: The original speech spectrogram and the reconstructed counterpart.
A total of 312 binary codes are with one for each single frame.
Figure 4.5: Same as Figure 4.4 but with a different TIMIT speech utterance.
Figure 4.6: The original speech spectrogram and the reconstructed counterpart.
A total of 936 binary codes are used for three adjacent frames.
4.3. Stacked denoising autoencoders 237
Figure 4.7: Same as Figure 4.6 but with a different TIMIT speech utterance.
Figure 4.8: Same as Figure 4.6 but with yet another TIMIT speech utterance.
higher dimension in the hidden or encoding layers than the input layer
is that it allows the autoencoder to capture a rich input distribution.
The trivial mapping problem discussed above can be prevented by
methods such as using sparseness constraints, or using the “dropout”
trick by randomly forcing certain values to be zero and thus introducing
distortions at the input data [376, 375] or at the hidden layers [166]. For
238 Deep Autoencoders — Unsupervised Learning
Figure 4.9: The original speech spectrogram and the reconstructed counterpart.
A total of 2000 binary codes with one for each single frame.
Figure 4.10: Same as Figure 4.9 but with a different TIMIT speech utterance.
input sample is different, it greatly increases the training set size and
thus can alleviate the overfitting problem.
It is interesting to note that when the encoding and decoding
weights are forced to be the transpose of each other, such denoising
autoencoder with a single sigmoidal hidden layer is strictly equiva-
lent to a particular Gaussian RBM, but instead of training it by the
technique of contrastive divergence (CD) or persistent CD, it is trained
by a score matching principle, where the score is defined as the deriva-
tive of the log-density with respect to the input [375]. Furthermore,
Alain and Bengio [5] generalized this result to any parameterization
of the encoder and decoder with squared reconstruction error and
Gaussian corruption noise. They show that as the amount of noise
approaches zero, such models estimate the true score of the underly-
ing data generating distribution. Finally, Bengio et al. [30] show that
any denoising autoencoder is a consistent estimator of the underly-
ing data generating distribution within some family of distributions.
This is true for any parameterization of the autoencoder, for any
type of information-destroying corruption process with no constraint
on the noise level except being positive, and for any reconstruction
loss expressed as a conditional log-likelihood. The consistency of the
estimator is achieved by associating the denoising autoencoder with
a Markov chain whose stationary distribution is the distribution esti-
mated by the model, and this Markov chain can be used to sample
from the denoising autoencoder.
The deep autoencoder described above can extract faithful codes for
feature vectors due to many layers of nonlinear processing. However, the
code extracted in this way is transformation-variant. In other words,
the extracted code would change in ways chosen by the learner when the
input feature vector is transformed. Sometimes, it is desirable to have
the code change predictably to reflect the underlying transformation-
invariant property of the perceived content. This is the goal of the
transforming autoencoder proposed in [162] for image recognition.
240 Deep Autoencoders — Unsupervised Learning
In this section, we present the most widely used hybrid deep archi-
tecture — the pre-trained deep neural network (DNN), and discuss
the related techniques and building blocks including the RBM and
DBN. We discuss the DNN example here in the category of hybrid
deep networks before the examples in the category of deep networks for
supervised learning (Section 6). This is partly due to the natural flow
from the unsupervised learning models to the DNN as a hybrid model.
The discriminative nature of artificial neural networks for supervised
learning has been widely known, and thus would not be required for
understanding the hybrid nature of the DNN that uses unsupervised
pre-training to facilitate the subsequent discriminative fine tuning.
Part of the review in this chapter is based on recent publications in
[68, 161, 412].
An RBM is a special type of Markov random field that has one layer of
(typically Bernoulli) stochastic hidden units and one layer of (typically
Bernoulli or Gaussian) stochastic visible or observable units. RBMs can
241
242 Pre-Trained Deep Neural Networks — A Hybrid
where Edata (vi hj ) is the expectation observed in the training set (with
hj sampled given vi according to the model), and Emodel (vi hj ) is that
same expectation under the distribution defined by the model. Unfor-
tunately, Emodel (vi hj ) is intractable to compute. The contrastive diver-
gence (CD) approximation to the gradient was the first efficient method
proposed to approximate this expected value, where Emodel (vi hj ) is
replaced by running the Gibbs sampler initialized at the data for one
or more steps. The steps in approximating Emodel (vi hj ) is summarized
as follows:
• Initialize v0 at data
• Sample h0 ∼ p(h|v0 )
• Sample v1 ∼ p(v|h0 )
• Sample h1 ∼ p(h|v1 )
244 Pre-Trained Deep Neural Networks — A Hybrid
Figure 5.1: A pictorial view of sampling from a RBM during RBM learning (cour-
tesy of Geoff Hinton).
decoder, and a logistic nonlinearity on the top of the encoder. The main
difference is that whereas the RBM is trained using (very approximate)
maximum likelihood, SESM is trained by simply minimizing the aver-
age energy plus an additional code sparsity term. SESM relies on the
sparsity term to prevent flat energy surfaces, while RBM relies on an
explicit contrastive term in the loss, an approximation of the log par-
tition function. Another difference is in the coding strategy in that the
code units are “noisy” and binary in the RBM, while they are quasi-
binary and sparse in SESM. The use of SESM in pre-training DNNs
for speech recognition can be found in [284].
Figure 5.3: Interface between DBN/DNN and HMM to form a DNN–HMM. This
architecture, developed at Microsoft, has been successfully used in speech recognition
experiments reported in [67, 68]. [after [67, 68], @IEEE].
correlated models more powerful than HMMs for the ultimate success
of speech recognition. Integrating such dynamic models that have real-
istic co-articulatory properties with the DNN and possibly other deep
learning models to form the coherent dynamic deep architecture is a
challenging new research direction.
6
Deep Stacking Networks and Variants —
Supervised Learning
6.1 Introduction
While the DNN just reviewed has been shown to be extremely power-
ful in connection with performing recognition and classification tasks
including speech recognition and image classification, training a DNN
has proven to be difficult computationally. In particular, conventional
techniques for training DNNs at the fine tuning phase involve the uti-
lization of a stochastic gradient descent learning algorithm, which is
difficult to parallelize across machines. This makes learning at large
scale nontrivial. For example, it has been possible to use one single,
very powerful GPU machine to train DNN-based speech recognizers
with dozens to a few hundreds or thousands of hours of speech training
data with remarkable results. It is less clear, however, how to scale up
this success with much more training data. See [69] for recent work in
this direction.
Here we describe a new deep learning architecture, the deep stacking
network (DSN), which was originally designed with the learning scal-
ability problem in mind. This chapter is based in part on the recent
publications of [106, 110, 180, 181] with expanded discussions.
250
6.1. Introduction 251
The central idea of the DSN design relates to the concept of stack-
ing, as proposed and explored in [28, 44, 392], where simple modules of
functions or classifiers are composed first and then they are “stacked”
on top of each other in order to learn complex functions or classifiers.
Various ways of implementing stacking operations have been developed
in the past, typically making use of supervised information in the sim-
ple modules. The new features for the stacked classifier at a higher
level of the stacking architecture often come from concatenation of the
classifier output of a lower module and the raw input features. In [60],
the simple module used for stacking was a conditional random field
(CRF). This type of deep architecture was further developed with hid-
den states added for successful natural language and speech recognition
applications where segmentation information is unknown in the train-
ing data [429]. Convolutional neural networks, as in [185], can also be
considered as a stacking architecture but the supervision information
is typically not used until in the final stacking module.
The DSN architecture was originally presented in [106] and was
referred as deep convex network or DCN to emphasize the convex
nature of a major portion of the algorithm used for learning the net-
work. The DSN makes use of supervision information for stacking each
of the basic modules, which takes the simplified form of multilayer per-
ceptron. In the basic module, the output units are linear and the hidden
units are sigmoidal nonlinear. The linearity in the output units permits
highly efficient, parallelizable, and closed-form estimation (a result of
convex optimization) for the output network weights given the hidden
units’ activities. Due to the closed-form constraints between the input
and output weights, the input weights can also be elegantly estimated in
an efficient, parallelizable, batch-mode manner, which we will describe
in some detail in Section 6.3.
The name “convex” used in [106] accentuates the role of convex
optimization in learning the output network weights given the hidden
units’ activities in each basic module. It also points to the importance
of the closed-form constraints, derived from the convexity, between the
input and output weights. Such constraints make the learning of the
remaining network parameters (i.e., the input network weights) much
easier than otherwise, enabling batch-mode learning of the DSN that
252 Deep Stacking Networks and Variants — Supervised Learning
Figure 6.1: A DSN architecture using input–output stacking. Four modules are
illustrated, each with a distinct color. Dashed lines denote copying layers. [after
[366], @IEEE].
Here, we provide some technical details on how the use of linear out-
put units in the DSN facilitates the learning of the DSN weights. A
single module is used to illustrate the advantage for simplicity rea-
sons. First, it is clear that the upper layer weight matrix U can
be efficiently learned once the activity matrix H over all training
samples in the hidden layer is known. Let’s denote the training vec-
tors by X = [x 1 , . . . , x i , . . . , x N ], in which each vector is denoted by
x i = [x1i , . . . , xji , . . . , xDi ]T where D is the dimension of the input vec-
tor, which is a function of the block, and N is the total number of
training samples. Denote by L the number of hidden units and by C
the dimension of the output vector. Then the output of a DSN block is
y i = U T h i where h i = σ(W T x i ) is the hidden-layer vector for sample
i, U is an L × C weight matrix at the upper layer of a block. W is a
D × L weight matrix at the lower layer of a block, and σ(·) is a sigmoid
function. Bias terms are implicitly represented in the above formulation
if x i and h i are augmented with ones.
Given target vectors in the full training set with a total of
N samples, T = [t 1 , . . . , t i , . . . , t N ], where each vector is t i =
[t1i , · · · , tji , . . . , tCi ]T , the parameters U and W are learned so as to
minimize the average of the total square error below:
1 1
E= y i − t i 2 = Tr[(Y − T )(Y − T )T ]
2 i 2
y i = U T h i = U T σ(W T x i ) = Gi (UW )
6.4. The tensor deep stacking network 255
The above DSN architecture has recently been generalized to its ten-
sorized version, which we call the tensor DSN (TDSN) [180, 181]. It
has the same scalability as the DSN in terms of parallelizability in
learning, but it generalizes the DSN by providing higher-order feature
interactions missing in the DSN.
The architecture of the TDSN is similar to that of the DSN in the
way that stacking operation is carried out. That is, modules of the
256 Deep Stacking Networks and Variants — Supervised Learning
Figure 6.2: Comparisons of a single module of a DSN (left) and that of a tensor
DSN (TDSN). Two equivalent forms of a TDSN module are shown to the right.
[after [180], @IEEE].
The DSN architecture has also recently been generalized to its ker-
nelized version, which we call the kernel-DSN (K-DSN) [102, 171]. The
motivation of the extension is to increase the size of the hidden units in
each DSN module, yet without increasing the size of the free parameters
to learn. This goal can be easily accomplished using the kernel trick,
resulting in the K-DSN which we describe below.
258 Deep Stacking Networks and Variants — Supervised Learning
Figure 6.5: An example architecture of the K-DSN with three modules each of
which uses a Gaussian kernel with different kernel parameters. [after [102], @IEEE].
way and unlike the basic DSN there is no longer nonconvex optimiza-
tion problem involved in training the K-DSN. The computation steps
make the K-DSN easier to scale up for parallel computing in distributed
servers than the DSN and tensor-DSN. There are many fewer param-
eters in the K-DSN to tune than in the DSN, T-DSN, and DNN, and
there is no need for pre-training. It is found in the study of [102] that
regularization plays a much more important role in the K-DSN than
in the basic DSN and Tensor-DSN. Further, effective regularization
schedules developed for learning the K-DSN weights can be motivated
by intuitive insight from useful optimization tricks such as the heuristic
in Rprop or resilient backpropagation algorithm [302].
However, as inherent in any kernel method, the scalability becomes
an issue also for the K-DSN as the training and testing samples become
very large. A solution is provided in the study by Huang et al. [171],
based on the use of random Fourier features, which possess the strong
theoretical property of approximating the Gaussian kernel while render-
ing efficient computation in both training and evaluation of the K-DSN
with large training samples. It is empirically demonstrated that just like
the conventional K-DSN exploiting rigorous Gaussian kernels, the use
of random Fourier features also enables successful stacking of kernel
modules to form a deep architecture.
7
Selected Applications in Speech
and Audio Processing
262
7.1. Acoustic modeling for speech recognition 263
Figure 7.1: Illustration of the joint learning of filter parameters and the rest of the
deep network. [after [307], @IEEE].
the input at higher layers, which helps to achieve better speech recog-
nition accuracy.
To the extreme end, deep learning would promote to use the lowest
level of raw features of speech, i.e., speech sound waveforms, for speech
recognition, and learn the transformation automatically. As an initial
attempt toward this goal the study carried out by Jaitly and Hinton
[183] makes use of speech sound waves as the raw input feature to an
RBM with a convolutional structure as the classifier. With the use
of rectified linear units in the hidden layer [130], it is possible, to a
limited extent, to automatically normalize the amplitude variation
in the waveform signal. Although the final results are disappointing,
the work shows that much work is needed along this direction. For
example, just as demonstrated by Sainath et al. [307] that the use of
raw spectra as features requires additional attention in normalization
than MFCCs, the use of speech waveforms demands even more
attention in normalization [327]. This is true for both GMM-based
and deep learning based methods.
7.1. Acoustic modeling for speech recognition 267
Table 7.1: Comparisons of the DNN–HMM architecture with the generative model
(e.g., the GMM–HMM) in terms of phone or word recognition error rates. From
sub-tables A to D, the training data are increased approximately three orders of
magnitudes.
Figure 7.2: Illustration of the use of bottleneck (BN) features extracted from a
DNN in a GMM–HMM speech recognizer. [after [425], @IEEE].
7.1. Acoustic modeling for speech recognition 271
used in the DNN. By processing both the training and testing data
with the same algorithm, any consistent errors or artifacts introduced
by the enhancement algorithm can be learned by the DNN–HMM rec-
ognizer. This study also successfully explored the use of the noise aware
training paradigm for training the DNN, where each observation was
augmented with an estimate of the noise. Strong results were obtained
on the Aurora4 task. More recently, Kashiwagi et al. [191] applied the
SPLICE feature enhancement technique [82] to a DNN speech rec-
ognizer. In that study the DNN’s output layer was determined on
clean data instead of on noisy data as in the study reported by Seltzer
et al. [325].
Besides DNN, other deep architectures have also been proposed to
perform feature enhancement and noise-robust speech recognition. For
example, Mass et al. [235] applied a deep recurrent auto encoder neural
network to remove noise in the input features for robust speech recogni-
tion. The model was trained on stereo (noisy and clean) speech features
to predict clean features given noisy input, similar to the SPLICE setup
but using a deep model instead of a GMM. Vinyals and Ravuri [379]
investigated the tandem approaches to noise-robust speech recognition,
where DNNs were trained directly with noisy speech to generate pos-
terior features. Finally, Rennie et al. [300] explored the use of a version
of the RBM, called the factorial hidden RBM, for noise-robust speech
recognition.
Most deep learning methods for speech recognition and other infor-
mation processing applications have focused on learning represen-
tations from input acoustic features without paying attention to
output representations. The recent 2013 NIPS Workshop on Learning
Output Representations (http://nips.cc/Conferences/2013/Program/
event.php?ID=3714) was dedicated to bridging this gap. For exam-
ple, the Deep Visual-Semantic Embedding Model described in [117],
to be discussed more in Section 11) exploits continuous-valued out-
put representations obtained from the text embeddings to assist in the
7.1. Acoustic modeling for speech recognition 273
branch of the deep network for classifying images. For speech recogni-
tion, the importance of designing effective linguistic representations for
the output layers of deep networks is highlighted in [79].
Most current DNN systems use a high-dimensional output represen-
tation to match the context-dependent phonetic states in the HMMs.
For this reason, the output layer evaluation can cost 1/3 of the total
computation time. To improve the decoding speed, techniques such
as low-rank approximation is typically applied to the output layer.
In [310] and [397], the DNN with high-dimensional output layer was
trained first. The singular value decomposition (SVD)-based dimen-
sion reduction technique was then performed on the large output-layer
matrix. The resulting matrices are further combined and as the result
the original large weight matrix is approximated by a product of two
much smaller matrices. This technique in essence converts the origi-
nal large output layer to two layers — a bottleneck linear layer and
a nonlinear output layer — both with smaller weight matrices. The
converted DNN with reduced dimensionality is further refined. The
experimental results show that no speech recognition accuracy reduc-
tion was observed even when the size is cut to half, while the run-time
computation is significantly reduced.
The output representations for speech recognition can benefit from
the structured design of the symbolic or phonological units of speech
as presented in [79]. The rich phonological structure of symbolic nature
in human speech has been well known for many years. Likewise, it has
also been well understood for a long time that the use of phonetic
or its finer state sequences, even with contextual dependency, in engi-
neering speech recognition systems, is inadequate in representing such
rich structure [86, 273, 355], and thus leaving a promising open direc-
tion to improve the speech recognition systems’ performance. Basic
theories about the internal structure of speech sounds and their rel-
evance to speech recognition technology in terms of the specification,
design, and learning of possible output representations of the underly-
ing speech model for speech target sequences are surveyed in [76] and
more recently in [79].
There has been a growing body of deep learning work in speech
recognition with their focus placed on designing output representations
274 Selected Applications in Speech and Audio Processing
Perhaps the most notable deep architecture among all is the recur-
rent neural network (RNN) as well as its stacked or deep versions
[135, 136, 153, 279, 377]. While the RNN saw its early success in phone
recognition [304], it was not easy to duplicate due to the intricacy
in training, let alone to scale up for larger speech recognition tasks.
Learning algorithms for the RNN have been dramatically improved
since then, and much better results have been obtained recently using
the RNN [48, 134, 235], especially when the bi-directional LSTM (long
short-term memory) is used [135, 136]. The basic information flow in
the bi-directional RNN and a cell of LSTM is shown in Figures 7.3 and
7.4, respectively.
Learning the RNN parameters is known to be difficult due to van-
ishing or exploding gradients [280]. Chen and Deng [48] and Deng and
Figure 7.3: Information flow in the bi-directional RNN, with both diagrammatic
and mathematical descriptions. W’s are weight matrices, not shown but can be easily
inferred in the diagram. [after [136], @IEEE].
7.1. Acoustic modeling for speech recognition 279
Figure 7.4: Information flow in an LSTM unit of the RNN, with both diagrammatic
and mathematical descriptions. W’s are weight matrices, not shown but can easily
be inferred in the diagram. [after [136], @IEEE].
with respect to the size of the training data. Dahl et al. [65] applied
dropout in conjunction with the ReLU units and to only the top few
layers of a fully-connected DNN. Seltzer and Yu [325] applied it to noise
robust speech recognition. Deng et al. [81], on the other hand, applied
dropout to all layers of a deep convolutional neural network, including
both the top fully connected DNN layers and the bottom locally con-
nected CNN layer and the pooling layer. It is found that the dropout
rate need to be substantially smaller for the convolutional layer.
Subsequent work on applying dropout includes the study by Miao
and Metze [243], where DNN-based speech recognition is constrained
by low resources with sparse training data. Most recently, Sainath et al.
[306] combined dropout with a number of novel techniques described
in this section (including the use of deep CNNs, Hessian-free sequence
learning, the use of ReLU units, and the use of joint fMLLR and filter-
bank features, etc.) to obtain state of the art results on several large
vocabulary speech recognition tasks.
As a summary, the initial success of deep learning methods for
speech analysis and recognition reported around 2010 has come a long
way over the past three years. An explosive growth in the work and
publications on this topic has been observed, and huge excitement has
been ignited within the speech recognition community. We expect that
the growth in the research on deep learning based speech recognition
will continue, at least in the near future. It is also fair to say that the
continuing large-scale success of deep learning in speech recognition as
surveyed in this chapter (up to the ASRU-2013 time frame) is a key
stimulant to the large-scale exploration and applications of the deep
learning methods to other areas, which we will survey in Sections 8–11.
framework. The deep learning techniques are thus expected to help the
acoustic modeling aspect of speech synthesis in overcoming the limita-
tions of the conventional shallow modeling approach.
A series of studies are carried out recently on ways of overcoming
the above limitations using deep learning methods, inspired partly by
the intrinsically hierarchical processes in human speech production
and the successful applications of a number of deep learning methods
in speech recognition as reviewed earlier in this chapter. In Ling
et al. [227, 229], the RBM and DBN as generative models are used
to replace the traditional Gaussian models, achieving significant
quality improvement, in both subjective and objective measures,
of the synthesized voice. In the approach developed in [190], the
DBN as a generative model is used to represent joint distribution of
linguistic and acoustic features. Both the decision trees and Gaussian
models are replaced by the DBN. The method is very similar to that
used for generating digit images by the DBN, where the issue of
temporal sequence modeling specific to speech (non-issue for image)
is by-passed via the use of the relatively large, syllable-sized units in
speech synthesis. On the other hand, in contrast to the generative
deep models (RBMs and DBNs) exploited above, the study reported
in [435] makes use of the discriminative model of the DNN to represent
the conditional distribution of the acoustic features given the linguistic
features. Finally, in [115], the discriminative model of the DNN is used
as a feature extractor that summarizes high-level structure from the
raw acoustic features. Such DNN features are then used as the input
for the second stage for the prediction of prosodic contour targets
from contextual features in the full speech synthesis system.
The application of deep learning to speech synthesis is in its infancy,
and much more work is expected from that community in the near
future.
but only quite recently. As an example, the first major event of deep
learning for speech recognition took place in 2009, followed by a series of
events including a comprehensive tutorial on the topic at ICASSP-2012
and with the special issue at IEEE Transactions on Audio, Speech, and
Language Processing, the premier publication for speech recognition,
in the same year. The first major event of deep learning for audio and
music processing appears to be the special session at ICASSP-2014,
titled Deep Learning for Music [14].
In the general field of audio and music processing, the impacted
areas by deep learning include mainly music signal processing and music
information retrieval [15, 22, 141, 177, 178, 179, 319]. Deep learning
presents a unique set of challenges in these areas. Music audio signals
are time series where events are organized in musical time, rather than
in real time, which changes as a function of rhythm and expression. The
measured signals typically combine multiple voices that are synchro-
nized in time and overlapping in frequency, mixing both short-term and
long-term temporal dependencies. The influencing factors include musi-
cal tradition, style, composer and interpretation. The high complexity
and variety give rise to the signal representation problems well-suited
to the high levels of abstraction afforded by the perceptually and bio-
logically motivated processing techniques of deep learning.
In the early work on audio signals as reported by Lee et al. [215]
and their follow-up work, the convolutional structure is imposed on
the RBM while building up a DBN. Convolution is made in time by
sharing weights between hidden units in an attempt to detect the same
“invariant” feature over different times. Then a max-pooling operation
is performed where the maximal activations over small temporal neigh-
borhoods of hidden units are obtained, inducing some local temporal
invariance. The resulting convolutional DBN is applied to audio as well
as speech data for a number of tasks including music artist and genre
classification, speaker identification, speaker gender classification, and
phone classification, with promising results presented.
The RNN has also been recently applied to music processing appli-
cations [22, 40, 41], where the use of ReLU hidden units instead of
logistic or tanh nonlinearities are explored in the RNN. As reviewed in
290 Selected Applications in Speech and Audio Processing
Section 7.2, ReLU units compute y = max(x, 0), and lead to sparser
gradients, less diffusion of credit and blame in the RNN, and faster
training. The RNN is applied to the task of automatic recognition of
chords from audio music, an active area of research in music information
retrieval. The motivation of using the RNN architecture is its power
in modeling dynamical systems. The RNN incorporates an internal
memory, or hidden state, represented by a self-connected hidden layer
of neurons. This property makes them well suited to model temporal
sequences, such as frames in a magnitude spectrogram or chord labels
in a harmonic progression. When well trained, the RNN is endowed
with the power to predict the output at the next time step given the
previous ones. Experimental results show that the RNN-based auto-
matic chord recognition system is competitive with existing state-of-
the-art approaches [275]. The RNN is capable of learning basic musical
properties such as temporal continuity, harmony and temporal dynam-
ics. It can also efficiently search for the most musically plausible chord
sequences when the audio signal is ambiguous, noisy or weakly discrim-
inative.
A recent review article by Humphrey et al. [179] provides a detailed
analysis on content-based music informatics, and in particular on why
the progress is decelerating throughout the field. The analysis con-
cludes that hand-crafted feature design is sub-optimal and unsustain-
able, that the power of shallow architectures is fundamentally limited,
and that short-time analysis cannot encode musically meaningful struc-
ture. These conclusions motivate the use of deep learning methods
aimed at automatic feature learning. By embracing feature learning, it
becomes possible to optimize a music retrieval system’s internal feature
representation or discovering it directly, since deep architectures are
especially well-suited to characterize the hierarchical nature of music.
Finally, we review the very recent work by van den Oord, et al. [371]
on content-based music recommendation using deep learning methods.
Automatic music recommendation has become an increasingly signifi-
cant and useful technique in practice. Most recommender systems rely
on collaborative filtering, suffering from the cold start problem where
it fails when no usage data is available. Thus, collaborative filtering is
7.3. Audio and music processing 291
not effective for recommending new and unpopular songs. Deep learning
methods power the latent factor model for recommendation, which pre-
dicts the latent factors from music audio when they cannot be obtained
from usage data. A traditional approach using a bag-of-words represen-
tation of the audio signals is compared with deep CNNs with rigorous
evaluation made. The results show highly sensible recommendations
produced by the predicted latent factors using deep CNNs. The study
demonstrates that a combination of convolutional neural networks and
richer audio features lead to such promising results for content-based
music recommendation.
Like speech recognition and speech synthesis, much more work is
expected from the music and audio signal processing community in the
near future.
8
Selected Applications in Language
Modeling and Natural Language Processing
292
8.1. Language modeling 293
learning over the current state of the art NLP methods has not been
as strong as speech or visual object recognition.
similar words get to be closer to each other in that space, at least along
some directions. A sequence of words can thus be transformed into a
sequence of these learned feature vectors. The neural network learns to
map that sequence of feature vectors to the probability distribution over
the next word in the sequence. The distributed representation approach
to LMs has the advantage that it allows the model to generalize well to
sequences that are not in the set of training word sequences, but that
are similar in terms of their features, i.e., their distributed represen-
tation. Because neural networks tend to map nearby inputs to nearby
outputs, the predictions corresponding to word sequences with similar
features are mapped to similar predictions.
The above ideas of NNLMs have been implemented in various
studies, some involving deep architectures. The idea of structuring
hierarchically the output of an NNLM in order to handle large
vocabularies was introduced in [18, 262]. In [252], the temporally
factored RBM was used for language modeling. Unlike the traditional
N -gram model, the factored RBM uses distributed representations
not only for context words but also for the words being predicted.
This approach is generalized to deeper structures as reported in [253].
Subsequent work on NNLM with “deep” architectures can be found in
[205, 207, 208, 245, 247, 248]. As an example, Le et al. [207] describes
an NNLM with structured output layer (SOUL–NNLM) where the pro-
cessing depth in the LM is focused in the neural network’s output rep-
resentation. Figure 8.1 illustrates the SOUL-NNLM architecture with
hierarchical structure in the output layers of the neural network, which
shares the same architecture with the conventional NNLM up to the
hidden layer. The hierarchical structure for the network’s output vocab-
ulary is in the form of a clustering tree, shown to the right of Figure 8.1,
where each word belongs to only one class and ends in a single leaf node
of the tree. As a result of the hierarchical structure, the SOUL–NNLM
enables the training of the NNLM with a full, very large vocabulary.
This gives advantages over the traditional NNLM which requires short-
lists of words in order to carry out the efficient computation in training.
As another example neural-network-based LMs, the work described
in [247, 248] and [245] makes use of RNNs to build large scale language
296 Language Modeling and Natural Language Processing
Figure 8.1: The SOUL–NNLM architecture with hierarchical structure in the out-
put layers of the neural network [after [207], @IEEE].
Figure 8.2: During the training of RNNLMs, the RNN unfolds into a deep feed-
forward network; based on Figure 3.2 of [245].
Machine learning has been a dominant tool in NLP for many years.
However, the use of machine learning in NLP has been mostly limited to
numerical optimization of weights for human designed representations
and features from the text data. The goal of deep or representation
learning is to automatically develop features or representations from
the raw text material appropriate for a wide range of NLP tasks.
Recently, neural network based deep learning methods have
been shown to perform well on various NLP tasks such as language
modeling, machine translation, part-of-speech tagging, named entity
recognition, sentiment analysis, and paraphrase detection. The most
attractive aspect of deep learning methods is their ability to perform
these tasks without external hand-designed resources or time-intensive
feature engineering. To this end, deep learning develops and makes use
an important concept called “embedding,” which refers to the represen-
tation of symbolic information in natural language text at word-level,
phrase-level, and even sentence-level in terms of continuous-valued
vectors.
The early work highlighting the importance of word embedding
came from [62], [367], and [63], although the original form came from
[26] as a side product of language modeling. Raw symbolic word rep-
resentations are transformed from the sparse vectors via 1-of-V coding
with a very high dimension (i.e., the vocabulary size V or its square or
even its cubic) into low-dimensional, real-valued vectors via a neural
network and then used for processing by subsequent neural network lay-
ers. The key advantage of using the continuous space to represent words
(or phrases) is its distributed nature, which enables sharing or grouping
the representations of words with a similar meaning. Such sharing is
not possible in the original symbolic space, constructed by 1-of-V cod-
ing with a very high dimension, for representing words. Unsupervised
300 Language Modeling and Natural Language Processing
Figure 8.3: The CBOW architecture (a) on the left, and the Skip-gram architecture
(b) on the right. [after [246], @ICLR].
nonlinearity in the upper neural network layer and share the projec-
tion layer for all words. And second, the N -gram NNLM is trained
on top of the word vectors. So, after removing the second step in the
NNLM, the simple model is used to learn word embeddings, where the
simplicity allows the use of very large amount of data. This gives rise
to a word embedding model called Continuous Bag-of-Words Model
(CBOW), as shown in Figure 8.3a. Further, since the goal is no longer
computing probabilities of word sequences as in LMs, the word embed-
ding system here is made more effective by not only to predict the
current word based on the context but also to perform inverse pre-
diction known as “Skip-gram” model, as shown in Figure 8.3b. In
the follow-up work [250] by the same authors, this word embedding
system including the Skip-gram model is extended by a much faster
learning method called negative sampling, similar to NCE discussed in
Section 8.1.
In parallel with the above development, Mnih and Kavukcuoglu
[254] demonstrate that NCE training of lightweight word embedding
302 Language Modeling and Natural Language Processing
Figure 8.4: The extended word-embedding model using a recursive neural network
that takes into account not only local context but also global context. The global
context is extracted from the document and put in the form of a global semantic
vector, as part of the input into the original word-embedding model with local
context. Taken from Figure 1 of [169]. [after [169], @ACL].
8.2. Natural language processing 303
Figure 8.5: Illustration of the basic approach reported in [122] for machine trans-
lation. Parallel pairs of source (denoted by f) and target (denoted by e) phrases
are projected into continuous-valued vector representations (denoted by the two y
vectors), and their translation score is computed by the distance between the pair in
this continuous space. The projection is performed by deep neural networks (denoted
by the two arrows) whose weights are learned on parallel training data. [after [121],
@NIPS].
both entities and relations. More recent work [340] adopts an alterna-
tive approach, based on the use of neural tensor networks, to attack
the problem of reasoning over a large joint knowledge graph for rela-
tion classification. The knowledge graph is represented as triples of a
relation between two entities, and the authors aim to develop a neu-
ral network model suitable for inference over such relationships. The
model they presented is a neural tensor network, with one layer only.
The network is used to represent entities in a fixed-dimensional vectors,
which are created separately by averaging pre-trained word embedding
vectors. It then learn the tensor with the newly added relationship ele-
ment that describes the interactions among all the latent components
in each of the relationships. The neural tensor network can be visu-
alized in Figure 8.6, where each dashed box denotes one of the two
slices of the tensor. Experimentally, the paper [340] shows that this
tensor model can effectively classify unseen relationships in WordNet
and FreeBase.
As the final example of deep learning applied successfully to NLP,
we discuss here sentiment analysis applications based on recursive deep
Figure 8.6: Illustration of the neural tensor network described in [340], with two
relationships shown as two slices in the tensor. The tensor is denoted by W [1:2] . The
network contains a bilinear tensor layer that directly relates the two entity vectors
(shown as e1 and e 2 ) across three dimensions. Each dashed box denotes one of the
two slices of the tensor. [after [340], @NIPS].
8.2. Natural language processing 307
308
9.2. Semantic hashing with deep autoencoders for document 309
the user. The process may then be iterated if the user wishes to refine
the query.
Based partly on [236], common IR methods consist of several
categories:
pass through the model with thresholding. Then the Hamming dis-
tance between the query binary code and all the documents’ 128-bit
binary codes, especially those of the “neighboring” documents defined
in the semantic space, are computed extremely efficiently. The effi-
ciency is accomplished by looking up the neighboring bit vectors in
the hash table. The same idea as discussed here for coding text docu-
ments for information retrieval has been explored for audio document
retrieval and speech feature coding problems with some initial explo-
ration reported in [100], discussed in Section 4 in detail.
and (2) these models are often trained in an unsupervised manner using
an objective function that is only loosely coupled with the evaluation
metric for the retrieval task. In order to improve semantic matching for
IR, two lines of research have been conducted to extend the above latent
semantic models. The first is the semantic hashing approach reviewed
in Section 9.1 above in this section based on the use of deep autoen-
coders [165, 314]. While the hierarchical semantic structure embedded
in the query and the document can be extracted via deep learning,
the deep learning approach used for their models still adopts an unsu-
pervised learning method where the model parameters are optimized
for the re-construction of the documents rather than for differentiating
the relevant documents from the irrelevant ones for a given query. As
a result, the deep neural network models do not significantly outper-
form strong baseline IR models that are based on lexical matching. In
the second line of research, click-through data, which consists of a list
of queries and the corresponding clicked documents, is exploited for
semantic modeling so as to bridge the language discrepancy between
search queries and Web documents in recent studies [120, 124]. These
models are trained on click-through data using objectives that tailor to
the document ranking task. However, these click-through-based models
are still linear, suffering from the issue of expressiveness. As a result,
these models need to be combined with the keyword matching models
(such as BM25) in order to obtain a significantly better performance
than baselines.
The DSSM approach reported in [172] aims to combine the
strengths of the above two lines of work while overcoming their weak-
nesses. It uses the DNN architecture to capture complex semantic prop-
erties of the query and the document, and to rank a set of documents
for a given query. Briefly, a nonlinear projection is performed first to
map the query and the documents to a common semantic space. Then,
the relevance of each document given the query is calculated as the
cosine similarity between their vectors in that semantic space. The
DNNs are trained using the click-through data such that the condi-
tional likelihood of the clicked document given the query is maximized.
Different from the previous latent semantic models that are learned
in an unsupervised fashion, the DSSM is optimized directly for Web
9.3. DSSM for document retrieval 313
Figure 9.1: The DNN component of the DSSM architecture for computing semantic
features. The DNN uses multiple layers to map high-dimensional sparse text features,
for both Queries and Documents into low-dimensional dense features in a semantic
space. [after [172], @CIKM].
l1 = W1 x,
li = f (Wi li−1 + bi ), i>1
y = f (WN lN −1 + bN ),
314 Selected Applications in Information Retrieval
where tanh function is used at the output layer and the hidden layers
li , i = 2, . . . , N − 1:
1 − e−2x
f (x) = .
1 + e−2x
The semantic relevance score between a query Q and a document D
can then be computed as the consine distance
Ty
yQ D
R(Q, D) = cosine(yQ , yD ) = ,
yQ yD
where yQ and yD are the concept vectors of the query and the docu-
ment, respectively. In Web search, given the query, the documents can
be sorted by their semantic relevance scores.
Learning of the DNN weights Wi and bi shown in Figure 9.1 is an
important contribution of the study of [172]. Compared with the DNNs
used in speech recognition where the targets or labels of the training
data are readily available, the DNN in the DSSM does not have such
label information well defined. That is, rather than using the common
cross entropy or mean square errors as the training objective function,
IR-centric loss functions need to be developed in order to train the DNN
weights in the DSSM using the available data such as click-through logs.
The click-through logs consist of a list of queries and their clicked
documents. A query is typically more relevant to the documents that
are clicked on than those that are not. This weak supervision informa-
tion can be exploited to train the DSSM. More specifically, the weight
matrices in the DSSM, Wi , is learned to maximize the posterior prob-
ability of the clicked documents given the queries
exp(γR(Q, D))
P (D | Q) =
D ∈D exp(γR(Q, D ))
defined on the semantic relevance score R(Q, D) between the Query (Q)
and the Document (D), where γ is a smoothing factor set empirically
on a held-out data set, and D denotes the set of candidate documents
to be ranked. Ideally, D should contain all possible documents, as in
the maximum mutual information training for speech recognition where
all possible negative candidates may be considered [147]. However in
9.3. DSSM for document retrieval 315
where Λ denotes the parameter set of the DNN weights {Wi } in the
DSSM. In Figure 9.2, we show the overall DSSM architecture that
contains several DNNs. All these DNNs share the same weights but take
different documents (one positive and several negatives) as inputs when
training the DSSM parameters. Details of the gradient computation
of this approximate loss function with respect to the DNN weights
tied across documents and queries can be found in [172] and are not
elaborated here.
Most recently, the DSSM described above has been extended to its
convolutional version, or C-DSSM [328]. In the C-DSSM, semantically
similar words within context are projected to vectors that are close
to each other in the contextual feature space through a convolutional
structure. The overall semantic meaning of a sentence is found to be
determined by a few key words in the sentence, and thus the C-DSSM
uses an additional max pooling layer to extract the most salient local
features to form a fixed-length global feature vector. The global feature
vector is then fed to the remaining nonlinear DNN layer(s) to map it
to a point in the shared semantic space.
316 Selected Applications in Information Retrieval
Figure 9.2: Architectural illustration of the DSSM for document retrieval (from
[170, 171]). All DNNs shown have shared weights. A set of n documents are shown
here to illustrate the random negative sampling discussed in the text for simplifying
the training procedure for the DSSM. [after [172], @CIKM].
Figure 9.3: The convolutional neural network component of the C-DSSM, with the
window size of three is illustrated for the convolutional layer. [after [328], @WWW].
In parallel with the IR studies reviewed above, the deep stacking net-
work (DSN) discussed in Section 6 has also been explored recently
for IR with insightful results [88]. The experimental results suggest
that the classification error rate using the binary decision of “relevant”
versus “non-relevant” from the DSN, which is closely correlated with
the DSN training objective, is also generally correlated well with the
NDCG (normalized discounted cumulative gain) as the most common
318 Selected Applications in Information Retrieval
this task are the first with the use of deep learning techniques (based
on the DSN architecture) on the ad-related IR problem. The prelimi-
nary results from the experiments are the close correlation between the
MSE as the DSN training objective with the NDCG as the IR quality
measure over a wide NDCG range.
10
Selected Applications in Object Recognition
and Computer Vision
Over the past two years or so, tremendous progress has been made in
applying deep learning techniques to computer vision, especially in the
field of object recognition. The success of deep learning in this area
is now commonly accepted by the computer vision community. It is
the second area in which the application of deep learning techniques
is successful, following the speech recognition area as we reviewed and
analyzed in Sections 2 and 7.
Excellent surveys on the recent progress of deep learning for
computer vision are available in the NIPS-2013 tutorial (https://
nips.cc/Conferences/2013/Program/event.php?ID=4170 with video
recording at http://research.microsoft.com/apps/video/default.aspx?
id=206976&l=i) and slides at http://cs.nyu.edu/∼fergus/presentations/
nips2013_final.pdf, and also in the CVPR-2012 tutorial (http://cs.nyu.
edu/∼fergus/tutorials/deep_learning_cvpr12). The reviews provided
in this section below are based partly on these tutorials, in connection
with the earlier deep learning material in this monograph. Another
excellent source which this section draws from is the most recent Ph.D.
thesis on the topic of deep learning for computer vision [434].
320
10.1. Unsupervised or generative feature learning 321
process starts from input x, a neural face, and generates the output
y, the facial expression. In face expression classification experiments,
the learned unsupervised hidden features generated from this stochas-
tic network are appended to the image pixels and helped to obtain
superior accuracy to the baseline classifier based on the conditional
RBM/DBN [361].
Perhaps the most notable work in the category of unsupervised deep
feature learning for computer vision (prior to the recent surge of the
work on CNNs) is that of [209], a nine-layer locally connected sparse
autoencoder with pooling and local contrast normalization. The model
has one billion connections, trained on the dataset with 10 million
images downloaded from the Internet. The unsupervised feature learn-
ing methods allow the system to train a face detector without having to
label images as containing a face or not. And the control experiments
show that this feature detector is robust not only to translation but
also to scaling and out-of-plane rotation.
Another set of popular studies on unsupervised deep feature learn-
ing for computer vision are based on deep sparse coding models [226].
This type of deep models produced state-of-the-art accuracy results on
the ImageNet object recognition tasks prior to the rise of the CNN
architectures armed with supervised learning to perform joint feature
learning and classification, which we turn to now.
324 Selected Applications in Object Recognition and Computer Vision
Figure 10.2: The original convolutional neural network that is composed of mul-
tiple alternating convolution and pooling layers followed by fully connected layers.
[after [212], @IEEE].
Figure 10.3: The architecture of the deep-CNN system which won the 2012 Ima-
geNet competition by a large margin over the second-best system and the state of
the art by 2012. [after [198], @NIPS].
Figure 10.5: The top portion shows how a deconvolutional network’s layer (left)
is attached to a corresponding CNN’s layer (right). The d econvolutional network
reconstructs an approximate version of the CNN features from the layer below. The
bottom portion is an illustration of the unpooling operation in the deconvolutional
network, where “Switches” are used to record the location of the local max in each
pooling region during pooling in the CNN. [after [436], @arXiv].
331
332 Selected Applications in Multimodal and Multi-task Learning
Figure 11.1: Illustration of the multi-modal DeViSE architecture. The left portion
is an image recognition neural network with a softmax output layer. The right por-
tion is a skip-gram text model providing word embedding vectors; see Section 8.2
and Figure 8.3 for details. The center is the joint deep image-text model of DeViSE,
with the two Siamese branches initialized by the image and word embedding mod-
els below the softmax layers. The layer labeled “transformation” is responsible for
mapping the outputs of the image (left) and text (right) branches into the same
semantic space. [after [117], @NIPS].
combination of embedding vectors for the text label and the image
classes [270]. Here is the main difference. DeViSE replaces the last,
softmax layer of a CNN image classifier with a linear transformation
layer. The new transformation layer is then trained together with the
lower layers of the CNN. The method in [270] is much simpler — keep-
ing the softmax layer of the CNN while not training the CNN. For a
test image, the CNN first produces top N-best candidates. Then, the
convex combination of the corresponding N embedding vectors in the
semantic space is computed. This gives a deterministic transformation
from the outputs of the softmax classifier into the embedding space.
This simple multi-modal learning method is shown to work very well
on the ImageNet zero-shot learning task.
Another thread of studies separate from but related to the above
work on multi-modal learning involving text and image have cen-
tered on the use of multi-modal embeddings, where data from multiple
sources with separate modalities of text and image are projected into
the same vector space. For example, Socher and Fei-Fei [341] project
words and images into the same space using kernelized canonical cor-
relation analysis. Socher et al. [342] map images to single-word vectors
so that the constructed multi-modal system can classify images with-
out seeing any examples of the class, i.e., zero-shot learning similar
to the capability of DeViSE. The most recent work by Socher et al.
[343] extends their earlier work from single-word embeddings to those
of phrases and full-length sentences. The mechanism for mapping sen-
tences instead of the earlier single words into the multi-modal embed-
ding space is derived from the power of the recursive neural network
described in Socher et al. [347] as summarized in Section 8.2, and its
extension with dependency tree.
In addition to mapping text to image (or vice versa) into the same
vector space or to creating the joint image/text embedding space,
multi-modal learning for text and image can also be cast in the frame-
work of language models. In [196], a model of natural language is made
conditioned on other modalities such as image as the focus of the
study. This type of multi-modal language model is used to (1) retrieve
images given complex description queries, (2) retrieve phrase descrip-
tions given image queries, and (3) generate text conditioned on images.
336 Selected Applications in Multimodal and Multi-task Learning
Figure 11.2: Illustration of the multi-modal DeViSE architecture. The left portion
is an image recognition neural network with a softmax output layer. The right por-
tion is a skip-gram text model providing word embedding vectors; see Section 8.2
and Figure 8.3 for details. The center is the joint deep image-text model of DeViSE,
with the two Siamese branches initialized by the image and word embedding mod-
els below the softmax layers. The layer labeled “transformation” is responsible for
mapping the outputs of the image (left) and text (right) branches into the same
semantic space. [after [196], @NIPS].
Figure 11.4: A DNN architecture for multitask learning that is aimed to dis-
cover hidden explanatory factors shared among three tasks A, B, and C. [after [22],
@IEEE].
Figure 11.5: A DNN architecture for multilingual speech recognition. [after [170],
@IEEE].
Figure 11.6: A DNN architecture for speech recognition trained with mixed-
bandwidth acoustic data with 16-kHz and 8-kHz sampling rates; [after [221],
@IEEE].
343
344 Conclusion
349
350 References
[179] E. Humphrey, J. Bello, and Y. LeCun. Feature learning and deep archi-
tectures: New directions for music informatics. Journal of Intelligent
Information Systems, 2013.
[180] B. Hutchinson, L. Deng, and D. Yu. A deep architecture with bilinear
modeling of hidden representations: Applications to phonetic recogni-
tion. In Proceedings of International Conference on Acoustics Speech
and Signal Processing (ICASSP). 2012.
[181] B. Hutchinson, L. Deng, and D. Yu. Tensor deep stacking net-
works. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 35:1944–1957, 2013.
[182] D. Imseng, P. Motlicek, P. Garner, and H. Bourlard. Impact of deep
MLP architecture on different modeling techniques for under-resourced
speech recognition. In Proceedings of the Automatic Speech Recognition
and Understanding Workshop (ASRU). 2013.
[183] N. Jaitly and G. Hinton. Learning a better representation of speech
sound waves using restricted boltzmann machines. In Proceedings of
International Conference on Acoustics Speech and Signal Processing
(ICASSP). 2011.
[184] N. Jaitly, P. Nguyen, and V. Vanhoucke. Application of pre-trained deep
neural networks to large vocabulary speech recognition. In Proceedings
of Interspeech. 2012.
[185] K. Jarrett, K. Kavukcuoglu, and Y. LeCun. What is the best multi-
stage architecture for object recognition? In Proceedings of International
Conference on Computer Vision, pages 2146–2153. 2009.
[186] H. Jiang and X. Li. Parameter estimation of statistical models using
convex optimization: An advanced method of discriminative training
for speech and language processing. IEEE Signal Processing Magazine,
27(3):115–127, 2010.
[187] B. Juang, S. Levinson, and M. Sondhi. Maximum likelihood estimation
for multivariate mixture observations of Markov chains. IEEE Trans-
actions on Information Theory, 32:307–309, 1986.
[188] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error
rate methods for speech recognition. IEEE Transactions On Speech
and Audio Processing, 5:257–265, 1997.
[189] S. Kahou et al. Combining modality specific deep neural networks for
emotion recognition in video. In Proceedings of International Conference
on Multimodal Interaction (ICMI). 2013.
References 365
[224] H. Liao, E. McDermott, and A. Senior. Large scale deep neural network
acoustic modeling with semi-supervised training data for youtube video
transcription. In Proceedings of the Automatic Speech Recognition and
Understanding Workshop (ASRU). 2013.
[225] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C.-H. Lee. A study on
multilingual acoustic modeling for large vocabulary ASR. In Proceedings
of International Conference on Acoustics Speech and Signal Processing
(ICASSP). 2009.
[226] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang.
Large-scale image classification: Fast feature extraction and SVM train-
ing. In Proceedings of Computer Vision and Pattern Recognition
(CVPR). 2011.
[227] Z. Ling, L. Deng, and D. Yu. Modeling spectral envelopes using
restricted boltzmann machines and deep belief networks for statisti-
cal parametric speech synthesis. IEEE Transactions on Audio Speech
Language Processing, 21(10):2129–2139, 2013.
[228] Z. Ling, L. Deng, and D. Yu. Modeling spectral envelopes using
restricted boltzmann machines for statistical parametric speech synthe-
sis. In International Conference on Acoustics Speech and Signal Pro-
cessing (ICASSP), pages 7825–7829. 2013.
[229] Z. Ling, K. Richmond, and J. Yamagishi. Articulatory control of HMM-
based parametric speech synthesis using feature-space-switched multi-
ple regression. IEEE Transactions on Audio, Speech, and Language
Processing, 21, January 2013.
[230] L. Lu, K. Chin, A. Ghoshal, and S. Renals. Joint uncertainty decoding
for noise robust subspace gaussian mixture models. IEEE Transactions
on Audio, Speech, and Language Processing, 21(9):1791–1804, 2013.
[231] J. Ma and L. Deng. A path-stack algorithm for optimizing dynamic
regimes in a statistical hidden dynamical model of speech. Computer,
Speech and Language, 2000.
[232] J. Ma and L. Deng. Efficient decoding strategies for conversational
speech recognition using a constrained nonlinear state-space model.
IEEE Transactions on Speech and Audio Processing, 11(6):590–602,
2003.
[233] J. Ma and L. Deng. Target-directed mixture dynamic models for spon-
taneous speech recognition. IEEE Transactions on Speech and Audio
Processing, 12(1):47–58, 2004.
References 369
[421] D. Yu, L. Deng, and F. Seide. Large vocabulary speech recognition using
deep tensor neural networks. In Proceedings of Interspeech. 2012c.
[422] D. Yu, L. Deng, and F. Seide. The deep tensor neural network with
applications to large vocabulary speech recognition. IEEE Transactions
on Audio, Speech, and Language Processing, 21(2):388–396, 2013.
[423] D. Yu, J.-Y. Li, and L. Deng. Calibration of confidence measures in
speech recognition. IEEE Transactions on Audio, Speech and Language,
19:2461–2473, 2010.
[424] D. Yu, F. Seide, G. Li, and L. Deng. Exploiting sparseness in deep
neural networks for large vocabulary speech recognition. In Proceedings
of International Conference on Acoustics Speech and Signal Processing
(ICASSP). 2012.
[425] D. Yu and M. Seltzer. Improved bottleneck features using pre-trained
deep neural networks. In Proceedings of Interspeech. 2011.
[426] D. Yu, M. Seltzer, J. Li, J.-T. Huang, and F. Seide. Feature learning in
deep neural networks — studies on speech recognition. In Proceedings
of International Conference on Learning Representations (ICLR). 2013.
[427] D. Yu, S. Siniscalchi, L. Deng, and C. Lee. Boosting attribute and
phone estimation accuracies with deep neural networks for detection-
based speech recognition. In Proceedings of International Conference
on Acoustics Speech and Signal Processing (ICASSP). 2012.
[428] D. Yu, S. Wang, and L. Deng. Sequential labeling using deep-structured
conditional random fields. Journal of Selected Topics in Signal Process-
ing, 4:965–973, 2010.
[429] D. Yu, S. Wang, Z. Karam, and L. Deng. Language recognition using
deep-structured conditional random fields. In Proceedings of Interna-
tional Conference on Acoustics Speech and Signal Processing (ICASSP),
pages 5030–5033. 2010.
[430] D. Yu, K. Yao, H. Su, G. Li, and F. Seide. KL-divergence regularized
deep neural network adaptation for improved large vocabulary speech
recognition. In Proceedings of International Conference on Acoustics
Speech and Signal Processing (ICASSP). 2013.
[431] K. Yu, M. Gales, and P. Woodland. Unsupervised adaptation with dis-
criminative mapping transforms. IEEE Transactions on Audio, Speech,
and Language Processing, 17(4):714–723, 2009.
[432] K. Yu, Y. Lin, and H. Lafferty. Learning image representations from
the pixel level via hierarchical sparse coding. In Proceedings Computer
Vision and Pattern Recognition (CVPR). 2011.
386 References