Structured Deep Neural Networks For Speech Recognition
Structured Deep Neural Networks For Speech Recognition
Structured Deep Neural Networks For Speech Recognition
Chunyang Wu
Department of Engineering
University of Cambridge
This dissertation is the result of my own work carried out at the University of Cambridge
and includes nothing which is the outcome of any work done in collaboration except
where explicitly stated. It has not been submitted in whole or in part for a degree at
any other university. Some of the work has been previously presented in international
conferences (Ragni et al., 2017; Wu and Gales, 2015, 2017; Wu et al., 2016a,b) and
workshops, or published as journal articles (Karanasou et al., 2017; Wu et al., 2017).
The length of this thesis including footnotes, appendices and references is approximately
56100 words. This thesis contains 37 figures and 35 tables.
The work on multi-basis adaptive neural networks has been published in Karanasou
et al. (2017); Wu and Gales (2015); Wu et al. (2016a). I was responsible for the original
ideas, code implementation, experiments and paper writing of Wu and Gales (2015);
Wu et al. (2016a). Penny Karanasou contributed to the i-vector extraction, discussion
and paper writing of Karanasou et al. (2017).
The work on stimulated deep neural networks has been published in Ragni et al.
(2017); Wu et al. (2016a, 2017). This is an extension inspired by Tan et al. (2015a).
I was responsible for the ideas, code implementation, complete experiments on Wall
Street Journal and broadcast news English, and paper writing of Wu et al. (2016a, 2017).
For experiments on Babel languages, I was responsible for preliminary investigations on
Javanese, Pashto and Mongolian, and module scripts related to stimulated systems for
all languages in the option period 3. Anton Ragni was responsible for the paper writing
of Ragni et al. (2017), and other team members in the project were responsible for
the joint training systems and key-word-spotting performance for all Babel languages.
Penny Karanasou contributed to discussions of the model in weekly meetings.
iii
The work on deep activation mixture models has been published in Wu and Gales
(2017). I was responsible for the original ideas, code implementation, experiments and
paper writing.
Chunyang Wu
March 2018
Acknowledgements
First of all, I would love to express my sincere and utmost gratitude to my supervisor,
Prof. Mark Gales, for his mentorship and support over the past four years. In the
period of thesis writing, I clicked and read documents I prepared for weekly meetings.
Looking back, I have learned a lot from his supervision and guidance. Particularly, I
want to thank him for the great patience in teaching me how to think and organise
research topics logically and thoroughly. His wisdom, insight and passion in research
and projects have influenced me a lot. I believe the profound influence will be there
with me for my whole life. Thanks you, Mark.
Special thanks go to the NST program (EPSRC funded), the Babel program (IARPA
funded), the RATS program (DARPA funded) and research funding from Google and
Amazon for the financial support, providing me an excellent opportunity to be involved
in high-standard research and attend many international conferences and workshops.
I want to thank my advisor Prof. Pill Woodland for his constructive suggestions
in my research. Also, I owe my thanks to my colleagues in the Machine Intelli-
gence Laboratory. Particular thanks go to Dr. Xie Chen, Dr. Penny Karanasou,
Dr. Anton Ragni, Dr. Chao Zhang, Dr. Kate Knill, Dr. Yongqiang Wang, Dr. Shix-
iong Zhang, Dr Rogier van Dalen, Dr. Yu Wang, Dr. Yanmin Qian, Dr. Pierre Lan-
chantin, Dr. Jingzhou Yang, Moquan Wan and Jeremy Wong, for scintillating dis-
cussions, whether it be speech recognition, machine learning, or subjects less directly
related to our research. I also would like to thank Patrick Gosling and Anna Langley
for their reliable support in maintaining the computer facilities.
The friends I made in Cambridge will be a treasure forever. It is my honour to meet
so many kind people in the Department, University and Wolfson College. Especially, I
v
would like to thank my housemates and friends met at Barton House and AA57 House
of Academy. Their company brought countless joyful and unforgettable moments.
Finally, the biggest thanks go to my parents. For over twenty-seven years, they
have offer everything possible to support me. This thesis is dedicated to them.
Abstract
Deep neural networks (DNNs) and deep learning approaches yield state-of-the-art
performance in a range of machine learning tasks, including automatic speech recogni-
tion. The multi-layer transformations and activation functions in DNNs, or related
network variations, allow complex and difficult data to be well modelled. However,
the highly distributed representations associated with these models make it hard to
interpret the parameters. The whole neural network is commonly treated a “black box”.
The behaviours of activation functions and the meanings of network parameters are
rarely controlled in the standard DNN training. Though a sensible performance can
be achieved, the lack of interpretations to network structures and parameters causes
better regularisation and adaptation on DNN models challenging. In regularisation,
parameters have to be regularised universally and indiscriminately. For instance, the
widely used L2 regularisation encourages all parameters to be zeros. In adaptation, it
requires to re-estimate a large number of independent parameters. Adaptation schemes
in this framework cannot be effectively performed when there are limited adaptation
data.
This thesis investigates structured deep neural networks. Special structures are
explicitly designed, and they are imposed with desired interpretation to improve
DNN regularisation and adaptation. For regularisation, parameters can be separately
regularised based on their functions. For adaptation, parameters can be adapted in
groups or partially adapted according to their roles in the network topology. Three
forms of structured DNNs are proposed in this thesis. The contributions of these
models are presented as follows.
The first contribution of this thesis is the multi-basis adaptive neural network.
This form of structured DNN introduces a set of parallel sub-networks with restricted
vii
List of tables xv
Nomenclature xix
1 Introduction 1
1.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
7 Experiments 129
7.1 Babel Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.1.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.2 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Broadcast News English . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Table of contents xi
8 Conclusion 147
8.1 Review of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
References 156
List of figures
3.11 DNN with speaker codes. Speaker codes c(s) are introduced to several
bottom layers to emphasise its importance. . . . . . . . . . . . . . . . . 70
3.12 Parametrised activation function. . . . . . . . . . . . . . . . . . . . . . 73
5.1 WSJ-SI84: Summary of training and evaluation sets. It includes the total
hours, number of utterances (#Uttr) and average utterance duration
(AvgUttr). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.2 WSJ-SI84: Recognition performance (WER %) of stimulated DNNs
using the KL regularisation with and without normalised activation
(NormAct) on H1-Dev. . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3 WSJ-SI84: Recognition performance (WER %) of sigmoid stimulated
DNNs using the KL regularisation on H1-Dev. Different settings on the
regularisation penalty η and the sharpness factor σ 2 are compared. . . 111
5.4 WSJ-SI84: Recognition performance (WER %) of stimulated DNNs
using the KL regularisation. . . . . . . . . . . . . . . . . . . . . . . . . 112
5.5 WSJ-SI84: Recognition performance (WER %) of of stimulated DNNs
using different activation regularisations on H1-Dev. . . . . . . . . . . . 113
5.6 WSJ-SI84: Recognition performance (WER %) of stimulated DNNs
using different activation regularisations on H1-Eval. . . . . . . . . . . 115
7.1 Babel: Summary of used languages. Scripts marked with † utilise capital
letter in the graphemic dictionary. . . . . . . . . . . . . . . . . . . . . . 130
7.2 Babel: Recognition performance (WER %) of CE stimulated DNNs
using different forms of activation regularisation in Javanese. . . . . . . 133
7.3 Babel: Recognition performance (WER %) to compare CE and MPE sig-
moid stimulated DNNs using different forms of activation regularisation
in Javanese. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
List of tables xvii
General Notations
t frame index
s speaker index
u utterance index
L training criterion
R regularisation function
D distance function
ϵ learning rate
η regularisation penalty
Nomenclature xx
κ hyper-parameter or constant
D training data
ω output class
φ activation function
θ model parameter
M canonical model
ρ correlation coefficient
K filter kernel
Acronyms / Abbreviations
AI Artificial Intelligence
AM Acoustic Model
BN Broadcast News
CE Cross Entropy
CV Cross Validation
LM Language Model
ML Maximum Likelihood
SD Speaker Dependent
SI Speaker Independent
YTB Youtube
Chapter 1
Introduction
constraints on computing resources, early DNN models (LeCun et al., 1990; Waibel et al.,
1989) were often evaluated in simple and small configurations. Recently, benefiting
from recent advances in computing resources, particularly graphical processing units
(GPUs), large DNN models on large datasets can be optimised fast and efficiently,
which is different from those in the early studies. Very deep neural networks, including
the VGG (Simonyan and Zisserman, 2014) and residual networks (He et al., 2016),
have been investigated, and they can consist of tens of layers. In addition, to model
different types of data, a range of network variations have been proposed, including
convolutional neural networks (Krizhevsky et al., 2012; LeCun et al., 1998) (CNNs)
and recurrent neural networks (Bengio et al., 2003; Mikolov et al., 2010) (RNNs).
Although DNN models have achieved promising performance, there are several
issues with them. One is that DNNs are likely to over-fit to training data, which limits
their generalisation to unseen ones. Usually, regularisation approaches are used during
training to reduce over-fitting. These include the weight decay and dropout (Srivastava
et al., 2014) methods. In addition, natively trained DNNs are usually treated as
“black boxes”, and the highly distributed representations are difficult to interpret
directly. This issue restricts the potential in further network regularisation and model
post-modification.
1970s, hidden Markov models (HMMs), particularly Gaussian mixture model HMMs
(GMM-HMMs), were introduced to speech recognition (Baker, 1975; Jelinek, 1976).
Since then, statistical method have dominated this research area. Mathematically,
given a sequence of features x1:T , of length T , extracted from raw speech signal,
x1:M = x1 , x2 , . . . , xT , (1.1)
according to Bayes’ decision rule, the most likely decoding hypothesis ω̂, is given as
ω 1:M = ω1 , ω2 , . . . , ωM . (1.3)
where p(x1:T |ω) is referred to as the acoustic model, and P (ω) is the language model.
In this configuration, the ASR system is described in a generative framework. HMMs
are applied to estimate the acoustic model, p(x1:T |ω). A range of effective extensions
were proposed for the HMM framework in the following decades, including state
tying (Young et al., 1994), discriminative training (Povey and Woodland, 2002), and
speaker/noise adaptation (Gales, 1998; Leggetter and Woodland, 1995). Meanwhile,
ASR systems have also evolved from recognising isolated words to large-vocabulary
continuous speech, from handling clean environments to complex scenarios such as
telephone conversation (Godfrey et al., 1992). Recent progress in integrating deep
learning to HMMs has significantly improved the performance of ASR systems (Dahl
et al., 2012; Deng et al., 2013; Hinton et al., 2012; Seide et al., 2011b). Instead,
discriminative models (Cho et al., 2014; Graves et al., 2006; Sutskever et al., 2014), also
known as end-to-end models, has been investigated as well. In these models, neural
1.3 Thesis Organisation 4
networks are used to model the conditional probability P (ω|x1:T ), which is directly
related to the decision rule.
Efforts in the past half century have greatly improved the technology in speech
recognition. Such technology has started to change the lifestyle of human being, and
promote the progress of civilisation. Nevertheless, there are still a number of issues
remaining in such techniques. For instance, effective and rapid adaptation methods to
specific accents and corruptions, such as speech distorted by noise and reverberation,
remain major challenges in modern ASR systems, particularly DNN-based schemes.
They requires further exploration in the realm of speech recognition as well as artificial
intelligence.
In recent years, deep neural networks have been widely used in supervised learning
tasks (LeCun et al., 2015). The goal of supervised learning is to infer prediction models
that are able to learn and generalise from a set of training data,
The prediction model maps the input feature vector xt onto the output vector y t ,
representing the predicted “scores” for all classes. Commonly, the scores are described
in a probabilistic fashion, that is,
The class with the highest score is picked as the desired prediction. This framework is
usually formalised and described using Bayes’ decision rule, that is, the most likely
class ω̂ is given by
ω̂ = arg max P (ω|xt ). (2.4)
ω
2.1 Neural Network Architecture 7
Deep neural networks are an effective form of prediction model. The multiple layers of
non-linear processing units in DNNs allow complex data to be well modelled.
This chapter reviews the basic methodology used in deep neural network and deep
learning algorithms. It includes typical network architectures, training and optimisation
schemes, regularisation methods, network visualisation, and interpretation1 .
...
Output Targets
Input Feature
...
...
Fig. 2.1 Feed-forward neural network.
function for each unit. The unit receives signals from the previous layer, processes the
information, and forwards the transformed signals to units in the next layer. Usually,
the units between two successive layers are fully connected. Finally, after a series of
hidden layers, the output layer yields scores for a range of classes.
(l)
Formally, given an input feature vector xt , activation function inputs z t and
(l)
outputs ht of hidden layers are recursively defined as
(l) (l−1)
z t = W (l)T ht + b(l) 1 ≤ l ≤ L, (2.5)
(l) (l)
ht = φ z t 1 ≤ l < L, (2.6)
(0)
ht = xt , (2.7)
where L denotes the total number of layers, and φ(·) represents activation functions,
operating on each unit. An affine transformation is applied between two successive
layers associated with parameters W (l) and b(l) . The activation function specifies the
output signal of a hidden unit. It is often modelled as a nonlinear function to allow
the network to derive meaningful feature abstractions. This function can take a range
of forms, and they are discussed in detail in Section 2.2.
The output of neural network is denoted by y t . For classification tasks, a softmax
function is usually used , which can directly be interpreted as the conditional probability
2.1 Neural Network Architecture 9
The output vector y t can be viewed as a soft version of 1-ok-K coding, which assigns
the prediction scores for different classes. In practice, the total number of classes may
vary in a large range. Simple tasks such as handwritten digit recognition contain only a
few classes. For modern speech recognition systems, there can be thousands of classes
involved (Dahl et al., 2012).
• Convolution: The convolution stage is a core part of CNN models. For example,
in image data, one pixel is highly correlated with its neighbours in the image
grid. Thus, properties in local regions are desired to be captured. In contrast
with fully connected network, the convolution stage of one CNN layer introduces
highly restricted connections to model local properties. It uses ml kernels for
(l) (l)
layer l, K 1 , K 2 , . . . , K (l)
ml with trainable parameters to perform convolution
2.1 Neural Network Architecture 10
CNN Layer
Convolution Stage
Detector Stage
Pooling Stage
Layer Output
Fig. 2.2 Typical layer configuration for convolutional neural networks.
(l) (l)
Z j = H (l−1)∗ ∗ K j (2.9)
where ∗ stands for the convolution operation and H (l)∗ is the grid representation
of previous-layer outputs h(l) . Usually, H (l)∗ is modelled as a 3D tensor. The con-
(l) (l)
volution outputs Z 1 , Z 2 , . . . , Z (l)
ml are then treated as “slides” and compounded
to form a large tensor Z (l)∗ .
where z (l) is the original vector representation of Z (l)∗ . However, compared with
fully connected layers in feed-forward DNNs, the transformation matrix W (l)
of a CNN layer is a very sparse matrix, where most elements are zero. Only a
restricted number of tied parameters are introduced to focus on local regions.
This allows CNN models to be robustly and efficiently trained.
2.1 Neural Network Architecture 11
• Detector: The outputs after convolution operations are then run through non-
linear activation functions:
(l)
h̃ = φ z (l) . (2.11)
This stage is usually referred to as the detector stage, which generates a range
(l)
of high-level feature detectors. The outputs of activation functions, h̃ , are
subsequently used in the final pooling stage.
where Gi stands for the index set of units of a rectangular region in the grid.
Other popular pooling functions include the average, L2 norm, and Gaussian blur
functions, which can be performed similarly on a rectangular neighbourhood.
CNN layers are usually combined with fully connected layers to obtain powerful DNN
models. For instance, Alexnet (Krizhevsky et al., 2012) introduces five CNN layers and
three fully connected layers; GoogLeNet (Szegedy et al., 2015) consists of 22 hidden
layers of different types.
yt y0 y1 ··· yT
v0 v1 vT −1
···
unfold
vt−1
xt x0 x1 ··· xT
Fig. 2.3 Recurrent neural network. The loop design allows the network to be unfolded
to process sequential data of variable length.
challenge of such data is the flexible length of data, which means that a fixed-length
input layer cannot be utilised directly. To resolve this issue, an internal “memory”
mechanism is modelled in RNNs, which can process sequences of variable lengths.
Usually, the RNN is used to model P (ω 1:T |x1:T ), where the input feature x1:T
and the output class ω 1:T are of variable length (denoted by T ). The probability
P (ω 1:T |x1:T ) is approximated by
T
P (ω 1:T |x1:T ) ≃ P (ωt |x1:t ), (2.13)
Y
t=1
where P (ωt |x1:t ) is recursively commuted by the RNN. An example of RNN with one
hidden layer is illustrated in Figure 2.3. In contrast to feed-forward neural networks, a
connection loop is designed on the RNN architecture. This design allows the network to
be unfolded to handle the variable-length issue. The loop in the hidden layer recurrently
(1)
feeds a history vector v t−1 , i.e. the delayed activation outputs ht−1 at time t − 1, into
the input layer at time t. In this way, the hidden layer can represent information both
from the current input feature and the history vectors:
(1) (1)
ht = φ(z t ), (2.14)
(1)
z t = W (1)T xt + R(1)T v t−1 + b(1)
(1)
= W (1)T xt + R(1)T ht−1 + b(1) (2.15)
where R(1) strands for additional parameters for recurrent connections. The history
vector v t−1 encodes a temporal representation for all past inputs, so effective history
2.1 Neural Network Architecture 13
which approximates the probability condition on the full past history x1 , x2 , . . . , xt−1
by the history vector v v−1 .
The concept of recurrent units can be implemented in a variety of ways. Deep
RNNs (Pascanu et al., 2013) introduce recurrent units in multiple hidden layers. Rather
than capturing information only from past history, bidirectional RNNs (Graves et al.,
2013b; Schuster and Paliwal, 1997) combine information both moving forward and
backward through time to yield a prediction depending on the whole input sequence.
Another RNN generalisation is recursive neural networks (Socher et al., 2011). Instead
of sequential data with chain dependencies, it is designed to process tree-structured
data, such as syntactic trees.
RNN models are usually trained using gradient-based algorithms (see Section 2.3.2),
which require the calculation of parameter gradients. One challenge in training RNNs
is that long-term dependencies (Bengio et al., 1994) cause vanishing gradients, i.e. the
magnitude of gradients turns to be very small in long sequences, making recurrent
network architectures difficult to optimise. Gated RNNs, such as long short-term
memory (Hochreiter and Schmidhuber, 1997) and gated recurrent units (Chung et al.,
2014), were developed to handle the issue of long-term dependency.
Long short-term memory (LSTM) has shown good performance in many practical
applications, including speech recognition (Graves and Jaitly, 2014; Graves et al.,
2013b). A diagram of an LSTM layer (also known as an LSTM block) is shown in
Figure 2.4. A key modification in the of LSTM is the memory cell, which maintains
history information through time. The gates (marked in red) are explicitly designed to
control inward/outward information on the cell. The cell input is scaled by the input
gate, while the forget cell controls what history should be retained. The cell output is
2.1 Neural Network Architecture 14
(l)
ht
output gate
(l)
ht−1
(l−1)
ht
(l)
ct
forget gate
(l−1)
ht (l)
+ cell
ft (l−1)
(l)
ht
gt
(l) (l)
ht−1 ct−1 (l)
ht−1
(l)
input gate
ut
(l−1) (l)
ht ht−1
Fig. 2.4 Long short-term memory. Red circles are gate activation functions, blue
circles are input/output activation functions, and black circles stand for element-wise
multiplication.
also dynamically scaled by an output gate. This process can be expressed as follows:
(l) (l)T (l−1) (l)T (l) (l)
ut =φ
in
W i ht + Ri ht−1 + bi , block input (2.17)
(l) (l−1) (l)
g t = φgate W (l)T
g ht + R(l)T
g ht−1 + bg
(l)
input gate (2.18)
(l) (l)T (l−1) (l)T (l) (l)
f t = φgate W f ht + Rf ht−1 + bf forget gate (2.19)
(l) (l) (l) (l) (l)
ct = g t ⊗ ut + ut ⊗ ct−1 cell state (2.20)
(l) (l−1) (l)
ot = φgate W (l)T
o ht + R(l)T
o ht−1 + bo
(l)
output gate (2.21)
(l) (l) (l)
ht = ot ⊗ φout (ct ) block output (2.22)
where ⊗ stands for element-wise multiplication, φgate (·), φin (·), and φout (·) are ac-
tivation functions, respectively, on the gate, block input, and output units. Usually,
φgate (·) is modelled as a sigmoid function to perform like “memory gates”. The design
of gating contributes to preserving effective history information over a long period.
There are a range of extensions on LSTMs, such as bidirectional LSTM (Graves
et al., 2013a) and variations on network connectivity. A peephole connection (Gers
et al., 2002) is a common setting in the latest LSTMs, which connects the cell unit
2.2 Activation Function 15
with different gates to learn a precise timing. For deep LSTMs, high-way connections
between cells in adjacent layers (Zhang et al., 2016b) are introduced for effective
training.
φ(z) = W T z + b. (2.23)
This linear setting makes the overall DNN a model consisting of multiple affine
transformations. Since combining multiple affine transformations is identical to a single
affine transformations, this DNN model degrades to become a simple linear one, which
means that the large number of parameters fails to increase the modelling capacity.
Non-linear activation functions can take a range of forms. There are some common
desirable properties of an appropriate function form. First, it should be non-linear,
as discussed above, to trigger non-trivial models. Second, it should be continuously
(sub-)differentiable3 for direct integration into gradient-based optimisation algorithms.
Third, it should be sufficiently smooth to make the gradient stable. Sigmoid, hyperbolic
tangent, rectified linear unit, maxout, softmax, Hermite polynomial, and radial basis
functions are also discussed in this section.
2 In (l)
this section, z t is abbreviated as z where the superscript l (layer index) and subscript t
(sample index) are omitted, to simplify the notations for discussion.
3 Some functions may not be differentiable only at some points in its domain. However, on these
points, a set of values can be used as gradients, to generalise the concept of the derivative to functions.
This is referred to as sub-differentiable, and the picked “gradient” is named as the sub-gradient.
2.2 Activation Function 16
1.5
1.0
0.5
0.0
0.5
1.0 sigmoid
tanh
1.5
8 6 4 2 0 2 4 6 8
Sigmoid
The sigmoid activation function is a common choice in most neural network configura-
tions, defined as
1
φi (z) = sig(zi ) = . (2.24)
1 + exp (−zi )
Figure 2.5 illustrates the plot of a sigmoid activation function. This function has an
“S”-shaped curve, which can be viewed as a soft version of the desired “switch” design:
when zi is very positive, φi (z) is close to 1, and when zi is very negative, φi (z) is near
0.
Hyperbolic Tangent
1 − exp (−2zi )
φi (z) = tanh(zi ) = . (2.25)
1 + exp (−2zi )
As shown in Figure 2.5, the tanh function also has an “S”-shaped curve, but it works
in a different dynamic range, [−1, 1], in contrast with the sigmoid function, [0, 1]. This
form of activation function is closely related to the sigmoid function, since
1 − exp (−2zi ) 2
tanh(z i ) = = − 1 = 2sig(2zi ) − 1. (2.26)
1 + exp (−2zi ) 1 + exp(−2zi )
2.2 Activation Function 17
Rather than this analytic expression, hard tanh function (Collobert, 2004), defined as
has also been proposed. This function has a similar shape to tanh but consists only of
simple algebraic operations to form a hard “S”-shaped curve, as indicated in its name.
The rectified linear unit (ReLU) function (Nair and Hinton, 2010), also known as the
ramp function, is defined as
φi (z) = max{0, zi }. (2.28)
In its positive half domain, it is identical to a linear function, while it remains at zero
in its negative half domain. An advantage of this simple design of ReLU is that its
sub-gradient can take a very simple form,
∂φi 0,
zi ≤ 0,
= (2.29)
∂zi 1, z > 0.
i
The slope κi can be either trained as a learnable parameter (He et al., 2015) or tuned
in a heuristic fashion (Maas et al., 2013).
2.2 Activation Function 18
Maxout
The maxout activation function (Goodfellow et al., 2013) can be viewed as a generalised
version of the ReLU function. Instead of an element-wise function, it divides z into M
subsets, Z1 , . . . , ZM , with k elements in each. Maxout is then applied to each group,
defined as
φi (z) = max{z}. (2.31)
z∈Zi
This form of activation function does not specify the curve shape. Instead, it can
approximate an arbitrary convex function using k linear segments. Therefore, the
maxout function has the capacity to learn an appropriate activation function itself.
This activation function has a similar form to the max pooling (Eq. 2.12). In the
pooling operation, the candidates are usually designed as some neighbours that has
some explicit meaning, e.g., nearby feature detectors in CNNs. In comparison, the
candidates of the maxout function are learned automatically without this kind of
pre-defined physical meaning.
Softmax
The softmax function is commonly used in the output layer of neural network for
classification tasks with multiple classes:
(L)
exp zi
φi (z) = P . (2.32)
(L)
j exp zj
(L)
A normalisation term, j exp zj , is introduced. It satisfies
P
Hermite Polynomial
R
φi (z) = cir gr (zi ) (2.35)
X
r=1
where ci is the parameters associated with this activation function, R is the degree
of Hermite polynomial, and gr (zi ) is the r-th Hermite orthonormal function, which is
recursively defined as
gr (zi ) = κr Gr (zi )ψ(zi ) (2.36)
where
1 1 r−1
κr = (r!)− 2 π 4 2− 2 , (2.37)
1 z2
!
ψ(zi ) = √ exp − i , (2.38)
2π 2
2zGr−1 (zi ) − 2(r − 1)Gr−2 (zi ) r > 1,
Gr (zi ) = 2zi r = 1, (2.39)
1 r = 0.
1
!
φi (z) = exp − 2 ||z − ci ||22 (2.40)
σi
where σi and ci are the activation function parameters. This activation function defines
a desired template ci , and it becomes more active as z approaches the template. Neural
2.3 Network Training 20
networks using this form of activation functions are commonly referred to as RBF
networks (Orr et al., 1996).
t=1 i
where, for the t-th training sample, P (i|xt ) stands for the predicted distribution, and
Ptref (i) represents the reference distribution. In the context of neural networks, P (i|xt )
is given by the network output, yti . Usually, a hard target label ωt is given in one
4 Rather than training criteria, similar terminologies such as objective, loss, or cost functions are
also used by different people. Training criteria can be either maximised or minimised. To maintain
consistency in this thesis, a training criterion is defined as a function to minimise.
2.3 Network Training 21
training sample with no uncertainty. Thus, the reference distribution Ptref (i) can be
expressed as
1,
Ci = ωt ,
Ptref (i) = (2.42)
0,
Ci ̸= ωt .
|D|
1 X
Lce (θ; D) = − log P (ωt |xt ). (2.43)
|D| t=1
∂L
θ (τ +1) = θ (τ ) − ϵ (θ; D) (2.44)
∂θ θ=θ (τ )
where the hyper-parameter ϵ is referred to as the learning rate, which decides the step
size of this update. At iteration τ + 1, the parameters take a step proportional to
the negative gradient direction at iteration τ , resulting in a decrease of the training
criterion.
Other than gradient-based methods, a broad range of algorithms have been studied
to train DNN parameters. Second-order methods, such as Newton and quasi-Newton
methods (Bishop, 1995), utilise statistics from second-order derivatives to update the
parameters. Such schemes require higher computational complexity in both time and
space, but they can yield better local minima with fewer update iterations. In addition,
Hessian-free methods (Kingsbury et al., 2012; Martens, 2010) has also been investigated.
2.3 Network Training 22
This thesis focuses on gradient descent, particularly stochastic gradient descent, which
has been widely used in DNN training.
same speaker or environmental condition, and the update on this mini-batch would
degrade its generalisation to arbitrary acoustic conditions.
Learning Rate
The learning rate in SGD determines how much the parameters are changed in one
update. If a large learning rate is used, training is likely to fluctuate and skip “good”
local minima. If it is too small, training will be very slow to converge, or it will fall
into some local minimum in the training criterion error surface. Several empirical and
heuristic approaches have been proposed. These approaches adaptively change the
learning rate to improve the performance of SGD training. For instance, the NewBob
method (Renals et al., 1991) adaptively determines the learning rate according to
temporary system performance during training. Decay methods (Bottou, 2010; Xu,
2011) reduce the learning rate gradually after each SGD update. Alternative methods
associate an individual learning rate with each parameter and adjust them according
to heuristic rules (Duchi et al., 2011; Riedmiller and Braun, 1993).
Momentum
¯ (τ +1) ,
θ (τ +1) = θ (τ ) − ∆ (2.45)
¯ (τ +1) = κ∆(τ ) − ϵ ∂L
∆ τ ≥ 1, (2.46)
∂θ θ=θ (τ )
¯ (0) = 0,
∆ (2.47)
∂L (l) ∂L
(l)
= W (l) D t (l+1)
, 1≤l<L (2.50)
∂ht ∂ht
(l)
where D t is a matrix representing the gradient of activation function
(l)
∂hti
(l) (l)
dtij = (l)
= φ′i ztj . (2.51)
∂ztj
This derivative depends on the choice of activation function. For simple forms such
(l)
as the sigmoid and tanh functions, D t is a diagonal matrix. The backward step
determines the error for each unit in the network, and by using Eq. 2.50, the gradient
with respect to network parameters can be written as
∂L (l) ∂L
= hT (2.52)
X
(l)
Dt (l+1) t ,
∂W t ∂ht
∂L (l) ∂L
= (2.53)
X
(l)
Dt (l+1)
.
∂b t ∂ht
As shown in Eq. 2.52 and 2.53, the gradient calculation can be performed in a recursive
way, according to the network topology. The error back-propagation algorithm reveals
a simple and efficient way to calculate parameter gradients.
Random Initialisation
(l) (l)
where U(·, ·) stands for a uniform distribution, Nin and Nout are, respectively, the
input and output dimensions of layer l. Activation function inputs are initialised to
evenly cover the interval, [−6, 6], which prevents outputs from getting too close to
either 0 or 1.
6 Atthe time of writing, the latest DNN systems on large datasets no longer require pre-training,
such as deep ReLU networks (Glorot et al., 2011). However, pre-training plays a useful role with
smaller datasets and is related to experimental settings in this thesis.
2.3 Network Training 27
Generative Pre-training
E(v, h) = −cT T T
rbm v − brbm u − v W rbm u (2.56)
where W rbm , brbm , and crbm are RBM parameters. In terms of the energy concept,
the joint probability of observed and unobserved variables can be expressed as
exp(−E(v, u))
P (v, u) = P . (2.57)
ṽ,ũ exp(−E(ṽ, ũ))
In practice, RBMs can be efficiently trained by minimising log likelihood using the
contrastive divergence algorithm (Hinton, 2002).
Generative pre-training is performed by stacking RBMs. From lower to upper, any
two adjacent layers h(l) and h(l+1) , prior to the output layer, are trained as an RBM.
By rewriting Eq. 2.57, the conditional probability of u given v can yield an interesting
form,
P (ui = 1|v) = sig wT
rbm,i v + brbm,i . (2.58)
RBM parameters W rbm and brbm can then be used to initialise the transformation
matrix W (l) and bias vector b(l) on layer l. Finally, a randomly initialised output
layer is added to the top of stacked RBMs. An advantage of generative pre-training
is that it is performed in an unsupervised way, requiring no labels for training data.
Especially for resource-limited tasks, this scheme allows unlabelled data to be utilised
for parameter initialisation.
7A simple configuration of RBM is presented, where u and v are defined as binary variables.
2.3 Network Training 28
Discriminative Pre-training
The training criteria of generative pre-training and the original task are usually different.
To overcome the criterion inconsistency, discriminative pre-training (Bengio et al.,
2007) initialises DNNs using the same training criterion as the original task. The basic
idea is to construct DNNs in a greedy way: a network with fewer layers is sensibly
trained at first, and new layers are added to the top of this shallow network. This
strategy designs a “curriculum” in DNN training: primitive representations are learned
in lower layers, and high-level representations can then be derived from them.
A layer-wise discriminative pre-training framework (Seide et al., 2011a) is illustrated
in Algorithm 2. For each iteration, a new layer l with associated parameters W (l) and
b(l) is added to the network configuration. A temporary last layer with parameters
W last and blast is also introduced. This temporary DNN is then updated for one
iteration. Usually, a relatively large learning rate is used in this pre-training phase,
which drives parameters close to a good local minimum.
Discriminative pre-training can also be performed using related tasks rather than
the original one. In speech recognition, Zhang and Woodland (2015a) initialised DNNs
on an easier context-independent phoneme task for more difficult context-dependent
ones. Autoencoder (Vincent et al., 2008), which yields a DNN to predict an input
feature itself, is another initialisation strategy that has been shown to yield initial
high-level representations.
2.4 Regularisation 29
2.4 Regularisation
One crutial concern in machine learning is how well a model yields to work well on
unseen data rather than just the training data. Regularisation methods are commonly
used to improve generalisation and reduce over-fitting. Regularisation is the strategy
that helps to improve the generalisation. Many forms of regularisation, such as L2
regularisation (Bishop, 1995), can be described in a framework that explicitly adds a
regularisation term R(θ; D) to the overall training criterion F(θ; D),
Parameter norm penalties (Bishop, 1995; Tibshirani, 1996) are a common form of
regularisation for machine learning algorithms. The regularisation term is defined as
1
R(θ; D) = ||θ||pp (2.60)
p
2.4 Regularisation 30
This regularisation encourages a small parameter norm during training. In the context
of neural networks, it forces the weights of the multiple affine transformations to “decay”
towards zero. Intuitively, this regularisation causes the network to prefer small numbers
of active parameters. Large numbers of active parameters will only be allowed if they
considerably improve the original training criterion L(θ; D). It can be viewed as a way
to balance active parameters and minimising L(θ; D).
In practice, the norm degree p is usually set to 2 or 1, referred to as L2 or L1
regularisation. The L2 regularisation (Bishop, 1995; Woodland, 1989), also known as
weight decay or Tikhonov regularisation, penalises the sum of the squares of individual
parameters,
1 1X 2
R(θ; D) = ||θ||22 = θ . (2.62)
2 2 i i
···
Shared Layers
Input Feature
Multi-task Learning
Multi-task learning (Caruana, 1997) introduces a set of auxiliary tasks along with the
primary one for regularisation. The primary and auxiliary tasks are usually related
to each other. This approach improves generalisation by using information in the
training signals from related tasks as “induction”. Induction is commonly achieved by
introducing a set of shared parameters across different tasks. Shared parameters are
trained to operate well on multiple tasks, reducing the risk of over-fitting to a specific
task.
For DNN models, a common form of multi-task learning is illustrated in Figure 2.6.
The whole model is generally divided into two categories of parameters,
1. Shared parameters (marked in red): The lower layers of the neural networks
are shared across different tasks. According to studies that examine neural
network behaviours (Yosinski et al., 2015; Zeiler and Fergus, 2013), lower layers
concentrate on primitive abstraction of raw input features, and such relative
raw information is more likely to be shared in different tasks than that in upper
layers.
This framework allows primary and auxiliary tasks to be jointly optimised, regularisation
is controlled by the set of shared parameters during training. Several multi-task
2.4 Regularisation 32
...
Output Targets
Input Feature
...
...
Fig. 2.7 A “thinned” neural network produced by dropout. Crossed units are dropped.
approaches have been proposed for DNN models. An example in speech recognition
is multi-lingual neural networks (Heigold et al., 2013) , which extract generalised,
cross-language, hidden representations as features for recognition systems. It helps to
solve the data scarcity issue and reduce the performance gap between resource-rich
and resource-limited languages.
Dropout
random variables,
(l)
ri ∼ Bernoulli(κ) (2.64)
(l)
z (l+1) = W (l)T h̃ + b(l) . (2.66)
The uncertainty in the vector r (l) yields a range of “thinned” sub-network configurations.
The training algorithm using dropout regularisation follows the feed-forward topol-
ogy described in Eq. 2.64 to 2.66. In the propagation phase, on the l-th layer, it
starts by drawing a sample of r (l) , a range of units are then temporarily dropped
out, and the outputs of presented units are propagated to the following layer. In
the back-propagation phase, only the parameters associated with presented units are
updated accordingly. At test time, it is often not feasible to explicitly generate and
combine all network configurations. Instead, an approximate averaging method can
be applied, where an overall network is used without dropout operations. Activation
function outputs are given as the expectation,
(l)
h̃test = Er(l) r (l) ⊗ h(l) = κh(l) . (2.67)
The output of any hidden unit is scaled by the factor κ, the present probability. This
scaling approach can be viewed as a combination of 2Ln neural networks with shared
parameters, where n is the total number of hidden units in one layer. In practice, this
efficient averaging method can improve generalisation and avoid sampling infeasible
number of networks.
2.4 Regularisation 34
Early Stopping
When training a large model, particularly a DNN, the training criterion on the training
set often decreases consistently, but on some “held-out” cross validation (CV) data (not
used for training), the criterion increases at later iterations. This phenomenon indicates
over-fitting on training data. A model with lower validation set error, hopefully with
lower generalisation error, can be obtained using fewer iterations of parameter updates.
This strategy is referred to as early stopping.
This early stopping strategy is widely used in training DNN models. The full
dataset is randomly split into two sets, a training set and a validation set. Usually,
the validation set contains a small portion, such as 5% to 10%, of the full data. Note
that there is no overlap between the validation and training sets. A basic framework is
illustrated in Algorithm 3. The model parameters are only updated on the training set.
This strategy tracks the training criterion, i.e. error, on the validation set. Once the
validation error begins to increase, the training procedure terminates. Early stopping
is a very simple form of regularisation, Unlike parameter norm penalties or dropout, it
requires little modification to the underlying training algorithm.
There are a number of variations to implement early stopping. An example is
the widely used NewBob training scheduler (Renals et al., 1991), which dynamically
decreases the learning rate when the validation error rises. It helps to adapt and find
an appropriate learning rate for a specific task.
2.5 Visualisation 35
Data Augmentation
For large models such as DNNs, collecting more training data is a direct way to improve
generalisation. However, in practice, To collect more training data can be expensive
and impracticable. Alternatively, appropriate fake samples can be generated instead
of collecting real ones. Data augmentation introduces fake data to the training set to
regularise network training.
For machine learning tasks, particularly classification ones, data augmentation
can be simply performed via creating new samples from a real sample (x, y), through
domain-specific transformations on features x and keeping the target y fixed. In
computer vision, such transformations includes cropping, brightness changing, rotation,
and scaling on image data. Also, in speech recognition, data augmentation on audio
features, such as vocal tract length perturbation, stochastic feature mapping (Cui et al.,
2015b), and tempo/speed perturbing (Ko et al., 2015), can help to improve model
generalisation.
2.5 Visualisation
Multi-layer transformations and non-linear activation functions in DNN models con-
tribute to modelling complex data. However, there is no explicit meaning for how they
process, manipulate, and transform raw features into useful high-level abstractions. A
DNN remains a “black box”, and the lack of interpretation restricts the potential for
further network improvement and post-modification. Research on network visualisation
analyses qualitative comparisons of representations learned in different layers. The
representations are inverted and visualised in the input space, which can illustrate the
intuition of corresponding activation functions.
Maximising Activation
(l)
x̂ = arg max hi , subject to ||x||22 = κ (2.68)
x
(l) (l)
∂hi ∂hi
= −W (1) (1) , (2.69)
∂x ∂h
2.6 Summary
This chapter reviews the fundamentals of neural network and deep learning. It begins
with basic network architectures, and three typical forms of DNN are presented:
feed-forward neural networks, convolutional neural networks, and recurrent neural
networks. The activation functions are then reviewed, which is a key technique
in neural networks to trigger meaningful non-linear representations. They include
sigmoid, tanh, ReLU, maxout, softmax, Hermite polynomial, and RBF functions. The
following section describes the training and optimisation of DNN parameters. Training
2.6 Summary 37
criteria are described first, which specify the overall goal of training. The basic cross-
entropy criterion is presented as an example. Parameter optimisation focuses on
practical techniques, including stochastic gradient descent, learning rate adjustment,
and momentum. It is followed by the error back-propagation algorithm: according to
the chain rule, gradient calculation can be performed in a simple but efficient way. The
highly complex model topology in neural networks makes training difficult and cause it
ot fall into poor local minima. Therefore, parameters should be appropriately initialised
to alleviate such issues. The so-called “pre-training” schemes, in both generative and
discriminative fashions, are reviewed. Another crucial issue in training is over-fitting.
When a large number of parameters are introduced, DNN models are likely to over-fit
to training data. Regularisation is a commonly used to reduce the risk of over-fitting,
and improve generalisation. Several regularisation strategies are presented, such as
parameter norm penalties, multi-task learning, dropout, early stopping, and data
augmentation. The last section of this chapter reviews the methodology of network
visualisation, which aims at analysing network behaviour, according to the visualisation
and interpretation of hidden-layer representations.
Chapter 3
The objective of automatic speech recognition is to generate the correct word sequence,
or transcription, given a speech waveform. In this standard processing pipeline, the
raw speech waveform is first processed to extract a sequence of acoustic features x1:T ,
x1:T = x1 , x2 , . . . , xT (3.1)
where T is the length of sequence and xt represents the feature vector at time t. The
length of the sequence can vary from utterance to utterance. The ASR system yields
the most likely decoding hypothesis ω̂, according to Bayes’ decision rule
where P (ω 1:M |x1:T ) stands for the conditional probability of a hypothesis ω given the
features x1:T . This hypothesis can be expressed as a sequence of length M ,
ω 1:M = ω1 , ω2 , . . . , ωM (3.3)
Filter Bank
Filter bank analysis can be applied to the spectrum. The spectrum given by the DFT
is evenly distributed in the frequency domain. However, frequencies across the audio
spectrum are resolved a non-linear fashion by the human ear. Filter bank analysis can
remove this kind of mismatch. The feature vectors after filter bank analysis are usually
warped by log(·) to rescale the dynamic range. The feature extracted by this process
is referred to as the filter bank feature.
3.1 Acoustic Feature 40
Cepstral Features
The representation obtain from filter bank analysis can be further processed to obtain
cepstral features. There are two types of cepstral features widely used in speech
recognition, mel-frequency cepstral and perceptual linear predictive coefficients.
Mel-frequency cepstral coefficients (Davis and Mermelstein, 1980) (MFCCs) use
filters equally spaced on the Mel-scale to obtain non-linear resolution
!
f
Mel(f ) = 2595 log10 1+ (3.4)
700
The outputs are then processed by a non-linear transform based on equal-loudness and
the intensity-loudness power law. Linear prediction (Atal and Hanauer, 1971) is finally
performed to obtain the cepstral coefficients, referred to as PLP features.
Raw Waveform
A complete machine learning approach would operate directly onto raw waveforms with
few manual designs. In speech recognition, raw features are waveform representations
in time domain. Recent DNN-based methods (Sainath et al., 2015; Tüske et al., 2014)
introduce special front-end modules to use raw waveforms as features that have achieved
comparable state-of-the-art systems.
3.1 Acoustic Feature 41
Dynamic Feature
Acoustic features are extracted in the frame level, which focus more on static information
within the time window. To capture sequential properties, dynamical information in
successive frames, such as time derivatives
Feature Normalisation
Feature normalisation aims to remove the irrelevant information from the features.
Additionally, it can be used to standardise the range of the features, which is practically
important for DNN models. Acoustic features may include a range of irrelevant factors,
such as accent, gender, environment noise and channel. Normalisation can reduce the
impact of such irrelevant factors to features.
Traditional normalisation techniques include cepstral mean normalisation (Atal,
1974) (CMN), cepstral variance normalisation (Viikki and Laurila, 1998) (CVN) and
vocal tract length normalisation (Lee and Rose, 1996).
3.2 Generative Model 42
p(ω, x1:T )
ω̂ = arg max
ω p(x1:T )
= arg max p(ω, x1:T ). (3.7)
ω
The probability density function (PDF) of the feature sequence, p(x1:T ), can be
omitted, as x1:T is independent of ω. The joint distribution is usually factorised into
two components,
p(ω 1:M , x1:T ) = p(x1:T |ω 1:M )P (ω 1:M ), (3.8)
the likelihood p(x1:T |ω 1:M ), referred to as the acoustic model, and the prior P (ω 1:M ),
referred to as the language model. In general, these two models are separately trained.
It can be difficult to model p(x1:T |ω 1:M ) directly, as the sequence lengths, M and T ,
are different. To address this issue, a sequence of discrete latent variables ψ 1:T , referred
to as the alignment, are introduced to handle the mapping between the sequences.
Therefore,
where ΨTω1:M represents all valid alignments of length T for the M -length word sequence
ω 1:M . A standard generative framework for speech recognition is illustrated in Fig-
ure 3.1. It consists of five principal components: front-end processing, acoustic model,
language model, lexicon and decoder. First, the front-end processing extracts acoustic
features from the waveforms. The decoder then uses the acoustic model, language
model and lexicon to find the most likely decoding hypothesis. This section discusses
3.2 Generative Model 43
Lexicon
Speech Waveform
Front-end
Decoder Decoding Hypothesis
Processing
xt xt+1
ψt ψt+1
Fig. 3.2 Probabilistic graphical model for HMM. Unobserved variables are marked in
white, and observed variables are in blue.
these components. The discussion focuses on the acoustic model p(x1:T |ω 1:M ), which
is a key component and the main focus of this thesis. It also covers a brief description
of the language model and decoding methods.
Using these Markovian approximations, the acoustic model p(x1:T |ω 1:M ) can be rewrit-
ten as
T
p(x1:T |ω 1:M ) ≃ p(xt |ψt )P (ψt |ψt−1 ) . (3.12)
X Y
ψ 1:T ∈ΨT
ω 1:M
t=1
In speech recognition, a single HMM is used to model each basic acoustic unit. Using
the lexicon, HMMs can be composed together to represent words and sentences. Usually,
each basic-unit HMM has a fixed number of hidden states, including an initial and
an accepting non-emitting states (i.e. states that cannot generate feature vectors).
The topology of a standard left-to-right1 HMM with five states (three emitting and
two non-emitting states) is illustrated in Figure 3.3. Two types of parameters are
associated with the HMM model:
N
aij = P (ψt+1 = j|ψt = i), aij = 1 (3.13)
X
j=1
where N is the total number of hidden states. Left-to-right HMMs restrict state
transition to self loops or the next state; thus, many of aij are zeros.
1 Herethe term, left-to-right, means that the state transition cannot jump from a latter state to a
previous one.
3.2 Generative Model 45
b2 b3 b4
Fig. 3.3 A left-to-right HMM with five states. Emitting states are red circles, and
non-emitting states are blue circles.
State emitting probabilities are at the heart of HMM models. For many years,
the state emitting PDFs were modelled as Gaussian mixture models (GMMs),
K
(j) (j) (j)
bj (x) = (3.15)
X
ck N x; µk , Σk
k=1
1 1
( T −1 )
(j) (j) (j) (j) (j)
N x; µk , Σk =s exp − x − µk Σk x − µk
(j) 2
(2π)d Σk
(3.16)
3.2 Generative Model 46
where d is the dimension of feature vector. To make Eq. 3.15 a valid PDF, it also
requires
(j)
ck ≥ 0 ∀j, k; (3.17)
K
(j)
ck = 1 (3.18)
X
∀j.
k=1
Likelihood Calculation
Performing the calculation of the likelihood, the outer summation may take as many
as O(N T ) steps. Thus, even for a small number of states and short sequences, this
summation cannot be computed in practice. However, the Markovian approximations
in HMM enable an efficient way to calculate the likelihood. The forward-backward
algorithm (Baum et al., 1970) is a dynamic programming scheme that can break down
the likelihood calculation for the complete sequence into a collection of sub-sequence
calculations.
3.2 Generative Model 47
The forward probability, fwd(t, j), is defined as the likelihood of a t-length sequence
that stays at state j at time t. For emitting states, 2 ≤ j ≤ N − 1, the forward
probability can be recursively computed via
fwd(t, j) = p(x1:t , ψt = j)
N −1
= fwd(t − 1, i)aij bj (xt ). (3.20)
X
i=2
As a consequence, the likelihood of the whole sequence is given by the forward probability
of the non-emitting state N at time T
Alternatively, the decomposition can begin with the end of the sequence. The backward
probability, bwd(t, j), is defined as the conditional probability of a partial sequence
from time t + 1 to the end, given ψt = j. Similarly, the calculation of bwd(t, j) can be
performed recursively,
i=2
Therefore, the likelihood p(x1:T |ω 1:M ) can be accumulated as the backward probability
of the non-emitting state 1 at time 1,
Using either forward or backward probability, the likelihood calculation takes only
O(N T ) steps, which is efficiently performed in polynomial times.
Acoustic Units
HMMs have been introduced to model the basic acoustic units for speech recognition.
Acoustic units can be designed at various levels. For simple tasks such as digit
recognition, where the vocabulary size is relatively small, units can be specified at the
word level, i.e. to train an HMM for each possible word. However, as the vocabulary
increases, it is not feasible to robustly estimate word-level HMMs. For example, the
typical vocabulary size in English varies between 40k and 100k. Because of the data
sparsity, building an HMM for each possible word is not practical. Alternatively, the
lexicon can be utilised to break down words into sub-word units, such as phonemes
or graphemes. A typical English phoneme set contains 40 to 60 phones only, which
is much smaller than the vocabulary size. Given phone-level HMMs, word or even
sentence HMM models can be built via concatenating related phone HMMs according
to the lexicon.
Two forms of phone-level acoustic units are used in speech recognition, context-
independent and context-dependent phones. Context-independent (CI) phones, also
known as monophones, use the original linguistic phonemes specified in the lexicon
as acoustic units. The limitation of monophones is that context information from
adjacent phones is not taken into account. The co-articulation phenomenon (Lee, 1988)
states that the acoustic property of a particular phone can be considerately influenced
by the preceding or following phones. Context-dependent (CD) phones, also known
as triphones, are used to address this issue. A triphone is composed of one central
phone and two context phones. For instance, /l-i+t/ stands for a triphone where
the central phone is /i/, the preceding phone is /l/, and the following phone is /t/.
3.2 Generative Model 49
Context Dependent
...
Acoustic Feature
...
...
...
...
...
... ...
...
...
Fig. 3.4 DNN-HMM hybrid model.
Hybrid Systems
The DNN-HMM hybrid system (Bourlard and Morgan, 1994; Dahl et al., 2012) replaces
the state emitting PDFs, p(xt |ψt ), by a deep neural network. DNNs are often trained
in a discriminative manner, such as modelling target posterior P (ψt |xt ). This cannot
directly represent a likelihood function p(xt |ψt ). By Bayes’ rule, it can be converted
to form a “pseudo-likelihood” equation,
where P (ψt ) is the prior probability distribution for the state generated as ψt , and
p(xt ) can be omitted, as the feature vector xt is independent of the target ψt . The
state prior distribution P (ψt ) can be simply calculated from for frame state alignments
of the training data. Figure 3.4 illustrates the DNN-HMM hybrid model. The DNN
outputs are specified as context-dependent triphone states. The input feature consists
of several successive frames, not a single speech frame, to reinforce context information.
Hybrid systems have been extensively used in state-of-the-art ASR systems (Dahl
et al., 2012). Several approaches have been investigated to improve its performance.
These include large-scale training (Kingsbury et al., 2012), discriminative training
(Section 3.4), and speaker adaptation (Section 3.5).
Tandem System
In contrast to the hybrid system, the DNN-HMM tandem system (Grezl et al., 2007;
Hermansky et al., 2000) is built on the GMM-HMM framework. The neural network
for tandem system is used to extract features, rather than discriminative models. As
shown in Figure 3.5, a bottleneck layer, consisting of much fewer units than other
layers, is designed prior to the output layer. The output of this bottleneck layer,
referred to as bottleneck features, is then concatenated with raw acoustic features,
such as PLP, to train the GMM-HMM model. Bottleneck features are a compact,
low-dimensional representation for raw acoustic features. They are discriminatingly
trained to distinguish between phone states.
3.2 Generative Model 51
GMM
Context Dependent
...
Acoustic Feature
...
...
Bottleneck
System Combination
log pj (xt |ψt) ∝ κhyb log phyb (xt |ψt) + κtan log ptan (xt |ψt) (3.27)
where κhyb and κtan stand for the interpolation weights of hybrid and tandem systems
respectively. The interpolation weights can be manually tuned. The combination result
log pj (xt |ψt ) is finally used as the combined score for acoustic model in decoding.
κtan κhyb
Combine
GMM
Context Dependent Context Dependent
Bottleneck
... ... ... ... ... ...
Fig. 3.6 Joint decoding for DNN-HMM hybrid and tandem systems.
M
P (ω 1:M ) = P (ωm |ω 1:m−1 ). (3.28)
Y
m=2
There are two special word symbol introduced: the sentence start symbol ⟨s⟩,
and the sentence end symbol ⟨/s⟩. Language models are trained on data with text only to
estimate the conditional distribution P (ωm |ω 1:m−1 ). To directly model P (ωm |ω 1:m−1 )
can be impractical due to data sparsity. The word history ω 1:m−1 needs to be approxi-
mated to address this issue. There are two popular strategies of approximation,
N -gram LMs directly estimate P (ωm |ω m−n+1:m−1 ) on the maximum likelihood crite-
rion,
#(ω m−n+1:m )
P (ωm |ω m−n+1:m−1 ) = (3.32)
#(ω m−n+1:m−1 )
where #(·) stands for the total number of an n-gram in the training corpus. The term
n is referred to as the order of the n-gram LM. In ASR systems, bi-gram (2-gram),
tri-gram (3-gram) and 4-gram LMs are often used. The simple expression in Eq. 3.32
makes it possible to efficiently train n-gram LMs on very large corpora.
There are several extensions to improve n-gram LMs. Smoothing techniques adjusts
the distribution for non-zero and robust probability for all n-grams, such as Katz
smoothing (Katz, 1987), absolute discounting (Ney et al., 1994) and Kneser-Ney
smoothing (Kneser and Ney, 1995). LM adaptation (Bellegarda, 2004; Gildea and
Hofmann, 1999) combines multiple LMs to resolve the issue of text mismatches in
different topics.
3.2.4 Decoding
Decoding is a key module in speech recognition system. Given a sequence of acoustic
features x1:T , the decoder searches for the “optimal” word sequence ω̂ using the acoustic
3.2 Generative Model 54
A summation over exponentially many possible state sequences ψ 1:T is required, which
is not feasible in practice. The sum in Eq. 3.33 can be approximated by a max(·)
operator. Instead of the optimal word sequence, the approximated decoding algorithm
searches for the word sequence corresponding to the optimal state sequence
!
ω̂ ≃ arg max P (ω) max p(x1:T |ψ 1:T )P (ψ 1:T |ω) . (3.34)
ω ψ 1:T
This search process can be viewed as finding the best path though a directed graph,
referred to as decoding network, constructed from the acoustic model, language model
and lexicon. It can be performed in either a breadth-first or depth-first fashion (Aubert,
2002). The Viterbi algorithm (Forney, 1973) is is a dynamic programming algorithm
based on a breadth-first concept, which have been extensively used in ASR systems.
In detail, given the acoustic feature sequence x1:T , the search process is based on
computing the following term2
which represents the maximum likelihood of the partial “best” state sequence that stays
on state j at time t. According to the Markov property, this term can be recursively
accumulated by
lik(t, j) = max {lik(t − 1, i)aij bj (xt )} , (3.36)
i
2 To simplify the discussion, a simple form without the language model is discussed here. For more
details regarding the decoding algorithm, please refer to Gales and Young (2008).
3.2 Generative Model 55
lik(1, 1) = 1, (3.37)
lik(2, j) = a1j bj (x1 ) ∀j. (3.38)
n o
ψ̂t = arg max lik(t, j)aj ψ̂t+1 1 ≤ t ≤ T − 1, (3.39)
j
It takes O(N 2 T ) steps to complete the searching algorithm. In practice, the imple-
mentation of the Viterbi algorithm can be very complex due to the HMM topology,
language model constraints, context-dependent acoustic units (Gales and Young, 2008).
This algorithm can either be implemented on a dynamic decoding network (Odell
et al., 1994; Ortmanns et al., 1997) (i.e. constructing the network while decoding)
or on a static network such as a weighted finite-state transducer (Mohri et al., 2002;
Povey et al., 2011). The Viterbi algorithm is based on Markov properties. When
non-Markovian modules such as RNN LMs are used, it is no longer possible to simply
perform dynamic programming for decoding. Approximations for the RNN LMs (Liu
et al., 2014, 2015) have been investigated to use non-Markovian LMs in the Viterbi
decoding framework.
The hypothesis ω̂ given by the Viterbi algorithm is the most probable word sequence
with minimum error rate at the sentence level. However, the recognition performance
are usually evaluated at the word level (Section 3.6). The output of Viterbi decoding is
sub-optimal for word error rate. The Mimimum Bayes’ risk decoding, such as confusion
network decoding (Evermann and Woodland, 2000; Mangu et al., 2000), is designed to
address this mismatch.
3.3 Discriminative Model 56
Lexicon
Speech Waveform
Front-end
Decoder Decoding Hypothesis
Processing
Model
3.2.5 Lexicon
The lexicon, also known as the dictionary, is used in modern ASR systems to map words,
or characters, into sub-word units. This mapping allows acoustic-model parameters
to be robustly estimated, and unseen words in the training data to be modelled.
To build the lexicon, a standard approach is to map words into phones. The word
punctuation can be generated manually by experts or automatically by grapheme-to-
phoneme algorithms (Bisani and Ney, 2008). For low-resource languages, it may be
impractical to manually generate a phonetic lexicon. An alternative approach is to
build a graphemic lexicon (Gales et al., 2015b; Kanthak and Ney, 2002; Killer et al.,
2003), where the “pronunciation” for a word is defined by the letters forming that word.
The graphemic lexicon enables ASR systems to be build with no phonetic information
provided.
where ΨTω1:M consists of all possible alignment sequences that are of length T and
identifies ω 1:M . In this way, the posterior of word sequence P (ω 1:M |x1:T ) can be
rewritten as
ψ 1:T ∈ΨT
ω 1:M
ψ 1:T ∈ΨT
ω 1:M
In a similar fashion as the HMM, the CTC model converts the original unsegmented
labelling task to a segmented one. In practice, the label set may include a set of
characters, graphemes or phonemes. A special label ε is also added to the label set and
referred to as a blank label. The CTC blank label functions similarly to the silence
unit in HMMs that can separate successive words in speech.
3.3 Discriminative Model 58
xt xt+1
vt vt+1
ψt ψt+1
Figure 3.8 illustrates the framework for CTC models3 . CTC approximates and
factorises the conditional probability of alignment sequence, P (ψ 1:T |x1:T ), as
T
P (ψ 1:T |x1:T ) ≃ P (ψt |x1:t )
Y
t=1
T
P (ψt |xt , v t−1 )
Y
≃
t=1
T
P (ψt |v t ) (3.43)
Y
≃
t=1
T
P (ω 1:M |x1:T ) ≃ P (ψt |v t ). (3.44)
X Y
ψ 1:T ∈ΨT
ω 1:M
t=1
3 Note that this diagram does not illustrate the exact probabilistic graphical model for the CTC.
The symbols v t and v t+1 in dotted circles are deterministic, not random variables. This thesis
introduces this type of diagram to emphasis the relationship across observed variables (in blue circles),
unobserved variables (in white circles), and DNN hidden units (in dotted circles). Figure 3.9 and 3.10
in the following discussion are designed using the same concept.
3.3 Discriminative Model 59
This form allows the conditional probability P (ω 1:M |ψ 1:T ) to be efficiently calcu-
lated via dynamic programming, which is similar to the forward-backward algorithm
(Section 3.2.1).
Decoding in CTC models can be performed using a best-path approximation, which
is similar to the Viterbi algorithm (Section 3.2.4). Given the feature sequence x1:T ,
the most likely word sequence is given by the most probable label sequence
where B(·) maps an alignment sequence to the corresponding word sequence. The best-
path approximation cannot guarantee to find the most probable labelling. Alternatively,
Graves et al. (2006) discussed the prefix search decoding method, which performs
decoding with a prefix tree to efficiently accumulate the statistics of label prefixes.
Given enough time, prefix search decoding can find the most probable labelling.
Decoder
ωi ωi+1
ui−1 ui
vt vt+1 ··· vT −1 vT c
Eecoder
This context vector c is then used by the decoder module to guide the word sequence
generation. Formally, the conditional probability P (ω 1:M |x1:T ) is factorised as
M
P (ω 1:M |x1:T ) = P (ωi |ω 1:i−1 , x1:T )
Y
i=1
M
P (ωi |ωi−1 , ui−2 , c) (3.47)
Y
≃
i=1
Decoder
ωi ωi+1
ui−1 ui
Attention
ci ci+1
vt vt+1 ··· vT −1 vT
Eecoder
ci = (3.48)
X
αit v t
t
where interpolation weights, αi1 , αi2 , . . . , αiT , are dynamically determined via
exp (s (v t , ui−2 ))
αit = PT . (3.49)
k=1 exp (s (v k , ui−2 ))
The function s (v t , ui−2 ) is referred to as attention score, measuring how well position t
in input matches position i − 1 in output. The conditional probability of word sequence
3.4 Training Criteria for Speech Recognition 62
M
P (ω 1:M |x1:T ) ≃ P (ωi |ωi−1 , ui−2 , ci )
Y
i=1
M
P (ωi |ui−1 ). (3.50)
Y
≃
i=1
The attention mechanism makes context vector ct dynamically capture specific frames
when generating different words.
(u) (u)
where (x1:Tu , ω ref ) stands for one training instance. Discussion in this section is split
into two parts. The maximum likelihood estimation for generative model is presented
at first. The second part describes discriminative training criteria for both generative
and discriminative models.
1 XU
(u) (u)
L (θ; D) = −
ml
log p(x1:Tu |ω ref ). (3.52)
U u=1
3.4 Training Criteria for Speech Recognition 63
(u) (u)
For HMM models, the likelihood term p(x1:Tu |ω ref ) can be calculated via the forward-
backward algorithm described in Section 3.2.1. The appropriateness of ML estimation
needs to specify a number of requirements, in particular, training data sufficiency and
model-correctness. In general, neither of them can be satisfied for speech data (Brown,
1987). Alternative methods, such as discriminative training criteria, have been proposed
to overcome such mismatches.
where Ω denotes the hypothesis space containing all possible word sequences. In
practice, it is infeasible to explore the complete hypothesis space Ω to compute the
exact denominator of Eq. 3.53. The denominator calculation is usually approximated
by a smaller candidate set of possible word sequences, such as n-best lists, or decoding
lattices, generated by a sensible recognition system. Alternatively, the lattice-free
approach (Povey et al., 2016) can be applied to some discriminative criteria, such
as the maximum mutual information criterion, which can avoid the requirement of
decoding lattices. The term P (ω) is usually specified by a separate language model,
which is not trained in conjunction with this generative model.
This section discusses three forms of discriminative training criteria: maximum
mutual information, minimum classification error, and minimum Bayes’ risk.
3.4 Training Criteria for Speech Recognition 64
Maximum mutual information (Bahl et al., 1986; Povey, 2004) (MMI) is a discriminative
training criterion closely related to the classification performance. The aim of MMI
is to minimise4 the negative mutual information between the word sequence ω and
the information extracted by a recogniser from the the associated feature sequence5
x: , I(ω, x: ). Because the joint distribution of the word-sequences and observations
is unknown, it is approximated by the empirical distributions over the training data.
This can be expressed as (Brown, 1987)
(u) (u)
1 XU p(ω ref , x1:Tu )
I(ω, x: ) ≃ − log (u) (u)
. (3.54)
U u=1 P (ω ref )p(x1:Tu )
(u)
Since the language model P (ω ref ) is fixed, this is equivalent to minimise the negative
average log-posterior probability of the correct word sequence. The MMI criterion can
be expressed as
1 XU
(u) (u)
Lmmi (θ, D) = − log P (ω ref |x1:Tu ). (3.55)
U u=1
When this form of training criterion is used with discriminative models, it is also known
as the conditional maximum likelihood criterion.
Minimum classification error (Juang et al., 1997) (MCE) aims at minimising the
(u)
difference between the log-likelihood of the reference ω ref and that of other competing
decoding hypothesis ω
ξ −1
(u) (u)
1 XU P (ω ref |x1:Tu )
L (θ, D) =
mce 1 + (3.56)
(u)
U u=1
P
P (ω|x1:Tu )
(u)
ω̸=ω ref
4 Notice that this criterion is defined as a negative MMI to keep a consistent “minimisation” form
for training criterion.
5 The notation x is defined as a feature sequence of arbitrary length.
:
3.4 Training Criteria for Speech Recognition 65
where ξ is a hyper-parameter for smoothness. There are two major differences between
the MMI and MCE criteria. One is that the denominator term of MCE excludes the
(u)
reference sequence ω ref . The other is that the MCE criterion smooths the posterior
via a sigmoid function. When ξ = 1, the MCE criterion is given as
1 XU
(u) (u)
L mce
(θ, D) = 1 − P (ω ref |x1:Tu ). (3.57)
U u=1
Minimum Bayes’ risk (Byrne, 2006; Kaiser et al., 2000) (MBR) aims at minimising the
expectation of a particular form of loss during recognition,
1 XU X
(u) (u)
Lmbr (θ; D) = − P (ω|x1:Tu )D(ω, ω ref ) (3.58)
U u=1 ω∈Ω
(u)
where D(ω, ω ref ) defines an appropriate loss function measuring the mismatch between
ω and ω ref . The loss can be defined in a range of levels,
• Word The word-level loss is directly related to the exception of word error
rate, which is the ASR performance metric. This loss is commonly defined as
word-level Levenshtein distance between ω and ω ref . The MBR using such loss
is referred to as the minimum word error (MWE) criterion (Mangu et al., 2000).
speech recognition. Rather than phone-level loss, state-level loss have also been
investigated, which yields state MBR (sMBR) criterion (Su et al., 2013; Zheng
and Stolcke, 2005).
• Frame The frame-level loss is defined as the Hamming distance, which computes
the number of frames with incorrect phone labels. The MBR with this frame
loss is referred to as the minimum phone frame error (MPFE) criterion (Zheng
and Stolcke, 2005).
3.5 Adaptation
A fundamental assumption in machine learning algorithms is that the training and
test data have the same distribution; otherwise, their mismatch is likely to degrade the
performance of related systems. In speech recognition, unseen speakers or environment
conditions often exist, which may be poorly presented in the training data. One
solution to address the data-mismatch issue is adaptation. Adaptation allows a small
amount of data from an unseen speaker to be used to transform a model to more
closely match that speaker. It can be used either in the training phase to induce a more
robust model, or in the evaluation phase to reduce the recognition errors. This section
describes the adaptation methods for neural networks. The methodology of adaptation
can be roughly categorised into two groups: feature-based adaptation and model-based
adaptation. The feature-based approaches only depend on the acoustic features and
aim at compensating the features to a more compact representation. Alternatively, the
model-based approaches change the model parameters to achieve the compensation.
This thesis presents adaptation schemes in the context of speaker adaptation.
Speaker adaptation in speech recognition aims at adapting original models to handle
acoustic variations, such as accent and gender, across different speakers. The unadapted
neural network is referred to as a speaker-independent (SI) model. The adapted model
is referred to as a speaker-dependent (SD) one.
3.5 Adaptation 67
L2 Regularization
An adaptation criterion was formed of an L2 regularisation term (Li and Bilmes, 2006;
Liao, 2013) to the overall training criterion. This can be expressed as
where θ SI stands for the parameters in the SI model. The introduced L2 penalty term
decays the parameters of the adapted model towards the original SI model. In contrast
with the discussion in Section 2.4, this L2 regularisation term for adaptation is based
on a different concept, which well-tuned parameters θ SI are treated as a prior, not
zeros.
Instead of re-estimating the complete parameter set, the re-estimation can be performed
on a small set of parameters, which are sensitive to acoustic variations. Sensitive
parameter selection (Stadermann and Rigoll, 2005) chooses hidden units in the SI
DNN with higher activation variance over the adaptation data. These hidden units are
expected to significantly influence the output; thus, associated parameters are then
chosen to be re-estimated. Define the index set of selected parameters as S(θ), the
3.5 Adaptation 68
This strategy keeps insensitive parameters unchanged to ensure that the SD model is
not driven far away from the SI model.
KL-divergence Regularization
Adaptation can also be based on the KL-divergence regularisation (Yu et al., 2013),
that is,
|D|
1 X X SI
F(θ, D) = L(θ, D) + η P (y|xt ) log P (y|xt ) (3.62)
|D| t=1 y
where P SI (y|xt ) is the target distribution from the SI model. It encourages the
KL-divergence between the SI and SD target distributions to be small.
Feature Normalisation
1998), which was originally proposed for GMM-HMMs, can be applied to normalise
acoustic features for DNN models (Rath et al., 2013; Seide et al., 2011a; Yoshioka
et al., 2014). In addition, denoising autoencoders (Feng et al., 2014; Ishii et al., 2013)
have also been used to improve the input feature representation.
I-vectors
I-vectors (Dehak et al., 2011; Glembek et al., 2011), i.e. information vectors, are a low-
dimensional fixed-length vector representation of speaker space spanning the dimensions
of highest speech variability. It was initially proposed for speaker verification, but it has
recently been used for DNN adaptation in speech recognition (Saon et al., 2013; Senior
and Lopez-Moreno, 2014b). The i-vector representation contains relevant information
about the identity of speakers, which can inform the DNN training about corresponding
acoustic variations and distortions. The details of i-vector estimation and extraction
are presented in Appendix A.
A range of i-vector variations have also been examined for DNN adaptation as well.
Informative prior (Karanasou et al., 2015) smooths i-vector representation to handle
highly distorted acoustic conditions. In addition, i-vector factorisation (Karanasou
et al., 2014) enriches its representation with multiple acoustic factors.
Speaker Codes
Speaker codes (Abdel-Hamid and Jiang, 2013a,b; Xue et al., 2014) are an alternative
type of axillary feature for DNN adaptation. In contrast to most axillary features, they
are trained jointly with the DNN parameters. In contrast with i-vectors, which are
extracted by an independent model, speaker codes are learned in conjunction with the
DNN model. This design avoids the potential modelling mismatch caused by separate
models.
The network configuration with speaker code is illustrated in Figure 3.11. To
reinforce their importance, speaker codes are introduced as the input signal to multiple
layers, often the bottom of the network. In these layers, the activation function inputs
are
(ls) (l−1,s)
zt = W (l)T ht + A(l)T c(s) + b(l) (3.63)
3.5 Adaptation 70
Context Dependent
...
c(s)
...
xt
...
Fig. 3.11 DNN with speaker codes. Speaker codes c(s) are introduced to several bottom
layers to emphasise its importance.
where c(s) stands for the speaker code for speaker s, and W (l) , A(l) , b(l) are parameters
associated with this layer. The network training can follow the error back-propagation
algorithm, and speaker codes c(s) are jointly updated for all the speakers. In the
adaptation phase, the speaker code is firstly learned on the adaptation data via
back-propagation. It is then used as input to perform decoding on test data.
(ls) (l−1,s)
zt = A(ls)T W(l)T ht + b(ls) (3.64)
where A(ls) and b(ls) are speaker-dependent parameters on layer l. Normally, A(ls) is a
large matrix that may contain several million parameters. Using the full matrix is im-
practical. This matrix can be restricted in different ways, such as block-diagonal (Seide
et al., 2011a) or diagonal (Yao et al., 2012). For DNNs with SD linear hidden layers, the
canonical model includes the parameters of SI neural network, while the SD transforms
includes A(ls) and b(ls) for each speaker.
An alternative way to model A(ls) is to used the subspace method (Dupont and
Cheboub, 2000), which only introduces a small number of SD parameters. In detail,
Principal component analysis (PCA) is first performed on the transformation matrix
W (l) to obtain a set of K eigenvector matrix,
where λ(s) represents the SD parameters for speaker s. Because λ(s) only consists of
a limited number of interpolation weights, it can be robustly trained even with very
limited adaptation data.
3.5 Adaptation 72
Transformation Interpolation
The previous two schemes deal network adaptation with affine transformations in
DNN models. Instead, parametrised activation function (Siniscalchi et al., 2012;
Swietojanski and Renais, 2016; Swietojanski and Renals, 2014; Zhang and Woodland,
2015b) generalises the form of activation function to compensate speaker variations.
Figure 3.12 illustrates the general framework of parametrised activation function. It
introduces three types of SD parameters. The outputs of adapted activation functions
3.6 Performance Evaluation 73
γ (s) α(s)
z h ..
.
β (s)
can be expressed as
(ls) (l−1,s) (ls)
ht =α(ls)
⊗φ γ (ls)
⊗ ht ;β (3.68)
2
α̃(ls) = , (3.69)
1 + exp −α(ls)
to restrict the output to be in the range (0, 2). This warping approach was reported to
yield a robust representation, especially for adaptation in noisy acoustic conditions.
One limitation in parametrised activation function is that the number of parameters
to adapt is equal to the number of hidden units. It is difficult to robustly estimate a
large number of parameters when there are few limited adaptation data.
the WER, recognised transcriptions are first aligned against the reference transcriptions.
This is usually performed via a dynamic programming algorithm that minimises the
Levenshtein distance between the two sequences. Given the alignment, the number of
substitution (Sub), deletion (Del) and insertion (Ins) errors are respectively counted
by comparing the words in the recognised and reference transcriptions. The WER is
then calculated via
Sub + Del + Ins
WER = × 100% (3.70)
Tot
where Tot is the total number of words in the reference. Word error rate is quoted as
percentages. Rather than WER, error rate can be evaluated in a different level for
specific tasks. For phone recognition tasks, phone error rate is commonly used. For
languages such as Chinese, character error rate are used to remove mismatches in word
segmentation. Another popular metric is the sentence error rate, which reports the
rate of recognised sentences that are fully correct in the test data.
3.7 Summary
This chapter has described principal concepts in speech recognition and related deep
learning approaches for ASR. It started with the description of acoustic feature,
including filter banks, PLPs as well as MFCCs, and post-processing methods, such
as dynamic feature and feature normalisation. Generative and discriminative models
for speech recognition were then discussed, respectively. The description of generative
model focused on hidden Markov models and the integration of DNNs to HMMs.
It also covered language models, decoding approaches and lexicon. The discussion
of discriminative models included three approaches of deep learning: connectionist
temporal classification, encoder-decoder models, and attention-based models. Training
criteria for sequential tasks were then presented. The discussion included a range
of discriminative training criteria, such as maximum mutual information, minimum
classification error and minimum Bayes’ risk. Next, the adaptation schemes for neural
networks were discussed, including conservative training, feature-based and model-based
adaptation. The last section presented the evaluation metrics for ASR tasks.
Chapter 4
Basis 1
Context-Dependent
Input Feature
Basis 2
Combination
...
Basis K
Fig. 4.1 Multi-basis adaptive neural network.
(l,k) (l,k)
ht = φ(z t ), (4.1)
(l,k) (l−1,k)
zt = W (l,k)T ht + b(l,k) . (4.2)
The hidden units between successive layers in one basis are fully connected, while there
is no connection between units from different bases. With this restricted configuration,
a basis should be able to model a specific aspect of the training data. The outputs of
the bases are then merged together at the combination stage.
Basis 1 Combination
... T ∼
λ1
Basis 2
... T ∼ λ2 P
Softmax
T
.. .. ..
...
Basis K
λK
... T ∼
h iT
λmb = λmb,1 , λmb,2 , . . . , λmb,K (4.3)
expressed as
K
(L−1) (L−1,k)
= (4.4)
X
h̄t λmb,k ht .
k=1
This combined result is then propagated to subsequent common layers. The structured
topology of MBANN imposes a regularisation on the activation functions in the bases
prior to the combination. That is, the activation functions, which are interpolated,
are ordered together and restricted to perform similar behaviours. This mitigates the
issue of arbitrary-ordered activation functions, which has the potential to improve the
network regularisation.
To perform adaptation on an MBANN, the basis weight vector must be estimated.
(s)
By finding an appropriate λmb for each speaker s, the MBANN can be efficiently adapted
to any speaker. The details of parameter training and adaptation for MBANNs are
presented below.
where θ (k) represents the parameters in basis k, and θ share denotes the parameters of
the shared layers. The speaker-dependent transform Λ(s) for speaker s is defined as
(s)
Λ(s) = {λmb }. (4.6)
Given the training data D and training criterion, F(θ, D), the canonical model M
and SD parameters for all training speakers {Λ(s) }1≤s≤S are jointly optimised in the
training phase.
Parameter Initialisation
K
(s)
λmb,k = 1. (4.8)
X
k=1
This constraint is only used in the parameter initialisation. It is not introduced in the
parameter training, in order to reinforce the effect of speaker-dependent parameters to
the maximal extent.
Interleaved Training
The canonical model and SD transforms are updated iteratively in MBANN training.
The interleaved parameter training is described in Algorithm 4. The canonical model
M and speaker-dependent parameters {Λ(s) }1≤s≤S are interleavingly updated till
convergence or the maximal iterations are reached. In each iteration, parameters of
the canonical model and the basis weight vectors for all training speakers are updated
using stochastic gradient descent. The gradient calculation can be performed via error
back-propagation. In addition to the gradients used in standard DNN training, it
4.3 Adaptation 80
∂F (s) ∂F
(L−1,k)
= λmb,k (L−1)
, (4.9)
∂ht ∂ h̄t
∂F (L−1,k)T ∂F
(s)
= ht (L−1)
. (4.10)
∂λmb,k ∂ h̄t
4.3 Adaptation
After the training phase, the SD transforms belonging to training speakers are wiped
out, and only the canonical model M is used for adaptation. By keeping M to be fixed,
(s)
the SD basis weight vector λmb for a test speaker can be optimised using Eq. 4.10.
The estimated SD transform is then combined with M to decode testing utterances.
(s)
The estimation of λmb requires the corresponding reference target labels aligned with
the feature vectors. In speech recognition, adaptation can either be performed in the
supervised or unsupervised fashion. In supervised adaptation, reference transcriptions
are available for the testing data; thus, speaker-dependent parameters can be estimated
using the true labels. In unsupervised adaptation, there is no reference transcription
available. In this case, adaptation can be performed using decoding hypotheses from
an SI system utilised as the reference. The potential errors in the hypotheses can
significantly influence the performance of adaptation. To alleviate the influence of
decoding errors, the cross-entropy criterion is normally used as the adaptation criterion.
If the combination is designed prior to the output layer of MBANN, the optimisation
(s)
of λmb is a convex problem. This convex property brings two advantages. First, the
performance of adaptation is insensitive to the initialisation of SD transform. Second,
4.4 Combining I-vector Representation 81
the optimisation can be performed even more rapidly by second-order methods while
ensuring a sensible performance. The proof of convexity is discussed in Appendix B.
Basis 1
Acoustic Feature
Context-Dependent
Basis 2
Combination
...
I-vector
Basis K
Predictor
Fig. 4.3 Combining MBANN with I-vectors. Adaptable modules are coloured in red.
where g pred (·) represents the predictive model. The adaptation performance of MBANN
is then undertaken by the precision of the prediction mappings, which is irrelevant to
the quality of decoding hypothesis. Besides, the predictive procedure avoids the first
decoding pass, thus allowing adaptation to be performed efficiently.
4.5 Target-dependent Interpolation 83
.. .. .. ..
. . . .
The mismatch between the distribution of predicted interpolation weights and that
of the original weights is likely to degrade the performance. To reduce the degradation
caused by this sort of difference, an interleaving mode is utilised to update the MBANN
and predictor jointly. In each iteration, the predictor is trained on the estimated SD
(s)
transforms {Λmb }esti
s for the training speakers from the current MBANN system, and
(s)
the re-estimated SD transforms {Λmb }pred
s given by this trained predictor is then used
to update the neural network for the next iteration. The conjugate pair of neural
network and predictor of each iteration is then used in evaluation.
Using this form of SD transform, the output of the adapted MBANN can be expressed
as
(s,c(i)) (L,k)
exp
P
k λmb,k zti
P (ψ = i|xt ) = P (4.13)
(s,c(j)) (L,k)
j exp
P
k λmb,k ztj
(L,k)
where c(i) maps target i to its corresponding target class, and z t is defined as
(L,k) (L−1,k)
zt = W (L)T ht + b(L) . (4.14)
∂F (L,k) ∂F
= (4.15)
X
(s,i)
ztj (L)
∂λmb,k j:c(j)=i ∂ z̄tj
(L)
where z̄tj is defined as
(L) X (s,c(j)) (L,k)
z̄tj = λmb,k ztj . (4.16)
k
(l,1) (l+1,1)
ht Basis 1 ht
In
te
r-b
as
(l,2) Basis 2 (l+1,2)
is
ht ht
.. ... ..
In
. .
te
r-b
as
(l,K) (l+1,K)
ht Basis K ht
is
Fig. 4.5 MBANN with inter-basis connectivity.
(l) (l)
where b̃ concatenates the bias vectors {b(l,k) }k , and W̃ consists of the matrix
(l)
parameters. If the inter-basis connectivity is turned off, W̃ is a block-diagonal matrix,
and the diagonal blocks are the matrix parameters for different bases {W (l,k) }k . The
introduction of inter-basis connections can enable the matrix parameters out of diagonal
blocks.
To train an MBANN with the inter-basis connectivity, the restriction on connections
between bases distinguishes it from a fully connected network, as different bases should
4.7 Preliminary Experiments 86
be forced to capture different aspects of the training data. In this way, the network
training is performed with aggressive L2 regularisation on inter-basis parameters to
limit inter-basis connections. The overall training criterion is expressed as
where the regularisation term R(θ; D) penalises large parameters for inter-basis con-
nections,
(l)2
R(θ; D) = (4.20)
X X
w̃ij
l (i,j)∈I(l)
where I(l) stands for the index set of inter-basis parameters. Inter-basis connections
are penalised to be small, which can help to reinforce the importance of parameters
within each basis.
A B C D
Type clean noise channel noise+channel
Total (hrs) 0.7 4.0 0.7 4.0
#Uttr 330 1980 330 1980
AvgUttr (secs) 7.3
Table 4.1 AURORA 4: Summary of evaluation sets. It includes the type of acoustic
distortion, total hours, number of utterances (#Uttr) and average utterance duration
(AvgUttr). “Noise” represents additive noise, and “channel” is channel distortion.
test sets. They were split into 4 sets: A, B, C and D . Table 4.1 summarises the four
sets used in the evaluation.
In this thesis, unless otherwise stated, the ASR system was build using the DNN-
HMM hybrid framework. The relevant GMMs, DNNs and the proposed models were
implemented and trained on an extended version of the HTK Toolkit 3.5 (Young et al.,
2015).
both training stages, 650 utterances belonging to 8 speakers were used as the cross
validation set. This CE DNN was subsequently used to generate the lattices of the
training set and further tuned for four iterations on the MPE criterion to obtain the
baseline MPE DNN system. To fairly compare MBANNs and DNN baselines with
a comparable number of parameters, a larger DNN configuration, denoted as “DNN
(large)” in the following discussion, was trained. It included five hidden layers and
2000 units in each layer. The CE and MPE DNN systems of this large configuration
were trained in similar settings as the systems of the small configuration. In evaluation,
decoding was performed with the standard WSJ bi-gram language model.
Multi-basis adaptive neural networks were trained using similar settings as the
baseline DNN system. As discussed in Section 4.2, the MBANN model was initialised
by the well-trained DNN baseline and updated in the interleaved fashion. To prevent
over-fitting caused by the additional tuning epochs, a lower learning rate was used
to optimise the MBANN. If there is no additional descriptions, the adaptation of
MBANNs was performed in an utterance-level unsupervised fashion. That is, the
(s)
speaker-dependent parameters, i.e. basis weight vectors λmb , were estimated per test
utterance, and the supervision was obtained from the decoding hypothesis of the SI
DNN baseline. Detailed discussions are presented below.
To achieve a sensible initial performance, the MBANN was initialised with the SI CE
DNN baseline. In parameter training, the bases were modelled to represent different
i-vector clusters. The utterance i-vectors of the training data were clustered into 2, 4
and 6 clusters by k-means. A 1-of-K vector (a vector with one element containing a
1 and all other elements as 0) was specified to each utterance as its initial SD basis
weight vector, representing its cluster index. The MBANNs were trained using the
4.7 Preliminary Experiments 89
1.91
2
1.90 4
6
1.89
1.88
Cross Entropy
1.87
1.86
1.85
1.84
1.83
INIT CM1 SD1 CM2 SD2 CM3 SD3 CM4 SD4 CM5
Fig. 4.6 AURORA 4: Learning curves of MBANNs with 2, 4 and 6 bases. “CM” is an
update of the canonical model; “SD” is an update of the speaker-dependent parameters.
interleaved method (Algorithm 4) for five iterations on the CE criterion. The learning
curves of MBANNs with 2, 4 and 6 bases are illustrated in Figure 4.6. By performing
the interleaved training scheme, a lower cross entropy on the training set could be
obtained. As much fewer parameters are contained in the speaker-dependent transform
(i.e. interpolation weights), the decrease of cross-entropy value in the update phase of
SD parameters is smaller than that of the canonical model.
The recognition performance of CE MBANN systems is summarised in Table 4.2.
(s)
In the block “MBANN”, the supervision for estimating λmb was obtained from the
decoding hypotheses of the CE SI DNN. The MBANNs with 2, 4 and 6 bases achieved
similar performance, i.e. about 5% relative error reduction compared with the SI DNN
4.7 Preliminary Experiments 90
System A B C D Avg
DNN 3.8 7.7 8.5 19.0 12.3
DNN (large) 3.9 7.7 8.0 18.5 12.1
MBANN 3.8 7.9 8.1 18.4 12.1
baseline, reducing the WER from 13.0% to 12.4%. As indicated in the results of Sets B
and D, increasing the number of bases can slightly improve the adaptation performance
on low-WER scenarios (Set B), while the performance in high-WER scenarios (Set D)
actually decreases. The “DNN (large)” system has a comparable number of parameters
to the 4-bases MBANN. Nevertheless, the corresponding MBANN still yielded a lower-
error performance. This indicates the effectiveness of structured design of MBANN. To
illustrate the performance of MBANN to the maximal extent, the oracle adaptation,
which was performed using the reference transcriptions, is also reported in the block
“MBANN (oracle)”. Better performance was obtained with more bases in the oracle
experiments.
The MBANN with 2 bases were further tuned on the MPE criterion for two
interleaving iterations2 . To perform adaptation on MPE MBANNs, the decoding
hypotheses of the MPE SI DNN were used as supervision. The performance of MPE
systems are reported in Table 4.3. In contrast to the SI MPE DNN baseline, The
adapted MPE MBANN reduced the WER from 12.3% to 12.1%. This performance is
similar to the DNN large system, which contained much more parameters than the
2-basis MBANN.
The MBANN with 2 bases was selected to evaluate the target-dependent interpolation
scheme3 . As discussed in Section 4.5, based on the meanings of DNN targets, there are
2 In network training, the SD transforms of MPE MBANNs were optimised on the MPE criterion.
In adaptation, the SD transforms were optimised on the CE criterion. This is a common configuration
for adaptation, which can alleviate the impact of errors contained in decoding hypothesis.
3 For the target-dependent interpolation scheme, parameter training with multiple target classes
did not yield significant gains. This thesis reports a simple configuration that trained the MBANN
with one class (default), and in adaptation, multiple target classes were used.
4.7 Preliminary Experiments 91
Clustering A B C D Avg
– 4.0 8.1 9.4 18.7 12.4
sil/nonsil 4.0 8.0 8.7 18.8 12.4
k-means 4.0 8.0 8.7 18.8 12.4
Table 4.5 AURORA 4: Comparison of 1,2 and 3 k-means target classes for target-
dependent interpolation scheme on the 2-basis CE MBANN. “Oracle” systems stand
for performing adaptation on reference transcriptions. Adaptation was performed at
the utterance level.
several ways to specify the target classes using prior knowledge or automatic clustering.
The DNN targets in this task were modelled as context-dependent triphone states.
Two types of target classes were investigated: silence/non-silence classes, and k-means
clustering classes. The silence/non-silence method split the targets into two classes:
states belonging to silence, and those belonging to triphones. This configuration is
based on the fact that silence frames are noticeably different to the non-silence ones.
To separately adapt silence/non-silence targets is likely to improve the adaptation
performance. The second type of target classes was obtained by the k-means clustering
on the column vectors of the last-layer transformation matrix of MBANN (i.e. W (L) ).
The adaptation performance of MBANN target-dependent interpolations using the
silence/non-silence and the k-means (2 clusters) methods are compared in Table 4.4.
On the 2-class settings, the k-means results were similar result to the silence/non-silence
classes, which mostly consisted of silence states and a range of states belonging to
voiceless consonants. The overall performance on both target-class settings yielded little
4.7 Preliminary Experiments 92
Inter-basis Connectivity
Next, the MBANN model with inter-basis connectivity (Section 4.6) was evaluated.
By enabling the inter-basis connections, this 2-basis MBANN and the “DNN (large)”
system have a comparable number of parameters. Therefore, experiments of inter-basis
connectivity were conduct using the MBANN configuration with 2 bases. The impact
of different regularisation penalties, ranged from 0 to infinity, of inter-basis connections
is summarised in Table 4.7. A higher penalty will drive inter-basis weights closer to
4.8 Summary 93
System η A B C D Avg
DNN 4.2 8.4 9.1 19.7 13.0
–
DNN (large) 4.2 8.2 8.2 19.3 12.7
0 4.0 8.0 8.3 18.8 12.4
10−1 4.0 8.0 8.4 18.7 12.3
MBANN
1 4.0 8.0 8.3 18.7 12.3
102 4.0 8.0 8.4 18.8 12.4
∞(≃ 105 ) 4.0 8.1 9.4 18.7 12.4
zero. The “∞” row represents the default MBANN setting that turned off inter-basis
connections, and its regularisation penalty η is approximately equivalent to 105 , the
reciprocal of learning rate. By introducing the inter-basis connectivity, it yielded little
performance gains in contrast to the original MBANN model. This indicates that the
inter-basis connections are less important than those in each basis.
4.8 Summary
In this chapter, multi-basis adaptive neural networks were proposed. Conventional
model-based adaptation schemes (Section 3.5) for DNNs usually involve a large number
of parameters being adapted, which makes effective adaptation impractical when there
are limited adaptation data. The MBANN model aims at introducing structures to the
network topology, allowing network adaptation to be performed robustly and rapidly.
A set of parallel sub-networks, i.e. bases, are introduced. Weights are restricted to
connect within a single basis, and different bases share no connectivity. The outputs
among different bases are subsequently combined via speaker-dependent interpolation.
This chapter also discussed several extensions to the basic MBANN model. To
combine i-vector representation, two combination schemes were presented. The first
scheme appends i-vectors to the DNN input features. In this configuration, the bases are
explicitly informed about acoustic attributes, and the robustness to acoustic variations
can be reinforced. The second scheme uses i-vectors to directly predict the speaker-
dependent transform for MBANN. This avoids the requirement for decoding hypotheses
4.8 Summary 94
in adaptation, which helps to reduce the computational cost, as well as improve the
robustness to hypothesis errors. The target-dependent interpolation was discussed,
which introduces multiple sets of interpolation weights to separately adapt different
DNN targets. Lastly, the inter-basis connectivity generalises the MBANN framework
with parameters between different bases.
Chapter 5
In the previous chapter, multi-basis adaptive neural networks were presented. A set of
bases, i.e. parallel sub-network structures, with restricted connections are modelled,
and the restricted connectivity allows different aspects of data to be modelled separately.
This design can be viewed as a “hard” version to group the hidden units.
Alternatively, the concept of hidden-unit grouping can be performed in a “soft”
way. In this chapter, stimulated deep neural networks (Ragni et al., 2017; Tan et al.,
2015a; Wu et al., 2016b) are proposed1 . This type of structured neural network
encourages activation function outputs in regions of the network to be related, aiding
the interpretability and visualisation of network parameters. In standard neural network
training, hidden units can take an arbitrary ordering; thus, it is difficult to relate
parameters to each other. The lack of interpretability can cause problems for network
regularisation and speaker adaptation. The design of stimulated DNN first resolves
the issue of arbitrary ordering of hidden units. In the network topology, the units of
each hidden layer are reorganised to form a grid. Activation functions with similar
behaviours are then learned to group together in this grid space. This objective is
achieved by introducing a special form of regularisation term to the overall training
criterion, referred to as activation regularisation. The activation regularisation is
designed to encourage the outputs of activation functions to satisfy a target pattern.
By defining appropriate target patterns, different visualising, partitioning or grouping
1 Theterm “stimulated” follows the naming fashion in stimulated learning (Tan et al., 2015a) which
performs stimulation on the outputs of activation functions to induce interpretation.
5.1 Network Topology 96
concepts can be imposed on the network. The stimulated DNN prevents the arbitrary-
ordering issue in standard DNN models. This kind of manipulation is believed to
reduce over-fitting and improve the model generalisation. In addition, based on the
similarity between activation functions in this spatial ordering, smoothness techniques
can be used on stimulated DNNs to regularise a range of DNN adaptation methods.
In literature, a range of approaches have been proposed to interpret DNN parameters.
These schemes mainly focus on analysing a well-trained neural network instead of
inducing useful interpretations in parameter training. For instance, Garson’s algorithm
(Nguyen et al., 2015) was used to inspect feature importance in DNN models. In the area
of computer vision, weight analysis of neural networks has been examined to interpret
neural networks. In Mahendran and Vedaldi (2015); Nguyen et al. (2015); Simonyan
et al. (2013), the input feature was optimised to maximise the output of a given hidden
activation in the network. The visualisation of the feature implies the function of that
activation. However, stimulated DNNs achieve the network interpretation in a different
fashion, which are induced in the training procedure. The strategy of inducing desired
network interpretations offers a flexible tool to analyse DNN models. Instead of being
deciphered from complex “black boxes”, DNN parameters can be regularised to present
pre-defined concepts.
This chapter discusses the network topology, the design of activation regularisation,
and adaptation methods for stimulated deep neural networks.
(l)
1 2 3 4 5 6 7 8 9 ht
1 2 3
∗(l)
4 5 6 Ht
7 8 9
Fig. 5.1 Reorganise units to form a grid in one hidden layer. Non-contiguous elements
(in dotted boxes) can form a contiguous region in the grid representation.
∗(l)
The exact form of the representation H t depends on the dimensionality of grid
∗(l)
representation. If a 1D grid is introduced, H t is a vector; in a 2D configuration,
∗(l)
H t is a matrix. For example, on a layer with n2 hidden units, The 2D representation
∗(l)
of H t can be expressed as
(l) (l)
ht,1 ··· ht,n
.. ... ..
∗(l)
= . . (5.2)
Ht
.
(l) (l)
ht,n2 −n+1 ··· ht,n2
2 The superscript “*” is introduced to disambiguate vector and grid representations. In this chapter,
variables in the grid representation are marked with superscript “*”, and variables in the vector
representation are not. For example, h∗tij is the activation function output of unit (i, j) in the grid,
while hti is the activation function output of the i-th hidden unit in the vector representation.
5.2 Activation Regularisation 98
∗(l)
For higher dimensional configurations, H t can be expressed as a higher-order tensor.
This chapter concentrates on the 2D situation as an example for the discussion.
Figure 5.1 shows an example of unit reorganisation on a hidden layer with 9 units. Non-
contiguous nodes (in dotted boxes) can form a contiguous region in the corresponding
grid representation. This grid representation can be viewed as a Cartesian coordinate
system. The hidden unit located at (i, j) in the grid is located at a point in this space,
denoted as sij . The network topology of stimulated DNN defines a spatial order for
each hidden layer, on which the grouping or interpretable regions can be defined using
Euclidean metrics.
where L(θ; D) is the standard training criterion, and the hyper-parameter η determines
the contribution of the activation regularisation term R(θ, D). The complete framework
for activation regularisation can be described in three discrete stages:
(l)
2. a target pattern Gt is specified. Several concepts can be embedded in the target
pattern, for example, interpretation or smoothness.
∗(l)
∂ H̃ t
The calculation of the gradients ∗(l) and ∂R
∗(l) is also be discussed below.
∂htij ∂ H̃ t
∗(l) ∗(l)
H̃ t = T (H t ). (5.6)
There are multiple possible transforms T (·). The most trivial option is the identity
transform, which yields the original activation function output
∗(l) ∗(l)
H̃ t = Ht . (5.7)
5.2 Activation Regularisation 100
Three types of transforms are investigated in this thesis: the normalised activation,
the probability mass function, and the high-pass filtering.
Normalised Activation
(l)
where the term ξij reflects the impact that the activation function has on the following
layer l + 1. This is expressed as
s
(l) (l+1)2
= (5.9)
X
ξij wk,o(i,j) .
k
(l)
where o(i, j) represents the original node index in ht of the (i, j)-th grid unit. This
provides a method to consider both aspects of the problem: the empirical range of the
activation function, and the influence of the next-layer parameters. The gradient with
respect to raw activation function outputs is given by
(l)
∗(l)
∂ h̃tmn ξij
m = i n = j,
∗(l)
= (5.10)
htij 0
otherwise.
This normalised activation can be integrated with other transformations, such as the
probability mass function and the high-pass filtering presented below.
5.2 Activation Regularisation 101
The grid can also be treated as a discrete probability space. The activation function
outputs can be transformed to yield a probability mass function (PMF). This probability
mass function is defined as follows
∗(l)
∗(l) htij
h̃tij =P ∗(l)
(5.11)
u,v htuv
where activation function outputs are normalised by their sum. There are some
∗(l)
constraints that need to be satisfied for this PMF transform. To ensure that H̃ t is
∗(l)
a valid distribution, h̃tij should be non-negative. This restricts the potential choices
of activation function. Simple methods such as an exp(·) wrapping can be utilised
for an arbitrary function, but this may disable the effect of the negative range in the
activation function. The gradient with respect to raw activation function outputs can
be calculated via
∗(l)
1 htmn
m = i n = j,
− P
∗(l) ∗(l)
∗(l)
( u,v htuv )2
P
∂ h̃tmn u,v htuv
∗(l)
= ∗(l)
(5.12)
htij
htmn
− P otherwise.
∗(l)
( u,v htuv )
2
High-pass Filtering
The high-pass filtering transform is design to induce smoothness over the activa-
tion function outputs. It includes information about nearby units via a convolution
operation,
∗(l) ∗(l)
H̃ t = Ht ∗K (5.13)
where K is a filter. The filter can take a range of forms. For example, a Gaussian
high-pass filter, used in Wu et al. (2016b), assigns the impact of other nodes according
to the distance; a simple 3 × 3 kernel, used in Xiong et al. (2016), only introduces
adjacent nodes.
5.2 Activation Regularisation 102
Time-variant Pattern
The activation grid can be split into a set of spatial regions. The meanings of regions
can take a variety of concepts, such as phonemes, noise types or speaker variations.
In this way, different grid regions can model and respond to different concepts in the
data. The time-variant pattern is designed on the regions. On different types of data, a
time-variant pattern can encourage the activation function outputs in the corresponding
regions. This design aids the network interpretability, that is, a particular unit can be
interpreted according to its location in the grid.
For example, phone-dependent patterns encourage the regions to model different
phones. In each hidden layer, a set of phoneme (or grapheme) dependent target patterns
is defined by the targets of training data. A point in this grid space is associated with
each phone /p/, denoted as s/p/ . These phoneme positions can be determined using a
range of methods, such as t-SNE (Maaten and Hinton, 2008) using the acoustic feature
means of the phones. It is then possible to apply a transform to target patterns in a
similar fashion to its activation function output transform
(l) exp − 2σ1 2 ||sij − ŝ/pt / ||22
gtij = P (5.14)
m,n exp − 2σ1 2 ||smn − ŝ/pt / ||22
where ŝpt is the position in the grid space of the “correct” phoneme at time t, and the
sharpness factor σ controls the sharpness of the surface of target pattern. For each
phoneme, a Gaussian contour is defined at its nearby region in the grid. It encourages
nodes to correspond to the same phoneme to be grouped in the same region. An
example of phone-dependent target pattern is shown in Figure 5.2. The target pattern
induces hidden units a deterministic ordering where activation functions with similar
5.2 Activation Regularisation 103
Fig. 5.2 Phone-dependent target pattern. It includes an example of target pattern and
the corresponding activation function outputs yielded by stimulated and unstimulated
DNNs. The models were trained on the Wall Street Journal data used for preliminary
experiments in this section.
behaviours were grouped in nearby regions. This form of target pattern prevents the
arbitrary ordering, which has the potential to improve the regularisation.
Time-invariant Pattern
Time variant patterns require time-dependent “labels” to be derived from the training
data. Alternatively, time invariant patterns can be used to specify general, desirable,
attributes of the network activation pattern. It can be expressed as
(l)
Gt = G(l) ∀t. (5.15)
This time-invariant pattern can induce the activation grid with a global concept.
|D|
1 XX ∗(l) (l)
R(θ; D) = D(H̃ t , Gt ) (5.16)
|D| t=1 l
∗(l) (l)
where D(H̃ t , Gt ) measures the mismatch between the activation output and the
target pattern for layer l. There are a range of approaches to defining the function
5.2 Activation Regularisation 104
D(·, ·). Three are investigated in this thesis: the mean squared error, the KL-divergence,
and the cosine similarity.
A simple way to measures the difference is the mean squared error (MSE) method,
defined as
∗(l) (l) ∗(l) (l)
D(H̃ t , Gt ) = ||H̃ t − Gt ||2F (5.17)
The gradient ∂D
(l) can be calculated by
∂ H̃ t
∂D
∗(l) (l)
(l)
=2 H̃ t − Gt . (5.19)
∂ H̃ t
(l) (l)
This method minimises the element-wise squared error between H̃ t and Gt . For
example, using the raw activation function output as the transformed one and a
(l)
time-invariant target pattern Gt = 0 yields
patterns depending on the activation function. For activation functions with fixed
(l)
ranges, such as sigmoid or tanh, Gt can be easily rescaled to an appropriate range.
However, for activation functions such as ReLU, in which the range is not restricted,
(l)
the rescaling on Gt tends to require empirical tuning.
KL-divergence
∗(l) (l)
One way to address the dynamic range issue between H̃ t and Gt is to combine the
PMF transformation with distribution distances such as the KL-divergence method.
∗(l) (l)
The difference D(H̃ t , Gt ) is the KL-divergence of the two distributions, the target
∗(l)
pattern Gt and the activation distribution H̃ t ,
(l)
∗(l) (l) X (l) gtij
D(H̃ t , Gt ) = gtij log
∗(l)
. (5.22)
i,j h̃tij
For example, by using the phoneme-dependent target pattern (Eq. 5.14) and the
probability mass function (Eq. 5.11), the KL-divergence can spur different regions in
the grid to correspond to different phonemes. The gradient ∂D
(l) can be calculated by
∂ H̃ t
(l)
∂D gtij
∗(l)
= ∗(l)
. (5.23)
∂ h̃tij h̃tij
∗(l)
The KL-divergence regularisation requires H̃ to be positive to yield a valid distri-
bution. This limits the choices of activation function. There are several approaches
to convert specific activation functions to be positive. For example, by using tanh +1,
instead of tanh, in Eq. 5.11, the KL-divergence regularisation can manipulate the
hyperbolic tangent function with a similar pattern as the sigmoid function. However,
these methods require a pre-defined lower bound on the activation function, which
cannot be applied in all cases.
5.3 Smoothness Method for Adaptation 106
Cosine Similarity
(ls)
In the LHUC scheme, a speaker-dependent scaling factor αi is introduced inde-
pendently to every activation function of each hidden layer,
where s stands for the speaker index. Scaling factors are introduced per activation;
thus, the number of independent parameters to adapt is equal to the number of DNN
hidden units. The lack of interpretable meanings in standard-trained DNN causes that
scaling factors are modelled as independent components instead of groups based on
∗(ls)
functional similarities. However, in stimulated DNNs, H̃ is regularised to behave
as a smooth surface. That is, nearby activation functions in the spatial ordering
are likely to perform analogously. Based on this property, the LHUC adaptation
with smoothness regularisation can be performed, which aims to smooth the adapted
activation outputs by spatial neighbours. This regularisation is achieved by a special
adaptation regularisation term. Defining Λ(s) as the speaker-dependent transform for
LHUC, which consists of all scaling factors, gives
(ls)
Λ(s) = {αi }i,l. (5.27)
|D|
1 X X X X ij
2 !
(s) ∗(ls) ∗(ls)
RL (Λ ; D) = q h̃ − h̃tmn (5.28)
2T t=1 l i,j m,n mn tij
1 1
!
ij
qmn = exp − 2 ||sij − smn ||22 , (5.29)
Qij 2σL
where the hyper-parameter σL specifies the distance-decay factor, and Qij is a normal-
isation term,
1
!
Qij = exp − 2 ||sij − sĩj̃ ||22 . (5.30)
X
2σL
ĩ,j̃
5.4 Preliminary Experiments 108
Table 5.1 WSJ-SI84: Summary of training and evaluation sets. It includes the total
hours, number of utterances (#Uttr) and average utterance duration (AvgUttr).
1. KL: The KL system used the KL-divergence regularisation (Eq. 5.22) with the
activation PMF (Eq. 5.11) and the phoneme-dependent target pattern (Eq. 5.14).
2. Cos: The Cos system used the cosine similarity regularisation (Eq. 5.24) with
the normalised activation (Eq. 5.8) and the phoneme-dependent target pattern
(Eq. 5.14).
3. Smooth: The smooth system used the mean-squared-error regularisation (Eq. 5.17)
with the high-pass filtering activation transformation (Eq. 5.13) and the zero
target pattern (Eq. 5.21). A 3 × 3 kernel was used as the high-pass filter, in which
the central tap was 1 and others were -0.125. In this way, the activation function
outputs were smoothed with their adjacent ones in the grid; and a smooth surface
was formed over the activation grid.
5.4 Preliminary Experiments 110
For the KL and Cos systems where time-variant target patterns were required, 46
English phonemes were used to define the time-variant target patterns, and 2D positions
of the phonemes were estimated via the t-SNE method (Maaten and Hinton, 2008)
over the average of frames of different phonemes. They were then scaled to fit in a
1.0 s
ch z sh
0.8 sil f jh zh
th k t
dh iy
p hh d ng
0.6 g ix
b v en ey
n uw ih
y em ax
0.4 m uh
w axr
ae eh r
0.2 l oy
el ah ay er
ao
ow aw
0.0
0.0 0.2
aa
0.4 0.6 0.8 1.0
unit square [0, 1] × [0, 1]. Figure 5.3 illustrates the phoneme positions. For the Cos
and Smooth systems, the sigmoid, ReLU and tanh DNNs were investigated. For the
KL system, only the sigmoid DNN was investigated, due to the positive constraint of
activation function required by the KL-divergence regularisation function.
KL Activation Regularisation
The first experiment investigated the impact of the normalised activation that is defined
in Eq. 5.8. As discussed in Section 5.2.1, the normalised activation (Eq. 5.8) can be
combined with other activation transformations. The combination of the normalised
5.4 Preliminary Experiments 111
❍❍ η
❍❍
0.1 0.2 0.3 0.5
σ2 ❍❍
0.05 10.0 9.9 – –
0.1 9.9 9.7 10.1 10.6
0.2 9.9 9.8 – –
activation and the PMF activation transformation (Eq. 5.11) was examined on the
KL system, as shown in Table 5.2. By combining the normalised activation, the KL
system could further reduce the word error rate. Similar results were also found in the
Cos and Smooth systems.
1.55 25
KL
1.50
CE 20
1.45 15
KL-divergence
Cross Entropy
1.40 10
1.35 5
1.30 0
0.0 0.1 0.2 0.3 0.4
Regularisation Penalty
Fig. 5.4 WSJ-SI84: Cross-entropy and KL-divergence values of the CV set on different
regularisation penalties.
5.4 Preliminary Experiments 112
0.85 1.60
Train
CV 1.55
0.80
1.50
Cross Entropy (Train)
Fig. 5.5 WSJ-SI84: CE values of training and CV sets using different activation
regularisations on sigmoid stimulated DNNs.
functions around the phoneme “ay” location echoed higher values than other regions.
The Smooth system, which does not have a specific target, just yielded a smoothed
pattern. The presented grid outputs matched with the corresponding target patterns,
indicating the effectiveness of activation regularisation to induce the behaviour of
activation function.
Tables 5.5 and 5.6 summarises the decoding performance of different stimulated
DNNs on the H1-Dev and H1-Eval testsets. The Cos and Smooth regularisation on
sigmoid, ReLU and tanh DNNs yielded similar performance. On this relatively small
task, small consistent gains could be obtained.
5.5 Summary 114
Fig. 5.6 WSJ-SI84: Comparison of activation grid outputs of raw, KL, Cos and Smooth
systems on an “ay” frame.
5.5 Summary
Stimulated deep neural networks were presented in this chapter. This type of structured
neural network relates activation functions in regions of the network to aid interpretation
and visualisation. In the network topology of stimulated DNN, hidden units are
reorganised to form a grid, and activation functions with similar behaviours are then
grouped together in this grid space. This goal is obtained by introducing a special form
of regularisation, which is the activation regularisation. The activation regularisation
is designed to encourage the outputs of activation functions to satisfy a target pattern.
By defining appropriate target patterns, different learning, partitioning or grouping
concepts can be imposed and shaped on the network. This design prevents hidden units
from an arbitrary order, which has the potential to improve network regularisation.
Also, based on the restricted ordering of hidden units, smoothness techniques can be
used to improve the adaptation schemes on stimulated DNNs. This chapter used the
5.5 Summary 115
LHUC adaptation approach as an example to explain how the smoothness method can
be performed.
Chapter 6
In the previous chapter, stimulated deep neural networks were presented. This type of
structured DNN imposes the hidden-layer representation to some desired target pattern
using the activation regularisation. The concept of target pattern can be viewed as a
“reference model”, which roughly controls a general behaviour for activation functions.
Activation regularisation is an implicit stimulating mechanism that induces activation
functions to learn separate parts of the target pattern by themselves.
This chapter proposes deep activation mixture models (DAMMs) (Wu and Gales,
2017). Inspired by stimulated DNNs, the DAMM is also designed to relate the hidden
units in regions of the network. The activation functions in a DAMM are modelled as
the sum of a mixture and residual models. The mixture model expands an activation
contour that roughly describes a general behaviour for activation functions. Rather than
being implemented as a regularisation term in stimulated DNNs, the induced behaviour
of activation functions in DAMMs are controlled by distinct network structures, i.e.
mixture models. The residual model specifies a fluctuation term for each activation
function on this contour. Consequently, the resultant activation functions stay on a
smooth contour controlled by the mixture model, which triggers activation functions of
nearby hidden units to be similar. The introduction of mixture model in DAMM can
be viewed as an informed regularisation that controls a dynamic prior pattern for the
activation functions. The activation functions in the DAMM are related and controlled,
which has the potential to improve network regularisation. In addition, the highly
restricted nature of the mixture model allows it to be robustly re-estimated. This design
6.1 Network Topology 117
..
.
mixture
mixture
mixture
..
.
Context Dependent
hidden activation
hidden activation
hidden activation
Acoustic Feature
⊕
..
.
..
.
residual
residual
residual
..
.
Fig. 6.1 Deep activation mixture model.
enables a novel approach to network adaptation, even when there is limited adaptation
data. In contrast to mixture density networks (Bishop, 1994; Richmond, 2006; Variani
et al., 2015; Zen and Senior, 2014; Zhang et al., 2016a), deep mixture activation
models utilise the contour of mixture-model distributions, instead of estimating density
functions in a “deep” configuration.
The discussion in this chapter includes the network topology, parameter training
and adaptation methods for deep activation mixture models.
Activation output
n grid
vatio
Acti
grid, which is the same as that in stimulated DNNs (Section 5.1). In the l-th hidden
(l)
layer, the output of the activation activations ht is defined as the sum of a mixture
(mix,l) (res,l)
model ht and a residual model ht .
The effect of mixture model is illustrated in Figure 6.2. It can be viewed as an informed
regularisation on activation functions. The contour, dynamically generated by the
mixture model, forces the outputs of activation functions to stay around it. By turning
off the mixture model, the DAMM with the residual model only will degrade to a
standard DNN. The outputs of activation functions of a standard DNN can be viewed
as fluctuations over a static plane that depends on the choice of activation function.
For example, if tanh is used, the plane is located at zero; if sigmoid is used, the plane
is located at 0.5. L2 regularisation is usually used in network training and its effect
can be viewed as encouraging the activation function outputs to stay around the static
plane. Though regularisation can often be achieved, there is no feasible meaning of the
static plane. Instead, in DAMM, the use of mixture model extends the static plane
to a dynamic surface that can inform activation functions about rough outputs. This
informed design has the potential to improve the network regularisation. The mixture
and residual models are defined below.
6.1 Network Topology 119
K
(mix,l) (l) X (l) (l) (l)
hti = ζt πtk N si ; µk , Σk (6.2)
k=1
(l)
where K stands for the number of Gaussian components. ζt is a scaling factor
to specify the importance of the mixture model
(l) (l−1)
ζt = sig g (l)T ht + r(l) . (6.3)
where g (l) and r(l) are the associated parameters. It is modelled by a sigmoid
function to perform in the range (0, 1). This scaling factor is introduced to
dynamically scale the output of mixture model in an appropriate range.
(l)
The mixing weights π t for different mixture components are modelled as a
softmax function,
(l)T (l−1) (l)
exp ak ht + ck
(l)
πtk = P (6.4)
(l)T (l−1) (l)
k̃ exp ak̃ ht + ck̃
(l)
where A(l) and c(l) are its associated parameters. To make π t a valid distribution,
the softmax function is used, by which the sum-to-one and positive constraints
(l)
on the mixture weights are satisfied. The Gaussian mean vectors {µk } and
(l)
covariance matrices {Σk } can be introduced with desired interpolations. For
(l)
example, by setting {µk } as the 2D projection of phonemes, the regions in the
DAMM activation grid can be induced with phone meanings.
h(res,l) = tanh W (l)T h(l−1) + b(l) (6.5)
where W (l) and b(l) are parameters. It describes precise variations for differ-
ent locations over the contour, enriching the expressiveness of every activation
function.
The total number of Gaussian components is usually smaller than that of units in
one hidden layer. This mixture model is highly restricted, due to fewer parameters
associated with it. This compact property allows the mixture model to be robustly re-
estimated, which is suitable for network post-modifications, such as speaker adaptation
(Section 6.3).
n o
θ mix = g (l) , r(l) , A(l) , c(l) (6.6)
1≤l<L,
n o
θ res = W (l) , b(l) (6.7)
1≤l≤L.
(l) (l)
Note that, mean vectors {µk } and covariance matrices {Σk } of the Gaussian com-
ponents are fixed during the training phase of DAMM. This configuration forces the
mixture model to form a significant contour. According to experiments in joint training
with mean and covariance (Section 6.4.2), the optimisation is expected to deactivate
the effect of the mixture model. This phenomenon is harmful to the overall model.
However in Section 6.3, these parameters are used as the speaker-dependent transform
to adapt a well-trained DAMM.
The residual model contains much more parameters than the mixture model. If a
simple joint training scheme is performed, the effect of the mixture model can be easily
6.2 Parameter Training 121
absorbed by that of the residual model. It causes the mixture model unable to generate
a sensible activation contour to regularise the activation function outputs. To emphasise
the informed regularisation, the mixture model should be trained to its maximal extent.
Parameter training for DAMM should be organised appropriately for these concerns,
in addition to optimising the primary training criterion. The outline of this training
mode is described in Algorithm 5. The DAMM is constructed a layer-wise manner.
During the construction phase (Line 1–5), the l-th iteration first initialises and adds
parameters for the mixture and residual models for the l-th layer, denoted as θ (mix,l)
and θ (res,l) , respectively. The parameters of the mixture model are randomly initialised
to break the modelling symmetry. In contrast, the parameters of the residual model
are initialised as zeros, and this zero initialisation acts as an implicit regularisation
to encourage small parameters. The update of the mixture model is performed till
convergence (referred to as finetune). The residual model is fully optimised at last
(Line 6). Intuitively, this isolating mode specifies a “curriculum” to train different
network components in a pre-defined order. The complete network is constructed
greedily from shallow to deep, which is similar to layer-wise DNN pre-training. In each
layer, the mixture model is introduced at first, and optimised till convergence, prior to
the introduction of the residual one. This design ensures that the mixture model is
trained to its maximal extent.
Parameter training for DAMM is again designed in the stochastic gradient descent
and error back-propagation. In this paper, the overall training criterion is defined as
where the term R(θ res , D) stands for an L2 regularisation term with hyper-parameter
η. For DAMMs, the regularisation is only used on parameters of the residual model
θ res ,
1 XX b(l)2 + (l)2
R(θ res ; D) = (6.9)
X
wij .
2 l i
i
j
∂F ∂F
(res,l)
= (l)
. (6.10)
∂ht ∂ht
To calculate ∂F
and ∂F
, it requires
∂A(l) c(l)
∂F ∂F
(l) X (l) (l)
(l)
= ζt N si ; µk , Σk (l)
. (6.12)
∂πtk i ∂hti
These gradients can then be integrated with stochastic gradient descent to perform the
network training.
6.3 Adaptation 123
6.3 Adaptation
This section discusses the adaptation on a well-trained DAMM. In standard DNN
configurations, since there is no explicit meanings of activation functions, independent,
not tied, parameters are often introduced for each activation functions to handle
the adaptation. In comparison, the DAMM uses mixture models to form the rough
outputs of activation functions. The adaptation of mixture models can affect all the
activation functions in a “tied” fashion. In the thesis, the adaptation of mixture model
is performed on Gaussian components. In the adaptation phase, the change of the
contour should effectively adapt the DAMM to an unseen speaker. The outputs of
adapted activation functions can be expressed as
(ls)
where s stands for the speaker index. The mixture model hmix is adapted to speaker s
by mean vectors and covariance matrices of Gaussian components
K
(mix,ls) (l) (ls) (ls)
= g (l) ck N si ; µk , Σk (6.14)
X
hti .
k=1
In the context to adapt a DAMM, the canonical model includes parameters of affine
transformations in mixture and residual models for all layers (optimised in training)
The speaker-dependent transform Λ(s) consists of mean vectors and covariance matrices
of all mixture components for all layers
(ls) (ls)
Λ(s) = µk , Σk (6.16)
1≤l≤L,1≤k≤K.
(ls) (ls)
The re-estimation of µk and Σk changes the contour of the Gaussian mixture
model, which affects all activation functions to some level in this layer.
6.3 Adaptation 124
To perform effective adaptation, the mean vector and covariance matrix of any
(ls)
Gaussian component are parametrised as follows. Mean vector µk can be used as
(ls)
SD parameters directly. However, to make Σk a valid covariance matrix, it should
satisfy the positive-definite property. This requirement can be satisfied by constrained
optimization methods. Alternatively, the 2D grid configuration presented in this
(ls)
chapter allows a simple optimisation scheme. In this 2D configuration, Σk stands for
a 2 × 2 covariance matrix of a bivariate Gaussian PDF, thus can be factorised as
(ls) (ls)
where σ k represents the unit variance vector that should be positive and ρk is the
correlation coefficient that should lay in the range [−1, 1]. They can be parametrised
as
(ls) (ls)
σk = exp σ̃ k , (6.18)
(ls) (ls)
ρk = tanh ρ̃k (6.19)
(l) (l)
to comply with the mathematical constraints. σ̃ k and ρ̃k are then used as parameters
instead of the raw unit variance and correlation coefficient. By using the matrix form
(ls)
in Eq. 6.17, the positive-definite property of Σk can inherently be satisfied, requiring
no additional constraints during optimisation.
Given adaptation data and criterion F(Λ(s) ; D), the speaker-dependent transform
(ls)
can be re-estimated by stochastic gradient descent. Define a vector λk consisting of
the five adaptable parameters (mean, unit variance and correlation coefficient) of the
k-th Gaussian mixture component in the l-th layer,
T
(ls) (ls) (ls) (ls) (ls) (ls)
λk = µk1 , µk2 , σk1 , σk2 , ρk . (6.20)
6.4 Preliminary Experiments 125
∂F ∂ Ñi ∂F
X (l) (l) X
(ls) (ls)
(ls)
= ζt πtk (ls)
N si ; µk , Σk (ls)
(6.21)
∂λk t i ∂λk ∂hti
∂ Ñi
where (ls) represents an expression with respect to mean, unit variance and correlation
∂λk
coefficient1
∂ Ñi
= (Σ)−1 (si − µk ) , (6.22)
∂µk
∂ Ñi 1 (si1 − µk1 )(si2 ρσk1 + σk2 µk1 − σk2 si1 − ρk µk2 σk1 )
= + , (6.23)
∂σk1 σk1 (ρ2 − 1)σk1
3 σ
k2
∂ Ñi 1 (si2 − µk2 )(si1 ρσk2 + σk1 µk2 − σk2 si2 − ρk µk1 σk2 )
= + , (6.24)
∂σk2 σk2 (ρ2 − 1)σk2
3 σ
k1
∂ Ñi (si2 ρk σk1 − si1 σk2 − ρk µk2 σk1 + µk1 σk2 )(si1 ρk σk2 − si2 σk1 − ρk µk1 σk2 + µk2 σk1 )
=
∂ρk (ρ2k − 1)2 σk1
2 σ2
k2
ρk
+ (6.25)
1 − ρ2k
In the adaptation, these parameters can be partially re-estimated, e.g. to only re-
estimate mean vectors for a compact SD transform.
(l)
components were given by the 2D projection described in Section 5.4.1. Every ρk was
h√ √ i
(l)
set to 0 and σ k was empirically set to 0.1, 0.1 , i.e. setting the unit variance to
0.1. This model configuration has a comparable number of parameters as the baseline
DNN system. The cross-entropy DAMM model was initialised and well-tuned in the
isolating training mode as shown in Algorithm 5. On each layer, the mixture model was
fully optimised prior to the introduction of the residual model, and the residual model
was updated for three iterations. The penalty of residual-model L2 regularisation η
was set to 10−4 .
7
DAMM
6 DNN(sigmoid)
5
Cross Entropy
0init 1mix 1mix+res 2mix 2mix+res 3mix 3mix+res 4mix 4mix+res 5mix ...finetune...
Fig. 6.3 WSJ-SI84: Learning curves of the DAMM and sigmoid DNN.
(l)
test sets. According to the analysis, many of the updated mean vectors {µk } were
tuned to move far away from the unit square [0, 1] × [0, 1]. Thus, the contours generated
by these Gaussian components have little contribution to the model. This can explain
the gains achieved by disabling the update of Gaussian components in training. The
DAMM system without training the Gaussian components was further investigated in
the following experiments.
The learning curves of the DAMM and the sigmoid DNN are shown in Figure 6.3.
On the layer construction of the DAMM, the mixture model was tuned to the maximal
extent before enabling the residual model. Because of the highly restricted nature,
the mixture model cannot achieve good performance. However, it learned the rough
behaviour for activation functions. Figure 6.4 illustrates the first-layer activation
function outputs of the mixture and residual models on one training frame. The
6.5 Summary 128
mixture model in Figure 6.4a constructed an activation contour, and the residual model
in Figure 6.4b added a small variation to each activation function, which was expected
in the network training.
Lastly, the adaptation of DAMM was investigated. The decoding hypotheses of the
SI DAMM were used as the supervision for adaptation. To adapt the DAMM, the mean
vectors and covariance matrices of the Gaussian components in all hidden layers were
tuned on the supervision. To examine rapid adaptation on the DAMM, adaptation
was performed at the utterance level. Table 6.2 reports the adaptation performance of
the SD DAMM. By performing the adaptation on DAMM, small consistent gains could
be obtained. The relative WER reduction is up to 4%, compared to the performance
of the sigmoid DNN baseline.
6.5 Summary
This chapter proposed deep activation mixture models. This type of structured
neural network uses a mixture model and a residual model to jointly form activation
functions. The mixture model defines a smooth activation contour, and the residual
model describes fluctuations around this contour. The effect of mixture model can be
viewed as an informed regularisation that has the potential to improve the network
regularisation. Also, it allows novel adaptation schemes on this form of structured
DNN. The discussion started with the network topology of DAMM. To address the
unbalance numbers of parameters in mixture and residual models, the isolating mode
for parameter training was presented. Lastly, the adaptation scheme on DAMM was
discussed.
Chapter 7
Experiments
This chapter presents the evaluation of the three forms of structured deep neural
networks: the multi-basis adaptive neural network in Chapter 4; the stimulated deep
neural network in Chapter 5; and the deep activation mixture model in Chapter 6. The
proposed models were evaluated on two large vocabulary continuous speech recognition
tasks tasks: the Babel languages; and the broadcast news English.
Table 7.1 Babel: Summary of used languages. Scripts marked with † utilise capital
letter in the graphemic dictionary.
(FLP) was used for each language. This consists of 40-hour training data and 10-hour
development data (Dev). The development data was used for evaluation.
For the Babel project, the performance of the system was evaluated in two ways:
the word error rate, for recognition performance; and the maximum term-weighted
value (MTWV), for keyword-spotting performance. In keyword spotting, the ASR
system is used to generate decoding lattices, and the keyword query is searched in all
possible paths in lattices. Given the keyword list Q, a metric, named as term-weighted
value (Fiscus et al., 2007), is defined as
1 X ms
TWV(ξ; Q) = 1 − P (ω; ξ) + 999.9P fa (ω; ξ) (7.1)
|Q| ω∈Q
where P ms (ω; ξ) and P fa (ω; ξ) are, respectively, the rates of miss and false alarm errors
at detection threshold ξ, and defined as
#cor(ω; ξ) #incor(ω; ξ)
P ms (ω; ξ) = 1 − , P fa (ω; ξ) =
#ref(ω) #trail(ω)
Context−Dependent
Bottleneck
Input Layer Layer
Features
Tandem
HMM−GMM
IBM Bottleneck
Context−Dependent
CMLLR
Joint Decoding
Context−Dependent
Bottleneck
Input Layer Layer
Features
Tandem
HMM−GMM
RWTH Bottleneck
Features
Stacked Hybrid
Context−Dependent
CMLLR
impact of threshold selection, the maximum term-weighted value is used as the metric
for keyword spotting.
To find the most effective activation regularisation for Babel languages, the first
experiment compared the performance of KL, Cos and Smooth systems. They were
evaluated in Javanese, which was picked as the development language in the evaluation,
using a simplified system configuration. That is, a single DNN using the RWTH multi-
7.1 Babel Languages 133
Regularisation CE MPE
L2 58.2 56.5
KL 57.2 55.8
Cos 57.9 56.2
Smooth 57.9 56.3
Table 7.3 Babel: Recognition performance (WER %) to compare CE and MPE sigmoid
stimulated DNNs using different forms of activation regularisation in Javanese.
language bottleneck features was investigated, and decoding was performed using a
tri-gram language model. Three types of activation function were investigated: sigmoid;
tanh; and ReLU. The Cos and Smooth regularisations were used for stimulated DNNs
using sigmoid, ReLU and tanh activation functions. The KL regularisation was only
used in the stimulated DNN with sigmoid functions, due to the positive constraint on
it.
The recognition performance of the CE systems is summarised in Table 7.2. On
different activation function settings, the Cos and Smooth systems outperformed
their corresponding baselines. The sigmoid function yielded better performance than
other activation functions, and the best performance was achieved by the KL system.
Sigmoid CE systems were further tuned using the MPE criterion. Table 7.3 reports the
recognition performance of the MPE systems. The MPE training on different systems
yielded lower-error performance than their corresponding CE baselines. All the KL,
Cos and Smooth DNNs outperformed the baseline MPE DNN. The best performance
was achieved by the KL system, reducing the TER from 56.5% to 55.8%. The KL
regularisation was further investigated in the following experiments.
7.1 Babel Languages 134
Table 7.4 Babel: Recognition and Keyword-spotting performance (WER % and MTWV)
of joint decoding systems, with and without stimulated DNNs, in all languages. Stimu-
lated DNNs were trained using the KL activation regularisation.
The second experiment contrasted the impact of stimulated deep neural networks on all
languages in more advanced configuration combining 4 acoustic models and interpolated
FLP and web data LMs in a single joint decoding run. For these results both Tandem
and Hybrid systems were combined using joint decoding with stimulated DNN only
being applied to the hybrid systems. Stimulated DNNs were training using the KL
regularisation. The results in Table 7.4 show that recognition gains are seen even after
system combination for all languages. Similarly, gains can be seen in keyword-spotting
performance for all languages. Because being examined in the joint-decoding system,
the gains achieved by stimulated DNNs are relatively small. However, stimulated DNNs
still yielded the complementarity that was not achieved by other system candidates.
7.2 Broadcast News English 135
Table 7.5 Babel: Performance (WER % and MTWV) of joint decoding with stimulated
DNNs of different grid sizes in the four most challenging languages, Pashto, Igbo,
Mongolian and Javanese. Stimulated DNNs were trained using the KL activation
regularisation.
BN YTB
Train Dev03 Eval03 GDev GEval
Total(hrs) 144.2 2.7 2.6 7.4 7.0
#Uttr 44667 918 816 4288 3624
AvgUttr(secs) 11.6 10.7 10.9 6.2 6.9
Table 7.6 Broadcast News: Summary of training and evaluation sets, including total
hours, number of utterances and average utterance duration.
for neural network training. An improvement is observed in each testset by using the
priors. The extracted i-vectors were also used in multi-basis adaptive neural networks
with i-vector representation.
For multi-basis adaptive neural networks, basis parameters were initialised by the CE
SI DNN model to achieve a sensible initial performance, as discussed in Section 4.2.
The network was then optimised using the CE criterion in the interleaving mode
described in Algorithm 4. This CE system was further interleavingly tuned using the
MPE criterion for four iterations. In the case of the MPE system, the bases were
initialised using the bases of the multi-basis CE system. In this task, the number
(s)
of bases was set to two, and the basis weight vector λmb for each training speaker
were initialised by setting one weight to 1 and the other to 0 according to its gender
(s)
type. λmb was optimised for each speaker in training. To evaluate the effectiveness of
(s)
multi-basis systems in rapid scenarios, λmb was estimated for each test utterance in
the unsupervised fashion.
To combine i-vector representation with multi-basis models (Section 4.4), the i-
vectors extracted in Section 7.2.1 were again used. The first combination scheme
appended i-vectors with acoustic features to form the DNN input, and a multi-basis
system with i-vector features was trained. The second combination scheme, predictive
multi-basis transform, used i-vectors to directly predict multi-basis interpolation weights.
(s)
A support-vector regression model with linear kernels was trained to estimate λmb
7.2 Broadcast News English 139
BN YTB
System
Dev03 Eval03 GDev GEval
DNN 12.5 10.9 58.5 62.1
+ivec 11.1 9.9 57.0 60.2
MBANN 11.9 10.3 56.9 61.2
+ivec 11.1 9.8 56.6 60.5
BN YTB
Trn
Dev03 Eval03 Gdev Geval
5.4 5.9 5.8 10.0 9.4
Table 7.8 Broadcast News: Average i-vector distance of training and evaluation datasets.
for each speaker from the corresponding training i-vectors from the original multi-
basis system. The open source toolkit SVMLight (Joachims, 2002) was used for
the estimation of the prediction model3 . The initial DNN and predictor were then
(s)
interleavingly updated to adjust the system to the predictive λmb space (Section 4.4.2).
Table 7.7 presents the results of the use of i-vector input in the MBANN. Such
combination (row “MBANN+ivec”) slightly outperformed both the primary i-vector
(row “DNN+ivec”) and MBANN systems in the case of matched acoustic conditions
(columns “BN”), indicating a complementarity of the two approaches. The results are
not as consistent as far as mismatched YTB data are concerned. In this case, i-vectors
achieved the best performance for GEval, while the combination scheme improved
GDev.
The MBANN system with i-vector input features did not consistently outperform
the i-vector DNN baseline on YTBGdev and YTBGeval testsets. To explain this case,
a comparison of the average euclidean distance between the test i-vectors and the mean
of training ones on different evaluation sets is shown in Table 7.8. The BN testsets
gave a similar average i-vector distance as the training set which pointed out their
consistency spanning in the acoustic space. The BN test sets present a similar average
i-vector distance as the training set which indicates a similar span of the BN test and
3 http://svmlight.joachims.org
7.2 Broadcast News English 140
BN YTB
System
Dev03 Eval03 GDev GEval
DNN 12.5 10.8 58.5 62.1
MBANN 11.9 10.3 56.9 61.2
+pred 12.1 10.4 56.7 60.8
+pred-updt 12.0 10.3 56.1 60.5
training speaker spaces. The longer distances observed for the YTB i-vectors indicate
the presence of i-vector estimations which are not or are not sufficiently represented
by the training speaker space. This may explain why the i-vectors do not improve
the performance in the case of mismatched acoustic conditions. In addition, the
mismatched i-vector inputs seem to incorrectly compensate the hidden representations
among the bases of the multi-basis DNN system and degrade the performance of the
combined system.
In Table 7.9, the second combination scheme is examined where the i-vectors were
used as fast predictors of the MBANN interpolation weights. This is indicated by
the suffix “+pred” in the naming conventions. Moreover, the MBANN with i-vector
predictor system was updated in the mode described in Section 4.4.2 for two iterations
to obtain the refined predictive systems (noted with the suffix “+pred-updt”). For
the BN test sets, the performance of the predictive system is similar to that of the
default MBANN. However, for the mismatched YTB sets, the performance gains by
the predictive model became more consistent. Thus, using the i-vector predictor in
these cases achieves the best results and the desired rapid adaptation.
The improvement observed by using the i-vector predictor in the MBANN system
is investigated in Figure 7.2. This figure compares the distribution of the MBANN
weights of the training set and of GDev set. The training set (blue dots) weights are
repeated in the right figure (Figure 7.2b) for a cleaner representation, as they are
mostly covered by the test weights in the left figure. Concerning Figure 7.2a, the
green dots present the MBANN weights using an i-vector predictor, while the red dots
present the MBANN weights extracted after alignment with the hypothesis extracted
7.2 Broadcast News English 141
Fig. 7.2 Broadcast News: Comparison of MBANN interpolation weights of the training
speakers and YTBGdev test utterances.
from a first pass decoding. It can be seen that the predicted test estimations and the
training ones are distributed in a linear space, presenting higher consistency than the
initial MBANN system (red dots), which may explain the better performance of the
predictive approach.
BN YTB
System
Dev03 Eval03 GDev GEval
DNN 11.2 10.2 55.5 59.2
+ivec 10.8 9.3 57.0 59.6
MBANN 10.7 9.5 55.4 60.3
+pred 11.2 9.7 54.0 58.6
+ivec 10.3 9.0 55.1 59.4
+ivec+pred 10.4 8.9 55.2 59.1
The performance of the MPE models is summarised in Table 7.10. The multi-
basis system combined with i-vector input still gave a lower WER under matched
acoustic conditions (“BN” columns) and the MBANN system with i-vector predictor
still achieved the best performance for Youtube test sets.
7.2 Broadcast News English 142
For stimulated deep neural networks, the sigmoid activation function was examined
only, as it has shown better performance than tanh and ReLU in the experiments
in Section 5.4 and 7.1. The configuration of stimulated DNN consisted of 5 hidden
layers, and each layer formed a default 32 × 32 grid. The activation regularisation was
perform on all hidden layers. The investigation on stimulated deep neural networks
included three types of activation regularisation: KL, Cos and Smooth, as described in
Section 5.4.1. For the KL and Cos systems, the sharpness factor σ (defined in Eq. 5.14)
was empirically set to 0.1.
Table 7.11 reports the impact of the regularisation penalty (defined in Eq. 5.3) on
CE KL systems. All the stimulated DNNs penalized from 0.05 to 0.2 outperformed
the DNN baseline. The best system was achieved by that with 0.05, decreasing the
word error rate by 0.6% in absolute value. Table 7.12 summarises the performance of
different CE systems, and all the systems using activation regularisation outperformed
the DNN baseline. The KL system again achieved the best performance.
7.2 Broadcast News English 143
were unable to provide useful information, no enhancement was achieved on top of the
unstimulated system.
This CE stimulated DNN was then used to train the MPE stimulated system.
Table 7.15 reports the comparison of the performance of different MPE systems.
Similar to the CE ones, the MPE stimulated DNN outperformed the unstimulated
MPE baseline. The regularised LHUC on the stimulated system achieved the best
performance as well, reducing the WER up to 5% relatively compared with the SI
MPE stimulated system.
For the deep activation mixture model, the network configuration and training followed
a similar procedure as described Section 6.4.1. The recognition performance of CE
SI systems is summarised in Table 7.16. As discussed in Section 6.1, by disabling
the mixture model, the DAMM degrades to a tanh DNN. The DAMM outperformed
the tanh DNN baseline, yielding up to 4% relative WER reduction. In addition, the
DAMM yielded a slightly better performance than the sigmoid DNN system. The SD
performance of the adapted CE DAMM is given in Table 7.17, comparing the impacts
of adapting the Gaussian mean vector, unit variance vector and correlation coefficient.
The change of unit variance applied homologous effects to activations located on nearby
contour lines, while the move of mean vector applied opposite effects to the activations
on the same contour line, which could not correspond to the similarity of activations
7.2 Broadcast News English 145
Adapt
System Dev03 Eval03
Mean Variance Correlation
SI ✗ ✗ ✗ 12.3 10.6
✓ ✗ ✗ 12.2 10.6
✗ ✓ ✗ 12.1 10.5
SD ✗ ✓ ✓ 12.1 10.5
✓ ✓ ✗ 12.1 10.4
✓ ✓ ✓ 12.0 10.4
in the contour. Thus, the adaptation on the covariance matrix yielded a more effective
impact than the mean vector. The relative WER reduction is up to 3%.
The recognition performance of MPE systems is compared in Table 7.18. The
MPE DAMM yielded a similar performance as the sigmoid MPE DNN baseline. The
adaptation on all the mean, variance and correlation coefficient achieved further
performance gains than the SI MPE DAMM. The adaptation performance of the MPE
DAMM obtained up to 3% relative WER reduction, which is similar to that of the CE
DAMM.
To compare the performance of the three proposed models, a summary of the MPE
systems is shown in Table 7.19. All the three types of structured deep neural networks
outperformed the SI MPE DNN baseline. Also, by further performing the associated
adaptation schemes on the SI stimulated DNN and DAMM, consistent adaptation
gains can be achieved.
7.2 Broadcast News English 146
Adapt
System Dev03 Eval03
Mean Variance Correlation
DNN (sigmoid) – – – 11.4 10.1
✗ ✗ ✗ 11.4 10.0
DAMM
✓ ✓ ✓ 11.1 9.8
Table 7.19 Broadcast News: Recognition performance (WER %) of the MPE multi-basis
adaptive neural network, stimulated deep neural network and deep activation mixture
model. Adaptation was performed in the utterance level.
Chapter 8
Conclusion
This thesis investigated structured deep neural networks for automatic speech recog-
nition. Three forms of structured deep neural networks were proposed: multi-basis
adaptive neural network, stimulated deep neural network and deep activation mixture
model. These structured DNNs explicitly introduce special structures to the network
topology, making specific aspects of the data modelled explicitly.
Standard DNN models are commonly treated as “black boxes”, in which parameters
are difficult to be interpreted and grouped. This makes regularisation and adaptation
challenging. The major contribution of this thesis is that the proposed structured
DNNs induce and impose interpretation on the introduced network structures. These
structured designs can improve regularisation and adaptation for DNN models. For
regularisation, parameters can be separately regularised according to their meanings
instead of a universal and indiscriminate regularisation to all parameters such as the
L2 regularisation. For adaptation, parameters can be adapted in groups or partially
adapted according to their functions. This can help achieve robust adaptation when
limited adaptation data are provided. A brief review of the thesis and future work are
presented as follows.
connectivity, while basis and different bases share no connectivity. The outputs
among different bases are combined via linear interpolation. To perform adaptation
on an MBANN, only a compact set of parameters, i.e. interpolation weights, needs
to be estimated. Therefore, rapid adaptation scenarios with limited data can be
resolved within this framework. Several extensions to this basic MBANN model
were also investigated. To combine i-vector representation, two combination schemes
were presented. The first scheme appends i-vectors as DNN input features. In
this configuration, the bases are explicitly informed with acoustic attributes, and
the robustness to acoustic variations can be reinforced. The second scheme uses
i-vectors to directly predict the speaker-dependent transform for MBANN. It avoids
the requirement of decoding hypotheses in adaptation, which helps to reduce the
computational cost as well as improve the robustness to hypothesis errors. The target-
dependent interpolation introduces multiple sets of interpolation weights to separately
adapt different DNN targets. The inter-basis connectivity generalises the MBANN
framework with parameters between different bases.
Stimulated deep neural networks were proposed in Chapter 5. This form of struc-
tured neural network relates activation functions in regions of the network to aid
interpretation and visualisation. In the network topology, hidden units are reorganised
to form a grid, and activation functions with similar behaviours can then be grouped
together in this grid space. This goal is obtained by introducing a special form of
regularisation, which is the activation regularisation. The activation regularisation is
designed to encourage the outputs of activation functions to satisfy a target pattern.
By defining appropriate target patterns, different learning, participating or grouping
concepts can be imposed and shaped on the network. This design prevents hidden units
from an arbitrary order, which has the potential to improve network regularisation.
Also, based on the restricted ordering of hidden units, smoothness techniques can be
used to improve the adaptation schemes on stimulated DNNs. The LHUC adaptation
approach was discussed as an example to explain how the smoothness method can be
performed. In contrast with multi-basis adaptive neural networks, a “hard” version to
partition the hidden units, stimulated deep neural networks perform the hidden-unit
8.2 Future Work 149
• In this thesis, the structured DNNs are discussed using the feed-forward neu-
ral network architecture. These concepts can be extended to more complex
architectures, such as RNNs and CNNs.
• The structured DNNs are applied to speech recognition for the discussion in
this thesis. However, these models can be used in other tasks as well. For
example, the target patterns of stimulated DNN can be designed using expertise
other than acoustic knowledge. To use stimulated deep neural networks for
language modelling, part-of-speech target patterns have the potential to improve
the regularisation of neural-network-based language models.
Appendix A
I-vector Estimation
supervector) and the canonical means, which for a particular Gaussian component
m ∈ M is given by
(m) (s)
µ(sm) = µ0 + M (m) λiv (A.1)
(s)
where M is the canonical model to be estimated and M̂ is the “old” model. λiv are
(s) (m)
the i-vectors to be estimated and λ̂iv the “old” i-vectors. γt (s) is the posterior
probability of Gaussian component m at time t determined using the canonical model
(s)
parameters M̂ and the speaker i-vectors λ̂iv .
The training procedure uses the expectation-maximisation algorithm to estimate
the parameters. This is the typical CAT model-based training procedure (Gales, 2000)
or the ML Eigen-decomposition (Kuhn et al., 1998). By differentiating Eq. A.2 with
respect to the i-vector of a particular speaker and equating to zero, the i-vector for
speaker s may be shown to be
(s) (s)
where Gλiv and kλiv are given by
(s) X (m)
Gλiv = γt (s)M (m)T Σ(m)−1 M (m) , (A.4)
m,t
To estimate the factor matrix M (m) , it suffices to differentiate Equation A.2 with
respect to M (m) and equate to zero. Doing so, the sufficient statistics are collected,
(m) (m)−1
M (m) = K M GM . (A.8)
Appendix B
(Ls)
The input to the softmax function, z t can then be rewritten as
(Ls) (s)
zt = W (L)T B t λmb + b(L) . (B.2)
Defining the function f (xt , j), which can be viewed as feature extractors on the raw
features xt ,
(L)
f (xt , j) = B T
t wj . (B.3)
B.2 Convexity in MBANN with Target-dependent Interpolation Weights 155
(s)
It is unchanged during λmb optimisation. Using f (xt , j), the CE criterion can then be
rewritten as
(s) X (s)T (L) (s)T
L(λmb ; D) = log exp λmb f (xt , j) + bj − λmb f (xt , j) + b(L) (B.4)
X
yt .
t j
This form is identical to the log-linear model (Nelder and Baker, 1972), which is a
convex model. Therefore, the convexity of MBANN interpolation weight optimisation
is held.
It can follow a similar fashion described in the previous section to define the feature
extractor f˜(xt , j),
T (L)
f˜(xt , j) = B̃ t wj . (B.7)
Ossama Abdel-Hamid and Hui Jiang. Fast speaker adaptation of hybrid NN/HMM
model for speech recognition based on discriminative learning of speaker code.
In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 7942–7946. IEEE, 2013a.
Ossama Abdel-Hamid and Hui Jiang. Rapid and effective speaker adaptation of convo-
lutional neural network based models for speech recognition. In INTERSPEECH,
pages 1248–1252, 2013b.
Victor Abrash, Horacio Franco, Ananth Sankar, and Michael Cohen. Connectionist
speaker normalization and adaptation. In in Eurospeech, pages 2183–2186, 1995.
Bishnu S Atal. Effectiveness of linear prediction characteristics of the speech wave
for automatic speaker identification and verification. the Journal of the Acoustical
Society of America, 55(6):1304–1312, 1974.
Bishnu S Atal and Suzanne L Hanauer. Speech analysis and synthesis by linear
prediction of the speech wave. The journal of the acoustical society of America, 50
(2B):637–655, 1971.
Xavier L Aubert. An overview of decoding techniques for large vocabulary continuous
speech recognition. Computer Speech & Language, 16(1):89–114, 2002.
L Brown Bahl, P de Souza, and R P Mercer. Maximum mutual information estimation
of hidden Markov model parameters for speech recognition. Acoustics, Speech, and
Signal Processing, IEEE International Conference on ICASSP’86., 1986.
James Baker. The DRAGON system–an overview. IEEE Transactions on Acoustics,
Speech, and Signal Processing, 23(1):24–29, 1975.
Leonard E Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization
technique occurring in the statistical analysis of probabilistic functions of Markov
chains. The annals of mathematical statistics, 41(1):164–171, 1970.
Jerome R Bellegarda. Statistical language model adaptation: review and perspectives.
Speech communication, 42(1):93–108, 2004.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies
with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166,
1994.
References 157
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural
probabilistic language model. Journal of machine learning research, 3(Feb):1137–
1155, 2003.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise
training of deep networks. In Advances in neural information processing systems,
pages 153–160, 2007.
Maximilian Bisani and Hermann Ney. Joint-sequence models for grapheme-to-phoneme
conversion. Speech communication, 50(5):434–451, 2008.
Christopher M Bishop. Mixture density networks. 1994.
Christopher M Bishop. Neural networks for pattern recognition. Oxford university
press, 1995.
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
Herve A Bourlard and Nelson Morgan. Connectionist speech recognition: a hybrid
approach, volume 247. Springer, 1994.
John S Bridle. Probabilistic interpretation of feedforward classification network outputs,
with relationships to statistical pattern recognition. In Neurocomputing, pages 227–
236. Springer, 1990.
Peter F Brown. The acoustic-modeling problem in automatic speech recognition.
Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF
COMPUTER SCIENCE, 1987.
William Byrne. Minimum Bayes risk estimation and decoding in large vocabulary con-
tinuous speech recognition. IEICE TRANSACTIONS on Information and Systems,
89(3):900–907, 2006.
Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell.
arXiv preprint arXiv:1508.01211, 2015.
Wen-Hsiung Chen, CH Smith, and SC Fralick. A fast computational algorithm for the
discrete cosine transform. IEEE Transactions on communications, 25(9):1004–1009,
1977.
X. Chen, X. Liu, Y. Qian, M. J. F Gales, and P. C. Woodland. CUED-RNNLM – an
open-source toolkit for efficient training and evaluation of recurrent neural network
language models. In ICASSP, 2016.
David Chiang, Kevin Knight, and Wei Wang. 11,001 new features for statistical
machine translation. In Proceedings of human language technologies: The 2009
annual conference of the north american chapter of the association for computational
linguistics, pages 218–226. Association for Computational Linguistics, 2009.
References 158
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations
using RNN encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua
Bengio. Attention-based models for speech recognition. In Advances in Neural
Information Processing Systems, pages 577–585, 2015.
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
Ronan Collobert. Large scale machine learning. 2004.
J. Cui, J. Mamou, B. Kingsbury, and B. Ramabhadran. Automatic keyword selection
for keyword search development and tuning. In ICASSP, 2014.
Jia Cui, Brian Kingsbury, Bhuvana Ramabhadran, Abhinav Sethy, Kartik Audhkhasi,
Xiaodong Cui, Ellen Kislal, Lidia Mangu, Markus Nussbaum-Thom, Michael Picheny,
et al. Multilingual representations for low resource speech recognition and keyword
search. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE
Workshop on, pages 259–266. IEEE, 2015a.
Xiaodong Cui, Vaibhava Goel, and Brian Kingsbury. Data augmentation for deep
neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and
Language Processing (TASLP), 23(9):1469–1477, 2015b.
George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained
deep neural networks for large-vocabulary speech recognition. Audio, Speech, and
Language Processing, IEEE Transactions on, 20(1):30–42, 2012.
George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural
networks for lvcsr using rectified linear units and dropout. In Acoustics, Speech
and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages
8609–8613. IEEE, 2013.
KH Davis, R Biddulph, and Stephen Balashek. Automatic recognition of spoken digits.
The Journal of the Acoustical Society of America, 24(6):637–642, 1952.
Steven Davis and Paul Mermelstein. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. IEEE transactions
on acoustics, speech, and signal processing, 28(4):357–366, 1980.
Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet.
Front end factor analysis for speaker verification. IEEE Transactions on Audio,
Speech and Language Processing, 2010.
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet.
Front-end factor analysis for speaker verification. IEEE Transactions on Audio,
Speech, and Language Processing, 19(4):788–798, 2011.
References 159
Marc Delcroix, Keisuke Kinoshita, Takaaki Hori, and Tomohiro Nakatani. Context
adaptive deep neural networks for fast acoustic model adaptation. In 2015 IEEE
International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015,
South Brisbane, Queensland, Australia, April 19-24, 2015, pages 4535–4539, 2015.
Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Takuya Yoshioka, Dung T Tran,
and Tomohiro Nakatani. Context adaptive neural network for rapid adaptation of
deep cnn based acoustic models. In INTERSPEECH, pages 1573–1577, 2016a.
Marc Delcroix, Keisuke Kinoshita, Chengzhu Yu, Atsunori Ogawa, Takuya Yoshioka,
and Tomohiro Nakatani. Context adaptive deep neural networks for fast acoustic
model adaptation in noisy conditions. In Acoustics, Speech and Signal Processing
(ICASSP), 2016 IEEE International Conference on, pages 5270–5274. IEEE, 2016b.
Li Deng, Jinyu Li, Jui-Ting Huang, Kaisheng Yao, Dong Yu, Frank Seide, Michael
Seltzer, Geoff Zweig, Xiaodong He, Jason Williams, et al. Recent advances in deep
learning for speech research at Microsoft. In Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on, pages 8604–8608. IEEE, 2013.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12
(Jul):2121–2159, 2011.
Stéphane Dupont and Leila Cheboub. Fast speaker adaptation of artificial neural net-
works for automatic speech recognition. In Acoustics, Speech, and Signal Processing,
2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, volume 3,
pages 1795–1798. IEEE, 2000.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing
higher-layer features of a deep network. University of Montreal, 1341:3, 2009.
Gunnar Evermann and PC Woodland. Posterior probability decoding, confidence esti-
mation and system combination. In Proc. Speech Transcription Workshop, volume 27,
page 78. Baltimore, 2000.
Xue Feng, Yaodong Zhang, and James Glass. Speech feature denoising and dereverber-
ation via deep autoencoders for noisy reverberant speech recognition. In Acoustics,
Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on,
pages 1759–1763. IEEE, 2014.
Xue Feng, Brigitte Richardson, Scott Amman, and James Glass. On using heterogeneous
data for vehicle-based speech recognition: a DNN-based approach. In Acoustics,
Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on,
pages 4385–4389. IEEE, 2015.
J. G. Fiscus et al. Results of the 2006 spoken term detection evaluation. In Proc. ACM
SIGIR Workshop on Searching Spontaneous Conversational Speech, 2007.
George Forman. An extensive empirical study of feature selection metrics for text
classification. Journal of machine learning research, 3(Mar):1289–1305, 2003.
References 160
G David Forney. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
Sadaoki Furui. Speaker-independent isolated word recognition based on emphasized
spectral dynamics. In Acoustics, Speech, and Signal Processing, IEEE International
Conference on ICASSP’86., volume 11, pages 1991–1994. IEEE, 1986.
M. J. F. Gales, K. M. Knill, and A. Ragni. Unicode-based graphemic systems for
limited resource languages. In ICASSP, 2015a.
Mark Gales and et. al. Generative kernels and score-spaces for classification of speech.
http://mi.eng.cam.ac.uk/ mjfg/Kernel/index.html, 2013.
Mark Gales and Steve Young. The application of hidden Markov models in speech
recognition. Foundations and trends in signal processing, 1(3):195–304, 2008.
Mark JF Gales. Maximum likelihood linear transformations for HMM-based speech
recognition. Computer speech & language, 12(2):75–98, 1998.
Mark JF Gales. Cluster adaptive training of hidden Markov models. IEEE transactions
on speech and audio processing, 8(4):417–428, 2000.
Mark JF Gales, Kate M Knill, and Anton Ragni. Unicode-based graphemic systems for
limited resource languages. In Acoustics, Speech and Signal Processing (ICASSP),
2015 IEEE International Conference on, pages 5186–5190. IEEE, 2015b.
J-L Gauvain and Chin-Hui Lee. Maximum a posteriori estimation for multivariate
gaussian mixture observations of markov chains. IEEE transactions on speech and
audio processing, 2(2):291–298, 1994.
Roberto Gemello, Franco Mana, Stefano Scanzio, Pietro Laface, and Renato De Mori.
Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech
Communication, 49(10):827–835, 2007.
Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. Learning precise timing
with lstm recurrent networks. Journal of machine learning research, 3(Aug):115–143,
2002.
Daniel Gildea and Thomas Hofmann. Topic-based language models using em. In Sixth
European Conference on Speech Communication and Technology, 1999.
Ondřej Glembek, Lukáš Burget, Pavel Matějka, Martin Karafiát, and Patrick Kenny.
Simplification and optimization of i-vector extraction. In Acoustics, Speech and Signal
Processing (ICASSP), 2011 IEEE International Conference on, pages 4516–4519.
IEEE, 2011.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-
forward neural networks. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, pages 249–256, 2010.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural
networks. In Proceedings of the Fourteenth International Conference on Artificial
Intelligence and Statistics, pages 315–323, 2011.
References 161
John F Hemdal and George W Hughes. A feature based computer recognition program
for the modeling of vowel perception. Models for the Perception of Speech and Visual
Form, Wathen-Dunn, W. Ed. MIT Press, Cambridge, MA, 1967.
Hynek Hermansky. Perceptual linear predictive (PLP) analysis of speech. the Journal
of the Acoustical Society of America, 87(4):1738–1752, 1990.
Hynek Hermansky, Daniel PW Ellis, and Sangita Sharma. Tandem connectionist
feature extraction for conventional HMM systems. In Acoustics, Speech, and Signal
Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on,
volume 3, pages 1635–1638. IEEE, 2000.
Geoffrey Hinton et al. Deep neural networks for acoustic modeling in speech recognition:
The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):
82–97, 2012.
Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence.
Neural computation, 14(8):1771–1800, 2002.
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for
deep belief nets. Neural computation, 18(7):1527–1554, 2006.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997.
Zhen Huang, Jinyu Li, Sabato Marco Siniscalchi, I-Fan Chen, Chao Weng, and Chin-
Hui Lee. Feature space maximum a posteriori linear regression for adaptation of
deep neural networks. In Fifteenth Annual Conference of the International Speech
Communication Association, 2014.
Zhen Huang, Sabato Marco Siniscalchi, I-Fan Chen, Jiadong Wu, and Chin-Hui Lee.
Maximum a posteriori adaptation of network parameters in deep models. arXiv
preprint arXiv:1503.02108, 2015.
Zhen Huang, Sabato Marco Siniscalchi, I-Fan Chen, and Chin-Hui Lee. Towards a
direct bayesian adaptation framework for deep models. In Signal and Information
Processing Association Annual Summit and Conference (APSIPA), 2016 Asia-Pacific,
pages 1–4. IEEE, 2016.
David H Hubel and Torsten N Wiesel. Binocular interaction in striate cortex of kittens
reared with artificial squint. Journal of neurophysiology, 28(6):1041–1059, 1965.
Mei-Yuh Hwang and Xuedong Huang. Shared-distribution hidden markov models
for speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4):
414–420, 1993.
Takaaki Ishii, Hiroki Komiyama, Takahiro Shinozaki, Yasuo Horiuchi, and Shingo
Kuroiwa. Reverberant speech recognition based on denoising autoencoder. In
Interspeech, pages 3512–3516, 2013.
Fumitada Itakura. Minimum prediction residual principle applied to speech recognition.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):67–72, 1975.
References 163
Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmen-
tation for speech recognition. In INTERSPEECH, pages 3586–3589, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
Roland Kuhn, Patrick Nguyen, Jean-Claude Junqua, Lloyd Goldwasser, Nancy Niedziel-
ski, Steven Fincke, Ken Field, and Matteo Contolini. Eigenvoices for speaker
adaptation. In International Conference on Spoken Language Processing, 1998.
Nagendra Kumar and Andreas G Andreou. Heteroscedastic discriminant analysis and
reduced rank HMMs for improved speech recognition. Speech communication, 26(4):
283–297, 1998.
Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard,
Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a
back-propagation network. In Advances in neural information processing systems,
pages 396–404, 1990.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
Kai-Fu Lee. On large-vocabulary speaker-independent continuous speech recognition.
Speech communication, 7(4):375–379, 1988.
Li Lee and Richard C Rose. Speaker normalization using efficient frequency warping
procedures. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Confer-
ence Proceedings., 1996 IEEE International Conference on, volume 1, pages 353–356.
IEEE, 1996.
Christopher J Leggetter and Philip C Woodland. Maximum likelihood linear regression
for speaker adaptation of continuous density hidden Markov models. Computer
Speech & Language, 9(2):171–185, 1995.
Bo Li and Khe Chai Sim. Comparison of discriminative input and output transforma-
tions for speaker adaptation in the hybrid NN/HMM systems. 2010.
Xiao Li and Jeff Bilmes. Regularized adaptation of discriminative classifiers. In
Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006
IEEE International Conference on, volume 1, pages I–I. IEEE, 2006.
Hank Liao. Speaker adaptation of context dependent deep neural networks. In Acoustics,
Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on,
pages 7947–7951. IEEE, 2013.
References 165
Xunying Liu, Yongqiang Wang, Xie Chen, Mark JF Gales, and Philip C Woodland.
Efficient lattice rescoring using recurrent neural network language models. In Acous-
tics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference
on, pages 4908–4912. IEEE, 2014.
Xunying Liu, Xie Chen, Mark JF Gales, and Philip C Woodland. Paraphrastic
recurrent neural network language models. In Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on, pages 5406–5410. IEEE, 2015.
Liang Lu, Xingxing Zhang, Kyunghyun Cho, and Steve Renals. A study of the
recurrent neural network encoder-decoder for large vocabulary speech recognition.
In INTERSPEECH, pages 3249–3253, 2015.
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve
neural network acoustic models. In Proc. ICML, volume 30, 2013.
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal
of Machine Learning Research, 9(Nov):2579–2605, 2008.
Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations
by inverting them. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 5188–5196, 2015.
Lidia Mangu, Eric Brill, and Andreas Stolcke. Finding consensus in speech recognition:
word error minimization and other applications of confusion networks. Computer
Speech & Language, 14(4):373–400, 2000.
James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th
International Conference on Machine Learning (ICML-10), pages 735–742, 2010.
G. Mendels, E. Cooper, V. Soto, J. Hirschberg, M. Gales, K. Knill, A. Ragni, and
H. Wang. Improving speech recognition and keyword search for low resource languages
using web data. In Interspeech, 2015.
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur.
Recurrent neural network based language model. In Interspeech, volume 2, page 3,
2010.
Mehryar Mohri, Fernando Pereira, and Michael Riley. Weighted finite-state transducers
in speech recognition. Computer Speech & Language, 16(1):69–88, 2002.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th international conference on machine learning
(ICML-10), pages 807–814, 2010.
John Ashworth Nelder and R Jacob Baker. Generalized linear models. Wiley Online
Library, 1972.
Yurii Nesterov. A method of solving a convex programming problem with convergence
rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
References 166
Joao Neto, Luís Almeida, Mike Hochberg, Ciro Martins, Luís Nunes, Steve Renals,
and Tony Robinson. Speaker-adaptation for hybrid HMM-ANN continuous speech
recognition system. 1995.
Hermann Ney, Ute Essen, and Reinhard Kneser. On structuring probabilistic depen-
dences in stochastic language modelling. Computer Speech & Language, 8(1):1–38,
1994.
Jiquan Ngiam, Zhenghao Chen, Daniel Chia, Pang W Koh, Quoc V Le, and Andrew Y
Ng. Tiled convolutional neural networks. In Advances in neural information processing
systems, pages 1279–1287, 2010.
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled:
High confidence predictions for unrecognizable images. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015.
JJ Odell, V Valtchev, Philip C Woodland, and Steve J Young. A one pass decoder
design for large vocabulary recognition. In Proceedings of the workshop on Human
Language Technology, pages 405–410. Association for Computational Linguistics,
1994.
Mark JL Orr et al. Introduction to radial basis function networks, 1996.
Stefan Ortmanns, Hermann Ney, and Xavier Aubert. A word graph algorithm for
large vocabulary continuous speech recognition. Computer Speech & Language, 11
(1):43–72, 1997.
David S. Pallett, Jonathan G. Fiscus, Alvin Martin, and Mark A. Przybocki. 1997
broadcast news benchmark test results: English and non-English. In Proc. 1998
DARPA Broadcast News Transcription and Understanding Workshop, pages 5–11,
1998.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to
construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.
Douglas B Paul and Janet M Baker. The design for the Wall Street Journal-based
CSR corpus. In Proceedings of the workshop on Speech and Natural Language, pages
357–362. Association for Computational Linguistics, 1992.
David Pearce and J Picone. Aurora working group: DSR front end LVCSR evaluation
AU/384/02. Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep,
2002.
Boris T Polyak. Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
Daniel Povey. Discriminative training for large vocabulary speech recognition. PhD
thesis, Ph. D. thesis, Cambridge University, 2004.
Daniel Povey and Philip C Woodland. Minimum phone error and i-smoothing for im-
proved discriminative training. In Acoustics, Speech, and Signal Processing (ICASSP),
2002 IEEE International Conference on, volume 1, pages I–105. IEEE, 2002.
References 167
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek,
Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,
et al. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic
speech recognition and understanding, number EPFL-CONF-192584. IEEE Signal
Processing Society, 2011.
Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar,
Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purely sequence-trained neural
networks for ASR based on lattice-free MMI. In INTERSPEECH, pages 2751–2755,
2016.
Lawrence R Rabiner. A tutorial on hidden Markov models and selected applications in
speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
Lawrence R Rabiner and Bernard Gold. Theory and application of digital signal
processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 1975.
A Ragni, C Wu, MJF Gales, J Vasilakes, and KM Knill. Stimulated training for
automatic speech recognition and keyword search in limited resource conditions.
In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International
Conference on, pages 4830–4834. IEEE, 2017.
Shakti P Rath, Daniel Povey, Karel Veselỳ, and Jan Cernockỳ. Improved feature
processing for deep neural networks. In Interspeech, pages 109–113, 2013.
Steve Renals, Nelson Morgan, Herve Bourlard, Michael Cohen, Horacio Franco, Chuck
Wooters, and Phil Kohn. Connectionist speech recognition: Status and prospects.
Technical report, Technical Report TR-91-070, University of California at Berkeley,
1991.
Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn. Speaker verification
using adapted Gaussian mixture models. Digital Signal Processing, 10(1-3):19–41,
2000.
Korin Richmond. A trajectory mixture density network for the acoustic-articulatory
inversion mapping. In Interspeech, 2006.
Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster back-
propagation learning: The RPROP algorithm. In Neural Networks, 1993., IEEE
International Conference on, pages 586–591. IEEE, 1993.
Frank Rosenblatt. The perceptron: A probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386, 1958.
David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group,
editors. Parallel Distributed Processing: Explorations in the Microstructure of
Cognition, Vol. 1: Foundations. MIT Press, Cambridge, MA, USA, 1986. ISBN
0-262-68053-X.
David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representa-
tions by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
References 168
Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals.
Learning the speech front-end with raw waveform CLDNNs. In Sixteenth Annual
Conference of the International Speech Communication Association, 2015.
George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny. Speaker adaptation
of neural network acoustic models using i-vectors. In Automatic Speech Recognition
and Understanding (ASRU), 2013 IEEE Workshop on, pages 55–59. IEEE, 2013.
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11):2673–2681, 1997.
Frank Seide, Gang Li, Xie Chen, and Dong Yu. Feature engineering in context-
dependent deep neural networks for conversational speech transcription. In Automatic
Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, pages
24–29. IEEE, 2011a.
Frank Seide, Gang Li, and Dong Yu. Conversational speech transcription using context-
dependent deep neural networks. In Twelfth Annual Conference of the International
Speech Communication Association, 2011b.
Oliver G Selfridge. Pandemonium: a paradigm for learning in mechanisation of thought
processes. 1958.
Andrew Senior and Ignacio Lopez-Moreno. Improving dnn speaker independence with
i-vector inputs. In Proc. of ICASSP, pages 225–229, 2014a.
Andrew Senior and Ignacio Lopez-Moreno. Improving DNN speaker independence
with i-vector inputs. In Acoustics, Speech and Signal Processing (ICASSP), 2014
IEEE International Conference on, pages 225–229. IEEE, 2014b.
Koichi Shinoda and C-H Lee. A structural bayes approach to speaker adaptation.
IEEE Transactions on Speech and Audio Processing, 9(3):276–287, 2001.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional
networks: Visualising image classification models and saliency maps. arXiv preprint
arXiv:1312.6034, 2013.
Sabato Marco Siniscalchi, Jinyu Li, and Chin-Hui Lee. Hermitian based hidden
activation functions for adaptation of hybrid HMM/ANN models. In Thirteenth
Annual Conference of the International Speech Communication Association, 2012.
Sabato Marco Siniscalchi, Jinyu Li, and Chin-Hui Lee. Hermitian polynomial for
speaker adaptation of connectionist speech recognition systems. IEEE Transactions
on Audio, Speech, and Language Processing, 21(10):2152–2161, 2013.
Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. Parsing natural scenes
and natural language with recursive neural networks. In Proceedings of the 28th
international conference on machine learning (ICML-11), pages 129–136, 2011.
References 169
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
J Stadermann and G Rigoll. Two-stage speaker adaptation of hybrid tied-posterior
acoustic models. In Acoustics, Speech, and Signal Processing, 2005. Proceed-
ings.(ICASSP’05). IEEE International Conference on, volume 1, pages 977–980.
IEEE, 2005.
Hang Su, Gang Li, Dong Yu, and Frank Seide. Error back propagation for sequence
training of context-dependent deep networks for conversational speech transcription.
In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 6664–6668. IEEE, 2013.
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. LSTM neural networks for
language modeling. In Thirteenth Annual Conference of the International Speech
Communication Association, 2012.
Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. From feedforward to recurrent
LSTM neural networks for language modeling. IEEE/ACM Transactions on Audio,
Speech and Language Processing (TASLP), 23(3):517–529, 2015.
Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance
of initialization and momentum in deep learning. In International conference on
machine learning, pages 1139–1147, 2013.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104–3112,
2014.
Pawel Swietojanski and Steve Renais. SAT-LHUC: Speaker adaptive training for
learning hidden unit contributions. In Acoustics, Speech and Signal Processing
(ICASSP), 2016 IEEE International Conference on, pages 5010–5014. IEEE, 2016.
Pawel Swietojanski and Steve Renals. Learning hidden unit contributions for unsuper-
vised speaker adaptation of neural network acoustic models. In Spoken Language
Technology Workshop (SLT), 2014 IEEE, pages 171–176. IEEE, 2014.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 1–9, 2015.
S. Tan, K. C. Sim, and M. Gales. Improving the interpretability of deep neural networks
with stimulated learning. In Automatic Speech Recognition and Understanding
(ASRU), 2015 IEEE Workshop on, pages 617–623, 2015a.
Tian Tan, Yanmin Qian, Maofan Yin, Yimeng Zhuang, and Kai Yu. Cluster adap-
tive training for deep neural network. In Acoustics, Speech and Signal Processing
(ICASSP), 2015 IEEE International Conference on, pages 4325–4329. IEEE, 2015b.
References 170
Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
SE Tranter, MJF Gales, R Sinha, S Umesh, and PC Woodland. The development
of the Cambridge University RT-04 diarisation system. In Proc. Fall 2004 Rich
Transcription Workshop (RT-04), 2004.
Jan Trmal, Jan Zelinka, and Ludek Müller. Adaptation of a feedforward artificial
neural network using a linear transform. In TSD’10, pages 423–430, 2010.
Zoltán Tüske, Pavel Golik, Ralf Schlüter, and Hermann Ney. Acoustic modeling
with deep neural networks using raw time signal for LVCSR. In Fifteenth Annual
Conference of the International Speech Communication Association, 2014.
Zoltan Tuske, David Nolden, Ralf Schluter, and Hermann Ney. Multilingual MRASTA
features for low-resource keyword search and speech recognition systems. In Acoustics,
Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on,
pages 7854–7858. IEEE, 2014.
Ehsan Variani, Erik McDermott, and Georg Heigold. A Gaussian mixture model
layer jointly optimized with discriminative features within a deep neural network
architecture. In 2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 4270–4274. IEEE, 2015.
Olli Viikki and Kari Laurila. Cepstral domain segmental feature vector normalization
for noise robust speech recognition. Speech Communication, 25(1):133–147, 1998.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Ex-
tracting and composing robust features with denoising autoencoders. In Proceedings
of the 25th international conference on Machine learning, pages 1096–1103. ACM,
2008.
Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiyohiro Shikano, and Kevin J
Lang. Phoneme recognition using time-delay neural networks. IEEE transactions on
acoustics, speech, and signal processing, 37(3):328–339, 1989.
H. Wang, A. Ragni, M. J. F. Gales, K. M. Knill, P. C. Woodland, and C. Zhang.
Joint decoding of tandem and hybrid systems for improved keyword spotting on low
resource languages. In Interspeech, 2015a.
Haipeng Wang, Anton Ragni, Mark JF Gales, Kate M Knill, Philip C Woodland,
and Chao Zhang. Joint decoding of tandem and hybrid systems for improved
keyword spotting on low resource languages. In Sixteenth Annual Conference of the
International Speech Communication Association, 2015b.
Ronald J Williams and David Zipser. A learning algorithm for continually running
fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
PC Woodland. Weight limiting, weight quantisation and generalisation in multi-layer
perceptrons. In Artificial Neural Networks, 1989., First IEE International Conference
on (Conf. Publ. No. 313), pages 297–300. IET, 1989.
References 171
Philip C Woodland, Chris J Leggetter, JJ Odell, Valtcho Valtchev, and Steve J Young.
The 1994 HTK large vocabulary speech recognition system. In Acoustics, Speech, and
Signal Processing, 1995. ICASSP-95., 1995 International Conference on, volume 1,
pages 73–76. IEEE, 1995.
Chunyang Wu and Mark JF Gales. Multi-basis adaptive neural network for rapid adap-
tation in speech recognition. In Acoustics, Speech and Signal Processing (ICASSP),
2015 IEEE International Conference on, pages 4315–4319. IEEE, 2015.
Chunyang Wu and Mark JF Gales. Deep activation mixture model for speech recognition.
Proc. Interspeech 2017, pages 1611–1615, 2017.
Chunyang Wu, Penny Karanasou, and Mark JF Gales. Combining i-vector represen-
tation and structured neural networks for rapid adaptation. In Acoustics, Speech
and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages
5000–5004. IEEE, 2016a.
Chunyang Wu, Penny Karanasou, Mark JF Gales, and Khe Chai Sim. Stimulated deep
neural network for speech recognition. In Proc. Interspeech, pages 400–404, 2016b.
Chunyang Wu, Mark Gales, Anton Ragni, Penny Karanasou, and Khe Chai Sim. Im-
proving interpretation and regularisation in deep learning. submitted to IEEE/ACM
Transactions on Audio, Speech and Language Processing (TASLP), 2017.
Yeming Xiao, Zhen Zhang, Shang Cai, Jielin Pan, and Yonghong Yan. A initial
attempt on task-specific adaptation for deep neural network-based large vocabulary
continuous speech recognition. In INTERSPEECH’12, pages –1–1, 2012.
Wayne Xiong, Jasha Droppo, Xuedong Huang, Frank Seide, Mike Seltzer, Andreas
Stolcke, Dong Yu, and Geoffrey Zweig. Achieving human parity in conversational
speech recognition. arXiv preprint arXiv:1610.05256, 2016.
Wei Xu. Towards optimal one pass large scale learning with averaged stochastic
gradient descent. arXiv preprint arXiv:1107.2490, 2011.
Shaofei Xue, Ossama Abdel-Hamid, Hui Jiang, and Lirong Dai. Direct adaptation of
hybrid DNN/HMM model for fast speaker adaptation in LVCSR based on speaker
code. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International
Conference on, pages 6339–6343. IEEE, 2014.
Kaisheng Yao, Dong Yu, Frank Seide, Hang Su, Li Deng, and Yifan Gong. Adaptation
of context-dependent deep neural networks for automatic speech recognition. In
SLT, pages 366–369, 2012.
Takuya Yoshioka, Anton Ragni, and Mark JF Gales. Investigation of unsupervised
adaptation of dnn acoustic models with filter bank input. In Acoustics, Speech
and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages
6344–6348. IEEE, 2014.
Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understand-
ing neural networks through deep visualization. arXiv preprint arXiv:1506.06579,
2015.
References 172
Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying A
Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Anton Ragni, Valtcho
Valtchev, Phil Woodland, and Chao Zhang. The HTK book (for HTK version 3.5).
2015.
Steve J Young. The use of state tying in continuous speech recognition. In Proc. of
Eurospeech’93, 1993.
Steve J Young, Julian J Odell, and Philip C Woodland. Tree-based state tying for high
accuracy acoustic modelling. In Proceedings of the workshop on Human Language
Technology, pages 307–312. Association for Computational Linguistics, 1994.
Dong Yu, Kaisheng Yao, Hang Su, Gang Li, and Frank Seide. KL-divergence regularized
deep neural network adaptation for improved large vocabulary speech recognition.
In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 7893–7897. IEEE, 2013.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks. arXiv preprint arXiv:1311.2901, 2013.
Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks. In European conference on computer vision, pages 818–833. Springer, 2014.
Heiga Zen and Andrew Senior. Deep mixture density networks for acoustic modeling in
statistical parametric speech synthesis. In 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 3844–3848. IEEE, 2014.
C Zhang and PC Woodland. Context independent discriminative pre-training. unpub-
lished work, 2015a.
Chao Zhang and Philip C Woodland. Parameterised sigmoid and ReLU hidden
activation functions for DNN acoustic modelling. In Sixteenth Annual Conference of
the International Speech Communication Association, 2015b.
Shiliang Zhang, Hui Jiang, and Lirong Dai. Hybrid orthogonal projection and estimation
(HOPE): A new framework to learn neural networks. Journal of Machine Learning
Research, 17(37):1–33, 2016a.
Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James
Glass. Highway long short-term memory rnns for distant speech recognition. In Acous-
tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference
on, pages 5755–5759. IEEE, 2016b.
Jing Zheng and Andreas Stolcke. Improved discriminative training using phone lattices.
In Ninth European Conference on Speech Communication and Technology, 2005.
Y-T Zhou, Rama Chellappa, Aseem Vaid, and B Keith Jenkins. Image restoration using
a neural network. IEEE Transactions on Acoustics, Speech, and Signal Processing,
36(7):1141–1151, 1988.