Data-Driven Neural Network Based Feature - Phd-Thesis
Data-Driven Neural Network Based Feature - Phd-Thesis
by
Samuel Thomas
Baltimore, Maryland
December, 2012
c Samuel Thomas 2012
the message that is being communicated, (2) the speakers who are communicating and
(3) the environment in which the communication occurs. Depending on the final goal,
before being used for subsequent pattern recognition applications. Feature extraction
front-ends for automatic speech recognition (ASR) are designed to derive features that
characterize underlying speech sounds in the signal that are useful in recognizing the
spoken message. Irrelevant variability from speakers and the environment should also
veloping a data-driven feature extraction approach. The key element in this ap-
nize phonemes which are basic speech units of speech occurring at intervals of 5-10
combining information from multiple acoustic features derived using novel signal pro-
ii
ABSTRACT
(LVCSR) task, the proposed features provide about 14% relative reduction of word
LVCSR systems with only few hours of training data. In conventional systems, the
propose several techniques to deal in these low resource scenarios by using features
from data-driven feature extractors trained on data from different languages and
multilingual data transcribed using different phoneme sets. Our approaches show that
with this kind of prior training at the feature extraction level, data-driven features can
compensate significantly for the lack of large amounts of training data in downstream
low-resource task with only 1 hour of transcribed training data for acoustic modeling.
Apart from being used to generate features, we also show how outputs from
the proposed data-driven front-ends can be used for a host of other speech appli-
cations. In noisy environments we show how data-driven features can be used for
speech activity detection on acoustic data from multiple languages transmitted over
noisy radio communication channels. In a novel speaker recognition model using neu-
ral networks, posteriors of speech classes are used to model parts of each speakers
iii
ABSTRACT
broad phonetic classes. In zero resource settings, tasks such as spoken term discovery
attempt to automatically discover repeated words and phrases in speech without any
transcriptions. With no transcripts to guide the process, results of the search largely
depend on the quality of the underlying speech representation being used. Our ex-
periments show that in these settings significant improvements can be obtained using
phoneme posterior outputs derived using the proposed front-ends. We also explore a
different application of these posteriors - as phonetic event detectors for speech recog-
nition. These event detectors are used along with Segmental Conditional Random
Thesis Committee
Prof. Mounya Elhilali, Prof. Aren Jansen (Reader) and Prof. Hynek Hermansky
iv
Acknowledgments
This thesis would never have been in place without so many great people
around me. I would like to thank my advisor, Prof. Hynek Hermansky for his
different projects and work with other research groups. Thank you very much for all
the mentoring!
I owe much to Sriram for always being there for me as a great friend and
collaborator. We have worked together on many interesting ideas and projects, several
of which form the core of this thesis. He has always been around to help - many thanks
for also reading this thesis! My sincere thanks to my colleagues - Sivaram, Harish,
Keith, Vijay, Feipeng, Ehsan, Janu, Kailash, Sridhar, Mike, Bala, Deepu, Joel, Fabio,
Petr, Mathew, John, Tamara, Lakshmi, Hari, Weifeng and Phil. Graduate school
would never have been as it was, without all of you! Thank you very much for the
Xinhui, Daniel, Dmitry and Ramani), BBN (Spyros, Stavros, Tim, Long and Bing)
v
ACKNOWLEDGMENTS
and BUT (Lukas, Petr, Pavel, Martin, Ondrej and Honza) on the IAPRA BEST,
I spent three different summers working with CLSP summer workshop teams
lead by Dan, Nagendra, Lukas and Richard in 2009, Geoff and Patrick in 2010 and
Aren, Mike and Ken in 2012. These were great workshops! Thanks for having me
on your teams! My sincere thanks to Sanjeev for organizing these workshops and the
my GBO to final defense. Thank you very much Aren for the great collaboration,
I pulled through all of this because of the love and prayers of my family -
my son Joshua, wife Jamie and our parents. No words can express my thanks to the
vi
Dedication
vii
Contents
Abstract ii
Acknowledgments v
List of Figures xv
1 Introduction 1
viii
CONTENTS
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
ix
CONTENTS
Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
x
CONTENTS
4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Data selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xi
CONTENTS
6 Conclusions 110
Bibliography 117
Vita 139
xii
List of Tables
2.1 FDLP model parameters that improve robustness of short-term spectral fea-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 FDLP model parameters that improve performance of long-term modulation
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3 Phoneme Recognition Accuracies (%) for different feature extraction tech-
niques on the TIMIT database . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Word Recognition Accuracies (%) on the OGI Digits database for different
feature extraction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5 Word Recognition Accuracies (%) on RT05 Meeting data, for different feature
extraction techniques. TOT - total word recognition accuracy (%) for all
test sets, AMI, CMU, ICSI, NIST, VT - word recognition accuracies (%) on
individual test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6 Recognition Accuracies (%) of broad phonetic classes obtained from confusion
matrix analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1 Word Recognition Accuracies (%) using different Tandem features derived
using only 1 hour of English data . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2 Word Recognition Accuracies (%) using Tandem features enhanced using
cross-lingual posterior features . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3 Word Recognition Accuracies (%) using multi-stream cross-lingual posterior
features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Word Recognition Accuracies (%) using two languages - Spanish and English 67
3.5 Word Recognition Accuracies (%) using three languages - Spanish, German
and English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1 Word Recognition Accuracies (%) using different amounts of Callhome data
to train the LVCSR system with conventional acoustic features . . . . . . . 77
4.2 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 83
4.3 Word Recognition Accuracies (%) at different word confidence thresholds . 89
4.4 Word Recognition Accuracies (%) with semi-supervised pre-training . . . . 90
xiii
LIST OF TABLES
4.5 Word Recognition Accuracies (%) with semi-supervised acoustic model training 91
5.1 Equal Error Rate (%) on different channels using different acoustic features
and combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Performance in terms of Min DCF (×103 ) and EER (%) in parentheses on
different NIST-08 conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Integrating MLP based event detectors with ASR . . . . . . . . . . . . . . . 108
xiv
List of Figures
2.1 Illustration of the all-pole modeling property of FDLP. (a) a portion of the
speech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP. 35
2.2 PLP (b) and FDLP (c) spectrograms for a portion of speech (a). . . . . . . 38
2.3 Schematic of the joint spectral envelope, modulation features for posterior
based ASR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 Schematic of the proposed training technique with multiple output layers . 58
3.2 Deriving cross-lingual and multi-stream posterior features for low resource
LVCSR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Tandem and bottleneck features for low-resource LVCSR systems. . . . . . 68
4.1 (a) Wide and (b) Deep neural network topologies for data-driven features . 71
4.2 Data driven front-end built using data from the same language but from a
different genre. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 A cross-lingual front-end built with data from the same language and with
large amounts of additional data from a different language but with same
acoustic conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
xv
LIST OF FIGURES
4.4 LVCSR word recognition accuracies (%) with 1 hour of task specific training
data using the proposed front-ends . . . . . . . . . . . . . . . . . . . . . . . 78
4.5 MLP posteriogram based phoneme occurrence count . . . . . . . . . . . . . 87
5.1 Schematic of (a) features and (b) the processing pipeline for speech activity
detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.2 Average precision for different configuration of the wide topology front-ends 105
xvi
Chapter 1
Introduction
This chapter introduces the automatic speech recognition problem and machinery.
The theme of the thesis - developing data-driven feature extractors for speech recognition, is
motivated along with a discussion on techniques that have been developed in the past. The
Automatic speech recognition is the process of transcribing speech into text. Cur-
rent speech recognition systems solve this task in a probabilistic setting using four key
and a language model. In a word recognition task, given an acoustic signal corresponding
1
CHAPTER 1. INTRODUCTION
acoustic model, pronunciation dictionary and a language model are then used to find the
p(X) is the a priori probability of observing a sequence of words in the language, inde-
pendent of any acoustic evidence and is modeled using the language model component.
p(Y |X) corresponds to the likelihood of the acoustic features Y being generated given the
word sequence X.
In current ASR systems, both the language model and the acoustic model are
stochastic models trained using large amounts training data [1, 2]. Hidden Markov Models
(HMMs) or a hybrid combination of neural networks and HMMs [3] are typically used as
acoustic models.
For large vocabulary speech recognition, not all words have adequate number of
acoustic examples in the training data. The acoustic data also covers only a limited vo-
or utterances using limited examples, acoustic models for basic speech sounds are instead
built. By using these basic units, recognizers can also recognize words without acoustic
training examples.
To compute the likelihood p(Y |X), each word in the hypothesized word sequence X
is first broken down into its constituent phones using the pronunciation dictionary. A single
composite model for the hypothesis is then constructed by combining individual phone
HMMs. In practice, to account for the large variability of basic speech sounds, HMMs
2
CHAPTER 1. INTRODUCTION
of context dependent speech units with continuous density output distributions are used.
There exist efficient algorithms like the Baum-Welch algorithm to learn the parameters of
the a priori probability p(X) [2]. Although p(X) is the probability of a sequence of words,
N -grams model this probability assuming the probability of any word xi depends on only
N-1 preceding words. These probability distributions are estimated from simple frequency
counts that can be directly obtained from large amounts of text. To account for the inability
to estimate counts for all possible N -gram sequences techniques like discounting and back-off
Front-ends for ASR which have traditionally evolved from coding techniques like
linear predictive coding (LPC) [6] start by performing a short-term analysis of the speech
signal. Based on the assumption that speech is stationary in sufficiently short-time intervals,
the power spectrum (squared magnitude of the short-time Fourier spectrum) of the signal
[7, 8]. This spectral representation of speech is then transformed into an auditory-like
representation by warping the frequency axis to the Mel or Bark scale and applying a non-
[9] or Perceptual Linear Prediction (PLP) [10] features for speech recognition are cepstral
coefficients derived by projecting the auditory-like representation onto a set of discrete cosine
3
CHAPTER 1. INTRODUCTION
transform (DCT) basis functions. Since these techniques analyze the speech signal only in
short analysis windows, information about local dynamics of the underlying speech signal is
often provided by augmenting these features with derivatives of the cepstral trajectories at
each instant [11]. In speech recognition applications, the first 13 cepstral coefficients along
labels to an entity given an N dimensional feature vector x. One approach to this problem,
involves inferring posterior class probabilities p(Cj |x) of each class given the features. The
entity is then assigned to the class that gives the highest class posterior probability [12].
The posterior probability p(Cj |x) of each class can be estimated in multiple ways.
p(x|Cj )p(Cj )
In a Bayesian formulation, p(Cj |x) can be expanded as p(x) . Each of the quantities
p(x|Cj ) and p(Cj ) are then separately computed from generative models trained to capture
these distributions from data. The probability p(Cj |x) can also be estimated directly from
a parametric model, whose parameters have also been optimized using the training data.
functions that predict the class label of the input [13]. In this framework, classification
is viewed as partitioning of the input feature space into different classes using decision
boundaries or surfaces. For a simple two class problem, a linear discriminant function can
be constructed as the linear combination of the input feature vector with a weight vector
4
CHAPTER 1. INTRODUCTION
w as
f (x, w) = wT x + w0 . (1.2)
In the N-dimensional input space, the function f (x, w) = wT x + w0 forms an N-1 di-
of the form
where φ(.) is a fixed linear or non-linear vector function of the original input vector x. Using
these functions, for the J class problem we can design for example, a J-class discriminant
ing avenue for integrating information from the data through the data dependent transfor-
mations of the input features. An example of a linear discriminant function is Fisher’s linear
discriminant. In this method, instead of using the linear combination of the input vector
to form a hyperplane for class assignment, the linear combination is used as a dimension-
ality reduction technique. The weight vector w is designed as a set of basis functions that
projects the feature vector x to a lower dimension such that there is maximal separation
between class means and variance within each class is minimum. A common criteria used
5
CHAPTER 1. INTRODUCTION
where Sw and Sb are within-in class and between class covariance matrices of the data. If
the dimensionality of the new projection space is M , the weight vector can be shown to
be the set of basis functions corresponding to M eigenvectors of Sw−1 Sb with the largest
eigenvalues [12].
functions. In feed-forward neural networks, which are classic examples of these models, the
K
f (x, w) = g( wk φk (x)), (1.6)
k=1
where g(.) is a non-linear activation function and φk is now a non-linear. During the training
phase, the basis functions and the weights are adjusted using the training data [13].
In a two layer neural network for example, the processing starts by creating linear
combinations of the N dimensional feature vector at each of the K hidden layer units. With
each of the hidden nodes being connected to every input node through a set of weights, an
N
ak = wnk xn + wk0 , (1.7)
n=1
is first produced at each node. Each of node activation then passes through a differentiable,
used activation functions are nonlinear sigmoidal functions like the logistic sigmoid or the
‘tanh’ function. Weight wnk is a trainable parameter connecting input node n and hidden
6
CHAPTER 1. INTRODUCTION
node k. wk0 is fixed bias term of the hidden node. Activation outputs of the hidden layer
are then linear combined again to form output unit activations. Each of the M output
K
am = wkm bm + wm0 (1.8)
k=1
to produce an output of the form cm = σ(am ), where σ(.) is the ‘softmax’ activation function
defined as
exp(am )
σ(am ) = , (1.9)
m exp(am )
for multi-class classification problems. Using (1.7)-(1.9), the overall network function can
be written as
K N
hm (x, w) = σ wkm ψ wnk xn . (1.10)
k=0 n=0
Comparing (1.6) with (1.10) shows how the non-linear basis functions ψ(.) are also now
learnt like the weight parameters. There are different training algorithms to learn these
parameters. In commonly used training methods, model parameters are optimized by us-
ing a cross-entropy error criteria and techniques like error back-propagation. For speech
of speech classes like phonemes, conditioned on the input features [3, 14].
Both the transforms reviewed above - transforms with linear basis functions and
transforms with non-linear basis functions form starting points for the development of more
7
CHAPTER 1. INTRODUCTION
Speech Signal
Time−frequency
representation
Front−end Feature
Linear Non−linear Transforms
Basis Basis
Functions Functions
Features
complex data-driven feature transforms and acoustic model backends in speech recognition.
Although transforms like the discrete Fourier transform and the discrete cosine transform
have been used, neither of these transforms are data driven. There has hence been consid-
erable interest to improve these front-ends with more powerful data-driven techniques.
for ASR can be broadly classified. There are clearly two distinct sets of transformation
classes - while one set of transforms are strongly tied with the feature extraction module,
8
CHAPTER 1. INTRODUCTION
the second set is strongly coupled with the acoustic model and its training criteria. We call
the first class front-end feature transforms and the second class back-end feature transforms.
representations of speech. As shown in Figure 1.1, these transforms can be further cate-
gorized into two broad groups - data independent projections and data-driven projections.
Examples of data independent projections are the DCT transforms discussed earlier. Al-
though these are a set of fixed cosine basis functions, they are very similar to basis functions
that can be derived from a direct principal component analysis (PCA) [15] on the auditory
variables into a new set of values corresponding to linearly uncorrelated variables or prin-
cipal components. Figure 1.2 (reproduced from [16]) shows a set of spectral basis derived
using the data-dependent Karhunen-Loeve transform (KLT) on filter bank outputs using 2
hours of speech from the OGI Stories database [17]. The basis functions are very similar to
the cosine functions used in conventional features. The flatness of the first basis function
shows that the variation in the average energy is what contributes the most to the variance
of auditory representations.
LDA using the Fisher discriminant criteria described earlier has been used as a
useful tool in the development of many techniques in the second class of projections - data-
dependent projections. This class is sub-divided further into two groups - a set of transforms
that use linear basis derived by solving a generalized eigenvalue decomposition problem
9
CHAPTER 1. INTRODUCTION
Figure 1.2: Spectral basis functions derived using PCA on the bark-spectrum of speech from
the OGI stories database - Eigenvalues of the KLT basis, total covariance matrix projected
on the first 8 KLT vectors, first 6 KL spectral basis functions derived by PCA analysis.
and those which use neural network based techniques with non-linear basis functions. In
early work, Brown [18] and Hunt [19] have used LDA on features in speech recognition.
Hunt and his colleagues integrated LDA with Mel-auditory representations of speech in
framework they called IMELDA - integrated Mel-scale representation with LDA [19, 20]. A
host of techniques have since been developed based on using LDA with HMM based speech
recognizers to improve recognition performances. These techniques have focused on the use
of different types of output classes like phones, subphones or HMM states and improvements
10
CHAPTER 1. INTRODUCTION
by Malayath, van Vuuren, Valente and Hermansky [21–23] have analyzed the usefulness
of LDA with phonemes as output classes. Table 1.1 summarizes their key observations of
using LDA with different time-frequency representations of speech. All these techniques
while decorrelating the input feature vectors also maximize the class separability of the
systems.
formant frequencies.
11
CHAPTER 1. INTRODUCTION
applied to long segments of vectors are consistent with the RASTA, delta and
by Furui [24]
While the PCA and LDA techniques described above are useful in describing
transforms in the Euclidean space, manifold based techniques characterize data as being
embedded in a manifold space [25–27]. Several generic manifold learning techniques have
been adopted to be applied on speech data. While learning the manifold structure, several
of these techniques also model both global and local relationships between data points in
the manifold space as constraints. These learning problems are usual solved as optimization
12
CHAPTER 1. INTRODUCTION
Figure 1.3: LDA-derived spectral basis functions of the critical band spectral space derived
from the OGI Numbers corpus.
The second important class of front-end based transforms use neural networks. For acoustic
modeling, multilayer perceptrons (MLP) based systems are trained on different kinds of
phonemes, conditioned on the input features [14]. Neural network based acoustic models
Training criteria - Neural networks are trained to discriminate between output classes
using non-linear basis functions, with its cross-entropy training criteria. This training
13
CHAPTER 1. INTRODUCTION
Figure 1.4: (a) Frequency and impulse responses of the first three discriminant vectors
derived by applying LDA on trajectories of critical-band energies from clean Switchboard
database, (b) Frequency and impulse responses of the RASTA filter and the RASTA filter
combined with the delta and double-delta filters.
Input feature assumptions - These networks can model high dimensional input features
without any strong assumptions about the probability distribution of these features.
Several different kinds of correlated feature streams can also be integrated together
14
CHAPTER 1. INTRODUCTION
Output representations - MLPs trained on large amounts of data from a diverse col-
lection of speakers and environments, can achieve invariance to these unwanted vari-
abilities. Since posterior probabilities are produced by these networks, outputs from
In hybrid HMM/MLP systems [3], these posterior probabilities are used directly
as the scaled likelihoods of sound classes in HMM states instead of conventional state-
emission probabilities from GMM models (discussed in detail in Chapter 2). Alternatively,
these posteriors can be converted to features that replace conventional acoustic features,
in HMM/GMM based system via the Tandem technique [28] (also discussed in detail in
Chapter 2). Features from intermediate layers of neural networks have also been shown to
Pinto et.al [31,32] use a Volterra series based analysis to understand the behavior of
the non-linear transforms that are learned by MLPs trained to estimate phoneme posterior
probabilities. The linear Volterra kernels used to analyze MLPs trained on Mel-filter bank
features reveal interesting spectro-temporal patterns learnt by the trained system for each
phoneme class. An extended study on a hierarchy of MLPs using the same framework,
shows that when a second MLP classifier is trained on posteriors estimated by an initial
MLP, it learns phonetic temporal patterns in the posterior features. These patterns include
phonetic confusions at the output of the first MLP as well as phonotactics of the language
15
CHAPTER 1. INTRODUCTION
As shown in Figure 1.1, acoustic features after front-end level transforms are used
to train acoustic models. The distributions of basic speech sounds like phones are typically
represented by a Hidden Markov Model (HMM). Phone HMMs are constructed as finite state
machines with typically five states - a start state, three emitting states and an end state,
continuous density Gaussian mixture models are used to model the emission probability
distribution of feature vectors. To cover the large phonetic variability, separate HMMs
are trained for every basic speech unit, typically a phone, in context with a left and right
neighboring phone. Individual Gaussian parameters along with the mixing coefficients of the
Gaussian mixture models are estimated in a maximum likelihood framework [2]. However
since the number of trainable tri-phone parameters is huge, additional techniques like state-
tying with phonetic decision trees are used. In a second stage of training, the acoustic
models are then discriminatively trained using objective functions such as maximum mutual
information (MMI) [33, 34], minimum phone error (MPE) [35] or minimum classification
error (MCE) [35]. To improve the performance in each of these two passes of acoustic
models training, separate feature transforms which adapt features to each of the training
phases have been proposed. These set of transforms form the second major class of feature
In the past linear discriminant analysis has been investigated in several different
settings - to process feature vectors [18], as a transform to improve the discrimination be-
tween HMM states [36] and also a feature rotation and reduction technique in a maximum
16
CHAPTER 1. INTRODUCTION
likelihood setting [37]. Kumar and Andreou generalized LDA with Heteroscedastic linear
discriminant analysis (HLDA) [38] by relaxing the assumption of sharing the same covari-
ance matrix among all output classes. Also developed in a maximum likelihood setting, the
Maximum Likelihood Linear Transform (MLLT) [39] has been shown to be a special case
Feature space transforms like fMMI [40] and fMPE [41] on the other hand, are
linear transforms also applied on feature vectors but in a discriminative framework to opti-
mize the MMI/MPE objective functions. Similar to the early work in [42], region dependent
linear transforms (RDLT) [43] are an extension to the fMMI/MPE by first partitions the
feature space into different regions using a GMM. Each feature vector is then transformed
by a linear transform corresponding to the region that the vector belong via posterior prob-
transforms. Studies like [44] have shown that although these transforms are separately
applied at the feature and model level, they can be combined to significantly improve ASR
performances.
The feature extraction module plays a very crucial “gate-keeper ” role in any pat-
tern recognition task. If useful information for classification is discarded by a poorly de-
signed feature extractor from the signal, it cannot be recovered again and the classification
task suffers. On the other hand if the feature extraction module allows irrelevant and re-
17
CHAPTER 1. INTRODUCTION
dundant detail to remain in the features, the classification module has to be additionally
developed to cope up with this. In speech recognition, a similar setting exists - a feature
extraction front-end first produces features for a pattern recognition back-end to recognize
words. To improve the performances in this setting, this thesis focuses on developing better
The review presented above describes one avenue of improvement for current
speech recognition feature front-ends - the development of better data-driven features. Fig-
ure 1.5 reiterates this again. The primary goal of speech recognition is to extract the mes-
sage, the human communicator produced using an inventory of basic speech units. However
the message is embedded with several constituent components of the speech signal as it
passes through a communication channel influenced by both the human speaker, transmis-
sion mechanism and the environment before it is captured by a machine using a microphone.
It is the goal of the feature extractor module to remove these irrelevant variabilities while
extracting useful features for the speech recognition back-end to recover the message.
Current speech recognition front-ends largely rely on information in the short-term spectrum
of speech. This representation is however very fragile and easily corruptible by channel
sources of knowledge. The best source of information is the data itself. This thesis hence
In earlier sections, several techniques that allow data integration into feature ex-
traction were reviewed. Neural networks provide very interesting mechanisms of integrating
information not only because they are discriminatively trained and use non-linear basis func-
tions to transform the data but also because they have been shown to have several other
18
CHAPTER 1. INTRODUCTION
Figure 1.5: Thesis contributions to developing better data-drive neural network features for
ASR pipeline.
key advantages. For example they can accommodate large feature dimensions and do not
place strong assumptions on the distributions of these features. A very significant advan-
tage is that they can also directly produce posterior probabilities of speech classes, making
the posteriogram representation of speech - the evolution of the posteriors of speech classes
like phonemes over time, a useful source of information for speech recognition (see figure
1.5). As can be seen, this representation is void of speaker and channel variabilities and is
linked more closely to the underlying speech message encoded using basic speech units like
phonemes.
19
CHAPTER 1. INTRODUCTION
eral factors. The MLP estimates posterior probabilities of phoneme classes ci conditioned
on the input acoustic features x and the model parameters w as p(ci |x, w). The factors
(a) The input acoustic features: Robust acoustic features which capture information from
(b) The amount of training data: Significant amounts of task dependent data needs to be
(c) Network architectures: Suitable network architectures have to be used to learn the
data-driven transforms.
(a) Exploiting temporal dynamics of speech: We adopt a novel signal processing tech-
nique based on Frequency Domain Linear Prediction (FDLP) to better model sub-band
temporal envelopes of speech. Features from these representations are used to build
from a small vocabulary continuous digit recognition task to large vocabulary continu-
ous speech recognition (LVCSR) task, the proposed data-driven features provide about
(b) Working with limited amounts of training data: With significant amounts of train-
20
CHAPTER 1. INTRODUCTION
ing data the proposed data-driven features can perform well (Chapter 2). However, in
several real-world scenarios this is not always the case. In the development of ASR
technologies for new languages and domains, for example, very few hours of transcribed
data is available initially. We hence focus on data-driven features in low resource sce-
narios where only up to 1 hour of transcribed task dependent data is available to train
acoustic models. As with the case of every data-driven technique, the performances of
techniques are based on the use of task independent data. In many cases, these sources
of data cannot be used directly. For example, if data from different languages is used to
build ASR systems for a new language, differences in phone sets used to transcribe each
language step in. We propose techniques to deal with these kinds of issues in training
data-driven front-ends.
use of several neural network architectures to allow task independent data to be used.
Using data transcribed with different phone sets from different languages, these im-
provements allow better neural network models to be built. Our contributions lead to
an absolute WER reduction of about 15% on a low-resource task with only 1 hour of
data-driven front-ends are presented, apart from using them to generate features for
ASR (Chapters 5). These applications include - speech activity detection in noisy en-
21
CHAPTER 1. INTRODUCTION
1. Chapter 2 is an overview of different feature extraction techniques for ASR and in-
troduces a set of new features using Frequency Domain Prediction. The usefulness of
niques - performances of these systems drop significantly with only few hours of tran-
scribed training data. We show how this can be compensated using the proposed
data-driven front-ends which also are affected in these scenarios. The proposed ap-
wider and deeper neural network architectures in low resource settings. Typically
these kinds of networks cannot be trained well with only few hours of transcribed
training data. We however show how task independent data can be used in these
settings as well.
features for speech activity detection, probabilities of broad phonetic classes to model
parts of each speakers acoustic space in a neural network based speaker verification
22
CHAPTER 1. INTRODUCTION
system, feature representations for zero-resource applications and event detectors for
speech recognition.
23
Chapter 2
Data-driven Features
This chapter introduces a novel acoustic feature extraction technique for ASR. Data-driven
front-ends are developed using these features and evaluated on different ASR tasks. Significant
improvements are demonstrated by using the proposed features with neural network based front-ends.
every 10 ms. Although speech is a non-stationary signal, over sufficiently short time inter-
vals, the signal can be considered stationary. In each of these analysis windows, the power
spectrum - squared magnitude of the short-term Fourier spectrum is then computed, before
24
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
being processed along two dimensions - across frequency and across time. The processing
across frequency attempts to model the gross shape of the spectrum. Temporal dynamics
of the spectrum are, on the other hand, captured by the processing across time.
Processing across the frequency axis has two primary objectives. Through a se-
quence of steps the resolution of the spectrum is first modified to be non-uniform instead
of the inherent uniform Fourier transform resolution. The non-uniform resolution has been
shown to be useful for discriminating between basic speech sounds [22]. The spectrum is also
smoothened to capture only its gross shape and remove any rapidly varying fluctuations.
In Perceptual Linear Prediction (PLP) [10], the first objective is achieved through
a set of operations motivated by human auditory perception which convert the power spec-
• Using a filter-bank of trapezoidal filters, to warp the power spectrum to a Bark fre-
quency scale. Outputs of these integrators are consistent with the notion of integration
• Emphasizing each sub-band frequency signal using a scaling function base on the
emphasis in the time domain. Pre-emphasis is performed to remove the overall spectral
• Compressing the sub-band signals using the cubic root function. This step is moti-
vated by the power law of hearing that relates intensity and perceived loudness.
25
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The gross shape of the auditory spectrum of speech is finally approximated using an auto-
regressive model. The prediction coefficients of this model are obtained via a recursive auto-
correlation based method on the inverse Fourier transform of the auditory spectrum [10]. As
described in the previous chapter, the features for ASR are cepstral coefficients obtained by
projecting the smoothened auditory-like representation onto a set of discrete cosine trans-
form (DCT) basis functions. Based on the source-filter interpretation of LPC, smoothing
the spectrum using LPC allows the features to capture vocal tract filter properties which
are useful in characterizing speech sounds. Apart from its decorrelation and dimensionality
reduction properties, truncating the DCT coefficients also removes higher order coefficients
that capture speaker specifics in the spectrum. This further smooths the spectrum.
Just as the gross shape of the spectrum is useful in characterizing speech sounds,
temporal dynamics of the spectrum are also key to classification. Several important obser-
vations [46] useful while capturing these dynamics include facts that -
• speech is produced at a typical rate by vocal tract movements. The rate of change of
Traditionally temporal dynamics have been captured through first and second or-
der time derivatives of cepstral coefficients [11]. These operations can also be interpreted as
speech while suppressing other higher and lower components. In the RASTA processing of
26
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
speech, the above mentioned observations are explicitly integrated into the PLP pipeline by
filtering the temporal trajectories of the spectrum to suppress constant factors while pre-
to the RASTA technique, a bank of bandpass filters with varying resolutions has also been
As illustrated in Figure 1.5, the speech signal is often modified by channel and
Differences in the vocal tract anatomies, lead to significant variability in the spec-
trum between speakers and genders. Other extrinsic characteristics that produce speaker
variabilities are the socio-linguistic background and emotional state of the speakers. While
some of these artifacts are compensated for by techniques like PLP, techniques such as vocal
tract length normalization (VTLN) [48] are often used by state-of-the-art feature extraction
techniques.
Effects from the channel or environment are usually modeled as additive or convo-
lutive distortions. When speech signal is corrupted by additive noise, the recorded speech
signal is expressed as
where ns[m], cs[m], n[m] are discrete representations of the noisy speech, clean speech and
the corrupting noise respectively. If the speech and noise are assumed to be uncorrelated,
27
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
where Pns (m, ωk ), Pcs (m, ωk ), Pn (m, ωk ) are the short-term power spectral densities at fre-
quency ωk of the noisy speech, clean speech and noise respectively. Conventional feature
extraction techniques for ASR estimate the short term (1030 ms) power spectral density
(PSD) of speech in bark or mel scale. Hence, most of the recently proposed noise robust
feature extraction techniques apply some kind of spectral subtraction in which an estimate
of the noise PSD is subtracted from the noisy speech PSD. The estimate of noise PSD
is usually computed using a speech activity detector from regions likely to contain only
noise (for example the ETSI front-end [49]). A survey of many other common techniques
is available in [50].
channel impulse response or room impulse response, the noisy speech can be written as
where ns[m], cs[m], r[m] is the noisy speech, clean speech and the corrupting room or chan-
nel impulse response respectively. These kinds of convolutive distortions are multiplicative
in the spectral domain and additive in the log-spectral domain. However these assumptions
be shorter than the short-term Fourier transform (STFT) analysis window. If the artifact
28
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
is assumed to be constant in the short analysis window, its effect appears as an offset term
in the final cepstral representation. If the channel remains the same for each recorded ut-
terance, the artifact can then be removed by subtracting the mean of the cepstral sequences
distortions [51]. Usually this is done on a per speaker basis as well to achieve additional
speaker normalization. In the log-DFT mean normalization technique [52], mean subtrac-
tion is done on a linear frequency scale instead of a warped scale as in CMS since the
assumption that response functions might have a constant value in each critical band is not
always valid.
necessary to estimate the log-spectrum in much longer analysis windows. This is because
room impulse responses characterized by their T60 reverberation times usually have ranges
between 200-800 ms. T60 denotes the amount of time required for the reverberant signal to
reduce by 60 dB from the initial direct component value. Successful techniques like long-
term spectral subtraction (LTLSS) [53] hence use analysis windows as long as 2 seconds to
above, acoustic features for speech recognition are derived from the representation. The
most common feature representations are cepstral vectors of the processed power spectral
29
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
envelope derived using 20-30 ms analysis windows every 10 ms. These features are then
typically augmented with time derivatives (first, second and third derivatives). In some sys-
tems instead of using the time derivatives, 9-21 successive frames are concatenated together
and used after a projection to a lower dimension using various transforms [54–56].
Aggregating information from only such a limited temporal context could however
This argument is further strengthened by information theoretic results showing that features
from longer time intervals (up to several hundred milliseconds) are useful better discrimina-
tion between speech sounds [58]. This limitation has been addressed using several different
Through a number of studies it has been shown that speech perception is sensitive
to relatively slow modulations of the temporal envelope of speech. [59, 60]. Most of the
energy in the modulation spectrum peaks around 4 Hz which is also corresponds to syllabic
rate of speech. In the presence of noise although these components are affected [61, 62],
Information from the modulation spectrum can be derived from a spectral analysis
sufficient spectral resolution at the low modulation frequencies frequencies described above,
relatively long segments of speech signal need to be analyzed. For example to capture
30
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
necessary. Analysis windows of this length are also consistent with the time intervals of co-
phenomena and the linguistic concept of syllable [64]. By deriving features for ASR using
these kinds of analysis windows, information about the dynamics of spectral components
In [65, 66], 1 second long temporal trajectories of individual critical sub-band en-
ergies were used for phoneme recognition experiments. In this multi-stream framework,
separate neural network classifiers were trained on long-term features from each sub-band
before being combined together by a second level neural network. Since features from
each sub-band were used independently, the comparable performance of this feature ex-
traction technique with conventional short-term spectral features, demonstrates that there
is significant information in the local temporal dynamics being captured. These temporal
pattern features (TRAPS) have been extended in different configurations (for example [67])
as modulation features after applying a cosine transform [68] or filtering using modulation
filters [47].
The modulation features discussed above are extracted from sub-band energies
of speech using long analysis windows. The sub-band energies are not directly modeled
but are instead produced with an inherent limited resolution as outputs of Bark/Mel scale
2.1.1). For more effective features that capture the evolution of the temporal envelopes, it
31
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
in time to effectively capture spectral resonances. Based on duality properties, LPC can
similarly be performed in the frequency domain to directly model and capture important
temporal events. This framework is based on the notion that speech can be considered to
analytical signals. The squared magnitude of the analytical signal is also called the Hilbert
envelope and is a description of temporal energy. Instead of computing the analytic signal
directly, an auto-regressive modeling approach can be used. This modeling approach also
called Frequency Domain Linear Prediction (FDLP) is the dual of conventional time domain
linear prediction used to model the power spectrum of speech [69, 70]. Instead of modeling
the power spectrum, FDLP models the evaluation of signal energy in the time domain
by the application of linear prediction in the frequency domain using the discrete cosine
transform of the signal. This parametric model can be used as an alternate technique to
The modulation features described in the earlier sections are typically high dimen-
sional correlated features. Both these limitations prevent them from being used directly
with ASR systems. These features have hence been used in conjunction with neural net-
works which have much more relaxed assumptions on feature distributions. As described
in the previous chapter neural networks can be trained to estimate posterior probabilities
of speech classes. These probabilities can then be used directly as scaled likelihoods in the
32
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
posteriors to features similar to traditional acoustic features for ASR systems. In the Tan-
dem processing approach [28], posterior features from neural networks are post-processed
step procedure - a log transform is first applied to the posteriors to Gaussianize the vectors
have been proposed to derive features from the outputs of neural networks. In the HATS
technique [73] non-linear outputs from the penultimate layer of a network have been used.
This has been further extended to deriving features from an intermediate bottleneck layer
A key benefit from the development of long-term features are the significant
LVCSR gains obtained from combining these features with conventional short-term fea-
tures [73]. The best combination of features is obtained by first training neural networks
using both the long-term modulation features and short-term spectral energy based features
separately. The outputs of the neural networks are then combined using a merger neural
network or using different combination rules before being used as data-driven features for
LVCSR tasks [74]. As discussed in [75] this approach is useful because of several reasons -
• The MLP features derived from neural networks trained on conventional short-term
33
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
• Although the MLPs are trained on different inputs, since they have the same target
• During the training phase the neural networks are able to discriminatively learn
class boundaries and produce data-driven features that are useful for classification
• After the application of post-processing techniques like Tandem, the data-driven neu-
ral network features can easily be modeled by HMM-GMM based LVCSR systems.
Recognition
We propose a novel feature extraction scheme along the lines of the techniques
described above, to derive two kinds of features - short-term spectral features and long-
term modulation features for ASR. The technique starts by creating a two-dimensional
auditory spectrogram representation of the input signal. This is formed by stacking sub-
time.
The sub-band temporal envelopes are obtained by analyzing speech using Fre-
quency Domain Linear Prediction (FDLP). The FDLP technique, as described earlier, fits
an all pole model to the Hilbert envelope of the signal (See Figure 2.1). These representa-
tions of the speech signal are able to capture fine temporal events associated with transient
34
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
(a)
5000
-5000
0 0.6 1.3
(b)
5000
0
0 0.65 1.3
(c)
5000
0
0 0.65 1.3
Time (s)
Figure 2.1: Illustration of the all-pole modeling property of FDLP. (a) a portion of the
speech signal, (b) its Hilbert envelope (c) all pole model obtained using FDLP.
events like stop bursts while at the same time summarize the signals gross temporal evo-
lution [76]. Short-term features are derived by integrating the auditory spectrogram in
short analysis windows. Long-term modulation frequency components are obtained after
the application of the cosine transform on compressed (static and adaptive compression)
35
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The FDLP time-frequency representation is created through the following steps [72] -
sub-band frequency signal. To facilitate this, the speech signal is first projected into
(b) Analysis of speech into sub-band frequency signals - Sub-band frequency signal are
obtained by windowing the DCT transform using a set of overlapping Gaussian windows
(c) Computation of auto-correlation coefficients via a series of dual operations of time do-
main linear prediction (TDLP) - Among the many approaches, one way of applying
TDLP is using the auto-correlation of the time signal. The auto-correlation coeffi-
cients are in turn derived from the power spectrum since the power spectrum and
auto-correlation of the time signal form Fourier transform pairs. In the FDLP case, the
Hilbert envelope and the auto-correlation of the DCT signal form Fourier transform
pairs.
Since the sub-band DCT signals have already been derived in the previous step, their
magnitude of the inverse discrete Fourier (IDFT) transform of the DCT signal. The
36
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
cients.
(d) Application of linear prediction - By solving a system of linear equations, the auto-
regressive model of each sub-band Hilbert envelope is finally derived from the auto-
correlation coefficients. Using the set of prediction coefficients {ai } the estimated
s can be represented as
Hilbert envelope in each sub-band HE
s (n) = G
HE i=p
(2.4)
| i=0 ai e−i2πkn |2
The parameter G is called the gain of the model. In [77], by normalizing the gain G, the
estimated sub-band envelopes have been shown to become robust to convolutive distor-
tions like reverberations and telephone channel artifacts. Additional robustness to additive
distortions by short-term subtraction of an estimate of noise have also been shown in [78].
There are several parameters that control the temporal resolution of the estimated envelopes
as well as the type and extent of analysis windows for different applications. These have
Figure 2.2 shows the PLP and FDLP spectrograms for a portion of speech. Criti-
cally spaced sub-bands energies of speech are derived in short-analysis windows in the PLP
case. The representation is hence smooth across frequencies in each analysis windows. In-
dividual sub-bands of speech are directly modeled in FDLP technique, resulting in a better
temporal resolution - for example the transient regions are well captured in this representa-
tion. Two kinds of features are derived from two-dimensional time-frequency representation
37
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
(a)
5000
-5000
0 0.24 0.45
(b)
20
Bark band
10
0
0 0.24 0.45
(c)
20
Bark band
10
0
0 0.24 0.45
Time (s)
Figure 2.2: PLP (b) and FDLP (c) spectrograms for a portion of speech (a).
first integrated using Mel/Bark integrators in short analysis windows to create sub-band
sub-band trajectories of spectral energy, identical distributions of energy in the time domain
(sub-band Hilbert envelopes), are estimated. Short-term cepstral features can be derived
38
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
This is done by first integrating the envelopes in short term analysis Hamming
windows (of the order of 25 ms with a shift of 10 ms). The integrated sub-band energies
are then converted to cepstral coefficients by applying the log transform and taking the
DCT transform across the spectral bands in each of the frames. For most applications we
use 13 cepstral coefficients. First and second derivatives of these cepstral coefficients are
also appended to form a 39 dimensional feature vector [79, 80], similar to conventional PLP
features. In [72, 81], a set of FDLP modeling parameters that improve the performance
of these short-term features for ASR in noisy environments has been identified. These
parameters and their effects are summarized in Table 2.1. In all these experiments, both
clean and noisy reverberant test data is evaluated on models trained with clean speech.
39
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
be useful [77].
features are derived by analyzing temporal trajectories of spectral energy estimates in indi-
vidual sub-bands using long analysis windows. As described earlier, since FDLP estimates
the temporal envelope in sub-bands, modulation features can be derived from these en-
40
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Before we derive the long-term features, we compress the sub-band temporal en-
velopes both statically and dynamically. The envelopes are compressed statically using the
tion circuit which consists of five consecutive nonlinear adaptation loops proposed in [83].
These loops are designed so that sudden transitions in the sub-band envelope that are fast
compared to the time constants of the adaptation loops are amplified linearly at the out-
put, while the steady state regions of the input signal are compressed logarithmically. The
compressed temporal envelopes are then transformed using the Discrete Cosine Transform
(DCT) in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation
frequency components from each cosine transform, yielding modulation spectrum in the 0-
35 Hz range with a resolution of 2.5 Hz [84]. The static and dynamic modulation frequency
components of each critical band are then stacked together before being used as features.
In [85], the proposed modulation features have been compared with other similar
modulation feature techniques - Modulation Spectrogram (MSG) [86], MRASTA [47] and
Fepstum [87]. In these experiments FDLP based modulations are significantly better than
features derived from the other approaches. An additional set of FDLP modeling parameters
that improve the performance of these long-term features for ASR have also been identified
based on a set of phoneme recognition experiments. These parameters and their effects are
41
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
42
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Static
compression
00000000
11111111
00000000
11111111
00000000
11111111
modulation
features
statically compressed sub−bands envelopes Posterior
probability
estimator
Adaptive
Speech FDLP compression Posterior Tandem Features
probability Processing for ASR
11111111
00000000
00000000
11111111 merger
Posterior
adaptively compressed sub−bands envelopes probability
1
0
1
0 estimator
1
0
frequency
1
0
1
0
1
0 auditory
1
0
1
features
0
1
0
1
0
1
0
time
sub−bands envelopes
Figure 2.3: Schematic of the joint spectral envelope, modulation features for posterior based
ASR
Each of these acoustic features are converted into data-driven features by using
them to first train two separate 3-layer multilayer perceptrons to estimate posterior prob-
abilities of phoneme classes. Each frame of the short-term spectral envelope features is
used with a context of 9 frames during training. As described earlier, static and dynamic
modulation frequency features of each critical band are stacked together and used to train
a separate MLP network. The spectral envelope and modulation frequency features are
then combined at the phoneme posterior level using the Dempster Shafer (DS) theory of
evidence [74]. These phoneme posteriors are first Gaussianized by using the log function
and then decorrelated using the Karhunen-Loeve Transform (KLT) [28]. This reduces the
dimensionality of the feature vectors by retaining only the feature components which con-
tribute most to the variance of the data. We use 25 dimensional features in our Tandem
43
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
representations similar to [75]. Figure 2.3 shows the schematic of the proposed feature
extraction technique.
spectral envelope and modulation frequency features along with other state-of-the-art fea-
tures for ASR. These include a phoneme recognition task, a small vocabulary continuous
digit recognition task and a large vocabulary continuous speech recognition (LVCSR) task.
For each of these experiments, we train three layered MLPs to estimate phoneme posterior
probabilities using these features. The proposed features are compared with three other
feature extraction techniques - PLP features [10] with a 9 frame context which are similar
to spectral envelope features derived using FDLP (FDLP-S), M-RASTA features [47] and
Modulation Spectro-Gram (MSG) features [86] with a 9 frame context, which are both
FDLP-M features using the DS theory of evidence to obtain a joint spectro-temporal fea-
ture set (FDLP-S+FDLP-M). Similarly, we derive two more feature sets by combining PLP
dimensional Tandem representations of these features are used for our experiments. We also
experiment with 39 dimensional PLP features without any Tandem processing (PLP-D).
44
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
features for a phoneme recognition task using HMMs. We perform experiments on the
TIMIT database, excluding ‘sa’ dialect sentences. All speech files are sampled at 16 kHz.
The training data consists of 3000 utterances from 375 speakers, cross validation data set
consists of 696 utterances from 87 speakers and the test data set consists of 1344 utterances
from 168 speakers. The TIMIT database, which is hand-labeled using 61 labels is mapped to
the standard set of 39 phonemes [89]. A three layered MLP is used to estimate the phoneme
posterior probabilities. The network consisting of 1000 hidden neurons, and 39 output
neurons (with soft max nonlinearity) representing the phoneme classes is trained using the
standard back propagation algorithm with cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
The Tandem representation of each feature set is used along with a decision tree
clustered triphone HMM with 3 states per triphone, trained using standard HTK maximum
likelihood training procedures. The emission probability density in each HMM state is mod-
eled with 11 diagonal covariance Gaussians. We use a simple word-loop grammar model
using the same standard set of 39 phonemes. Table 2.3 shows the results for phoneme recog-
nition accuracies across all individual phoneme classes for these techniques. The proposed
45
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.3: Phoneme Recognition Accuracies (%) for different feature extraction techniques
on the TIMIT database
PLP-D 68.3
PLP 70.1
FDLP-S 70.1
M-RASTA 66.8
MSG 65.1
FDLP-M 70.6
PLP+M-RASTA 71.2
PLP+MSG 71.4
FDLP-S+FDLP-M 72.5
digit recognition (OGI Digits database) to recognize eleven (0-9 and zero) digits with 28
pronunciation variants [47]. MLPs are trained using these features to estimate posterior
probabilities of 29 English phonemes using the whole Stories database plus the training
part of Numbers95 database with approximately 10% of data for cross-validation. Tandem
representation of the features are used along with a phoneme-based HMM system with
by 32 Gaussian mixture components [47]. Table 2.4 shows the results for word recognition
accuracies. For this task, the proposed spectral envelope features (FDLP-S) and modulation
46
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.4: Word Recognition Accuracies (%) on the OGI Digits database for different feature
extraction techniques
PLP-D 95.9
PLP 96.2
FDLP-S 96.6
M-RASTA 96.3
MSG 96.0
FDLP-M 96.8
PLP+M-RASTA 97.1
PLP+MSG 97.0
FDLP-S+FDLP-M 97.1
frequency features (FDLP-M) improve word recognition accuracies compared to PLP and
In our third experiment, we use these features on an LVCSR task using the AMI
LVCSR system for meeting transcription [90]. The training data for this system uses indi-
vidual headset microphone (IHM) data from four meeting corpora; NIST (13 hours), ISL
(10 hours), ICSI (73 hours) and a preliminary part of the AMI corpus (16 hours). MLPs
are trained on the whole training set in order to obtain estimates of phoneme posteriors for
each of the feature sets. Acoustic models are phonetically state tied triphone models trained
using standard HTK maximum likelihood training procedures. The recognition experiments
47
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.5: Word Recognition Accuracies (%) on RT05 Meeting data, for different feature
extraction techniques. TOT - total word recognition accuracy (%) for all test sets, AMI,
CMU, ICSI, NIST, VT - word recognition accuracies (%) on individual test sets
are conducted on the NIST RT05 [91] evaluation data. The AMI-Juicer large vocabulary
decoder is used for recognition with a pruned trigram language model [92]. This is used
along with reference speech segments provided by NIST for decoding and the pronuncia-
tion dictionary used in AMI NIST RT05s system. Table 2.5 shows the results for word
recognition accuracies for these techniques on the RT05 meeting corpus. The proposed fea-
tures (FDLP-S+FDLP-M) obtain a significant relative improvements for the LVCSR task
48
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
Table 2.6: Recognition Accuracies (%) of broad phonetic classes obtained from confusion
matrix analysis
M-RASTA FDLP-M
2.5 Conclusions
showed that this technique based on FDLP can capture important details in speech
• Two kinds of acoustic features - a short-term spectral feature and a long-term modula-
tion feature. Table 2.6 shows the results for phoneme recognition accuracies across all
individual phoneme classes for the proposed techniques using the TIMIT database.
49
CHAPTER 2. FROM ACOUSTIC FEATURES TO DATA-DRIVEN FEATURES
The FDLP-S features provide comparable results as the PLP features. The mod-
ulation features (FDLP-M) result in broad class recognition rate for all the broad
• A combination of the feature streams at the phoneme posterior level - From Table
2.6, the joint spectral envelope and modulation features yield improved broad class
post-processing allows these features to be used for ASR systems. In all our experi-
ments, Tandem representations of the proposed features improve ASR accuracies over
other features.
In the following chapters we will use this data-driven framework in many other
scenarios. The key scenario is a low-resource setting where the amount of training data
is limited, unlike the ASR settings assumed in this chapter where the amount of training
data is not restricted. We devise techniques to improve the effectiveness of the proposed
50
Chapter 3
Low-resource Scenarios
This chapter presents two novel techniques for building data-driven front-ends in
low-resource settings with very limited amounts of transcribed data for acoustic model train-
ing. Both the techniques improve performance in the low-resource settings using data from
multiple languages circumventing issues with different phone sets used in each language.
3.1 Overview
of available transcribed training data. When LVCSR systems are built for new languages
or domains with only few hours of transcribed data, the performance is lower. To improve
performance, unlabeled data from this new language or domain has been used to increase
51
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
the size of the training set [93]. This is done by first recognizing the unlabeled data and
incrementally adding reliable portions to the original training set. For these self-training
techniques to be effective, a low error rate recognizer is required to annotate the unlabeled
data. However in several scenarios like ASR systems for new languages, recognizers built
using limited amounts of training data have very high error rates. Additional improvements
Another potential solution to this problem is to use transcribed data available from
other languages to build acoustic models which can be shared with the low-resource language
[94, 95]. However training such systems requires all the multilingual data to be transcribed
using a common phone set across the different languages. This common phone set can
be derived either in a data driven fashion or using phonetic sets such as the International
Phonetic Alphabet (IPA) [96]. More recently cross-lingual training with Subspace Gaussian
Mixture Models (SGMM) [97, 98] have also been proposed for this task.
An alternative approach to this problem moves the focus from using the shared
data to build acoustic models, to training data-driven front-ends. The key element in
amounts of task independent data. In [99, 100], a task independent approach has been
used to first train MLPs with large amounts of data. Features derived from these nets
are then shown to reduce the requirement of task specific data to train subsequent HMM
stages. In these experiments, although the task specific data comes from the same language
as the task independent data, the data sources are collected in different domains. More
recently this approach has been shown useful also in cross-domain and cross-lingual LVCSR
52
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
tasks [75,101]. In [101], Tandem features trained on English CTS data are shown to improve
performance when used in other domains (meeting data) within the language and even
in other languages (Mandarin and Arabic). Even though MLPs are trained on different
phone sets in different languages, Tandem features are able to capture common phonetic
of transcribed task specific data to train the acoustic models. To improve over the poor
performance of acoustic models using conventional features in these settings, we use data-
driven feature front-ends that integrate the following additional sources of information -
(a) Multilingual task independent data - Transcribed data from other languages other than
the target language are first used to train initial neural networks models. These task-
independent models are then adapted using limited amounts of task-specific data.
(b) Multiple feature representations - Significant gains were demonstrated in the previous
chapter using different feature representations. We show how these features can be
One of the key problems in training neural network systems using data from mul-
tiple domains are differences in how data sources are transcribed. Although there are
phoneme sets like the IPA which can be used to uniformly label data across languages, only
few data sources are labeled using such sets. This chapter proposes techniques that can be
53
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
for example the language model or pronunciation dictionary are also affected. We however
focus our attention only on the feature extraction module and acoustic models.
In this section we describe a training approach using two data sets - H and L. H
is a task independent data set with significantly more amounts of training data than the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
(a) Train an initial network using data set H - We start by training a multilayer perceptron
(MLP) on the high resource task independent data set. After it has been trained, this
(b) Find a mapping between phoneme sets H and L - If the two phoneme sets share the
same phonetic transcription scheme for example the IPA, it is relatively easy to find
In the proposed training scheme we investigate the use of a data-driven technique based
been used in the past to measure the reliability of human speech recognition [102]. More
recently they have also been used to study the performance of ASR systems [103, 104].
We start by forward passing the low-resource task specific data L through the MLP
trained on task independent data in step (a) to obtain phoneme posteriors. To un-
54
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
derstand the relationship between phonemes, we treat the phoneme recognition system
source symbols to the system. Using the recognized phonemes belonging to H at the
output of the recognizer as received symbols, confusion matrices that characterize the
Each time a feature vector corresponding to phoneme li is passed through the trained
the output of the MLP. We treat each of these posterior probabilities as soft-counts
counts can be derived. Entry (i, j) of the confusion matrix corresponds to the soft count
aggregate c(i, j) of the total number of times task-specific phoneme li was recognized
as task-independent phoneme hj . Marginal count c(i) of each row is the total number
of times phoneme li occurred in the task-specific data. Similarly count c(j) of each
column is the total number of times phoneme hj of the task-independent data set was
Given such a CM, we would like to find the best map for every phoneme li among the
phones of H based on these counts. A useful information theoretic quantity that can
be used is the empirical point wise mutual information [105]. In [104], the use of this
quantity in conjunction with confusion matrices has been shown. For an input alphabet
A and output alphabet B, using the count based confusion matrix, the empirical point
wise mutual information between two symbols ai from A and bj from B is expressed as
Nij .N
IˆAB (ai , bj ) = log , (3.1)
Ni .Nj
55
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
where Nij is the number of times the joint event (A = ai , B = bj ) occurs and Ni , Nj
are marginal counts j Nij and i Nij .
Using our soft count based confusion matrix between two phone sets H and L, we
similarly define the empirical point wise mutual information between phoneme pairs
(li , hj ) as
using quantities defined earlier. For a given task specific phoneme li we compare
c(i, j)
J(li , hj ) = (3.3)
c(i).c(j)
is instead used.
Using this measure, for each label li , the more frequently a particular label hj occurs,
higher the value of J(li , hj ). We hence map each phoneme li in the task specific phoneme
set to a phoneme hj in the task independent set which has the highest J(li , hj ).
If assumptions that there exists a one-to-one mapping between the phoneme sets and
can be avoided. This can be done by removing an assigned phoneme from the list of
(c) Re-transcribe L using a new mapped phone set Ĥ - Using the mapping derived using
confusion matrices from above, the task specific data L can now be re-transcribed into
56
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
(d) Adapt the network using data set L - The initial task independent neural network can
now be adapted using the task specific data since it has been mapped to the same phone
set. The neural network is adapted by retraining it using the new data after initializing
(e) Extract data-driven features - Posterior features are derived for ASR after Tandem
In this section we propose a second training technique for training neural network
systems across different data sets without having to map all the data using a common
phoneme set. As before we describe the training approach using two data sets - H and
L. H is a task independent data set with significantly more amounts of training data the
low-resource data set - L. Both H and L are transcribed using different phoneme sets H
and L, with cardinalities h and l respectively. The network is trained using an acoustic
(a) Train the MLP on the task independent set H - We start by training a 4 layer MLP
of size d×m1×m2×h on the high resource language with randomly initialized weights.
While the input and output nodes are linear, the hidden nodes are non-linear. While
layer. We are motivated to introduce the bottleneck layer to allow the network to learn
57
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
11
00
00
11
Bottleneck layer 00
11
00
11 Intermediate output
111
000
00
11
00
11
layer specific
to phoneme set H
111
000
111
000
111
000 00
11
11
00
11
00
111
000
111
000
111
000
00
11
00
11
111
000
11
00
11
00 111
000
111
000 111
000 h
11
00
11
00 111
000
111
000 111
000
11
00
11
00
11
00
11
00
111
000
111
000
111
000
111
000
111
000
111
000
Weights initialized
11
00
11
00
11
00
11
00
111
000
111
000
111
000
111
000
111
000
111
000
from a single layer
perceptron
11
00
11 111
000
00
11 111
000 m2
00
11 111
000
00 111
000 11
00
d 111
000
111 11
00
000
111 11
00
000
111 11
00
000
111 11
00
000 11
00
m1
11
00
11
00 Final output
11
00 layer specific
11
00
11
00
11 to phoneme set L
00
11
Layers common 00
11
00
11
00
across 11
00
11
00
data−sets l
Figure 3.1: Schematic of the proposed training technique with multiple output layers
(b) Initialize the network to train on task specific set L - To continue training on the low-
resource data set which has a different phoneme set size, we create a new 4 layer MLP
of size d×m1×m2×l. The first 3 layer weights of this new network are initialized using
weights from the MLP trained on the high resource data set. Instead of using random
weights between the last two layers, we initialize these weights from a separately trained
single layer perceptron. To train the single layer perceptron, non-linear representations
of the low-resource training data are derived by forward passing the data through the
first 3 layers of the MLP. The data is then used to train a single layer network of size
m2×l.
(c) Train the MLP on task specific set L - Once the 4 layer MLP of size d×m1×m2×l
has been initialized, we re-train the MLP on the task specific data. By sharing weights
across data sets the MLP is now able to train better on limited amounts of task specific
58
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
(d) Derive data-driven features - The proposed 4 layer MLP are trained to estimate phoneme
posterior probabilities using the standard back propagation algorithm with cross en-
tropy error criteria. We derive two kinds of features for LVCSR tasks -
A. Tandem features - These features are derived from the posteriors estimated by
the MLP at the fourth layer. When networks are trained on multiple feature repre-
sentations, better posterior estimates can be derived by combining the outputs from
different system using posterior probability combination rules. Phoneme posteriors are
then converted to features by Gaussianizing the posteriors using the log function and
reduction is also performed by retaining only the feature components which contribute
B. Bottleneck features - Unlike Tandem features, bottleneck features are derived as lin-
ear outputs of the neurons from the bottleneck layer. These outputs are used directly
as features for LVCSR features without applying any transforms. When bottleneck
features are derived from multiple feature representations, these features are appended
together and a dimensionality reduction is performed using KLT to retain only relevant
components.
We use the English, German and Spanish parts of the Callhome corpora collected
by LDC for our experiments [106–108]. The conversational nature of speech along with high
59
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
out-of-vocabulary rates, use of foreign words and telephone channel distortions make the
task of speech recognition on this database challenging. The English database consists of
corresponding to about 15 hours of speech, form the complete training data. We use 1 hour
of randomly chosen speech covering all the speakers from the complete train set for our
experiments as an example of data from a low-resource language. The English MLPs and
subsequent HMM-GMM systems use this one hour of data. Two sets of 20 conversations,
roughly containing 1.8 hours of speech each, form the test and development sets. Similar to
the English database, the German and Spanish databases consist of 100 and 120 spontaneous
hours of Spanish are used as examples of task independent high resource languages for
training the MLPs. Each of these languages use different phoneme sets - 47 phonemes for
We train a single pass HTK [109] based recognizer with 600 tied states and 4
mixtures per state on the 1 hour of data. We use fewer states and mixtures per state since
the amount of training data is low. The recognizer uses a 62K trigram language model with
an OOV rate of 0.4%, built using the SRILM tools. The language model is interpolated from
individual models created using the English Callhome corpus, the Switchboard corpus [110],
the Gigaword corpus [111] and some web data. The web data is obtained by crawling the
web for sentences containing high frequency bigrams and trigrams occurring in the training
text of the Callhome corpus [97]. The 90K PRONLEX dictionary [112] with 47 phones is
60
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
used as the pronunciation dictionary for the system. The test data is decoded using the
HTK decoder - HDecode, and scored with the NIST scoring scripts [91].
We use the steps described in Section 3.2 to build a data-driven front-end for
low-resource settings.
(a) Build a multilingual task independent MLP - We train cross-lingual MLP systems on
data from two other languages - German and Spanish using a phone set that covers
phonemes from both the languages. We derive spectral envelope and modulation fre-
quency features from 15 hours of German and 16 hours of Spanish data. Even though
these languages have different phonemes from English, they share several common pho-
netic attributes of speech. The cross-lingual MLPs capture these attributes from each
(b) Construct the data-driven map for English - One hour of English data is forward passed
using the cross lingual MLP to obtain phoneme posteriors in terms of 52 cross-lingual
phones. The true labels for English data contains 47 English phonemes. Using the
mapping technique described earlier we then determine to which phone in the German-
Spanish set each English phoneme can be mapped. This one-to-one mapping is created
by associating each English phoneme to the phone which gives the highest count based
(c) Build low-resource MLPs using task specific data - We train a set of low resource MLP
systems for each of the feature streams by adapting the cross-lingual system using 1
61
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Cross−lingual
Trained on German and Spanish data
MLP
Cross−lingual MLP
adapted using 1 hour of English data
Modulation Low
resource
features MLP
Posterior
probability Tandem Features for
processing ASR
merger
Spectral Low
envelope resource
features MLP
Cross−lingual MLP
adapted using 1 hour of English data
Cross−lingual
MLP Trained on German and Spanish data
Figure 3.2: Deriving cross-lingual and multi-stream posterior features for low resource
LVCSR systems
hour of English data after mapping it to the new phone set. By adapting the nets it is
observed that the systems are able to discriminate better between phonetic classes of
the low resource language. The primary challenge in adapting an MLP system using
additional data from different language is to effectively map the phonetic units of the
new language to the phone set on which the system has already been trained. We
construct the map as described earlier between the existing and new language phone
sets. This adaptation allows the systems to capture information about phonetic classes
from the acoustic signal enhanced along with common phonetic attributes from the other
languages. We adapt the MLP by retraining it using the new data after initializing it
(d) Extract data-driven features - We use the two FDLP based acoustic streams proposed in
62
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.1: Word Recognition Accuracies (%) using different Tandem features derived using
only 1 hour of English data
the earlier chapter for our experiments. We derive short-term features (FDLP-S) from
sub-band temporal envelopes, modeled using FDLP by integrating the envelopes in short
term frames (of the order of 25 ms with a shift of 10 ms). These short term sub-band
energies are converted into 13 cepstral features along with their first and second deriva-
tives. Each frame of these spectral envelope features is used with a context of 9 frames
first compress the sub-band temporal envelopes statically using the logarithmic func-
tion and dynamically with an adaptation circuit consisting of five consecutive nonlinear
adaptation loops. The compressed temporal envelopes are then transformed using the
DCT in long term windows (200 ms long, with a shift of 10 ms). We use 14 modulation
frequency components from each cosine transform, yielding modulation spectrum in the
0-35 Hz range. The static and dynamic modulation frequency features of each sub-band
are stacked together and used to train an MLP network. For telephone channel speech,
63
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.2: Word Recognition Accuracies (%) using Tandem features enhanced using cross-
lingual posterior features
System 1 - Trained on
with 1 hr of English
Posterior features from the two acoustic streams (FDLP-M and FDLP-S) are combined
at the posterior level. This allows us to obtain more accurate and robust estimates of
are used to train the subsequent HMM-GMM system. Figure 3.2 is the proposed data-
driven front-end
64
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.3: Word Recognition Accuracies (%) using multi-stream cross-lingual posterior
features
Multi-stream Cross-lingual
Table 3.1 summarizes the baseline results for our experiments using different fea-
tures with only 1 hour of English data. In our second set of experiments we derive Tandem
features for the 1 hour of English data from the cross-lingual systems. It is clear that systems
built using low amounts of training data perform very poorly. Our subsequent experiments
aim to improve these performances using multi-stream and cross-lingual data. Table 3.2
shows the experiments using Tandem features derived from the spectral envelope and mod-
ulation features using the cross-lingual systems. These experiment show the improvements
as more cross-lingual data is used. Adapting the systems with the limited amount of task
specific language improves the performance of each system further. As described earlier
posterior streams derived from two different feature representations are now combined to
Table 3.3 shows the results of combining posterior streams from the final cross-
lingual systems (System 4 of Table 3.2) of both streams the using the Dempster Shafer
(DS) theory of evidence [74]. The results show significant improvements after combining
posterior streams over the results from individual streams compared to the baseline PLP
system.
65
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Layers
We use the similar experimental set described in the previous section to demon-
strate the usefulness of the second technique. The primary advantage of this new technique
is that it does not require the multilingual data to be mapped using a common phone set
In our first set of experiments we train a 4 layer MLP system on two languages -
Spanish and English as outlined in Sec. 3.3. We start by training two separate networks
on the task independent language using 16 hours of Spanish. Both these systems have a
first hidden layer of 1000 nodes, a bottleneck layer of 25 nodes and a final output layer
of 28 nodes corresponding to the size of the Spanish phoneme set. 39 dimensional PLP
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to train
the first network with architecture - 351×1000×25×28. A second system is trained on 476
dimensional modulation features derived using FDLP. These features correspond to 28 static
and dynamic modulation frequency components extracted from 17 bark spaced bands. This
system has an architecture of 476×1000×25×28. Both the systems are trained using the
standard back propagation algorithm with cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
After the task independent networks have been trained, the task specific networks
66
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.4: Word Recognition Accuracies (%) using two languages - Spanish and English
to be trained on 1 hour of English are initialized in two stages as discussed in Sec. 3.3. In
the first stage, all weights except the weights between the bottleneck layer and the output
layer are initialized directly from the Spanish network. The second set of weights are
initialized from a single layer network trained on non-linear representations of the 1 hour
of English data derived by forward passing the English data through the Spanish network
till the bottleneck layer. This network has an architecture of 25×47 corresponding to the
dimensionality of the non-linear representations from the bottleneck layer of the Spanish
network and the size of the English phoneme set. These networks are trained on both PLP
Once the networks has been initialized, PLP and FDLPM features derived from 1
hour of English are used to train the new task specific low-resource networks. The networks
trained on PLP and FDLPM features now have an architecture of 351×1000×25×47 and
are combined using the Dempster Shafer (DS) theory of evidence before deriving the 25
dimensional Tandem set. The 2 sets of 25 dimensional bottleneck features from each of the
networks are appended together before applying a dimensionality reduction to form a final
25 dimensional bottleneck feature vector. Both the Tandem and bottleneck features are
used to train the subsequent low-resource HMM-GMM system on 1 hour of training data.
67
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Figure 3.3: Tandem and bottleneck features for low-resource LVCSR systems.
Table 3.4 shows the results of using the proposed MLP based features. We train
languages - Spanish, German and English. The training procedure starts as outlined earlier
with 15 hours of Spanish. The networks are then initialized to train with the German data
in two stages - with weights from the Spanish system till the bottleneck layer and with
weights from single layer network trained to the German data. After the net has been
trained on the German data, we do a re-training using the 1 hour of English data. Figure
3.3 is a schematic of the training and feature extraction procedure. Table 3.5 shows the
68
CHAPTER 3. DATA-DRIVEN FEATURES IN LOW-RESOURCE SCENARIOS
Table 3.5: Word Recognition Accuracies (%) using three languages - Spanish, German and
English
The above results show the advantage of the proposed approach to training MLPs
on multilingual data. Unlike in earlier approaches we are able to train on multiple languages
3.5 Conclusions
ends over conventional features in low-resource settings. In these settings, data-driven fea-
tures are built using task independent data. However in most cases, this data is transcribed
using different phoneme sets. We have addressed this issue using two methods. Features
extracted using these techniques are used to train LVCSR systems in the low-resource lan-
guage. In our experiments, the proposed features provide a relative improvement of about
30% in an low-resource LVCSR setting with only one hour of training data. In the next
69
Chapter 4
in Low-resource Settings
processing layers have been added to neural network front-ends. To train these additional
parameters, large amounts of training data are also required. This chapter explores how
these additional layers can be incorporated in low-resource settings with only few hours of
4.1 Overview
using multiple feature representations of the acoustic signal. To allow these parallel streams
of information to be trained, task independent data from different languages were used in
70
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Speech Speech
Feature
Post−processing MLP
Interactions via
intermediate
Data−driven Feaures outputs
Feature
Post−processing
Data−driven Feaures
(a) (b)
Figure 4.1: (a) Wide and (b) Deep neural network topologies for data-driven features
conjunction with simple neural network topologies. In this chapter, in addition to these
parallel feature streams, we explore if more complex neural network architectures which are
currently being used in state-of-the-art ASR systems can also be trained in low-resource
settings.
In [113], these complex neural network architectures have been broadly classified
into two categories - wide networks and deep networks. In wide networks, several parallel
neural network modules that interact with each other are used. On the other hand, in deep
networks topologies several interacting neural network layers are stacked one after the other
71
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Several wide network topologies have been used in processing long-term modula-
tion features for example the architectures used in the TRAPS [66] or HATS framework [73].
In a more recent approach [114], modulation features are first divided into two separate
streams as shown in Figure 4.1. The phoneme posterior outputs of a neural network trained
on high modulations (> 10Hz) are then combined with low modulation features to train
a second network. Tandem processed features from the second network are then used for
ASR.
Hierarchical networks where the outputs of one neural network processing stage are
further processed by a second neural network have been used in [100, 115]. More recently,
Deep Belief Networks with several layers (5-6 hidden layers) have been used in acoustic
modeling. In this approach individual layers of the deep network are usually pre-trained
In this chapter we discuss techniques to train both these classes of complex net-
works in low-resource settings. Faced with limited amounts of task specific data in these
scenarios we demonstrate the use of task independent data to build these networks.
We use two kinds of task independent data sources in building the proposed front-
(a) Up to 20 hours of data from the same language collected for a different task. Although
this data has a different genre, it has similar acoustic channel conditions as the low
72
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Same language,
different genre −
{1/2/5/10/15/20} hours
Posterior features
Speech (from the Acoustic MLP trained
low−resource Features on N hrs for LVCSR/Spoken
setting) − PLP of data term detection
Figure 4.2: Data driven front-end built using data from the same language but from a
different genre.
resource data.
(b) 200 hours of data from a different language but with similar acoustic channel conditions.
We build two kinds of front-ends on varying amounts of these task independent training
data.
1. A monolingual front-end trained on varying amounts of data from the same language as
the low-resource task. As shown in Figure 4.2, we train different configurations of this
front-end is that even though the genre is different, the MLP learns useful information
that characterizes the acoustics of the language. This improves as the amount of training
data increases. For our current experiments we also choose task independent data from
similar acoustic conditions as the low resource setting. Features generated using this
front-end are hence enhanced with knowledge about the language and have unwanted
variabilities from the channel and speaker removed. We use conventional short-term
2. A cross-lingual front-end that uses large amounts of data from a different language.
In most low-resource settings, it is less likely to have sufficient transcribed data in the
73
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
MLP trained
Acoustic on M hrs of data
Features
− PLP Different language − Posterior
200 hours Combination
Acoustic
Features
− FDLPM
MLP trained
on M hrs of data Tandem
Processing
Acoustic Features enhanced
with multilingual posteriors
Speech Acoustic
Features 11
00
00
11 Same language, different
− PLP 00
11
00
11 genre − {1/2/5/10/15/20}
hours Posterior
features
MLP trained on for LVCSR
N hrs of data
Figure 4.3: A cross-lingual front-end built with data from the same language and with large
amounts of additional data from a different language but with same acoustic conditions.
languages might be available. Figure 4.3 outlines the components of the cross-lingual
front-end that we train to include additional data from a different language. This front-
end has two parts. The first part is similar to the monolingual front-end described above
and consists of an MLP trained on various amounts of data from same language but
different genre (N hours). The second part includes a set of MLPs trained on large
amounts of data from a different language (M hours). Outputs from these MLPs are
used to enhance the input acoustic features for the former part.
Although languages have common attributes between them, data from these languages
is transcribed using different phone sets and need to be combined before it can be used.
In the previous chapter, we use two different approaches to deal with this - a count based
74
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
data driven approach to find a common phone set and an MLP training scheme with
intermediate language specific layers. Both these approaches finally involve adaptation
any MLPs, instead we keep the front-end fixed by using the multilingual MLP to derive
posterior features.
When MLPs trained on a particular language are used to derive phoneme posteriors from
a different language, the language mismatch results in less sharp posteriors than from an
MLP trained on the same language. However an association can still be seen between
similar speech sounds from the different languages. We use this information to enhance
acoustic features of the task specific language. Phoneme posteriors from two compli-
mentary acoustic streams are combined to improve the quality of the posteriors before
they are converted to features using the Tandem technique. The multilingual posterior
features are finally appended to short-term acoustic features to train a second level of
MLPs on varying amounts of data from the same language as the low-resource task. This
procedure is hence similar to the approaches described earlier with modulation features
and the TRAPS/HATS configurations used to build wide neural network topologies (see
Figure 4.1).
We train two data-driven front-ends for the low-resource LVCSR task as described
in Sec. 4.2.1. We train the monolingual front-end on a separate task independent training set
of 20 hours from the Switchboard corpus. Although this training set has similar telephone
75
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
channel conditions, as the low-resource task used for our experiments, it has a different
genre. The phone labels for this set are obtained by force aligning word transcripts to
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames. We
train separate MLPs on subsets of 1, 2, 5, 10, 15 and 20 hours to understand how the
In addition to the Switchboard corpus, we train Spanish MLPs on 200 hours of tele-
phone speech from the LDC Spanish Switchboard and Callhome corpora for the cross-lingual
front-end. Phone labels for this database are obtained by force aligning word transcripts
using BBN’s Byblos recognition system using 27 phones. We use two acoustic features
- short-term 39 dimensional PLP features with 9 frames of context and 476 dimensional
long-term modulation features (FDLPM). When networks are trained on multiple feature
representations, better posterior estimates can be derived by combining the outputs from
different systems using posterior probability combination rules. We use the Dempster-Shafer
rule of combination for our experiments. Posteriors from multiple streams are combined to
reduce the effects of language mismatch and improve posteriors. Phoneme posteriors are
then converted to features by Gaussianizing the posteriors using the log function and decor-
is also performed by retaining only the top 20 feature components which contribute most
The English MLPs in the cross-lingual setting are trained on enhanced acoustic
features. These features are created by appending posterior features derived from the
76
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Table 4.1: Word Recognition Accuracies (%) using different amounts of Callhome data to
train the LVCSR system with conventional acoustic features
Spanish MLPs to the PLP features used in monolingual training. We similarly also train
In our first experiment we use 39 dimensional PLP features directly for the 1
hour Callhome LVCSR task. The acoustic models have a low word accuracy of 28.8%.
These features are then replaced by 25 dimensional posterior features using the monolingual
and cross-lingual front-ends, each trained on varying amounts of task independent data
from the Switchboard corpus. Figure 4.4 shows how the performance changes for both the
monolingual and cross-lingual systems. Using the data-driven front-ends, the word accuracy
improves from 28.8% to 30.1% and 37.1% with just 1 hour of task independent training
data using the monolingual and cross-lingual front-ends respectively. These improvements
continue to 37.2% and 41.5% with the same 1 hour of Callhome LVCSR training data as
the amount of task-independent data is increased for both the front-ends. We draw the
1. With very few hours of task specific training data, posteriors features can provide
significant gains over conventional acoustic features. Table 4.1 shows the work accu-
racies when different amounts of Callhome data are used to train the LVCSR system.
By using the cross-lingual front-end, features from only 1 hour of data perform close
to 5-10 hours of the Callhome data with conventional features. This demonstrates
77
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
50
48 1hr of Acoustic Features only
1hr of Posterior Features using Monolingual Front−end
46 1hr of Posterior Features using Crosslingual Front−end
44
Figure 4.4: LVCSR word recognition accuracies (%) with 1 hour of task specific training
data using the proposed front-ends
the usefulness of our approach where we use task independent data in low-resource
2. When data from a different language is used, additional gains of 4-7% absolute are
achieved over just using task independent data from the same language. It is interest-
ing to observe that the performance with the cross-lingual front-end starts improving
A deep neural networks (DNN) is multilayer MLPs with several more layers than
traditionally used networks. The layers of a DNN are often initialized using a pretraining
algorithm before the network is trained to completion using the error back-propagation
algorithm [119]. In this section we discuss the development of a DNN for low-resource
scenarios.
78
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
The purpose of the pretraining step is to initialize a DNN network with a better set
of weights than a randomly selected set. Networks trained from these kinds of initial weights
are observed to be well regularized and converge to a better local optimum than a randomly
initialized networks [120, 121]. As with traditional ANNs, deep neural networks have been
used both as acoustic models that directly model context-dependent states of HMMs [117]
and also to derive data-driven features [122, 123]. In both cases, the performances of these
In the deep belief network (DBN) pretraining procedure [124], by treating layers
of the MLP as restricted Boltzmann machines (RBM), the parameters of the network are
which in turn decreases the effectiveness of this approach when the number of layers is
increased [119].
A different algorithm that has been shown to equally effective for pretraining
DNNs is called discriminative pretraining [119, 125]. This pretraining procedure starts by
training an MLP with 1 hidden layer. After this MLP has been trained discriminatively
with the error back-propagation algorithm, a new randomly initialized hidden layer and
softmax layer are introduced to replace the initial soft-max layer of the first network. The
deeper network is then trained again discriminatively. This procedure is repeated until the
Although pretraining algorithms are effective in initializing DNNs, the key con-
79
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
straint in low resource settings is often the insufficient amount of data to train these net-
works. We show that in these scenarios, task independent data can instead be used to
pretrain and initialize a DNN before it is finally adapted and used with limited amounts of
algorithm is however general and can be extended to more hidden layers. The MLP has a
linear input layer with a size d corresponding to the dimension of the input feature vector,
followed by three non-linear layers m1, m2, m3 and a final linear layer with a size h corre-
sponding to the phone set of the task independent data the DNN is being trained. While
the dimensions of m1, m2 are quite high, m3 is low bottleneck dimensional layer. Similar
to data driven networks described in the previous chapter, both posterior and bottleneck
features can be derived from the DNN. We use the following steps to pretrain a DNN -
network with 1 hidden layer - d×m1×h. Starting with randomly initialized the weights
connecting all the layers of the network, we train this network with one pass of the
2. Growing the network - The d×m1×h network is now grown by inserting a new layer
is again trained with one pass of the entire data using the standard back-propagation
algorithm. The weights d − m1 are copied from the initialization step and are kept
fixed.
80
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
layer m3. While weights d − m1, m1 − m2 are copied from the previous step, new
3. Final training - With all the layers of the network in place, the complete network is
We use task independent data in all these steps. The DNN is next adapted to the
As described in the previous chapter, one limitation while adapting between do-
mains are differences in the phoneme set. We have proposed a neural network based tech-
nique for this in the previous chapter that replaces the last language specific layer. We use
this technique in the following steps for the adapting the DNN -
1. Initialize the network to train on task specific set - To continue training on the task
specific set which has a different phoneme set size l, we create a new 5 layer DNN of
size d×m1×m2×m3×l. The first 4 layer weights of this new network are initialized
using weights from the DNN trained on the task independent data set. Instead of
using random weights between the last two layers, we initialize these weights from a
separately trained single layer perceptron. To train the single layer perceptron, non-
linear representations of the low-resource training data are derived by forward passing
the data through the first 4 layers of the MLP. The data is then used to train a single
2. Train the MLP on task specific set - Once the 4 layer MLP of size d×m1×m2×m3×l
81
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
has been initialized, we re-train the MLP on the low-resource language. By sharing
weights across languages the MLP is now able to train better on limited amounts of
We derive features from the bottleneck hidden layer of the final DNN as features
for ASR.
lingual DNN front-end using data from 3 different languages - Spanish, German and English.
Separate DNNs are trained on two different feature representations - PLP and FDLPM.
Bottleneck features from these front-ends are then combined and used for ASR experiments.
32 hours of cross-lingual data from Spanish (16 hours), German (15 hours) and
English (1 hour) are used to train a 6 layer DNN network with 3 hidden layers. The cross-
lingual data uses a combined phoneme set size of 52 derived from a count-based mapping
features (13 cepstral + Δ + ΔΔ features) are used along with a context of 9 frames to
trained on modulation features derived using FDLP. These features (FDLPM) correspond
to 28 static and dynamic modulation frequency components extracted from 17 bark spaced
bands. A reduced feature set from only 9 alternate odd bands is used to train a system
82
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
(Chapter 3)
of English (Chapter 3)
of English
with an architecture of 252×1000×1000×25×52. Both the systems are trained with the
standard back propagation algorithm and cross entropy error criteria. The learning rate
and stopping criterion are controlled by the error in the frame-based phoneme classification
The DNN networks are build in stages as described in the previous section. For the
DNN trained using PLP features, a three layer MLP (351×1000×52) initialized with random
weights, is first trained using one pass of the cross-lingual data. In the next step, a four
layer MLP (351×1000×1000×52) is trained starting with copied weights from the 351×1000
section of the earlier network and random weights for the 1000×1000×52 section. A single
83
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
pass of the cross-lingual data is used to train this network keeping the copied weights fixed.
The final 6 layer network (351×1000×1000×25×52) is constructed with copied weights for
the 351×1000×1000 section and random weights for the 1000×25×52 part. The network
FDLPM features.
Each of the DNN networks trained on task independent data are then adapted
to the low-resource setting with task-specific 1 hour of English data. The networks are
adapted after the task dependent output layer of the cross-lingual DNN has been replaced.
In the first step, all weights except the weights between the bottleneck layer and
the output layer are initialized directly from the cross-lingual network. The second set
of weights are initialized from a single layer network trained on non-linear representations
of the 1 hour of English data derived by forward passing the English data through the
cross-lingual network till the bottleneck layer. This network has an architecture of 25×47
layer of the cross-lingual network and the size of the English phoneme set.
Once the networks has been initialized, PLP and FDLPM features derived from 1
hour of English are used to train the new low-resource networks. The networks trained on
PLP and FDLPM features now have an architecture of 351×1000×25×47 and 252×1000×25×47
respectively. These networks are then used to derive bottleneck features. The 2 sets of 25
dimensional bottleneck features from each of the networks are appended together before ap-
84
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
for ASR.
We use the similar ASR setup on Callhome English described earlier. The baseline
HMM-GMM system is trained on 1 hour of data using 39 dimensional PLP features. Table
4.2 shows the recognition accuracies on this task using different approaches. The DNN
features significantly improve ASR accuracies when compared with equivalent systems built
4.4.1 Overview
several languages and conditions [93, 126–128]. In this section we describe the development
settings.
We start by using the best acoustic models trained in the low-resource setting to
decode the available untranscribed data. The decoded data is then used along with the
fashion.
85
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
the quality of the decoded untranscribed data is also poor. It is hence useful to select
reliable portions of the untranscribed data for semi-supervised training. This selection is
done using confidences scores computed for each decoded utterance. Confidence scores are
1. LVCSR based word confidences - LVCSR lattice outputs can be treated as directed
graphs with arcs representing hypothesized words. Each arc spans a duration of
time (ts , tf ), that the word is hypothesized to be present in the speech signal and is
also associated with acoustic and language model scores. Using these scores, word
For any given hypothesized word wi , at a given time frame t, several instances of
the word can be present on different lattice arcs simultaneously. A frame-based word
p(wi |t) = p(wij |t) (4.1)
j
where j corresponds to all the different instances of wi that are present at time frame
t [130]. In our proposed selection technique we use a word confidence measure Cmax
based on these frame level word posteriors [130], given as the maximum word confi-
86
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
p
4
11
00
000
111
000 000
00
11
111 11100
11
000
11100
11
000
111
00
11
000
111
Path along on which
occurances are
p 00011
11100
111
000 counted
Phonemes of
p
3
000
111
111
000
111
00
000
00011
111
00
000
11
111
00
11
000
111
word W
2 00
111 000 000
00011
11 00
111 111
11100
000
00011
111
00
000
11
111 000 Presence of
00
11
111
p 00
000
11
111
00
11
11100
000
00011
111
00
000
11
111 00
000
11
111
00
000
11
111 phoneme
1 00
00011
11
11100
000
11
111 00
000
11
111
00
000
11
11100
000
111 00
000
11
111
t s t s+1 t
f
Time (frames)
2. MLP posteriogram based phoneme occurrence confidence - Similar to the above men-
tioned confidence from the LVCSR classifier, we also derive confidences scores from
acoustic features corresponding to the utterance through the trained MLP classifier.
For each hypothesized word wi in the LVCSR transcripts, we first look up its set of
ors corresponding to each phoneme are then selected for the utterance’s posteriogram
set threshold. The average number of times the constituent phonemes appear in the
hypothesized time span (ts , tf ) along a Viterbi search path is then used as confidence
measure. The selected path is designed to produced the occurrence count while visit-
ing all constituent phonemes in sequence. The rationale behind this measure is that
if a word is hypothesized correctly, it is likely that all its constituent phonemes will
c
Cocc (wi , ts , tf ) = (4.3)
N
87
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
where c is the total number of times phoneme occurrences and N is the total number
The two confidence measures are finally combined using logistic regression. The
regressor is trained to predict a combined confidence using word confidence and phoneme
of transcribed data from the complete 15 hour Callhome English data set. In our semi-
Data selection
Using the ASR system trained with features from the cross-lingual DNN front-
end, the 14 hour data set is first decoded. Word lattices also produced during the decoding
process are used to generate word-confidences for each hypothesized word as described
above. The cross-lingual DNN front-end is also used to produce phoneme posterior outputs
from which phoneme occurrence based confidences are derived. Combination weights for
these confidence scores are then estimated by training a logistic regressor on a 45 minute
After every hypothesized word in the decoded output has been given a score using
the trained logistic regression module, each utterance is assigned an utterance-level score.
This utterance level score is the average of all word-level scores in the utterance.
88
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Table 4.3: Word Recognition Accuracies (%) at different word confidence thresholds
utterance level scores for the held out data. The word recognition accuracy (%) is then
evaluated on selected sentences at different threshold levels. Table 4.3 shows the word
recognition accuracies at different thresholds. As the threshold increases, only fewer reliable
The initial cross-lingual DNN training experiments described earlier were based on
only 1 hour of transcribed data. For semi-supervised training of DNNs we include additional
data with noisy transcripts. These utterances are selected from the untranscribed data based
To avoid detrimental effects from noisy semi-supervised data during discriminative training
done by multiplying the cross-entropy error with a small multiplicative factor during
training,
89
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
(b) The semi-supervised data is used only in the final pre-training stage after all the layers
For our experiments we select about 4.5 hours of data using utterances with a
score of 0.3 and greater. This data is then combined with the cross-lingual pre-training
data set of 15 hours of German, 16 hours of Spanish and 1 hour of English. During the
DNN training, we use a multiplicative factor of 0.3 to de-weight the cross-entropy error
The semi-supervised data is used in the final pre-training stage (Section 4.3.1,
step 3) to train both the DNN networks using PLP (351×1000×1000×25×52 network) and
both the networks are adapted with 1 hour of English as before. Bottleneck features from
both the networks are combined and used to train the low-resource ASR system with 1 hour
of data as before. Table 4.4 shows the performance of the system after using semi-supervised
90
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
Table 4.5: Word Recognition Accuracies (%) with semi-supervised acoustic model training
Hours of semi-supervised
0 42.7
2 43.3
4 44.0
8 44.3
14 44.8
Features from the DNN front-end with semi-supervised data are used to extract
data-driven features for semi-supervised training of the ASR system. Similar to the weighing
of semi-supervised data during the DNN training, we also use a simple corpus weighing while
training the ASR systems. This is done by adding the 1 hour of fully supervised data with
performance using different amounts of semi-supervised data. From Table 4.5 we observe
that as we double the amount of semi-supervised data, there is roughly a 0.5% absolute
increase in performance.
91
CHAPTER 4. WIDE AND DEEP MLP ARCHITECTURES IN LOW-RESOURCE
SETTINGS
4.5 Conclusions
In this chapter we have shown how complex neural network architectures can be
built in low resource settings. Using large amounts of multilingual data, we have show
that task independent data can significantly improve performances in low resource settings.
Training using task independent data compensates for the lack of limited amounts of tran-
scribed task specific data in low resource settings. Both the deep and wide networks trained
92
Chapter 5
Applications of Data-driven
Front-end Outputs
In the previous chapters, the outputs of data-driven front-ends were used as features for
automatic speech recognition. In this chapter, we describe how these front-ends can be used
in other applications - to derive features for speech activity detection, combination weights
in neural network based speaker recognition models, feature representations for zero resource
5.1.1 Overview
Speech activity detection (SAD) is the first step in most speech processing ap-
plications like speech recognition, speech coding and speaker verification. This module is
93
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
an important component that helps subsequent processing blocks focus resources on the
speech parts of the signal. In each of these applications, several approaches have been
used to build reliable SAD modules. These techniques are usually variants of decision rules
based on features from the audio signal like signal energy [131], pitch [132], zero crossing
rate [133] or higher order statistics in the LPC residual domain [134]. Acoustic features
have also been used to train multi-layer perceptrons (MLPs) [135] and hidden Markov
models (HMMs) [136] to differentiate between speech and non-speech classes. All these
Traditionally acoustic features derived from the spectrum of speech have been
used to differentiate between speech and other acoustic events. In a different approach, we
train MLPs on large amounts of data to differentiate between two classes - speech versus
non-speech. Instead of using these models to directly produce S/NS decisions, the models
The proposed front-end has a multi-stream architecture with several levels of MLPs
[137]. The motivation behind this multi-stream front-end is to use parallel streams of data
that carry complementary or redundant information while at the same time degrading
differently in noisy environments [138]. We form 3 feature streams by dividing the sub-
band trajectories derived using FDLP on a mel-scale with 45 filters equally into 3 groups.
Similar to deriving short-term spectral features, we then integrate the envelopes in short
term frames (of the order of 25 ms with a shift of 10 ms). We also use a context of
94
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
about 1 second by appending 50 frames from the right and left with each sub-band feature
vector to form TRAP like features [65]. The two other streams are formed by dividing the
14 modulation features into 2 groups - the first 5 DCT coefficients corresponding to slow
Speech activity detection is carried out on the proposed features in three main
steps. In the first step, the input frame-level features are projected to a lower-dimensional
space. The reduced features are then used to compute per-frame log likelihood scores with
respect to speech and non-speech classes, each class being represented separately by a GMM.
The frame level log likelihood scores are mapped to S/NS classification decisions to produce
final segmentation outputs in the last step. Figure 5.1 is a brief schematic of the proposed
approach and the processing pipeline for SAD. Each of these steps is described in detail
in [139].
The proposed features are evaluated in terms of speech activity detection (SAD)
accuracy on noisy radio communications audio provided by the Linguistic Data Consortium
(LDC) for the DARPA RATS program [140, 141]. The audio data for the DARPA RATS
program is collected under both controlled and uncontrolled field conditions over highly
degraded, weak and/or noisy communication channels making the SAD task very challeng-
ing [140]. Most of the RATS data released for SAD were obtained by retransmitting existing
audio collections - such as the DARPA EARS Levantine/English Fisher conversational tele-
phone speech (CTS) corpus - over eight radio channels, labeled A through H covering a
95
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Figure 5.1: Schematic of (a) features and (b) the processing pipeline for speech activity
detection.
audio from the Arabic Levantine and English Fisher CTS corpus, retransmitted over the
eight channels. The training corpus consists of 73 hours of audio (62 hours from the Fisher
collection, and 11 from new RATS collection). Although the entire data was also retrans-
mitted over eight channels, since some data from channel F was unusable, all data from
The MLPs used for extracting data-driven features are trained on close to 660
hours of audio from the RATS development corpus using LDC provided S/NS annotations.
Outputs from these 5 sub-systems are then fused by a merger MLP at the second level to
derive the final S/NS posterior features. These features are derived from the pre-softmax
96
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
PLP 15 31 465 3.55 3.00 5.03 2.51 2.75 3.48 2.34 3.34
FDLPS 15 31 465 3.42 3.10 4.46 2.42 2.78 3.40 2.29 3.20
FDLPM 340 1 340 3.88 3.80 4.12 3.26 3.52 3.60 2.51 4.15
PLP+MLP 17 31 527 3.10 2.84 3.20 2.25 2.63 2.96 2.07 2.84
FDLPS+MLP 17 31 527 3.15 2.94 3.04 2.17 2.67 2.89 1.93 2.82
FDLPM+MLP 402 1 402 3.02 2.90 3.73 2.26 2.84 2.42 1.89 2.88
Table 5.1: Equal Error Rate (%) on different channels using different acoustic features and
combinations
SAD models are trained both acoustic and data-driven features, as well as on
feature combinations. In each case, HLDA was used to reduce dimensionality prior to
GMM training. Table 5.1 shows the dimensionality of the original space, prior to the
application of HLDA, for each feature type used. A context of 31 frames was used for
short-term features. In all cases, the output dimensionality of HLDA was set to 45. A single
Gaussian was used to represent each of the two classes (speech, non-speech) during HLDA
S/NS classification. The number of contextual frames, HLDA dimensionality, and number
of GMM components were optimized using separate experiments [142]. The derived SAD
models were evaluated on the development set in terms of equal error rate (EER%), which is
97
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
the operating point at which the falsely rejected speech rate (probability of missed speech)
is equal to the falsely accepted non-speech rate (probability of false alarm). The results
are shown in Table 5.1 for conventional features (PLP), short-term features derived using
FDLP (FDLPS), long-term modulation features (FDLPM) and data-driven features (MLP).
Although each of the feature sets have varying performance in each of the individual noisy
channels, they are comparable to each other in terms of overall SAD performance. In a
second set of experiments, acoustic and data-driven features which capture various kinds
of information about speech, are combined. We observe close to 15% relative improvement
when the acoustic features are used in conjunction with the data-driven features. We draw
1. MLP based models, which are traditionally used to directly produce S/NS decisions,
tion
5.2.1 Overview
The goal of speaker verification is to verify the truth of a speakers claimed identity.
Majority of current speaker verification systems model overall acoustic feature vector space
using a GMM based Universal Background Model (UBM), trained on large amounts of data
98
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
from multiple speakers [143,144]. In this section we discuss the development of a mixture of
AANNs for speaker verification. The mixture consists of several AANNs tied using posterior
AANNs are feed-forward neural networks with several layers trained to reconstruct
the input at its output through a hidden compression layer. This is typically done by
modifying the parameters of the network using the back-propagation algorithm such that
the average squared error between the input and output is minimized over the entire training
data. More formally, for an input vector x, the network produces an output x̂(x, W) which
depends both on the input x and the parameters W of the network (the set of weights and
biases). For simplicity, we denote the network output as x̂(W). The training process then
min E x − x̂(W)2 . (5.1)
{W}
This method of training ensures that for a well trained network, the average reconstruction
error of input vectors that are drawn from the distribution of the training data will be small
compared to vectors drawn from a different distribution [145]. The likelihood of the data x
p (x; W) ∝ exp(−E x − x̂(W)2 ). (5.2)
In [146, 147], these properties have been used to model acoustic data for speaker
99
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
acoustic features from large amounts of data containing multiple speakers. Since data from
many speakers are used, the AANN model learns a speaker independent distribution of the
acoustic vectors. For each speaker in the enrollment set, the UBM-AANN is then adapted to
learn speaker dependent distributions by retraining the entire network using each speaker’s
enrollment data. During the test phase, the average reconstruction error of the test data is
computed using both the UBM-AANN and the claimed speaker AANN model. In an ideal
case, if the claim is true, the average reconstruction error under the speaker specific model
will be smaller than under the UBM-AANN and vice versa if false.
the maximum a posteriori probability (MAP) adaptation to obtain speaker specific models.
In the MAP adaptation of GMMs, only those components that are well represented in the
adaptation data get significantly modified. However in the case of neural networks, there
is no similar mechanism by which only parts of the model can be adapted. This limits the
ability of a single AANN to capture the distribution of acoustic vectors especially when
the space of speakers is large. To address this issue, we introduce a mixture of AANNs as
Mixture of AANNs
a separate part of the acoustic feature space [148]. In our experiments we partition the
acoustic space into 5 classes corresponding to the broad phoneme classes of speech - vowels,
fricatives, nasals, stops and silence. The assignment of a feature vector to one of these classes
is done using posterior probabilities of these classes estimated using a separate multilayer
100
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
perceptron (MLP). This additional information is incorporated into the objective function
in Eqn. (5.1) as -
c
min E P (Cj /x) x − x̂(Wj )2 (5.3)
{Wj }
j=1
where c denotes the number of mixture components or number of broad phoneme classes,
and the set Wj consists of parameters of the j th AANN of the mixture. P (Cj /x) is the
posterior probability of j th broad phonetic class Cj given x estimated using the MLP. During
back propagation training, since the error is weighted with class posterior probabilities, each
class.
Similar to the single AANN case, a UBM-AANN is first trained on large amounts of
data. For each speaker in the enrollment, the UBM is then adapted using speaker specific
enrollment data. Broad class phoneme posteriors are used in both these cases to guide
the training of each class specific mixture component on appropriate set of frames. This
approach helps to alleviate the limitation of a single AANN model described earlier since
only parts of the UBM-AANN are now adapted based on the speaker data.
{x1 , . . . , xn } is given by
n
c
P (Cj /xi ) xi − x̂i (Wj )2
i=1 j=1
e (D; W1 , . . . , Wc ) = . (5.4)
n
During the test phase, likelihood scores based on reconstruction errors from both the UBM-
AANN and the claimed speaker models are used to make a decision. In our experiments,
since the amount of adaptation data is usually limited, we adapt only the last layer weights
101
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
of each AANN component. We also restrict the number of nodes of the third hidden layer
ficiently large amounts of data to serve as UBM. Gender specific UBMs are trained on a
telephone development data set consisting of audio from the NIST 2004 speaker recognition
database, the Switchboard II Phase III corpora and the NIST 2006 speaker recognition
database. We use only 400 male and 400 female utterances each corresponding to about 17
hours of speech. The acoustic features used in our experiments are 39 dimensional FDLP
features [149].
Posteriors to train the UBM are derived from an MLP trained on 300 hours of
conversational telephone speech (CTS) [88]. The 45 phoneme posteriors are combined
Each AANN component of the UBM has a linear input and a linear output layer
along with three nonlinear (tanh nonlinearity) hidden layers. Both input and output layers
have 39 nodes corresponding to the dimensionality of the input FDLP features. We use 160
nodes in the first hidden layer, 20 nodes in the compression layer and 39 nodes in the third
hidden layer. Speaker specific models are obtained by adapting (retraining) only the last
Once the UBMs and speaker models have been trained, a score for a trial is
computed as difference between the average reconstruction error (given by (5.4)) values
102
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
of test utterance under the UBM model and the claimed model.
ponents on FDLP features. The UBM-GMM is trained using the entire development data
described in above section. The speaker specific GMM models are obtained by MAP adapt-
ing the UBM-GMM with a relevance factor 16. As a second baseline, we train gender specific
AANN systems. These systems use 160 nodes in both the second and fourth hidden layers
and 20 nodes in the compression layer. The UBMs are trained using the same development
System C6 C7 C8
Table 5.2: Performance in terms of Min DCF (×103 ) and EER (%) in parentheses on
different NIST-08 conditions
(C6, C7 and C8) consisting of 3851 trials from 188 speakers. Table 5.2 lists both minimum
detection cost function (DCF) and equal error rate (EER) of various systems. The proposed
mixture of AANNs system performs much better than the baseline AANN system and
yields comparable results to the conventional GMM system. The score combination (equal
weighting) of GMM baseline and the proposed system further improves the performance.
However, state-of-the-art GMM systems use factor analysis to obtain much better gains.
103
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
In [150, 151], the AANN based approach has been further developed to use factor analysis.
In zero resource settings, tasks such as spoken term discovery attempt to auto-
matically identify repeated words and phrases in speech without any transcriptions [152].
In recent approaches [152–154] to address this task, a dynamic time warping (DTW) search
of the speech corpus is performed against itself to discover repeated patterns. With no
transcripts to guide the process, results of the search largely depend on the quality of the
underlying speech representation being used. In [155], multiple information retrieval met-
rics have been proposed to evaluate the quality of different speech representations on this
task. These metrics operate by using a large collection of pre-segmented word examples to
first compute the DTW distance between all example pairs and then quantify how well the
DTW distances can differentiate between same or different the example pairs. Better scores
with these metrics are indicative of good speaker independence and high word discrim-
inability of feature representations. Since these are also desirable properties of features for
other downstream recognition applications, these metrics are also predictive of how different
features will perform in those applications. We evaluate posterior features from both the
multilingual and cross-lingual front-ends (Chapter 4, Sec. 4.2) for spoken term discovery
The evaluation metric uses 11K words from the Switchboard corpus resulting in
60.7M word pairs of which 96K are same word pairs [155]. Similarity between word pairs
(wi , wj ) are measured using minimum DTW alignment cost - DT W (wi , wj ) between wi and
104
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
Average Precision
0.5
0.4
0.3
0.2
0.1
0
1 5 10 15 20
Amount of Task−independent Training Data
Figure 5.2: Average precision for different configuration of the wide topology front-ends
τ . Computing DTW distances also requires a distance metric to be defined between feature
vector frames that make up words. For this evaluation cosine distance is used for comparing
frames of raw acoustic features corresponding to words. A more meaningful symmetric KL-
divergence is used for accessing similarities on phoneme posteriors vectors generated by the
The entire set of word pairs is now used in the context of an information retrieval
task where the goal is to retrieve same word pairs from different word impostors for each
for each setting. The precision-recall curves can then be characterized by several criteria.
We use the average precision metric defined as the area under the precision-recall curve for
our experiments, which summarizes the system performance across all operating points.
Figure 5.2 shows the average precision scores for the two front-ends with varying
105
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
amounts of training data. The plot shows that posterior features perform significantly
better than the raw acoustic features (39D PLP features with zero mean/unit variance)
which have a very low score of only 0.177. As in the LVCSR case (Chapter 4, Sec. 4.2),
posterior features from the cross-lingual front-end perform even better. Both front-ends
improve as the amount of task independent data increases. Since this evaluation metric is
based on DTW distances over a moderately large set of words, improved performances on
this metric imply more accurate spoken term discovery. These experiments clearly show
the potential of data-driven front-ends not only in low-resource settings but zero-resource
settings.
In [156], we present a new application of phoneme posteriors for ASR. We use MLP
based phoneme posteriors to detect phonetic events in the acoustic signal. These phoneme
detectors are then used along with Segmental Conditional Random Fields (SCRFs) [157] to
given the acoustic evidence. Each output unit of the MLP is associated with a particular
system. The Viterbi algorithm is then applied on the hybrid system to decode phoneme
sequences. Each time frame in the acoustic signal is associated with a phoneme in the
106
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
decoded output. We use the output phonemes along with their corresponding time stamps
of the time span in which a phoneme is present. These phoneme detections are subsequently
terior probabilities of phonetic sound classes are estimated using a hierarchical configuration
of MLPs. We use both short-term spectral and long-term modulation acoustic features as
word sequence using a log-linear model. SCARF [158] uses four basic kinds of features
to describe the events present in the observation stream to the words being hypothesized.
model features and baseline features. The expectation and Levenshtein features measure
the similarity between expected and observed phoneme strings, while the existence features
indicate simple co-occurrence between words and phonemes. The baseline feature indicates
(dis)agreement between the label on a lattice link, and the word which occurs in the same
The phoneme detections that we now include capture phonetic events that occur
in the underlying acoustic signal. During the training process SCARF learns weights for
each of the features. In the testing phase, SCARF uses the inputs from the detectors to
107
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
We use the SCARF along with the earlier described event detectors on the Broad-
cast News task [159]. Table 5.3 shows the results of using the word detector stream along
with all the phoneme detector streams in combination. In this experiment we observe fur-
ther improvements with the phoneme detectors even after the word detectors have been
used. Both the experiments clearly show that additional information in the underlying
acoustic signal is being captured by the detectors and hence the further reduction in error
rates. It should be noted that these improvements are on top of results using state-of-the-art
recognition systems.
5.5 Conclusions
for four different applications. For speech activity detection, the data-driven front-ends are
used to derive features which improve speech detection in very noisy environments. In the
108
CHAPTER 5. APPLICATIONS OF DATA-DRIVEN FRONT-END OUTPUTS
second application we use broad class posteriors to improve neural network based speaker
verification. By the introduction of this side information, a mixture of neural networks can
be trained similar to conventional GMM based models. This technique improves the neural
network framework significantly and make its performance comparable with state-of-the-art
systems.
speech applications like spoken term discovery, which operate without any transcribed
speech to train systems. The proposed features provide significant gains over conventional
acoustic features on various information retrieval metrics for this task. In chapter we have
also explored a different application of phoneme posteriors - as phonetic event detectors for
speech recognition. We show how these detectors can be built to reliably capture phonetic
events in the acoustic signal by integrating both acoustic and phonetic information about
109
Chapter 6
Conclusions
6.1 Contributions
In this thesis we have proposed novel data-driven feature front-ends for differ-
ent speech applications. This approach is different from conventional feature extraction
techniques which derive information only from the spectrum of speech in short analysis
windows.
with these features, we have explored the use of various data-driven front-ends in different
and adaptation techniques have been proposed to improve the performance of these front-
ends when only limited amounts of task specific transcribed data is available. We have also
110
CHAPTER 6. CONCLUSIONS
demonstrated the use of these front-ends for other speech applications like speech activity
Data-driven features for speech recognition (Chap. 2, Sec. 2.3) - We have pro-
posed a new set of data-driven features for speech recognition. These features are derived
by combining posterior outputs of MLPs trained on FDLP based short-term spectral and
performances on various ASR tasks - phoneme recognition, digit recognition and large
3, Sec. 3.2) - We have developed a count based technique to map between phoneme
classes used to transcribe data in different languages and domains. This technique is
based on a measure that uses posteriors of phoneme classes as soft counts. We have
demonstrated the use of this approach in combining data from three languages - English,
Spanish and German, to train neural network systems. Significant gains are observed
when data-driven features derived using these multilingual MLPs are used in low-resource
settings [162].
Sec. 3.3) - Instead of using a mapping scheme to combine data from different sources
111
CHAPTER 6. CONCLUSIONS
before training, we have developed an approach to train neural networks using domain
specific output layers that are modified as training progresses across different domains.
This approach has been shown to be useful in sharing trained network layers across
different domains especially in low-resource settings [163]. Both the above mentioned
techniques address a key issue usually encountered while training neural networks
with data transcribed using different phoneme sets from multiple sources.
b. Wide neural network topology using data from multiple languages (Chap.
4, Sec. 4.2) - We have explored the use of a wide neural network topology that uses
several MLPs trained on large amounts of task independent data for low-resource and
when task dependent training data is scarce, task independent multi-lingual data can
c. Deep neural network with pre-training using task independent data (Chap.
4, Sec. 4.3) - To allow deep neural networks to be effectively trained in low resource
settings, we have investigated the use of multilingual data for initialization and train-
ing. By using deep neural networks, significant gains are observed on a low-resource
task using only 1 hour of training data. We also illustrate the use of unsupervised
acoustic model training in these settings. Table 6.1 summarizes the gains obtained by
using the proposed techniques in a low-resource experimental setup with only 1 hour
112
CHAPTER 6. CONCLUSIONS
(Contribution 2)
Spanish MLP and 20 hours of English MLP from different domain 41.5
(Contribution 3b)
113
CHAPTER 6. CONCLUSIONS
The above results clearly show that data-driven features are able to improve recog-
nition accuracies in low resources settings significantly. With only a small fraction
of task specific training data, the proposed approaches are able to achieve perfor-
mances (44.8%) very close to those obtained with conventional features when all of
Sec. 5.1) - Neural networks have traditionally been used only as acoustic models
for speech activity detection. We have proposed the use of data-driven features
derived using MLPs for this task. When combined with acoustic features, significant
5, Sec. 5.2) - To allow neural network models to be able to effectively capture the
independent AANNs are trained on different parts of the acoustic space correspond-
ing to broad phoneme classes of speech. The assignment of a feature vector to one
of these classes is done using posterior probabilities of these classes estimated us-
114
CHAPTER 6. CONCLUSIONS
spoken term discovery which features provide significant gains over conventional
posterior probabilities estimated using MLPs are extensively used both as scaled
6.2 Summary
In this chapter, we have summarized the contributions of this thesis. Although the
proposed data-driven feature extraction techniques have been shown to be useful in many
applications they have limitations related to their training and use. These include -
1. Labeled training data - For the neural network systems to be trained, sufficient data
with frame level phonetic transcriptions are required. These labels are often produced
resource settings where no such transcripts are available, building neural network
2. Mismatch conditions - Neural networks are sensitive to mismatches in train and test
conditions. Neural network based front-ends can be useful for deriving features only
115
CHAPTER 6. CONCLUSIONS
These current limitations open up several interesting avenues for future work. It
would be interesting to see if any of the techniques currently being developed for unsuper-
vised sub-word acoustic model training using universal background models [165], successive
state splitting algorithms for HMMs [166], estimation of sub-word HMMs [167], discrimi-
native clustering objectives [168], non-parametric Bayesian estimation of HMMs [169], au-
tomatically discovered context independent sub-word units [170] can be used to build data-
paradigm for processing of corrupted signals has been studied for more than a decade [137].
and classified in separate processing channels in order to provide for a possibility to adap-
tively alleviate the corrupted channels while preserving the uncorrupted channels for further
be built using this technique to deal with unexpected or unseen noise environments.
116
Bibliography
[2] ——, Statistical methods for speech recognition. MIT press, 1998.
[5] S. Katz, “Estimation of probabilities from sparse data for the language model com-
[6] J. Makhoul, “Linear prediction: A tutorial review,” Proceedings of the IEEE, vol. 63,
117
BIBLIOGRAPHY
Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[10] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” The Journal
[14] M. Richard and R. Lippmann, “Neural network classifiers estimate Bayesian a poste-
[16] H. Hermansky and N. Malayath, “Spectral basis functions from discriminant analy-
118
BIBLIOGRAPHY
[17] R. Cole, M. Fanty, M. Noel, and T. Lander, “Telephone speech corpus development
[19] M. Hunt, “A statistical approach to metrics for word and syllable recognition,” The
Journal of The Acoustical Society of America, vol. 66, no. S1, pp. S35–S36, 1979.
[20] M. Hunt and C. Lefebvre, “A comparison of several acoustic representations for speech
1989.
[22] N. Malayath and H. Hermansky, “Data-driven spectral basis functions for automatic
speech recognition,” Speech communication, vol. 40, no. 4, pp. 449–466, 2003.
[24] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE
Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272,
1981.
[25] A. Jansen and P. Niyogi, “Intrinsic Fourier analysis on the manifold of speech sounds,”
119
BIBLIOGRAPHY
[26] V. Jain and L. Saul, “Exploratory analysis and visualization of speech and music by
[27] A. Errity and J. McKenna, “An investigation of manifold learning for speech analysis,”
[28] H. Hermansky, D. Ellis, and S. Sharma, “Tandem connectionist feature extraction for
tions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225–241, 2011.
[32] J. Pinto, G. Sivaram, H. Hermansky, and M. Magimai-Doss, “Volterra series for ana-
2009.
120
BIBLIOGRAPHY
[34] P. Woodland, D. Povey et al., “Large scale MMIE training for conversational telephone
[38] N. Kumar and A. Andreou, “Heteroscedastic discriminant analysis and reduced rank
HMMs for improved speech recognition,” Speech communication, vol. 26, no. 4, pp.
283–297, 1998.
[39] R. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classi-
[41] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Dis-
IEEE, 2005.
121
BIBLIOGRAPHY
[42] N. Kambhatla and T. Leen, “Dimension reduction by local principal component anal-
discriminative feature, transform, and model training for large vocabulary speech
[45] E. Zwicker, G. Flottorp, and S. Stevens, “Critical band width in loudness summation,”
The Journal of the Acoustical Society of America, vol. 29, no. 5, pp. 548–557, 1957.
[48] L. Lee and R. Rose, “Speaker normalization using efficient frequency warping proce-
[49] ETSI, “Speech processing, transmission and quality aspects (STQ); Distributed
122
BIBLIOGRAPHY
[51] M. Gales and S. Young, “The application of hidden Markov models in speech recog-
[53] D. Gelbart and N. Morgan, “Double the trouble: handling noise and reverberation in
[55] B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz, “Long span features and minimum
“Advances in speech transcription at IBM under the DARPA EARS program,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1596–
1608, 2006.
P. Jain, H. Hermansky, D. Ellis et al., “Pushing the envelope - aside: Beyond the
123
BIBLIOGRAPHY
tures for phonetic and speaker-channel classification,” Speech Communication, vol. 31,
[59] R. Drullman, J. Festen, and R. Plomp, “Effect of reducing slow temporal modulations
on speech reception,” The Journal of the Acoustical Society of America, vol. 95, p.
2670, 1994.
[60] T. Arai, M. Pavel, H. Hermansky, and C. Avendano, “Syllable intelligibility for tem-
porally filtered LPC cepstral trajectories,” The Journal of the Acoustical Society of
[61] T. Houtgast and H. Steeneken, “The modulation transfer function in room acoustics as
transfer functions and speech intelligibility,” The Journal of the Acoustical Society of
[63] T. Houtgast and H. Steeneken, “A review of the MTF concept in room acoustics and
its use for estimating speech intelligibility in auditoria,” The Journal of the Acoustical
124
BIBLIOGRAPHY
[67] P. Schwarz, “Phoneme recognition based on long temporal context,” Ph.D. disserta-
[68] P. Jain and H. Hermansky, “Beyond a single critical-band in TRAP based ASR,” in
[69] J. Herre and J. Johnston, “Enhancing the performance of perceptual audio coders by
[70] R. Kumaresan and A. Rao, “Model-based approach to envelope and positive instan-
taneous frequency estimation of signals with speech applications,” The Journal of the
[71] M. Athineos, “Linear prediction of temporal envelopes for speech and audio applica-
[73] B. Chen, Q. Zhu, and N. Morgan, “Learning long-term temporal features in LVCSR
125
BIBLIOGRAPHY
[75] Q. Zhu, B. Chen, N. Morgan, A. Stolcke et al., “On using MLP features in LVCSR,”
Transactions on Signal Processing, vol. 55, no. 11, pp. 5237–5245, 2007.
using frequency domain linear prediction,” IEEE Signal Processing Letters, vol. 15,
robust phoneme recognition using modulation spectrum,” The Journal of the Acous-
[80] ——, “Phoneme recognition using spectral envelope and modulation frequency fea-
[82] S. Thomas, S. Ganapathy, and H. Hermansky, “Hilbert envelope based features for
126
BIBLIOGRAPHY
far-field speech recognition,” Machine Learning for Multimodal Interaction, pp. 119–
124, 2008.
[83] T. Dau, D. Püschel, and A. Kohlrausch, “A quantitative model of the effective signal
processing in the auditory system. i. model structure,” The Journal of the Acoustical
phoneme recognition in noisy speech,” The Journal of the Acoustical Society of Amer-
[86] B. Kingsbury, N. Morgan, and S. Greenberg, “Robust speech recognition using the
modulation spectrogram,” Speech Communication, vol. 25, no. 1, pp. 117–132, 1998.
[88] S. Ganapathy, S. Thomas, and H. Hermansky, “Static and dynamic modulation spec-
[89] K. Lee and H. Hon, “Speaker-independent phone recognition using hidden Markov
models,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37,
127
BIBLIOGRAPHY
D. Moore, V. Wan, R. Ordelman et al., “The 2005 AMI system for the transcription
2006.
[91] J. Fiscus, N. Radde, J. Garofolo, A. Le, J. Ajot, and C. Laprun, “The rich transcrip-
tion 2005 spring meeting recognition evaluation,” Machine Learning for Multimodal
[92] D. Moore, J. Dines, M. Doss, J. Vepa, O. Cheng, and T. Hain, “Juicer: A weighted
[93] G. Zavaliagkos, M. Siu, T. Colthurst, and J. Billa, “Using untranscribed training data
[94] H. Lin, L. Deng, D. Yu, Y. Gong, A. Acero, and C. Lee, “A study on multilingual
2009.
[95] D. Imseng, H. Bourlard, and P. Garner, “Using KL-divergence and multilingual infor-
IEEE, 2012.
[96] IPA, Handbook of the International Phonetic Association: A guide to the use of the
128
BIBLIOGRAPHY
IEEE, 2010.
[98] Y. Qian, D. Povey, and J. Liu, “State-level data borrowing for low-resource speech
2011.
[99] S. Sivadas and H. Hermansky, “On use of task independent training data in tandem
[100] J. Pinto, “Multilayer perceptron based hierarchical acoustic modeling for automatic
2010.
[102] G. Miller and P. Nicely, “An analysis of perceptual confusions among some English
consonants,” The Journal of the Acoustical Society of America, vol. 27, no. 2, pp.
338–352, 1955.
confusions using formal concept analysis,” The Journal of the Acoustical Society of
129
BIBLIOGRAPHY
press, 1961.
[108] A. Canavan and G. Zipperlen, “CALLHOME Spanish speech,” Linguistic Data Con-
sortium, 1997.
vol. 3, 2002.
[111] D. Graff, J. Kong, K. Chen, and K. Maeda, “English gigaword,” Linguistic Data
[113] N. Morgan, “Deep and wide: Multiple layers in automatic speech recognition,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 7–13,
2012.
130
BIBLIOGRAPHY
[115] G. Sivaram and H. Hermansky, “Sparse multilayer perceptron for phoneme recogni-
tion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,
[116] A. Mohamed, G. Dahl, and G. Hinton, “Acoustic modeling using deep belief net-
works,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1,
[117] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neu-
ral networks for large vocabulary speech recognition,” IEEE Transactions on Audio,
Speech, and Language Processing, vol. 20, no. 1, pp. 30–42, 2012.
“Making deep belief networks effective for large vocabulary continuous speech recog-
[119] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep
IEEE, 2011.
does unsupervised pre-training help deep learning?” The Journal of Machine Learning
131
BIBLIOGRAPHY
[121] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuning in context-
[122] D. Yu and M. Seltzer, “Improved bottleneck features using pretrained deep neural
[124] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,”
deep networks,” Advances in neural information processing systems, vol. 19, p. 153,
2007.
[126] T. Kemp and A. Waibel, “Unsupervised training of a speech recognizer: Recent ex-
[129] T. Kemp and T. Schaaf, “Estimating confidence using word lattices,” in Proceedings
132
BIBLIOGRAPHY
IEEE, 2008.
[131] K. Woo, T. Yang, K. Park, and C. Lee, “Robust voice activity detection algorithm
for estimating noise spectrum,” IET Electronics Letters, vol. 36, no. 2, pp. 180–181,
2000.
ISCA, 1999.
729 optimized for V. 70 digital simultaneous voice and data applications,” IEEE
[134] E. Nemer, R. Goubran, and S. Mahmoud, “Robust voice activity detection using
[135] J. Dines, J. Vepa, and T. Hain, “The segmentation of multi-channel meeting record-
2006.
133
BIBLIOGRAPHY
T. Ng, B. Zhang, L. Nguyen et al., “Acoustic and data-driven features for robust
[140] K. Walker and S. Strassel, “The RATS radio traffic collection system,” in Proceedings
[141] X. Ma, D. Graff, and K. Walker, “RATS - first incremental SAD audio delivery,”
N. Mesgarani, “Developing a speech activity detection system for the DARPA RATS
Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Pro-
134
BIBLIOGRAPHY
[144] D. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted Gaussian
mixture models,” Digital signal processing, vol. 10, no. 1, pp. 19–41, 2000.
[145] B. Yegnanarayana and S. Kishore, “AANN: an alternative to GMM for pattern recog-
[147] K. Murty and B. Yegnanarayana, “Combining evidence from residual phase and
MFCC features for speaker recognition,” IEEE Signal Processing Letters, vol. 13,
[149] S. Ganapathy, J. Pelecanos, and M. Omar, “Feature normalization for speaker verifi-
[151] S. Garimella, “Alternative regularized neural network architectures for speech and
[152] A. Jansen, K. Church, and H. Hermansky, “Towards spoken term discovery at scale
135
BIBLIOGRAPHY
[154] Y. Zhang and J. Glass, “Towards multi-speaker unsupervised speech pattern discov-
2011.
[156] S. Thomas, P. Nguyen, G. Zweig, and H. Hermansky, “MLP based phoneme detectors
[157] G. Zweig and P. Nguyen, “A segmental CRF approach to large vocabulary continuous
[158] ——, “SCARF: A segmental conditional random field toolkit for speech recognition,”
ditional random fields: A summary of the JHU CLSP 2010 summer workshop,” in
136
BIBLIOGRAPHY
[161] ——, “Tandem representations of spectral envelope and modulation frequency fea-
[162] ——, “Cross-lingual and multistream posterior features for low resource LVCSR sys-
[163] ——, “Multilingual MLP features for low-resource LVCSR systems,” in Proceedings
[165] Y. Zhang and J. Glass, “Unsupervised spoken keyword spotting via segmental DTW
[167] M. Siu, H. Gish, S. Lowe, and A. Chan, “Unsupervised audio patterns discovery using
[169] C. Lee and J. Glass, “A non-parametric Bayesian approach to acoustic model discov-
137
BIBLIOGRAPHY
138
Vita
Science and Engineering from Cochin University of Science and Technology, Kerala,
India in 2000 and Master of Science by Research degree from the Indian Institute of
Technology, Madras in 2006. He completed his PhD. in Electrical and Computer Sci-
ence while being affiliated to the Center for Language and Speech Processing (CLSP)
at the IBM T.J. Watson Research Center, Yorktown Heights, USA. His research in-
terests include speech recognition, speaker recognition, speech synthesis and machine
learning. In the past he has been part of several summer workshops at the CLSP and
has also worked at IDIAP Research, Switzerland and the IBM India Research Lab,
139