0% found this document useful (0 votes)

43 views130 pages

Speech Recognition in Noisy Environments

Uploaded by

Abdelkbir Ws

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views130 pages

Speech Recognition in Noisy Environments

Uploaded by

Abdelkbir Ws

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

Speech Recognition in Noisy Environments

Pedro J. Moreno
April 22, 1996

Department of Electrical
and Computer Engineering
Carnegie Mellon University
Pittsburgh, Pennsylvania 15213

Submitted in Partial Fulfillment of the Requirements

for the Degree of Doctor of Philosophy in Electrical and Computer Engineering

Copyright (c) 1996, by Pedro J. Moreno

Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1. Thesis goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Chapter 2
The SPHINX-II Recognition System . . . . . . . . . . . . . . . . . . . . . . 17
2.1. An Overview of the SPHINX-II System . . . . . . . . . . . . . . . . . . 17
2.1.1. Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.2. Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 20
2.1.3. Recognition Unit . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.4. Training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.5. Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2. Experimental Tasks and Corpora . . . . . . . . . . . . . . . . . . . . . 24
2.2.1. Testing databases . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2. Training database . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3. Statistical Significance of Differences in Recognition Accuracy . . . . . . . . . 25
2.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Chapter 3
Previous work in Environmental Compensation . . . . . . . . . . . . . . . . . 29
3.1. Cepstral Mean Normalization . . . . . . . . . . . . . . . . . . . . . . 29
3.2. Data-driven compensation methods . . . . . . . . . . . . . . . . . . . . 29
3.2.1. POF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.2. FCDCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3. Model-based compensation methods. . . . . . . . . . . . . . . . . . . . 30
3.3.1. CDCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.2. PMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.3. MLLR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5. Algorithms proposed in this thesis. . . . . . . . . . . . . . . . . . . . . 32
Chapter 4
Effects of the Environment on
Distributions of Clean Speech. . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1. A Generic Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2. One dimensional simulations using artificial data . . . . . . . . . . . . . . . 39
4.3. Two dimensional simulations with artificial data . . . . . . . . . . . . . . . 41
4.4. Modeling the effects of the environment as correction factors . . . . . . . . . . 43
4.5. Why do speech recognition system degrade in performance in the presence of unknown
environments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Chapter 5
A Unified View of Data-Driven Environment Compensation . . . . . . . . . . . . 49
iv

5.1. A unified view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2. Solutions for the correction factors and . . . . . . . . . . . . . . . . . . 51
5.2.1. Non-stereo-based solutions . . . . . . . . . . . . . . . . . . . . . 52
5.2.2. Stereo-based solutions . . . . . . . . . . . . . . . . . . . . . . . 54
5.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 6
The RATZ Family of Algorithms. . . . . . . . . . . . . . . . . . . . . . . . 57
6.1. Overview of RATZ and Blind RATZ. . . . . . . . . . . . . . . . . . . . 57
6.2. Overview of SNR-Dependent RATZ and Blind-RATZ . . . . . . . . . . . . . 59
6.3. Overview of Interpolated RATZ and Blind RATZ. . . . . . . . . . . . . . . 62
6.4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4.1. Effect of an SNR-dependent structure . . . . . . . . . . . . . . . . . 64
6.4.2. Effect of the number of adaptation sentences . . . . . . . . . . . . . . 65
6.4.3. Effect of the number of Gaussian Mixtures . . . . . . . . . . . . . . . 65
6.4.4. Stereo based RATZ vs. Blind based RATZ . . . . . . . . . . . . . . . 65
6.4.5. Effect of environment interpolation on RATZ . . . . . . . . . . . . . . 67
6.4.6. Comparisons with FCDCN . . . . . . . . . . . . . . . . . . . . . 67
6.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Chapter 7
The STAR Family of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 71
7.1. Overview of STAR and Blind STAR . . . . . . . . . . . . . . . . . . . . 71
7.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2.1. Effect of the number of adaptation sentences . . . . . . . . . . . . . . 74
7.2.2. Stereo vs. non-stereo adaptation databases . . . . . . . . . . . . . . . 75
7.2.3. Comparisons with other algorithms . . . . . . . . . . . . . . . . . . 76
7.3. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 8
A Vector Taylor Series Approach to Robust Speech Recognition . . . . . . . . . . 79
8.1. Theoretical assumptions. . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2. Taylor series approximations . . . . . . . . . . . . . . . . . . . . . . . 81
8.3. Truncated Vector Taylor Series approximations . . . . . . . . . . . . . . . 83
8.3.1. Comparing Taylor series approximations to exact solutions . . . . . . . . 84
8.4. A Maximum Likelihood formulation for the case of unknown environmental parame-
ters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.5. Data compensation vs. HMM mean and variance adjustment . . . . . . . . . . 89
8.6. Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.7. Experiments using real data . . . . . . . . . . . . . . . . . . . . . . . 92
8.8. Computational complexity. . . . . . . . . . . . . . . . . . . . . . . . 94
8.9. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Chapter 9
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.1. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.2. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.3. Suggestions for Future Work . . . . . . . . . . . . . . . . . . . . . . . 100
Appendix A
Comparing Data Compensation to Distribution Compensation . . . . . . . . . . . 103
v

Appendix B
Solutions for the SNR-RATZ Correction Factors . . . . . . . . . . . . . . . . . 109
Appendix C
Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ. . . . . 115
Appendix D
EM Solutions for the n and q Parameters for the VTS Algorithm . . . . . . . . . . 121
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
vi

List of Figures
Figure 2-1.: Block diagram of SPHINX-II. . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2-2.: Block diagram of SPHINX-II’s front end. . . . . . . . . . . . . . . . . . . . . . 21
Figure 2-3.: The topology of the phonetic HMM used in the SPHINX-II system. . . . . . . . 23
Figure 3-1: Outline of the algorithms for environment compensation presented in this thesis. 33
Figure 4-1: : A model of the environment for additive noise and filtering by a linear channel. rep-
resents the clean speech signal, represents the additive noise and represents the re-
sulting noisy speech signal. represents a linear channel . . . . . . . . . . . . . . 35
Figure 4-2: : Estimate of the distribution of noisy data via Monte-Carlo simulations. The continu-
ous line represents the pdf of the clean signal. The dashed line represents the real pdf
of the noise-contaminated signal. The dotted line represents the Gaussian-approxi-
mated pdf of the noisy signal. The original clean signal had a mean of 5.0 and a vari-
ance of 3.0, the channel was set to 5.0. The mean of the noise was set to 7.0 and the
variance of the noise was set to 0.5. . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4-3: : Estimate of the distribution of noisy signal at a lower SNR level via Monte-Carlo
methods. The continuous line represents the pdf of the clean signal. The dashed line
represents the real pdf of the noise-contaminated signal. The dotted line represents the
Gaussian-approximated pdf of the noisy signal. The original clean signal had a mean
of 5.0 and a variance of 3.0. Channel was set to 5.0. The mean of the noise was set to
9.0 and the variance of the noise was set to 0.5. . . . . . . . . . . . . . . . . . . 40
Figure 4-4: :Estimate of the distribution of noisy signal at a lower SNR level via Monte-Carlo
methods. The continuous line represents the pdf of the clean signal. The dashed line
represent the real pdf of the noise-contaminated signal. The dotted line represents the
Gaussian-approximated pdf of the noisy signal. The original clean signal had a mean
of 5.0 and a variance of 3.0. The channel was set to 5.0. The mean of the noise was
set to 11.0 and the variance of the noise was set to 0.5. . . . . . . . . . . . . . . 41
Figure 4-5: : Contour plot of the distribution of the clean signal. . . . . . . . . . . . . . . . . 42
Figure 4-6: : Contour plot of the distribution of the clean signal and of the noisy signal. . . . 43
Figure 4-7: : Decision boundary for a single two-class classification problem. The shaded region
represents the probability of error. An incoming sample xi will be classified as be-
longing to class H1 or H2 comparing it to the decision boundary . If xi is less than
it will be classified as belonging to class H1, otherwise it will be classified as belong-
ing to class H2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 4-8: : When the classification is performed using the wrong decision boundaries, the error
region is composed of two terms, the optimal one assuming the optimal decision
boundary is known (banded area above), and an addition term introduced by using the
wrong decision boundary (shaded are above between and ). . . . . . . . . . . . 46
Figure 5-1: A state with a mixture of Gaussians is equivalent to a set of states where each of them
contains a single Gaussian and the transition probabilities are equivalent to the a pri-
ori probabilities of each of the mixture Gaussians. . . . . . . . . . . . . . . . . . 51
Figure 6-1: : Contour plot illustrating joint pdfs of the structural mixture densities for the compo-
nents and . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Figure 6-2.: Comparison of RATZ algorithms with and without a SNR dependent structure. We
compare an 8.32 SNR-RATZ algorithm with a normal RATZ algorithm with 256
Gaussians. We also compare a 4.16 SNR-RATZ algorithm with a normal RATZ al-
vii

gorithm with only 64 Gaussians. . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Figure 6-3.: Study of the effect of the number of adaptation sentences on a 8.32 SNR dependent
RATZ algorithm. We observe that even with only 10 sentences available for adapta-
tion the performance of the algorithm does not seem to suffer. . . . . . . . . . . 66
Figure 6-4.: Study of the effect of the number of Gaussians on the performance of the RATZ al-
gorithms. In general a 256 configuration seems to perform better than a 64 or 16. 66
Figure 6-5.: Comparison of a stereo based 4.16 RATZ algorithm with a blind 4.16 RATZ algo-
rithm. The stereo-based algorithm outperforms the blind algorithm at almost all
SNRs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 6-6.: Effect of environment interpolation in recognition accuracy. The curve labeled
RATZ interpolated (A) was computed excluding the correct environment from the list
of environments. The curve labeled RATZ interpolated (B) was computing with all
environments available for interpolation. . . . . . . . . . . . . . . . . . . . . . 68
Figure 6-7.: Effect of environment interpolation on the performance of the RATZ algorithm. In
this case we compare the effect of removing the right environmental correction fac-
tors from the list of environments. We can observe that removing the right environ-
ment does not affect the performance of the algorithm. . . . . . . . . . . . . . . 69
Figure 7-1.: Effect of the number of adaptation sentences used to learn the correction factors rk
and Rk on the recognition accuracy of the STAR algorithm. The bottom dotted line
represents the performance of the system with no compensation. . . . . . . . . . 75
Figure 7-2.: Comparison of the Blind STAR, and original STAR algorithms. The line with dia-
mond symbols represents the original blind STAR algorithm while the line with tri-
angle symbols represents the blind STAR algorithm bootstrapped from the closest in
SNR sense distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 7-3.: Comparison of the stereo STAR, blind STAR, stereo RATZ, and blind RATZ algo-
rithms. The adaptation set was the same for all algorithms and consisted of 100 sen-
tences. The dotted line at the bottom represents the performance of the system with
no compensation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 8-1.: Comparison between Taylor series approximations to the mean and the actual value
of the mean of noisy data. A Taylor series of order zero seems to capture most of the
effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 8-2.: Comparison between Taylor series approximations to the variance and the actual val-
ue of the variance of noisy data. A Taylor series of order one seems to capture most
of the effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 8-3.: Flow chart of the vector Taylor series algorithm of order one for the case of unknown
environmental parameters. Given a small amount of data the environmental parame-
ters are learned in an iterative procedure. . . . . . . . . . . . . . . . . . . . . . 89
Figure 8-4.: Comparison of the VTS algorithms of order zero and one with CDCN. The VTS al-
gorithms outperform CDCN at all SNRs. . . . . . . . . . . . . . . . . . . . . . 91
Figure 8-5.: Comparison of the VTS algorithms of order zero and one with the stereo-based
RATZ and STAR algorithms. The VTS algorithm of order one performs as well as
the STAR algorithm up to a SNR or 10 dB. For lower SNRs only the STAR algorithm
produces lower error rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Figure 8-6.: Comparison of the VTS of order one algorithm with the CDCN algorithm on a real
database. Each points consists of one hundred sentences collected at different distanc-
es from the mouth of the speaker to the microphone. . . . . . . . . . . . . . . . 93
viii

Figure 8-7.: Comparison of several algorithms on the 1994 Spoke 10 evaluation set. The upper
line represents the accuracy on clean data while the lower dotted line represents the
recognition accuracy with no compensation. The RATZ algorithm provides the best
recognition accuracy at all SNRs. . . . . . . . . . . . . . . . . . . . . . . . . . 95
Figure 8-8.: Comparison of the real time performance of the VTS algorithms with the RATZ and
CDCN compensation algorithms. VTS-1 requires about 6 times the computational ef-
fort of CDCN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9

Abstract
The accuracy of speech recognition systems degrades severely when the systems are operated
in adverse acoustical environments. In recent years many approaches have been developed to ad-
dress the problem of robust speech recognition, using feature-normalization algorithms, micro-
phone arrays, representations based on human hearing, and other approaches.

Nevertheless, to date the improvement in recognition accuracy afforded by such algorithms has
been limited, in part because of inadequacies in the mathematical models used to characterize the
acoustical degradation. This thesis begins with a study of the reasons why speech recognition sys-
tems degrade in noise, using Monte Carlo simulation techniques. From observations about these
simulations we propose a simple and yet effective model of how the environment affects the pa-
rameters used to characterize speech recognition systems and their input.

The proposed model of environment degradation is applied to two different approaches to en-
vironmental compensation, data-driven methods and model-based methods. Data-driven methods
learn how a noisy environment affects the characteristics of incoming speech from direct compar-
isons of speech recorded in the noisy environment with the same speech recorded under optimal
conditions. Model-based methods use a mathematical model of the environment and attempt to use
samples of the degraded speech to estimate the parameters of the model.

In this thesis we argue that a careful mathematical formulation of environmental degradation

improves recognition accuracy for both data-driven and model-based compensation proce-
dures.The representation we develop for data-driven compensation approaches can be applied both
to incoming feature vectors and to the stored statistical models used by speech recognition systems.
These two approaches to data-driven compensation are referred to as RATZ and STAR, respective-
ly. Finally, we introduce a new approach to model-based compensation with solution based on vec-
tor Taylor series, referred to as the VTS algorithms.

The proposed compensation algorithms are evaluated in a series of experiments measuring rec-
ognition accuracy for speech from the ARPA Wall Street Journal database that is corrupted by ad-
ditive noise that is artificially injected at various signal-to-noise ratios (SNRs). For any particular
SNR, the upper bound on recognition accuracy provided by practical compensation algorithms is
the recognition accuracy of a system trained with noisy data at that SNR. The RATZ, VTS, and
STAR algorithms achieve this bound at global SNRs as low as 15, 10, and 5 dB, respectively. The
experimental results also demonstrate that the recognition error rate obtained using the algorithms
proposed in this thesis is significantly better than what could be achieved using the previous state
of the art. We include a small number of experimental results that indicate that the improvements
in recognition accuracy provided by our approaches extend to degraded speech recorded in natural
environments as well.
10

We also introduce a generic formulation of the environment compensation problem and its so-
lution via vector Taylor series. We show how the use of vector Taylor series in combination with a
Maximum Likelihood formulation produces dramatic improvements in recognition accuracy.
11

Acknowledgments
There are a lot of people I must recognize for their help in completing this project I started
almost five years ago. First I must thank my thesis advisor, Professor Richard M. Stern. His scien-
tific method has always been an inspiration for me. From him I have learned to ask the “whys” and
“hows” in my research. Also, his excellent writing skills have always considerably improved my
research papers (including this thesis!).

I must also recognize the other members of my thesis committee, Vijaya Kumar, Raj Reddy,
Alejandro Acero and Bishnu Atal. I am indebted to Raj for creating the CMU SPHINX group and
providing the research infrastructure used in this thesis. He has also generously provided the
funding for my final year as a graduate student at CMU. Professor Kumar made excellent sugges-
tions to improve this manuscript. Alejandro Acero is in part responsible for my joining CMU. He
advised me in my early years at CMU and provided me with some of the very first robust speech
recognition algorithms that are the seeds of the work presented here. He also carefully reviewed
this manuscript and suggested several improvements. Finally, Bishnu Atal provided me with a
more general perspective of my work and with valuable insights.

I must also thank the “Ministerio de Educación y Ciencia” from Spain and the Fulbright schol-
arship Program for their generous support during my first four years of stay at CMU.

During these five years at CMU I have had several colleagues and friends that have helped me
in several ways. Evandro Gouvea has been willing to help in any kind of research experiment. The
SNR plots I present in this thesis were pioneered by him in the summer of 1994. Matthew Siegler
is one of the major contributors to the creation of the Robust Speech Group. His efforts in main-
taining our software, directory structures, and other issues have made my experiments infinitely
easier. I am also grateful to Sam-Joo Doh for his help in proofreading the final versions of this
document. Eric Thayer and Ravi Mosur have been the core of the SPHINX-II system. It is only fair
to say that without them the SPHINX-II system would not exist.

My good friend Bhiksha Raj deserves special mention. His arrival to our group made a big dif-
ference in my research. In a way my research changed dramatically (to the better) as a result of my
collaboration and interactions with Bhiksha Raj. I have lost track of the many discussions we have
had at 3 am. The algorithms in this thesis are the result of many discussions (over dinner and lots
of beers) with him. He is also responsible for the real-time experiments reported in the thesis.

My friend Daniel Tapias, from Telefónica I+D, also has had a big influence in my work. His
approach to research is relentless. He has showed me how little by little, any concept, no matter
how difficult it is, can be mastered. Some of the derivations using the EM algorithm are based on
a presentation he gave here at CMU in 1995.

I must mention too some other members of the SPHINX group for their occasional help and
12

advice. Bob Weide, Sunil Isar, Lin Chase, Uday Jain, and Roni Rosenfeld have always been there
when needed. I am also thankful to my office mates, Rich Buskens, Mark Stahl and Mark Bearden
for their friendship over the years.

John Hampshire and Radu Jasinschi deserve special mention. They have been a model of sci-
entific honesty and integrity. I am lucky to have them as friends.

Finally, I want to dedicate this thesis to my parents, Pedro José and María Dolores, and sisters,
Belén and Salud, for their love and support. They have always been there when I needed them the
most. They have always encouraged me to follow my dreams.

Last but not least, I also want to dedicate this thesis to Carolina, my future wife, for her support
and love. She has endured my busy schedule over the years always cheering me up when I felt dis-
appointed. I could not think of a better person with whom to share the rest of my life.
13

Chapter 1
Introduction
The goal of errorless continuous speech recognition systems has remained unattainable over
the years. Commercial systems have been developed to handle small to medium vocabularies with
moderate performance. Large vocabulary systems able to handle 10,000 to 60,000 words have been
developed and demonstrated under laboratory conditions. However, all these systems suffer sub-
stantial degradations in recognition accuracy when there is any kind of difference between the con-
ditions in which the system is trained and the conditions in which the systems is finally tested.

Among other causes, differences between training and testing conditions can be due to:

• the speaking style

• the linguistic content of the task
• the environment
This dissertation focuses on the latter problem, environmental robustness.

1.1. Thesis goals

In recent years the field of environmental robustness has gained wide acceptance as one of the
primary areas of research in the speech recognition field. Several approaches have been studied
[e.g. 26, 52]. Microphone arrays [14, 53], auditory-based representations of speech features [51,
16], approaches based on filtering of features [2, 20] and other algorithms have been studied and
showed to increase the recognition accuracy.

Some of the most successful approaches to environmental compensation have been based on
modifying the feature vectors that are input to a speech recognition system or modifying the statis-
tics that are at the heart of the internal models used by recognition systems. These modifications
may be based on empirical comparisons of high-quality and degraded speech data, or they may be
based on analytical models of the degradation. Empirically-based methods tend to achieve faster
compensation while model-based methods tend to be more accurate.

In this dissertation we show that speech recognition accuracy can be further improved by mak-
ing use of more accurate models of degraded speech than had been used previously. We apply our
techniques to both empirically-based methods and model-based methods, using a variety of opti-
Chapter 1: Introduction 14

mal estimation procedures.

We also believe that the development of improved environmental compensation procedures is

facilitated by a rigorous understanding of how noise and filtering affect the parameters used to
characterize speech recognition systems and their input. Toward this end we attempt to elucidate
the nature of these degradations in an intuitive fashion.

Another major effort in this thesis is that of experimentation at different signal-to-noise ratios
(SNRs). Traditional environmental techniques have generally been tested at high SNRs where as
we will show most environmental techniques achieve similar recognition results. Hence the relative
merit of a particular environmental compensation technique can be better explored by looking at a
complete range of SNRs.

The goals of this thesis include:

• Development of compensation procedures that approach the recognition accuracy of fully re-
trained systems (that have been trained with data from the same environment as the testing
set).

• Development of a useful generic formulation of the problem of environmental robustness.

• Presentation of a generic characterization of the effects of the environment on the distribu-

tions of the cepstral vectors of clean speech along with a simple model to characterize these
effects.

• Presentation of a unified formulation for data-driven compensation of incoming feature vec-

tors as well as the internal statistical representation used by the recognition system.

• Presentation of applications of this unified formulation for two particular cases:

• the Multivariate-Gaussian-Based Cepstral Normalization (RATZ) algorithm that com-

pensates incoming feature vectors.

• the Statistical Reestimation (STAR) compensation algorithm that compensates internal

distributions.

• Presentation of an improved general approach to model-based compensation and its applica-

tion, the Vector Taylor Series (VTS) compensation algorithm.
Chapter 1: Introduction 15

• Demonstration that a more detailed mathematical formulation of the problem of environment

compensation results in greater recognition accuracy and flexibility in implementation than
previous methods.

1.2. Dissertation Outline

The thesis outline is as follows. Chapter 2 provides a brief description of the CMU SPHINX-
II speech recognition system, and it describes the databases used for training and experimentation.

In Chapter 3 we describe some relevant previously-developed environment compensation

techniques, and we introduce the compensation techniques that form the core of this thesis. We also
discuss the major differences between the compensation algorithms proposed in this thesis and the
previous ones.

In Chapter 4 we study the effects of the environment on the distributions of log spectra of clean
speech by using simulated data. We also discuss reasons for the degradation in recognition accura-
cy introduced by the environment.

In Chapter 5 we present a unified view of data-driven environmental compensation methods.

We show how approaches that modify the feature vectors of noisy input cepstra and approaches
that modify the internal distributions representing the cepstra of clean speech can be described
within a common theoretical framework.

In Chapter 6 we present the RATZ family of algorithms, which modify incoming feature vec-
tors. We describe in detail the mathematical structure of the algorithms, and we present experimen-
tal results exploring some of the dimensions of the algorithm.

In Chapter 7 we present the STAR algorithms, which modify the mean vectors and covariance
matrices of the distributions used by the recognition system to model speech. We describe the al-
gorithms and present experimental results. We present comparisons of the STAR and RATZ algo-
rithms and we show that the STAR compensation results in greater recognition accuracy. We also
explore the effect of initialization in the blind STAR algorithms.

In Chapter 8 we introduce the Vector Taylor Series (VTS) approach to robust speech recogni-
tion. We present a generic formulation for the problem of model-based environment compensation,
and we introduce the use of vector Taylor series as a more tractable approximation to characterize
Chapter 1: Introduction 16

the environmental degradation. We present a mathematical formulation of the algorithm and con-
clude with experimental results.

Finally, Chapter 9 contains our results and conclusions as well as suggestions for future work.
17

Chapter 2
The SPHINX-II Recognition System
Since the environmental adaptation algorithms to be developed will be evaluated in the context
of continuous speech recognition, this chapter provides an overview of the basic structure of the
recognition system used for the experiments described in this thesis. Most of the algorithms devel-
oped in this thesis are independent of the recognition engine used, and in fact they can be imple-
mented as completely separate modules. Hence, the results and conclusions of this thesis should
be applicable to other recognition systems.

The most important topic of this chapter is a description of various aspects of the SPHINX-II
recognition system. We also summarize the databases used for evaluation in the thesis.

2.1. An Overview of the SPHINX-II System

SPHINX-II is a large-vocabulary, speaker-independent, Hidden Markov Model (HMM)-based
continuous speech recognition system, like its predecessor, the original SPHINX system. SPHINX
was developed at CMU in 1988 [31, 32] and was one of the first systems to demonstrate the feasi-
bility of accurate, speaker-independent, large-vocabulary continuous speech recognition.

Figure 2-1 shows the fundamental structure of the SPHINX-II [22] system. We describe the
functions of each block briefly.

2.1.1. Signal Processing

Almost all speech recognition systems use a parametric representation of speech rather than the
waveform itself as the basis for pattern recognition. The parameters usually carry the information
about the short-time spectrum of the signal. SPHINX-II uses mel-frequency cepstral coefficients
(MFCC) as static features for speech recognition [11]. First-order and second-order time deriva-
tives of the cepstral coefficients are then obtained, and power information is included as a fourth
feature.

In this thesis we will use the cepstrum and log spectrum signal feature representations for en-
vironment compensation procedure. In each section we will clearly define which features we use
and the reason for them.
Chapter 2: The SPHINX-II System 18

Testing
Training Data
Data

Signal Processing
Feature
Codebook

Signal Processing

HMM
Senone
Re-estimation

Multipass Search
VQ Clustering and
Quantization

Lexicon

Senonic Semi-
Continuous HMM

Language
Model

SPHINX-II Training SPHINX-II Testing

Figure 2-1. Block diagram of SPHINX-II.

The front end of SPHINX-II is illustrated in Figure 2-2. We summarize this feature extraction
procedure as follows:
Chapter 2: The SPHINX-II System 19

1. The input speech signal is digitized at a sampling rate of 16 kHz.

2. A pre-emphasis filter H ( z ) = 1 – 0.97z –1 is applied to the speech samples. The pre-emphasis

filter is used to reduce the effects of the glottal pulses and radiation impedance [38] and to
focus on the spectral properties of the vocal tract.

3. Hamming windows of 25.6-ms duration are applied to the pre-emphasized speech samples
at an analysis rate (frame rate) of 100 windows/sec.

4. The power spectrum of the windowed signal in each frame is computed using a 512-point
DFT.

5. 40 mel-frequency spectral coefficients (MFSC) [11] are derived based on mel-frequency

bandpass filters using 13 constant-bandwidth filters from 100 Hz to 1 kHz and 27 constant-
Q filters from 1 kHz to 7 kHz.

6. For each 10-ms time frame, 13 mel-frequency cepstral coefficients (MFCCs) are computed
using the cosine transform, as shown in Equation (2.1)

39
xt [ k ] = ∑ X t [ i ] cos [ k ( i + 1 ⁄ 2 ) ( π ⁄ 40 ) ] 0 ≤ k ≤ 12 (2.1)
i=0

where X t [ i ] represents the log-energy output of the ith mel-frequency bandpass filter at time frame

t and x t [ k ] represents the kth cepstral vector component at time frame t. Note that unlike other

speech recognition systems [e.g. 57], the x t [ 0 ] cepstrum coefficient here is the sum of the log spec-
tral band energies as opposed to the logarithm of the sum of the spectral band energies. The rela-
tionship between the cepstrum vector and the log spectrum vector can be expressed in matrix form
as

xt [ 0 ] Xt[0]
... ...
xt [ k ] = d k, i Xt[i] (2.2)
... ...
x t [ 12 ] X t [ 39 ]
d k, i = cos [ k ( i + 1 ⁄ 2 ) ( π ⁄ 40 ) ]
Chapter 2: The SPHINX-II System 20

where d k, i is a 13x40 dimensional matrix.

7. The derivative features are computed from the static MFCCs as follows,

(a) Differenced cepstral vectors consist of 40-ms and 80-ms differences with 24 coefficients

∆ x t [ k ] = x t + 2 [ k ] – x t – 2 [ k ] ,1 ≤ k ≤ 12
(2.3)
∆ x' t [ k ] = x t + 4 [ k ] – x t – 4 [ k ] ,1 ≤ k ≤ 12

(b) Second-order differenced MFCCs are then derived in similar fashion, with 12 dimen-
sions.

∆∆ x t [ k ] = ∆x t + 1 [ k ] – ∆x t – 1 [ k ] ,1 ≤ k ≤ 12 (2.4)

x t [ 0 ] = x t [ 0 ] – max { x i [ 0 ] }
∆ xt [ 0 ] = xt + 2 [ 0 ] – xt – 2 [ 0 ] (2.5)
∆∆ x t [ 0 ] = ∆x t + 1 [ 0 ] – ∆x t – 1 [ 0 ]

Thus, the speech representation uses 4 sets of features including: (1) 12 Mel-frequency cepstral
coefficients (MFCC); (2) 12 40-ms differenced MFCC and 12 80-ms differenced MFCC; (3) 12
second-order differenced cepstral vectors; and (4) power, 40-ms differenced power, and second-
order differenced power. These features are all assumed to be statistically independent for mathe-
matical and implementational simplicity.

2.1.2. Hidden Markov Models

In the context of statistical methods for speech recognition, hidden Markov models (HMM)
have become a well known and widely used statistical approach to characterizing the spectral prop-
erties of frames of speech. As a stochastic modeling tool, HMMs have an advantage of providing
a natural and highly reliable way of recognizing speech for a wide variety of applications. Since
the HMM also integrates well into systems incorporating information about both acoustics and se-
mantics, it is currently the predominant approach for speech recognition. We present here a brief
summary of the fundamentals of HMMs. More details about the fundamentals of HMMs can be
found in [6, 7, 25, 32, 34].
Chapter 2: The SPHINX-II System 21

Speech Waveform

Pre-emphasis

25.6-ms Hamming Window

Discrete Fourier Transform

Mel-frequency
bandpass filtering

cosine transform

normalized power &

40-ms & 80-ms 2nd-order
Cepstrum 40-ms diff. power &
diff. cepstrum diff. cepstrum 2nd-order diff. power

Figure 2-2. Block diagram of SPHINX-II’s front end.

Hidden Markov models are a “doubly stochastic process” in which the observed data are
viewed as the result of having passed the true (hidden) process through a function that produces the
second process (observed). The hidden process consists of a collection of states (which are pre-
sumed abstractly to correspond to states of the speech production process) connected by transi-
tions. Each transition is described by two sets of probabilities:

• A transition probability, which provides the probability of making a transition from one
state to another.

• An output probability density function, which defines the conditional probability of observ-
Chapter 2: The SPHINX-II System 22

ing a set of speech features when a particular transition takes place. For semicontinuous
HMM systems (such as SPHINX-II) or fully continuous HMMs [27], pre-defined continuous
distribution functions are used for observations that are multi-dimensional vectors. The con-
tinuous density function most frequently used for this purpose is the multivariate Gaussian
mixture density function.

The goal of the decoding (or recognition) process in HMMs is to determine a sequence of (hid-
den) states (or transitions) that the observed signal has gone through. The second goal is to define
the likelihood of observing that particular event given a state determined in the first process. Given
the definition of hidden Markov models, there are three problems of interest:

• The Evaluation Problem: Given a model and a sequence of observations, what is the prob-
ability that the model generated the observations? This solution can be found using the for-
ward-backward algorithm [47, 9].

• The Decoding Problem: Given a model and a sequence of observations, what is the most
likely state sequence in the model that produced the observation? This solution can be found
using the Viterbi algorithm [55].

• The Learning Problem: Given a model and a sequence of observations, what should the
model’s parameters be so that it has the maximum probability of generating the observations?
This solution can be found using the Baum-Welch algorithm (or the forward-backward algo-
rithm) [9, 8].

2.1.3. Recognition Unit

An HMM can be used to model a specific unit of speech. The specific unit of speech can be a
word, a subword, or a complete sentence or paragraph. In large-vocabulary systems, HMMs are
usually used to model subword units [6, 30, 10, 29] such as phonemes, while in small-vocabulary
systems HMMs tend to be used to model the words themselves.

SPHINX-II is based on phonetic models because the amount of training data and storage re-
quired for word models is enormous. In addition, phonetic models are easily trainable. However,
the phone model is inadequate to capture the variability of acoustical behavior for a given phoneme
in different contexts. In order to enable detailed modeling of these co-articulation effects, triphone
models were proposed [50] to account for the influence by the neighboring contexts.

Because the number of triphones to model can be too large and because triphone modeling does
Chapter 2: The SPHINX-II System 23

not take into account the similarity of certain phones in their effect on neighboring phones, a pa-
rameter-sharing technique called distribution sharing [24] is used to describe the context-depen-
dent characteristics for the same phones.

2.1.4. Training

B1 B2 M E1 E2

B1 B2 M E1

Figure 2-3. The topology of the phonetic HMM used in the SPHINX-II system.

SPHINX-II is a triphone-based HMM speech recognition system. Figure 2-3 shows the basic
structure of the phonetic model for HMMs used in SPHINX-II. Each phonetic model is a left-to-
right Bakis HMM [7] with 5 distinct output distributions.

SPHINX-II [22] uses a subphonetic clustering approach to share parameters among models.
The output of clustering is a pre-specified number of shared distributions, which are called senones
[24]. The senone, then, is a state-related modeling unit. By using subphonetic units for clustering,
the distribution-level clustering provides more flexibility in parameter reduction and more accurate
acoustic representation than the model-level clustering based on triphones.

The training procedure involves optimizing HMM parameters given an ensemble of training
data. An iterative procedure, the Baum-Welch algorithm [9,47] or forward-backward algorithm, is
employed to estimate transition probabilities, output distributions, and codebook means and vari-
ances under a unified probabilistic framework.

The optimal number of senones varies from application to application. It depends on the
amount of available training data and the number of triphones present in the task. For the training
corpus and experiments in this thesis which will be described in Section 2.2., we use 7000 senones
Chapter 2: The SPHINX-II System 24

for the ARPA Wall Street Journal task with 7200 training sentences.

2.1.5. Recognition
For continuous speech recognition applied to large-vocabulary tasks, the search algorithm
needs to apply all available acoustic and linguistic knowledge to maximize recognition accuracy.
In order to integrate the use of all the lexical, linguistic and acoustic source of knowledge SPHINX-
II uses a multi-pass search approach [5]. This approach uses the Viterbi algorithm [55] as a fast-
match algorithm, and a detailed re-scoring approach to the N-best hypotheses [49] to produce the
final recognition output.

SPHINX-II is designed to exploit all available acoustic and linguistic knowledge in three
search phases. In Phase One a Viterbi beam search is applied in a left-to-right fashion, as a forward
search, to produce best-matched word hypotheses, along with information about word ending times
and associate scores, using detailed between-word triphone models and a bigram language model.

In Phase Two, a Viterbi beam search is performed in a right-to-left fashion, as a backward

search, to generate all possible word beginning times and scores using the between-word triphone
models and a bigram language model. In Phase Three, an A* search [44] is used to produce a set
of N-best hypotheses for the test utterance by combining the results of Phases One and Phase Two
for reescoring by a trigram language model.

2.2. Experimental Tasks and Corpora

To evaluate the algorithms proposed in this thesis, we have used the Wall Street Journal (WSJ)
database. This database consists of several subsets encompassing different vocabulary sizes, envi-
ronmental conditions, foreign accents, etc [45].

2.2.1. Testing databases

We have focused on the 5,000-word vocabulary size “clean” speech subset of the database sub-
mitted for evaluation on 1993. This subset was recorded over a Sennheiser close-talking, headset-
mounted noise-cancelling microphone (HMD-410 or HMD-414). These data were contaminated
with additive white Gaussian noise at several SNRs. The testing set contained 215 sentences with
a total number of 4066 words belonging to 10 different native speakers of American English. The
testing set contained five male and five female speakers. In all of our experiments we provide the
Chapter 2: The SPHINX-II System 25

performance of the compensation algorithms at SNRs from zero to thirty decibels. A reasonable
upper bound on the recognition accuracy of each compensation algorithm is the recognition accu-
racy of a fully retrained system. The expected lower bound on recognition accuracy is the accuracy
of a system with no compensation enabled.

A second focus of our experiments was the Spoke 10 subset released in 1994 for study of en-
vironment compensation algorithms in the presence of automobile noise, which contains a vocab-
ulary of 5,000-words. The testing set contained 113 sentences with a total number of 1937 words
from 10 different speakers. The automobile noise was collected through an omni-directional mi-
crophone mounted on the drivers’s side sun visor while the car was traveling at highway speeds.
The windows of the car were closed and the air-conditioning was turned on. A single lengthy sam-
ple of the noise was collected so that it could span across the entire test set. The noise was scaled
to three different levels so that the resulting SNR of the noise plus speech equalled three target
SNRs picked by NIST and unknown to the speech recognition system. A one-minute sample of
noise was provided for adaptation although in our experiments this was not used.

2.2.2. Training database

In our studies we use the official speaker-independent training corpus, referred to as “WSJ0-
si_trn”, supplied by the National Institute of Standards and Technology (NIST) containing 7240
utterances of read WSJ text. These sentences were recorded using a Sennheiser close-talking noise-
cancelling headset. The training utterances are collected from 84 speakers. All these data are used
to build a single set of gender independent HMMs to train the SPHINX-II system.

2.3. Statistical Significance of Differences in Recognition Accuracy

The algorithms we propose in this dissertation are evaluated in terms of recognition accuracy
observed using a common standardized corpus of speech material for testing and training. Recog-
nition accuracy is obtained by comparing the word-string output produced from the recognizer
(hereafter referred to as the hypothesis) to the word string that had been actually uttered (hereafter
referred to as the reference). Based on a standard nonlinear string-matching program, word error
rate is computed as the percentage of errors including insertion, deletion and substitution of words.

It is important to know whether any apparent difference in performance of the algorithms is sta-
tistically significant in order to interpret experimental results in an objective manner. Gillick and
Chapter 2: The SPHINX-II System 26

Cox [17] proposed the use of the McNemar’s test and a matched-pairs test for determining the sta-
tistical significance of recognition results. Recognition errors are assumed to be independent in the
McNemar’s test or independent across different sentence segments in the matched-pairs test, re-
spectively. Picone and Doddington [46] also advocated a phone-mediated alternative to the conven-
tional alignment of reference and hypothesis word strings for the purpose of analyzing word errors.
NIST has implemented several automated benchmark scoring programs to evaluate statistical sig-
nificance of performance differences between systems.

Many results produced by different algorithms do not differ from each other by a very substan-
tial margin, and it is to our interest to know whether these performance differences are statistically
significant. A straightforward solution is to apply the NIST “standard” benchmark scoring program
to compare a pair of results.

In general, the statistical significance of a particular performance improvement is closely relat-

ed to the differences in error rates, and it also varies with the number of testing utterances, the task
vocabulary size, the positions of errors, the grammar, and the range of overall accuracy. Neverthe-
less, for the ARPA WSJ0 task with the SPHINX-II system, a rule of thumb we have observed is
that performance improvement is usually considered to be significant if the absolute difference in
accuracy between two results is greater than 1%. There is usually no statistically significant differ-
ence if differences in error rate are less than 0.7%.

2.4. Summary
In this chapter, we reviewed the overall structure of SPHINX-II that will be used as the primary
recognition system in our study. We also described the training and evaluation speech corpora that
we employ to evaluate the performance of our algorithms in the following chapters. The primary
vehicle for research of this thesis will be the WSJ0 5,000-word 7240 sentences training corpora
from which a single set of gender independent 7000-senonic HMMs will be constructed. Using
these models we will evaluate the environmental compensation algorithms proposed in this thesis
with the 1993 WSJ0 5,000-word clean speech evaluation set, adding white noise at different SNRs
Chapter 2: The SPHINX-II System 27

and with the 1994 5,000-word Spoke 10 multimicrophone evaluation set.

Chapter 2: The SPHINX-II System 28
29

Chapter 3
Previous work in Environmental Compensation
In this section we discuss some of the latest and most relevant algorithms in environmental
compensation that relate to those presented in this thesis. They all share similar assumptions, name-
ly:

• they use a cepstrum feature vector representation of the speech signal

• they use a statistical characterization of the feature vector based on mixtures of Gaussians,
Vector Quantization (VQ), or even a more detailed modelling provided by HMMs.

In this chapter we briefly review these algorithms and relate them to the algorithms proposed
in this thesis. Finally, we also present a taxonomy of the algorithms presented in this dissertation.

3.1. Cepstral Mean Normalization

Cepstral mean normalization (CMN) [37] is perhaps one of the most effective algorithms con-
sidering its simplicity. It is a de facto standard in most large vocabulary speech recognition sys-
tems. The algorithm computes a long-term mean value of the feature vectors and subtracts this
mean value from each of the vectors. In this way it assures that the mean value of the incoming
feature stream is zero. This helps in reducing the variability of the data and also allows for a simple
and yet effective channel and speaker normalization. The procedure is applied to both the training
and testing data. In the experiments described in this thesis it is always used just before training or
recognition but always after compensation.

However, the effectiveness of CMN is limited when the environment is not adequately modeled
by a linear channel. For those situations more sophisticated algorithms are needed.

3.2. Data-driven compensation methods

As we will describe in Chapter 4, the effect of the environment on cepstra and log spectra of
clean speech feature vectors can frequently be modeled by additive correction factors. These cor-
rection factors can be computed using “examples” of how clean speech vectors are affected by the
environment. A simple and effective way of directly observe the effect of the environment on
speech feature vectors is through the use of simultaneously recorded clean and noisy speech data,
also known as “stereo-recorded” data.
Chapter 3: Previous Work in Environment Compensation 30

In this section we describe the Probabilistic Optimum Filtering (POF) [43] and the Fixed Code-
word Dependent Cepstral Normalization (FCDCN) [1] algorithms as primary examples of these
approaches.

3.2.1. POF
The POF algorithm [43] uses a VQ description of the distribution of clean speech cepstrum
combined with a codeword-dependent multidimensional transversal filter. The role of the multidi-
mensional transversal filter is to capture temporal correlations across previous and past frame vec-
tors.

POF learns the parameters of the VQ cell-dependent transversal filters (matrices) for each of
the cells and for each environment through the minimization of an error function defined as the
norm of the difference between the clean speech vectors and the noisy speech vectors. To do so it
requires the use of stereo data.

One of the limitations of the POF algorithm is its dependency on stereo-recorded speech data.
The use of a weak statistical representation of the clean speech cepstrum distributions as modeled
by VQ makes the algorithm usable only when stereo-recorded data are available. Even if large
amounts of noisy adaptation speech data are available, the algorithm cannot make use of them
without parallel recordings of clean speech.

3.2.2. FCDCN
FCDCN [1] is similar in structure to POF. It uses a VQ representation for the distribution of
clean speech cepstrum vectors and computes a codeword-dependent correction vector based on si-
multaneously recorded speech data. It suffers the same limitations as POF. The use of a weak sta-
tistical representation of cepstral vector distributions of clean speech based on VQ makes also the
algorithm dependent on the availability of stereo-recorded data.

3.3. Model-based compensation methods

The previous compensation methods did not make any assumption about the environment since
its effect on the cepstral vectors was directly modeled through the use of simultaneously-recorded
clean and noisy speech. In this section we present methods that assume a model of the environment
characterized by additive noise and linear filtering that do not require simultaneously recorded
Chapter 3: Previous Work in Environment Compensation 31

speech data.

We describe the Codeword Dependent Cepstrum Normalization method (CDCN) [1] and the
Parallel Model Combination method (PMC) [15]. Although strictly speaking it is not a model-
based method, we also describe the Maximum Likelihood Linear Regression method (MLLR) [33]
because of its similarity to PMC.

3.3.1. CDCN
CDCN [1] models the distributions of cepstra of clean speech by a mixture of Gaussian distri-
butions. It analytically models the effect of the environment on the distributions of clean speech
cepstrum. The algorithm works in two steps. The goal of the first step is to estimate the values of
the environmental parameters (noise and channel vectors) that maximize the likelihood of the ob-
served noisy cepstrum vectors. In the second step Minimum Mean Squared Estimation (MMSE) is
applied to find the unobserved cepstral vector of clean speech given the cepstral vector of noisy
speech.

The algorithm works on a sentence-by-sentence basis, needing only the sentence to be recog-
nized to estimate environmental parameters.

3.3.2. PMC
The Parallel Model Combination approach [15] assumes the same model of the environment
used by CDCN. Assuming perfect knowledge of the noise and channel vectors, it tries to transform
the mean vectors and covariance matrices of the acoustical distributions of the HMMs to make
them more similar to the ideal distributions of the cepstra of the noisy speech. Several possible al-
ternatives exists to transform the mean vectors and covariance matrices.

However, all these versions of the PMC algorithm need previous knowledge of the noise and
channel vectors. Its estimation is done beforehand using different approximations. Typically sam-
ples of isolated noise are needed to adequately estimate the parameters of PMC.

3.3.3. MLLR
MLLR [33] was originally designed as a speaker adaptation method but it has also proved to
be effective for environment compensation [56]. The algorithm updates the mean vectors and co-
variance matrices of the distributions of cepstra for clean speech as modeled by the HMMs, given
Chapter 3: Previous Work in Environment Compensation 32

a small amount of adaptation data. It finds a set of transformation matrices that maximize the like-
lihood of observing the noisy cepstrum vectors.

The algorithm does not make use of any implicit model of the environment. It only assumes
that the mean vectors of the clean speech cepstrum distributions can be rotated and shifted by the
environment.

3.4. Summary and discussion

The data-driven algorithms proposed in this thesis (RATZ and STAR) use richer and more de-
tailed models of the distributions of cepstral vectors of speech than the POF and FCDCN algo-
rithms. They generally use mixtures of Gaussian distributions in which all the parameters (priors,
mean vectors, covariance matrices) are learned optimally from training data consisting of clean
speech. The use of mixtures of Gaussians allows a representation of the feature space in which vec-
tors are described as weighted sums of mixtures. In contrast, POF and FCDCN use a vector quan-
tization representation of the feature space in which hard decisions are made to assign a vector as
belonging to a particular codeword.

The data-driven methods we will propose are designed within a maximum likelihood frame-
work. This framework facilitates easy extension to new conditions such as:

• unavailability of simultaneously recorded clean and noisy speech data.

• interpolation between different environmental conditions.

The model-based techniques proposed in this thesis (VTS-0 and VTS-1) also use a maximum
likelihood framework. The formulation we introduce for these algorithms allows for an easy gen-
eralization to any kind of environment. They are able to work with a single sentence. Unlike PMC
they do not need any extra information to estimate the noise or the channel.

Finally, all the algorithms proposed in this thesis assume a model of the effect of the environ-
ment on the distributions of cepstra or log spectra of clean speech that is closer to reality. Changes
in both mean vectors and covariance matrices are modelled.

3.5. Algorithms proposed in this thesis

Figure 3-1 summarizes the environmental compensation techniques presented in this thesis. We
first consider the data-driven compensation methods, RATZ and STAR, which make use of simul-
Chapter 3: Previous Work in Environment Compensation 33

Compensation Algorithms

Data-driven algorithms Model-based algorithms

Chapter 5 (Compensation of log spectra)
VTS Chapter 7

Compensation of incoming Compensation of HMM

cepstra Chapter 6 distributions (cepstra
∆-cepstra, ∆2-cepstra)
STAR Chapter 7

RATZ SNR RATZ

Section 6.1 Section 6.2
Stereo-based Blind
STAR STAR

SNR-dep SNR-dep
Stereo-based Blind stereo-based VTS-1 VTS-0
RATZ blind RATZ
RATZ RATZ

Figure 3-1 Outline of the algorithms for environment compensation presented in this thesis.

taneously-recorded clean and noisy speech data. In the case of RATZ we also explore the effect of
SNR-dependent structures in modeling the distributions of clean speech cepstrum. In the case of
STAR we explore the benefits of compensation of the distributions of cepstra of clean speech.

For the model-based VTS methods we will explore the use of vector Taylor series approxima-
tion to the complex analytical function describing the environments.
Chapter 3: Previous Work in Environment Compensation 34
35

Chapter 4
Effects of the Environment on
Distributions of Clean Speech
In this chapter we describe and discuss several sources of degradation that the environment im-
poses on speech recognition systems. We will analyze how these sources of degradation affect the
statistics of clean speech and how they impact on speech recognition accuracy. We will also explain
why speech recognition systems degrade in performance in the presence of unknown environ-
ments. Finally, based on this explanation we propose two generic solutions to the problem.

4.1. A Generic Formulation

Throughout this thesis we will assume that any environment can be characterized by the fol-
lowing equation

y = x + g ( x, a 1, a 2, .... ) (4.1)

where x represents the clean speech log spectral or cepstral vector that characterize the speech, and
a 1 , a 2 and so on represent parameters (vectors, scalars, matrices,...) that define the environment.

While this generic mathematical formulation can be particularized for many cases, in the rest
of this section we present a detailed analysis for the case of convolutional and additive noise. In
this case we can assume the environment can be modeled as represented in Figure 4-1. This kind

“Clean” speech “Dirty” speech

x[m] h[m] + y[m]

Linear filtering
n[m]

Additive noise
Figure 4-1: A model of the environment for additive noise and filtering by a linear channel.
x [ m ] represents the clean speech signal, n [ m ] represents the additive noise and y [ m ]
represents the resulting noisy speech signal. h [ m ] represents a linear channel
Chapter 4: Effects of the Environment on Distributions of Clean Speech 36

of environment was originally proposed by Acero [1] and later used by Liu [35] and Gales [15]. It
is a reasonable model of the environment.

The effect of the noise and filtering on clean speech in the power spectral domain can be rep-
resented as

2
PY ( ωk ) = H ( ωk ) P X ( ωk ) + P N ( ωk ) (4.2)

where P Y ( ω k ) represents the spectra of the noisy speech y [ m ] , P N ( ω k ) represents the power

2
spectra of the noise n [ m ] , P X ( ω k ) the power spectra of the clean speech x [ m ] , H ( ω k ) the

power spectra of the channel h [ m ] , and ω k represents a particular mel-spectral band.

To transform to the log spectral domain we apply the logarithm operator at both sides of the
expression (4.2) resulting in

2
10 log 10 ( P Y ( ω k ) ) = 10 log 10 ( H ( ω k ) P X ( ω k ) + P N ( ω k ) ) (4.3)

and defining the noisy speech, noise, and clean speech,

y [ k ] = 10 log 10 ( P Y ( ω k ) )
n [ k ] = 10 log 10 ( P N ( ω k ) )
(4.4)
x [ k ] = 10 log 10 ( P X ( ω k ) )
2
h [ k ] = 10 log 10 ( H ( ω k ) )

results in equations

x[k ] + h[k ] n[k ]

 ---------------------------
10
- ----------
10
10 log 10 ( P Y ( ω k ) ) = 10 log 10  10 + 10 
 
(4.5)
n[k ] – x[k ] – h[k ]
 --------------------------------------------
10
y [ k ] = x [ k ] + h [ k ] + 10 log 10  1 + 10 
 
Chapter 4: Effects of the Environment on Distributions of Clean Speech 37

2
Where h [ k ] is the logarithm of H ( ω k ) , and similar relationships exist between n [ k ] and

P N ( ω k ) , x [ k ] and P X ( ω k ) , and y [ k ] and P Y ( ω k ) .

Following our initial formulation proposed in Equation (4.1) for the case of additive noise and
linear channel this expression can be written as

y [ k ] = x [ k ] + g ( x [ k ], h [ k ], n [ k ] ) (4.6)

or in vector form

y = x + g ( x, h, n ) (4.7)

where

n[k ] – x[k ] – h[k ]

 --------------------------------------------
10
g ( x [ k ], h [ k ], n [ k ] ) = h [ k ] + 10 log 10  1 + 10  (4.8)
 

From these equations we can formulate a relationship between the log spectral features repre-
senting clean and noisy speech. However, the relation is cumbersome and not so easy to under-
stand. A simple way to understand how noise and channel affect speech is by observing how the
statistics of clean speech are transformed by the environment.

Let’s assume that the log spectral vectors that characterize clean speech follow a Gaussian dis-
tribution N x ( µ x, Σ x ) and that the noise and channel are perfectly known. These circumstances

produce a transformation of random variables leading to a new distribution for the log spectra of
noisy speech equal to

n– y –1
L⁄2 
------------
 10 
p ( y µ x, Σ x, n, h ) =  ( 2π Σ x )  I – 10  
   
n– y T n– y (4.9)
1  ------------ 
–1    ------------  
10 10
– ---  y – h – µ x + 10log 10  i – 10   Σ x  y – h – µ x + 10  10log 10  i – 10 
2      
e
Chapter 4: Effects of the Environment on Distributions of Clean Speech 38

where L is the dimensionality of the log spectral vector random variable, i is the unitary vector, and
I is the identity matrix.

The resulting distribution p ( y ) is clearly non-Gaussian. However, since most speech recogni-
tion systems assume Gaussian distributions, a Gaussian distribution assigned to p ( y ) can still cap-
ture part of the effect of the environment on speech statistics. To characterize the Gaussian
distributions we need only compute the mean vector and covariance matrices of these new distri-
butions. The new mean vector can be computed as

µ y = E ( x + f ( x, h, n ) ) = µ x + E ( f ( x, h, n ) )

µ y = µx + ∫ g ( x, h, n )N x ( µ x, Σ x )dx
X (4.10)
n–x–h
 ----------------------
10
µ y = µ x + h + ∫ 10 log 10  i + 10  N x ( µ x, Σ x )dx
X  

and the covariance matrix can be computed as

T T
Σ y = E ( ( x + g ( x, h, n ) ) ( x + g ( x, h, n ) ) ) – µ y µ y
T T T T
Σ y = E ( x x ) + E ( g ( x, h, n ) g ( x, h, n ) ) + 2E ( x g ( x, h, n ) ) – µ y µ y
T T
Σ y = Σ x + µ x µ x + ∫ ( g ( x, h, n ) g ( x, h, n ) )N x ( µ x, Σ x )dx + (4.11)
X
T T
+ 2 ∫ ( x g ( x, h, n ) )N x ( µ x, Σ x )dx – µ y µ y
X

In both equations the integrals don’t have a closed form solution. Therefore we must use nu-
merical methods to estimate the mean vector and covariance matrix of the distribution.

Since in most cases the noise must be estimated and is not known a priori, a more realistic mod-
el would be to assign a Gaussian distribution N n ( µ n, Σ n ) to the noise. To simplify the resulting

equations we can also assume that the noise and the speech are statistically independent. The prob-
ability density function (pdf) of the log spectrum of the noisy speech under these assumptions can-
not be computed analytically, but it can be estimated using Monte-Carlo methods.
Chapter 4: Effects of the Environment on Distributions of Clean Speech 39

The mean vector and the covariance matrix of the log spectrum of the noisy speech will have
the form

µ y = µ x + ∫ N x ( µ x, Σ x ) ∫ g ( x, h, n )N n ( µ n, Σ n )dxdn (4.12)
X N

T T
Σ y = Σ x + µ x µ x + ∫ N x ( µ x, Σ x ) ∫ ( g ( x, h, n ) g ( x, h, n ) ) N n ( µ n, Σ n )dndx +
X N
(4.13)
T T
+ 2 ∫ N x ( µ x, Σ x ) ∫ ( x g ( x, h, n ) ) N n ( µ n, Σ n )dndx – µ yµ y
X N

Again, as in the previous case the resulting equations have no closed-form solution, and we can
only estimate the resulting mean vector and covariance matrix through numerical methods.

4.2. One dimensional simulations using artificial data

To visualize the resulting distributions of noisy data we present results obtained with artificially
produced one-dimensional data. These artificial data can simulate the simplified case of a log spec-
trum feature vector of speech using a single dimension.

2
Simulated clean data were produced according to a Gaussian distribution N x ( µ x, σ x ) and con-

2
taminated with artificially produced noise according to a Gaussian distribution N n ( µ n, σ n ) . A

channel h was also defined. The artificially-produced clean data, noise, and channel were com-
bined according to Equation (4.5) producing a noisy data set Y = { y 0, y 1, ..., y N – 1 } . From this
noisy data set we directly estimated the mean and variance of a maximum likelihood (ML) fit as

N–1 N–1
1 2 1 2
µ y, ML = ----
N ∑ yi σ y, ML = ----
N ∑ ( y i – µ y, ML ) (4.14)
i=0 i=0

We also computed a histogram of the noisy data set to estimate directly its real distribution. The
contamination was performed at different speech-to-noise ratios defined by µ x – µ n .

Figure 4-2 shows an example of the original distribution of the clean signal, the original noisy
signal after going through a transformation of Equation (4.5) and producing the distribution of
Chapter 4: Effects of the Environment on Distributions of Clean Speech 40

Equation (4.9), and the best Gaussian fit to the distribution of the noisy signal. The SNR for the
noisy signal is 3 dB.
0.20

0.16

0.12

0.08

0.04

-5 0.000 5 10 15 20 25
Log Power Spectra (dB)
Figure 4-2: Estimate of the distribution of noisy data via Monte-Carlo simulations. The con-
tinuous line represents the pdf of the clean signal. The dashed line represents the real pdf
of the noise-contaminated signal. The dotted line represents the Gaussian-approximated
pdf of the noisy signal. The original clean signal had a mean of 5.0 and a variance of 3.0,
the channel was set to 5.0. The mean of the noise was set to 7.0 and the variance of the
noise was set to 0.5.

0.32
0.28
0.24
0.20
0.16
0.12
0.08
0.04

-5 0.000 5 10 15 20 25
Log Power Spectra (dB)
Figure 4-3: Estimate of the distribution of noisy signal at a lower SNR level via Monte-Carlo
methods. The continuous line represents the pdf of the clean signal. The dashed line repre-
sents the real pdf of the noise-contaminated signal. The dotted line represents the Gauss-
ian-approximated pdf of the noisy signal. The original clean signal had a mean of 5.0 and a
variance of 3.0. Channel was set to 5.0. The mean of the noise was set to 9.0 and the vari-
ance of the noise was set to 0.5.
Chapter 4: Effects of the Environment on Distributions of Clean Speech 41

Figure 4-3 and Figure 4-4 show similar results as those of Figure 4-2 but with different SNRs.
0.56
0.48
0.40
0.32
0.24
0.16
0.08

-5 0.000 5 10 15 20 25
Log Power Spectra (dB)
Figure 4-4:Estimate of the distribution of noisy signal at a lower SNR level via Monte-Carlo
methods. The continuous line represents the pdf of the clean signal. The dashed line repre-
sent the real pdf of the noise-contaminated signal. The dotted line represents the Gaussian-
approximated pdf of the noisy signal. The original clean signal had a mean of 5.0 and a vari-
ance of 3.0. The channel was set to 5.0. The mean of the noise was set to 11.0 and the vari-
ance of the noise was set to 0.5.

The noisy signal in Figure 4-3 had a SNR of 1 dB and in Figure 4-4 it had a SNR of -1 dB.

We can see how for some cases (e.g. Figure 4-2) the pdf of the resulting noisy signal can be
bimodal and clearly non-Gaussian. However, if the noise mean is higher (9.0) (e.g. Figure 4-2) the
bimodality of the resulting noisy signal pdf is lost and we can also observe the compression of the
resulting noisy signal distribution. We also see that a Gaussian fit to the noisy signal captures some
of the effect of this particular environment on the clean signal.

In general, the effect of this particular type of environment on speech statistics can be reason-
ably accurately modelled as a shift in the mean of the pdfs and a decrease in the variance of the
resulting pdf. Notice, however, that this compression of the variance will happen only if the vari-
ance of the distribution of the noise is smaller than the variance of the distribution of the clean sig-
nal. The change in variance can be represented by a additive factor in the covariance matrix.

4.3. Two dimensional simulations with artificial data

Speech representation are normally multidimensional. In the case of the SPHINX-II system a
40-dimensional log spectral representation is created which is then transformed to a 13-dimension-
Chapter 4: Effects of the Environment on Distributions of Clean Speech 42

al cepstral representation. Because of the inherent multidimensionality of speech representations,

it is worthwhile to visualize how the noise affects the multidimensional statistics of speech. Since
log spectral features are highly correlated, the behavior of frequency band ω i is very similar to that

of bands ω i + 1 and ω i – 1 . As a result the covariance matrix of these features is non-diagonal.

A simple way to visualize the effects of noise on correlated distributions of speech features is
to use a simplified representation with 2 dimensions. In the following sections we repeat the same
simulations but with a set of simulated 2-dimensional artificial data where both the covariance ma-
trices of both the signal and noise are non-diagonal.

Figure 4-5 shows the pdf of the clean signal assuming a two-dimensional Gaussian distribution

−2

−3

−4

−5

−6

−7

−8

−12 −10 −8 −6 −4 −2 0 2

Figure 4-5: Contour plot of the distribution of the clean signal.

with mean and covariance matrix

µ x = –5 Σx = 3 0.5 (4.15)
–5 0.5 6

This signal is passed through a channel h and added to a noise with mean vector and covari-
Chapter 4: Effects of the Environment on Distributions of Clean Speech 43

−2

−4

−6

−8

−12 −10 −8 −6 −4 −2 0 2 4 6 8

Figure 4-6: Contour plot of the distribution of the clean signal and of the noisy signal.

ance matrix

h = 5 µn = 5 Σn = 0.05 – 0.01 (4.16)

5 5 – 0.01 0.025

Figure 4-6 shows the pdf of the resulting noisy signal. As we can see the resulting distribution
has been shifted and compressed. A maximum likelihood (ML) Gaussian fit to the resulting noisy
data yields these estimates for the mean vector and covariance matrix

µ y, ML = 6.24 Σ y, ML = 0.35 0.006 (4.17)

6.30 0.006 0.53

The above conclusions apply for the case in which the variance of the distribution of the noise
is smaller than the variance of the distribution of the clean signal. If the clean signal has a very nar-
row distribution or if the noise distribution is very wide, we will observe an expansion of the pdf
of the resulting noisy signal.

4.4. Modeling the effects of the environment as correction factors

From the one- and two-dimensional simulations with artificial data sets we can clearly observe
the following effects
Chapter 4: Effects of the Environment on Distributions of Clean Speech 44

• The pdf of the signal is shifted according to the SNRs.

• The pdf of the signal is expanded if Σ x < Σ n or compressed if Σ x > Σ n and this expansion/
compression depends on the SNR.
• The pdf of the noisy signal are clearly non Gaussian, under particular conditions they exhibit
a bimodal shape.
However, since most speech recognition systems model speech statistics as mixtures of Gaus-
sians it is still convenient to keep modelling these resulting noisy speech pdfs as Gaussians. A sim-
ple way to achieve this is by

• modelling the mean of the noisy speech distribution as the mean of the clean signal plus a
correction vector
µ y = µx + r (4.18)

• modelling the covariance matrix of the noisy speech distribution as the covariance matrix of
the clean speech plus a correction covariance matrix
Σ y = Σx + R (4.19)

The R matrix will be symmetric and will have positive or negative elements according to the
value of the covariance matrix of the noise compared with that of the clean signal.

This approach will be extensively used in the RATZ family of algorithms (Chapter 6).

Another alternative is to model the environment effects by attempting to solve some of the
equations presented in this chapter via Taylor series approximations (e.g. (4.10), (4.11), (4.12) and
(4.13)). This kind of approach will be exploited in the VTS family of algorithms (Chapter 8).

4.5. Why do speech recognition system degrade in performance in the

presence of unknown environments?
For completeness it is helpful to review some of the ways in which the degradations to the sta-
tistics of speech described earlier in this chapter can degrade recognition accuracy.

We frame the speech recognition problem as one of pattern classification [12]. The simplest
type of pattern classification is illustrated in Figure 4-7. In this case we assume two classes, H1 and
H2, each represented by a single Gaussian distribution, and each with equal a priori probabilities
Chapter 4: Effects of the Environment on Distributions of Clean Speech 45

accept H1 accept H2
R1 R2

γx

decision boundary at cross over point

Figure 4-7: Decision boundary for a single two-class classification problem. The shaded re-
gion represents the probability of error. An incoming sample xi will be classified as belong-
ing to class H1 or H2 comparing it to the decision boundary γ . If xi is less than γ it will be
x x
classified as belonging to class H1, otherwise it will be classified as belonging to class H2.

and variances. In this case the maximum a posteriori (MAP) decision rule is expressed as a ratio
of likelihoods

choose H 1 if P [ class=H 1 x ] ≥ P [ class=H 2 x ]

(4.20)
choose H 2 if P [ class=H 1 x ] ≤ P [ class=H 2 x ]

Solving the previous equation will yield a decision boundary of the form

µ x, H + µ x, H
γ x = ----------------------------------
1 2
- (4.21)
2

this decision boundary is guaranteed to minimize the probability of error P e , and therefore it pro-
vides the optimal classifier. It specifies in effect a decision boundary that will be used by the system
to classify incoming signals. However, as we have seen, the effects of the environment on that data
are three fold:

• The resulting distributions change shape, becoming non-Gaussian.

• Even assuming a Gaussian shape, the means of the noisy distributions are shifted.
• Even assuming a Gaussian shape, the covariance matrices of the noisy distributions are com-
pressed or expanded depending on the relation between noise and clean signal covariance
Chapter 4: Effects of the Environment on Distributions of Clean Speech 46

matrices.
For every particular environment the distributions of the noisy data change, and therefore the opti-
mal decision boundaries also change. We call the optimal noisy decision boundary γ y . If the clas-

sification is done using a decision boundaries γ x that had been derived on the basis of the statistics

of the clean signal, the result is suboptimal in that the minimal probability of error will not be ob-
tained.

Figure 4-8 illustrates how the error region is composed of two areas, the optimal error region

accept H1 accept H2
R1 R2
p( y H 1) p( y H 2)

wrong decision boundary γ x γ y optimal decision boundary

Figure 4-8: When the classification is performed using the wrong decision boundaries, the
error region is composed of two terms, the optimal one assuming the optimal decision
boundary is known (banded area above), and an addition term introduced by using the
wrong decision boundary (shaded are above between γ and γ ).
x y

that would be obtained using the γ y decision boundary, and a secondary error surface that is pro-

duced using the γ x decision boundary.

This explanation suggests two possible ways to compensate for the effects of the environment
on speech statistics:

• Modify the statistics of the speech recognition system (mean vectors and covariance matri-
ces) to make them more similar to those of the incoming noisy speech. In other words, make
sure that the “classifiers” as represented by the HMMs, use the optimal decision boundaries.
This will be the optimal solution as it guarantees minimal probability of error.
• Modify the incoming noisy speech data to produce a pseudo-clean speech data set whose dis-
tributions resemble as much as possible the clean speech distributions. This solution does not
modify any of the parameters (mean vectors and covariance matrices) of the HMMs. In ad-
Chapter 4: Effects of the Environment on Distributions of Clean Speech 47

dition, this will be a suboptimal solution since at the most it can yield similar results to the
previous solution.
Notice however that the first approach would imply the use of non-Gaussian distributions and
the solution of difficult equations. Therefore only approximations to the first approach will be pro-
posed in this thesis.

In this thesis we provide several algorithms that attempt to approximate both approaches. In
addition, Appendix A illustrates the difference between both types of approaches and discusses
why algorithms that modify the statistics seem to perform better than those that attempt to com-
pensate the incoming noisy speech data.

4.6. Summary
In this chapter we have analyzed the effect of the environment on the statistics of clean speech,
considering the particular case of additive noise and linear filtering. We have provided several sim-
ulations using one- and two-dimensional data as a tool to explore how speech statistics are modified
by the environment. From these experiments we have concluded that the effects of the environment
on speech statistics are:

• The resulting distributions are not always Gaussian.

• The means of the resulting distributions are shifted.
• The covariance matrices of the resulting distributions are compressed or expanded.
As we have already mentioned the compression of the variances really depends on whether the
variance of the distribution of the noise is smaller than that of the clean signal. In our experiments
with real data we have observed the compression described in the simulation presented in this chap-
ter. However, there might be situations in which this does not happen. The model of environmental
degradation proposed in this chapter does indeed allow for an expansion of the variance.

It is also important to mention the fact that the resulting non-Gaussian distributions of the noisy
signal are not necessarily problematical for speech modeling. The feature vectors representing the
clean speech signal are probably not Gaussian in nature to begin with. We just fit a Gaussian model
to them for convenience. Therefore applying a Gaussian model to the distributions of the feature
vectors of noisy speech is as valid as when done with clean speech feature vectors.
Chapter 4: Effects of the Environment on Distributions of Clean Speech 48

The simulations described in this chapter have been done assuming log spectra feature vectors.
It is important to mention that all the conclusions derived in this chapter are also valid for the case
of cepstral feature vector as the relationship between cepstral vectors and log spectral vectors is
linear.

Finally, we have introduced an explanation for why speech recognition systems fail in the pres-
ence of unknown environments using some examples with one dimensional data.
49

Chapter 5
A Unified View of Data-Driven
Environment Compensation
In the previous chapter we have seen how the effect of the environment on the distributions of
the log-spectra or cepstra of clean speech can be modeled as shifts to the mean vectors and com-
pressions or expansions to the covariance matrices, or in more general terms as correction factors
applied to the mean vectors and covariance matrices.

In this chapter we present some techniques that attempt to learn these effects directly from sam-
ple data. In other words, by observing sets of noisy and clean vectors these techniques try to learn
the appropriate correction factors. This approach does not explicitly assume any model of the en-
vironment, but uses empirical observations to infer environmental characteristics. Any environ-
ment that only affects speech features by shifting their means and compressing their variances will
be compensated by the techniques proposed in this chapter.

In previous years several techniques have been proposed to address environmental robustness
using direct observation of the environment without structural assumptions about the nature of the
degradation. Some of these techniques have attempted to address the problem by applying com-
pensation factors to the cepstrum vectors (FCDCN [1], POF [43]) and others have applied compen-
sation factors to the means and covariances of the distributions of HMMs [35]. However, both kind
of approaches have been presented as separate techniques. In this chapter we present a unified view
of data-driven compensation methods. We will argue that techniques that attempt to modify the in-
coming cepstra vectors and techniques that modify the parameters (means and variances) of the
distributions of the HMMs are two aspects of the same theory. We will show how both these two
approaches to environment compensation share the same basic assumptions and internal structure
but differ in whether they modify the incoming cepstra vectors or the classifier statistics.

In this chapter we first introduce a unified view of environmental compensation and then pro-
vide solutions for the compensation factors. We then particularize these solutions for the case of
stereo adaptation data. After this we particularize the generic solutions for two family of tech-
niques; the Multivariate Gaussian Based Cepstral Normalization (RATZ) techniques [39, 40], and
the Statistical Reestimation (STAR) techniques [40, 41]. Their performance for several databases
Chapter 5: A Unified View of Data-Driven Environment Compensation 50

and experimental conditions is compared in the next chapters.

5.1. A unified view

We model the distribution of the tth vector x t of a cepstral vector sequence of length T,
X = { x 1, x 2, ....., x T } , generically as

K
p ( xt ) = ∑ ak ( t )N x ( µ x, k, Σ x, k ) (5.1)
k=1

i.e., as a summation of K Gaussian components with a priori probabilities that are time dependent.
Assuming that each vector x t is independent and identically distributed (i.i.d.) the overall likeli-
hood for the full observation sequence X becomes

T T
l( X ) = ∏ p ( xt ) = ∏ ∑ ak ( t )N x ( µ x, k, Σ x, k ) (5.2)
t=1 t=1 k

The above likelihood equation offers a double interpretation. For the methods that modify the
incoming features, we set the a priori probabilities a k ( t ) to be independent of t; this defines a con-
ventional mixture of Gaussian distributions for the entire training set of cepstral vectors. Another
possible interpretation is that the cepstral speech vectors are represented by a single HMM state
with K Gaussians that transitions to itself with probability unity.

The interpretation is slightly different for the methods that modify the distributions represent-
ing speech (HMMs). We assume that the cepstral speech vectors are emitted by a HMM with K
states in which each state emission p.d.f. is composed of a single Gaussian. In this case the a k ( t )
terms define the probability of being in state k at time t. Under these assumptions the expression of
the likelihood for the full observation sequence is exactly as expressed in Equation (5.2).

The assumption of a single Gaussian per state is not limiting at all. Specifically, any state with
a mixture of Gaussians for emission probabilities can also be represented by multiple states where
the output distributions are single Gaussians and where the incoming transition probabilities of the
state are the same as the a priori a k ( t ) probabilities of the Gaussians and the exiting transition
probability is unity. The next figure illustrates this idea
Chapter 5: A Unified View of Data-Driven Environment Compensation 51

N x ( µ x, 1, Σ x, 1 ) 1
a1 ( t )

a2 ( t ) 1
N x ( µ x, 2, Σ x, 2 )
a3 ( t ) 1
N x ( µ x, 3, Σ x, 3 )

Figure 5-1A state with a mixture of Gaussians is equivalent to a set of states where each of
them contains a single Gaussian and the transition probabilities are equivalent to the a priori
probabilities of each of the mixture Gaussians.
These probabilities a k ( t ) depend only on the Markov chain topology and are represented in the
form

T
a ( t ) = a 1 ( t ) a 2 ( t ) .... a K ( t ) = A
t
π (5.3)

t
where A represents the transition matrix, A represents the transition matrix after t transitions and
π represents the initial state probability vector of the HMM. The N x ( µ x, k, Σ x, k ) terms of Equation
(5.1) refer to the Gaussian densities associated with each of the K states of the HMM.

As we have mentioned before the changes to the mean vectors and covariance matrices can be
expressed as

µ y, k = r k + µ x, k Σ y, k = R k + Σ x, k (5.4)

where r k and R k represent the corrections applied to the mean vector and covariance matrix respec-
tively of the kth Gaussian. These two correction factors account for the effect of the environment on
the distributions of the cepstra of clean speech. Finding these two correction factors will be the first
step of the RATZ and STAR algorithms.

5.2. Solutions for the correction factors k and k

The solutions for the correction factors will depend upon the availability of stereo data, i.e., si-
multaneous recordings of clean and noisy adaptation data. We first describe the generic solution
for the case in which only samples of noisy speech are available, the so-called “blind” case. We
Chapter 5: A Unified View of Data-Driven Environment Compensation 52

then describe how to particularize these solutions for the stereo case.

In this section we make extensive use of the EM algorithm [13]. Our goal is not to describe the
EM algorithm itself but to show its use in the solution for the correction parameters r k and R k . Ref-
erences [21] and [13] give a detailed and full explanation of the EM algorithm.

5.2.1. Non-stereo-based solutions

We begin with an observed set of T noisy vectors Y = { y 1, y 2, ....., y T } , and assuming that these
vectors have been produced by a probability density function

K
p ( yt ) = ∑ ak ( t )N y ( µ y, k, Σ y, k ) (5.5)
k=1

which is a summation of K Gaussians where each component relates to the corresponding kth Gaus-
sian of clean speech according to Equation (5.4). We define a likelihood function l ( Y ) as

T T
l(Y ) = ∏ p ( yt ) = ∏ ∑ ak ( t )N y ( µ y, k, Σ y, k ) (5.6)
t=1 t=1 k

We can also express l ( Y ) in terms of the original parameters of clean speech and the correction
terms r k and R k

T T
l ( Y ) = l ( Y r 1, ..., r K , R 1, ..., R K ) = ∏ p ( yt ) = ∏ ∑ ak ( t )N y ( rk + µ x, k, Rk + Σ x, k ) (5.7)
t=1 t=1 k

For convenience we express the above equation in the logarithm domain defining the log like-
lihood L ( Y ) as

T T
L ( Y ) = log ( l ( Y ) ) = ∑ log ( p ( y t ) ) = ∑ log  ∑ ak ( t )N y ( rk + µ x, k, Rk + Σ x, k ) (5.8)
t=1 t=1 k

Our goal is to find the complete set of K terms r k and R k that maximize the likelihood (or log
likelihood). As it turns out there is no direct solution to this problem and some indirect method is
necessary. The Expectation-Maximization (EM) algorithm is one of this methods.
Chapter 5: A Unified View of Data-Driven Environment Compensation 53

The EM algorithm defines a new auxiliary function Q ( φ, φ ) as

Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] (5.9)

where the ( Y , S ) pair represent the complete data, composed of the observed data Y (the noisy vec-
tors) and the unobserved data S (indicating which Gaussian/state produced an observed data vec-
tor). This equation can be easily related to the Baum-Welch equations used in Hidden Markov
Modelling. The φ symbol represents the set of parameters (K correction vectors and K correction
matrices) that maximize the observed data

φ = { r 1, ..., r K , R 1, ..., R K } (5.10)

The φ symbol represents the same set of parameters as φ but with different values. The basis
of the EM algorithm lies in the fact that given two sets of parameters φ and φ , if Q ( φ, φ ) ≥ Q ( φ, φ ) ,
then L ( Y , φ ) ≥ L ( Y , φ ) . In other words, maximizing Q ( φ, φ ) with respect to the φ parameters is
guaranteed to increase the likelihood L ( Y , φ ) .

Since the unobserved data S are represented by a discrete random variable (the mixture index
in our case), Equation (5.9) can be expanded as

T K
p ( y t, s t ( k ) φ )
Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] = ∑ ∑ --------------------------------
p ( yt φ )
- log ( p ( y t, s t ( k ) φ ) ) (5.11)
t = 1k = 1

hence

T K
L L
Q ( φ, φ ) = ∑ ∑ P [ st ( k ) y t, φ ] { log a k ( t ) – --- log ( 2π ) – --- log R k + Σ x, k +
2 2
t = 1k = 1 (5.12)
1 T –1
– --- ( y t – µ x, k – r k ) ( Σ x, k + R k ) ( y t – µ x, k – r k ) }
2

where L is the dimensionality of the cepstrum vector. The expression can be further simplified to

T K
L
Q ( φ, φ ) = constant + ∑ ∑ P [ st ( k ) y t, φ ] { – --- log R k + Σ x, k +
2
t = 1k = 1 (5.13)
1 T –1
– --- ( y t – µ x, k – r k ) ( Σ x, k + R k ) ( y t – µ x, k – r k ) }
2
Chapter 5: A Unified View of Data-Driven Environment Compensation 54

To find the φ parameters we simply take derivatives and set equal to zero,

T
d –1
d rk
Q ( φ, φ ) = ∑ P [ st ( k ) y t, φ ] ( Σ x, k + R k ) ( y t – µ x, k – r k ) = 0
t=1 (5.14)
T
d T
–1
Q ( φ, φ ) = ∑ P ( yt st ( k ), φ ) { ( Σ x, k + Rk ) –( yt – µ x, k – rk ) ( yt – µ x, k – rk ) } = 0
d Rk t=1

hence

∑ P [ s t ( k ) y t, φ ] y t
rk = t=1
-----------------------------------------------
T
- – µ x, k (5.15)

∑ P [ st ( k ) y t, φ ]
t=1

T
T
∑ P [ st ( k ) y t, φ ] ( ( y t – µ x, k – r k ) ( y t – µ x, k – r k ) )
=1
T
- – Σ x, k
R k = t---------------------------------------------------------------------------------------------------------------------------- (5.16)

∑ P [ st ( k ) y t, φ ]
t=1

Equation (5.15) and Equation (5.16) for the basis of an iterative algorithm. The EM algorithm guar-
antees that each iteration increases the likelihood of the observed data.

5.2.2. Stereo-based solutions

When simultaneously recorded clean and noisy speech (also called stereo-recorded) adaptation
data are available the information about the environment is encoded in the stereo pairs. By observ-
ing how each clean speech vector x i is transformed into a noisy speech vector y i we can learn the
correction factors more directly.

We can readily assume that the a posteriori probabilities P [ s t ( k ) y t, φ ] can be directly esti-
mated by P [ s t ( k ) x t ] . This is equivalent to assuming that the probabilities of a vector being pro-
duced by each of the underlying classes do not change due to the environment. We call this
assumption a posteriori invariance. This assumption, although it is not strictly correct, seems to
Chapter 5: A Unified View of Data-Driven Environment Compensation 55

be a good approximation. If we expand the P [ s t ( k ) y t, φ ] and P [ s t ( k ) x t ] terms we obtain

P [ s t ( k ) ] p ( y t s t ( k ), φ )
P [ s t ( k ) y t, φ ] = ---------------------------------------------------------------
K
- (5.17)

∑ p ( yt st ( j ), φ )P [ st ( j ) ]
j=1

P [ st ( k ) ] p ( xt st ( k ) )
P [ s t ( k ) x t, φ ] = ---------------------------------------------------------
K
(5.18)

∑ p ( xt st ( j ) )P [ st ( j ) ]
j=1

for the two above expressions to be equal each of the terms in the summation must be equal.
This would imply that each Gaussian is shifted exactly the same amount and not compressed. How-
ever, at high SNR conditions the shift for each Gaussian is quite similar and the compression in the
variances is almost non-existent. Therefore, at high SNR the a posteriori invariance is almost valid
and at lower SNR conditions it is less valid. In addition, this assumption avoids the need to iterate
in Equation (5.15) and Equation (5.16).

A second assumption we make is that the µ x, k term can be replaced by x t . This change has been
experimentally proven to improve recognition performance. After these changes the resulting esti-
mates of the correction factors are

–1
 T  T 
r k =  ∑ P [ s t ( k ) x t ] ( y t – x t )  ∑ P [ s t ( k ) x t ] (5.19)
t = 1  t = 1 
T
T
∑ P [ st ( k ) xt ] ( ( yt – xt – rk ) ( yt – xt – rk ) )
(5.20)
=1
T
- – Σ x, k
R k = t-----------------------------------------------------------------------------------------------------------
∑ P [ st ( k ) xt ]
t=1

5.3. Summary
In this chapter we have presented a unified view of adaptation based data-driven environmental
compensation algorithms. We have showed how both approaches of environment compensation
share the same algorithmic structure and differ only in minor details. Solutions have been provided
Chapter 5: A Unified View of Data-Driven Environment Compensation 56

for the case of stereo- and non stereo-based solutions. In the next chapters we particularize this dis-
cussion for the RATZ and STAR family of algorithms.
57

Chapter 6
The RATZ Family of Algorithms
In this chapter we particularize the generic solutions described in Chapter 3 for the Multivari-
ate-Gaussian-Based Cepstral Normalization (RATZ) family of algorithms. We present an overview
of the algorithms and describe in detail the steps followed in RATZ-based compensation. We de-
scribe the generic stereo based and blind versions of the algorithms as well as the SNR dependent
versions of RATZ and Blind RATZ. In addition we also describe the interpolated versions of the
RATZ algorithms. Finally, we provide some experimental results using several databases and en-
vironmental conditions, followed by our conclusions.

6.1. Overview of RATZ and Blind RATZ

The algorithms work in the three following stages which are described as follows:

• Estimation of the statistics of clean speech

• Estimation of the statistics of noisy speech (stereo and non stereo cases)

• Compensation of noisy speech

Estimation of the statistics of clean speech. The pdf for the features of clean speech is mod-
eled as a mixture of multivariate Gaussian distributions. Under these assumptions the distribution
of the cepstral vectors of clean speech can be written as

K
p ( xt ) = ∑ a k N x ( µ x, k , Σ x, k ) (6.1)
k=1

which is equivalent to Equation (5.1) for the case of a k ( t ) being time independent.

The a k , µ x, k and Σ x, k represent respectively the a priori probabilities, mean vector and covari-
ance matrix of each multivariate Gaussian mixture element k. These parameters are learned
through traditional maximum likelihood EM methods [21]. The covariance matrix is assumed to
be diagonal.

Estimation of the statistics of noisy speech. As we mentioned in Chapter 4 we will assume

that the effect of the environment on speech statistics can be accurately modeled by applying the
proper correction factors to the mean vectors and covariance matrices. Therefore, our goal will be
Chapter 6: The RATZ Family of Algorithms 58

to compute these correction factors to estimate the statistics of noisy speech.

If we particularize the solutions of Chapter 5 we will have several solutions, depending on whether
or not stereo data are available for learning the correction factors.

If stereo data are not available we obtain these solutions

∑ P[k y t, φ ] y t – µ x, k
t=1
- – µ x, k
= ------------------------------------------------------
T
(6.2)

∑ P[k y t, φ ]
t=1

T
T
∑ P[k y t, φ ] ( ( y t – µ x, k – r k ) ( y t – µ x, k – r k ) )
=1
T
- – Σ x, k
R k = t------------------------------------------------------------------------------------------------------------------- (6.3)

∑ P[k y t, φ ]
t=1

where the P [ k y t, φ ] term represents the a posteriori probability of an observation noisy vector y t
being produced by Gaussian k given the set of estimated correction parameters φ . The solutions are
iterative and each iteration guarantees greater likelihood.

If stereo data are available we obtain these solutions

∑ P[k xt ] ( yt – xk )
t=1
r k = ------------------------------------------------
T
- (6.4)

∑ P[k xt ]
t=1

∑ P[k
T
x t ] ( ( y t – µ x, k – r k ) ( y t – µ x, k – r k ) )
=1
T
- – Σ x, k
R k = t------------------------------------------------------------------------------------------------------------- (6.5)

∑ P[k xt ]
t=1

In this case the solution is non iterative.

Compensation of noisy speech. The solution for the correction factors { r 1, ..., r K , R 1, ..., R K }
helps us learn the new distributions of noisy speech cepstral vectors. With this knowledge we can
Chapter 6: The RATZ Family of Algorithms 59

estimate what correction factor to apply to each incoming noisy vector y to obtain an estimated
clean vector x̂ . To do so we use a Minimum Mean Squared Error (MMSE) estimator

x̂ MMSE = E ( x y ) = ∫ x. p ( x y )dx (6.6)

Since this equation requires the knowledge of the marginal distribution p ( x y ) and this might
be difficult or impossible to get in closed form (see Chapter 4), some simplifications are needed. In
particular we will first assume that the x vector can be presented as x = y – r ( x ) . In this case Equa-
tion (6.2) simplifies to

K
x̂ MMSE = y – ∫ r ( x ) p ( x y )dx = y – ∫ ∑ r ( x ) p ( x, k y )dx
X Xk = 1
K
= y– ∑ P[k y ] ∫ r ( x ) p ( x k, y )dx
k=1 X
(6.7)
K
≅ y– ∑ rk P [ k y ] ∫ p ( x k, y )dx
k=1 X
K
≅ y– ∑ rk P [ k y]
k=1

where we have further simplified the r ( x ) expression to r k . This is equivalent to assuming that the
r ( x ) term can be well approximated by a constant value within the region in which p ( x k, y ) has
a significant value.

6.2. Overview of SNR-Dependent RATZ and Blind-RATZ

RATZ and Blind RATZ as described in [39] and [40] used a standard Gaussian mixture distri-
bution as defined in Equation (5.1) where the a k ( t ) terms are taken to be independent of time to
model the statistics of cepstra of clean speech. While this model is valid, it is constrained in that it
resolves all the cepstral components equally, i.e., into the same number of Gaussians. In particular,
the frame energy parameter, 0, has the same resolution in terms of number of Gaussians as the oth-
er cepstral parameters.

Previous work by Acero [1] and Liu [35] suggests that a more fine modelling of 0 is necessary.
In Acero’s FCDCN algorithm [1] the covariance matrix used to model the distribution of the cep-
Chapter 6: The RATZ Family of Algorithms 60

strum is diagonal. This gives higher importance to the 0 coefficient and makes a more detailed
modeling of the 0 coefficient very helpful. Inspired by Liu and Acero’s previous work, SNR-
RATZ and SNR-BRATZ use a more structured model for the distribution whereby the number of
Gaussians used to define the 0 statistics can be different from the number used for the other cep-
stral components.

T
Figure 6-1 illustrates this idea for a two-dimensional vector x = [ x 0 x 1 ] . In this example the

pdf has the following structure

1 2

∑ ∑ P [ j i ]N x
2 2
p( x) = P [ i ]N x0 i ( µ x0, i, σ x0, i ) 1 i, j ( µ x 1, i, j, σ x1, i, j ) (6.8)
i=0 j=0

In this example there are two distributions for the 0 component and each 0 has three associ-
ated marginal distributions in the x 1 variable. Note that the means of the mixtures that comprise
the pdf of 1 associated with each mixture component of 0 can take on any value, and they gener-
ally differ for different values of 0.

Figure 6-1: Contour plot illustrating joint pdfs of the structural mixture densities for the com-
ponents 0 and 1

The SNR dependent versions of RATZ also work in the three basic steps mentioned in Section
4.1, namely

• Estimation of the statistics of clean speech

• Estimation of the statistics of noisy speech (stereo and non stereo cases)

• Compensation of noisy speech

Chapter 6: The RATZ Family of Algorithms 61

Estimation of the statistics of clean speech. In our implementation of the SNR-RATZ and
T T
blind SNR-RATZ algorithms we split the cepstral vector in two parts, x = [ x 0 x 1 ] , where x 1

is itself a vector composed of the x 1 , x 2 ..., x L – 1 components of the original cepstral vector.

The resulting distribution for the clean cepstrum vectors has the following structure

M N

∑ ∑ ai, j N x
2
p( x) = a i N x0 i ( µ x0, i, σ x0, i ) i, j ( µ x 1, i, j, Σ x1, i, j ) (6.9)
1
i=0 j=0

The means, variances, and a priori probabilities of the individual Gaussians are learned by
standard EM methods [13]. Appendix C summarizes the resulting solutions.

Estimation of the statistics of noisy speech. As in the conventional RATZ algorithm we as-
sume that the effect of the environment on the means and variances of the cepstral distributions of
clean speech can be adequately modelled by additive correction factors.

The resulting means and variances of the statistics of noisy speech are

2 2
µ y0, i = r i + µ x0, i σ y0, i = R i + σ x0, i
(6.10)
µ y1, i, j = r i, j + µ x1, i, j Σ y1, i, j = R i, j + Σ x1, i, j

where r i , R i , r i, j and R i, j represent the correction factors applied to the clean speech cepstrum dis-
tribution.

To obtain these correction factors we can use techniques very similar to those used in the case
of non-SNR RATZ. Notice that the only difference lies in the structure of p ( x ) . Appendix B gives
a detailed explanation of the procedure used to obtain the optimal correction factors for the stereo-
based and non-stereo-based cases.

Compensation of noisy speech. Once the r i , R i , r i, j and R i, j correction factors are computed
we can apply a MMSE procedure that is similar to the one used in the non-SNR-based RATZ case.
The MMSE estimator will have the form

x̂ MMSE = E ( x y ) = ∫ x. p ( x y )dx (6.11)

X
Chapter 6: The RATZ Family of Algorithms 62

As in the non-SNR-dependent case we will first assume that the vector x can be represented as
x = y – s ( x ) . In this case Equation (6.2) simplifies to

M N
x̂ MMSE = y – ∫ s ( x ) p ( x y )dx = y – ∫ ∑ ∑ s ( x ) p ( x, i, k y )dx
X Xi = 0 j = 0
M N
= y– ∫ ∑ ∑ s( x) p( x y, i, j )P [ i, j y ]dx
Xi = 0 j = 0
(6.12)
M N
≅ y– ∫ ∑ ∑ s i, j p ( x y, i, j )P [ i, j y ]dx
Xi = 0 j = 0
M N
≅ y– ∑ ∑ si, j P [ i, j y]
i=0j=0

where we have further simplified the expression for s ( x ) to s i, j , a vector composed of the concat-
enation of the correction terms r i and r i, j . This is equivalent to assuming that the s ( x ) term can be
well approximated by a constant value within the region in which p ( x i, j, y ) has its maximum, i.e.,
the mean.

6.3. Overview of Interpolated RATZ and Blind RATZ

The previous versions of the RATZ algorithms assumed that the environment where the recog-
nition is going to be performed is known, enabling the RATZ algorithm to make use of previously-
learned correction factors. However, in more realistic conditions this might not be possible. Even
though there might be enough adaptation data to learn the correction factors of a number of envi-
ronments we might not know what environment is presented to us for recognition.

The basic idea of Interpolated RATZ is to estimate the a posteriori probabilities of each of E
possible environments over the whole ensemble of cepstrum vectors for the utterance Y

T
P [ i ] ∏ p ( yt i )
P [ environment=i Y ] = -------------------------------------------------
E
t=1
T
(6.13)

∑ P [ e ] ∏ p ( yt e )
e=1 t=1

The a priori probability of each environment is P [ i ] . We normally assume that all the environ-
ments are equiprobable.
Chapter 6: The RATZ Family of Algorithms 63

The p ( y t i ) terms are defined as

K
p ( yt i ) = ∑ ak N y ( µ x, k + r x, k, i, Σ x, k + R x, k, i ) (6.14)
k=1

where r x, k, i and R x, k, i represent the correction terms for the environment i.

Once the a posteriori probabilities for each of the putative environments are computed we can
use them to weight each of the environment dependent correction factors

E K
x̂ MMSE = y – ∫ ∑ ∑ r ( x ). p ( x, k, e y )dx
Xe = 1k = 1
E K
x̂ MMSE = y – ∑ ∑ ∫ r ( x ). p ( x k, e, y )P [ k e, y ]P [ e y ]dx
e = 1k = 1X
(6.15)
E K
x̂ MMSE ≅ y – ∑ P [ e Y ] ∑ rk, e P [ k e, y ] ∫ p ( x k, e, y )dx
e=1 k=1 X
E K
x̂ MMSE ≅ y – ∑ P[e Y ] ∑ rk, e P [ k e, y ]
e=1 k=1

where we have approximated P [ e y ] by P [ e Y ] , using all the cepstrum vectors in the utterance to
compute the a posteriori probability of the environment.

Similar extensions are also possible for the case of the SNR-dependent RATZ algorithms.

6.4. Experimental Results

In this section we describe several experiments designed to evaluate the performance of the
RATZ family of algorithms. We explore several of the dimensions of the algorithms, such as:

• the impact of SNR dependence on recognition accuracy

• the impact of the number of adaptation sentences on recognition accuracy

• the optimal number of Gaussians

• the impact of interpolation on the performance of the algorithm

The experiments described here are performed on the 5,000-word Wall Street Journal 1993
evaluation set with white Gaussian noise added at several SNR levels. The SNR is computed on a
Chapter 6: The RATZ Family of Algorithms 64

sentence by sentence basis. For each sentence the energy is computed and white artificially gener-
ated Gaussian noise is added at the decided SNR below the signal energy level. In all the experi-
ments with this database, the upper dotted line represents the performance of the system when fully
trained on noisy data while the lower dotted line represents the performance of the system when
no compensation is used.

6.4.1. Effect of an SNR-dependent structure

In this section we compare the possible benefits of an SNR-dependent structure as described in
Section 4.2. To explore the effect of an SNR dependent structure in our experiments we compare
the performance of the RATZ and SNR-RATZ algorithms using the same number of Gaussians.

In the case of SNR-RATZ we use two configurations. The first configuration which we call an
8.32 SNR-RATZ configuration contains 8 x 0 Gaussians and 32 x 0 -dependent Gaussians while the
second one called a 4.16 SNR-RATZ configuration contains 4 x 0 -Gaussians and 16 x 0 dependent
Gaussians. In the case of regular RATZ we also use two configurations. The first configuration con-
tains 256 Gaussians and the second one 64 Gaussians. The correction factors were computed from
the same stereo databases. One hundred adaptation sentences were used to learn the correction fac-
tors.

Figure 6-2 shows the performance for this particular database for all the four mentioned con-
ditions.

As we can see, contrary to previous results by Liu and Acero, a SNR-based structure does not
seem to provide clear benefits in reducing the error rate. In all four configurations the results are
comparable. Perhaps this is due to the fact that Acero and Liu in their FCDCN [1, 35] algorithms
use a weaker model to represent the distributions of clean speech cepstrum based on vector quan-
tization (VQ). In particular their model does not take into account the compression of the distribu-
tions. Another possible reason is that FCDCN uses a diagonal covariance matrix in which all the
elements have the same value. This weights too much the x 0 component of the cepstral vector in
computing likelihoods. Under this conditions, a SNR-dependent structure might have significant
benefits.
Chapter 6: The RATZ Family of Algorithms 65

Word Accuracy (%) 100

60 Retrained
RATZ 8.32
40 RATZ 256
RATZ 4.16
RATZ 64
20 CMN

0 5 10 15 20 25
SNR (dB)
Figure 6-2. Comparison of RATZ algorithms with and without a SNR dependent structure. We compare an
8.32 SNR-RATZ algorithm with a normal RATZ algorithm with 256 Gaussians. We also compare a 4.16 SNR-
RATZ algorithm with a normal RATZ algorithm with only 64 Gaussians.

6.4.2. Effect of the number of adaptation sentences

To explore the effect the number of adaptation sentences has on recognition accuracy we
learned correction factors using a varied number of adaptation sentences. We present results for the
8.32 configuration of SNR-dependent RATZ.

Figure 6-3 demonstrates that even with a very small number of adaptation sentences the RATZ
algorithm is able to compensate for the effect of the environment. In fact the performance seems to
be quite insensitive to the number of adaptation sentences. Only when the number of sentences is
lower than 10 do we observe a decrease in accuracy.

6.4.3. Effect of the number of Gaussian Mixtures

To explore the effect the number of Gaussians has on recognition accuracy we compared sev-
eral RATZ algorithms. In particular we used 256, 64, and 16 stereo RATZ configurations with cor-
rection factor learned from 100 adaptation sentences for our study. Figure 4-4 shows that as the
number of Gaussians increases, the recognition performance increases. The differences are more
significant at lower SNR levels.

6.4.4. Stereo based RATZ vs. Blind based RATZ

In this section we consider not having stereo data to learn the correction factors on recognition
Chapter 6: The RATZ Family of Algorithms 66

Word Accuracy (%) 100

60 Retrained
600 sents
40 100 sents
40 sents
10 sents
20 4 sents
CMN
0 5 10 15 20 25
SNR (dB)
Figure 6-3. Study of the effect of the number of adaptation sentences on a 8.32 SNR dependent RATZ algo-
rithm. We observe that even with only 10 sentences available for adaptation the performance of the algorithm
does not seem to suffer.

100
Word Accuracy (%)

60 Retrained
RATZ 256
40 RATZ 64
RATZ 16
CMN
20

0 5 10 15 20 25
SNR (dB)
Figure 6-4. Study of the effect of the number of Gaussians on the performance of the RATZ algorithms. In
general a 256 configuration seems to perform better than a 64 or 16.

performance. We compare two identical 4.16 SNR-dependent RATZ algorithms with the only dif-
ference being the presence or absence of simultaneously-recorded (“stereo”) clean and noisy sen-
tences in learning the correction factors. The correction factors were trained using one hundred
sentences in each case. Figure 6-5 shows that not having stereo data is detrimental to recognition
Chapter 6: The RATZ Family of Algorithms 67

Word Accuracy (%) 100

60 Retrained
RATZ (stereo)
40 Blind RATZ
CMN

0 5 10 15 20 25
SNR (dB)
Figure 6-5. Comparison of a stereo based 4.16 RATZ algorithm with a blind 4.16 RATZ algorithm. The stereo-
based algorithm outperforms the blind algorithm at almost all SNRs.

accuracy. However when compared to not doing any compensation at all (the CMN case) Blind
RATZ still provides considerable benefits.

6.4.5. Effect of environment interpolation on RATZ

In this section we compare the performance of interpolated versions of RATZ with that of
equivalent not interpolated versions. We compare two 8.32 SNR-dependent RATZ algorithms
where the number of adaptation sentences used to learn the corrections was set to 100. In the stan-
dard version the correction factors are chosen from the correct environment. In the interpolated ver-
sion the correction factors are chosen from a list of possible environments containing correction
factors learned from data contaminated at different SNRs.

Figure 6-6 shows that not knowing the environment has almost no significant effect on the per-
formance of the algorithm. In fact it seems to improve accuracy slightly. If the correct environment
is removed from the list of environments at each SNR the algorithms does not seem to suffer either.

6.4.6. Comparisons with FCDCN

The FCDCN family of algorithms introduced by Acero [1] and further studied by Liu [35] is
compared in this section with the RATZ family of algorithms. FCDCN can be considered to be a
particular case of the RATZ algorithms where a VQ codebook is used to represent the statistics of
Chapter 6: The RATZ Family of Algorithms 68

Word Accuracy (%) 100

60 Retrained
RATZ interpolated (A)
40 RATZ interpolated (B)
RATZ
CMN
20

0 5 10 15 20 25
SNR (dB)
Figure 6-6. Effect of environment interpolation in recognition accuracy. The curve labeled RATZ interpolated
(A) was computed excluding the correct environment from the list of environments. The curve labeled RATZ
interpolated (B) was computing with all environments available for interpolation.

clean speech and the effect of the environment on this codebook is modelled by shifts in the cen-
troids of the codebook. In FCDCN all the centroids in the codebook have the same variance. When
compensation is applied a single correction factor is applied; the one that minimizes the VQ dis-
tortion.

Figure 6-7 compares a RATZ configuration with 256 Gaussians with an equivalent FCDCN
configuration. The same sentences were used to learn the statistics or VQ codebook of clean speech
and the same 100 stereo sentences were used to learn the correction factors. As we can see the
RATZ algorithm outperforms FCDCN at all SNRs. This can be explained by the better representa-
tion used by RATZ to represent clean speech distributions and by the better model used by RATZ
to represent the effect of the environment on clean speech distributions. The difference in accuracy
is more obvious at lower SNRs.

The same experiment was reproduced with a SNR-dependent structure comparing FCDCN
with RATZ and the same result was observed.

6.5. Summary
In this section we have presented the RATZ family of algorithms as a particular case of the uni-
fied approach described in the previous chapter. We have described the RATZ, SNR-dependent
Chapter 6: The RATZ Family of Algorithms 69

100
Word Accuracy (%)
80

60 Retrained
RATZ 256
40 FCDCN 256
CMN

0 5 10 15 20 25
SNR (dB)
Figure 6-7. Effect of environment interpolation on the performance of the RATZ algorithm. In this case we
compare the effect of removing the right environmental correction factors from the list of environments. We can
observe that removing the right environment does not affect the performance of the algorithm.

RATZ, Blind RATZ, and Interpolated RATZ algorithms.

We have explored some of the dimensions of the algorithm including the number of distribu-
tions used to describe the clean speech cepstrum acoustics, the impact of a SNR dependent distri-
bution, the effect of not having stereo data available to learn the environment correction factor, and
the effect of interpolation of the correction factors on recognition performance. Finally we have
compared the RATZ algorithm with the FCDCN algorithm developed by Acero [1].

From the experimental results presented in this chapter we conclude that contrary to previous
results by Liu and Acero [1, 35], an SNR-dependent structure does not provide any improvement
in recognition accuracy. An explanation for this difference in behavior is as follows. The use of di-
agonal covariance matrices in FCDCN in which all the elements are equal gives a disproportionate
weight to the x 0 coefficient in its contribution to the likelihood. Therefore, a partition of the data
according to the x 0 coefficient will reduce its variability and this in turn will reduce its contribution
to the likelihood. In the RATZ family of algorithms we use diagonal covariance matrices with all
the elements learned. This weights all the components of the cepstral vector equally making a ex-
plicit division of the data according to x 0 unnecessary.

Also, contrary to previous results with FCDCN [36], the RATZ family of algorithms seem to
Chapter 6: The RATZ Family of Algorithms 70

be quite insensitive to the number of sentences used to learn the environmental correction factors.
In fact, the algorithms seem to work quite well even with only 10 sentences (or about 70 seconds
of speech). Perhaps the use of a Maximum Likelihood formulation, where each data sample con-
tributes to learn all the parameters, is responsible for this. We have also shown that the use of a
Maximum Likelihood formulation allows for a natural extension of the RATZ algorithms for the
case in which only noisy data is available to learn the correction factors. Unlike FCDCN, which is
based on VQ, the extension of RATZ to work without simultaneously-recorded clean and noisy
data is quite natural. This Maximum Likelihood structure also allowed us to extend naturally the
RATZ algorithms for the case of environment interpolation. The experiments with interpolated
RATZ provided almost no loss in recognition accuracy when compared with the case of known en-
vironment.

In general we can observe how the RATZ algorithms are able to achieve the recognition per-
formance of a fully retrained system up to a SNR of about 15 dB. For lower signal-to-noise ratios
the algorithms provides partial recovery for the degradation introduced by the environment.
71

Chapter 7
The STAR Family of Algorithms
In this chapter we particularize the generic solutions described in Chapter 3 for the STAtistical
Reestimation (STAR) algorithms. Unlike the previous RATZ family of algorithms that apply cor-
rection factors to the incoming cepstral vectors of noisy speech, STAR tries to modify some of the
parameters of the acoustical distributions in the HMM structure. Therefore there is no need for
“compensation” since the noisy speech cepstrum vectors are used for recognition. Because of this
STAR is called a “distribution compensation” algorithm.

As we discussed before, there are theoretical reasons that support the idea that algorithms that
attempt to adapt the distributions of the acoustics to those of the noisy target speech are optimal.
The experimental results presented in this chapter support this conclusion.

We present an overview of the STAR algorithm and describe in detail all the steps followed in
STAR based compensation. We also provide some experimental results on several databases and
environmental conditions. Finally, we present our conclusions.

7.1. Overview of STAR and Blind STAR

The idea of data-driven algorithms that adapt the HMMs to the environment has been intro-
duced before. For example, the Tied-mixture normalization algorithm proposed by Anastasakos [4]
and the Dual Channel Codebook adaptation algorithm proposed by Liu [35] are similar in spirit to
STAR. However, they are based on VQ indices rather than Gaussian a posteriori probabilities and
use a weaker model of the effect of the environment on Gaussian distributions in which the cova-
riance matrices are not corrected. Furthermore, they only model the effect of the environment on
cepstrum distributions without modelling the effect on the other feature streams (see
Section 2.1.1.) such as delta cepstrum, double delta cepstrum and energy.

Since the clean speech cepstrum is represented by the HMM distribution, in the STAR family
of algorithms there is no need to explore a SNR-dependent structure as this would imply changing
the underlying HMM structure to have SNR-dependencies. Furthermore, since in our experiments
with the RATZ algorithms we did not find evidence to support the use of a SNR-dependent struc-
ture we decided not to explore this issue.
Chapter 7: The STAR Family of Algorithms 72

The algorithm works in the two following stages which are described as follows:

• Estimation of the statistics of clean speech

• Estimation of the statistics of noisy speech

In the next paragraphs we explain in detail each of the two stages of the STAR algorithm.

Estimation of the statistics of clean speech. The STAR algorithm uses the acoustical distri-
butions modeled by the HMMs to represent clean speech. Therefore, strictly speaking this is not a
step related to the STAR algorithm. The algorithm just takes advantage of the information con-
tained on the HMMs.

In the SPHINX-II system the distributions representing the cepstra of clean speech are modeled
as a mixture of multivariate Gaussian distributions. Under these assumptions the distribution for
clean speech can be written as

K
p ( xt ) = ∑ ak ( t ) N x ( µ x, k, Σ x, k ) (7.1)
k=1

where the a k ( t ) term represents the a priori probability of each of the Gaussians of each of the
possible states with the restriction that the total number of Gaussians is limited to 256 and shared
across all states. Reference [23] describes the SPHINX-II HMM topology and acoustical model-
ling assumptions in detail.

The µ x, k and Σ x, k terms represent the mean vector and covariance matrix of each multivariate
Gaussian mixture element k. These parameters are learned through the well known Baum-Welch
algorithm [25].

Estimation of the statistics of noisy speech. As we mentioned in Chapter 4, we assume that

the effect of the environment on the distributions of speech cepstra can be well modeled by apply-
ing the proper correction factors to the mean vectors and covariance matrices. Therefore, our goal
will be to compute these correction factors to estimate the statistics of noisy speech.

Using the methods described in Chapter 4 we have several solutions, depending on whether or not
stereo data are available to learn the correction factors.
Chapter 7: The STAR Family of Algorithms 73

If stereo data are not available we obtain these solutions

∑ P [ s t ( k ) y t, φ ] y t
rk = t=1
-----------------------------------------------
T
- – µ x, k (7.2)

∑ P [ st ( k ) y t, φ ]
t=1

T
T
∑ P [ st ( k ) y t, φ ] ( ( y t – µ x, k – r k ) ( y t – µ x, k – r k ) )
–1
T
- – Σ x, k
R k = t--------------------------------------------------------------------------------------------------------------------------- (7.3)

∑ P [ st ( k ) y t, φ ]
t=1

where the P [ s t ( k ) y t, φ ] term represent the a posteriori probability of an observation noisy vector

y t being produced by Gaussian k in state s t ( k ) given the set of estimated correction parameters φ .
The solutions are iterative and each iteration guarantees higher likelihood. Notice that in this case
the solutions are very similar to the Baum-Welch reestimation solutions commonly used for HMM
training [21].

If stereo data are available we obtain these solutions

∑ P [ st ( k ) xt ] ( yt – xt )
r k = t---------------------------------------------------------
=1
T
- (7.4)

∑ P [ st ( k ) xt ]
t=1

T
T
∑ P [ st ( k ) x t ] ( ( y t – µ x, k – r k ) ( y t – µ x, k – r k ) )
=1
T
- – Σ x, k
R k = t---------------------------------------------------------------------------------------------------------------------- (7.5)

∑ P [ st ( k ) xt ]
t=1

where the solutions are non iterative. Notice that in the stereo case the substitution of P [ s t ( k ) y t ]
by P [ s t ( k ) x t ] assumes implicitly that the a posteriori probabilities do not change due to the en-
vironment.
Chapter 7: The STAR Family of Algorithms 74

As we explained in Chapter 2, in the SPHINX-II system we use not only the cepstral vector as
to represent the speech signal but also a first-difference cepstral vector, a second-difference cepstral
vector and a fourth vector composed of three components: the energy, first-difference energy, and
second-difference energy. Each of these four streams of vectors is modelled with a different set of
256 Gaussians whose probabilities are combined.

The STAR family of algorithms assumes that each of these four streams will be affected by the
environment in a similar way as the cepstral stream. Even though we have not presented evidence
to support this assumption our experimental results as well as results by Gales [15] support this
assumption.

The STAR algorithm models the effect of the environment on the mean vectors and covariance
matrices of the distributions by additional correction factors. We estimate each of the additional
correction factors using formulas that are equivalent to those used to estimate the correction factors
for the cepstral stream.

Once the correction factors are estimated we can perform recognition using the distributions of
noisy speech estimated as distributions of clean speech corrected by the appropriate factors.

7.2. Experimental Results

In this section we describe several experiments designed to evaluate the performance of the
STAR family of algorithms. We explore several of the dimensions of the algorithm, such as:

• the impact of the number of adaptation sentences on speech recognition performance

• the effect of having stereo data to learn the correction factors

• comparison of the STAR algorithm to other algorithms such as RATZ.

The experiments described here are performed on the 5,000-word Wall Street Journal 1993 evalu-
ation set with white Gaussian noise added at several SNR levels. In all the following figures the
upper dotted line represents the performance of the system when fully trained on noisy data while
the lower dotted line represents the performance of the system when no compensation is used.

7.2.1. Effect of the number of adaptation sentences

In this section we study the sensitivity of the algorithm to the number of adaptation sentences
used to learn the correction factors. The correction factors were learned from five sets of 10, 40,
Chapter 7: The STAR Family of Algorithms 75

Word Accuracy (%) 100

60 Retrained
STAR 600 sents
40 STAR 100 sents
STAR 40 sents
STAR 10 sents
20 CMN

0 5 10 15 20 25
SNR (dB)
Figure 7-1. Effect of the number of adaptation sentences used to learn the correction factors rk and Rk on the
recognition accuracy of the STAR algorithm. The bottom dotted line represents the performance of the system
with no compensation.

100, 200, and 600 stereo adaptation sentences at different signal-to-noise ratios.

Figure 7-1 shows the error rates for this particular database for all the aforementioned condi-
tions. As we can see as the number of adaptation sentences grows the accuracy of the algorithm
improves. However, with only 40 sentences the algorithm seems to capture all the needed informa-
tion with no further improvements with more sentences. It is interesting to note that at SNRs larger
than 12.5 dB the performance seems to be quite independent of the number of adaptation sentences
even with only 10 adaptation sentences.

7.2.2. Stereo vs. non-stereo adaptation databases

In this section we explore the effect of not having stereo databases available for learning the
correction factors. We also explore different alternatives to bootstrap the iterative learning equa-
tions when no stereo data are available.

Figure 7-2 shows the recognition accuracy for STAR and Blind STAR where the number of
adaptation sentences used was 100. Ten iterations of the reestimation formulas were used for the
Blind STAR experiments. To explore the effect the initial distributions have on the Blind STAR al-
gorithm, we initialized the algorithm both from the distributions for clean speech and also from the
closest distributions. We observe that when using the closest SNR distributions to bootstrap the
Chapter 7: The STAR Family of Algorithms 76

Word Accuracy (%) 100

60 Retrained
STAR stereo
40 STAR Blind SNR init.
STAR Blind orig.
CMN
20

0 5 10 15 20 25
SNR (dB)
Figure 7-2. Comparison of the Blind STAR, and original STAR algorithms. The line with diamond symbols rep-
resents the original blind STAR algorithm while the line with triangle symbols represents the blind STAR algo-
rithm bootstrapped from the closest in SNR sense distributions.

STAR algorithm its performance improves considerably. However, the stereo-based STAR seems
to outperform the blind STAR versions at SNRs of 5 dB and above. For lower SNRs we hypothe-
size that a properly-initialized blind STAR algorithm might have some advantages over the stereo-
based STAR. In particular, the assumption that the a posteriori probabilities P [ s t ( k ) y t ] can be
replaced by P [ s t ( k ) x t ] is not completely valid at lower SNRs.

7.2.3. Comparisons with other algorithms

In this section we compare the performance of STAR with other previously developed algo-
rithms. We make comparisons where the number of adaptation sentences is equivalent and where
the availability of stereo data to learn the correction factors is also equivalent.

Figure 7-3 compares the STAR, blind STAR, RATZ, and blind RATZ algorithms using 100 sen-
tences for learning the correction factors. In general the STAR algorithms always outperform the
RATZ algorithms. This supports our claim that algorithms that modify the distributions of clean
speech approach the ideal condition of minimization of the probability of error region much better
than those that apply correction factors to the cepstrum data vectors.

The STAR algorithm is able to produce almost the same performance of a fully retrained sys-
tem up to a SNR of 5 dB. For lower SNRs the assumptions made by the algorithm (a posteriori
Chapter 7: The STAR Family of Algorithms 77

Word Accuracy (%) 100

60 Retrained
STAR stereo
40 Blind STAR
RATZ 256
Blind RATZ 256
20 CMN

0 5 10 15 20 25
SNR (dB)
Figure 7-3. Comparison of the stereo STAR, blind STAR, stereo RATZ, and blind RATZ algorithms. The adap-
tation set was the same for all algorithms and consisted of 100 sentences. The dotted line at the bottom rep-
resents the performance of the system with no compensation.

invariance) are not appropriate and recognition accuracy suffers.

7.3. Summary
The STAR family of algorithms modify the mean vectors and covariance matrices of the dis-
tributions of clean speech to make them more similar to those of noisy speech. We observe that this
family of algorithms outperforms the previously-described RATZ algorithms at almost all SNRs.

The results presented in this section suggest that data-driven compensation algorithm are better
when applied to the distributions of clean speech cepstrum. These results support our earlier sug-
gestion (see Appendix A) that compensation of the internal statistics approximates better the per-
formance of a minimum error classifier than compensation of incoming features. In fact, our
experimental results show that the STAR algorithm is able to produce almost the performance of a
fully retrained system up to a SNR of 5 dB.

We also studied the effect of the initial distributions used for initializing the blind STAR algo-
rithms and have observed that good initial distributions can radically change the performance of
the algorithm.
Chapter 7: The STAR Family of Algorithms 78
79

Chapter 8
A Vector Taylor Series Approach to Robust Speech
Recognition
In Chapter 4 we described how some of the parameters of the distributions of clean speech are
affected by the environment. We have seen how the mean vectors are shifted and how the covari-
ance matrices are compressed. Furthermore, we have shown how in some circumstances the result-
ing distributions representing noisy speech have no closed-form analytical solutions.

In Chapter 6 and 7 we have described methods that model the effects of the environment on the
mean vectors and covariance matrices of clean speech distributions directly from observations of
cepstral data, i.e., no model assumptions are directly made. While these techniques, called RATZ
and STAR, provide good performance, they are limited by the requirement of extra adaptation data.
A sizable amount of noisy data must be observed before performing compensation.

In this chapter we present a new model-based approach to robust speech recognition. Unlike
the previously-mentioned methods that learn correction terms directly from adaptation data, we
present a group of algorithms that learn these correction terms analytically. These methods reduce
the amount of adaptation data to a single sentence, namely, the sentence to be recognized. They
take advantage of the extra knowledge provided by the model to reduce the data requirements.
These methods are collectively referred to as the Vector Taylor Series (VTS) approach.

8.1. Theoretical assumptions1

We will assume that when a vector x representing clean speech is affected by the environment,
the resulting vector y representing noisy speech can be described by the following equation

y = x + g ( x, a 1, a 2, .... ) (8.1)

where the function g( ) is called the environmental function and a 1 , a 2 represents parameters
(vectors, scalars, matrices,..) that define the environment. We will assume that the g( ) function is
perfectly known although we will not require knowledge of the environment parameters a 1 , a 2 .

Figure 4-1 showed a typical environmental function g( ) and parameters. This kind of model

1. The notation used in this section as well as the derivation of some of these formulas was introduced in Section 4.1..
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 80

for the environment was originally proposed by Acero [1].

In this case the relationship between clean and noisy speech vectors can be expressed in the
log-spectral domain as

y [ k ] = x [ k ] + g ( x [ k ], h [ k ], n [ k ] ) (8.2)

or in vector notation as

y = x + g ( x, h, n ) (8.3)

where the environmental function term g ( x, h, n ) is expanded as

n–x–h
 ---------------------- 
10
g ( x, h, n ) = h + 10 log 10 i + 10  (8.4)
 

where i is a unity vector. The dimension1 of all vectors is L.

In this case the environmental parameters are the vectors

h[0] n[0]
h = ... n = ... (8.5)
h[ L – 1] n[ L – 1]

where each of the components h [ k ] is the kth log spectral mel component of the power spectrum
2
of the channel H ( ω k ) . Similarly each of the components n [ k ] is the kth log spectra mel com-

ponent of the power spectrum of the noise N ( ω k )

As in the case of the RATZ family of algorithms our second assumption in this chapter will be
that the clean speech log-spectrum random vector variable can be represented by a mixture of
Gaussian distributions

K–1
p( xt ) = ∑ p k N xt ( µ x, k, Σ x, k ) (8.6)
k=0

1. In the SPHINX-II system L is set to 40. See Section 2.1.1.

Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 81

8.2. Taylor series approximations

Given these assumptions, we would like to compute the distribution of the log spectral vectors
of the noisy speech. If the pdf of x and the analytical relationship between the random variables x
and y are known, it is possible to compute the distribution for y . For the case in which the environ-
mental parameters h and n are deterministic, we showed in Chapter 4 that the resulting distribution
for y was non-Gaussian and had the form

n– y –1
L⁄2 
 ------------  
10
p ( y µ x, Σ x, n, h ) =  ( 2π Σ x )  I – 10 
  
T
1 
n– y
------------ 
n– y (8.7)
–1    ------------ 
10 10
– ---  y – h – µ x + 10 log 10 i – 10  Σ x  y – h – µ x + 10 10 log 10 i – 10 
2      
e

For the more realistic case in which the noise itself is a random variable modeled with a Gaus-
sian distribution N n ( µ n, Σ n ) , there is no closed-form solution for p ( y µ x, Σ x, n, h ) . In general, ex-
cept for environments for which the environmental function g ( x, a 1, a 2, .... ) is very simple, the
resulting distribution for the log spectral vectors of the noisy speech has no closed-form solution.

In order to obtain a solution for the pdf of y , we make the further simplification that the result-
ing distribution is still Gaussian in nature. As we showed in Chapter 4, this assumption is not un-
reasonable and it makes the problem mathematically tractable. However, the resulting equations
for the mean and covariance of p ( y µ x, Σ x, n, h ) are still not solvable.

To simplify the problem even further we propose to replace the environmental vector function
g ( x, a 1, a 2, .... ) by its vector Taylor series approximation. This simplification only requires that the

environmental function g( ) be analytical. Under this assumption the resulting relationship be-
tween the random variable x and y becomes

y = x + g ( x 0, a 1, a 2, .... ) + g' ( x 0, a 1, a 2, .... ) ( x – x 0 )

1 (8.8)
+ --- g'' ( x 0, a 1, a 2, .... ) ( x – x 0 ) ( x – x 0 ) + ...
2
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 82

For the type of environment described in Figure 4-1 the Vector Taylor approximation is

y = x + g ( x 0, h, n ) + g' ( x 0, h, n ) ( x – x 0 )
1 (8.9)
+ --- g'' ( x 0, h, n ) ( x – x 0 ) ( x – x 0 ) + ...
2

where the term g ( x 0, h, n ) is expanded as

n – xo – h
 ------------------------ 
10
g ( x 0, h, n ) = h + 10 log 10 i + 10  (8.10)
 

i.e., the environment vector function g ( . ) evaluated at the vector point x 0 . The term g' ( x 0, h, n ) is
the derivative of the environment vector function g ( . ) with respect to the vector variable x evalu-
ated at the vector point x 0

x 0, i + h i – n i – 1
  ------------------------------  
10
g' ( x 0, h, n ) = diag – 1 + 10   (8.11)
   

x 0, i + h i – n i – 1
 ------------------------------ 
10
i.e., a diagonal matrix with L entries in the main diagonal, each of the form 1 + 10  .
 

Higher order derivatives result in tensors [3] of order three and higher. For example, the second
derivative of the vector function g ( . ) with respect to the vector variable x evaluated at the vector
point x 0 would be

x +h –n x +h –n –2
 ------------------------------
0, i j k
- 0, i j k
-
------------------------------
ln10
 10 10
 1 + 10
10
 ----------- i = j = k
g'' ( x 0, h, n ) = f '' ijk =    10 (8.12)

 0 otherwise

i.e., a diagonal tensor with only the diagonal elements different from zero. In this particular type
of environment higher order derivatives of g ( . ) always result in diagonal tensors.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 83

8.3. Truncated Vector Taylor Series approximations

In this section we compute the mean vector and covariance matrix of the noisy vector y assum-
ing that we have perfect knowledge of the environmental parameters h and n .

The original expression for the mean vector is

µ y = E ( y ) = E ( x + g ( x, h, n ) ) = E ( x ) + E ( g ( x, h, n ) ) (8.13)

approximating the environmental function g ( . ) by its vector Taylor series the above expression is
simplified to

µ y = E ( y ) = E ( x + g ( x, h, n ) ) = E ( x ) + g ( x 0, h, n ) + g' ( x 0, n, h )E ( x – x 0 ) +
1 (8.14)
--- g'' ( x 0, n, h )E ( ( x – x 0 ) ( x – x 0 ) ) + ...
2

where each of the expected value terms can be easily computed.

For the covariance matrix the original expression is

T T T T T T
Σ y = E ( ( y y ) ) – µ y µ y = E ( x x ) + E ( g ( x, h, n ) g ( x, h, n ) ) + 2E ( x g ( x, h, n ) ) – µ y µ y (8.15)

approximating each of the g ( . ) terms by its vector Taylor series results also in a expression where
each of the individual terms is solvable.

To simplify the expressions we retain terms of the Taylor series up to a certain order. This will
result in approximate equations describing the mean and covariance matrices of the log spectrum
random variable y for noisy speech. The more terms of the Taylor series we keep, the better the
approximation.

For example, for a Taylor series of order zero the expressions for the mean vector and covari-
ance matrix are

µ y = E ( y ) ≅ E ( x + g ( x 0, h, n ) ) = µ x + g ( x 0, h, n ) (8.16)

T
Σ y = E { ( y – µ y ) ( y – µ y ) } ≅ E{ ( x + g ( x 0, h, n ) – µ x – g ( x 0, h, n ) )
T
( x + g ( x 0, h , n ) – µ x – g ( x 0, h , n ) ) } (8.17)
T
Σ y ≅ E {( x – µx)( x – µx) } = Σx
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 84

From these expressions we conclude that a zeroth order vector Taylor series models the effect of
the environment on clean speech distributions only as a shift of the mean.

For a Taylor series of order one, the expressions for the mean and covariance matrix are

µ y = E ( y ) ≅ E ( x + g ( x 0, h, n ) + g' ( x 0, h, n ) ( x – x 0 ) )
(8.18)
µ y ≅ ( I + g' ( x 0, h, n ) )µ x + g ( x 0, h, n ) – g' ( x 0, h, n ) x 0

T T
Σ y = E { y y } – µ yµ y
T
E { yy } ≅ E{ ( x + g ( x 0, h, n ) + g' ( x 0, h, n ) ( x – x 0 ) )( x + g ( x 0, h, n ) +
T
+ g' ( x 0, h, n ) ( x – x 0 )) }
(8.19)
T
µ y µ y ≅ ( ( I + g' ( x 0, h, n ) )µ x + g ( x 0, h, n ) – g' ( x 0, h, n ) x 0 )( ( I + g' ( x 0, h, n ) )µ x +
T
+ g ( x 0, h, n ) – g' ( x 0, h, n ) x 0 )
T
Σ y ≅ ( I + g' ( x 0, h, n ) )Σ x ( I + g' ( x 0, h, n ) )

From these expressions we conclude that a first order vector Taylor series models the effect of the
environment on clean speech distributions as a shift of the mean and as a compression of the cova-
riance matrix. This can be proved by realizing that each of the elements of the ( I + g' ( x 0, h, n ) )
matrix that multiplies the covariance matrix Σ x is smaller than one

x 0, i + h i – n i – 1 x 0, i + h i – n i – 1
  ------------------------------  
10  – ------------------------------  
10
( I + g' ( x 0, h, n ) ) = diag 1 – 1 + 10   = diag 1 + 10   (8.20)
      

8.3.1. Comparing Taylor series approximations to exact solutions

In this section we explore the accuracy of the Taylor series approximation for the kind of envi-
ronment described in Figure 4-1. We compare the actual values of the means and variances of noisy
distributions at different signal-to-noise ratios with those computed using Taylor series approxima-
tions of different orders.

The experiments were performed on artificially generated data in a similar way to the experi-
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 85

ments done on Chapter 4. A clean set X = { x 0, x 1, ..., x S – 1 } of one-dimensional data points was
randomly generated according to a Gaussian distribution. Another set of noise data
N = { n 0, n 1, ..., n S – 1 } was randomly produced according to a Gaussian distribution with a small

variance. Both sets were combined according to the formula

n–y
 ----------- 
10
y = x + 10 log 10 1 + 10  (8.21)
 

resulting in a noisy data set Y = { y 0, y 1, ..., y S – 1 } . The mean and variance of this set was estimated
directly from the data as

S–1 S–1
1 1
µ y = --- ∑ y t σ y = ------------ ∑ ( y t – µ y )
2 2
(8.22)
S S–1
i=0 i=0

The mean and the variance were also computed using Taylor series approximations of different
orders using the previously described formulas. The experiment was repeated at different signal to
noise ratios1.

Figure 8-1 compares Taylor series approximations of order zero and two with the actual values
of the mean. As can be seen, a Taylor approximation of order zero seems to capture most of the
effect of the environment on the mean. Figure 8-2 compares Taylor series approximations of order
zero and one with the actual value of the variance of the noisy data. A Taylor series of order one
seems to be able to capture most of the effect of the environment on the variance. From these sim-
ulations we conclude that a Taylor series approximation of order one might be enough to capture
most of the effects of the environment of log spectral distributions of clean speech.

1. The SNR here is defined as µ x – µ n , the difference between the means of the distributions for clean and noisy speech.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 86

Noisy signal mean

10
Actual mean
5 Taylor 0th
Taylor 2nd

-12 -8 -4 0 4 8 12
SNR (dB)
Figure 8-1. Comparison between Taylor series approximations to the mean and the actual value of the mean
of noisy data. A Taylor series of order zero seems to capture most of the effect.

6
Noisy signal variance

3
Actual var
2 Taylor 0th
Taylor 1st
1

-12 -8 -4 0 4 8 12
SNR (dB)
Figure 8-2. Comparison between Taylor series approximations to the variance and the actual value of the vari-
ance of noisy data. A Taylor series of order one seems to capture most of the effect.

8.4. A Maximum Likelihood formulation for the case of unknown

environmental parameters
In the previous sections we described the generic use of vector Taylor series approximations as
a way of solving for the means and covariances of the noisy speech log-spectrum distributions.
However, knowledge of the environment parameters was assumed in the above discussion. In prac-
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 87

tical situations we might know the generic form of the environmental function g ( x, a 1, a 2, .... ) but
not the exact values of the environmental parameters a 1 , a 2 . Therefore, a method that is able to
estimate these parameters from noisy observations is necessary. Once these environmental param-
eters are estimated we can directly compute the log-spectrum mean vectors and covariance matri-
ces for noisy speech using a vector Taylor series approximation as described in the previous
sections.

In this section we outline an iterative procedure that combines a Taylor series approach with a
maximum likelihood formulation to estimate the environmental parameters a 1 , a 2 from noisy ob-
servations. The algorithm then estimates the mean vector and covariance matrix of the distributions
of log spectral vectors for the noisy speech. The procedure is particularized for the type of environ-
ments described in Figure 4-1.

Given the following assumptions

• A set of noisy speech log-spectrum vectors Y = { y 0, y 1, ..., y S – 1 }

•A distribution for the clean speech log-spectrum vector random variable

K–1
p( xt ) = ∑ p k N xt ( µ x, k, Σ x, k )
k=0

• A set of initial values for the environmental parameters n 0 = min { Y } , h 0 = mean { Y } – µ x ,

and x 0 = µ x

we now define a vector Taylor1 series around the set of points µ x , n 0 and h 0

y = x + g ( x, n, h ) ≅ x + g ( µ x, n 0, h 0 ) + ∇ x g ( µ x, n 0, h 0 ) ( x – µ x ) +
(8.23)
∇n g ( µ x, n 0, h 0 ) ( n – n 0 ) + ∇h g ( µ x, n 0, h 0 ) ( h – h 0 ) + ...

Given a particular order in the Taylor approximation we compute the mean vector and covari-
ance matrix of noisy speech as functions of the unknown variables n and h .

1. Since the environmental function is a vector function of vectors, i.e. a vector field or in more general terms a tensor field,
we are forced to use a new notation for partial derivatives using the gradient operator ∇ x g( ) . Notice also that this is only
valid for the first derivative. For higher order derivatives we would be forced to use tensor notation. See [3] for more details
on tensor notation and tensor calculus.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 88

For example, for the case of a first order Taylor series approximation we obtain

µ y ≅ µ x + ( I + ∇h g ( µ x, n 0, h 0 ) )h + ∇n g ( µ x, n 0, h 0 ) n +
g ( µ x, n 0, h 0 ) – ∇h g ( µ x, n 0, h 0 ) h 0 – ∇n g ( µ x, n 0, h 0 ) n 0 (8.24)
µ y ≅ µ x + w ( h, n, µ x, n 0, h 0 )

T
Σ y ≅ ( I + ∇h g ( µ x, n 0, h 0 ) )Σ x ( I + ∇h g ( µ x, n 0, h 0 ) )
(8.25)
T
Σ y ≅ W ( µ x, n 0, h 0 ) Σ x W ( µ x, n 0, h 0 )

In this case, the expression for the mean of the log-spectral distribution of noisy speech is a lin-
ear function of the unknown variables n and h can be rewritten as

µy ≅ a + B h + C n
a = µ x + g ( µ x, n 0, h 0 ) – ∇h g ( µ x, n 0, h 0 ) h 0 – ∇n g ( µ x, n 0, h 0 ) n 0 (8.26)
B = I + ∇h g ( µ x, n 0, h 0 ) C = ∇n g ( µ x, n 0, h 0 )

while the expression for the covariance matrix is dependent only on the initial values µ x , n 0 , and
h 0 , which are known.

Therefore, given the observed noisy data we can define a likelihood function

S–1
L ( Y = { y 0, y 1, ..., y S – 1 } ) = ∑ log ( p ( yt h, n ) ) (8.27)
t=0

where the only unknowns are the variables n and h . To find these unknowns we can use a tradi-
tional iterative EM approach. Appendix D describes in detail all the steps needed to estimate these
solutions.

Once we obtain the solutions for the variables n and h we readjust the Taylor approximations
to the mean vectors and covariance matrices by substituting n 0 by n and h 0 by h . Once the Taylor
approximations are readjusted we iterate the procedure by defining again the new mean vectors and
covariance matrices of the noisy speech and redefining a maximum likelihood function. The whole
procedure is stopped when no significant change is observed in the estimated values of n and h .
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 89

Figure 8-3 shows a block diagram of the whole estimation procedure.

Estimate n 0, h 0

µ y ≅ µ x + w ( h, n, µ x, n 0, h 0 ) Equation (8.25)
T
Σ y ≅ W ( µ x, n 0, h 0 ) Σ x W ( µ x, n 0, h 0 )
Equation (8.26)

EM loop: find ( n, h ) to maximize

N
L ( Z )=∑ log  p zi h, n

i=1
EM loop

Replace ( n 0, h 0 ) by ( n, h )

Taylor approx
readjust loop
Converged?

Optimal ( n, h ) found, obtain ( µ z, Σ z )

Recognize using noisy stats

Figure 8-3. Flow chart of the vector Taylor series algorithm of order one for the case of un-
known environmental parameters. Given a small amount of data the environmental parameters
are learned in an iterative procedure.

8.5. Data compensation vs. HMM mean and variance adjustment

As we mentioned in Chapter 4, once the distributions of the noisy speech log-spectrum are
known there are two choices for environment compensation. The first choice consists of perform-
ing the classification with the feature vectors directly and using the distributions of the noisy speech
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 90

feature vectors. The second choice consists of applying correction factors to features representing
the noisy speech to “clean” them and performing the classification using distributions derived from
clean speech.

The VTS algorithm as presented here can be used for both cases. However, because a speech
feature compensation approach is easier to implement and can be implemented independently of
the recognition engine the experimental results we will present correspond to the second case.
Therefore an additional step is needed to perform data compensation once the distributions of the
noisy speech are learned via the VTS method.

We propose to use an approximated Minimum Mean Squared Error (MMSE) method similar
to the one used in the RATZ family of algorithms

x̂ MMSE = E ( x y ) = ∫ x. p ( x y )dx (8.28)

expressing x as a function of y and the environmental function g( )

K–1
x̂ MMSE = y – ∫ g ( x, n, h ) p ( x y )dx = y – ∫ ∑ g ( x, n, h ) p ( x, k y )dx
X X k=0
(8.29)
K–1
x̂ MMSE = y – ∑ P[k y ] ∫ g ( x, n, h ) p ( x k, y )dx
k=0 X

and approximating g( ) by its Taylor series approximation we will get different solutions. For ex-
ample, for a Taylor series approximation of order zero we obtain the following result

K–1 K–1
x̂ MMSE ≅ y – ∑ P [ k y ] ∫ g ( µ k, x, n, h ) p ( x k, y )dx = y – ∑ P[k y ] g ( µ k, x, n, h ) (8.30)
k=0 X k=0

For a Taylor series of order one we obtain the following result

y ≅ x + g ( µ k, x, n, h ) + g' ( µ k, x, n, h ) ( x – µ k, x ) (8.31)
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 91

K–1
x̂ MMSE ≅ ∑ P[k y ] ∫ ( y – g ( µ k, x, n, h ) ) p ( x k, y )dx
k=0 X
K–1
(8.32)
x̂ MMSE = y – ∑ P[k y ] g ( µ k, x, n, h )
k=0

Solutions for higher order approximations are also possible.

8.6. Experimental results

In this section we compare the VTS algorithms of order zero and one with other environmental
compensation algorithms. The experiments described here were performed on the 5,000-word Wall
Street Journal 1993 evaluation set with white Gaussian noise added at several SNRs. As in the pre-
vious experiments with this database, the upper dotted line represents the performance of the sys-
tem when fully trained on noisy data while the lower dotted line represents the performance of the
system when no compensation is used.

Figure 8-4 compares the VTS algorithms with the CDCN algorithm. All the algorithms used

100
Word Accuracy (%)

80
Retrained
60 1st-Order VTS
0th-Order VTS
40
CDCN
20 CMN

0 5 10 15 20 25
SNR (dB)
Figure 8-4. Comparison of the VTS algorithms of order zero and one with CDCN. The VTS algorithms out-
perform CDCN at all SNRs.

256 Gaussians to represent the distributions of clean speech feature vectors. The VTS algorithms
used a 40-dimensional log spectral vector as features. The CDCN algorithm used a 13-dimensional
cepstral vector. The VTS algorithm of order one outperforms the VTS algorithm of order zero
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 92

which in turn it outperforms the CDCN algorithm at all SNRs. The approximations of the VTS al-
gorithm model the effects of the environment on the distributions of clean speech log spectrum bet-
ter than those of CDCN.

Figure 8-5 compares the VTS algorithms with the best configurations of the RATZ and STAR

100
Word Accuracy (%)

80
Retrained
60 STAR stereo
1st-Order VTS
40
RATZ 256
0th-Order VTS
20
CMN

0 5 10 15 20 25
SNR (dB)
Figure 8-5. Comparison of the VTS algorithms of order zero and one with the stereo-based RATZ and STAR
algorithms. The VTS algorithm of order one performs as well as the STAR algorithm up to a SNR or 10 dB. For
lower SNRs only the STAR algorithm produces lower error rates.

algorithm. For this experiments the STAR and RATZ algorithm were trained with one hundred ad-
aptation sentences per SNR. The environment was perfectly known. The VTS algorithms worked
on a sentence by sentence level using the same sentence for learning the distributions of the noisy
speech log spectrum and then compensating the same sentence.

We can observe how the VTS algorithm of order one yields similar recognition accuracy as the
STAR algorithm or a fully retrained system. Only at lower than 10 dB the STAR algorithm outper-
forms the VTS algorithm of order one. Perhaps this is due to the fact that it is at this lower SNRs
where algorithms that modify the distributions of clean speech cepstrum approximate better the op-
timal minimum error classifier.

8.7. Experiments using real data

In this section we describe a series of experiments in which we study the performance of the
algorithms proposed in this thesis with real databases.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 93

The first experiment compares the performance of the CDCN and the VTS of order one algo-
rithms. The task consists of one hundred sentences collected at a seven different distances between
the speaker and a desktop microphone. Each of the one hundred sentences subsets is different in
terms of speech and noise. However, the same speakers read the same sentences at each mike-to-
mouth distance. The vocabulary size of the task was about 2,000 words. The sentences were re-
corded in an office environment with computer background noise as well as some impulsive noises.
The next figure illustrates our experimental results.

100
Word Accuracy (%)

40
1st-Order VTS

20 CDCN
CMN

0 6 12 18 24 30 36 42
distance mic. (inches)
Figure 8-6. Comparison of the VTS of order one algorithm with the CDCN algorithm on a real database. Each
points consists of one hundred sentences collected at different distances from the mouth of the speaker to the
microphone.

As we can see both algorithms behave similarly up to a distance of 18 inches. At that point the
VTS algorithm improves the recognition rate.

The second experiment described in this section compares the performance of all the algo-
rithms introduced in this thesis and the CDCN algorithm. The database used was the 1994 Spoke
10 evaluation set described in Section 2.2.1. This database consists of three sets of 113 sentences
contaminated with car-recorded noise at different signal-to-noise levels. The recognition system
used was trained on the whole Wall Street Journal corpora consisting of about 37,000 sentences or
57 hours of speech. Male and female models were produced with about 10,000 senonic clusters
each.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 94

Each sentences was recognized with the male and female models and the most likely candidate
chosen as the correct one.

The CDCN and VTS algorithms of order zero and one were ran on a sentence by sentence basis.
Even though the Spoke 10 evaluations conditions allow for the use of some extra noise samples to
correctly estimate the noise, we did not make use of those samples.

The RATZ algorithm used a statistical representation of the clean speech based on 256 Gauss-
ians. Fifteen sets of correction factors were learned from stereo-recorded data distributed by NIST.
The data consisted of sets of 53 utterances contaminated with noise collected from three different
cars and added at five different SNRs. An interpolated version of the RATZ algorithm was used.

Using the same stereo-recorded data fifteen different sets of STAR-based correction factors
were also estimated. For each of the three evaluation sets the three most likely correction sets were
applied to the HMM means and variances and the resulting hypothesis produced by the decoder
were combined and the most likely chosen.

Figure 8-7 presents our experimental results. As we can see most of the algorithms perform
quite well and differences in recognition performance are only clear and apparent at the lowest
SNR. In that case the interpolated RATZ algorithm exhibits the greatest accuracy.

Contrary with our results based on artificial data the STAR algorithm does not outperform the
results achieved with RATZ. The lack of a proper mechanism for interpolation in STAR might be
responsible for this lower performance. In general, all the algorithms perform similarly perhaps be-
cause the SNR of each of the testing sets is high enough.

8.8. Computational complexity

In a very informal study of the computational complexity of the compensation algorithms pro-
posed in this thesis we observed how long it took to compensate a set of five noisy sentences using
the RATZ, CDCN, VTS-0, and VTS-1 algorithms.

No optimization effort was made to improve the performance of each of the algorithms. The
experiments were done on a DEC Alpha workstation model 3K600. The numbers we report in Fig-
ure 8-8 represent the number of seconds it took to compensate the five sentences divided by the
duration of the five sentences.
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 95

Word Accuracy (%) 100

90
Clean
80 Interpolated RATZ 256
STAR
70 1st-Order VTS
0th-Order VTS
60 CDCN
CMN
50
15 20 25 30
SNR (dB)
Figure 8-7. Comparison of several algorithms on the 1994 Spoke 10 evaluation set. The upper line represents
the accuracy on clean data while the lower dotted line represents the recognition accuracy with no compensa-
tion. The RATZ algorithm provides the best recognition accuracy at all SNRs.

8
Times real time

7
6
5
4
3
2
1
0
RATZ CDCN VTS-0 VTS-1
Algorithm
Figure 8-8. Comparison of the real time performance of the VTS algorithms with the RATZ and CDCN com-
pensation algorithms. VTS-1 requires about 6 times the computational effort of CDCN.

The RATZ experiment assumes that the environment (and its correction parameters) is known.
If an interpolated version of RATZ is used, the load of the algorithm increases linearly with the
number of environments. Therefore, if the number of environments to interpolate is about twenty,
the computational load of RATZ reaches the level of VTS-1. Even though VTS-0 and CDCN are
Chapter 8: A Vector Taylor Series Approach to Robust Speech Recognition 96

very similar algorithms, the use of log spectra as the feature representation of VTS-1, makes the
algorithm more computationally intensive as the dimensionality of a log spectral vector is three
times that of a cepstral vector.

It is interesting to note that the increase in computational load going from VTS-0 to VTS-1 is
not that great.

8.9. Summary
In this chapter we have introduced the Vector Taylor Series (VTS) algorithms. We have shown
how this approach allows us to estimate simultaneously the parameters defining the environment
as well as the mean vectors and covariance matrices of log spectral distributions of noisy speech.

We also have shown the algorithm performance as a feature-vector compensation method. It

compares favorably with other model-based algorithms such as CDCN as well as with algorithms
that learn the effect of the environment on speech distributions by adaptation sets, such as RATZ.
In fact the VTS algorithms perform as well as fully retrained systems up to a SNR of 10 dB.

We have explored the computational load of the VTS algorithms and shown to be higher than
RATZ or CDCN.

Finally, we have studied the performance of all the algorithms proposed on this thesis on the
1994 Spoke 10 evaluation database. We have observed how all the algorithms proposed in this the-
sis provide significant improvements in recognition accuracy.
97

Chapter 9
Summary and Conclusions
This dissertation addresses the problem of environmental robustness using current speech rec-
ognition technology. Starting with a study of the effects of the environment on speech distributions
we proposed a mathematical framework based on the EM algorithm for environment compensa-
tion. Two generic approaches have been proposed. The first approach uses data that is simulta-
neously recorded in the training and testing environments to learn how speech distributions are
affected by the environment, and the second approach uses a Taylor series approximation to model
the effects of the environment using an analytical vector function.

In this chapter we summarize our conclusions and findings based on our simulations with arti-
ficial data and experiments using real and artificially-contaminated data containing real speech. We
review the major contributions of this work and present several suggestions for future work.

9.1. Summary of Results

We performed a series of simulations using artificial data to study in a controlled manner to
study the effects of the environment on speech-like log spectral distributions. From these simula-
tions we draw the following conclusions:

• the distributions of log spectra of speech are no longer Gaussians when submitted to additive
noise and linear channel distortions.

• the means of the resulting noisy distributions are shifted

• the variances of the resulting noisy distributions are compressed

Based on this simulations we modeled the effects of the environment on Gaussian speech dis-
tributions as correction factors to be applied to the mean vectors and covariance matrices.

We developed two families of algorithms for data-driven environmental compensation. The

first set of algorithms specified the use of correction factors to be applied to the incoming vector
representation of noisy speech. The second family of algorithms provided corrections to the speech
distributions that represent noisy speech in the HMM classifier.
Chapter 9: Summary and Conclusions 98

The values of the correction factors were learned in three different ways:

• Using simultaneously-recorded clean and noisy speech databases (“stereo” databases) to

learn correction factors directly from data (stereo RATZ and STAR).

• Iteratively learning the correction factors directly from noisy data alone (blind RATZ and
STAR).

• Using only a small amount of noisy data (i.e., the same sentence to be recognized) in combi-
nation with an analytical model of the environment to learn iteratively the parameters of the
environment and the correction factors (VTS).

We also presented a unified framework for the RATZ and STAR algorithms showing that tech-
niques that attempt to modify the incoming cepstra vectors and techniques that modify the param-
eters of the distributions of the HMMs can be described by the same theory.

Our experimental results demonstrate that all techniques proposed in this thesis produce signif-
icant improvements in recognition accuracy. In agreement with our predictions, the STAR tech-
niques that modify the parameters of the distributions that internally represent speech outperform
the RATZ techniques that modify the incoming cepstral vectors of noisy speech, given equivalent
experimental conditions.

We have also shown how data-driven techniques seem to perform quite well even with only ten
sentences of adaptation data. Contrary to previous results by Liu and Acero using the FCDCN al-
gorithm [1], the use of an SNR-dependent structure did not help to reduce the error rate for our
data-driven algorithms. Perhaps the use of a better mathematical model makes it unnecessary to
partition distributions according to SNR, as had been done by the FCDCN algorithm. We have also
shown how this mathematical structure allows for a natural extension to incorporate the concept of
environment interpolation. Our results using environment interpolation show that not knowing the
target environment might not be very detrimental.

When comparing the proposed algorithms with the performance of a fully retrained system we
can conclude that they provide the same performance for SNRs as low as

• 15 dB for the RATZ family of algorithms

Chapter 9: Summary and Conclusions 99

• 10 dB for the VTS family of algorithms

• 5 dB for the STAR family of algorithms

Finally, we have shown that an analytical characterization of the environment can enable mean-
ingful compensation even with very limited amounts of data. Our results with the VTS algorithm
are better than equivalent model-based algorithms such as CDCN [1].

9.2. Contributions
We summarize below the major contributions of this thesis.

• We used simulations with artificial data as a way of learning the effects of the environment

on the distributions of the log spectra of clean speech. We believe that experimentation in

controlled conditions is key to understanding the problem in hand and to obtaining important

insights into the nature of the effects of the environment on the distributions of clean speech.

Based on these insights we modeled the effects of the environment on the distributions of cep-

stra of clean speech as shifts of the mean vectors and compressions of the covariance matri-

ces.

• We showed that compensation of the distributions of clean speech is the optimal solution for

speech compensation as it minimizes the probability of error. Our results with STAR seem to

support this assertion, since the STAR algorithm (which modifies internal distributions) out-

performs all other algorithms that compensate incoming features. However, it is important to

note that algorithms such as STAR are only approximations to the theoretical optimum as

they make several assumptions that result in distributions that do not exactly minimize the

probability of error.

• We presented a unified statistical formulation of the data-driven noise compensation problem.

We argued that compensation methods that modify incoming data and compensation methods

that modify the parameters of the distributions of the HMMs share the same common as-

sumptions and differ primarily in how the correction factors are applied. This statistical for-
Chapter 9: Summary and Conclusions 100

mulation has been naturally extended to the cases of SNR-dependent distributions and

environment interpolation.

• We also presented a generic formulation of the problem of environment compensation when

the environment is known through an analytical function. We introduced the use of vector

Taylor series as a generic formulation for solving analytically for the correction factors to be

applied to the distributions of the log spectra of clean speech. Our results are consistently bet-

ter than the previous best-performing model-based method, CDCN.

9.3. Suggestions for Future Work

The majority of the current environmental compensation techniques are limited by the feature
representations used to characterize the speech signal and by the model assumptions made in the
classifier itself. As we have shown in our experimental results, even when the system is fully re-
trained at each of the SNRs used in the experiments, the performance of the system degrades.

The major reason for this degradation is that as the noise becomes more and more dominant the
inherent differences between the classes become smaller, and the classification error increases. It
is not possible to recover from the effects of additive noise at extremely low SNRs, so our only
choice is to keep improving the performance of speech recognition systems at medium SNRs by
changes to the feature representation and/or the classifier itself.

We would like to use feature representations that are inherently more resilient to the effects of
the environment. We certainly would like to avoid the compressions produced by the use of a log-
arithmic function. Perhaps the use of mel-cepstral vectors should be reconsidered. In this sense
techniques motivated by our knowledge of speech production and perception mechanisms have
great potential, as confirmed by recent results using the Perceptual Linear Prediction technique
(PLP) [18] in large vocabulary systems [56] [28].

With respect to the classifier, we would like to use more contextual and temporal information.
Certainly the signal contains more structure and detailed information than is currently used in
speech recognition systems. The use of feature trajectories [18] instead of feature vectors is a pos-
sibility. Also, the current recognition paradigm focuses more on the stationary parts of the signal
Chapter 9: Summary and Conclusions 101

than on the transitional parts. Techniques that model the transitional part of the signal seem also
promising [42].

In this thesis we have presented two data-driven approaches to environment compensation

(RATZ and STAR) and a model-based approach (VTS). The data-driven approaches make minimal
assumptions about the environment. The model-based approaches apply “structural” knowledge of
the environmental degradation on the distributions of log spectrum of clean speech, thus reducing
the need for large data sets. Perhaps there is scope for hybrid approaches between VTS and RATZ
that assume some initial model of the environment and make a more efficient use of the data as
these are available. In this sense MAP approaches are worth exploring.

Another possibility for improvement of the data-driven methods is to explore the a posteriori
invariance assumption proposed in Section 5.2.2. We know from our results and other results [15]
that this assumption is not accurate at lower SNRs. A study of ways to learn how the a posteriori
probabilities are changed by the effects of the environment might be useful.

The VTS approach was only introduced in this thesis. We believe this compensation algorithm
should be further explored in the following ways:

• Although we have presented a generic formulation for any kind of environment, the VTS al-

gorithm should be modified for testing in different kinds of environments such as telephone

channels, radio broadcast channels, etc. This would involve the exploration of different envi-

ronmental functions.

• The VTS algorithms should be extended to compensate the HMM acoustical distributions as

does STAR. We would expect further improvements when used in this manner.

• If perfect analytical knowledge of the environment is not available, perhaps methods that at-

tempt to learn the environmental function and its derivatives should be explored.

• For a given order in the polynomial series expansion there might be more optimal power se-

ries than the Taylor series. The use of optimal Vector Power Series should be explored.

Another topic worth exploring would be to study the extent to which the ideas proposed in this
Chapter 9: Summary and Conclusions 102

thesis can be applied to the area of speaker adaptation. Both problem domains share similar as-
sumptions and can probably be unified.

Finally, to avoid the complexity that the use of Gaussian distributions introduces it is important
to explore the use of other distributions that either result in equations with analytical solutions or
that result in equations where simpler approximations can be used.
103

Appendix A
Comparing Data Compensation to Distribution
Compensation
In this appendix, we provide a simple explanation of why data compensation methods such as
RATZ, VTS-0th and VTS-1st provide worse performance than those that modify some of the pa-
rameters of the speech distributions such as STAR.

A.1. Basic Assumptions

We frame our problem as a simple binary detection problem. We assume that there are two
classes (classes H1 and H2) with a priori probabilities P(H1) and P(H2) respectively. Our goal will
be that of deriving a decision rule that maximizes a performance measure (likelihood) based on the
probabilities of correct and incorrect decisions.

Both classes have Gaussian pdf’s

px H1( x H 1 ) = N x ( µ x, H 1, σ x, H 1 ) px H2( x H 2 ) = N x ( µ x, H 2, σ x, H 2 ) (A.1)

Our decision rule will have the form

choose H 1 if ( P [ class=H 1 x ] ≥ P [ class=H 2 x ] )

(A.2)
choose H 2 if ( P [ class=H 1 x ] ≤ P [ class=H 2 x ] )

which is the maximum a posteriori (MAP) decision rule. From this decision rule we can divide the
space into decision regions. For example, for the case where the two variances are equal and the
two a priori probabilities are not equal the decision region is

2 P ( H 2 )
µ x , H 1 + µ x, H 2 σ log ---------------
x , H P ( H 1 )
x ≥ ------------------------------- - = γx
- – ------------------------------------------ (A.3)
2 ( µ x, H 2 – µ x , H 1 )
Appendix A: Comparing Data Compensation to Distribution Compensation 104

Or presenting this graphically for a one dimensional random variable

accept H1 accept H2
R1 R2
µ x, H 1 P( H 2) px ( x H2)
H2
P( H 1) px H1
( x H1) µ x, H 2

σ x, H

γx

Figure A-1: The shadowed area represents the probability of making an error when clas-
sifying an incoming signal x as belonging to class H1 or H2. The line at the middle point di-
vides the space into regions R1 and R2.

The shadowed area represents the probability of making a wrong decision when classifying an in-
coming signal x as belonging to classes H1 or H2. The space is divided into two regions, R1 and R2.
Depending on what region the vector x is located we will classify the signal as belonging to class
H1 or H2.

The probability of error can be computed as

∞ γx

Pe = P ( H 1 ) ∫ p x H1( x H 1 )dx + P ( H 2 ) ∫ p x H ( x H 2 )dx (A.4)

2
γx –∞

where γ x is the decision threshold.

In the two following sections we compare model compensation with data compensations for
the simple case of additive noise, i.e., the signal x has been contaminated by a noise n with pdf
N n ( µ n, σ n ) , resulting in an observed noisy signal y. We also assume that the noise pdf is perfectly

known.
Appendix A: Comparing Data Compensation to Distribution Compensation 105

A.2. Speech Distribution parameter Compensation

In this case a model compensation method would modify the parameters in the distributions,
mean vectors and covariance matrices and perform classification with the noisy signal directly.

For the simple case of additive noise the new mean and covariance matrices can be easily cal-
culated as

µ y, H = µ x, H + µ n Σ y, H = Σ x, H + Σ n
1 1 1 1
(A.5)
µ y, H 2 = µ x, H 2 + µ n Σ y, H 2 = Σ x, H 2 + Σ n

For the previous case where the two covariance matrices where equal and the a priori probabilities
were different, the new decision rule with the noisy data example will be

µ y, H 1 + µ y, H 2 ( µ y, H 2 – µ y, H 1 ) P ( H 2 )
y ≥ ------------------------------- - log  ---------------
- – -------------------------------------------------------------------------------------- = γy (A.6)
2 t – 1  P ( H 1 )
( µ y, H 2 – µ y, H 1 ) Σ y, H ( µ y, H 2 – µ y, H 1 )

And the new probability of error will be

∞ γy

Pe = P ( H 1 ) ∫ p y H1( y H 1 )dy + P ( H 2 ) ∫ p y H ( y H 2 )dy (A.7)

2
γy –∞

For example, for the case where the two covariance matrices of the clean signal classes are equal
and the a priori probabilities of both classes are also equal Figure A-2 shows how the error as rep-
resented by the shadowed area increases as the variance of the noise increases. The relative distance
between the two classes remains constant as both are shifted the same distance, µ n . However, the
two distributions are wider as their width increases due to the effect of the noise covariance matrix
Σn

A.3. Data Compensation

In this case a data compensation method would apply a correction term to the noisy data pro-
ducing an estimated “clean” data vector. In most of the techniques proposed in this thesis the clean
data are obtained using a Minimum Mean Squared Error Estimation technique (MMSE).

x̂ = E { x y } = E { y – n y } = y – µ n (A.8)
Appendix A: Comparing Data Compensation to Distribution Compensation 106

accept H1 accept H2
R1 R2
µ y, H 2 P( H 2) py ( y H2)
P( H 1) py H1
( y H1) H2

µ y, H 1
σ y, H

γy

Figure A-2: The shadowed area represents the probability of making an error when clas-
sifying an incoming signal y as belonging to class H1 or H2. The line at the middle point di-
vides the space into regions R1 and R2.

The pdf of the estimated clean data would be also Gaussian with these mean and variance param-
eters

µ x̂, H 1 = µ y, H 1 – µ n = µ x, H 1 Σ x̂, H 1 = Σ y, H 1 = Σ x, H 1 + Σ n
(A.9)
µ x̂, H 2 = µ y, H 2 – µ n = µ x, H 2 Σ x̂, H 2 = Σ y, H 2 = Σ x, H 2 + Σ n

From these pdf’s the decision threshold for our graphical example would be

µ x, H 1 + µ x, H 2 ( µ x, H 2 – µ x, H 1 ) P( H 2)
γ x̂ ≥ -------------------------------
- – ------------------------------------------------------------------------------------------------------------- log  ---------------  (A.10)
2 t – 1  P ( H 1 )
( µ x, H 2 – µ x, H 1 ) ( Σ x , H + Σ n ) ( µ x, H 2 – µ x , H 1 )

However, the classification would be done using the clean signal statistics which are different
from the estimated clean statistics in the covariance terms. Using these clean signal pdf’s would
yield a decision threshold

µ x , H 1 + µ x, H 2 ( µ x, H 2 – µ x, H 1 ) P ( H 2 )
γ x ≥ ------------------------------- - log  ---------------
- – ------------------------------------------------------------------------------------------- (A.11)
2 t – 1  P ( H 1 )
( µ x, H 2 – µ x, H 1 ) Σ x, H ( µ x, H 2 – µ x, H 1 )
Appendix A: Comparing Data Compensation to Distribution Compensation 107

Performing the classification with the clean signal distributions would introduce an additional error
due to using the wrong decision threshold. In the next figure this additional error is marked as a
small shadowed surface between the two thresholds γ x and γ x̂ .

accept H1 accept H2
R1 R2
µ x, H 2 P ( H 2 ) p x̂ ( x H2)
P ( H 1 ) p x̂ H1
( x̂ H 1 ) H2

µ x, H 1
Σ x̂, H

γ x γ x̂

Figure A-3: The shadowed area represents the probability of making an error when clas-
sifying an incoming signal y as belonging to class H1 or H2. The area is split into two regions,
the stripped one represent the normal error due to the classifier, the shadowed one (smaller
and above the other one) represents the additional error due to improper modeling of the
effect of the environment on the variances of the signal distributions.

In general data compensation methods incur greater errors compared to model compensation
methods, due to improper modelling of the effects of the environment in the variances of the dis-
tributions.
Appendix A: Comparing Data Compensation to Distribution Compensation 108
109

Appendix B
Solutions for the SNR-RATZ Correction Factors
In this appendix we provide solutions for the SNR-RATZ correction factors i, R i, r i, j and R i, j .
The solutions depend on the availability of simultaneous clean and noisy (stereo) recordings. We
first describe the generic solutions for the case in which only samples of noisy speech are available
(the blind case) and we then describe how to particularize the previous solutions for the stereo case.

B.1. Non stereo based solutions

Given an observed noisy set of vectors Y = { y 1, y 2, ....., y T } of length T, and assuming that
these vectors have been produced by the probability density function

M N

∑ ∑ a i, j N y
2
p( y) = a i N y i ( µ y , i, σ y , i ) i, j ( µ y 1, i, j, Σ y1, i, j ) (B.1)
0 0 0 1
i=0 j=0

i.e., a double summation of Gaussians where each of them relates to the corresponding clean
speech Gaussian according to

2 2
µ y0, i = r i + µ x0, i σ y0, i = R i + σ x0, i
(B.2)
µ y1, i, j = r i, j + µ x1, i, j Σ y1, i, j = R i, j + Σ x1, i,

we can define a log likelihood function L ( Y ) as

T
 T   M N

L ( Y ) = log  ∏ p ( y t ) = ∑ log  ∑ ai N y0 i ( µ y0, i, σ y0, i ) ∑ ai, j N y1
2
i, j ( µ y 1, i, j, Σ y1, i, j ) (B.3)
t = 1  t=1 i=0 j=0 

We can also express it in terms of the original clean speech parameters and the correction terms
i, R i, r i, j and R i, j

T
 M N

L ( Y ) = ∑ log  ∑ a i N y0 i ( r i + µ x0, i, R i + σ x0, i ) ∑ a i, j N y1
2
i, j ( r i, j + µ x1, i, j, R i, j + Σ x1, i, j ) (B.4)
t=1 i = 0 j=0 

Our goal is to find all the terms i, R i, r i, j and R i, j that maximize the log likelihood. For this

problem we can also use the EM algorithm defining a new auxiliary function Q ( φ, φ ) as

Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] (B.5)
Appendix B: Solutions for the SNR-RATZ Correction Factors 110

where the ( Y , S ) pair represents the complete data, composed of the observed data Y (the noisy
vectors) and the unobserved data S (it indicates what Gaussian produced an observed data vector).
The φ symbol represents the correction terms i, R i, r i, j and R i, j .

Equation (B.5) can we expanded as

T M N
p ( y t , i, j φ )
Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] = ∑ ∑ ∑ ----------------------------
p ( yt φ )
- log ( p ( y t, i, j φ ) ) (B.6)
t=1 i=1 j=1

hence,

T M N
1 1
∑ ∑ ∑ P [ i, j
2
Q ( φ, φ ) = y t, φ ] { log a i – --- log ( 2π ) – --- log R i + σ x0, i +
2 2
t=1 i=1 j=1
1 2 2 L–1 L–1 (B.7)
– --- ( y 0, t – ( r i + µ x 0, i ) ) ⁄ ( R i + σ x0, i ) + log a i, j – ------------ log ( 2π ) – ------------ log R i, j + Σ x1, i, j +
2 2 2
1 T –1
– --- ( y 1, t – ( r i, j + µ x1, i, j ) ) ( R i, j + Σ x1, i, j ) ( y 1, t – ( r i, j + µ x1, i, j ) ) }
2

where L is the dimensionality of the cepstrum vector. The expression can be further simplified to

T M N
1
∑ ∑ ∑ P [ i, j
2
Q ( φ, φ ) = constants + y t, φ ] { – --- log R i + σ x0, i +
2
t=1 i=1 j=1
1 2 2 L–1 (B.8)
– --- ( y 0, t – ( r i + µ x0, i ) ) ⁄ ( R i + σ x0, i ) – ------------ log R i, j + Σ x1, i, j +
2 2
1 T –1
– --- ( y 1, t – ( r i, j + µ x1, i, j ) ) ( R i, j + Σ x1, i, j ) ( y 1, t – ( r i, j + µ x1, i, j ) ) }
2

B.1.1 Solutions for the r i and Ri parameters

To find the r i and R i parameters we simply take derivatives and equate to zero,

T N

∑ ∑ P [ i, j
2
∇r i Q ( φ, φ ) = y t, φ ] ( y 0, t – ( r i + µ x0, i ) ) ⁄ ( R i + σ x0, i ) = 0 (B.9)
t=1 j=1

T N
1 2
Q ( φ, φ ) = – ∑ ∑ P [ i, j
2
∇ 2 
–1 y t, φ ] --- ( ( R i + σ x0, i ) – ( y 0, t – ( r i + µ x0, i ) ) ) = 0 (B.10)

Ri + σ x 2
0, i t=1 j=1
Appendix B: Solutions for the SNR-RATZ Correction Factors 111

hence,

T N T

∑ ∑ P [ i, j y t, φ ] ( y 0, t – µ x0, i ) ∑ P[i y t, φ ] ( y 0, t – µ x0, i )

t=1 j=1 t=1
ri = --------------------------------------------------------------------------------
T N
- = ---------------------------------------------------------------
T
- (B.11)

∑ ∑ P [ i, j y t, φ ] ∑ P[i y t, φ ]
t=1 j=1 t=1

T N

∑ ∑ P [ i, j
2
y t, φ ] ( ( y 0, t – ( r i + µ x0, i ) ) )
2
=1 j=1
T N
- – σ x0, i
R i = t------------------------------------------------------------------------------------------------------
∑ ∑ P [ i, j y t, φ ] (B.12)
t=1 j=1

 T 2  
T

=  ∑ P [ i y t, φ ] ( ( y 0, t – ( r i + µ x0, i ) ) ) ⁄  ∑ P [ i y t, φ ] – σ x0, i
2

t = 1  t = 1 

B.1.2 Solutions for the ri, j and i, j parameters

To find the r i, j and R i, j parameters we simply take derivatives and equate to zero,

∑ P [ i, j
–1
∇ri, j Q ( φ, φ ) = y t, φ ] ( R i, j + Σ x1, i, j ) ( y 1, t – ( r i, j + µ x1, i, j ) ) = 0 (B.13)
t=1

T
∇ –1 Q ( φ, φ ) = – ∑ P [ i, j y t, φ ]{ ( R i, j + Σ x1, i, j ) –
( R i, j + Σ x )
1, i, j t=1 (B.14)
T
( y 1, t – ( r i, j + µ x1, i, j ) ) ( y 1, t – ( r i, j + µ x1, i, j ) ) }

resulting in the following solutions

∑ P [ i, j y t, φ ] ( y 1, t – µ x1, i, j )
r i, j = t-------------------------------------------------------------------------
=1
T
- (B.15)
∑ P [ i, j y t , φ ]
t=1

∑ P [ i, j
T
y t, φ ] ( ( y 1, t – ( r i, j + µ x1, i, j ) ) ( y 1, t – ( r i, j + µ x1, i, j ) ) )
R i, j = t-----------------------------------------------------------------------------------------------------------------------------------------------------------
=1
T
– Σ x1, i, j (B.16)

∑ P [ i, j y t, φ ]
t=1
Appendix B: Solutions for the SNR-RATZ Correction Factors 112

These equations for the basis of an iterative algorithm. The EM algorithm guarantees that each
iteration produces better estimates in a ML sense.

B.2. Stereo based solutions

When stereo adaptation data is available we can readily assume that the a posteriori probabil-
ities P [ i, j y t, φ ] can be directly estimated by P [ i, j x t, φ ] . We call this assumption a posteriori in-
variance.

B.2.1 Solutions for the r i and Ri parameters

The resulting estimates for the r i and R i parameters are

∑ P[i x t, φ ] ( y 0, t – µ x0, i )
r i = t---------------------------------------------------------------
=1
T
- (B.17)

∑ P[i x t, φ ]
t=1

∑ P[i
2
x t, φ ] ( ( y 0, t – ( r i + µ x0, i ) ) )
2
=1
T
- – σ x 0, i
R i = t------------------------------------------------------------------------------------- (B.18)

∑ P[i x t, φ ]
t=1

Further simplification can be achieved by substituting the ( y 0, t – µ x0, i ) term by ( y 0, t – x 0, t ) re-

sulting in the formulas

∑ P[i x t, φ ] ( y 0, t – x 0, t )
t=1
r i = ------------------------------------------------------------- (B.19)
 T 
 ∑ P [ i x t, φ ] 
t = 1 

∑ P[i
2
x t, φ ] ( ( y 0, t – ( x 0, t + r i ) ) )
2
=1
- – σx
R i = t----------------------------------------------------------------------------------- (B.20)
T 0, i

∑ P[i x t, φ ]
t=1
Appendix B: Solutions for the SNR-RATZ Correction Factors 113

B.2.2 Solutions for the i, j and i, j parameters

Using the a posteriori invariance property we obtain the following formulas

∑ P [ i, j x t, φ ] ( y 1, t – µ x
1, i, j )
t=1
r i, j = -------------------------------------------------------------------------
T
- (B.21)

∑ P [ i, j x t, φ ]
t=1

∑ P [ i, j
T
x t, φ ] ( ( y 1, t – ( r i, j + µ x1, i, j ) ) ( y 1, t – ( r i, j + µ x1, i, j ) ) )
R i, j = t-----------------------------------------------------------------------------------------------------------------------------------------------------------
=1
T
– Σ x1, i, j (B.22)

∑ P [ i, j x t, φ ]
t=1

Further simplification can be achieved by substituting the ( y 1, t – µ x1, i, j ) term by ( y 1, t – x 1, t )

resulting in

∑ P [ i, j x t, φ ] ( y 1, t – x 1, t )
t=1
r i, j = -------------------------------------------------------------------
T
- (B.23)

∑ P [ i, j x t, φ ]
t=1

∑ P [ i, j
T
x t, φ ] ( ( y 1, t – x 1, t – r i, j ) ( y 1, t – x 1, t – r i, j ) )
=1
T
- – Σ x1, i, j
R i, j = t------------------------------------------------------------------------------------------------------------------------------------ (B.24)

∑ P [ i, j x t, φ ]
t=1

This concludes the derivation of the correction factors for the stereo and non stereo based SNR
RATZ compensation algorithms.
Appendix B: Solutions for the SNR-RATZ Correction Factors 114
115

Appendix C
Solutions for the Distribution Parameters for Clean
Speech using SNR-RATZ
In this appendix we provide the EM solutions for the parameters of the SNR-RATZ clean
speech cepstrum distributions.

It is important to notice also that our goal is only to show the use of the Expectation-Maximi-
zation (EM) algorithm in solving the Maximum Likelihood equations that will appear in this sec-
tion. The reader is referred to [9][13] for a detailed explanation of the EM algorithm.

C.1. Basic Assumptions

The SNR-RATZ distribution for the clean speech cepstrum vectors has the following structure

M N

∑ ∑ a i, j N x
2
p( x φ) = a i N x0 i ( µ x0, i, σ x0, i ) i, j ( µ x 1, i, j, Σ x1, i, j ) (C.1)
1
i=1 j=1

where we define φ as

2 2
φ = {a 1, .., a M, µ x0, 1, ..., µ x0, M, σ x0, 1, ..., σ x0, M,
(C.2)
a 1, 1, ..., a M, N , µ x , ..., µ x1, M, N , Σ x1, 1, 1, ..., Σ x1, M, N }
1, 1, 1

T T
i.e., the set of parameters that are unknown, and where the cepstrum vector x = [ x 0 x 1 ] is split

in two parts, the energy component x 0 , and x 1 , a vector composed of the x 1 , x 2 ..., x L – 1 components
of the original cepstrum vector.

As in many other Maximum Likelihood problems given an ensemble of T clean vectors or ob-
servations X = { x 1, x 2, ..., x T } , we can define a log likelihood function,

T
L( X φ) = ∑ log ( p ( xt φ ) ) (C.3)
t=1

Our goal is to find the set of parameters φ that maximize the log likelihood of the observed data
X.
Appendix C: Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ 116

2
C.2. EM solutions for the µ x , i, σ x , i, µ x , i, j, Σ x , i, j parameters
0 0 1 1
As it turns out there is no direct solution to this problem and indirect methods are necessary.
The Expectation-Maximization (EM) algorithm is one of these methods. The EM algorithm defines
a new auxiliary function Q ( φ, φ ) as

Q ( φ, φ ) = E [ L ( X, S φ ) X, φ ] (C.4)

where the ( X , S ) pair represents the complete data, composed of the observed data X (the clean
cepstrum vectors) and the unobserved data S (it indicates what two Gaussians produced an ob-
served data vector).

The basis of the EM algorithm lies in the fact that given two sets of parameters φ and φ , if
Q ( φ, φ ) ≥ Q ( φ, φ ) , then L ( X, φ ) ≥ L ( X, φ ) . In other words, maximizing Q ( φ, φ ) with respect to

the φ parameters is guaranteed to increase the likelihood L ( X, φ ) .

Since the unobserved data S are described by a discrete random variable (the mixture index in
our case), Equation (C.4) can we expanded as

T M N
p ( x t , i, j φ )
Q ( φ, φ ) = E [ L ( X, S φ ) X, φ ] = ∑ ∑ ∑ ----------------------------
p ( xt φ )
- log ( p ( x t, i, j φ ) ) (C.5)
t=1 i=1 j=1

This can be further expanded to

T M N
1 1
∑ ∑ ∑ P [ i, j
2
Q ( φ, φ ) = x t, φ ] { log a i – --- log ( 2π ) – --- log σ x0, i +
2 2
t=1 i=1 j=1
1 2 2 L–1 L–1 (C.6)
– --- ( x 0, t – µ x0, i ) ⁄ σ x0, i + log a i, j – ------------ log ( 2π ) – ------------ log Σ x1, i, j +
2 2 2
1 T –1
– --- ( x 1, t – µ x1, i, j ) Σ x1, i, j ( x 1, t – µ x1, i, j ) }
2

where L is the dimensionality of the cepstrum vector. The expression can be further simplified to
Appendix C: Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ 117

T M N
1
∑ ∑ ∑ P [ i, j
2
Q ( φ, φ ) = constants + x t, φ ] { log a i – --- log σ x0, i +
2
t=1 i=1 j=1
1 2 2 L–1 (C.7)
– --- ( x 0, t – µ x0, i ) ⁄ σ x0, i + log a i, j – ------------ log Σ x1, i, j +
2 2
1 T –1
– --- ( x 1, t – µ x , i, j ) Σ x1, i, j ( x 1, t – µ x , i, j ) }
2 1 1

To find the φ parameters we simply take derivatives and equate to zero. The solutions for the
2
µx and σ x0, i parameters are
0, i

T N

∑ ∑ P [ i, j
2
∇µ x , i Q ( φ, φ ) = x t, φ ] ( x 0, t – µ x0, i ) ⁄ σ x0, i = 0 (C.8)
0
t=1 j=1

T N
1 2
∇σ–2 Q ( φ, φ ) = – ∑ ∑ P [ i, j
2
x t, φ ] --- ( σ x0, i – ( x 0, t – µ x0, i ) ) = 0 (C.9)
x 0, i 2
t=1 j=1

hence,

T N T

∑ ∑ P [ i, j x t, φ ]x 0, t ∑ P[i x t, φ ]x 0, t
µ x0, i = t----------------------------------------------------------
=1 j=1
T N
- = t-----------------------------------------
=1
T
- (C.10)

∑ ∑ P [ i, j x t , φ ] ∑ P[i x t, φ ]
t=1 j=1 t=1

T N T

∑ ∑ P [ i, j ∑ P[i
2 2
x t, φ ] ( x 0, t – µ x0, i ) x t, φ ] ( x 0, t – µ x0, i )
–2
σ x0, i = t-----------------------------------------------------------------------------------
=1 j=1
T N
t=1
= -----------------------------------------------------------------
T
- (C.11)

∑ ∑ P [ i, j x t, φ ] ∑ P[i x t, φ ]
t=1 j=1 t=1

Similarly, the solutions for the µ x1, i, j and Σ x1, i, j parameter are

∑ P [ i, j
–1
∇µ x , i, j Q ( φ, φ ) = x t, φ ]Σ x1, i, j ( x 1, t – µ x1, i, j ) = 0 (C.12)
1
t=1
Appendix C: Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ 118

T
1
Q ( φ, φ ) = – ∑ P [ i, j x t, φ ] --- ( Σ x1, i, j – ( x 1, t – µ x1, i, j ) ( x 1, t – µ x1, i, j ) ) = 0
T
∇ –1 (C.13)
Σx 2
1, i, j
t=1

hence,

∑ P [ i, j x t, φ ] x 1, t
t=1
µ x , i, j = -----------------------------------------------
T
- (C.14)
1
∑ P [ i, j xt, φ ]
t=1

∑ P [ i, j
T
x t, φ ] ( x 1, t – µ x1, i, j ) ( x 1, t – µ x1, i, j )
Σ x1, i, j = t----------------------------------------------------------------------------------------------------------------
=1
T
(C.15)

∑ P [ i, j x t, φ ]
t=1

C.3. EM solutions for the i, ai, j parameters

The solutions for the a i and the a i, j parameters cannot be obtained by simple derivatives. These
parameters have the following additional constraints

M N

∑ ai = 1 ∑ aij = 1 (C.16)
=1 j=1

therefore to find the solutions for these parameters we need to use a Lagrange multiplier.

We can build two auxiliary functions f aux and g aux as

M N
f aux = α ∑ a i + Q ( φ, φ ) g aux = β ∑ aij + Q ( φ, φ ) (C.17)
i=1 j=1

where α and β are the Lagrange multipliers.

Taking the partial derivative of f aux and g aux with respect to a i and a ij respectively and equating
to zero
Appendix C: Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ 119

T N T N
∂ f aux 1
∂ ai
= α+ ∑ ∑ P [ i, j x t, φ ] ---- = 0
ai
⇒ αa i + ∑ ∑ P [ i, j x t, φ ] = 0 (C.18)
t=1 j=1 t=1 j=1

T T
∂g aux 1
∂ a ij
= β+ ∑ P [ i, j x t, φ ] ----- = 0
a ij
⇒ βa ij + ∑ P [ i, j x t, φ ] = 0 (C.19)
t=1 t=1

summing over i and j respectively

M T N M T N M
α ∑ ai + ∑ ∑ ∑ P [ i, j x t, φ ] = 0 ⇒ α=–∑ ∑ ∑ P [ i, j x t, φ ] = – T (C.20)
i=1 t=1 j=1 i=1 t=1 j=1 i=1

N T N T N
β ∑ aij + ∑ ∑ P [ i, j x t, φ ] = 0 ⇒ β=–∑ ∑ P [ i, j x t, φ ] (C.21)
j=1 t=1 j=1 t=1 j=1

entering the value of α in Equation (C.16) and the value of β in Equation (C.17)

T N T N T
1 1
– T ai + ∑ ∑ P [ i, j x t, φ ] = 0 ⇒ a i = --- ∑
T ∑ P [ i, j x t, φ ] = --- ∑ P [ i x t, φ ]
T
(C.22)
t=1 j=1 t=1 j=1 t=1

T N T ∑ P [ i, j x t , φ ]
– a ij ∑ ∑ P [ i, j x t, φ ] + ∑ P [ i, j x t, φ ] = 0 ⇒ a ij = T
t=1
-------------------------------------------------
N
- (C.23)
t=1 j=1 t=1
∑ ∑ P [ i, j x t, φ ]
t=1 j=1

This concludes the derivation of the EM solutions for the parameters of the SNR-RATZ clean
speech cepstrum distributions.
Appendix C: Solutions for the Distribution Parameters for Clean Speech using SNR-RATZ 120
121

Appendix D
EM Solutions for the n and q Parameters for the VTS
Algorithm
In this appendix, we provide the EM solutions for the environmental parameters for the case of
an additive noise and linear channel environment. We also provide solutions for the Vector Taylor
series or order zero and one.

D.1. Solutions for the vector Taylor series of order one

As we mentioned in Section 8.4. for the case of a first order Taylor series approximation the
mean vector and covariance matrix of each of the individual mixtures of the Gaussian mixture can
be expressed as

µ k, y ≅ µ k, x + ( I + ∇h g ( µ k, x, n 0, h 0 ) ) h + ∇n g ( µ k, x, n 0, h 0 ) n +
(D.1)
g ( µ k, x, n 0, h 0 ) – ∇h g ( µ k, x, n 0, h 0 ) h 0 – ∇n g ( µ k, x, n 0, h 0 ) n 0

T
Σ k, y ≅ ( I + ∇h g ( µ k, x, n 0, h 0 ) )Σ k, x ( I + ∇h g ( µ k, x, n 0, h 0 ) ) (D.2)

In this case, the expression for the mean of the noisy speech log-spectrum distribution is a linear
function of the unknown variables n and h can be rewritten as

µ k, y ≅ a k + B k h + C k n
a k = µ k, x + g ( µ k, x, n 0, h 0 ) – ∇h g ( µ k, x, n 0, h 0 ) h 0 – ∇n g ( µ k, x, n 0, h 0 ) n 0 (D.3)
B k = I + ∇h g ( µ k, x, n 0, h 0 ) C k = ∇n g ( µ k, x, n 0, h 0 )

while the expression for the covariance matrix is dependent only of the initial values µ x , n 0 and h 0
which are known. Therefore, given the observed noisy data we can define a likelihood function

S–1
L ( Y = { y 0, y 1, ..., y S – 1 } ) = ∑ log ( p ( yt h, n ) ) (D.4)
t=0
Appendix D: EM Solutions for the n and q Parameters for the VTS algorithm 122

where the only unknowns are the n and h variables. To find these unknowns we can use a tradi-
tional iterative EM approach.

We define an auxiliary function Q ( φ, φ ) as

Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] (D.5)

Equation (D.5) can we expanded as

S–1 K–1
p ( y t, k φ )
Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] = ∑ ∑ -----------------------
p ( yt φ )
- log ( p ( y t, k φ ) ) (D.6)
t=0 k=0

hence,

S–1 K–1
L
Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] = ∑ ∑ P[k y t, φ ]{ log p k + --- log 2π +
2
t=0 k=0 (D.7)
L 1 T –1
– --- log Σ k, y – --- ( y t – ( a k + B k h + C k n ) ) Σ k, y ( y t – ( a k + B k h + C k n ) )}
2 2

where L is the dimension of the log-spectrum vector and the terms a k , B k and C k are the terms de-
scribed in Equation (D.3) particularize for each of the individual Gaussians of p ( x t ) . The expres-
sion can be further simplified to

S–1 K–1
1
Q ( φ, φ ) = constants – --- ∑ ∑ P[k y t, φ ]
2 (D.8)
t=0 k=0
T –1
( y t – ( a k + B k h + C k n ) ) Σ k, y ( y t – ( a k + B k h + C k n ) )

To find the n and h parameters we simply take derivatives and equate to zero,

S–1 K–1
(D.9)

∑ ∑ P[k
T –1
∇n Q ( φ, φ ) = y t, φ ]{C k Σ k, y ( y t – ( ak + Bk h + Ck n ) ) = 0
t=0 k=0
Appendix D: EM Solutions for the n and q Parameters for the VTS algorithm 123

(D.10)
S–1 K–1

∑ ∑ P[k
T –1
∇h Q ( φ, φ ) = y t, φ ]{B k Σ k, y ( y t – ( a k + B k h + C k n ) ) = 0
t=0 k=0

The above two vector equations can be simplified to

d–E h–F n = 0
(D.11)
g–H h–J n = 0

where each of the d, E, F, g, H, and J terms is expanded as

S–1 K–1 S–1 K–1

∑ ∑ P[k ∑ ∑ P[k
T –1 T –1
d = y t, φ ]{C k Σ k, y ( y t – ak ) g = y t, φ ]{B k Σ k, y ( y t – a k )
t=0 k=0 t=0 k=0
S–1 K–1 S–1 K–1
(D.12)
∑ ∑ P[k ∑ ∑ P[k
T –1 T –1
E = y t, φ ]{C k Σ k, y B k H = y t, φ ]{B k Σ k, y B k
t=0 k=0 t=0 k=0
S–1 K–1 S–1 K–1

∑ ∑ P[k ∑ ∑ P[k
T –1 T –1
F = y t, φ ]{C k Σ k, y C k J = y t, φ ]{B k Σ k, y C k
t=0 k=0 t=0 k=0

Equation (D.5) can be rewritten as

d = E F h (D.13)
g H J n

where we have created an expanded matrix composed of the E, F, H, and J matrices. We also have
created an expanded vector composed of the concatenation of the d and g vectors. The above linear
system yields the following solutions

–1 –1 –1
h = (H – J F E) ( g – J F d)
(D.14)
–1 –1 –1
n = (J – H F E) ( g – H F d)

There is no solutions strictly speaking if the extended matrix E F is not invertible. We might
H J

be faced with a situation where there are no solution or there are an infinite number of solutions.
This occurs when the solutions obtained for the h and n vectors converge to ± infinity. To avoid
Appendix D: EM Solutions for the n and q Parameters for the VTS algorithm 124

this behavior we impose the empirical constraint in the space of solutions that any log-spectrum
component h i or n i can only exist in the range h i, max ≤ h i ≤ h i, min or n i, max ≤ n i ≤ n i, min . The upper
and lower boundaries are set experimentally.

Once the solutions for h and n are found we can substitute them for h and n and iterate the
procedure until convergence is obtained.

D.2. Solutions for the vector Taylor series of order one

For the case of a zeroth-order Taylor series approximation the mean vector and covariance ma-
trix of each of the individual mixtures of the Gaussian mixture can be expressed as

µ k, y ≅ µ k, x + h + g ( µ k, x, n 0, h 0 ) (D.15)

Σ k, y ≅ Σ k, x (D.16)

In this case, the expression for the mean of the log-spectral distribution of the noisy speech is
a linear function of the unknown variable h and can be rewritten as

µ k, y ≅ a k + h
(D.17)
a k = µ k, x + g ( µ k, x, n 0, h 0 )

Therefore, given the observed noisy data we can define a likelihood function

S–1
L ( Y = { y 0, y 1, ..., y S – 1 } ) = ∑ log ( p ( yt h ) ) (D.18)
t=0

where the only unknown is the h variable. To find h we can use again a traditional iterative EM

approach. We define an auxiliary function Q ( φ, φ ) as

Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] (D.19)

that can we expanded as

S–1 K–1
L
Q ( φ, φ ) = E [ L ( Y , S φ ) Y , φ ] = ∑ ∑ P[k y t, φ ]{ log p k + --- log 2π +
2
t=0 k=0 (D.20)
L 1 T –1
– --- log Σ k, y – --- ( y t – ( a k + h ) ) Σ k, y ( y t – ( a k + h ) )}
2 2
Appendix D: EM Solutions for the n and q Parameters for the VTS algorithm 125

where L is the dimension of the log-spectrum vector and the term a k is the term described in Equa-
tion (D.3). The expression can be further simplified to

S–1 K–1
1 T
Q ( φ, φ ) = constants – --- ∑ ∑ P[k
–1
y t, φ ] ( ( y t – ( a k + h ) ) Σ k, x ( y t – ( a k + h ) ) ) (D.21)
2
t=0 k=0

To find the h parameter we simply take derivative and set equal to zero,

S–1 K–1

∑ ∑ P[k
–1
∇h Q ( φ, φ ) = y t, φ ]{Σ k, x ( y t – ( a k + h ) ) = 0 (D.22)
t=0 k=0

The above vector equation yields the following solution for h

–1
S – 1 K–1
–1  
S–1 K–1

h = ∑ ∑ k, x   ∑ ∑ P[k
–1
P [ k y t , φ ]{Σ y t, φ ]{Σ k, x ( y t – a k ) (D.23)
t = 0 k=0  t = 0 k=0 

Once the h variable is found we can substitute it for h and iterate the procedure until conver-
gence is obtained.

As we can see a zeroth-order Taylor approximation does not provide solution for the parameter
n. To remedy this problem we can redefine the relationship between noisy speech and clean speech
as

y ≅ n + f ( x, n, h ) (D.24)

With this new environmental equation we can define a zeroth-order Taylor expansion

y ≅ n + f ( x 0, n 0, h 0 ) (D.25)

Taking expected values in both sides of the equation yields the expression

µ y ≅ n + f ( µ x, n 0, h 0 ) (D.26)

We now can use this expression to define a likelihood function that can be maximized via the
EM algorithm resulting in the following iterative solution

–1 S – 1 K – 1
S – 1 K–1
–1   
n = ∑ ∑ P[k ∑ ∑ P[k
–1
y t, φ ]{Σ k, x y t, φ ]{Σ k, x ( y t – f ( µ k, x, n 0, h 0 ) ) (D.27)
t = 0 k=0  t = 0 k=0 
Appendix D: EM Solutions for the n and q Parameters for the VTS algorithm 126

Although this is not a rigorous solution it works reasonably well in practice and solves the prob-
lem of not having a proper way of estimating the noise vector for zeroth-order Taylor approxima-
tions.
127

REFERENCES
[1] A. Acero, “Acoustical and Environmental Robustness in Automatic Speech Recognition”, Ph.D.
Thesis, Department of Electrical and Computer Engineering, Carnegie Mellon University, Sept.
1990.
[2] K. Aikawa, H. Singer, H. Kawahara, Y. Tohkura, “A dynamic cepstrum incorporating time-frequen-
cy masking and its application to continuous speech recognition”, IEEE International Conference
on Acoustics, Speech, and Signal Processing, May, 1993
[3] A. Akivis & V. V. Goldberg, An introduction to Linear Algebra and Tensors. Dover Publications,
1990.
[4] A. Anastasakos, F. Kubala, J. Makhoul and R. Schwartz, “Adaptation to new microphones using
Tied-Mixtures Normalization”. Proceedings of the Spoken Language Technology Workshop,
March, 1994.
[5] F. Alleva, X. Huang, and M. Hwang, “An Improved Search Algorithm for Continuous Speech Rec-
ognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. II 307-
310, May, 1993.
[6] J. Baker, “Stochastic Modeling as a Means of Automatic Speech Recognition”, Ph.D. Thesis, Com-
puter Science Department, Carnegie Mellon University, April 1975.
[7] R. Bakis, “Continuous Speech Recognition via Centisecond Acoustic States”, 91st Meeting of the
Acoustical Society of America, April, 1976.
[8] L. Bahl, F. Jelinek, and R. Mercer, “A Maximum Likelihood Approach to Continuous Speech Rec-
ognition” IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-5(2):179-190,
March 1983.
[9] L. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation of Prob-
abilistic Functions of Markov Processes”, Inequalities 3:1-8, 1972.
[10] Y. Chow, M. Dunham, O. Kimball, M. Krasner, F. Kubala, J. Markoul, S. Roucos, and R. Schwartz,
“BYBLOS: The BBN Continuous Speech Recognition System”, IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp.89-92, April, 1987.
[11] S. Davis, and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word
Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech, and
Signal Processing, vol. ASSP-28, No. 4, pp. 357-366, August 1980.
[12] R. O. Duda & P. E. Hart, Pattern Classification and Scene Analysis, John Wiley and Sons, 1973.
[13] A. P. Depster, N. M. Laird & D. B. Rubin, “Maximum Likelihood from incomplete data via the EM
algorithm (with discussion)”, Journal of the Royal Stat. Society, Series B, Vol.39, pp.1-38.
[14] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, “Computer-steered Microphone Arrays for Sound
Transduction in Large Rooms”, the Journal of Acoustical Society of America, Vol. 78, pp. 1508-
1518, Nov. 1985.
[15] M. F. Gales, “Model-Based Techniques for Noise Robust Speech Recognition”. Ph.D. Thesis, En-
gineering Department, Cambridge University, Sept. 1995.
[16] O. Ghitza, “Auditory Nerve Representation as a Front-End for Speech Recognition in a Noisy En-
References 128

vironment”, Computer Speech and Language, Vol. 1, pp. 109-130, 1986.

[17] L. Gillick, and S. Cox, “Some Statistical Issues in the Comparison of Speech Recognition Algo-
rithms”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 532-535,
May 1989.
[18] Y. Gong and J. P. Haton, “Stochastic trajectory modeling for speech recognition”, IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing, pp. I-57 - 60, April, 1994.
[19] H. Hermansky, “Perceptual Linear Prediction (PLP) Analysis for Speech”. J. Acoust. Soc. Amer.
Vol. 87, pp. 1738-1752.
[20] H. Hermansky, N. Morgan, and H. Hirsch, “Recognition of Speech in Additive and Convolutional
Noise Based on RASTA Spectral Processing”, IEEE International Conference on Acoustics,
Speech, and Signal Processing, pp. II-83 - 86, April, 1993.
[21] Huang, Ariki & Jack, Hidden Markov Models for Speech Recognition, Edinburgh University Press,
1990.
[22] X. Huang, F. Alleva, H. Hon, M. Hwang, K. Lee, R. Rosenfeld, “The SPHINX-II Speech Recogni-
tion System: An Overview”, Computer Speech and Language, vol. 2, pp. 137-148, 1993.
[23] M. Hwang, “Subphonetic Acoustic Modeling for Speaker-Independent Continuous Speech Recog-
nition”, Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, Dec. 1993.
[24] M. Hwang, and X. Huang, “Shared-Distribution Hidden Markov Models for Speech Recognition”,
IEEE Transactions on Speech and Audio Processing, vol. 1, pp. 414-420, 1993.
[25] F. Jelinek, “Continuous Speech Recognition by Statistical Methods”, Proceedings of the IEEE
64(4):532-556, April 1976.
[26] B. Juang, “Speech Recognition in Adverse Environments”, Computer Speech and Language, Vol.
5, pp. 275-294, 1991.
[27] B. Juang, and L. Rabiner, “Mixture Autoregressive Hidden Markov Models for Speech Signals”,
IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-33, pp. 1404-1413, 1985.
[28] D. J Kershaw, A. J. Robinson & S. J. Renals, “The 1995 Abbot Hybrid Connectionist-HMM Large-
Vocabulary Recognition System”, Proceeding of the 1996 ARPA Speech Recognition Workshop,
Feb. 1996.
[29] C. Lee, L. Rabiner, R. Pieraccini, and J. Wilpon, “Acoustic Modeling for Large Vocabulary Speech
Recognition” Computer Speech and Language, vol. 4, 1990.
[30] K. Lee and H. Hon, “Large-Vocabulary Speaker-Independent Continuous Speech Recognition”,
IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 123-126, April
1988.
[31] K. Lee, H. Hon, and R. Reddy, “An Overview of the SPHINX Speech Recognition”, IEEE Trans-
actions on Acoustics, Speech, and Signal Processing, pp. 35-45, Jan. 1990.
[32] K. Lee, “Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX
System”, Ph.D. Thesis, School of Computer Science, Carnegie Mellon University, April. 1988.
[33] C. J. Leggeter and P. C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adapta-
tion of Continuous Density Hidden Markov Models”. Computer Speech & Language, Vol. 9, pp.
References 129

171-185.
[34] S. Levinson, L. Rabiner, M. Sondhi, “An Introduction to the Application of the Theory of Probabi-
listic Function on a Markov Process to Automatic Speech Recognition”, the Bell System Technical
Journal 62(4), April, 1983.
[35] F.H. Liu, “Environment Adaptation for Robust Speech Recognition”. Ph.D. Thesis, dept. of ECE,
Carnegie Mellon University, June 1994.
[36] F. H. Liu, Personal Communication.
[37] F.H. Liu, A. Acero, and R. Stern, “Efficient Joint Compensation of Speech For the Effects of Addi-
tive Noise and Linear Filtering”, IEEE International Conference on Acoustics, Speech, and Signal
Processing, pp. I-257 - I-260, March, 1992
[38] J. Markel, and A. Gray, Linear Prediction of Speech, Springer-Verlag, 1976.
[39] P. J. Moreno, B. Raj, R. M. Stern, “Multivariate Gaussian-Based Cepstral normalization”, IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing, May, 1995.
[40] P. J. Moreno, B. Raj, R. M. Stern, “A Unified Approach to Robust Speech Recognition”, Proceed-
ings of Eurospeech 1995, Madrid, Spain.
[41] P. J. Moreno, B. Raj, R. M. Stern, “Approaches to Environment Compensation in Automatic Speech
Recognition”, Proceeding of the 1995 International Conference in Acoustics ICA’95, Throndhein,
Norway, June 1995.
[42] N. Morgan, H. Boulard, S. Greenberg and H. Hermansky, “Stochastic Perceptual Auditory-based
models for speech recognition”, Proceedings of the 1994 International Conference on Spoken Lan-
guage Processing (ICSLP). Vol. 4, pp. 1943-6, Yokohama, Japan.
[43] L. Neumeyer and M. Weintraub, “Probabilistic Optimum Filtering for Robust Speech Recogni-
tion”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I417-I420,
May 1994
[44] N. Nilsson, Principles of Artificial Intelligence, Tioga Publishing Co., 1980.
[45] D. Paul, and J. Baker, “The Design of the Wall Street Journal-based CSR Corpus”, Proceedings of
ARPA Speech and Natural Language Workshop, pp. 357-362, Feb., 1992.
[46] J. Picone, G. Doddington, and D. Pallett, “Phone-mediated Word Alignment for Speech Recogni-
tion Evaluation”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-38,
pp. 559-562, March 1990.
[47] L. Rabiner, and B. Juang, “An Introduction to Hidden Markov Models”, IEEE ASSP Magazine
3(1):4-16, Jan. 1986.
[48] B. Raj, Personal Communication.
[49] R. Schwartz, and Y. Chow, “The Optimal N-Best Algorithm: An Efficient Procedure for Finding
Multiple Sentence Hypotheses”, IEEE International Conference on Acoustics, Speech, and Signal
Processing, April 1990.
[50] R. Schwartz, Y. Chow, S. Roucos, M. Krasner, J. Makhoul, “Improved Hidden Markov Modeling
of Phonemes for Continuous Speech Recognition”, IEEE International Conference on Acoustics,
Speech, and Signal Processing, 1984
References 130

[51] S. Seneff, “A Joint Synchrony/Mean-Rate Model of Auditory Speech Processing”, Journal of Pho-
netics, Vol. 16, pp. 55-76, January 1988.
[52] R. M. Stern, A. Acero, F. H. Liu, Y. Ohshima, “Signal Processing for Robust Speech Recognition”,
in Automatic Speech and Speaker Recognition, edited by Lee, Soong and Paliwal, Kluwer Academ-
ic Publishers, 1996.
[53] T. Sullivan, and R. Stern, “Multi-Microphone Correlation-Based Processing For Robust Speech
Recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 91-
94, April, 1993.
[54] D. Tapias-Merino. Personal Communication.
[55] A. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Al-
gorithm”, IEEE Transactions on Information Theory, vol. IT-13, pp. 260-269, 1967.
[56] P. C. Woodland, M. J. F. Gales, D. Pye & V. Valtchev, “The HTK Large Vocabulary Recognition
System for the 1995 ARPA H3 Task”. Proceeding of the 1996 ARPA Speech Recognition Workshop,
Feb. 1996.
[57] S. J. Young & P. C. Woodland, HTK Version 1.5: User, Reference and Programmer Manual, Cam-
bridge University Engineering Dept., Speech Group, 1993.

Wiki Book 4th Industrial Revolution
No ratings yet
Wiki Book 4th Industrial Revolution
184 pages
Uncertainty Reduction Theory
100% (1)
Uncertainty Reduction Theory
6 pages
Budget of Work ILLUSTRATION 9
100% (1)
Budget of Work ILLUSTRATION 9
10 pages
Xiangl PHD
No ratings yet
Xiangl PHD
139 pages
American - Sign - Language - Progress Final
No ratings yet
American - Sign - Language - Progress Final
44 pages
ENIT TIC Report Template
No ratings yet
ENIT TIC Report Template
22 pages
American Sign Language Progress (1) Final
No ratings yet
American Sign Language Progress (1) Final
44 pages
Machine Learning For Environmental Noise Classification in Smart Cities (Ali Othman Albaji) (Z-Library)
100% (2)
Machine Learning For Environmental Noise Classification in Smart Cities (Ali Othman Albaji) (Z-Library)
179 pages
Deep Learning
No ratings yet
Deep Learning
100 pages
The Perception Processor: by Binu K. Mathew
No ratings yet
The Perception Processor: by Binu K. Mathew
167 pages
ENIT TIC Report Template
No ratings yet
ENIT TIC Report Template
35 pages
Wang Asu 0010N 21448
No ratings yet
Wang Asu 0010N 21448
81 pages
Bowon Yang Deposit-19
No ratings yet
Bowon Yang Deposit-19
86 pages
Big Data Analysis of Synchrophasor Data Outcomes of Research Activities Supported by DOE FOA 1861 (PNNL, 2022)
No ratings yet
Big Data Analysis of Synchrophasor Data Outcomes of Research Activities Supported by DOE FOA 1861 (PNNL, 2022)
39 pages
Atheoretical Survey On Mahalanobis-Taguchi System PDF
No ratings yet
Atheoretical Survey On Mahalanobis-Taguchi System PDF
11 pages
Artur Proposta Dissertacao
No ratings yet
Artur Proposta Dissertacao
50 pages
Identification of Unknown Landscape Types Using CNN Transfer Lear
No ratings yet
Identification of Unknown Landscape Types Using CNN Transfer Lear
105 pages
Communications Protocol
No ratings yet
Communications Protocol
462 pages
Introduction To Machine Learning - Wikipedia
No ratings yet
Introduction To Machine Learning - Wikipedia
456 pages
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
No ratings yet
BT4431 Report of Project Ete 7TH Sem Plag Report Attachted
69 pages
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
No ratings yet
Thesis 2022-Bayesian Convolutional Neural Network With Prediction Smoothing A
65 pages
Computer Science 2
No ratings yet
Computer Science 2
66 pages
Napolitano Amri 200912 PHD
No ratings yet
Napolitano Amri 200912 PHD
235 pages
Advanced Strategies For Improving The Robustness of Deep Learning
No ratings yet
Advanced Strategies For Improving The Robustness of Deep Learning
152 pages
Roadway Surface Profiling Using An Onboard Data Logger
No ratings yet
Roadway Surface Profiling Using An Onboard Data Logger
67 pages
Cristian-Stefan Tutuianu PDF
No ratings yet
Cristian-Stefan Tutuianu PDF
40 pages
Cross Modal Survey
No ratings yet
Cross Modal Survey
39 pages
Project Proposal 260 Copy
No ratings yet
Project Proposal 260 Copy
38 pages
Information Security
No ratings yet
Information Security
337 pages
Assignment 2 551628 1391838575
No ratings yet
Assignment 2 551628 1391838575
13 pages
Script
No ratings yet
Script
460 pages
Advances in Data Analysis Theory and Applications To Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks PDF
100% (17)
Advances in Data Analysis Theory and Applications To Reliability and Inference, Data Mining, Bioinformatics, Lifetime Data, and Neural Networks PDF
17 pages
Brain-Machine Interface - Amir Zjajo
100% (1)
Brain-Machine Interface - Amir Zjajo
176 pages
Assessing Threat of Terrorism
No ratings yet
Assessing Threat of Terrorism
169 pages
A Survey On The Application of Recurrent Neural Ne
No ratings yet
A Survey On The Application of Recurrent Neural Ne
39 pages
Tong 2020
No ratings yet
Tong 2020
14 pages
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
No ratings yet
On The Applicability of Deep Learning To Construct Process Models From Natural Text 16 05
66 pages
A Survey On Large Language Models With Some Insights
No ratings yet
A Survey On Large Language Models With Some Insights
174 pages
Neural Transfer Learning For NLP
No ratings yet
Neural Transfer Learning For NLP
329 pages
1177 Modular Deep Learning
No ratings yet
1177 Modular Deep Learning
76 pages
GonzalezAdrian Thesis2017
No ratings yet
GonzalezAdrian Thesis2017
117 pages
Report Latex Code A08 1
No ratings yet
Report Latex Code A08 1
57 pages
Seed1.5-VL Technical Report
No ratings yet
Seed1.5-VL Technical Report
77 pages
Zhu 2017
No ratings yet
Zhu 2017
15 pages
1 s2.0 S1367578819300094 Main
No ratings yet
1 s2.0 S1367578819300094 Main
23 pages
History and Evolution of The Johnson Criteria: Sandia Report
No ratings yet
History and Evolution of The Johnson Criteria: Sandia Report
40 pages
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
No ratings yet
A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1
42 pages
American Sign Language Progress
No ratings yet
American Sign Language Progress
50 pages
Munier - Methodology To Select A Set of Urban Sustainability Indicators To Measure The State of The City, and Perf (8258)
No ratings yet
Munier - Methodology To Select A Set of Urban Sustainability Indicators To Measure The State of The City, and Perf (8258)
7 pages
qt2hv873np Nosplash
No ratings yet
qt2hv873np Nosplash
70 pages
Speech Recognition Using Artificial Neural Networks
No ratings yet
Speech Recognition Using Artificial Neural Networks
50 pages
Recent Trends On Hybrid Modeling For Indust - 2021 - Computers - Chemical Engine
No ratings yet
Recent Trends On Hybrid Modeling For Indust - 2021 - Computers - Chemical Engine
21 pages
Speech and Language Processing
100% (1)
Speech and Language Processing
623 pages
Nordby - Environmental Sound Classification On Microcontrollers Using Convolutional Neural Networks
No ratings yet
Nordby - Environmental Sound Classification On Microcontrollers Using Convolutional Neural Networks
70 pages
Speech and Language Processing: Third Edition Draft
No ratings yet
Speech and Language Processing: Third Edition Draft
287 pages
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
No ratings yet
Mestrado-Engenharia Informatica-Eduardo Farofia Medeiros
103 pages
Modern ML
No ratings yet
Modern ML
146 pages
Unit 2 Ref Book
No ratings yet
Unit 2 Ref Book
136 pages
Pattern Recognition 2nd Ed. (2009)
No ratings yet
Pattern Recognition 2nd Ed. (2009)
113 pages
自动控制书籍
No ratings yet
自动控制书籍
13 pages
Artificial Intelligence in Education Technologies: New Development and Innovative Practices
No ratings yet
Artificial Intelligence in Education Technologies: New Development and Innovative Practices
224 pages
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
No ratings yet
Performanceanalysisof ASRModelfor Santhalilanguageon Kaldiand Matlab Toolkit
5 pages
Artificial Intelligence in Education An Argument of Chat-GPT Use in Education
No ratings yet
Artificial Intelligence in Education An Argument of Chat-GPT Use in Education
6 pages
Merajul Hasan 1
No ratings yet
Merajul Hasan 1
11 pages
Applicationnof AIin Educational MGT
No ratings yet
Applicationnof AIin Educational MGT
9 pages
Noise Effect On Amazigh Digits in Speech
No ratings yet
Noise Effect On Amazigh Digits in Speech
8 pages
Cosentino 2020
No ratings yet
Cosentino 2020
5 pages
1 s2.0 S0040162523007618 Main
No ratings yet
1 s2.0 S0040162523007618 Main
18 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Spasov Ski 2015
No ratings yet
Spasov Ski 2015
8 pages
DEMAND: A Collection of Multi-Channel Recordings of Acoustic Noise in Diverse Environments
No ratings yet
DEMAND: A Collection of Multi-Channel Recordings of Acoustic Noise in Diverse Environments
6 pages
Advancing RNN Transducer Technology For Speech Recognition
No ratings yet
Advancing RNN Transducer Technology For Speech Recognition
5 pages
Noise Effect On Arabic Alphadigits in Au
No ratings yet
Noise Effect On Arabic Alphadigits in Au
4 pages
1 s2.0 S0885230819302992 Main
No ratings yet
1 s2.0 S0885230819302992 Main
16 pages
1 s2.0 S0167639322000292 Main
No ratings yet
1 s2.0 S0167639322000292 Main
16 pages
1 s2.0 S1877050916300588 Main
No ratings yet
1 s2.0 S1877050916300588 Main
8 pages
Comparing Open-Source Speech Recognition Toolkits
No ratings yet
Comparing Open-Source Speech Recognition Toolkits
12 pages
Arabic Language Learning Assistance Base
No ratings yet
Arabic Language Learning Assistance Base
7 pages
pxc3872774 PDF
No ratings yet
pxc3872774 PDF
7 pages
Moroccan Dialect Speech Recognition System Based On Cmu Sphinxtools
No ratings yet
Moroccan Dialect Speech Recognition System Based On Cmu Sphinxtools
5 pages
Automatic Isolated Digit Recognition System: An Approach Using HMM
No ratings yet
Automatic Isolated Digit Recognition System: An Approach Using HMM
3 pages
Chapter I. Introduction 1-13
No ratings yet
Chapter I. Introduction 1-13
5 pages
A Complete Kaldi Recipe For Building Arabic Speech Recognition Systems
No ratings yet
A Complete Kaldi Recipe For Building Arabic Speech Recognition Systems
5 pages
Arabic Speech Recognition Systems
No ratings yet
Arabic Speech Recognition Systems
8 pages
Isolated Digit Recognizer Using Gaussian Mixture Models
No ratings yet
Isolated Digit Recognizer Using Gaussian Mixture Models
44 pages
Speaker and Language Recognition by GMM
No ratings yet
Speaker and Language Recognition by GMM
5 pages
Rank Mismatch in Argument (Fortran 90)
No ratings yet
Rank Mismatch in Argument (Fortran 90)
3 pages
Examination Solution Form Subject Code: Matb 253 SEMESTER: 2, 2013/2014 EXAMINER (S) : Fadhilah Abd Razak, Zarina Abd. Rahman No. Solution Marks/Comments
No ratings yet
Examination Solution Form Subject Code: Matb 253 SEMESTER: 2, 2013/2014 EXAMINER (S) : Fadhilah Abd Razak, Zarina Abd. Rahman No. Solution Marks/Comments
10 pages
Second Semester Syllabus: 19chy102 Engineering Chemistry-B (2 1 0 3)
No ratings yet
Second Semester Syllabus: 19chy102 Engineering Chemistry-B (2 1 0 3)
12 pages
PE ZC213 / TA ZC233 Engineering Measurements L-3: BITS Pilani
No ratings yet
PE ZC213 / TA ZC233 Engineering Measurements L-3: BITS Pilani
17 pages
Lines, Angles, and Triangles SLM
No ratings yet
Lines, Angles, and Triangles SLM
11 pages
Math 9 Q1 M2 W2 Revised Final
No ratings yet
Math 9 Q1 M2 W2 Revised Final
16 pages
Mech LND 2019r2 en Le07
No ratings yet
Mech LND 2019r2 en Le07
67 pages
s5 Frictionontheinclinedplane
No ratings yet
s5 Frictionontheinclinedplane
15 pages
Representation Theory in Chiral Conformal Field Theory From Fields To Observables
No ratings yet
Representation Theory in Chiral Conformal Field Theory From Fields To Observables
82 pages
Maths Notes
No ratings yet
Maths Notes
27 pages
Principal Stresses & Strain
No ratings yet
Principal Stresses & Strain
39 pages
PR2 - SLHT 1 - January 4 To 8
No ratings yet
PR2 - SLHT 1 - January 4 To 8
5 pages
Eee 2204 Engineering Mathematics IV
No ratings yet
Eee 2204 Engineering Mathematics IV
3 pages
Number Representation
No ratings yet
Number Representation
59 pages
Prediction of Ground Vibration
No ratings yet
Prediction of Ground Vibration
10 pages
Lesson Plan - Math8 - Q3 - Day 1 Correspondence
No ratings yet
Lesson Plan - Math8 - Q3 - Day 1 Correspondence
7 pages
Clasification of 5 Mark Questions - New Book: Dr. K. Thirumurugan, GHSS
50% (2)
Clasification of 5 Mark Questions - New Book: Dr. K. Thirumurugan, GHSS
2 pages
Gesture Recognition For Home Automation
No ratings yet
Gesture Recognition For Home Automation
13 pages
3D Kinematics: Presented By: Amir Patel PHD (Mechatronics) Cape Town
No ratings yet
3D Kinematics: Presented By: Amir Patel PHD (Mechatronics) Cape Town
32 pages
11 Algebra
No ratings yet
11 Algebra
358 pages
Sixth Grade End of Year-Reflection Coloring Sheet
No ratings yet
Sixth Grade End of Year-Reflection Coloring Sheet
2 pages
الإحصاء الهندسي
No ratings yet
الإحصاء الهندسي
64 pages
PMOS, NMOS and CMOS Transmission Gate Characteristics.
No ratings yet
PMOS, NMOS and CMOS Transmission Gate Characteristics.
13 pages
Valuation and Capital Budgeting For Levered Firm (Chaper 18)
No ratings yet
Valuation and Capital Budgeting For Levered Firm (Chaper 18)
26 pages
Ascet Manual1
No ratings yet
Ascet Manual1
14 pages
Sliding Mode Control For Electro-Hydraulic Servo System: IJCCCE Vol.15, No.3, 2015
No ratings yet
Sliding Mode Control For Electro-Hydraulic Servo System: IJCCCE Vol.15, No.3, 2015
10 pages
Tutorial 2
No ratings yet
Tutorial 2
5 pages
Miller Indices
No ratings yet
Miller Indices
18 pages