Signals and Communication Technology
Signals and Communication Technology
Automatic Speech
Recognition
A Deep Learning Approach
123
Dong Yu Li Deng
Microsoft Research Microsoft Research
Bothell Redmond, WA
USA USA
This is the first book on automatic speech recognition (ASR) that is focused on the
deep learning approach, and in particular, deep neural network (DNN) technology.
The landmark book represents a big milestone in the journey of the DNN tech-
nology, which has achieved overwhelming successes in ASR over the past few
years. Following the authors’ recent book on “Deep Learning: Methods and
Applications”, this new book digs deeply and exclusively into ASR technology and
applications, which were only relatively lightly covered in the previous book in
parallel with numerous other applications of deep learning. Importantly, the
background material of ASR and technical detail of DNNs including rigorous
mathematical descriptions and software implementation are provided in this book,
invaluable for ASR experts as well as advanced students.
One unique aspect of this book is to broaden the view of deep learning from
DNNs, as commonly adopted in ASR by now, to encompass also deep generative
models that have advantages of naturally embedding domain knowledge and
problem constraints. The background material did justice to the incredible richness
of deep and dynamic generative models of speech developed by ASR researchers
since early 90’s, yet without losing sight of the unifying principles with respect to
the recent rapid development of deep discriminative models of DNNs. Compre-
hensive comparisons of the relative strengths of these two very different types of
deep models using the example of recurrent neural nets versus hidden dynamic
models are particularly insightful, opening an exciting and promising direction for
new development of deep learning in ASR as well as in other signal and infor-
mation processing applications. From the historical perspective, four generations of
ASR technology have been recently analyzed. The 4th Generation technology is
really embodied in deep learning elaborated in this book, especially when DNNs
are seamlessly integrated with deep generative models that would enable extended
knowledge processing in a most natural fashion.
All in all, this beautifully produced book is likely to become a definitive ref-
erence for ASR practitioners in the deep learning era of 4th generation ASR. The
book masterfully covers the basic concepts required to understand the ASR field as
a whole, and it also details in depth the powerful deep learning methods that have
vii
viii Foreword
shattered the field in 2 recent years. The readers of this book will become articulate
in the new state-of-the-art of ASR established by the DNN technology, and be
poised to build new ASR systems to match or exceed human performance.
By Sadaoki Furui, President of Toyota Technological Institute at Chicago, and
Professor at the Tokyo Institute of Technology.
Preface
ix
x Preface
Along with the development of the field over the past two decades or so, we
have seen a number of useful books on ASR and on machine learning related to
ASR, some of which are listed here:
• Deep Learning: Methods and Applications, by Li Deng and Dong Yu (June
2014)
• Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods,
by Joseph Keshet, Samy Bengio (January 2009)
• Speech Recognition Over Digital Channels: Robustness and Standards, by
Antonio Peinado and Jose Segura (September 2006)
• Pattern Recognition in Speech and Language Processing, by Wu Chou and
Biing-Hwang Juang (February 2003)
• Speech Processing—A Dynamic and Optimization-Oriented Approach, by Li
Deng and Doug O’Shaughnessy (June 2003)
• Spoken Language Processing: A Guide to Theory, Algorithm and System
Development, by Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon (April
2001)
• Digital Speech Processing: Synthesis, and Recognition, Second Edition, by
Sadaoki Furui (June 2001)
• Speech Communications: Human and Machine, Second Edition, by Douglas
O’Shaughnessy (June 2000)
• Speech and Language Processing—An Introduction to Natural Language Pro-
cessing, Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James Martin (April 2000)
• Speech and Audio Signal Processing, by Ben Gold and Nelson Morgan (April
2000)
• Statistical Methods for Speech Recognition, by Fred Jelinek (June 1997)
• Fundamentals of Speech Recognition, by Lawrence Rabiner and Biing-Hwang
Juang (April 1993)
• Acoustical and Environmental Robustness in Automatic Speech Recognition, by
Alex Acero (November 1992).
All these books, however, were either published before the rise of deep learning
for ASR in 2009 or, as our 2014 overview book, were focused on less technical
aspects of deep learning for ASR than is desired. These earlier books did not
include the new deep learning techniques developed after 2010 with sufficient
technical and mathematical detail as would be demanded by ASR or deep learning
specialists. Different from the above books and in addition to some necessary
background material, our current book is mainly a collation of research in most
recent advances in deep learning or discriminative and hierarchical models, as
applied specific to the field of ASR. Our new book presents insights and theoretical
foundation of a series of deep learning models such as deep neural network (DNN),
restricted Boltzmann machine (RBM), denoising autoencoder, deep belief network,
recurrent neural network (RNN) and long short-term memory (LSTM) RNN, and
their application in ASR through a variety of techniques including the DNN-HMM
Preface xi
hybrid system, the tandem and bottleneck systems, multi-task and transfer learning,
sequence-discriminative training, and DNN adaptation. The book further discusses
practical considerations, tricks, setups, and speedups on applying the deep learning
models and related techniques in building real-world real-time ASR systems. To set
the background, our book also includes two chapters that introduce GMMs and
HMMs with their variants. However, we omit details of the GMM–HMM tech-
niques that do not directly relate to the theme of the book—the hierarchical
modeling or deep learning approach. Our book is thus complementary to, instead of
replacement of, the published books listed above on many of similar topics. We
believe this book will be of interest to advanced graduate students, researchers,
practitioners, engineers, and scientists in speech processing and machine learning
fields. We hope our book not only provides reference to many of the techniques
used in the filed but also ignites new ideas to further advance the field.
During the preparation of the book, we have received encouragement and help
from Alex Acero, Geoffrey Zweig, Qiang Huo, Frank Seide, Jasha Droppo, Mike
Seltzer, and Chin-Hui Lee. We also thank Springer editors, Agata Oelschlaeger and
Kiruthika Poomalai, for their kind and timely help in polishing up the book and for
handling its publication.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 1
1.1 Automatic Speech Recognition: A Bridge for Better
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Human–Human Communication . . . . . . . . . . . . . . . . 2
1.1.2 Human–Machine Communication . . . . . . . . . . . . . . . 2
1.2 Basic Architecture of ASR Systems . . . . . . . . . . . . . . . . . . . 4
1.3 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Part I: Conventional Acoustic Models . . . . . . . . . . . . 6
1.3.2 Part II: Deep Neural Networks . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: DNN-HMM Hybrid Systems for ASR. . . . . . 7
1.3.4 Part IV: Representation Learning in Deep Neural
Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 7
1.3.5 Part V: Advanced Deep Models . . . . . . . . . . . . . ... 7
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 8
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Acronyms
xxi
xxii Acronyms
xxv
xxvi Symbols