0% found this document useful (0 votes)
12 views22 pages

Signals and Communication Technology

Ebook

Uploaded by

hod.ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views22 pages

Signals and Communication Technology

Ebook

Uploaded by

hod.ad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Signals and Communication Technology

More information about this series at http://www.springer.com/series/4748


Dong Yu Li Deng

Automatic Speech
Recognition
A Deep Learning Approach

123
Dong Yu Li Deng
Microsoft Research Microsoft Research
Bothell Redmond, WA
USA USA

ISSN 1860-4862 ISSN 1860-4870 (electronic)


ISBN 978-1-4471-5778-6 ISBN 978-1-4471-5779-3 (eBook)
DOI 10.1007/978-1-4471-5779-3

Library of Congress Control Number: 2014951663

Springer London Heidelberg New York Dordrecht

© Springer-Verlag London 2015


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publisher’s location, in its current version, and permission for use must always
be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To my wife and parents
Dong Yu

To Lih-Yuan, Lloyd, Craig, Lyle, Arie,


and Axel
Li Deng
Foreword

This is the first book on automatic speech recognition (ASR) that is focused on the
deep learning approach, and in particular, deep neural network (DNN) technology.
The landmark book represents a big milestone in the journey of the DNN tech-
nology, which has achieved overwhelming successes in ASR over the past few
years. Following the authors’ recent book on “Deep Learning: Methods and
Applications”, this new book digs deeply and exclusively into ASR technology and
applications, which were only relatively lightly covered in the previous book in
parallel with numerous other applications of deep learning. Importantly, the
background material of ASR and technical detail of DNNs including rigorous
mathematical descriptions and software implementation are provided in this book,
invaluable for ASR experts as well as advanced students.
One unique aspect of this book is to broaden the view of deep learning from
DNNs, as commonly adopted in ASR by now, to encompass also deep generative
models that have advantages of naturally embedding domain knowledge and
problem constraints. The background material did justice to the incredible richness
of deep and dynamic generative models of speech developed by ASR researchers
since early 90’s, yet without losing sight of the unifying principles with respect to
the recent rapid development of deep discriminative models of DNNs. Compre-
hensive comparisons of the relative strengths of these two very different types of
deep models using the example of recurrent neural nets versus hidden dynamic
models are particularly insightful, opening an exciting and promising direction for
new development of deep learning in ASR as well as in other signal and infor-
mation processing applications. From the historical perspective, four generations of
ASR technology have been recently analyzed. The 4th Generation technology is
really embodied in deep learning elaborated in this book, especially when DNNs
are seamlessly integrated with deep generative models that would enable extended
knowledge processing in a most natural fashion.
All in all, this beautifully produced book is likely to become a definitive ref-
erence for ASR practitioners in the deep learning era of 4th generation ASR. The
book masterfully covers the basic concepts required to understand the ASR field as
a whole, and it also details in depth the powerful deep learning methods that have

vii
viii Foreword

shattered the field in 2 recent years. The readers of this book will become articulate
in the new state-of-the-art of ASR established by the DNN technology, and be
poised to build new ASR systems to match or exceed human performance.
By Sadaoki Furui, President of Toyota Technological Institute at Chicago, and
Professor at the Tokyo Institute of Technology.
Preface

Automatic Speech Recognition (ASR), which is aimed to enable natural human–


machine interaction, has been an intensive research area for decades. Many core
technologies, such as Gaussian mixture models (GMMs), hidden Markov models
(HMMs), mel-frequency cepstral coefficients (MFCCs) and their derivatives, n-
gram language models (LMs), discriminative training, and various adaptation
techniques have been developed along the way, mostly prior to the new millenium.
These techniques greatly advanced the state of the art in ASR and in its related
fields. Compared to these earlier achievements, the advancement in the research and
application of ASR in the decade before 2010 was relatively slow and less exciting,
although important techniques such as GMM–HMM sequence discriminative
training were made to work well in practical systems during this period.
In the past several years, however, we have observed a new surge of interest in
ASR. In our opinion, this change was led by the increased demands on ASR in
mobile devices and the success of new speech applications in the mobile world such
as voice search (VS), short message dictation (SMD), and virtual speech assistants
(e.g., Apple’s Siri, Google Now, and Microsoft’s Cortana). Equally important is the
development of the deep learning techniques in large vocabulary continuous speech
recognition (LVCSR) powered by big data and significantly increased computing
ability. A combination of a set of deep learning techniques has led to more than
1/3 error rate reduction over the conventional state-of-the-art GMM–HMM frame-
work on many real-world LVCSR tasks and helped to pass the adoption threshold for
many real-world users. For example, the word accuracy in English or the character
accuracy in Chinese in most SMD systems now exceeds 90 % and even 95 % on
some systems.
Given the recent surge of interest in ASR in both industry and academia we, as
researchers who have actively participated in and closely witnessed many of the
recent exciting deep learning technology development, believe the time is ripe to
write a book to summarize the advancements in the ASR field, especially those
during the past several years.

ix
x Preface

Along with the development of the field over the past two decades or so, we
have seen a number of useful books on ASR and on machine learning related to
ASR, some of which are listed here:
• Deep Learning: Methods and Applications, by Li Deng and Dong Yu (June
2014)
• Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods,
by Joseph Keshet, Samy Bengio (January 2009)
• Speech Recognition Over Digital Channels: Robustness and Standards, by
Antonio Peinado and Jose Segura (September 2006)
• Pattern Recognition in Speech and Language Processing, by Wu Chou and
Biing-Hwang Juang (February 2003)
• Speech Processing—A Dynamic and Optimization-Oriented Approach, by Li
Deng and Doug O’Shaughnessy (June 2003)
• Spoken Language Processing: A Guide to Theory, Algorithm and System
Development, by Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon (April
2001)
• Digital Speech Processing: Synthesis, and Recognition, Second Edition, by
Sadaoki Furui (June 2001)
• Speech Communications: Human and Machine, Second Edition, by Douglas
O’Shaughnessy (June 2000)
• Speech and Language Processing—An Introduction to Natural Language Pro-
cessing, Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James Martin (April 2000)
• Speech and Audio Signal Processing, by Ben Gold and Nelson Morgan (April
2000)
• Statistical Methods for Speech Recognition, by Fred Jelinek (June 1997)
• Fundamentals of Speech Recognition, by Lawrence Rabiner and Biing-Hwang
Juang (April 1993)
• Acoustical and Environmental Robustness in Automatic Speech Recognition, by
Alex Acero (November 1992).
All these books, however, were either published before the rise of deep learning
for ASR in 2009 or, as our 2014 overview book, were focused on less technical
aspects of deep learning for ASR than is desired. These earlier books did not
include the new deep learning techniques developed after 2010 with sufficient
technical and mathematical detail as would be demanded by ASR or deep learning
specialists. Different from the above books and in addition to some necessary
background material, our current book is mainly a collation of research in most
recent advances in deep learning or discriminative and hierarchical models, as
applied specific to the field of ASR. Our new book presents insights and theoretical
foundation of a series of deep learning models such as deep neural network (DNN),
restricted Boltzmann machine (RBM), denoising autoencoder, deep belief network,
recurrent neural network (RNN) and long short-term memory (LSTM) RNN, and
their application in ASR through a variety of techniques including the DNN-HMM
Preface xi

hybrid system, the tandem and bottleneck systems, multi-task and transfer learning,
sequence-discriminative training, and DNN adaptation. The book further discusses
practical considerations, tricks, setups, and speedups on applying the deep learning
models and related techniques in building real-world real-time ASR systems. To set
the background, our book also includes two chapters that introduce GMMs and
HMMs with their variants. However, we omit details of the GMM–HMM tech-
niques that do not directly relate to the theme of the book—the hierarchical
modeling or deep learning approach. Our book is thus complementary to, instead of
replacement of, the published books listed above on many of similar topics. We
believe this book will be of interest to advanced graduate students, researchers,
practitioners, engineers, and scientists in speech processing and machine learning
fields. We hope our book not only provides reference to many of the techniques
used in the filed but also ignites new ideas to further advance the field.
During the preparation of the book, we have received encouragement and help
from Alex Acero, Geoffrey Zweig, Qiang Huo, Frank Seide, Jasha Droppo, Mike
Seltzer, and Chin-Hui Lee. We also thank Springer editors, Agata Oelschlaeger and
Kiruthika Poomalai, for their kind and timely help in polishing up the book and for
handling its publication.

Seattle, USA, July 2014 Dong Yu


Li Deng
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 1
1.1 Automatic Speech Recognition: A Bridge for Better
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Human–Human Communication . . . . . . . . . . . . . . . . 2
1.1.2 Human–Machine Communication . . . . . . . . . . . . . . . 2
1.2 Basic Architecture of ASR Systems . . . . . . . . . . . . . . . . . . . 4
1.3 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Part I: Conventional Acoustic Models . . . . . . . . . . . . 6
1.3.2 Part II: Deep Neural Networks . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: DNN-HMM Hybrid Systems for ASR. . . . . . 7
1.3.4 Part IV: Representation Learning in Deep Neural
Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 7
1.3.5 Part V: Advanced Deep Models . . . . . . . . . . . . . ... 7
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 8

Part I Conventional Acoustic Models

2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Gaussian and Gaussian-Mixture Random Variables . . . . . . . . . 14
2.3 Parameter Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Mixture of Gaussians as a Model for the Distribution
of Speech Features . . . . . . . . . . . . . . . . . . . . . . . . ....... 18
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... 20

3 Hidden Markov Models and the Variants . . . . . . . . . . . . . . . . . . 23


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Markov Chains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xiii
xiv Contents

3.3 Hidden Markov Sequences and Models . . . . . . . . . . . . . . . . . 26


3.3.1 Characterization of a Hidden Markov Model . . . . . . . 27
3.3.2 Simulation of a Hidden Markov Model . . . . . . . . . . . 29
3.3.3 Likelihood Evaluation of a Hidden Markov Model . . . 29
3.3.4 An Algorithm for Efficient Likelihood Evaluation . . . 30
3.3.5 Proofs of the Forward and Backward Recursions . . . . 32
3.4 EM Algorithm and Its Application to Learning
HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Introduction to EM Algorithm . . . . . . . . . . . . . . . . . 33
3.4.2 Applying EM to Learning the HMM—Baum-Welch
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Viterbi Algorithm for Decoding HMM State Sequences. . . . . . 39
3.5.1 Dynamic Programming and Viterbi Algorithm . . . . . . 39
3.5.2 Dynamic Programming for Decoding HMM States . . . 40
3.6 The HMM and Variants for Generative Speech Modeling
and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 GMM-HMMs for Speech Modeling
and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2 Trajectory and Hidden Dynamic Models for Speech
Modeling and Recognition. . . . . . . . . . . . . . . . . . . . 44
3.6.3 The Speech Recognition Problem Using Generative
Models of HMM and Its Variants . . . . . . . . . . . . . . . 46
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Part II Deep Neural Networks

4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


4.1 The Deep Neural Network Architecture . . . . . . . . . . . . . . . . 57
4.2 Parameter Estimation with Error Backpropagation. . . . . . . . . . 59
4.2.1 Training Criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Model Initialization. . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.4 Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.5 Batch Size Selection . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.6 Sample Randomization . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.8 Learning Rate and Stopping Criterion . . . . . . . . . . . . 73
4.3.9 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.10 Reproducibility and Restartability . . . . . . . . . . . . . . . 75
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Contents xv

5 Advanced Model Initialization Techniques . . . . . . . . . . . . . . . . . . 79


5.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Properties of RBMs . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 RBM Parameter Learning . . . . . . . . . . . . . . . . . . . . 83
5.2 Deep Belief Network Pretraining . . . . . . . . . . . . . . . . . . . . . 86
5.3 Pretraining with Denoising Autoencoder . . . . . . . . . . . . . . . . 89
5.4 Discriminative Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Hybrid Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Dropout Pretraining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Part III Deep Neural Network-Hidden Markov Model


Hybrid Systems for Automatic Speech Recognition

6 Deep Neural Network-Hidden Markov Model Hybrid Systems . . . 99


6.1 DNN-HMM Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.2 Decoding with CD-DNN-HMM . . . . . . . . . . . . . . . . 101
6.1.3 Training Procedure for CD-DNN-HMMs . . . . . . . . . . 102
6.1.4 Effects of Contextual Window . . . . . . . . . . . . . . . . . 104
6.2 Key Components in the CD-DNN-HMM
and Their Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 106
6.2.1 Datasets and Baselines for Comparisons
and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.2 Modeling Monophone States or Senones . . . . . . . . . . 108
6.2.3 Deeper Is Better . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.4 Exploit Neighboring Frames . . . . . . . . . . . . . . . . . . 111
6.2.5 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.6 Better Alignment Helps . . . . . . . . . . . . . . . . . . . . . . 112
6.2.7 Tuning Transition Probability . . . . . . . . . . . . . . . . . 113
6.3 Kullback-Leibler Divergence-Based HMM. . . . . . . . . . . . . . . 113
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7 Training and Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 117


7.1 Training Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 Pipelined Backpropagation Using Multiple GPUs . . . . 118
7.1.2 Asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . 121
7.1.3 Augmented Lagrangian Methods and Alternating
Directions Method of Multipliers . . . . . . . . . . . . ... 124
7.1.4 Reduce Model Size. . . . . . . . . . . . . . . . . . . . . . ... 126
7.1.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . ... 127
xvi Contents

7.2 Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127


7.2.1 Parallel Computation. . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.2 Sparse Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.3 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . 132
7.2.4 Teach Small DNN with Large DNN . . . . . . . . . . . . . 133
7.2.5 Multiframe DNN . . . . . . . . . . . . . . . . . . . . . . . . . . 134
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 Deep Neural Network Sequence-Discriminative Training . . . . . . . 137


8.1 Sequence-Discriminative Training Criteria . . . . . . . . . . . . . . . 137
8.1.1 Maximum Mutual Information . . . . . . . . . . . . . . . . . 137
8.1.2 Boosted MMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.3 MPE/sMBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.1.4 A Uniformed Formulation . . . . . . . . . . . . . . . . . . . . 141
8.2 Practical Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.1 Lattice Generation . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.2 Lattice Compensation . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.3 Frame Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.4 Learning Rate Adjustment . . . . . . . . . . . . . . . . . . . . 146
8.2.5 Training Criterion Selection . . . . . . . . . . . . . . . . . . . 146
8.2.6 Other Considerations. . . . . . . . . . . . . . . . . . . . . . . . 147
8.3 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3.1 Casting Probability Density Estimation Problem
as a Classifier Design Problem . . . . . . . . . . . . . . . . . 148
8.3.2 Extension to Unnormalized Models. . . . . . . . . . . . . . 150
8.3.3 Apply NCE in DNN Training . . . . . . . . . . . . . . . . . 151
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Part IV Representation Learning in Deep Neural Networks

9 Feature Representation Learning in Deep Neural Networks . . . . . 157


9.1 Joint Learning of Feature Representation and Classifier . . . . . 157
9.2 Feature Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.3 Flexibility in Using Arbitrary Input Features . . . . . . . . . . . . . 162
9.4 Robustness of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.4.1 Robust to Speaker Variations . . . . . . . . . . . . . . . . . . 163
9.4.2 Robust to Environment Variations . . . . . . . . . . . . . . 165
9.5 Robustness Across All Conditions . . . . . . . . . . . . . . . . . . . . 167
9.5.1 Robustness Across Noise Levels. . . . . . . . . . . . . . . . 167
9.5.2 Robustness Across Speaking Rates . . . . . . . . . . . . . . 169
9.6 Lack of Generalization Over Large Distortions . . . . . . . . . . . . 170
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Contents xvii

10 Fuse Deep Neural Network and Gaussian Mixture


Model Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 177
10.1 Use DNN-Derived Features in GMM-HMM Systems . . . . ... 177
10.1.1 GMM-HMM with Tandem and Bottleneck
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 177
10.1.2 DNN-HMM Hybrid System Versus GMM-HMM
System with DNN-Derived Features . . . . . . . . . . . . . 180
10.2 Fuse Recognition Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.2.1 ROVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.2.2 SCARF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.2.3 MBR Lattice Combination . . . . . . . . . . . . . . . . . . . . 185
10.3 Fuse Frame-Level Acoustic Scores . . . . . . . . . . . . . . . . . . . . 186
10.4 Multistream Speech Recognition. . . . . . . . . . . . . . . . . . . . . . 187
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

11 Adaptation of Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 193


11.1 The Adaptation Problem for Deep Neural Networks . . . . . . . . 193
11.2 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2.1 Linear Input Networks . . . . . . . . . . . . . . . . . . . . . . 195
11.2.2 Linear Output Networks . . . . . . . . . . . . . . . . . . . . . 196
11.3 Linear Hidden Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.4 Conservative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.4.1 L2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.4.2 KL-Divergence Regularization . . . . . . . . . . . . . . . . . 200
11.4.3 Reducing Per-Speaker Footprint . . . . . . . . . . . . . . . . 202
11.5 Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.5.1 Subspace Construction Through Principal
Component Analysis . . . . . . . . . . . . . . . . ........ 204
11.5.2 Noise-Aware, Speaker-Aware,
and Device-Aware Training . . . . . . . . . . . . . . . . . . . 205
11.5.3 Tensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.6 Effectiveness of DNN Speaker Adaptation . . . . . . . . . . . . . . 210
11.6.1 KL-Divergence Regularization Approach . . . . . . . . . . 210
11.6.2 Speaker-Aware Training . . . . . . . . . . . . . . . . . . . . . 212
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Part V Advanced Deep Models

12 Representation Sharing and Transfer in Deep


Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.1 Multitask and Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 219
12.1.1 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.1.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 220
xviii Contents

12.2 Multilingual and Crosslingual Speech Recognition . . . . . . . . . 221


12.2.1 Tandem/Bottleneck-Based Crosslingual
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 222
12.2.2 Shared-Hidden-Layer Multilingual DNN . . . . . . . . . . 223
12.2.3 Crosslingual Model Transfer . . . . . . . . . . . . . . . . . . 226
12.3 Multiobjective Training of Deep Neural Networks for Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.1 Robust Speech Recognition with Multitask
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.2 Improved Phone Recognition with Multitask
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.3 Recognizing both Phonemes and Graphemes . . . . . . . 231
12.4 Robust Speech Recognition Exploiting
Audio-Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . 232
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

13 Recurrent Neural Networks and Related Models . . . . . . . . . .... 237


13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 237
13.2 State-Space Formulation of the Basic Recurrent
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.3 The Backpropagation-Through-Time Learning Algorithm. . . . . 240
13.3.1 Objective Function for Minimization. . . . . . . . . . . . . 241
13.3.2 Recursive Computation of Error Terms . . . . . . . . . . . 241
13.3.3 Update of RNN Weights . . . . . . . . . . . . . . . . . . . . . 242
13.4 A Primal-Dual Technique for Learning Recurrent
Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 244
13.4.1 Difficulties in Learning RNNs . . . . . . . . . . . . . .... 244
13.4.2 Echo-State Property and Its Sufficient Condition .... 245
13.4.3 Learning RNNs as a Constrained
Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . 245
13.4.4 A Primal-Dual Method for Learning RNNs . . . . . . . . 246
13.5 Recurrent Neural Networks Incorporating LSTM Cells . . . . . . 249
13.5.1 Motivations and Applications . . . . . . . . . . . . . . . . . . 249
13.5.2 The Architecture of LSTM Cells . . . . . . . . . . . . . . . 250
13.5.3 Training the LSTM-RNN . . . . . . . . . . . . . . . . . . . . 250
13.6 Analyzing Recurrent Neural Networks—A Contrastive
Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 251
13.6.1 Direction of Information Flow:
Top-Down versus Bottom-Up . . . . . . . . . . . . . .... 251
13.6.2 The Nature of Representations: Localist
or Distributed . . . . . . . . . . . . . . . . . . . . . . . . .... 254
13.6.3 Interpretability: Inferring Latent Layers
versus End-to-End Learning. . . . . . . . . . . . . . . .... 255
Contents xix

13.6.4 Parameterization: Parsimonious Conditionals


versus Massive Weight Matrices. . . . . . . . . . . . . . .. 256
13.6.5 Methods of Model Learning: Variational Inference
versus Gradient Descent . . . . . . . . . . . . . . . . . . . . . 258
13.6.6 Recognition Accuracy Comparisons . . . . . . . . . . . . . 258
13.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

14 Computational Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267


14.1 Computational Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.2 Forward Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
14.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.4 Typical Computation Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 275
14.4.1 Computation Node Types with No Operand. . . . . . . . 276
14.4.2 Computation Node Types with One Operand . . . . . . . 276
14.4.3 Computation Node Types with Two Operands . . . . . . 281
14.4.4 Computation Node Types for Computing Statistics . . . 287
14.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . 288
14.6 Recurrent Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
14.6.1 Sample by Sample Processing Only Within Loops . . . 292
14.6.2 Processing Multiple Utterances Simultaneously . . . . . 293
14.6.3 Building Arbitrary Recurrent Neural Networks . . . . . . 293
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

15 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 299


15.1 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
15.1.1 Debut of DNNs for ASR . . . . . . . . . . . . . . . . . . . . . 299
15.1.2 Speedup of DNN Training and Decoding . . . . . . . . . 302
15.1.3 Sequence Discriminative Training. . . . . . . . . . . . . . . 302
15.1.4 Feature Processing . . . . . . . . . . . . . . . . . . . . . . . . . 303
15.1.5 Adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
15.1.6 Multitask and Transfer Learning . . . . . . . . . . . . . . . . 305
15.1.7 Convolution Neural Networks . . . . . . . . . . . . . . . . . 305
15.1.8 Recurrent Neural Networks and LSTM . . . . . . . . . . . 306
15.1.9 Other Deep Models. . . . . . . . . . . . . . . . . . . . . . . . . 306
15.2 State of the Art and Future Directions . . . . . . . . . . . . . . . . . . 307
15.2.1 State of the Art—A Brief Analysis . . . . . . . . . . . . . . 307
15.2.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 308
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Acronyms

ADMM Alternating directions method of multipliers


AE-BN Autoencoder bottleneck
ALM Augmented Lagrangian multiplier
AM Acoustic model
ANN Artificial neural network
ANN-HMM Artificial neural network-hidden Markov model
ASGD Asynchronous stochastic gradient descent
ASR Automatic speech recognition
BMMI Boosted maximum mutual information
BP Backpropagation
BPTT Backpropagation through time
CD Contrastive divergence
CD-DNN-HMM Context-dependent-deep neural network-hidden Markov
model
CE Cross entropy
CHiME Computational hearing in multisource environments
CN Computational network
CNN Convolutional neural network
CNTK Computational network toolkit
CT Conservative training
DAG Directed acyclic graph
DaT Device-aware training
DBN Deep belief network
DNN Deep neural network
DNN-GMM-HMM Deep neural network-Gaussian mixture model–hidden Markov
model
DNN-HMM Deep neural network-hidden Markov model
DP Dynamic programming
DPT Discriminative pretraining

xxi
xxii Acronyms

EBW Extended Baum–Welch algorithm


EM Expectation-maximization
fDLR Feature-space discriminative linear regression
fMLLR Feature-space maximum likelihood linear regression
FSA Feature-space speaker adaptation
F-smoothing Frame-smoothing
GMM Gaussian mixture model
GPGPU General-purpose graphical processing units
HDM Hidden dynamic model
HMM Hidden Markov model
HTM Hidden trajectory model
IID Independent and identically distributed
KLD Kullback–Leibler divergence
KL-HMM Kullback–Leibler divergence-based HMM
LBP Layer-wise backpropagation
LHN Linear hidden network
LIN Linear input network
LM Language model
LON Linear output network
LSTM Long short-term-memory (recurrent neural network)
LVCSR Large vocabulary continuous speech recognition
LVSR Large vocabulary speech recognition
MAP Maximum a posteriori
MBR Minimum Bayesian risk
MFCC Mel-frequency cepstral coefficient
MLP Multi-layer perceptron
MMI Maximum mutual information
MPE Minimum phone error
MSE Mean square error
MTL Multitask learning
NAT Noise adaptive training
NaT Noise-aware training
NCE Noise contrastive estimation
NLL Negative log-likelihood
oDLR Output-feature discriminative linear regression
PCA Principal component analysis
PLP Perceptual linear prediction
RBM Restricted Boltzmann machine
ReLU Rectified linear unit
RKL Reverse Kullback–Leibler divergence
RNN Recurrent neural network
ROVER Recognizer output voting error reduction
RTF Real-time factor
Acronyms xxiii

SaT Speaker-aware training


SCARF Segmental conditional random field
SGD Stochastic gradient descent
SHL-MDNN Shared-hidden-layer multilingual DNN
SIMD Single instruction multiple data
SKL Symmetric Kullback–Leibler divergence
sMBR State minimum Bayesian risk
SMD Short message dictation
SVD Singular value decomposition
SWB Switchboard
UBM Universal background model
VS Voice search
VTLN Vocal tract length normalization
VTS Vector Taylor series
WTN Word transition network
Symbols

General Mathematical Operators


x A vector
xi The i-th element of x
j xj Absolute value of x
kxk Norm of vector x
xT Transpose of vector x
aT b Inner product of vectors a and b
abT Outer product of vectors a and b
ab Element-wise product of vectors a and b
ab Cross product of vectors a and b
A A matrix
Aij The element value at the i-th row and j-th column of matrix A
trðAÞ Trace of matrix A
AB Khatri-Rao product of matrices A and B
AøB Element-wise division of A and B
AB Inner product of vectors applied on matrices A and B column-wise
A}B Inner product of vectors applied on matrices A and B row-wise
A1 Inverse of matrix A
Ay Pseudoinverse of matrix A
Aα Element-wise power of matrix A
vecðAÞ The vector formed by concatenating columns of A
In n  n identity matrix
1m;n m  n matrix with all 1’s
E Statistical expectation operator
V Statistical covariance operator
h xi Average of vector x
 Convolution operator
H Hessian matrix
J Jacobian matrix

xxv
xxvi Symbols

pðxÞ Probability density function of random vector x


Pð xÞ Probability of x
r Gradient operator

More Specific Mathematical Symbols


w Optimal w
^
w Estimated value of w
R Correlation matrix
Z Partition function
v Visible units in a network
h Hidden units in a network
o Observation (feature) vector
y Output prediction vector
ε Learning rate
θ Threshold
λ Regularization parameter
Nðx; μ; ΣÞ Random variable x follows a Gaussian distribution with mean vector
μ and covariance matrix Σ
μi i-th component of the mean vector μ
σ 2i The i-th variance component
cm Gaussian mixture component weight for the m-th Gaussian
ai;j HMM transition probability from state i to state j
bi ðoÞ HMM emission probability for observation o at state i
Λ The entire model parameter set
q HMM state sequence
π HMM initial state probability

You might also like