0% found this document useful (0 votes)

12 views22 pages

Signals and Communication Technology

Ebook

Uploaded by

hod.ad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views22 pages

Signals and Communication Technology

Ebook

Uploaded by

hod.ad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Signals and Communication Technology

More information about this series at http://www.springer.com/series/4748

Dong Yu Li Deng
•

Automatic Speech
Recognition
A Deep Learning Approach

123
Dong Yu Li Deng
Microsoft Research Microsoft Research
Bothell Redmond, WA
USA USA

ISSN 1860-4862 ISSN 1860-4870 (electronic)

ISBN 978-1-4471-5778-6 ISBN 978-1-4471-5779-3 (eBook)
DOI 10.1007/978-1-4471-5779-3

Library of Congress Control Number: 2014951663

Springer London Heidelberg New York Dordrecht

© Springer-Verlag London 2015

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed. Exempted from this legal reservation are brief
excerpts in connection with reviews or scholarly analysis or material supplied specifically for the
purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the
work. Duplication of this publication or parts thereof is permitted only under the provisions of
the Copyright Law of the Publisher’s location, in its current version, and permission for use must always
be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

To my wife and parents
Dong Yu

To Lih-Yuan, Lloyd, Craig, Lyle, Arie,

and Axel
Li Deng
Foreword

This is the first book on automatic speech recognition (ASR) that is focused on the
deep learning approach, and in particular, deep neural network (DNN) technology.
The landmark book represents a big milestone in the journey of the DNN tech-
nology, which has achieved overwhelming successes in ASR over the past few
years. Following the authors’ recent book on “Deep Learning: Methods and
Applications”, this new book digs deeply and exclusively into ASR technology and
applications, which were only relatively lightly covered in the previous book in
parallel with numerous other applications of deep learning. Importantly, the
background material of ASR and technical detail of DNNs including rigorous
mathematical descriptions and software implementation are provided in this book,
invaluable for ASR experts as well as advanced students.
One unique aspect of this book is to broaden the view of deep learning from
DNNs, as commonly adopted in ASR by now, to encompass also deep generative
models that have advantages of naturally embedding domain knowledge and
problem constraints. The background material did justice to the incredible richness
of deep and dynamic generative models of speech developed by ASR researchers
since early 90’s, yet without losing sight of the unifying principles with respect to
the recent rapid development of deep discriminative models of DNNs. Compre-
hensive comparisons of the relative strengths of these two very different types of
deep models using the example of recurrent neural nets versus hidden dynamic
models are particularly insightful, opening an exciting and promising direction for
new development of deep learning in ASR as well as in other signal and infor-
mation processing applications. From the historical perspective, four generations of
ASR technology have been recently analyzed. The 4th Generation technology is
really embodied in deep learning elaborated in this book, especially when DNNs
are seamlessly integrated with deep generative models that would enable extended
knowledge processing in a most natural fashion.
All in all, this beautifully produced book is likely to become a definitive ref-
erence for ASR practitioners in the deep learning era of 4th generation ASR. The
book masterfully covers the basic concepts required to understand the ASR field as
a whole, and it also details in depth the powerful deep learning methods that have

vii
viii Foreword

shattered the ﬁeld in 2 recent years. The readers of this book will become articulate
in the new state-of-the-art of ASR established by the DNN technology, and be
poised to build new ASR systems to match or exceed human performance.
By Sadaoki Furui, President of Toyota Technological Institute at Chicago, and
Professor at the Tokyo Institute of Technology.
Preface

Automatic Speech Recognition (ASR), which is aimed to enable natural human–

machine interaction, has been an intensive research area for decades. Many core
technologies, such as Gaussian mixture models (GMMs), hidden Markov models
(HMMs), mel-frequency cepstral coefficients (MFCCs) and their derivatives, n-
gram language models (LMs), discriminative training, and various adaptation
techniques have been developed along the way, mostly prior to the new millenium.
These techniques greatly advanced the state of the art in ASR and in its related
fields. Compared to these earlier achievements, the advancement in the research and
application of ASR in the decade before 2010 was relatively slow and less exciting,
although important techniques such as GMM–HMM sequence discriminative
training were made to work well in practical systems during this period.
In the past several years, however, we have observed a new surge of interest in
ASR. In our opinion, this change was led by the increased demands on ASR in
mobile devices and the success of new speech applications in the mobile world such
as voice search (VS), short message dictation (SMD), and virtual speech assistants
(e.g., Apple’s Siri, Google Now, and Microsoft’s Cortana). Equally important is the
development of the deep learning techniques in large vocabulary continuous speech
recognition (LVCSR) powered by big data and significantly increased computing
ability. A combination of a set of deep learning techniques has led to more than
1/3 error rate reduction over the conventional state-of-the-art GMM–HMM frame-
work on many real-world LVCSR tasks and helped to pass the adoption threshold for
many real-world users. For example, the word accuracy in English or the character
accuracy in Chinese in most SMD systems now exceeds 90 % and even 95 % on
some systems.
Given the recent surge of interest in ASR in both industry and academia we, as
researchers who have actively participated in and closely witnessed many of the
recent exciting deep learning technology development, believe the time is ripe to
write a book to summarize the advancements in the ASR field, especially those
during the past several years.

ix
x Preface

Along with the development of the field over the past two decades or so, we
have seen a number of useful books on ASR and on machine learning related to
ASR, some of which are listed here:
• Deep Learning: Methods and Applications, by Li Deng and Dong Yu (June
2014)
• Automatic Speech and Speaker Recognition: Large Margin and Kernel Methods,
by Joseph Keshet, Samy Bengio (January 2009)
• Speech Recognition Over Digital Channels: Robustness and Standards, by
Antonio Peinado and Jose Segura (September 2006)
• Pattern Recognition in Speech and Language Processing, by Wu Chou and
Biing-Hwang Juang (February 2003)
• Speech Processing—A Dynamic and Optimization-Oriented Approach, by Li
Deng and Doug O’Shaughnessy (June 2003)
• Spoken Language Processing: A Guide to Theory, Algorithm and System
Development, by Xuedong Huang, Alex Acero, and Hsiao-Wuen Hon (April
2001)
• Digital Speech Processing: Synthesis, and Recognition, Second Edition, by
Sadaoki Furui (June 2001)
• Speech Communications: Human and Machine, Second Edition, by Douglas
O’Shaughnessy (June 2000)
• Speech and Language Processing—An Introduction to Natural Language Pro-
cessing, Computational Linguistics, and Speech Recognition, by Daniel Jurafsky
and James Martin (April 2000)
• Speech and Audio Signal Processing, by Ben Gold and Nelson Morgan (April
2000)
• Statistical Methods for Speech Recognition, by Fred Jelinek (June 1997)
• Fundamentals of Speech Recognition, by Lawrence Rabiner and Biing-Hwang
Juang (April 1993)
• Acoustical and Environmental Robustness in Automatic Speech Recognition, by
Alex Acero (November 1992).
All these books, however, were either published before the rise of deep learning
for ASR in 2009 or, as our 2014 overview book, were focused on less technical
aspects of deep learning for ASR than is desired. These earlier books did not
include the new deep learning techniques developed after 2010 with sufficient
technical and mathematical detail as would be demanded by ASR or deep learning
specialists. Different from the above books and in addition to some necessary
background material, our current book is mainly a collation of research in most
recent advances in deep learning or discriminative and hierarchical models, as
applied specific to the field of ASR. Our new book presents insights and theoretical
foundation of a series of deep learning models such as deep neural network (DNN),
restricted Boltzmann machine (RBM), denoising autoencoder, deep belief network,
recurrent neural network (RNN) and long short-term memory (LSTM) RNN, and
their application in ASR through a variety of techniques including the DNN-HMM
Preface xi

hybrid system, the tandem and bottleneck systems, multi-task and transfer learning,
sequence-discriminative training, and DNN adaptation. The book further discusses
practical considerations, tricks, setups, and speedups on applying the deep learning
models and related techniques in building real-world real-time ASR systems. To set
the background, our book also includes two chapters that introduce GMMs and
HMMs with their variants. However, we omit details of the GMM–HMM tech-
niques that do not directly relate to the theme of the book—the hierarchical
modeling or deep learning approach. Our book is thus complementary to, instead of
replacement of, the published books listed above on many of similar topics. We
believe this book will be of interest to advanced graduate students, researchers,
practitioners, engineers, and scientists in speech processing and machine learning
fields. We hope our book not only provides reference to many of the techniques
used in the filed but also ignites new ideas to further advance the field.
During the preparation of the book, we have received encouragement and help
from Alex Acero, Geoffrey Zweig, Qiang Huo, Frank Seide, Jasha Droppo, Mike
Seltzer, and Chin-Hui Lee. We also thank Springer editors, Agata Oelschlaeger and
Kiruthika Poomalai, for their kind and timely help in polishing up the book and for
handling its publication.

Seattle, USA, July 2014 Dong Yu

Li Deng
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 1
1.1 Automatic Speech Recognition: A Bridge for Better
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Human–Human Communication . . . . . . . . . . . . . . . . 2
1.1.2 Human–Machine Communication . . . . . . . . . . . . . . . 2
1.2 Basic Architecture of ASR Systems . . . . . . . . . . . . . . . . . . . 4
1.3 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Part I: Conventional Acoustic Models . . . . . . . . . . . . 6
1.3.2 Part II: Deep Neural Networks . . . . . . . . . . . . . . . . . 6
1.3.3 Part III: DNN-HMM Hybrid Systems for ASR. . . . . . 7
1.3.4 Part IV: Representation Learning in Deep Neural
Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 7
1.3.5 Part V: Advanced Deep Models . . . . . . . . . . . . . ... 7
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 8

Part I Conventional Acoustic Models

2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Gaussian and Gaussian-Mixture Random Variables . . . . . . . . . 14
2.3 Parameter Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Mixture of Gaussians as a Model for the Distribution
of Speech Features . . . . . . . . . . . . . . . . . . . . . . . . ....... 18
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....... 20

3 Hidden Markov Models and the Variants . . . . . . . . . . . . . . . . . . 23

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Markov Chains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xiii
xiv Contents

3.3 Hidden Markov Sequences and Models . . . . . . . . . . . . . . . . . 26

3.3.1 Characterization of a Hidden Markov Model . . . . . . . 27
3.3.2 Simulation of a Hidden Markov Model . . . . . . . . . . . 29
3.3.3 Likelihood Evaluation of a Hidden Markov Model . . . 29
3.3.4 An Algorithm for Efficient Likelihood Evaluation . . . 30
3.3.5 Proofs of the Forward and Backward Recursions . . . . 32
3.4 EM Algorithm and Its Application to Learning
HMM Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 Introduction to EM Algorithm . . . . . . . . . . . . . . . . . 33
3.4.2 Applying EM to Learning the HMM—Baum-Welch
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Viterbi Algorithm for Decoding HMM State Sequences. . . . . . 39
3.5.1 Dynamic Programming and Viterbi Algorithm . . . . . . 39
3.5.2 Dynamic Programming for Decoding HMM States . . . 40
3.6 The HMM and Variants for Generative Speech Modeling
and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 GMM-HMMs for Speech Modeling
and Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6.2 Trajectory and Hidden Dynamic Models for Speech
Modeling and Recognition. . . . . . . . . . . . . . . . . . . . 44
3.6.3 The Speech Recognition Problem Using Generative
Models of HMM and Its Variants . . . . . . . . . . . . . . . 46
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Part II Deep Neural Networks

4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 The Deep Neural Network Architecture . . . . . . . . . . . . . . . . 57
4.2 Parameter Estimation with Error Backpropagation. . . . . . . . . . 59
4.2.1 Training Criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.2 Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Model Initialization. . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.4 Dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.5 Batch Size Selection . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3.6 Sample Randomization . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.8 Learning Rate and Stopping Criterion . . . . . . . . . . . . 73
4.3.9 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.10 Reproducibility and Restartability . . . . . . . . . . . . . . . 75
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Contents xv

5 Advanced Model Initialization Techniques . . . . . . . . . . . . . . . . . . 79

5.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1 Properties of RBMs . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.2 RBM Parameter Learning . . . . . . . . . . . . . . . . . . . . 83
5.2 Deep Belief Network Pretraining . . . . . . . . . . . . . . . . . . . . . 86
5.3 Pretraining with Denoising Autoencoder . . . . . . . . . . . . . . . . 89
5.4 Discriminative Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.5 Hybrid Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Dropout Pretraining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Part III Deep Neural Network-Hidden Markov Model

Hybrid Systems for Automatic Speech Recognition

6 Deep Neural Network-Hidden Markov Model Hybrid Systems . . . 99

6.1 DNN-HMM Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.1.2 Decoding with CD-DNN-HMM . . . . . . . . . . . . . . . . 101
6.1.3 Training Procedure for CD-DNN-HMMs . . . . . . . . . . 102
6.1.4 Effects of Contextual Window . . . . . . . . . . . . . . . . . 104
6.2 Key Components in the CD-DNN-HMM
and Their Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 106
6.2.1 Datasets and Baselines for Comparisons
and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.2 Modeling Monophone States or Senones . . . . . . . . . . 108
6.2.3 Deeper Is Better . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2.4 Exploit Neighboring Frames . . . . . . . . . . . . . . . . . . 111
6.2.5 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.6 Better Alignment Helps . . . . . . . . . . . . . . . . . . . . . . 112
6.2.7 Tuning Transition Probability . . . . . . . . . . . . . . . . . 113
6.3 Kullback-Leibler Divergence-Based HMM. . . . . . . . . . . . . . . 113
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7 Training and Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.1 Training Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 Pipelined Backpropagation Using Multiple GPUs . . . . 118
7.1.2 Asynchronous SGD . . . . . . . . . . . . . . . . . . . . . . . . 121
7.1.3 Augmented Lagrangian Methods and Alternating
Directions Method of Multipliers . . . . . . . . . . . . ... 124
7.1.4 Reduce Model Size. . . . . . . . . . . . . . . . . . . . . . ... 126
7.1.5 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . ... 127
xvi Contents

7.2 Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.2.1 Parallel Computation. . . . . . . . . . . . . . . . . . . . . . . . 128
7.2.2 Sparse Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.3 Low-Rank Approximation . . . . . . . . . . . . . . . . . . . . 132
7.2.4 Teach Small DNN with Large DNN . . . . . . . . . . . . . 133
7.2.5 Multiframe DNN . . . . . . . . . . . . . . . . . . . . . . . . . . 134
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8 Deep Neural Network Sequence-Discriminative Training . . . . . . . 137

8.1 Sequence-Discriminative Training Criteria . . . . . . . . . . . . . . . 137
8.1.1 Maximum Mutual Information . . . . . . . . . . . . . . . . . 137
8.1.2 Boosted MMI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.1.3 MPE/sMBR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.1.4 A Uniformed Formulation . . . . . . . . . . . . . . . . . . . . 141
8.2 Practical Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.1 Lattice Generation . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.2.2 Lattice Compensation . . . . . . . . . . . . . . . . . . . . . . . 143
8.2.3 Frame Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2.4 Learning Rate Adjustment . . . . . . . . . . . . . . . . . . . . 146
8.2.5 Training Criterion Selection . . . . . . . . . . . . . . . . . . . 146
8.2.6 Other Considerations. . . . . . . . . . . . . . . . . . . . . . . . 147
8.3 Noise Contrastive Estimation . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3.1 Casting Probability Density Estimation Problem
as a Classifier Design Problem . . . . . . . . . . . . . . . . . 148
8.3.2 Extension to Unnormalized Models. . . . . . . . . . . . . . 150
8.3.3 Apply NCE in DNN Training . . . . . . . . . . . . . . . . . 151
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Part IV Representation Learning in Deep Neural Networks

9 Feature Representation Learning in Deep Neural Networks . . . . . 157

9.1 Joint Learning of Feature Representation and Classifier . . . . . 157
9.2 Feature Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.3 Flexibility in Using Arbitrary Input Features . . . . . . . . . . . . . 162
9.4 Robustness of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.4.1 Robust to Speaker Variations . . . . . . . . . . . . . . . . . . 163
9.4.2 Robust to Environment Variations . . . . . . . . . . . . . . 165
9.5 Robustness Across All Conditions . . . . . . . . . . . . . . . . . . . . 167
9.5.1 Robustness Across Noise Levels. . . . . . . . . . . . . . . . 167
9.5.2 Robustness Across Speaking Rates . . . . . . . . . . . . . . 169
9.6 Lack of Generalization Over Large Distortions . . . . . . . . . . . . 170
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Contents xvii

10 Fuse Deep Neural Network and Gaussian Mixture

Model Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 177
10.1 Use DNN-Derived Features in GMM-HMM Systems . . . . ... 177
10.1.1 GMM-HMM with Tandem and Bottleneck
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 177
10.1.2 DNN-HMM Hybrid System Versus GMM-HMM
System with DNN-Derived Features . . . . . . . . . . . . . 180
10.2 Fuse Recognition Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.2.1 ROVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.2.2 SCARF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.2.3 MBR Lattice Combination . . . . . . . . . . . . . . . . . . . . 185
10.3 Fuse Frame-Level Acoustic Scores . . . . . . . . . . . . . . . . . . . . 186
10.4 Multistream Speech Recognition. . . . . . . . . . . . . . . . . . . . . . 187
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

11 Adaptation of Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 193

11.1 The Adaptation Problem for Deep Neural Networks . . . . . . . . 193
11.2 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
11.2.1 Linear Input Networks . . . . . . . . . . . . . . . . . . . . . . 195
11.2.2 Linear Output Networks . . . . . . . . . . . . . . . . . . . . . 196
11.3 Linear Hidden Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
11.4 Conservative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.4.1 L2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.4.2 KL-Divergence Regularization . . . . . . . . . . . . . . . . . 200
11.4.3 Reducing Per-Speaker Footprint . . . . . . . . . . . . . . . . 202
11.5 Subspace Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.5.1 Subspace Construction Through Principal
Component Analysis . . . . . . . . . . . . . . . . ........ 204
11.5.2 Noise-Aware, Speaker-Aware,
and Device-Aware Training . . . . . . . . . . . . . . . . . . . 205
11.5.3 Tensor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.6 Effectiveness of DNN Speaker Adaptation . . . . . . . . . . . . . . 210
11.6.1 KL-Divergence Regularization Approach . . . . . . . . . . 210
11.6.2 Speaker-Aware Training . . . . . . . . . . . . . . . . . . . . . 212
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Part V Advanced Deep Models

12 Representation Sharing and Transfer in Deep

Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.1 Multitask and Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 219
12.1.1 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.1.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 220
xviii Contents

12.2 Multilingual and Crosslingual Speech Recognition . . . . . . . . . 221

12.2.1 Tandem/Bottleneck-Based Crosslingual
Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 222
12.2.2 Shared-Hidden-Layer Multilingual DNN . . . . . . . . . . 223
12.2.3 Crosslingual Model Transfer . . . . . . . . . . . . . . . . . . 226
12.3 Multiobjective Training of Deep Neural Networks for Speech
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.1 Robust Speech Recognition with Multitask
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.2 Improved Phone Recognition with Multitask
Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
12.3.3 Recognizing both Phonemes and Graphemes . . . . . . . 231
12.4 Robust Speech Recognition Exploiting
Audio-Visual Information . . . . . . . . . . . . . . . . . . . . . . . . . . 232
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

13 Recurrent Neural Networks and Related Models . . . . . . . . . .... 237

13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 237
13.2 State-Space Formulation of the Basic Recurrent
Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
13.3 The Backpropagation-Through-Time Learning Algorithm. . . . . 240
13.3.1 Objective Function for Minimization. . . . . . . . . . . . . 241
13.3.2 Recursive Computation of Error Terms . . . . . . . . . . . 241
13.3.3 Update of RNN Weights . . . . . . . . . . . . . . . . . . . . . 242
13.4 A Primal-Dual Technique for Learning Recurrent
Neural Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 244
13.4.1 Difficulties in Learning RNNs . . . . . . . . . . . . . .... 244
13.4.2 Echo-State Property and Its Sufficient Condition .... 245
13.4.3 Learning RNNs as a Constrained
Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . 245
13.4.4 A Primal-Dual Method for Learning RNNs . . . . . . . . 246
13.5 Recurrent Neural Networks Incorporating LSTM Cells . . . . . . 249
13.5.1 Motivations and Applications . . . . . . . . . . . . . . . . . . 249
13.5.2 The Architecture of LSTM Cells . . . . . . . . . . . . . . . 250
13.5.3 Training the LSTM-RNN . . . . . . . . . . . . . . . . . . . . 250
13.6 Analyzing Recurrent Neural Networks—A Contrastive
Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 251
13.6.1 Direction of Information Flow:
Top-Down versus Bottom-Up . . . . . . . . . . . . . .... 251
13.6.2 The Nature of Representations: Localist
or Distributed . . . . . . . . . . . . . . . . . . . . . . . . .... 254
13.6.3 Interpretability: Inferring Latent Layers
versus End-to-End Learning. . . . . . . . . . . . . . . .... 255
Contents xix

13.6.4 Parameterization: Parsimonious Conditionals

versus Massive Weight Matrices. . . . . . . . . . . . . . .. 256
13.6.5 Methods of Model Learning: Variational Inference
versus Gradient Descent . . . . . . . . . . . . . . . . . . . . . 258
13.6.6 Recognition Accuracy Comparisons . . . . . . . . . . . . . 258
13.7 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

14 Computational Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

14.1 Computational Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.2 Forward Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
14.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
14.4 Typical Computation Nodes . . . . . . . . . . . . . . . . . . . . . . . . . 275
14.4.1 Computation Node Types with No Operand. . . . . . . . 276
14.4.2 Computation Node Types with One Operand . . . . . . . 276
14.4.3 Computation Node Types with Two Operands . . . . . . 281
14.4.4 Computation Node Types for Computing Statistics . . . 287
14.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . 288
14.6 Recurrent Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
14.6.1 Sample by Sample Processing Only Within Loops . . . 292
14.6.2 Processing Multiple Utterances Simultaneously . . . . . 293
14.6.3 Building Arbitrary Recurrent Neural Networks . . . . . . 293
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

15 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 299

15.1 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
15.1.1 Debut of DNNs for ASR . . . . . . . . . . . . . . . . . . . . . 299
15.1.2 Speedup of DNN Training and Decoding . . . . . . . . . 302
15.1.3 Sequence Discriminative Training. . . . . . . . . . . . . . . 302
15.1.4 Feature Processing . . . . . . . . . . . . . . . . . . . . . . . . . 303
15.1.5 Adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
15.1.6 Multitask and Transfer Learning . . . . . . . . . . . . . . . . 305
15.1.7 Convolution Neural Networks . . . . . . . . . . . . . . . . . 305
15.1.8 Recurrent Neural Networks and LSTM . . . . . . . . . . . 306
15.1.9 Other Deep Models. . . . . . . . . . . . . . . . . . . . . . . . . 306
15.2 State of the Art and Future Directions . . . . . . . . . . . . . . . . . . 307
15.2.1 State of the Art—A Brief Analysis . . . . . . . . . . . . . . 307
15.2.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 308
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Acronyms

ADMM Alternating directions method of multipliers

AE-BN Autoencoder bottleneck
ALM Augmented Lagrangian multiplier
AM Acoustic model
ANN Artiﬁcial neural network
ANN-HMM Artiﬁcial neural network-hidden Markov model
ASGD Asynchronous stochastic gradient descent
ASR Automatic speech recognition
BMMI Boosted maximum mutual information
BP Backpropagation
BPTT Backpropagation through time
CD Contrastive divergence
CD-DNN-HMM Context-dependent-deep neural network-hidden Markov
model
CE Cross entropy
CHiME Computational hearing in multisource environments
CN Computational network
CNN Convolutional neural network
CNTK Computational network toolkit
CT Conservative training
DAG Directed acyclic graph
DaT Device-aware training
DBN Deep belief network
DNN Deep neural network
DNN-GMM-HMM Deep neural network-Gaussian mixture model–hidden Markov
model
DNN-HMM Deep neural network-hidden Markov model
DP Dynamic programming
DPT Discriminative pretraining

xxi
xxii Acronyms

EBW Extended Baum–Welch algorithm

EM Expectation-maximization
fDLR Feature-space discriminative linear regression
fMLLR Feature-space maximum likelihood linear regression
FSA Feature-space speaker adaptation
F-smoothing Frame-smoothing
GMM Gaussian mixture model
GPGPU General-purpose graphical processing units
HDM Hidden dynamic model
HMM Hidden Markov model
HTM Hidden trajectory model
IID Independent and identically distributed
KLD Kullback–Leibler divergence
KL-HMM Kullback–Leibler divergence-based HMM
LBP Layer-wise backpropagation
LHN Linear hidden network
LIN Linear input network
LM Language model
LON Linear output network
LSTM Long short-term-memory (recurrent neural network)
LVCSR Large vocabulary continuous speech recognition
LVSR Large vocabulary speech recognition
MAP Maximum a posteriori
MBR Minimum Bayesian risk
MFCC Mel-frequency cepstral coefﬁcient
MLP Multi-layer perceptron
MMI Maximum mutual information
MPE Minimum phone error
MSE Mean square error
MTL Multitask learning
NAT Noise adaptive training
NaT Noise-aware training
NCE Noise contrastive estimation
NLL Negative log-likelihood
oDLR Output-feature discriminative linear regression
PCA Principal component analysis
PLP Perceptual linear prediction
RBM Restricted Boltzmann machine
ReLU Rectiﬁed linear unit
RKL Reverse Kullback–Leibler divergence
RNN Recurrent neural network
ROVER Recognizer output voting error reduction
RTF Real-time factor
Acronyms xxiii

SaT Speaker-aware training

SCARF Segmental conditional random ﬁeld
SGD Stochastic gradient descent
SHL-MDNN Shared-hidden-layer multilingual DNN
SIMD Single instruction multiple data
SKL Symmetric Kullback–Leibler divergence
sMBR State minimum Bayesian risk
SMD Short message dictation
SVD Singular value decomposition
SWB Switchboard
UBM Universal background model
VS Voice search
VTLN Vocal tract length normalization
VTS Vector Taylor series
WTN Word transition network
Symbols

General Mathematical Operators

x A vector
xi The i-th element of x
j xj Absolute value of x
kxk Norm of vector x
xT Transpose of vector x
aT b Inner product of vectors a and b
abT Outer product of vectors a and b
ab Element-wise product of vectors a and b
ab Cross product of vectors a and b
A A matrix
Aij The element value at the i-th row and j-th column of matrix A
trðAÞ Trace of matrix A
AB Khatri-Rao product of matrices A and B
AøB Element-wise division of A and B
AB Inner product of vectors applied on matrices A and B column-wise
A}B Inner product of vectors applied on matrices A and B row-wise
A1 Inverse of matrix A
Ay Pseudoinverse of matrix A
Aα Element-wise power of matrix A
vecðAÞ The vector formed by concatenating columns of A
In n n identity matrix
1m;n m n matrix with all 1’s
E Statistical expectation operator
V Statistical covariance operator
h xi Average of vector x
Convolution operator
H Hessian matrix
J Jacobian matrix

xxv
xxvi Symbols

pðxÞ Probability density function of random vector x

Pð xÞ Probability of x
r Gradient operator

More Specific Mathematical Symbols

w Optimal w
^
w Estimated value of w
R Correlation matrix
Z Partition function
v Visible units in a network
h Hidden units in a network
o Observation (feature) vector
y Output prediction vector
ε Learning rate
θ Threshold
λ Regularization parameter
Nðx; μ; ΣÞ Random variable x follows a Gaussian distribution with mean vector
μ and covariance matrix Σ
μi i-th component of the mean vector μ
σ 2i The i-th variance component
cm Gaussian mixture component weight for the m-th Gaussian
ai;j HMM transition probability from state i to state j
bi ðoÞ HMM emission probability for observation o at state i
Λ The entire model parameter set
q HMM state sequence
π HMM initial state probability

Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Deep Learning
No ratings yet
Deep Learning
56 pages
Deep Learning Methods and Application
No ratings yet
Deep Learning Methods and Application
100 pages
Speech Recognition
100% (4)
Speech Recognition
576 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Deng Apsipa
No ratings yet
Deng Apsipa
14 pages
Chapter One
No ratings yet
Chapter One
13 pages
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
No ratings yet
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
6 pages
Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey
No ratings yet
Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey
22 pages
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
No ratings yet
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
31 pages
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
No ratings yet
Deep Learning For Environmentally Robust Speech Recognition - An Overview of Recent Developments PDF
28 pages
Survey On Evolving Deep Learning Neural Network Architectures
No ratings yet
Survey On Evolving Deep Learning Neural Network Architectures
10 pages
Deep Learning From Speech Recognition To Language and Multimodal Processing
No ratings yet
Deep Learning From Speech Recognition To Language and Multimodal Processing
15 pages
Sample Seminar Report
No ratings yet
Sample Seminar Report
14 pages
DeepLearningBook RefsByLastFirstNames
No ratings yet
DeepLearningBook RefsByLastFirstNames
195 pages
Speech Overview
No ratings yet
Speech Overview
30 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
NLP Docs
No ratings yet
NLP Docs
51 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
Application of Deep Learning-Based Speech Signal P
No ratings yet
Application of Deep Learning-Based Speech Signal P
6 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
16 pages
Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems
No ratings yet
Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems
37 pages
Application and Development Prospect of AI Speech Recognition Technology
No ratings yet
Application and Development Prospect of AI Speech Recognition Technology
5 pages
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
No ratings yet
Integrated Method of Deep Learning and Large Language Model in Speech Recognition
6 pages
Deep Learning in Natural Language Processing PDF
100% (9)
Deep Learning in Natural Language Processing PDF
338 pages
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
14 pages
Deep Learning Application Pros and Cons Over Algorithm
No ratings yet
Deep Learning Application Pros and Cons Over Algorithm
13 pages
Aangan by Khadija Mastoor PDF
No ratings yet
Aangan by Khadija Mastoor PDF
29 pages
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
No ratings yet
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
46 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Deeplearninginspeech
No ratings yet
Deeplearninginspeech
4 pages
Hao 2016
No ratings yet
Hao 2016
23 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
No ratings yet
Modeling of Speech Recognition Based On Deep Learning: Min Zhang
8 pages
Machine Learning
100% (4)
Machine Learning
134 pages
Speech Recognition Using Deep Neural Networks: A Systematic Review
No ratings yet
Speech Recognition Using Deep Neural Networks: A Systematic Review
23 pages
Speech Recognition Using Deep Learning Techniques
No ratings yet
Speech Recognition Using Deep Learning Techniques
5 pages
A Review On Deep Learning Applications
No ratings yet
A Review On Deep Learning Applications
11 pages
Project Plan - Kel 5 PDF
No ratings yet
Project Plan - Kel 5 PDF
5 pages
Beginner's Guide To Kirigami 24 Skill Building Projects For The Absolute Beginner Exclusive Download
100% (12)
Beginner's Guide To Kirigami 24 Skill Building Projects For The Absolute Beginner Exclusive Download
15 pages
Subnet Mask PDF
No ratings yet
Subnet Mask PDF
5 pages
A World Class Carbon and Stainless Steel Flange Manufacturer
No ratings yet
A World Class Carbon and Stainless Steel Flange Manufacturer
5 pages
Dismantling Naik
No ratings yet
Dismantling Naik
45 pages
Module of Applied Entomology Only Agricultural Part
No ratings yet
Module of Applied Entomology Only Agricultural Part
53 pages
Role of Women in Mozart and Puccinis Operas
No ratings yet
Role of Women in Mozart and Puccinis Operas
12 pages
ERP in FMCG Company
No ratings yet
ERP in FMCG Company
48 pages
Experiment 6 Isolation of Eugenol From Cloves TECHNIQUE: Steam Distillation
No ratings yet
Experiment 6 Isolation of Eugenol From Cloves TECHNIQUE: Steam Distillation
2 pages
2 PDF
No ratings yet
2 PDF
232 pages
Operating Systems
No ratings yet
Operating Systems
7 pages
Mini Research On Homeless
No ratings yet
Mini Research On Homeless
6 pages
Basukukya
No ratings yet
Basukukya
9 pages
ENROLLMENT NO.:-160280107033 PYTHON PROGRAMMING (2180711) : Be - Comp. - Sem-8 - Ldce Page
No ratings yet
ENROLLMENT NO.:-160280107033 PYTHON PROGRAMMING (2180711) : Be - Comp. - Sem-8 - Ldce Page
23 pages
Mitochondrial Disorders Biochemical and Molecular Analysis Methods in Molecular Biology Vol 837 2012th Edition Lee-Jun C. Wong (Editor) Download PDF
100% (2)
Mitochondrial Disorders Biochemical and Molecular Analysis Methods in Molecular Biology Vol 837 2012th Edition Lee-Jun C. Wong (Editor) Download PDF
84 pages
TSR Notes
No ratings yet
TSR Notes
6 pages
Science Literacy Strategies
No ratings yet
Science Literacy Strategies
3 pages
Answer
100% (2)
Answer
7 pages
BOQs 444
No ratings yet
BOQs 444
33 pages
Peace and Conflict Studies
No ratings yet
Peace and Conflict Studies
18 pages
JEE Main 2024 Solutions Jan 29 Shift 2
No ratings yet
JEE Main 2024 Solutions Jan 29 Shift 2
22 pages
S5 Math Exercise
No ratings yet
S5 Math Exercise
6 pages
Chapter 7
No ratings yet
Chapter 7
19 pages
Avoid News Part1 TEXT PDF
No ratings yet
Avoid News Part1 TEXT PDF
11 pages
In Mathematics Facts and Concepts
No ratings yet
In Mathematics Facts and Concepts
1 page
Benchmark Report - Voice Service Optimization For Common State, TP20160728
No ratings yet
Benchmark Report - Voice Service Optimization For Common State, TP20160728
16 pages
In Vivo and in Vitro Evaluation of Four Different Aqueous Polymeric Dispersions For Producing An Enteric Coated Tablet
No ratings yet
In Vivo and in Vitro Evaluation of Four Different Aqueous Polymeric Dispersions For Producing An Enteric Coated Tablet
6 pages
EMTL Question Paper Mid One
No ratings yet
EMTL Question Paper Mid One
2 pages
Employment Application Form..
No ratings yet
Employment Application Form..
3 pages
Uniu S2466 Sti Ii Ul
No ratings yet
Uniu S2466 Sti Ii Ul
1 page
Adjective Order NA
No ratings yet
Adjective Order NA
2 pages
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
From Everand
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
Sanket Subhash Khandare
No ratings yet
TensorFlow Developer Certification Guide
From Everand
TensorFlow Developer Certification Guide
Patrick J
No ratings yet
TensorFlow Developer Certification Guide: Crack Google's official exam on getting skilled with managing production-grade ML models
From Everand
TensorFlow Developer Certification Guide: Crack Google's official exam on getting skilled with managing production-grade ML models
Patrick J
No ratings yet
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
From Everand
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
Dr. Rajkumar Tekchandani
No ratings yet
Linux Programming Tools Unveiled
From Everand
Linux Programming Tools Unveiled
N. B. Venkateswarlu
No ratings yet
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Terminology Extraction: Fundamentals and Applications
From Everand
Terminology Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet

Signals and Communication Technology

Uploaded by

Signals and Communication Technology

Uploaded by

Signals and Communication Technology

More information about this series at http://www.springer.com/series/4748

ISSN 1860-4862 ISSN 1860-4870 (electronic)

Library of Congress Control Number: 2014951663

Springer London Heidelberg New York Dordrecht

© Springer-Verlag London 2015

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

To Lih-Yuan, Lloyd, Craig, Lyle, Arie,

Automatic Speech Recognition (ASR), which is aimed to enable natural human–

Seattle, USA, July 2014 Dong Yu

Part I Conventional Acoustic Models

2 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Hidden Markov Models and the Variants . . . . . . . . . . . . . . . . . . 23

3.3 Hidden Markov Sequences and Models . . . . . . . . . . . . . . . . . 26

Part II Deep Neural Networks

4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Advanced Model Initialization Techniques . . . . . . . . . . . . . . . . . . 79

Part III Deep Neural Network-Hidden Markov Model

6 Deep Neural Network-Hidden Markov Model Hybrid Systems . . . 99

7 Training and Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2 Decoding Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

8 Deep Neural Network Sequence-Discriminative Training . . . . . . . 137

Part IV Representation Learning in Deep Neural Networks

9 Feature Representation Learning in Deep Neural Networks . . . . . 157

10 Fuse Deep Neural Network and Gaussian Mixture

11 Adaptation of Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . 193

Part V Advanced Deep Models

12 Representation Sharing and Transfer in Deep

12.2 Multilingual and Crosslingual Speech Recognition . . . . . . . . . 221

13 Recurrent Neural Networks and Related Models . . . . . . . . . .... 237

13.6.4 Parameterization: Parsimonious Conditionals

14 Computational Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

15 Summary and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 299

ADMM Alternating directions method of multipliers

EBW Extended Baum–Welch algorithm

SaT Speaker-aware training

General Mathematical Operators

pðxÞ Probability density function of random vector x

More Specific Mathematical Symbols

You might also like