0% found this document useful (0 votes)
14 views32 pages

Deep Spoken Keyword Spotting - An Overview

The document provides an overview of deep spoken keyword spotting (KWS), a technology that identifies keywords in audio streams, significantly enhanced by deep learning. It reviews the evolution of KWS systems, their applications, and inherent challenges, while suggesting future research directions. The paper aims to serve as a comprehensive resource for practitioners and researchers interested in the advancements and methodologies of deep KWS.

Uploaded by

22f1001344
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views32 pages

Deep Spoken Keyword Spotting - An Overview

The document provides an overview of deep spoken keyword spotting (KWS), a technology that identifies keywords in audio streams, significantly enhanced by deep learning. It reviews the evolution of KWS systems, their applications, and inherent challenges, while suggesting future research directions. The paper aims to serve as a comprehensive resource for practitioners and researchers interested in the advancements and methodologies of deep KWS.

Uploaded by

22f1001344
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Deep Spoken Keyword Spotting: An


Overview
IVÁN LÓPEZ-ESPEJO1 , ZHENG-HUA TAN1 , (Senior Member, IEEE), JOHN HANSEN2 ,
(Fellow, IEEE), AND JESPER JENSEN1,3
1
Department of Electronic Systems, Aalborg University, 9220 Aalborg, Denmark (e-mail: {ivl,zt,jje}@es.aau.dk)
arXiv:2111.10592v1 [cs.SD] 20 Nov 2021

2
Erik Jonsson School of Engineering and Computer Science, The University of Texas at Dallas, TX 75080 Richardson, USA (e-mail: [email protected])
3
Oticon A/S, 2765 Smørum, Denmark (e-mail: [email protected])
Corresponding author: Iván López-Espejo (e-mail: [email protected]).
This work was supported, in part, by Demant Foundation.

ABSTRACT Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams
and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few
years ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with
different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of
social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among
speech scientists, who constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS
to assist practitioners and researchers who are interested in this technology. Specifically, this overview has
a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech
features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation
metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper
allows us to identify a number of directions for future research, including directions adopted from automatic
speech recognition research and directions that are unique to the problem of spoken KWS.

INDEX TERMS Keyword spotting, deep learning, acoustic model, small footprint, robustness.

I. INTRODUCTION for KWS. One of the earliest approaches is based on the use
NTERACTING with machines via voice is not science of large-vocabulary continuous speech recognition (LVCSR)
I fiction anymore. Quite the opposite, speech technologies
have become ubiquitous in nowadays society. The prolif-
systems [5]–[7]. These systems are employed to decode the
speech signal, and then, the keyword is searched in the
eration of voice assistants like Amazon’s Alexa, Apple’s generated lattices (i.e., in the representations of the different
Siri, Google’s Assistant and Microsoft’s Cortana is good sequences of phonetic units that, given the speech signal, are
proof of this [1]. A distinctive feature of voice assistants likely enough). One of the advantages of this approach is the
is that, in order to be used, they first have to be activated flexibility to deal with changing/non-predefined keywords
by means of a spoken wake-up word or keyword, thereby [8]–[10] (although there is often a drop in performance when
avoiding running far more computationally expensive auto- keywords are out of vocabulary [11]). The main disadvantage
matic speech recognition (ASR) when it is not required [2]. of LVCSR-based KWS systems might reside in the computa-
More specifically, voice assistants deploy a technology called tional complexity dimension: these systems need to generate
spoken keyword spotting —or simply keyword spotting— rich lattices, which requires high computational resources
, which can be understood as a subproblem of ASR [3]. [9], [12] and also introduces latency [13]. While this should
Particularly, keyword spotting (KWS) can be defined as the not be an issue for some applications like offline audio search
task of identifying keywords in audio streams comprising [9], [14], LVCSR systems are not suitable for the lately-
speech. And, apart from activating voice assistants, KWS popular KWS applications1 intended for small electronic
has plenty of applications such as speech data mining, audio devices (e.g., smartphones, smart speakers and wearables)
indexing, phone call routing, etc. [4]. 1 By lately-popular KWS applications we mean activation of voice assis-
Over the years, different techniques have been explored tants, voice control, etc.

VOLUME 4, 2016 1
I. López-Espejo et al.: Deep Spoken KWS: An Overview

low memory and low computational complexity) sce-


narios in both clean and noisy conditions [17], [22].
NON-
SPEECH

Filler HMM - Speech/Non-speech loop


This threefold reason makes it very appealing to deploy the
SPEECH
deep KWS paradigm to a variety of consumer electronics
with limited resources like earphones and headphones [27],
smartphones, smart speakers and so on. Thus, much research
on deep KWS has been conducted since 2014 until today,
e.g., [15], [22], [26], [28]–[32]. And, what is more, we can
expect that deep KWS will continue to be a hot topic in the
k i w ɜr d future despite all the progress made.
Keyword HMM - "Keyword" phone state sequence
In this paper, we present an overview of the deep spoken
FIGURE 1. Scheme of a keyword/filler HMM-based KWS system [13] when keyword spotting technology. We believe that this is a good
the system keyword is “keyword”. While typically the keyword is modeled by a
context-dependent triphone-based HMM, a monophone-based HMM is
time to look back and analyze the development trajectory
depicted instead for illustrative purposes. The filler HMM is often a of deep KWS to elucidate future challenges. It is worth
speech/non-speech monophone loop. noticing that only a small number of KWS overview articles
is presently available in the literature [33]–[36]; at best, they
shallowly encompass state-of-the-art deep KWS approaches,
characterized by notable memory, computation and power
along with the most relevant datasets. Furthermore, while
constraints [12], [15]–[17].
some relatively recent ASR overview articles covering acous-
A still attractive and lighter alternative to LVCSR is
tic modeling —which is a central part of KWS, see Figure
the keyword/filler hidden Markov model (HMM) approach,
2— can also be found [37], [38], still (deep) KWS involves
which was proposed around three decades ago [18]–[20]. By
inherent issues, which need to be specifically addressed.
this, a keyword HMM and a filler HMM are trained to model
Some of these inherent issues are related to posterior handling
keyword and non-keyword audio segments, respectively, as
(see Figure 2), the class-imbalance problem [39], technology
illustrated by Figure 1. Originally, the acoustic features were
applications, datasets and evaluation metrics. To sum up, we
modeled by means of Gaussian mixture models (GMMs)
can state that 1) deep spoken KWS is currently a hot topic2 ,
to produce the state emission likelihoods in keyword/filler
2) available KWS overview articles are outdated and/or they
HMM-based KWS [18]–[20]. Nowadays, similarly to the
offer only a limited treatment of the latest progress, and
case of ASR, deep neural networks (DNNs) have replaced
3) deep KWS involves unique inherent issues compared
GMMs with this purpose [21]–[24] due to the consistent
to general-purpose ASR. Thus, this article aims at provid-
superior performance of the former. Viterbi decoding [25] is
ing practitioners and researchers who are interested in the
applied at runtime to find the best path in the decoding graph,
topic of keyword spotting with an up to date comprehensive
and, whenever the likelihood ratio of the keyword model
overview of this technology.
versus filler model is larger than a predefined threshold, the
KWS system is triggered [13]. While this type of KWS The rest of this article is organized as follows: in Section
systems is rather compact and good performing, it still needs II, the general approach to deep spoken KWS is introduced.
Viterbi decoding, which, depending on the HMM topology, Then, in Sections III, IV and V, respectively, the three main
can be computationally demanding [12], [22]. components constituting a modern KWS system are ana-
The arrival of 2014 represented a milestone for KWS lyzed, i.e., speech feature extraction, acoustic modeling and
technology as a result of the publication of the first deep posterior handling. In Section VI, we review current methods
spoken KWS system [22]. In this paradigm (being new at the to strengthen the robustness of KWS systems against differ-
time), the sequence of word posterior probabilities yielded by ent sources of distortion. Applications of KWS are discussed
a DNN is directly processed to determine the possible exis- in Section VII. Then, in Section VIII, we analyze the speech
tence of keywords without the intervention of any HMM (see corpora currently employed for experimentally validating the
Figure 2). The deep KWS paradigm has recently attracted latest KWS developments. The most important evaluation
much attention [16], [26] due to a threefold reason: metrics for KWS are examined in Section IX. In Section
1) It does not require a complicated sequence search al- X, a comparison among some of the latest deep KWS sys-
gorithm (i.e., Viterbi decoding); instead, a significantly tems in terms of both KWS performance and computational
simpler posterior handling suffices; complexity is presented. Section XI comprises a short review
2) The complexity of the DNN producing the posteriors of the literature on audio-visual KWS. Finally, concluding
(acoustic model) can be easily adjusted [9], [26] to fit remarks and comments about the future directions in the field
the computational resource constraints; are given in Section XII.
3) It brings consistent significant improvements over the
keyword/filler HMM approach in small-footprint (i.e.,
2 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

f (·|θ)

Speech signal Deep Learning-based Keyword


Spotting Acoustic Model
x(m) X{i} y{i}
Speech Feature
Posterior Handling Decision
Extraction

FIGURE 2. General pipeline of a modern deep spoken keyword spotting system: 1) features are extracted from the speech signal, 2) a DNN acoustic model uses
these features to produce posteriors over the different keyword and filler (non-keyword) classes, and 3) the temporal sequence of these posteriors is processed
(Posterior Handling) to determine the possible existence of keywords.

II. DEEP SPOKEN KEYWORD SPOTTING APPROACH


f (·|θ)
Figure 2 depicts the general pipeline of a modern deep y{i}
X{i}
spoken keyword spotting system [15], [22], [28], [41]–[43], "Right" : 0.1
which is composed of three main blocks: 1) the speech
"Left" : 0.8
feature extractor converting the input signal to a compact
Other speech : 0.1
speech representation, 2) the deep learning-based acoustic
model producing posteriors over the different keyword and Silence/noise : 0.0
filler (non-keyword) classes from the speech features (see the "Left"

example of Figure 3), and 3) the posterior handler processing


the temporal sequence of posteriors to determine the possible FIGURE 3. Illustrative example on how a DNN acoustic model performs.
There are N = 4 different classes representing the keywords “right” and “left”,
existence of keywords in the input signal. other speech and silence/noise. The acoustic model receives a speech segment
Let x(m) be a finite acoustic time signal comprising X{i} (log-Mel spectrogram) comprising the keyword “left”. The DNN produces
a posterior distribution over the N = 4 different classes. Keyword “left” is
speech. In the first place, the speech feature extractor com- given the highest posterior probability, 0.8.
putes an alternative representation of x(m), namely, X. It is
desirable X to be compact (i.e., lower-dimensional, to limit
the computational complexity of the task), discriminative in consecutive segments X{i} and X{i+1} , many works con-
terms of the phonetic content and robust to acoustic varia- sider acoustic models classifying non-overlapping segments
tions [44]. Speech features X are traditionally represented by that are sufficiently long (e.g., one second) to cover an entire
a two-dimensional matrix composed of a time sequence of keyword [16], [30], [48]–[53]. With regard to P and F , a
K-dimensional feature vectors xt (t = 0, ..., T − 1) as in number of approaches considers F < P to reduce latency
without significantly sacrificing performance [12], [22], [28],
X = (x0 , ..., xt , ..., xT −1 ) ∈ RK×T , (1)
[41]. In addition, voice activity detection [54] is sometimes
where T , the total number of feature vectors, depends on the used to reduce power consumption by only inputting to the
length of the signal x(m). Speech features X can be based on acoustic model segments X{i} in which voice is present [11],
a diversity of representation types, such as, e.g., spectral [22], [22], [55]–[57].
[28], [45], cepstral [16], [46] and time-domain ones [47]. Then, let us suppose that the DNN acoustic model f (·|θ) :
Further details about the different types of speech features RK×(P +F +1) → I N has N output nodes meaning N differ-
used for KWS are provided in Section III. ent classes, where θ and I = [0, 1] denote the parameters
The DNN acoustic model receives X as input and outputs a of the acoustic model and the unit interval, respectively.
sequence of posterior probabilities over the different keyword Normally, the output nodes represent either words [12], [16],
and non-keyword classes. Particularly, the acoustic model [22], [28], [30], [41], [43], [48]–[53], [57]–[59] or subword
sequentially consumes time segments units like context-independent phonemes [31], [60]–[62],
the latter especially in the context of sequence-to-sequence
X{i} = (xis−P , ..., xis , ..., xis+F ) (2) models [63]–[65] (see Subsection IV-C for further details).
of X until the whole feature sequence X is processed. In Eq. Let subscript n refer to the n-th element of a vector. For every
(2), i = d Ps e, ..., b T −1−F c is an integer segment index and s input segment X{i} , the acoustic model yields
s
represents the time frame shift. Moreover, P and F denote,
yn{i} = fn X{i} θ , n = 1, ..., N,

(3)
respectively, the number of past and future frames (temporal
context) in each segment X{i} ∈ RK×(P +F +1) . While s is {i} 
where yn = P Cn | X{i} , θ is the posterior of the n-th
typically designed to have some degree of overlap between class Cn given the input feature segment X{i} . To ensure
PN {i}
2 A proof of this is the organization of events like Auto-KWS 2021 that n=1 yn = 1 ∀i, deep KWS systems commonly
Challenge [40]. employ a fully-connected layer with softmax activation [66]
VOLUME 4, 2016 3
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Speech signal

(a) X{i+1} (b) Framing


Pre-
and FFT |·|2
emphasis
X{i} X{i} X{i+1} Windowing

Discrete Mel-
cosine log frequency
transform warping

Mel-frequency cepstrum

y{i} y{i+1} y{i} y{i+1} Log-Mel spectrogram


"Right" :0.9 "Right" :0.9 "Right" :0.45 "Right" :0.4
"Left" :0.0 "Left" :0.0 "Left" :0.0 "Left" :0.0
Other speech :0.1 Other speech :0.1 Other speech :0.5 Other speech :0.1 FIGURE 5. Classical pipeline for extracting log-Mel spectral and
Silence/noise :0.0 Silence/noise :0.0 Silence/noise :0.05 Silence/noise :0.5 Mel-frequency cepstral speech features using the fast Fourier transform (FFT).

FIGURE 4. Example of the processing of two consecutive feature segments


X{i} and X{i+1} , from X comprising the keyword “right”, by a DNN acoustic detecting the same keyword realization twice, yielding a false
model: (a) when using an overlapping segmentation window, and (b) when alarm. In addition, Figure 4b depicts the case in which a
using a smaller, non-overlapping one.
non-overlapping segmentation window is employed. In this
situation, the energy of the keyword realization leaks into two
as an output layer, e.g., [16], [43], [47], [52], [60], [67]– different segments
 in such a manner that  neither the posterior
[72]. The parameters of the model, θ, are usually estimated P C1 | X{i} , θ nor P C1 | X{i+1} , θ is sufficiently strong
by discriminatively training f (·|θ) by backpropagation from for the keyword to be detected, thereby yielding a miss detec-
annotated speech data characterizing the different N classes. tion. Hence, a proper handling of the sequence of posteriors
The most popular loss function that is employed to this end y{i} (i = d Ps e, ..., b T −1−F
s c) is a very important component
is cross-entropy loss [73], [74]. for effective keyword detection [2], [4], [15], [22], [29],
Figure 3 shows an example, illustrating the above para- [41]–[43], [45], [46], [56], [76]–[79]. Posterior handling is
graph, in which there are N = 4 different classes. Two of examined in Section V.
these classes represent the keywords “right” (C1 ) and “left”
(C2 ). The other two classes are the filler classes other speech III. SPEECH FEATURE EXTRACTION
(C3 ) and silence/noise (C4 ). A segment X{i} consisting of a In the following subsections, we walk through the most
log-Mel spectrogram comprising the keyword “left” is input relevant speech features revolving around deep KWS: Mel-
to the DNN acoustic model. Then, this generates a posterior scale-related features, recurrent neural network features, low-
distribution y{i} over the N = 4 classes. Keyword “left” precision features, learnable filterbank features and other
{i}
is given the highest
 posterior probability, namely, y2 = features.
P C2 | X{i} , θ = 0.8.
Most of the research that has been conducted on deep A. MEL-SCALE-RELATED FEATURES
KWS has focused on its key part, which is the design of Speech features based on the perceptually-motivated Mel-
increasingly accurate and decreasingly computationally com- scale filterbank [80], like the log-Mel spectral coefficients
plex acoustic models f (·|θ) [32], [75]. and Mel-frequency cepstral coefficients (MFCCs) [81], have
Finally, KWS is not a static task but a dynamic one in been widely used over decades in the fields of ASR and,
which the KWS system has to continuously listen to the indeed, KWS. Despite the multiple attempts to learn opti-
input signal x(m) to yield the sequence of posteriors y{i} , mal, alternative representations from the speech signal (see
i = d Ps e, ..., b T −1−F
s c, in order to detect keywords in real- Subsection III-D for more details), Mel-scale-related features
time. In the example in Figure 3, a straightforward way to do are still nowadays a solid, competitive and safe choice [82].
this could just be choosing the class Ĉ {i} with the highest Figure 5 depicts the well-known classical pipeline for extract-
posterior, that is, ing log-Mel spectral and MFCC features. In deep KWS, both
Ĉ {i} = argmax yn{i} = argmax P Cn | X{i} , θ . (4)

types of speech features are commonly normalized to have
Cn Cn zero mean and unit standard deviation before being input
Nevertheless, this approach is not robust, as discussed in to the acoustic model, thereby stabilizing and speeding up
what follows. Continuing with the illustration of Figure 3, training as well as improving model generalization [83].
Figure 4 exemplifies the processing by the acoustic model of Mel-scale-related features are, by far, the most widely used
two consecutive feature segments X{i} and X{i+1} from X speech features in deep KWS. For example, MFCCs with
comprising the keyword “right”. Figure 4a shows the typical temporal context and, sometimes, their first- and second-
case of using an overlapping segmentation window. As we order derivatives are used in [16], [30], [46], [51]–[53], [84]–
can see, following the approach of Eq. (4) might lead to [91]. As can be seen from Figure 5, MFCCs are obtained
4 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

from the application of the discrete cosine transform to in Section II might require system re-training, which is not
the log-Mel spectrogram. This transform produces approxi- always feasible.
mately decorrelated features, which are well-suited to, e.g., QbE KWS based on RNN feature extraction has shown to
acoustic models based on GMMs that, for computational ef- be more efficient and better performing than classical QbE
ficiency reasons, use diagonal covariance matrices. However, KWS approaches based on LVCSR [106] and dynamic time
deep learning models are able to exploit spectro-temporal warping (DTW) [107]. Therefore, the RNN feature approach
correlations, yielding the use of the log-Mel spectrogram is a good choice for on-device KWS applications providing
instead of MFCCs equivalent or better ASR and KWS per- keyword personalization.
formance [92]. As a result, a good number of deep KWS
works considers log-Mel or Mel filterbank speech features C. LOW-PRECISION FEATURES
with temporal context, e.g., [8], [9], [15], [22], [26], [28], A way to diminish the energy consumption and memory
[29], [31], [43], [45], [48], [55], [58], [60], [62], [68], [71], footprint of deep KWS systems to be run on resource-
[72], [78], [93]–[101]. In addition, [79] proposes instead the constrained devices consists, e.g., of quantization —i.e., pre-
use of the first derivative of the log-Mel spectrogram to cision reduction— of the acoustic model parameters. Re-
improve robustness against signal gain changes. The number search like [108], [109] has demonstrated that it is possible
of filterbank channels in the above works ranges from 20 to to (closely) achieve the accuracy provided by full-precision
128. In spite of this wide channel range, experience suggests acoustic models while drastically decreasing memory foot-
that (deep) KWS performance is not significantly sensitive print by means of 4-bit quantization of model’s weights.
to the value of this parameter as long as the Mel-frequency The same philosophy can be applied to speech features.
resolution is not very poor [82]. This fact could promote the Emerging research [69] studies two kinds of low-precision
use of a lower number of filterbank channels in order to limit speech representations: linearly-quantized log-Mel spectro-
computational complexity. gram and power variation over time, derived from log-Mel
spectrogram, represented by only 2 bits. Experimental results
B. RECURRENT NEURAL NETWORK FEATURES show that using 8-bit log-Mel spectra yields same KWS accu-
Recurrent neural networks (RNNs) are helpful to summarize racy as employing full-precision MFCCs. Furthermore, KWS
variable-length data sequences into fixed-length, compact performance degradation is insignificant when exploiting 2-
feature vectors, also known as embeddings. Due to this fact, bit precision speech features. As the authors of [69] state,
RNNs are very suitable for template matching problems like this fact might indicate that much of the spectral information
query-by-example (QbE) KWS, which involves keyword de- is superfluous when attempting to spot a set of keywords.
tection by determining the similarity between feature vectors In [82], we independently arrived at the same finding. In
(successively computed from the input audio stream) and conclusion, there appears to be a large room for future work
keyword templates. In, e.g., [11], [56], [102]–[104], long on the design of new extremely-light and compact (from
short-term memory (LSTM) and gated recurrent unit (GRU) a computational point of view) speech features for small-
neural networks are employed to extract word embeddings. footprint KWS (see also the next subsection).
Generally, these are compared, by means of any distance
function like cosine similarity [105] and particularly for D. LEARNABLE FILTERBANK FEATURES
QbE KWS, with keyword embeddings obtained during an The development of end-to-end deep learning systems in
enrollment phase. which feature extraction is optimal in line with the task and
While QbE KWS based on RNN feature extraction follows training criterion is a recent trend (e.g., [110], [111]). This
a different approach from that outlined in Section II, which approach aspires to become an alternative to the use of well-
is the main scope of this manuscript, we have considered established handcrafted features like log-Mel features and
it pertinent to allude to it for the following twofold reason. MFCCs, which are preferred for many speech-related tasks,
First, there is little difference between the general pipeline of including deep KWS (see Subsection III-A).
Figure 2 and QbE KWS based on RNN feature extraction, Optimal filterbank learning is part of such an end-to-end
since acoustic modeling is implicitly carried out by the training strategy, and it has been explored for deep KWS in
RNN3 . Second, QbE KWS based on RNN feature extraction [70], [82]. In this context, filterbank parameters are tuned
is especially useful for personalized, open-vocabulary KWS, towards optimizing word posterior generation. Particularly,
by which a user is allowed to define her/his own keywords by in [70], the acoustic model parameters are optimized jointly
just recording a few keyword samples during an enrollment with the cut-off frequencies of a filterbank based on sinc-
phase. Alternatively, in [103], a clever RNN mechanism convolutions (SincConv) [112]. Similarly, in [82], we stud-
to generate keyword templates from text instead of speech ied two filterbank learning approaches: one consisting of
inputs is proposed. Notice that incorporating new keywords filterbank matrix learning in the power spectral domain and
in the context of the deep spoken KWS approach introduced another based on parameter learning of a psychoacoustically-
3 Actually, in [11], [102], the LSTM networks used there are pure acoustic
motivated gammachirp filterbank [113]. While the use of
models, and the word embeddings correspond to the activations prior to the SincConv is not compared with using handcrafted speech
output softmax layer. features in [70], in [82], we found no statistically significant
VOLUME 4, 2016 5
I. López-Espejo et al.: Deep Spoken KWS: An Overview

KWS accuracy differences between employing a learned neural architecture at the time: the fully-connected feedfor-
filterbank and log-Mel features. This finding is in line with ward neural network (FFNN). A simple stack of three fully-
research on filterbank learning for ASR, e.g., [114]–[116]. connected hidden layers with 128 neurons each and rectified
In [82], it is hypothesized that such a finding might be an linear unit (ReLU) activations, followed by a softmax output
indication of information redundancy4 . As suggested in Sub- layer, greatly outperformed, with fewer parameters, a (at
section III-C, this should encourage research on extremely- that time) state-of-the-art keyword/filler HMM system in
light and compact speech features for small-footprint KWS. both clean and noisy acoustic conditions. However, since
In conclusion, handcrafted speech features currently provide the constant goal is the design of more accurate/robust and
state-of-the-art KWS performance at the same time that computationally lighter acoustic models, the use of fully-
optimal feature learning requires further research to become connected FFNNs was quickly relegated to a secondary level.
the preferred alternative. Nowadays, state-of-the-art acoustic models use convolutional
and recurrent neural networks (see Subsections IV-B and
E. OTHER SPEECH FEATURES IV-C), since they can provide better performance with fewer
A small number of works has explored the use of alter- parameters, e.g., [9], [28]. Even so, standard FFNN acoustic
native speech features with a relatively low computational models and variants of them5 are considered in recent lit-
impact. For example, [47] introduced the so-called multi- erature for either comparison purposes or studying different
frame shifted time similarity (MFSTS). MFSTS are time- aspects of KWS such as training loss functions, e.g., [9], [17],
domain features consisting of a two-dimensional speech rep- [42], [56].
resentation comprised of constrained-lag autocorrelation val- Closely related and computationally cheaper alternatives
ues. Despite their computational simplicity, which can make to fully-connected FFNNs are single value decomposition
them attractive for low-power KWS applications, features filter (SVDF) [31], [71], [119] and spiking neural net-
like MFCCs provide much better KWS accuracy [47]. works [41], [53], [120]. Proposed in [119] to approximate
A more interesting approach is that examined by [117], fully-connected layers by low-rank approximations, SVDF
[118], which fuses two different KWS paradigms: DTW and achieved to reduce by 75% the FFNN acoustic model size of
deep KWS. First, a DTW warping matrix measuring the the first deep KWS system [22] with no drop in performance.
similarity between an input speech utterance and the keyword A similar idea was explored in [121], where a high degree
template is calculated. From the deep KWS perspective, this of acoustic model compression is accomplished by means
matrix can be understood as speech features that are input to of low-rank weight matrices. The other side of the same
a deep learning binary (i.e., keyword/non-keyword) classifier coin is that modeling power can be enhanced by increasing
playing the role of an “acoustic model”. This hybrid approach the number of neurons while keeping the original number
brings the best of both worlds: 1) the powerful modeling ca- of multiplications fixed [121]. In this way, the performance
pabilities of deep KWS, and 2) the flexibility of DTW KWS of the first deep KWS system [22] was improved without
to deal with both open-vocabulary and language-independent substantially altering the computational resource usage of
scenarios. In spite of its potentials, further research on this the algorithm. Higuchi et al. [59] have shown that an SVDF
methodology is needed, since, e.g., it is prone to overfitting neural network is a special case of a stacked one-dimensional
[118]. convolutional neural network (CNN), so the former can be
easily implemented as the latter.
IV. ACOUSTIC MODELING
On the other hand, spiking neural networks (SNNs) are
human brain-inspired neural networks that, in contrast to
This section is devoted to review the core of deep spoken
artificial neural networks (ANNs), process the information
KWS systems: the acoustic model. The natural trend is the
in an event-driven manner, which greatly alleviates the com-
design of increasingly accurate models while decreasing
putational load when such information is sparse as in KWS
computational complexity. In an approximate chronological
[41], [53], [120]. To make them work, in the first place, real-
order, Subsections IV-A, IV-B and IV-C review advances
valued input data like speech features have to be transformed
in acoustic modeling based on fully-connected feedforward
to a sequence of spikes encoding real values in either its
networks, convolutional networks, and recurrent and time-
frequency (spike rate) or the relative time between spikes.
delay neural networks, respectively. Finally, Subsection IV-D
Then, spikes propagate throughout the SNN to eventually
is dedicated to how these acoustic models are trained.
fire the corresponding output neurons, which represent word
classes in KWS [41]. SNNs can yield a similar KWS per-
A. FULLY-CONNECTED FEEDFORWARD NEURAL
formance to that of equivalent ANNs while providing a
NETWORKS
computational cost reduction and energy saving above 80%
Deep spoken KWS made its debut in 2014 [22] employing [41] and of dozens of times [53], respectively. Apart from
acoustic modeling based on the most widespread type of having been applied to fully-connected FFNNs for KWS
4 With a sufficiently powerful DNN acoustic model, the actual input 5 For example, in [32], it is evaluated an FFNN acoustic model integrating
feature representation is of less importance (as long as it represents the an intermediate pooling layer, which yields improved KWS accuracy in
relevant information about the input signal). comparison with a standard FFNN using a similar number of parameters.

6 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Shortcut connections as an extreme case of residual network comprising a hive


of skip connections and requiring fewer parameters. The use
of an acoustic model inspired by WaveNet [132], involving
both skip connections and gated activation units, is evaluated
in [78]. Choi et al. [50] proposed utilizing one-dimensional

Layer l+1
Layer l-1

Layer l
... ...
convolutions along the time axis (temporal convolutions)
while treating the (MFCC) features as input channels within
a deep residual learning framework (TC-ResNet). This ap-
proach could help to overcome the challenge of simultane-
ously capturing both high and low frequency features by
FIGURE 6. Example of shortcut connections linking non-consecutive layers in means of not very deep networks —although we think that
residual learning models.
this can also be accomplished, to a great extent, by two-
dimensional dilated convolutions increasing the network’s
[41], [53], the SNN paradigm has also been recently applied receptive field—. The proposed temporal convolution yields
to CNN acoustic modeling [53], which is reviewed in the next a significant reduction of the computational burden with
subsection. respect to a two-dimensional convolution with the same num-
ber of parameters. As a result, TC-ResNet matches Tang and
B. CONVOLUTIONAL NEURAL NETWORKS Lin’s [30] KWS performance while dramatically decreasing
both latency and the amount of floating-point operations
Moving from fully-connected FFNN to CNN acoustic model-
per second on a mobile device [50]. In [32], where an
ing was a natural step taken back in 2015 [28]. Thanks to ex-
interesting deep KWS system comparison is presented, TC-
ploiting local speech time-frequency correlations, CNNs are
ResNet, exhibiting one of the least latency and model sizes,
able to outperform, with fewer parameters, fully-connected
is top-ranked in terms of KWS performance, outperform-
FFNNs for acoustic modeling in deep KWS [28], [32], [72],
ing competitive acoustic models based on standard CNNs,
[86], [96], [117], [122]–[125]. One of the attractive features
convolutional recurrent neural networks (CRNNs) [75], and
of CNNs is that the number of multiplications of the model
RNNs with an attention mechanism [133] (see also the next
can be easily limited to meet the computational constraints by
subsection), among others. Furthermore, very recently, Zhou
adjusting different hyperparameters like, e.g., filter striding,
et al. [134] adopted a technique so-called AdderNet [135] to
and kernel and pooling sizes. Moreover, this may be done
replace multiplications by additions in TC-ResNet, thereby
without necessarily sacrificing much performance [28].
drastically reducing its power consumption while maintain-
Residual learning, proposed by He et al. [126] for image
ing a competitive accuracy.
recognition, is widely considered to implement state-of-the-
Another appealing way to reduce the computation and size
art acoustic models for deep KWS [30], [32], [50]–[52],
of standard CNNs is by depthwise separable convolutions
[57], [67], [69], [78]. In short, residual learning models are
[136]. They work by factorizing a standard convolution into a
constructed by introducing a series of shortcut connections
depthwise one and a pointwise (1×1) convolution combining
linking non-consecutive layers (as exemplified by Figure 6),
the outputs from the depthwise one to generate new feature
which helps to better train very deep CNN models. To the
maps [136]. Depthwise separable CNNs (DS-CNNs) are a
best of our knowledge, Tang and Lin [30] were the first
good choice to implement well-performing acoustic models
authors exploring deep residual learning for deep KWS. They
in embedded systems [43], [45]. For example, the authors of
also integrated dilated convolutions increasing the network’s
[70] are able to reproduce the outstanding performance of
receptive field in order to capture longer time-frequency
TC-ResNet [50] using less parameters thanks to exploiting
patterns6 without increasing the number of parameters, as
depthwise separable convolutions. Furthermore, the combi-
also done by a number of subsequent deep KWS systems,
nation of depthwise separable convolutions with residual
e.g., [47], [51], [78]. In this way, Tang and Lin greatly
learning has been recently explored for deep KWS acoustic
outperformed, with less parameters, standard CNNs [28] in
modeling [51], [52], [57], [100], generally outperforming all
terms of KWS performance, establishing a new state-of-the-
standard residual networks [30], plain DS-CNNs and TC-
art back in 2018. Their powerful deep residual architecture
ResNet with less computational complexity.
so-called res15 has been employed to carry out different
Upon this review, we believe that a modern CNN-based
KWS studies in areas like robustness for hearing assistive
acoustic model should ideally encompass the following three
devices [128], [129], filterbank learning [82], and robustness
aspects:
to acoustic noise [130], among others.
1) A mechanism to exploit long time-frequency depen-
Largely motivated by this success, later work further ex-
dencies like, e.g., the use of temporal convolutions [50]
plored the use of deep residual learning. For example, [67]
or dilated convolutions.
uses a variant of DenseNet [131], which can be interpreted
2) Depthwise separable convolutions [136] to substan-
6 In [49], the authors achieve this same effect by means of graph convolu- tially reduce both the memory footprint and computa-
tional networks [127]. tion of the model without sacrificing the performance.
VOLUME 4, 2016 7
I. López-Espejo et al.: Deep Spoken KWS: An Overview

3) Residual connections [126] to fast and effectively train y{0} y{1} y{T-1}
deeper models providing enhanced KWS performance. Softmax Softmax Softmax

Decoder
C. RECURRENT AND TIME-DELAY NEURAL RNN RNN ... RNN
NETWORKS hT-1
<sos> ... ...
Speech is a temporal sequence with strong time depen-

Encoder
... y{0} y{T-2}
dencies. Therefore, the utilization of RNNs for acoustic RNN RNN ... RNN
modeling —and also time-delay neural networks (TDNNs),
which are shaped by a set of layers performing on different x0 x1 xT-1
time scales— naturally arises. For example, LSTM neural
FIGURE 7. Example of sequence-to-sequence (Seq2Seq) model. Here,
networks [137], which overcome the exploding and van- “<sos>” stands for “start of sequence”. See the text for further details.
ishing gradient problems suffered by standard RNNs, are
used for KWS acoustic modeling in, e.g., [4], [29], [76],
[78], [84], clearly outperforming FFNNs [29]. When latency to the sequence of feature vectors X = (x0 , ..., xT −1 ), where
is not a strong constraint, bidirectional LSTMs (BiLSTMs) m < T and the accurate alignment between C and X is
can be used instead to capture both causal and anticausal unknown. CTC is an alignment-free algorithm whose goal
dependencies for improved KWS performance [76], [138]. is to maximize [63]
Alternatively, bidirectional GRUs are explored in [32] for X −1
TY
KWS acoustic modeling. When there is no need to model P (C|X) = Pt (c|x0 , ..., xt ) , (5)
very long time dependencies, as it is the case in KWS, GRUs A∈AX,C t=0
might be preferred over LSTMs since the former demand less
where c is the whole set of recognizable phonetic units
memory and are faster to train while performing similarly or
or characters plus a blank symbol (modeling confusion in-
even better [93].
formation of the speech signal [4]), and the summation is
Besides, [58] studies a two-stage TDNN consisting of
performed over the set of all valid alignments AX,C . From
an LVCSR acoustic model followed by a keyword classi-
Eq. (5), the acoustic model outputs can be understood as the
fier. The authors of [58] also investigate the integration of
probability distribution over all the possible label sequences
frame skipping and caching to decrease computation, thereby
given the sequence of input features X [46].
outperforming classical CNN acoustic modeling [28] while
The very first attempt to apply CTC to KWS was carried
halving the number of multiplications.
out by Fernández et al. [46] using a BiLSTM for acoustic
As we already suggested in Subsection IV-B, CNNs might
modeling. At training time, this system just needs, along with
have difficulties to model long time dependencies. To over-
the training speech signals, the list of training words in order
come this point, they can be combined with RNNs to build
of occurrence. After this first attempt, several works have
the so-called CRNNs. Thus, it may be stated that CRNNs
explored variants of this approach using different RNN ar-
bring the best of two worlds: first, convolutional layers
chitectures like LSTMs [4], [60], [61], [139], BiLSTMs [84],
model local spectro-temporal correlations of speech and,
[98] and GRUs [61], [140], as well as considering different
then, recurrent layers follow suit by modeling long-term time
phonetic units such as phonemes [60], [84] and Mandarin
dependencies in the speech signal. Some works explore the
syllables [8], [139]. In general, these systems are shown to
use of CRNNs for acoustic modeling in deep spoken KWS
be superior to both LVCSR- and keyword/filler HMM-based
using either unidirectional or bidirectional LSTMs or GRUs
KWS systems with less or no additional computational cost
[32], [48], [76], [93], [109], [118]. Generally, the use of
[4], [8], [139]. Notice that since CTC requires searching
CRNNs allows us for outperforming standalone CNNs and
for the keyword phonetic unit sequence on a lattice, this
RNNs [48].
approach is also suitable for open-vocabulary KWS.
1) Connectionist Temporal Classification
2) Sequence-to-Sequence Models
As for the majority of acoustic models, the above-reviewed
CTC assumes conditional label independence, i.e., past
RNN acoustic models are typically trained to produce frame-
model outputs do not influence current predictions (see Eq.
level posterior probabilities. At training time, in case of em-
(5)). Hence, in the context of KWS and ASR in general, CTC
ploying, e.g., cross-entropy loss, frame-level annotated data
may need an external language model to perform well. There-
are required, which may be cumbersome to get. In the context
fore, a more convenient approach for KWS acoustic mod-
of RNN acoustic modeling, connectionist temporal classifi-
eling might be the use of sequence-to-sequence (Seq2Seq)
cation (CTC) [63] is an attractive alternative letting the model
models, first proposed in [141] for language translation.
unsupervisedly locate and align the phonetic unit labels at
Figure 7 illustrates an example of Seq2Seq model. In short,
training time [4]. In other words, frame-level alignments of
Seq2Seq models are comprised of an RNN encoder7 sum-
the target label sequences are not required for training.
Mathematically speaking, let C = (c0 , ..., cm−1 ) be the 7 In [9], Shan et al. show, for KWS, the superiority of CRNN encoders
sequence of phonetic units or, e.g., characters corresponding with respect to GRU ones, which, in turn, are better than LSTM encoders.

8 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

marizing the variable-length input sequence into a fixed- function— by means of backpropagation [147] and using
dimensional vector followed by an RNN decoder generating labeled/annotated speech data (see Section VIII in the latter
a variable-length output sequence conditioned on both the respect).
encoder output and past decoder predictions.
Besides for related tasks like QbE KWS [142], Seq2Seq 1) Loss Functions
models such as an RNN-Transducer (RNN-T) have also Apart from CTC [63], which has been examined in the
been studied for deep spoken KWS [60], [62], [101], [143]. previous subsection, cross-entropy loss [73], [74] is, by far,
RNN-T, integrating both acoustic and language models (and the most popular loss function for training deep spoken KWS
predicting phonemes), is able to outperform a CTC KWS acoustic models. For example, cross-entropy loss LCE is
system even when the latter exploits an external phoneme N- considered by [12], [16], [22], [29]–[32], [42], [43], [76],
gram language model [60]. [93], [121], [123], [124], and, retaking the notation of Section
II, can be expressed as
3) The Attention Mechanism
N  
As aforementioned, in Seq2Seq models, the encoder has to
XX
LCE = − ln{i} log yn{i} , (8)
condense all the needed information into a fixed-dimensional i n=1
vector regardless the (variable) length of the input sequence,
{i}
which might be challenging. The attention mechanism [144], where ln is the binary true (training) label corresponding to
similarly to human listening attention, might assist in this the input feature segment X{i} . Notice that when the acoustic
context by focusing on the speech sections that are more model is intended to produce subword-level posteriors, com-
likely to comprise a keyword [9]. monly, training labels are generated by force alignment using
Let ht be the hidden state of the RNN encoder of a an LVCSR system [22], [31], [42], which will condition the
Seq2Seq model at time step t: subsequent KWS system performance.
First proposed in [148], max-pooling loss is an alternative
ht = Encoder (xt , ht−1 ) . (6) to cross-entropy loss that has also been studied for KWS
Before decoding it, the whole input sequence X = purposes [29], [39], [71]. In the context of KWS, the goal
(x0 , ..., xT −1 ) has to be read, since hT −1 is the fixed- of max-pooling loss is to teach the acoustic model to only
dimensional vector summarizing the whole input sequence trigger at the highest confidence time near the end of the
that is finally input to the decoder (see Figure 7). To assist keyword [29]. Let L̂ be the set of all the indices of the
the decoder, a context-relevant subset of {h0 , ..., hT −1 } can input feature segments in a minibatch belonging to any
be attended to yield A, which is to be used instead of hT −1 : non-keyword class. In addition, let yp? be the largest target
posterior corresponding to the p-th keyword sample in the
T −1
X minibatch, where p = 1, ..., P and P is the total number of
A= αt ht , (7)
keyword samples in the minibatch. Then, max-pooling loss
t=0
can be expressed as
where αt = Attend
P (ht ), being Attend (·) an attention func- N P
tion [144] and t αt = 1. XX   X
ln{i} yn{i} log yp? .

LMP = − log − (9)
The integration of an attention mechanism (including
i∈L̂ n=1 p=1
a variant called multi-head attention [144]) in (primarily)
Seq2Seq acoustic models in order to focus on the keyword(s) From (9), we can see that max-pooling loss is cross-entropy
of interest has successfully been accomplished by a number loss for any non-keyword class (left summand) while, for
of works, e.g., [26], [32], [60], [68], [133], [143], [145]. each keyword sample, the error is backpropagated for a single
These works find that incorporating attention provides KWS input feature segment only (right summand). Max-pooling
performance gains with respect to counterpart Seq2Seq mod- loss has proven to outperform cross-entropy loss in terms
els without attention. of KWS performance, especially when the acoustic model
Lastly, let us notice that attention has also been studied in is initialized by cross-entropy loss training [29]. Weakly-
conjunction with TDNNs for KWS [12], [16]. Particularly, in constrained and smoothed max-pooling loss variants are
[16], thanks to exploiting shared weight self-attention, Bai et proposed in [39] and [71], respectively, which benefit from
al. reproduce the performance of the deep residual learning lowering the dependence on the accuracy of LVCSR force
model res15 of Tang and Lin [30] by using 20 times less alignment.
parameters, i.e., around 12k parameters only.
2) Optimization Paradigms
D. ACOUSTIC MODEL TRAINING In deep KWS, the most frequently used optimizers are
Once the acoustic model architecture has been designed stochastic gradient descent (SGD) [149] (normally with mo-
(see the previous subsections) or optimally “searched” [95], mentum), e.g., see [30], [31], [49]–[51], [53], [76], [91],
[146], it is time to discriminatively estimate its parameters [100], [121], [138], [143], [146], and Adam [150], e.g.,
according to an optimization criterion —defined by a loss see [9], [12], [16], [26], [32], [42], [43], [47], [48], [68],
VOLUME 4, 2016 9
I. López-Espejo et al.: Deep Spoken KWS: An Overview

[70], [78], [90], [98], [151]. It is also a common practice local correlations. Due to this, the sequence of raw posteriors,
to implement a mechanism shrinking the learning rate over which is inherently noisy, is typically smoothed over time —
epochs [9], [12], [16], [22], [29], [43], [48], [49], [51], [53], e.g., by moving average— on a class basis [15], [22], [29],
[68], [70], [76], [121], [152]. Furthermore, many deep KWS [42], [43], [45], [56], [58], [72], [76], [77] before further
works, e.g., [9], [49]–[51], [90], [100], [143], deploy a form processing.
of parameter regularization like weight decay and dropout. Let us denote by ȳ{i} the smoothed version of the raw
While random acoustic model parameter initialization is the posteriors y{i} . Furthermore, let us assume that each of
normal approach, initialization based on transfer learning the N classes of a deep KWS system represents a whole
from LVCSR acoustic models has proven to lead to better word (which is a common case). Then, the smoothed word
KWS models by, e.g., alleviating overfitting [22], [58], [101]. posteriors ȳ{i} are often directly used to determine the
presence or not of a keyword either by comparing them
V. POSTERIOR HANDLING with a sensitivity threshold8 [29], [43], [58] or by pick-
In order to come up with a final decision about the presence ing, within a time sliding window, the class with the high-
or not of a keyword in an audio stream, the sequence of est posterior [76]. Notice that since consecutive input seg-
posteriors yielded by the acoustic model, y{i} , needs to be

ments ..., X{i−1} , X{i} , X{i+1} , ... may cover fragments
processed. We differentiate between two main posterior han- of the same keyword realization, false alarms may oc-
dling modes: non-streaming (static) and streaming (dynamic) cur as a result of recognizing the same keyword realiza-
modes. tion
 multiple times from the smoothed posterior sequence
..., ȳ{i−1} , ȳ{i} , ȳ{i+1} , ... . To prevent this problem, a
A. NON-STREAMING MODE simple, yet effective mechanism consists of forcing the KWS
Non-streaming mode refers to standard multi-class classifi- system not to trigger for a short period of time right after a
cation of independent input segments comprising a single keyword has been spotted [29], [43].
word each (i.e., isolated word classification). To cover the Differently from the above case, let us now consider the
duration of an entire word, input segments have to be long two following scenarios:
enough, e.g., around 1 second long [153], [154]. In this mode, 1) Each of the N classes still represents a whole word but
commonly, given an input segment X{i} , this is assigned keywords are composed of multiple words (e.g., “OK
to the class with the highest posterior probability as in Eq. Google”).
(4). This approach is preferred over picking classes yielding 2) Each of the N classes represents a subword unit (e.g.,
posteriors above a sensitivity (decision) threshold to be set, a syllable) instead of a whole word.
since experience tells [82], [128]–[130] that non-streaming To tackle such scenarios, the first deep spoken KWS system
deep KWS systems tend to produce very peaked posterior [22] proposed a simple method processing the smoothed pos-
distributions. This might be attributed to the fact that non- teriors ȳ{i} in order to produce a keyword presence decision.
streaming systems do not have to deal with inter-class transi- Let us assume that the first class C1 corresponds to the non-
tion data as in the dynamic case (see the next subsection), but keyword class and that the remaining N − 1 classes represent
with well-defined, isolated class realizations. subunits of a single keyword9 . Then, a time sequence of
As mentioned in Section II, KWS is not a static task but {i}
confidence scores Sc can be computed as [22]
a dynamic one, which means that a KWS system has to v
uN
continuously process an input audio stream. Therefore, it uY {k}
{i} N −1
is obvious that the non-streaming mode lacks some realism S = t
c max ȳn , (10)
hmax (i)≤k≤i
from a practical point of view. Despite this, isolated word n=2

classification is considered by a number of deep KWS works, where hmax (i) indicates the onset of the time sliding window.
e.g., [16], [30], [32], [48]–[52], [58], [69], [82], [89], [99], {i}
A keyword is detected every time Sc exceeds a sensitivity
[109], [125], [128]–[130]. We believe that this is because of threshold to be tuned. This approach has been widely used in
the simpler experimental framework with respect to that of the deep KWS literature, e.g., [45], [56], [77].
the dynamic or streaming case. Fortunately, non-streaming In [15], Eq. (10) is subject to the constraint that the
performance and streaming performance seem to be highly keyword subunits trigger in the correct order of occurrence
correlated [129], [130], which makes non-streaming KWS within the keyword, which contributes to decreasing false
research more relevant than it might look at first sight. alarms. This improved version of the above posterior han-
dling method is also considered by a number of deep KWS
B. STREAMING MODE systems, e.g., [42], [72].
Streaming mode alludes to the continuous processing (nor- When each of the N classes of a deep KWS system repre-
mally in real-time) of an input audio stream in which key- sents a subword unit like a syllable or context-independent
words are not isolated/segmented. Hence, in this mode, any
8 This decision threshold might be set by optimizing, on a development
given segment may or may not contain (parts of) a keyword.
set, some kind of figure of merit (see also Section IX on evaluation metrics).
In this case, the acoustic model yields a time sequence of 9 This method can easily be extended to deal with more than one keyword
(raw) posteriors ..., y{i−1} , y{i} , y{i+1} , ... with strong

[22].

10 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

phoneme, a searchable lattice may be built from the time notation of [156], E(t, f ) represents (Mel) filterbank energy
sequence of posteriors y{i} . Actually, this is typically done at time frame t and frequency bin f , and
in the context of CTC [4], [8]. Then, the goal is to find,
from the lattice, the most similar subword unit sequence M (t, f ) = (1 − s)M (t − 1, f ) + sE(t, f ) (11)
to that of the target keyword. If the score resulting upon is a time smoothed version of E(t, f ), where 0 < s < 1 is a
the search on the lattice is greater than a predefined score smoothing coefficient. Thus, PCEN is intended to replace the
threshold, a keyword is spotted. Notice that this approach, typical log compression of filterbank features as follows:
despite its higher complexity, provides a great flexibility by,  r
for example, allowing a user defining her/his own keywords. E(t, f )
PCEN(t, f ) = α +δ − δr , (12)
( + M (t, f ))
VI. ROBUSTNESS IN KEYWORD SPOTTING where  prevents division by zero, α ∈ (0, 1) defines
Normalizing the effect of acoustic variability factors such as the gain normalization strength, and δ and r determine the
background noise and room reverberation is paramount to root compression. As we can see from Eq. (12), the energy
assure good KWS performance in real-life conditions. This contour of E(t, f ) is dynamically normalized by M (t, f )
section is intended to review the scarce literature on KWS on a frequency band basis, which yields significant KWS
robust against, primarily but not only, background noise and performance gains under far-field conditions since M (t, f )
far-field conditions. The motivation behind primarily dealing mirrors the loudness profile of E(t, f ) [156].
with the two latter acoustic variability factors lies in typical An appealing aspect of PCEN is that all its operations are
use cases of KWS technology10 . differentiable. As a result, PCEN can be integrated in the
This section has been arranged according to a taxonomy DNN acoustic model in order to comfortably tune its set of
that segregates front- and back-end methods, which reflects parameters —i.e., s, , α, δ and r— towards the optimization
the available literature on KWS robustness. Let us stress that of KWS performance during acoustic model training [156].
these are normally cross-cutting methods, since they either
come from or can be applied to other areas like ASR. 2) DNN Feature Enhancement
The powerful modeling capabilities of DNNs can also be ex-
A. FRONT-END METHODS ploited to clean the noisy speech features (usually, magnitude
spectral features) before these are input to the KWS acoustic
Front-end methods refer to those techniques that modify the
model. A variety of approaches can be followed:
speech signal before it is fed to the DNN acoustic model. In
this subsection, we further differentiate among gain control 1) Enhancement mask estimation: The aim of this ap-
for far-field conditions, DNN feature enhancement, adaptive proach is to estimate, from the noisy observation (e.g.,
noise cancellation and beamforming methods. noisy Mel spectra [157]) and using a neural network
(e.g., a CRNN [157]), a multiplicative de-noising time-
frequency mask to be applied to the noisy observation
1) Gain Control for Far-Field Conditions
[157], [158]. The result is then passed to the acoustic
Keyword spotting deployment is many times conceived to model.
facilitate real hands-free communication with devices such 2) Noise estimation: A DNN (e.g., a CNN with dilated
as smart speakers or in-vehicle systems that are located at a convolutions and residual connections [159]) might
certain distance from the speaker. This means that commu- also be used to provide an estimate of the distortion that
nication might take place in far-field conditions, and, due to contaminates the target speech signal. The estimated
distance attenuation, background noise and reverberation can distortion can then be subtracted from the noisy obser-
be particularly harmful. vation before feeding the acoustic model with it [159].
Prabhavalkar et al. [15] were the first to propose the use 3) Clean speech estimation: In this case, the DNN front-
of automatic gain control (AGC) [155] to provide robustness end directly produces an estimate of the clean speech
against background noise and far-field conditions for deep features, from the noisy observation, to be input to the
KWS. The philosophy behind AGC is based on selectively acoustic model. While this approach has been studied
amplifying the audio signal depending on whether speech is for robust ASR [160], to the best of our knowledge and
present or absent. This type of selective amplification is able surprisingly, this has not been the case for KWS.
to yield a significant reduction of miss detections in the far- 4) Filter parameter estimation: The parameters of an en-
field scenario [15]. hancement filter (e.g., a Wiener filter [158]) to be ap-
Later, a more popular [61], [93], [94], [122] and simpler plied to the noisy observation before further processing
AGC method called PCEN (Per-Channel Energy Normal- can be estimated by means of a DNN. Similarly to the
ization) [156] was proposed for KWS. Keeping the original above case, while this has been studied for robust ASR
[158], this has not been the case for KWS.
10 For example, activation of voice assistants typically takes place at home Regardless of the chosen approach, the DNN front-end and
in far-field conditions and with some TV or music background noise. the KWS acoustic model can be jointly trained following
VOLUME 4, 2016 11
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Noisy signal is uncertain), the most recent ANC weights are used
+ to filter/clean the input signal and the presence of a
keyword is rechecked.
-
De-noised In a similar vein, a so-called hotword cleaner was reported
Adaptive signal in [94], which overcomes one of the shortcomings of the
Filter above ANC approach [122]: the increased latency and CPU
Noise usage derived from having to run the acoustic model twice
reference
(one to provide feedback to the de-noising front-end and
FIGURE 8. Block diagram of adaptive noise cancellation. A signal of interest another for KWS itself). The hotword cleaner [94] leverages
s(m) is retrieved from a noisy observation x(m) = s(m) + v(m) by
subtracting an estimate of v(m), v̂(m). This estimate is obtained by filtering a
the following two characteristics of the KWS scenario to
noise reference v 0 (m) that originates from the same noise source as v(m) deploy a simple, yet effective de-noising method: 1) there is
(i.e., v(m) and v 0 (m) are highly correlated). The filter weights are typically no speech just before a keyword, and 2) keywords
continuously adapted to typically minimize the power of the estimated signal of
interest ŝ(m). are of short duration. Bearing these two characteristics in
mind, the hotword cleaner [94] simply works by continuously
computing fast-RLS ANC weights that are stored and applied
a multi-task learning scheme to account the complemen- to the input signal with a certain delay to clean and not
tary objectives of the front-end and the acoustic model. By damage the keyword. This methodology was generalized to
making the DNN front-end aware of the global keyword an arbitrary number of microphones in [162]. Overall, all
detection goal [157], [159], superior KWS performance can of these ANC-based methods bring significant KWS per-
be achieved in comparison with independent training of the formance improvements in everyday noisy conditions that
two components. include strong TV and music background noise.
One conclusion is that, oddly, DNN feature enhancement
is a rather unexplored area in the context of KWS. This 4) Beamforming
contrasts with the case of robust ASR, which has widely and Spatial filtering, also known as beamforming, enables the
successfully studied the application of this type of de-noising exploitation of spatial cues in addition to time and frequency
front-ends [158], [160]. Immediate future work on robust information to boost speech enhancement quality [166]. Sim-
KWS could address this imbalance, especially by exploring ilarly to the aforementioned case with DNN feature enhance-
promising time domain solutions that can benefit from phase ment, KWS lags several steps behind ASR regarding the
information [161]. integration of beamforming as done, e.g., in [167].
To the best of our knowledge, [97] is the first research
3) Adaptive Noise Cancellation studying beamforming for deep KWS. In particular, [97]
Presumably thinking of voice assistant use cases, Google applies four fixed beamformers that are arranged to uniformly
developed a series of noise-robust KWS methods based on sample the horizontal plane. The acoustic model is then fed
dual-microphone adaptive noise cancellation (ANC) to par- with the four resulting beamformed signals plus a reference
ticularly deal with speech interference [94], [122], [162]. The signal picked from one of the array microphones to avoid de-
working principle of ANC is outlined in Figure 8. The reason grading performance at higher signal-to-noise ratios (SNRs)
for accounting a dual-microphone scenario is that Google’s [97]. The acoustic model incorporates an attention mecha-
smart speaker Google Home has two microphones [163]. It is nism [144] to weigh the five input signals, which can be
interesting to point out that the authors of this series of ANC thought as a steering mechanism pointing the effective beam
works also tried to apply beamforming and multi-channel towards the target speaker. Actually, the motivation behind
Wiener filtering11 , but they only found marginal performance using fixed beamformers lies in the difficulty of estimating
gains by doing so [94]. the target direction in noisy conditions. However, notice that
In [122], Google researchers proposed a de-noising front- the attention mechanism implicitly estimates it.
end inspired by the human auditory system. In short, the de- The same authors of [97] went farther in [96] by replacing
noising front-end works by exploiting posterior probability the set of fixed beamformers by a set of data-dependent,
feedback from the KWS acoustic model: multiplicative spectral masks playing an equivalent role. The
latter masks, which are estimated by a neural network, can
1) If the acoustic model finds that voice is absent, the be interpreted as semi-fixed beamformers. This is because
weights of a recursive least squares (RLS) ANC filter though they are data-dependent, mask look directions (equiv-
working in the short-time Fourier transform (STFT) alent to look directions of beamformers) are still fixed. This
domain are updated; beamforming front-end, which is trained jointly with the
2) If the posterior probabilities computed by the acoustic acoustic model, outperforms the previous fixed beamforming
model are inconclusive (i.e., the presence of a keyword approach, especially at lower signal-to-interference ratios
11 Notice that multi-channel Wiener filtering is equivalent to minimum (SIRs) [96].
variance distortionless response (MVDR) beamforming followed by single- There is still a long way to go regarding the application
channel Wiener post-filtering [164], [165]. of beamforming to deep KWS. More specifically, despite
12 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

the aforementioned steering role of the attention mechanism, 3) Robustness to Keyword Data Scarcity
we believe that deep beamforming that does not pre-arrange To effectively train a KWS acoustic model, a sufficient
the look direction but estimates it continuously based on amount of speech data is required. This normally includes
microphone signals —as in, e.g., [168]— is worth to explore. a substantial number of examples of the specific keyword(s)
to be recognized. However, there is a number of possible rea-
B. BACK-END METHODS sons for which we might suffer from keyword data scarcity.
Back-end methods refer to techniques applied within the Certainly, collecting additional keyword samples can help to
acoustic model to primarily improve its generalization ability overcome the problem. Nevertheless, speech data collection
to a variety of acoustic conditions. The rest of this subsection can be costly and time-consuming, and is often infeasible. In-
is devoted to discuss the following matters: multi-style and stead, a smart way to obtain additional keyword samples for
adversarial training, robustness to keyword data scarcity, the model training is by synthetically generating them through
class-imbalance problem and other back-end methods. text-to-speech technology. This type of data augmentation
has proven to be highly effective by significantly improving
1) Multi-Style Training KWS performance in low-resource keyword settings [62],
One of the most popular and effective back-end methods to, [175], [176]. In particular, in [62], it is found that it is im-
especially, deal with background noise and reverberation is portant that synthetic speech reflects a wide variety of tones
multi-style training of the KWS acoustic model (see, e.g., of voice (i.e., speaker diversity) for good KWS performance.
[15], [32], [39], [43], [50], [60], [71], [88], [90], [118], [169],
[170]). Multi-style training, which has some regularization 4) The Class-Imbalance Problem
effect preventing overfitting [118], simply consists of training The class-imbalance problem refers to the fact that, typi-
the acoustic model with speech data contaminated by a cally, many more non-keyword than keyword samples are
variety of distortions trying to better reflect what is expected available for KWS acoustic model training. Actually, the
to be found at test time. class-imbalance problem can be understood as a relative
Usually, distorted speech data are generated by contami- keyword data scarcity problem: for obvious reasons, it is
nating —e.g., by background noise addition at different SNR almost always easier to access a plethora of non-keyword
levels— clean speech data in an artificial manner (see Section than keyword samples. The issue lies in that class imbalance
VIII for practical details). This artificial distortion procedure can lead to under-training of the keyword class with respect
is known as data augmentation [171]. For instance, a series of to the non-keyword one.
data augmentation policies like time and frequency masking To reach class balance for acoustic model training, one can
is defined by a tool like SpecAugment [172]. First proposed imagine many different things that can be done based on data
for end-to-end ASR, SpecAugment has recently become a augmentation:
popular way for generating distorted speech data, also for
1) Generation of adversarial examples yielding miss de-
KWS training purposes [32], [39], [90], [100], [143], [151].
tections, e.g., through FGSM [174], to re-train the
acoustic model in a class-balanced way;
2) Adversarial Training
2) Generation of additional synthetic keyword samples by
Deep neural networks often raise the following issue: net-
means of text-to-speech technology [62], [175].
works’ outputs might not be smooth with respect to inputs
[173], e.g., because of the lack of enough training data. To the best of our knowledge, the above two data augmenta-
This might involve, for example, that a keyword correctly tion approaches have not been studied for tackling the class-
classified by the acoustic model is misclassified when a very imbalance problem.
small perturbation is added to such a keyword. This kind Differently, a series of works has proposed to essentially
of subtly distorted input to the network is what we call an focus on challenging non-keyword samples12 at training time
adversarial example. Interestingly, adversarial examples can instead of fully exploiting all the non-keyword samples avail-
be generated by means of techniques like the fast gradient able [39], [42], [177]. For instance, Liu et al. [42]
 suggested

{i}
sign method (FGSM) [174] to re-train with them a well- to weigh cross-entropy loss LCE (see Eq. (8)) by 1 − yn
trained KWS acoustic model. The goal of this is to improve to come up with focal loss LFL :
robustness by smoothing the distribution of the acoustic N  γ  
XX
model. This approach, which can be interpreted as a type of LFL = − 1 − yn{i} ln{i} log yn{i} , (13)
data augmentation, has shown to be effective to drastically i n=1
decrease false alarms and miss detections for an attention-
where γ is a tunable focusing parameter. As one can easily
based Seq2Seq acoustic model [26]. Alternatively, [45] pro-
reason, weighing cross-entropy loss as in Eq. (13) helps to
poses to replace, with the same goal, adversarial example
focus training on challenging samples. While this weighting
re-training by adversarial regularization in the loss function.
procedure is more effective than regular cross-entropy in
Wang et al. [45] demonstrate that the latter outperforms the
former under far-field and noisy conditions when using a DS- 12 A challenging non-keyword sample can be, e.g., one exhibiting similar-
CNN acoustic model for KWS. ities with the keyword in terms of phonetics.

VOLUME 4, 2016 13
I. López-Espejo et al.: Deep Spoken KWS: An Overview

class-imbalanced scenarios [42], notice that it might be able tional ones like voice-dialing, interaction with a call cen-
to strengthen the model in a wide sense. Because focal loss ter and speech retrieval to nowadays flagship application,
LFL operates on a frame basis, [177] improved it by also namely, the activation of voice assistants.
considering the time context when computing the weight In addition to the above, KWS technology could be useful,
for cross-entropy loss. Particularly, such an improvement e.g., to assist disabled people like vision-impaired pedes-
is equivalent to assigning bigger weights to those frames trians when it comes to the activation of pedestrian call
belonging to non-keyword samples yielding false alarms. buttons in crosswalks. For example, [87] proposes the use
An alternative approach —so-called regional hard- of a CRNN-based KWS system [93] for the activation of
example mining— for dealing with the class-imbalance prob- pedestrian call buttons via voice, thereby contributing to im-
lem was described in [39]. Regional hard-example mining prove accessibility in public areas to people with the above-
subsamples the available non-keyword training data to keep a mentioned disability.
certain balance between keyword and non-keyword samples. In-vehicle systems can also benefit from voice control.
Non-keyword sample mining is based on the selection of the For example, in [77], Tan et al. explore multi-source fusion
most difficult non-keyword samples in the sense that they exploiting variations of vehicle’s speed and direction for
yield the highest keyword posteriors. online sensitivity threshold selection. The authors of [77]
demonstrate that this strategy improves KWS accuracy with
5) Other Back-End Methods respect to using a fixed, predetermined sensitivity threshold
A few other methods for robustness purposes not falling into for the posteriors yielded by the DNN acoustic model.
any of the above categories can be found in the literature. Moreover, it is worth noticing that KWS is a technology
For instance, [72] extracts embeddings characterizing the that is sometimes better suited than ASR to the solution of
acoustic environment that are passed to the acoustic model certain problems where the latter is typically employed. This
to carry out KWS which is robust to far-field and noisy is the case, for instance, of by-topic audio classification and
conditions. In this way, by making the acoustic model aware audio sentiment detection [181], [182], since the accuracy of
of the acoustic environment, better keyword prediction can these tasks rather relies on being able to correctly spot a very
be achieved. focused (i.e., quite limited) vocabulary in the utterances. In
We also recently contributed to noise-robust KWS in other words, lexical evidence is sparse for such tasks.
[130], where we proposed to interpret every typical KWS Some work has explored KWS also for voice control of
acoustic model as the concatenation of a keyword embedding videogames [138], [152]. Particularly, [138] points out how
extractor followed by a linear classifier consisting of the KWS becomes an extremely difficult task when it comes
typical final fully-connected layer with softmax activation to dealing with children controlling videogames with their
for word classification (see Section II). The goal is to, first, voice due to excitement and, generally speaking, the nature
multi-style train the keyword embedding extractor by means of children and children’s voice [183]. To partially deal with
of a (CN,2 + 1)-pair loss function extending the idea behind this, the authors of [138] propose the detection of overlap-
tuple-based losses like N -pair [178] and triplet [179] losses ping keywords in the context of a multiplayer side-scroller
(the latter used both standalone [103] and combined with the game called Mole Madness. Since BiLSTMs have proven to
reversed triplet and hinge losses [56] for keyword embedding work well for children’s speech [184], a BiLSTM acoustic
learning). In comparison with these and similar losses also model with 2N output classes —where N is the number of
employed for word embedding learning (e.g., a prototypical keywords— is used to represent all possible combinations of
loss angular variant [180]), in [130], we demonstrate that the overlapping keywords. It is found that, under the videogame
(CN,2 + 1)-pair loss reaches larger inter-class and smaller conditions, modeling the large variations of children’s speech
intra-class embedding variation13 . Secondly, the final fully- time structure is challenging even for a relatively large BiL-
connected layer with softmax activation is trained by multi- STM.
style keyword embeddings employing cross-entropy loss. Other KWS applications include voice control of home
This two-stage training strategy is much more effective than automation [185], even the navigation of complex procedures
standard end-to-end multi-style training when facing unseen in the International Space Station [186], etc.
noises [130]. Moreover, another appealing feature of this
two-stage training strategy is that it increases neither the A. PERSONALIZED KEYWORD SPOTTING SYSTEMS
number of parameters nor the number of multiplications of For some of the above applications, having a certain degree
the model. of personalization in the sense that only a specific user is
allowed to utilize the KWS system can be a desirable feature.
VII. APPLICATIONS Towards this personalization goal, some research has studied
Keyword spotting technology (including deep KWS) has a the combination of KWS and speaker verification [10], [76],
number of applications, which range from the more tradi- [140], [159]. While [10], [140] employ independently trained
13 This is because the (C
deep learning models to perform both tasks, [76], [159] ad-
N,2 + 1)-pair loss constrains the way the training
samples belonging to different classes relate to each other in terms of dress, following a multi-task learning scheme, joint KWS and
embedding distance. speaker verification with contradictory conclusions, since
14 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

HEY
ASSISTANT!
VOLUME UP! VOLUME UP!
Client
Wake-up word Server
+ Query KWS
+
ASR
KWS
KEYWORD EXTERNAL SPEAKER
DETECTED DETECTED

(a) Legitimate user detected. (b) External speaker detected.


FIGURE 9. Typical voice assistant client-server framework. FIGURE 10. Users’ own voice/external speaker detection in the context of
voice control of hearing aids. Red and blue dots symbolize the two
microphones of a hearing aid sitting behind the ear.

KWS performance is negatively and positively affected in


[76] and [159], respectively, by the integration of speaker the context of Google’s Assistant [2], it is shown that this
verification. A reason for this could be that, unlike in [159], server-side wake-up word check drastically reduces the false
higher-level features are shared for both tasks in [76], so this alarm rate while marginally increasing the rate of miss detec-
further preservation of speaker information may contaminate tions. Notice that this server-side check could be useful for
the phonetic information required to carry out KWS. mitigating privacy issues as a result of pseudo-query audio
Personalization can be of particular interest for voice acti- leakage if it were not for the fact that the supposed wake-
vation of voice assistants [187] as well as for voice control of up word audio and query audio are inseparably streamed
hearing assistive devices like hearing aids. These two KWS to the server. Interestingly, Garg et al. [190] have recently
applications are reviewed in a bit more detail in the next proposed a streaming Transformer encoder carrying out the
subsections. double check efficiently on the client-side, which can truly
help to mitigate privacy issues.
B. VOICE ACTIVATION OF VOICE ASSISTANTS
The flagship application of (deep) KWS is the activation of C. VOICE CONTROL OF HEARING ASSISTIVE DEVICES
voice assistants like Amazon’s Alexa, Apple’s Siri, Google’s Manually operating small, body-worn devices like hearing
Assistant and Microsoft’s Cortana. Actually, without fear of aids is not always feasible or can be cumbersome. One
error, we can say that revitalization of KWS research over reason could be that hands are busy doing other activities like
the last years is owed to this application [28]. And there is cooking or driving. Another cause could be that the wearer is
a compelling reason for this: forecasts suggest that, by 2024, an elderly person with reduced fine motor skills. Whatever
the number of voice assistant units will exceed that of world’s the reason is, KWS can help to deploy voice interfaces to
population [188]. comfortably operate such a kind of devices. Furthermore,
Figure 9 illustrates the typical voice assistant client-server these devices are personal devices, so it is desirable that the
framework. The client consists of an electronic device like user is the only person who can handle them.
a smartwatch or a smart speaker integrating the client-side In the above respect, in [128], [129] we studied an alterna-
of a voice assistant and an always-on KWS system to detect tive way to speaker verification to provide robustness against
when a user wakes up the assistant by uttering a trigger external speakers (i.e., personalization) in KWS for hearing
word/phrase, e.g., “hey assistant!”. To limit the impact on aids as exemplified by Figure 10. Particularly, we extended
the battery life, the KWS system has to be necessarily light. the deep residual learning model proposed by Tang and Lin
In this vein, Apple employs a two-pass detection strategy [30] to jointly perform KWS and users’ own voice/external
[187]. By this, a very light, always-on KWS system listens speaker detection following a multi-task learning scheme.
for the corresponding wake-up word. If this is detected, a A keyword prediction is then taken into account if and
more complex and accurate KWS system —also placed on only if the multi-task network determines that the spot-
the client device— is used to double check whether or not ted keyword was uttered by the legitimate user. Thanks to
the wake-up word has been really uttered. exploiting GCC-PHAT (Generalized Cross-Correlation with
When the wake-up word is spotted on the client-side, the PHAse Transform) [191] coefficients from dual-microphone
supposed wake-up word audio and subsequent query audio hearing aids in the perceptually-motivated constant-Q trans-
are sent to a server on which, first, the presence of the wake- form [192] domain, we achieve almost flawless users’ own
up word is checked for a second or third time by using much voice/external speaker detection [129]. This is because phase
more powerful and robust LVCSR-based KWS [2], [187], difference information is extremely useful to characterize
[189]. If, finally, the LVCSR-based KWS system determines the virtually time-invariant position of the user’s mouth with
that the wake-up word is not present, the subsequent au- respect to that of the hearing aid. It is worth noting that
dio is discarded and the process is ended. Otherwise, ASR this experimental validation was carried out on a hearing aid
is applied to the supposed query audio and the result is speech database created by convolving the Google Speech
further processed —e.g., using natural language processing Commands Dataset v2 [154] with acoustic transfer functions
techniques— to provide the client device with a response. In measured in a hearing aids set-up.
VOLUME 4, 2016 15
I. López-Espejo et al.: Deep Spoken KWS: An Overview

VIII. DATASETS parison presented in Section X is carried out among KWS


Data are an essential ingredient of any machine learning systems that are evaluated on the Google Speech Commands
system for both training the parameters of the algorithm Dataset. Further information on this corpus is provided in
(primarily, in our context, the acoustic model parameters) Subsection VIII-A.
and validating it. Some well-known speech corpora that have Also from Table 1, we can observe that the great majority
been extensively used over the years in the field of ASR are of datasets are noisy, which means that speech signals are
now also being employed for the development of deep KWS distorted in different ways, e.g., by natural and realistic back-
systems. For example, LibriSpeech [196] has been used by ground acoustic noise or room acoustics. This is generally
[55], [76], [142], [151], [169], TIDIGITS [197], by [140], a must if we want to minimize the mismatch between the
TIMIT [198], by [41], [84], [117], [118], [199], and the KWS performance at the lab phase and that one observable in
Wall Street Journal (WSJ) corpus [200], by [4], [76], [103]. the inherently-noisy real-life conditions. In particular, dataset
The main problem with these speech corpora is that they acoustic conditions should be as close as possible as those
were not developed for KWS, and, therefore, they do not that we expect to find when deploying KWS systems in real-
standardize a way of utilization facilitating KWS technology life [202]. Noisy corpora can be classified as natural and/or
reproducibility and comparison. By contrast, KWS research simulated noisy speech:
work exploiting these corpora employs them in a variety of
1) Natural noisy speech: Some of the datasets in Ta-
ways, which is even reflected by, e.g., the set of considered
ble 1 (e.g., [17], [31], [45], [56], [59], [122], [152],
keywords.
[193]) were partially or totally created from natu-
In the following we focus on those datasets particularly in-
ral noisy speech recorded —many times in far-field
tended for KWS research and development, which, normally,
conditions— by electronic devices such as smart
are comprised of hundreds or thousands of different speakers
speakers, smartphones and tablets. Often, recording
who do not overlap across sets (i.e., training, development
scenarios consist of home environments with back-
and test sets), e.g., [17], [26], [56], [61], [68], [78], [93],
ground music or TV sound, since this is the target
[102], [154], [201]. Table 1 shows a wide selection of the
scenario of many KWS systems.
most significant speech corpora available for training and
2) Simulated noisy speech: Some other noisy datasets
testing deep KWS systems. From this table, the first inference
conceived for KWS —e.g., [15], [22], [28], [31], [42],
that we can draw is that the advancement of the KWS
[58], [93]— were partially or totally generated by
technology is led by the private sector of the United States
artificially distorting clean speech signals through a
of America (USA) and China. Seven and five out of the
procedure called data augmentation [171]. Typically,
seventeen different dataset developers included in Table 1
given a clean speech signal, noisy copies of it are
are, respectively, North American and Chinese corporations.
created by adding different types of background noises
Actually, except for the “Narc Ya” corpus [152], which is
(e.g., daily life noises like babble, café, car, music and
in Korean, all the datasets shown in this table are in either
street noises) in such a manner that the resulting SNR
English or Mandarin Chinese.
levels (commonly, within the range [−5, 20] dB) are
A problem with the above is that the majority of the speech
under control. Filtering and Noise-adding Tool (FaNT)
corpora of interest for KWS research and development are
[203] is a useful software to create such noisy copies.
not publicly available (P.A.), but they are for (company)
For example, FaNT was employed in [43], [130] to
internal use only. On many occasions, these datasets are col-
generate, in a controlled manner, noisier versions of the
lected by companies to improve their wake-up word detection
already noisy Google Speech Commands Dataset. Nor-
systems for voice assistants running on smart speakers. For
mally, background noises for data augmentation come
example, this is the case for the speech corpora reported
from publicly available databases like TUT [204],
in [26], [122] and [9], which were collected, respectively,
DEMAND [205], MUSAN [206], NOISEX-92 [207]
from Mobvoi’s TicKasa Fox, Google’s Google Home and
and CHiME [208], [209]. In addition, alteration of
Xiaomi’s AI Speaker smart speakers. Unfortunately, only
room acoustics, e.g., to simulate far-field conditions
seven out of twenty six datasets in Table 1 are publicly
from close-talk speech [93], is another relevant data
available: one from Sonos [169], two different arrangements
augmentation strategy.
of AISHELL-2 [194] (used in [98]), the Google Speech Com-
mands Dataset v1 [153] and v2 [154], the Hey Snapdragon Collecting a good amount of natural noisy speech data in the
Keyword Dataset [195], and Hey Snips [78], [201] (also used desired acoustic conditions is not always feasible. In such
in, e.g., [53], [177]). In case of interest in getting access cases, simulation of noisy speech is a smart and cheaper
to any of these speech corpora, the reader is pointed to the alternative allowing us for obtaining similar technology per-
corresponding references indicated in Table 1. Among these formance [210].
publicly available datasets, the Google Speech Commands We can clearly see from Table 1 that the number of
Dataset (v1 and v2) is, by far, the most popular, and has keywords per dataset is mostly 1 or 2. A reason for this is
become the de facto open reference for KWS development that datasets mainly fit the application of KWS that, lately, is
and evaluation. Because of this, the KWS performance com- boosting research on this technology: wake-up word detec-
16 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

TABLE 1. A selection of the most significant speech datasets employed for training and validating deep KWS systems. “P.A.” stands for “publicly available”, while
“Y” and “N” mean “yes” and “no”, respectively. Furthermore, “+ sampl.” (“- sampl.”) refers to the size of the positive/keyword (negative/non-keyword) subset, and “Size”
denotes the magnitude of the whole set. Such sizes are given, depending on the available information, in terms of either the number of samples or time length in
hours (h). Unknown information is indicated by hyphens.

Ref. Name Developer P.A.? Language Noisy? No. of KW Training set Test set
Size + sampl. - sampl. Size + sampl. - sampl.

[193] - Alibaba N Mandarin Y 1 24k h - - - 12k 600 h


[93] - Baidu N English Y 1 12k - - 2k - -
Chinese
[42] - Academy N Mandarin Y 2 47.8k 8.8k 39k - 1.7k -
of Sciences
[58] - Fluent.ai N English Y 1 50 h 5.9k - 22 h 1.6k -
[22] - Google N English Y 10 >3k h 60.7k 133k 81.2k 11.2k 70k
[28] - Google N English Y 14 326.8k 10k 316.8k 61.3k 1.9k 59.4k
Harbin
[61] - Institute N Mandarin - 1 115.2k 19.2k 96k 28.8k 4.8k 24k
of Technology
[103] - Logitech N English - 14 - - - - - -
[26] - Mobvoi N Mandarin Y 1 67 h 20k 54k 7h 2k 5.9k
[169] - Sonos Y English Y 16 0 0 0 1.1k 1.1k 0
[96] - Tencent N Mandarin Y 1 339 h 224k 100k - - -
[45] - Tencent N Mandarin Y 1 65.9 h 6.9 h 59 h 8.7 h 0.9 h 7.8 h
[56] - Tencent N Mandarin Y 42 22.2k 15.4k 6.8k 10.8k 7.4k 3.4k
[9] - Xiaomi N Mandarin - 1 1.7k h 188.9k 1M 52.2 h 28.8k 32.8k
[194] AISHELL-2 (13) AISHELL Y Mandarin N 13 24.8 h >24k - 16.7 h >8.4k -
[194] AISHELL-2 (20) AISHELL Y Mandarin N 20 35 h >34k - 23.9 h >12k -
[108] “Alexa” Amazon N English Y 1 495 h - - 100 h - -
Google Speech
[153] Google Y English Y 10 51.7k 18.9k 32.8k 6.5k 2.4k 4.1k
Commands Dataset v1
Google Speech
[154] Google Y English Y 10 84.6k 30.8k 53.8k 10.6k 3.9k 6.7k
Commands Dataset v2
[59] “Hey Siri” Apple N English Y 1 500k 250k 250k - 6.5k 2.7k h
Hey Snapdragon
[195] Qualcomm Y English N 4 - - - 4.3k 4.3k -
Keyword Dataset
[78] Hey Snips Snips Y English Y 1 50.5 h 5.9k 45.3k 23.1 h 2.6k 20.8k
[152] “Narc Ya” Netmarble N Korean Y 1 130k 50k 80k 800 400 400
[31] “Ok/Hey Google” Google N English Y 2 - 1M - >3k h 434k 213k
[122] “Ok/Hey Google” Google N English Y 2 - - - 247 h 4.8k 7.5k
[17] Ticmini2 Mobvoi N Mandarin Y 2 157.5k 43.6k 113.9k 72.9k 21.3k 51.6k

tion for voice assistants. TABLE 2. List of the words included in the Google Speech Commands
Dataset v1 (first six rows) and v2 (all the rows). Words are broken down by the
Finally, the right part of Table 1 tells some informa- standardized 10 keywords (first two rows) and non-keywords (last five rows).
tion about the sizes of the training and test sets14 of the
different corpora in terms of either the number of sam- yes no up down left KW
Version 1 (v1)

ples (i.e., words, normally) or time length in hours (h) — right on off stop go
Version 2 (v2)

depending on the available information—. Specifically, “+ zero one two three four
sampl.” (“- sampl.”) refers to the size of the positive/keyword
Non-KW

five six seven eight nine


(negative/non-keyword) subset, and “Size” denotes the mag- bed bird cat dog happy
nitude of the whole set. Unknown information is indicated by house Marvin Sheila tree wow
hyphens. From this table, we note that, as a trend, publicly backward forward follow learn visual
available datasets tend to be smaller than in-house ones.
Furthermore, while the ratio between the sizes of the training
and test sets is greater than 1 in all the reported cases except to accurately reflect potential scenarios of use consisting of
[169], ratio values tend to differ from one corpus to another. always-on KWS applications like wake-up word detection,
Also, mainly, the ratio between the sizes of the correspond- in which KWS systems, most of the time, will be exposed to
ing negative/non-keyword and positive/keyword subsets is other types of words instead of keywords.
greater than 1, that is, +- sampl.
sampl. > 1. This is purposely done
A. GOOGLE SPEECH COMMANDS DATASET
14 Many of these corpora also include a development set. However, this The publicly available Google Speech Commands Dataset
part has been omitted for the sake of clarity. [153], [154] has become the de facto open benchmark for
VOLUME 4, 2016 17
I. López-Espejo et al.: Deep Spoken KWS: An Overview

(deep) KWS development and evaluation. This crowdsourced Ground truth NK NK KW NK NK KW NK NK NK NK


database was captured at a sampling rate of 16 kHz by means
of phone and laptop microphones, being, to some extent, SYS1 NK NK KW NK NK NK KW NK NK NK

noisy. Its first version, v1 [153], was released in August 2017 SYS2 NK NK NK NK NK NK NK NK NK NK
under a Creative Commons BY 4.0 license [211]. Recorded
by 1,881 speakers, this first version consists of 64,727 one- FIGURE 11. Example of two different KWS systems SYS1 and SYS2
recognizing a sequence of keywords (KW) and non-keywords (NK). The
second (or less) long speech segments covering one word ground truth sequence is also shown on top.
each out of 30 possible different words. The main difference
between the first version and the second version —which was
made publicly available in 2018— is that the latter incorpo- duced three outcomes revolving around the Google Speech
rates 5 more words (i.e., a total of 35 words), more speech Commands Dataset v2: 1) a variant of it emulating hearing
segments, 105,829, and more speakers, 2,618. Table 2 lists aids as a capturing device (employed, as mentioned in Sub-
the words included in the Google Speech Commands Dataset section VII-C, for KWS for hearing assistive devices robust
v1 (first six rows) and v2 (all the rows). In this table, words to external speakers) [128], [129], 2) another noisier variant
are broken down by the standardized 10 keywords (first two with a diversity of noisy conditions15 (i.e., types of noise
rows) and non-keywords (last five rows). To facilitate KWS and SNR levels) [130], and 3) manually-annotated speaker
technology reproducibility and comparison, this benchmark gender labels16 .
also standardizes the training, development and test sets, as
well as other crucial aspects of the experimental framework, IX. EVALUATION METRICS
including a training data augmentation procedure involving Obviously, the gold plate test of any speech communication
background noises (see, e.g., [30] for further details). Mul- system is a test with relevant end-users. However, such tests
tiple recent deep KWS works have employed either the first tend to be costly and time-consuming. Instead (or in addition
version [16], [30], [32], [43], [48]–[52], [57], [58], [67], [69], to subjective tests), one adheres to objective performance
[70], [86], [90], [100], [125] or the second version [32], [47], metrics for estimating system performance. It is important to
[48], [53], [70], [82], [89], [90], [99], [100], [109], [128]– choose a meaningful objective evaluation metric that allows
[130], [159], [175] of the Google Speech Commands Dataset. us to determine the goodness of a system and is highly
Despite how valuable this open reference is for KWS correlated to the subjective user experience. In what follows,
research and development, we can raise two relevant points we review and provide some criticism of the most common
of criticism: metrics considered in the field of KWS. These metrics are
1) Class balancing: The different keyword and non- rather intended for binary classification —e.g., keyword/non-
keyword classes are rather balanced (i.e., they ap- keyword— tasks. In the event of having multiple keywords, a
pear with comparable frequencies) in this benchmark, common approach consists of applying the metric computa-
which, as we know, is generally not realistic. See Sub- tion for every keyword and, then, the result is averaged, e.g.,
section IX-A for further comments on this question. see [30], [129], [130].
2) Non-streaming mode: Most of the above-referred
works using the Google Speech Commands Dataset A. ACCURACY
performs, due to the nature of this corpus, KWS eval- Accuracy can be defined as the ratio between the number
uations in non-streaming mode, namely, multi-class of correct predictions/classifications and the total number
classification of independent short input segments. In of them [212]. In the context of binary classification (e.g.,
this mode, a full keyword or non-keyword is surely keyword/non-keyword), accuracy can also be expressed from
present within every segment. However, real-life KWS the number of true positives (TP), false positives (FP), true
involves the continuous processing of an input audio negatives (TN) and false negatives (FN) as follows [213]:
stream. TP + TN
Accuracy = . (14)
A few deep KWS research works [43], [58], [129], [130] TP + TN + FP + FN
have proposed to overcome the above two limitations by Accuracy ∈ [0, 1], where 0 and 1 indicate, respectively,
generating more realistic streaming versions of the Google worst and perfect classification.
Speech Commands Dataset by concatenation of one-second It is reasonable to expect that, in real-life applications like
long utterances in such a manner that the resulting word class wake-up word detection, KWS systems will hear other word
distribution is unbalanced. Even though the author of the types rather than keywords most of the time. In other words,
Google Speech Commands Dataset reports some streaming KWS is a task in which, in principle, the keyword and non-
evaluations in the database description manuscript [154], keyword classes are quite unbalanced. Under these circum-
still, we think that this point should be standardized for the
15 Tools to create this noisy dataset can be freely downloaded from http:
sake of reproducibility and comparison, thereby enhancing
//ilopez.es.mialias.net/misc/NoisyGSCD.zip
the usefulness of this valuable corpus. 16 These labels are publicly available at https://ilopezes.files.wordpress.
Lastly, we wish to draw attention to the fact that we pro- com/2019/10/gscd_spk_gender.zip

18 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Receiver operating characteristic Detection error trade-off Notice that Eq. (15) is the probability that a positive sample
Perfect classifier Perfect classifier
1.0 1.0 SYS2 (i.e., a keyword in this paper) is correctly detected as such.

se
Be
Similarly, let FPR be the false positive rate —also known

or
tte

W
rW
0.8 0.8

r
as false alarm rate—, namely, the probability that a negative

False negative rate

tte
True positive rate

or

Be
se

R
sample (i.e., a non-keyword in our case) is wrongly classified

an
0.6 SYS1
0.6

do
er
ifi

m
as a positive one [217]:
ss

SYS1

cl
a

0.4 0.4

as
cl

sif
m

FP

ie
do

r
an

0.2 0.2 FPR = . (16)


R

FP + TN
SYS2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Then, a better and prominent way of evaluating the per-
False positive rate False positive rate formance of a KWS system is by means of the receiver
operating characteristic (ROC) curve, which consists of the
FIGURE 12. Outlining of the receiver operating characteristic (left) and
detection error trade-off (right) curves. The location of SYS1 and SYS2 is plot of pairs of false positive and true positive rate values that
indicated by green and red crosses, respectively. See the text for further are obtained by sweeping the sensitivity (decision) threshold
explanation.
[218]. The left part of Figure 12 outlines example ROC
curves. Coordinate (FPR = 0, TPR = 1) in the upper
left corner represents a perfect classifier. The closer to this
stances, accuracy tends to be an unsuitable evaluation metric point a ROC curve is, the better a classification system. In
yielding potentially misleading conclusions [214], [215]. Let addition, a system performing on the ROC space identity
us illustrate this statement with the following example. Let us line would be randomly guessing. The area under the curve
consider two different KWS systems SYS1 and SYS2. While (AUC), which equals the probability that a classifier ranks
SYS1 is a relatively decent system, SYS2 is a totally useless a randomly-chosen positive sample higher than a randomly-
one, since it always outputs “non-keyword” regardless of chosen negative one [218], is also often employed as a ROC
the input. Figure 11 depicts, along with an example ground summary for KWS evaluation, e.g., [76], [85], [123], [145],
truth sequence, the sequences of keywords (KW) and non- [152], [219]–[221]. The larger the AUC ∈ [0, 1], the better a
keywords (NK) predicted by SYS1 and SYS2. In this situation, system is [222].
both KWS systems perform with 80% accuracy, even though Let us return for a moment to the example of Figure 11.
SYS2 is useless while SYS1 is not. Thus, particularly in It is easy to check that the KWS systems SYS1 and SYS2
unbalanced situations, more appropriate evaluation metrics would be characterized, in the ROC space, by the coordinates
than accuracy may be required, and these are discussed in the (FPR = 0.125, TPR = 0.5) and (FPR = 0, TPR = 0),
next subsections. respectively (see Figure 12). Unlike what happened when
In spite of its disadvantage in unbalanced situations, accu- using accuracy, now we can rightly assess that SYS1 (above
racy is a widely used evaluation metric for deep KWS, espe- the random guessing line) is much better than SYS2 (on the
cially when performing evaluations on the popular Google random guessing line).
Speech Commands Dataset [153], [154] in non-streaming An alternative (with no particular preference) to the ROC
mode [16], [30], [32], [48]–[52], [58], [69], [89], [91], curve (e.g., [24], [138], [177], [223]) is the detection error
[99], [109], [125]. In this latter case, accuracy can still be trade-off (DET) curve [224]. From the right part of Figure 12,
considered a meaningful metric, since the different word it can be seen that a DET curve is like a ROC curve except
classes are rather balanced in the Google Speech Commands for the y-axis being false negative rate —also known as miss
Dataset benchmark. Hence, the main criticism that might rate [225]—, FNR:
be raised here is the lack of realism of the benchmark FN
itself, as discussed in Subsection VIII-A. Nevertheless, we FNR = . (17)
FN + TP
have experimentally observed for KWS a strong correlation
between accuracy on a quite balanced scenario and more This time, coordinate (FPR = 0, FNR = 0) in the bottom
suitable metrics like F-score (see Subsection IX-C) on a left corner represents a perfect classifier. The closer to this
more realistic, unbalanced scenario [129], [130]. This might point a DET curve is, the better a classification system.
suggest that the employment of accuracy, although not ideal, Therefore, the smaller the AUC ∈ [0, 1] in this case, the
can still be useful under certain experimental conditions to better a system is. Notice that, as FNR = 1 − TPR, the DET
adequately explain the goodness of KWS systems. curve is nothing else but a vertically-flipped version of the
ROC curve. From the DET curve we can also straightfor-
wardly obtain the equal error rate (EER) as the intersection
B. RECEIVER OPERATING CHARACTERISTIC AND
point between the identity line and the DET curve (i.e., the
DETECTION ERROR TRADE-OFF CURVES
point at which FNR = FPR) [226]. Certainly, the lower
Let TPR denote the true positive rate —also known as recall the EER value, the better. Though the use of EER is much
[216]—, which is defined as the ratio more widespread in the field of speaker verification [227]–
TP [229], this DET summary is sometimes considered for KWS
TPR = Recall = . (15) evaluation [4], [76], [117], [123], [159], [220], [230].
TP + FN
VOLUME 4, 2016 19
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Precision-recall F-score AUC ∈ [0, 1], the better a classifier. This time, a (precision-
Perfect classifier Perfect classifier
1.0 1.0
recall) random guessing line has not been drawn, since it
r
depends on the proportion of the positive class within both
tte
Be

Worse Better
0.8 0.8
se

classes [240]. For example, while in a balanced scenario


or
W

random guessing would be characterized by a horizontal line


Precision

0.6 0.6

F-score
at a precision of 0.5, we can expect that such a line is closer
0.4 0.4
to 0 precision in the event of the KWS problem due to the
0.2 0.2 highly imbalance nature of it.
0.0 0.0 The close relationship between the ROC (and DET) and
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 precision-recall curves can be intuited, and, in fact, there
Recall Sensitivity threshold
exists a one-to-one correspondence between both of them
FIGURE 13. Outlining of the precision-recall (left) and F-score (right) curves. [239]. However, the precision-recall curve is considered to
See the text for further explanation.
be a more informative visual analysis tool than the ROC one
in our context [240]. This is because, thanks to the use of
In real-world KWS applications, typically, the cost of a precision, the precision-recall curve allows us to better focus
false alarm is significantly greater than that of a miss detec- on the minority positive (i.e., keyword) class of interest (see
tion17 [231]. This is for example the case for voice activation Eq. (19)). On the precision-recall plane, while SYS1 lies on
of voice assistants, where privacy is a major concern [232] the point (Recall = 0.5, Precision = 0.5), precision is
since this application involves streaming voice to a cloud undefined (i.e., Precision = 0/0) for SYS2, which should
server. As a result, a popular variant of the ROC and DET alert us to the existence of a problem with the latter system.
curves is that one replacing false positive rate along the x- From precision and recall we can formulate the F-score
axis by the number of false alarms per hour [8], [28], [31], metric [241], F1 , which is often used for KWS evaluation,
[59], [60], [156], [162], [195]. By this, a practitioner can just e.g., [12], [129], [130], [140], [151], [242]. F-score is the
set a very small number of false alarms per hour (e.g., 1) harmonic mean of precision and recall, that is,
and identify the system with the highest (lowest) true positive 2 2TP
F1 = = , (20)
(false negative) rate for deployment. An alternative good se- Recall−1 + Precision−1 2TP + FP + FN
lection criterion consists of picking up the system maximiz-
where 0 ≤ F1 ≤ 1, and the larger F1 , the better. Indeed,
ing, at a particular system-dependent sensitivity threshold,
as for precision and recall, F-score can be calculated as a
the so-called term-weighted value (TWV) [88], [231], [233]–
function of the sensitivity threshold and plotted as exempli-
[238]. Given a sensitivity threshold, TWV is a weighted
fied by the right part of Figure 13. In this representation,
linear combination of the false negative and false positive
we assume that a KWS system provides confidence scores
rates as in
resulting from posterior probabilities, and this is why the
TWV = 1 − (FNR + βFPR) , (18)
sensitivity threshold ranges from 0 to 1. The larger the
where β  1 (e.g., β = 999.9 [231]) is a constant expressing AUC ∈ [0, 1], the better a system is. A perfect classifier
the greater cost of a false alarm with respect to that of a miss would be characterized by an AUC of 1. As in the case of
detection. the precision-recall curve, a random guessing line has not
been drawn either on the F-score space, since this similarly
C. PRECISION-RECALL AND F-SCORE CURVES depends on the proportion between the positive and negative
The precision-recall curve [239] is another important visual classes. Finally, let us notice that F-score is 0.5 and 0 for
performance analysis tool for KWS systems (e.g., [12], [77], SYS1 and SYS2, respectively, which clearly indicates the
[129], [140]). Let precision, also known as positive predictive superiority of SYS1 with respect to SYS2.
value [240], be the probability that a sample that is classified
as positive is actually a positive sample: X. PERFORMANCE COMPARISON
TP In this section, we present a performance comparison among
.
Precision = (19) some of the latest and most relevant deep KWS systems
TP + FP
reviewed throughout this manuscript. This comparison is
Then, the precision-recall curve plots pairs of recall (equiva- carried out in terms of both KWS performance and compu-
lently, TPR, see Eq. (15)) and precision values that, as in the tational complexity of the acoustic model, which is the main
case of the ROC and DET curves, are obtained by sweeping distinctive component of every system.
the sensitivity threshold. This definition is schematized by To measure KWS performance, we examine accuracy of
the left part of Figure 13, where a perfect classifier lies on systems in non-streaming mode on the Google Speech Com-
the coordinate (Recall = 1, Precision = 1). The closer mands Dataset (GSCD) v1 and v2 (described in Subsection
to this point a precision-recall curve is and the larger the VIII-A), which standardize 10 keywords (see Table 2). In
17 Evidently, in these circumstances, EER may not be a good metric this way, since the publicly available GSCD has become the
candidate for system comparison. de facto open benchmark for deep KWS, we can straightfor-
20 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

TABLE 3. Performance comparison among some of the latest deep KWS systems in terms of both accuracy (%) and computational complexity (i.e., number of
parameters and multiplications) of the acoustic model. Accuracy, provided with confidence intervals for some systems, is on the Google Speech Commands Dataset
(GSCD) v1 and v2. The reported values are directly taken from the references in the “Description” column. Unknown information is indicated by hyphens.

ID Description Year Accuracy (%) Computational complexity


GSCD v1 GSCD v2 No. of params. No. of mults.

1 Standard FFNN with a pooling layer [32] 2020 91.2 90.6 447k –
2 DenseNet with trainable window function and mixup data augmentation [67] 2018 92.8 – – –
3 Two-stage TDNN [58] 2018 94.3 – 251k 25.1M
4 CNN with striding [32] 2018 95.4 95.6 529k –
5 BiLSTM with attention [133] 2018 95.6 96.9 202k –
6 Residual CNN res15 [30] 2018 95.8 ± 0.484 – 238k 894M
7 TDNN with shared weight self-attention [16] 2019 95.81 ± 0.191 – 12k 403k
8 DenseNet+BiLSTM with attention [48] 2019 96.2 97.3 223k –
9 Residual CNN with temporal convolutions TC-ResNet14 [50] 2019 96.2 – 137k –
10 SVDF [32] 2019 96.3 96.9 354k –
11 SincConv+(Grouped DS-CNN) [70] 2020 96.4 97.3 62k –
12 Graph convolutional network CENet-40 [49] 2019 96.4 – 61k 16.18M
13 GRU [32] 2020 96.6 97.2 593k –
14 SincConv+(DS-CNN) [70] 2020 96.6 97.4 122k –
15 Temporal CNN with depthwise convolutions TENet12 [52] 2020 96.6 – 100k 2.90M
16 Residual DS-CNN with squeeze-and-excitation DS-ResNet18 [51] 2020 96.71 ± 0.195 – 72k 285M
17 TC-ResNet14 with neural architecture search NoisyDARTS-TC14 [146] 2021 96.79 ± 0.30 97.18 ± 0.26 108k 6.3M
18 LSTM [32] 2020 96.9 97.5 – –
19 DS-CNN with striding [32] 2018 97.0 97.1 485k –
20 CRNN [32] 2020 97.0 97.5 467k –
21 BiGRU with multi-head attention [32] 2020 97.2 98.0 743k –
22 CNN with neural architecture search NAS2_6_36 [125] 2020 97.22 – 886k –
23 Keyword Transformer KWT-3 [90] 2021 97.49 ± 0.15 98.56 ± 0.07 5.3M –
24 Variant of TC-ResNet with self-attention LG-Net6 [91] 2021 97.67 96.79 313k –
25 Broadcasted residual CNN BC-ResNet-8 [100] 2021 98.0 98.7 321k 89.1M

wardly use accuracy values reported in the literature in order


to rank the most prominent deep KWS systems. Regarding
accuracy as an evaluation metric, recall that this metric,
1000 although not ideal, is still meaningful under the GSCD exper-
22
imental conditions to explain the goodness of KWS systems,
as discussed in Subsection IX-A.
No. of parameters (x1000)

800
21
On the other hand, the number of parameters and mul-
tiplications of the acoustic model is used to evaluate the
600 13
4 computational complexity of the systems. Notice that these
19
20 measures are a good approximation to the complexity of the
400 entire deep KWS system since the acoustic model is, by far,
10
24 25
the most demanding component in terms of computation. Ac-
3 6
200 5
8
tually, in [86], Tang et al. show that the number of parameters
9 14 17
15
and, especially, the number of multiplications of the acoustic
12 16
7
model are solid proxies predicting the power consumption of
0
94 95 96 97 98 these systems.
Accuracy (%)
Table 3 shows a performance comparison among some of
FIGURE 14. Location of some of the deep KWS systems of Table 3 on the the latest deep KWS systems in terms of both accuracy on
plane defined by the dimensions “number of parameters” and “accuracy” (on
the Google Speech Commands Dataset v1). Better systems can be found on
the GSCD v1 and v2 (in percentages), and complexity of the
the lower right corner of this plane. The systems are identified by the numbers acoustic model. The reported values are directly taken from
in the “ID” column of Table 3. More recent systems are marked with a darker the references in the “Description” column, while hyphens
color.
indicate non-available information. Notice that some of the
accuracy values in Table 3 are shown along with confidence
intervals that are calculated across different acoustic mod-
VOLUME 4, 2016 21
I. López-Espejo et al.: Deep Spoken KWS: An Overview

els trained with different random model initialization. Deep Since more recent deep KWS systems are marked with a
KWS systems are listed in ascending order in terms of their darker color in Figure 14, it can be clearly observed that,
accuracy on the first version of the GSCD. From Table 3, it primarily, the driving force is the optimization of KWS
can be observed that KWS performance on GSCD v2 tends to performance, where the computational complexity, although
be slightly better than that on the first version of this dataset. important, is relegated to a secondary position. A good ex-
This behavior could be related to the fact that the second ample of this is the so-called Keyword Transformer KWT-3
version of this dataset has more word samples (see Table 1), [90] (ID 23), a fully self-attentional Transformer [144] that is
which might lead to better trained acoustic models. an adaptation of Vision Transformer [245] to the KWS task.
Also from Table 3, we can see the wide variety of ar- KWT-3 (not included in Figure 14), which achieves state-
chitectures (e.g., standard FFNNs, SVDFs, TDNNs, CNNs, of-the-art performance (97.49% and 98.56% accuracy on the
RNNs and CRNNs) integrating different elements (e.g., GSCD v1 and v2, respectively), has the extraordinary amount
attention, residual connections and/or depthwise separable of more than 5 million parameters. That being said, generally,
convolutions) that has been explored for deep KWS. It is we will be more interested in systems exhibiting both high
not surprising that the worst-performing system is that whose accuracy and a small footprint, i.e., in systems that can be
acoustic model is based on a standard and relatively heavy found on the lower right corner of the plane in Figure 14. In
(447k parameters) FFNN [32] (ID 1 in Table 3). Besides, this region of the plane we have the following two groups of
the most frequently used acoustic model type is based on systems:
CNN. This surely is because CNNs are able to provide 1) Systems with IDs 14, 15, 16 and 17: These systems
a highly competitive performance —thanks to exploiting are characterized by a good KWS performance along
local speech time-frequency correlations— while typically with a particularly reduced number of parameters. All
involving lesser computational complexity than other well- of them are based on CNNs while most of them inte-
performing types of models like RNNs. grate residual connections and/or depthwise separable
Furthermore, it is interesting to note the capability of neu- convolutions. Furthermore, the three best performing
ral architecture search techniques [243] to automatically pro- systems (with IDs 15, 16 and 17) integrate either
duce acoustic models performing better than those manually dilated or temporal convolutions to exploit long time-
designed. Thus, the performance of the residual CNN with frequency dependencies.
temporal convolutions TC-ResNet14 [50] (ID 9) is im- 2) Systems with IDs 24 and 25: These two systems are
proved when NoisyDARTS-TC14 [146] (ID 17) automat- characterized by an outstanding KWS performance
ically searches for kernel sizes, additional skip connections along with a relatively small number of parameters.
and enabling or not squeeze-and-excitation [244]. Even bet- Both of them are based on CNNs and they inte-
ter, this is achieved by employing fewer parameters, i.e., 137k grate residual connections and a mechanism to exploit
versus 108k. In addition, the CNN with neural architecture long time-frequency dependencies: dilated convolu-
search NAS2_6_36 [125] (ID 22) reaches an outstanding tions in System 25, and temporal convolutions and
performance (97.22% accuracy on the GSCD v1), though at self-attention layers in System 24. System 25 also
the expense of using a large number of parameters (886k). incorporates depthwise separable convolutions.
The effectiveness of CRNNs combining CNNs and RNNs The analysis of the above two groups of systems very much
(see Subsection IV-C) can also be assessed from Table 3. For reinforces our summary reflections concluding Subsection
instance, the combination of DenseNet18 [131] with a BiL- IV-B. In other words, a state-of-the-art KWS system compris-
STM network with attention as in [48] (ID 8) yields superior ing a CNN-based acoustic model should cover the following
KWS accuracy with respect to considering standalone either three elements in order to reach a high performance with a
DenseNet [67] (ID 2) or a BiLSTM network with attention small footprint: a mechanism to exploit long time-frequency
[133] (ID 5). Moreover, we can see that the performance of dependencies, depthwise separable convolutions [136] and
a rather basic CRNN incorporating a GRU layer [32] (ID 20) residual connections [126].
is quite competitive.
Due to the vast number of disparate factors contributing XI. AUDIO-VISUAL KEYWORD SPOTTING
to the performance of the deep KWS systems of Table 3, it In face-to-face human communication, observable articula-
is extremely difficult to draw strong conclusions and even tors like the lips are an important information source. In
trends far beyond the ones indicated above. Figure 14 gives other words, human speech perception is bimodal, since it
another perspective of Table 3 by plotting the location of relies on both auditory and visual information. Similarly,
some of the systems of this table on the plane defined by speech processing systems such as ASR systems can be
the dimensions “number of parameters” and “accuracy” (on benefited from exploiting visual information along with the
the GSCD v1). In this figure, the systems are identified by the audio information to enhance their performance [246]–[249].
numbers in the “ID” column of Table 3. This can be particularly fruitful in real-world scenarios where
severe acoustic distortions (e.g., strong background noise and
18 Recall that DenseNet is an extreme case of residual CNN with a hive of reverberation) are present, since the visual information is not
skip connections. affected by acoustic distortions.
22 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Speech signal a DNN-based acoustic model whose goal is the generation,


from speech features, of posterior probabilities that are subse-
Speech Feature
Extraction
quently processed to detect the presence of a keyword. Deep
Audio-Visual spoken KWS has revitalized KWS research by enabling a
Decision
Fusion massive deployment of this technology for real-world appli-
Visual Feature cations, especially in the area of voice assistant activation.
Extraction We foresee that, as has been happening to date, advances in
Visual signal ASR research will dramatically continue impacting the field
of KWS. In particular, we think that the expected progress in
FIGURE 15. General diagram of a modern audio-visual keyword spotting
system.
end-to-end ASR [257] (replacing handcrafted speech features
by optimal feature learning integrated in the acoustic model)
will also be reflected in KWS.
While fusion of audio-visual information is a quite active Immediate future work will keep focusing on advancing
research area in ASR (e.g., see [246]–[249]), very few works acoustic modeling towards two goals simultaneously: 1) im-
have studied it for (deep) KWS [250]–[252]. Figure 15 illus- proving KWS performance in real-life acoustic conditions,
trates the general diagram of a modern audio-visual KWS and 2) computational complexity reduction. With these two
system. First, speech and visual features are extracted. In goals in mind, surely, acoustic model research will be mainly
former audio-visual KWS work [250], [251], visual feature focused on the development of novel and efficient convolu-
extraction consists of a pipeline comprising face detection tional blocks. This is because of the good properties of CNNs
and lip localization (via landmark estimation), and visual allowing us to achieve an outstanding performance with a
feature extraction itself from the lips crop. Nowadays, the use small footprint, as has been widely discussed throughout this
of a deep learning model fed with raw images containing the paper. Furthermore, based on its promising initial results for
uncropped speaker’s face seems to be the preferred approach KWS [125], [146], we expect that neural architecture search
for visual feature extraction [252]. Finally, the extracted will play a greater role in acoustic model architecture design.
audio-visual information is fused in order to come up with Specifically within the context of computational complex-
a decision about the presence or not of a keyword. Typically, ity reduction, acoustic model compression will be, more than
one of the two following fusion strategies is considered in ever, a salient research line [101]. Indeed, this is driven by
practice [253]: the numerous applications of KWS that involve embedding
KWS technology in small electronic devices characterized by
1) Feature-level fusion: Speech and visual features are
severe memory, computation and power constraints. Acous-
somehow combined (e.g., concatenated) before their
tic model compression entails three major advantages: 1)
joint classification using a neural network model.
reduced memory footprint, 2) decreased inference latency,
2) Decision-level fusion: The final decision is formed
and 3) less energy consumption. All of this is of utmost
from the combination of the decisions from separate
importance for, e.g., enabling on-device acoustic model re-
speech and visual neural network-based classifiers.
training for robustness purposes or personalized keyword
This well-performing approach seems to be preferred
inclusion. Acoustic model compression research will un-
[250]–[252] over the feature-level fusion scheme and
doubtedly encompass model parameter quantization, neural
is less data-hungry than feature-level fusion [250].
network pruning and knowledge distillation [258], among
Notice that thanks to the integration of visual information other approaches.
—which, as aforementioned, is not affected by acoustic Another line of research that might experience a notable
distortions—, audio-visual KWS achieves the greatest rela- growth in the short term could be semi-supervised learning
tive improvements with respect to audio-only KWS at lower [259] for KWS. Especially in an industrial environment, it is
SNRs [250]–[252]. simple to daily collect a vast amount of speech data from
For those who are interested in audio-visual KWS re- users of cloud speech services. These data are potentially
search, the following realistic and challenging audio-visual valuable to strengthen KWS acoustic models. However, the
benchmarks can be of interest: Lip Reading in the Wild cost of labeling such an enormous amount of data for dis-
(LRW) [254], and Lip Reading Sentences 2 (LRS2) [255] criminative model training can easily be prohibitively ex-
and 3 (LRS3) [256] datasets. While LRW comprises single- pensive. To not “waste” these unlabeled speech data, semi-
word utterances from BBC TV broadcasts, LRS2 and LRS3 supervised learning methodologies can help by allowing hy-
consist of thousands of spoken sentences from BBC TV and brid learning based on both small and big volumes of labeled
TED(x) talks, respectively. and unlabeled data, respectively.
On the other hand, consumers seem to increasingly de-
XII. CONCLUSIONS AND FUTURE DIRECTIONS mand or, at least, value a certain degree of personaliza-
The goal of this article has been to provide a comprehensive tion when it comes to consumer electronics. While some
overview of state-of-the-art KWS technology, namely, of research has already addressed some KWS personalization
deep KWS. We have seen that the core of this paradigm is aspects (as we have discussed in this article), we foresee
VOLUME 4, 2016 23
I. López-Espejo et al.: Deep Spoken KWS: An Overview

that KWS personalization will become even more relevant and Signal Processing, April 19-24, Brisbane, Australia, 2015, pp. 5236–
in the immediate future. This means that we can expect 5240.
[12] S. Chai, Z. Yang, C. Lv, and W.-Q. Zhang, “An end-to-end model based
new research going deeper into topics like efficient open- on TDNN-BiGRU for keyword spotting,” in Proceedings of IALP 2019 –
vocabulary (personalized) KWS and joint KWS and speaker International Conference on Asian Language Processing, November 15-
verification [260], [261]. 17, Shanghai, China, 2019, pp. 402–406.
[13] M. Sun, D. Snyder, Y. Gao, V. Nagaraja, M. Rodehorst, S. Panchapage-
Last but not least, recall that KWS technology is many san, N. Strom, S. Matsoukas, and S. Vitaladevuni, “Compressed time de-
times intended to run on small devices like smart speakers lay neural network for small-footprint keyword spotting,” in Proceedings
and wearables that typically embed more than one micro- of INTERSPEECH 2017 – 18th Annual Conference of the International
Speech Communication Association, August 20-24, Stockholm, Sweden,
phone. This type of multi-channel information has been 2017, pp. 3607–3611.
successfully leveraged by ASR in different ways (which [14] D. Can and M. Saraclar, “Lattice indexing for spoken term detection,”
includes, e.g., beamforming) to provide robustness against IEEE Transactions on Audio, Speech, and Language Processing, vol. 19,
pp. 2338–2347, 2011.
acoustic distortions [44], [262]. Surprisingly, and as previ- [15] R. Prabhavalkar, R. Alvarez, C. Parada, P. Nakkiran, and T. N. Sainath,
ously outlined in Section VI, multi-channel KWS has only “Automatic gain control and multi-style training for robust small-
been marginally studied. Therefore, we expect that this rather footprint keyword spotting with deep neural networks,” in Proceedings of
ICASSP 2015 – 40th IEEE International Conference on Acoustics, Speech
unexplored area is worthy to be examined, which could lead and Signal Processing, April 19-24, Brisbane, Australia, 2015, pp. 4704–
to contributions further improving KWS performance in real- 4708.
life (i.e., noisy) conditions. [16] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time
delay neural network with shared weight self-attention for small-footprint
keyword spotting,” in Proceedings of INTERSPEECH 2019 – 20th Annual
REFERENCES Conference of the International Speech Communication Association,
[1] M. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice September 15-19, Graz, Austria, 2019, pp. 2190–2194.
assistants,” Medical Reference Services Quarterly, vol. 37, pp. 81–88, [17] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Region
01 2018. proposal network based small-footprint keyword spotting,” IEEE Signal
[2] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic, “Key- Processing Letters, vol. 26, pp. 1471–1475, 2019.
word spotting for Google Assistant using contextual speech recognition,” [18] J. Rohlicek, W. Russell, S. Roukos, and H. Gish, “Continuous hidden
in Proceedings of ASRU 2017 – IEEE Automatic Speech Recognition and Markov modeling for speaker-independent word spotting,” in Proceed-
Understanding Workshop, December 16-20, Okinawa, Japan, 2017, pp. ings of ICASSP 1989 – 14th IEEE International Conference on Acoustics,
272–278. Speech and Signal Processing, May 23-26, Glasgow, UK, 1989, pp. 627–
[3] O. Vinyals and S. Wegmann, “Chasing the metric: Smoothing learning 630.
algorithms for keyword detection,” in Proceedings of ICASSP 2014 – [19] R. Rose and D. Paul, “A hidden Markov model based keyword recogni-
39th IEEE International Conference on Acoustics, Speech and Signal tion system,” in Proceedings of ICASSP 1990 – 15th IEEE International
Processing, May 4-9, Florence, Italy, 2014, pp. 3301–3305. Conference on Acoustics, Speech and Signal Processing, April 3-6,
[4] Y. Zhuang, X. Chang, Y. Qian, and K. Yu, “Unrestricted vocabulary Albuquerque, USA, 1990, pp. 129–132.
keyword spotting using LSTM-CTC,” in Proceedings of INTERSPEECH [20] J. Wilpon, L. Miller, and P. Modi, “Improvements and applications for
2016 – 17th Annual Conference of the International Speech Communica- key word recognition using hidden Markov modeling techniques,” in
tion Association, September 8-12, San Francisco, USA, 2016, pp. 938– Proceedings of ICASSP 1991 – 16th IEEE International Conference on
942. Acoustics, Speech and Signal Processing, April 14-17, Toronto, Canada,
[5] M. Weintraub, “Keyword-spotting using SRI’s DECIPHER large- 1991, pp. 309–312.
vocabulary speech-recognition system,” in Proceedings of ICASSP 1993 [21] I.-F. Chen and C.-H. Lee, “A hybrid HMM/DNN approach to keyword
– 18th IEEE International Conference on Acoustics, Speech and Signal spotting of short words,” in Proceedings of INTERSPEECH 2013 –
Processing, April 27-30, Minneapolis, USA, 1993, pp. 463–466. 14th Annual Conference of the International Speech Communication
[6] D. R. H. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Association, August 25-29, Lyon, France, 2013, pp. 1574–1578.
Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term [22] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting
detection,” in Proceedings of INTERSPEECH 2007 – 8th Annual Con- using deep neural networks,” in Proceedings of ICASSP 2014 – 39th IEEE
ference of the International Speech Communication Association, August International Conference on Acoustics, Speech and Signal Processing,
27-31, Antwerp, Belgium, 2007, pp. 314–317. May 4-9, Florence, Italy, 2014, pp. 4087–4091.
[7] G. Chen, O. Yilmaz, J. Trmal, D. Povey, and S. Khudanpur, “Using [23] M. Sun, V. Nagaraja, B. Hoffmeister, and S. Vitaladevuni, “Model
proxies for OOV keywords in the keyword search task,” in Proceedings shrinking for embedded keyword spotting,” in Proceedings of ICMLA
of ASRU 2013 – IEEE Automatic Speech Recognition and Understanding 2015 – 14th IEEE International Conference on Machine Learning and
Workshop, December 8-12, Olomouc, Czech Republic, 2013, pp. 416– Applications, December 9-11, Miami, USA, 2015, pp. 369–374.
421. [24] S. Panchapagesan, M. Sun, A. Khare, S. Matsoukas, A. Mandal,
[8] Y. Wang and Y. Long, “Keyword spotting based on CTC and RNN for B. Hoffmeister, and S. Vitaladevuni, “Multi-task learning and weighted
Mandarin Chinese speech,” in Proceedings of ISCSLP 2018 – 11th Inter- cross-entropy for DNN-based keyword spotting,” in Proceedings of IN-
national Symposium on Chinese Spoken Language Processing, November TERSPEECH 2016 – 17th Annual Conference of the International Speech
26-29, Taipei, Taiwan, 2018, pp. 374–378. Communication Association, September 8-12, San Francisco, USA, 2016,
[9] C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end pp. 760–764.
models for small-footprint keyword spotting,” in Proceedings of INTER- [25] A. Viterbi, “Error bounds for convolutional codes and an asymptotically
SPEECH 2018 – 19th Annual Conference of the International Speech optimum decoding algorithm,” IEEE Transactions on Information The-
Communication Association, September 2-6, Hyderabad, India, 2018, pp. ory, vol. 13, pp. 260–269, 1967.
2037–2041. [26] X. Wang, S. Sun, C. Shan, J. Hou, L. Xie, S. Li, and X. Lei, “Adver-
[10] R. Rikhye, Q. Wang, Q. Liang, Y. He, D. Zhao, Y. A. Huang, sarial examples for improving end-to-end attention-based small-footprint
A. Narayanan, and I. McGraw, “Personalized keyphrase detection us- keyword spotting,” in Proceedings of ICASSP 2019 – 44th IEEE Inter-
ing speaker and environment information,” in Proceedings of INTER- national Conference on Acoustics, Speech and Signal Processing, May
SPEECH 2021 – 22nd Annual Conference of the International Speech 12-17, Brighton, UK, 2019, pp. 6366–6370.
Communication Association, August 30-September 3, Brno, Czechia, [27] B. D. Scott and M. E. Rafn, “Suspending noise cancellation using
2021, pp. 4204–4208. keyword spotting,” U.S. Patent 9 398 367, Jul. 19, 2016. [Online].
[11] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword Available: https://www.google.com/patents/US9398367B1
spotting using long short-term memory networks,” in Proceedings of [28] T. N. Sainath and C. Parada, “Convolutional neural networks for small-
ICASSP 2015 – 40th IEEE International Conference on Acoustics, Speech footprint keyword spotting,” in Proceedings of INTERSPEECH 2015

24 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

– 16th Annual Conference of the International Speech Communication [48] M. Zeng and N. Xiao, “Effective combination of DenseNet and BiLSTM
Association, September 6-10, Dresden, Germany, 2015, pp. 1478–1482. for keyword spotting,” IEEE Access, vol. 7, pp. 10 767–10 775, 2019.
[29] M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, [49] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Small-footprint
S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling loss training keyword spotting with graph convolutional network,” in Proceedings of
of long short-term memory networks for small-footprint keyword spot- ASRU 2019 – IEEE Automatic Speech Recognition and Understanding
ting,” in Proceedings of SLT 2016 – IEEE Spoken Language Technology Workshop, December 14-18, Singapore, Singapore, 2019, pp. 539–546.
Workshop, December 13-16, San Diego, USA, 2016, pp. 474–480. [50] S. Choi, S. Seo, B. Shin, H. Byun, M. Kersner, B. Kim, D. Kim, and
[30] R. Tang and J. Lin, “Deep residual learning for small-footprint keyword S. Ha, “Temporal convolution for real-time keyword spotting on mobile
spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International devices,” in Proceedings of INTERSPEECH 2019 – 20th Annual Confer-
Conference on Acoustics, Speech and Signal Processing, April 15-20, ence of the International Speech Communication Association, September
Calgary, Canada, 2018, pp. 5484–5488. 15-19, Graz, Austria, 2019, pp. 3372–3376.
[31] R. Alvarez and H.-J. Park, “End-to-end streaming keyword spotting,” [51] M. Xu and X.-L. Zhang, “Depthwise separable convolutional ResNet
in Proceedings of ICASSP 2019 – 44th IEEE International Conference with squeeze-and-excitation blocks for small-footprint keyword spot-
on Acoustics, Speech and Signal Processing, May 12-17, Brighton, UK, ting,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference
2019, pp. 6336–6340. of the International Speech Communication Association, October 25-29,
[32] O. Rybakov, N. Kononenko, N. Subrahmanya, M. Visontai, and S. Lau- Shanghai, China, 2020, pp. 2547–2551.
renzo, “Streaming keyword spotting on mobile devices,” in Proceedings [52] X. Li, X. Wei, and X. Qin, “Small-footprint keyword spotting with multi-
of INTERSPEECH 2020 – 21st Annual Conference of the International scale temporal convolution,” in Proceedings of INTERSPEECH 2020
Speech Communication Association, October 25-29, Shanghai, China, – 21st Annual Conference of the International Speech Communication
2020, pp. 2277–2281. Association, October 25-29, Shanghai, China, 2020, pp. 1987–1991.
[33] B. K. Deka and P. Das, “A review of keyword spotting as an audio [53] E. Yılmaz, Özgür Bora Gevrek, J. Wu, Y. Chen, X. Meng, and H. Li,
mining technique,” International Journal of Computer Sciences and “Deep convolutional spiking neural networks for keyword spotting,”
Engineering, vol. 7, pp. 757–769, 2019. in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of
[34] S. Tabibian, “A survey on structured discriminative spoken keyword the International Speech Communication Association, October 25-29,
spotting,” Artificial Intelligence Review, vol. 53, pp. 2483–2520, 2020. Shanghai, China, 2020, pp. 2557–2561.
[35] L. Mary and D. G, Searching Speech Databases: Features, Techniques [54] Z.-H. Tan, A. kr. Sarkar, and N. Dehak, “rVAD: An unsupervised
and Evaluation Measures, A. Neustein, Ed. Switzerland: Springer, 2018. segment-based robust voice activity detection method,” Computer Speech
[36] A. Mandal, K. R. P. Kumar, and P. Mitra, “Recent developments in spoken & Language, vol. 59, pp. 1–21, 2020.
term detection: a survey,” International Journal of Speech Technology, [55] L. Lugosch and S. Myer, “DONUT: CTC-based query-by-example key-
vol. 17, pp. 183–198, 2013. word spotting,” in Proceedings of NIPS 2018 – 32nd Annual Conference
[37] D. Yu and J. Li, “Recent progresses in deep learning based acoustic on Neural Information Processing Systems, December 2-8, Montreal,
models,” IEEE/CAA Journal of Automatica Sinica, vol. 4, pp. 396–409, Canada, 2018, pp. 1–9.
2017. [56] Y. Yuan, Z. Lv, S. Huang, and L. Xie, “Verifying deep keyword spotting
[38] D. Wang, X. Wang, and S. Lv, “An overview of end-to-end automatic detection with acoustic word embeddings,” in Proceedings of ASRU 2019
speech recognition,” MDPI Symmetry, vol. 11, pp. 1–27, 2019. – IEEE Automatic Speech Recognition and Understanding Workshop,
[39] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Mining December 14-18, Singapore, Singapore, 2019, pp. 613–620.
effective negative training samples for keyword spotting,” in Proceedings [57] C. Yang, X. Wen, and L. Song, “Multi-scale convolution for robust
of ICASSP 2020 – 45th IEEE International Conference on Acoustics, keyword spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual
Speech and Signal Processing, May 4-8, Barcelona, Spain, 2020, pp. Conference of the International Speech Communication Association,
7444–7448. October 25-29, Shanghai, China, 2020, pp. 2577–2581.
[40] J. Wang, Y. He, C. Zhao, Q. Shao, W.-W. Tu, T. Ko, H. yi Lee, and [58] S. Myer and V. S. Tomar, “Efficient keyword spotting using time delay
L. Xie, “Auto-KWS 2021 Challenge: Task, datasets, and baselines,” in neural networks,” in Proceedings of INTERSPEECH 2018 – 19th Annual
Proceedings of INTERSPEECH 2021 – 22nd Annual Conference of the Conference of the International Speech Communication Association,
International Speech Communication Association, August 30-September September 2-6, Hyderabad, India, 2018, pp. 1264–1268.
3, Brno, Czechia, 2021, pp. 4244–4248. [59] T. Higuchi, M. Ghasemzadeh, K. You, and C. Dhir, “Stacked 1D convo-
[41] B. U. Pedroni, S. Sheik, H. Mostafa, S. Paul, C. Augustine, and lutional networks for end-to-end small footprint voice trigger detection,”
G. Cauwenberghs, “Small-footprint spiking neural networks for power- in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of
efficient keyword spotting,” in Proceedings of BioCAS 2018 – IEEE the International Speech Communication Association, October 25-29,
Biomedical Circuits and Systems Conference, October 17-19, Cleveland, Shanghai, China, 2020, pp. 2592–2596.
USA, 2018. [60] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin, and I. Mc-
[42] B. Liu, S. Nie, Y. Zhang, S. Liang, Z. Yang, and W. Liu, “Focal loss Graw, “Streaming small-footprint keyword spotting using sequence-to-
and double-edge-triggered detector for robust small-footprint keyword sequence models,” in Proceedings of ASRU 2017 – IEEE Automatic
spotting,” in Proceedings of ICASSP 2019 – 44th IEEE International Speech Recognition and Understanding Workshop, December 16-20,
Conference on Acoustics, Speech and Signal Processing, May 12-17, Okinawa, Japan, 2017, pp. 474–481.
Brighton, UK, 2019, pp. 6361–6365. [61] X. Xuan, M. Wang, X. Zhang, and F. Sun, “Robust small-footprint
[43] P. M. Sørensen, B. Epp, and T. May, “A depthwise separable convolu- keyword spotting using sequence-to-sequence model with connectionist
tional neural network for keyword spotting on an embedded system,” temporal classifier,” in Proceedings of ICICSP 2019 – 2nd Interna-
EURASIP Journal on Audio, Speech, and Music Processing, vol. 10, pp. tional Conference on Information Communication and Signal Process-
1–14, 2020. ing, September 28-30, Weihai, China, 2019, pp. 400–404.
[44] I. López-Espejo, “Robust speech recognition on intelligent mobile de- [62] E. Sharma, G. Ye, W. Wei, R. Zhao, Y. Tian, J. Wu, L. He, E. Lin, and
vices with dual-microphone,” Ph.D. dissertation, University of Granada, Y. Gong, “Adaptation of RNN transducer with text-to-speech technology
2017. for keyword spotting,” in Proceedings of ICASSP 2020 – 45th IEEE
[45] X. Wang, S. Sun, and L. Xie, “Virtual adversarial training for DS-CNN International Conference on Acoustics, Speech and Signal Processing,
based small-footprint keyword spotting,” in Proceedings of ASRU 2019 May 4-8, Barcelona, Spain, 2020, pp. 7484–7488.
– IEEE Automatic Speech Recognition and Understanding Workshop, [63] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
December 14-18, Singapore, Singapore, 2019, pp. 607–612. temporal classification: labelling unsegmented sequence data with recur-
[46] S. Fernández, A. Graves, and J. Schmidhuber, “An application of recur- rent neural networks,” in Proceedings of ICML 2006 – 23rd International
rent neural networks to discriminative keyword spotting,” in Proceedings Conference on Machine Learning, June 25-29, Pittsburgh, USA, 2006,
of ICANN 2007 – 17th International Conference on Artificial Neural pp. 369–376.
Networks, September 9-13, Porto, Portugal, 2007, pp. 220–229. [64] A. Graves, “Sequence transduction with recurrent neural networks,” in
[47] E. A. Ibrahim, J. Huisken, H. Fatemi, and J. P. de Gyvez, “Keyword spot- Proceedings of ICML 2012 – 29th International Conference on Machine
ting using time-domain features in a temporal convolutional network,” Learning, June 26-July 1, Edinburgh, Scotland, 2012.
in Proceedings of DSD 2019 – 22nd Euromicro Conference on Digital [65] A. Graves, A. rahman Mohamed, and G. Hinton, “Speech recognition
System Design, August 28-30, Kallithea, Greece, 2019, pp. 313–319. with deep recurrent neural networks,” in Proceedings of ICASSP 2013

VOLUME 4, 2016 25
I. López-Espejo et al.: Deep Spoken KWS: An Overview

– 38th IEEE International Conference on Acoustics, Speech and Signal [85] T. Fuchs and J. Keshet, “Spoken term detection automatically adjusted for
Processing, May 26-31, Vancouver, Canada, 2013, pp. 6645–6649. a given threshold,” IEEE Journal of Selected Topics in Signal Processing,
[66] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, vol. 11, pp. 1310–1317, 2017.
2006. [86] R. Tang, W. Wang, Z. Tu, and J. Lin, “An experimental analysis of
[67] X. Du, M. Zhu, M. Chai, and X. Shi, “End to end model for keyword the power consumption of convolutional neural networks for keyword
spotting with trainable window function and Densenet,” in Proceedings spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International
of DSP 2018 – 23rd IEEE International Conference on Digital Signal Conference on Acoustics, Speech and Signal Processing, April 15-20,
Processing, November 19-21, Shanghai, China, 2018. Calgary, Canada, 2018, pp. 5479–5483.
[68] M. Lee, J. Lee, H. J. Jang, B. Kim, W. Chang, and K. Hwang, “Or- [87] M. Muhsinzoda, C. C. Corona, D. A. Pelta, and J. L. Verdegay, “Ac-
thogonality constrained multi-head attention for keyword spotting,” in tivating accessible pedestrian signals by voice using keyword spotting
Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and systems,” in Proceedings of ISC2 2019 – IEEE International Smart Cities
Understanding Workshop, December 14-18, Singapore, Singapore, 2019, Conference, October 14-17, Casablanca, Morocco, 2019, pp. 531–534.
pp. 86–92. [88] B. Pattanayak, J. K. Rout, and G. Pradhan, “Adaptive spectral smoothen-
[69] A. Riviello and J.-P. David, “Binary speech features for keyword spotting ing for development of robust keyword spotting system,” IET Signal
tasks,” in Proceedings of INTERSPEECH 2019 – 20th Annual Conference Processing, vol. 13, pp. 544–550, 2019.
of the International Speech Communication Association, September 15- [89] Y. Chen, T. Ko, L. Shang, X. Chen, X. Jiang, and Q. Li, “An investigation
19, Graz, Austria, 2019, pp. 3460–3464. of few-shot learning in spoken term classification,” in Proceedings of
[70] S. Mittermaier, L. Kürzinger, B. Waschneck, and G. Rigoll, “Small- INTERSPEECH 2020 – 21st Annual Conference of the International
footprint keyword spotting on raw audio data with sinc-convolutions,” Speech Communication Association, October 25-29, Shanghai, China,
in Proceedings of ICASSP 2020 – 45th IEEE International Conference 2020, pp. 2582–2586.
on Acoustics, Speech and Signal Processing, May 4-8, Barcelona, Spain, [90] A. Berg, M. O’Connor, and M. T. Cruz, “Keyword Transformer: A self-
2020, pp. 7454–7458. attention model for keyword spotting,” in Proceedings of INTERSPEECH
[71] H.-J. Park, P. Violette, and N. Subrahmanya, “Learning to detect keyword 2021 – 22nd Annual Conference of the International Speech Communica-
parts and whole by smoothed max pooling,” in Proceedings of ICASSP tion Association, August 30-September 3, Brno, Czechia, 2021, pp. 4249–
2020 – 45th IEEE International Conference on Acoustics, Speech and 4253.
Signal Processing, May 4-8, Barcelona, Spain, 2020, pp. 7899–7903. [91] L. Wang, R. Gu, N. Chen, and Y. Zou, “Text anchor based metric learning
[72] H. Wu, Y. Jia, Y. Nie, and M. Li, “Domain aware training for far-field for small-footprint keyword spotting,” in Proceedings of INTERSPEECH
small-footprint keyword spotting,” in Proceedings of INTERSPEECH 2021 – 22nd Annual Conference of the International Speech Communica-
2020 – 21st Annual Conference of the International Speech Communica- tion Association, August 30-September 3, Brno, Czechia, 2021, pp. 4219–
tion Association, October 25-29, Shanghai, China, 2020, pp. 2562–2566. 4223.
[73] J. S. Bridle, “Probabilistic interpretation of feedforward classification [92] S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, New Era for
network outputs, with relationships to statistical pattern recognition,” Robust Speech Recognition. Springer, 2017.
Neurocomputing, pp. 227–236, 1990. [93] S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner,
R. Prenger, and A. Coates, “Convolutional recurrent neural networks
[74] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
for small-footprint keyword spotting,” in Proceedings of INTERSPEECH
2016, http://www.deeplearningbook.org.
2017 – 18th Annual Conference of the International Speech Communi-
[75] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keyword
cation Association, August 20-24, Stockholm, Sweden, 2017, pp. 1606–
spotting on microcontrollers,” arXiv:1711.07128v3, 2018.
1610.
[76] R. Kumar, V. Yeruva, and S. Ganapathy, “On convolutional LSTM mod- [94] Y. A. Huang, T. Z. Shabestary, and A. Gruenstein, “Hotword Cleaner:
eling for joint wake-word detection and text dependent speaker verifica- Dual-microphone adaptive noise cancellation with deferred filter coef-
tion,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference ficients for robust keyword spotting,” in Proceedings of ICASSP 2019
of the International Speech Communication Association, September 2-6, – 44th IEEE International Conference on Acoustics, Speech and Signal
Hyderabad, India, 2018, pp. 1121–1125. Processing, May 12-17, Brighton, UK, 2019, pp. 6346–6350.
[77] Y. Tan, K. Zheng, and L. Lei, “An in-vehicle keyword spotting system [95] H. Mazzawi, X. Gonzalvo, A. Kracun, P. Sridhar, N. Subrahmanya,
with multi-source fusion for vehicle applications,” in Proceedings of I. L. Moreno, H. J. Park, and P. Violette, “Improving keyword spotting
WCNC 2019 – IEEE Wireless Communications and Networking Confer- and language identification via neural architecture search at scale,” in
ence, April 15-18, Marrakesh, Morocco, 2019. Proceedings of INTERSPEECH 2019 – 20th Annual Conference of the In-
[78] A. Coucke, M. Chlieh, T. Gisselbrecht, D. Leroy, M. Poumeyrol, and ternational Speech Communication Association, September 15-19, Graz,
T. Lavril, “Efficient keyword spotting using dilated convolutions and Austria, 2019, pp. 1278–1282.
gating,” in Proceedings of ICASSP 2019 – 44th IEEE International [96] M. Yu, X. Ji, B. Wu, D. Su, and D. Yu, “End-to-end multi-look keyword
Conference on Acoustics, Speech and Signal Processing, May 12-17, spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual Confer-
Brighton, UK, 2019, pp. 6351–6355. ence of the International Speech Communication Association, October
[79] Y. Gao, N. D. Stein, C.-C. Kao, Y. Cai, M. Sun, T. Zhang, and S. Vita- 25-29, Shanghai, China, 2020, pp. 66–70.
ladevuni, “On front-end gain invariant modeling for wake word spotting,” [97] X. Ji, M. Yu, J. Chen, J. Zheng, D. Su, and D. Yu, “Integration of multi-
in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of look beamformers for multi-channel keyword spotting,” in Proceedings
the International Speech Communication Association, October 25-29, of ICASSP 2020 – 45th IEEE International Conference on Acoustics,
Shanghai, China, 2020, pp. 991–995. Speech and Signal Processing, May 4-8, Barcelona, Spain, 2020, pp.
[80] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measure- 7464–7468.
ment of the psychological magnitude pitch,” Journal of the Acoustical [98] H. Yan, Q. He, and W. Xie, “CRNN-CTC based Mandarin keywords
Society of America, vol. 8, pp. 185–190, 1937. spotting,” in Proceedings of ICASSP 2020 – 45th IEEE International Con-
[81] S. Davis and P. Mermelstein, “Comparison of parametric representations ference on Acoustics, Speech and Signal Processing, May 4-8, Barcelona,
for monosyllabic word recognition in continuously spoken sentences,” Spain, 2020, pp. 7489–7493.
IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, [99] P. Zhang and X. Zhang, “Deep template matching for small-footprint and
pp. 357–366, 1980. configurable keyword spotting,” in Proceedings of INTERSPEECH 2020
[82] I. López-Espejo, Z.-H. Tan, and J. Jensen, “Exploring filterbank learning – 21st Annual Conference of the International Speech Communication
for keyword spotting,” in Proceedings of EUSIPCO 2020 – 28th European Association, October 25-29, Shanghai, China, 2020, pp. 2572–2576.
Signal Processing Conference, January 18-21, Amsterdam, Netherlands, [100] B. Kim, S. Chang, J. Lee, and D. Sung, “Broadcasted residual learning
2021, pp. 331–335. for efficient keyword spotting,” in Proceedings of INTERSPEECH 2021
[83] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient Back- – 22nd Annual Conference of the International Speech Communication
Prop,” in Neural Networks: Tricks of the Trade. Springer, 2012, vol. Association, August 30-September 3, Brno, Czechia, 2021, pp. 4538–
7700, pp. 9–48. 4542.
[84] M. Wöllmer, B. Schuller, and G. Rigoll, “Keyword spotting exploiting [101] Y. Tian, H. Yao, M. Cai, Y. Liu, and Z. Ma, “Improving RNN trans-
long short-term memory,” Speech Communication, vol. 55, pp. 252–265, ducer modeling for small-footprint keyword spotting,” in Proceedings
2013. of ICASSP 2021 – 46th IEEE International Conference on Acoustics,

26 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Speech and Signal Processing, June 6-11, Toronto, Canada, 2021, pp. Annual Conference of the International Speech Communication Associa-
5624–5628. tion, September 2-6, Hyderabad, India, 2018, pp. 117–121.
[102] J. Hou, L. Xie, and Z. Fu, “Investigating neural network based query- [118] E.-T. Albert, C. Lemnaru, M. Dinsoreanu, and R. Potolea, “Keyword
by-example keyword spotting approach for personalized wake-up word spotting using dynamic time warping and convolutional recurrent net-
detection in Mandarin Chinese,” in Proceedings of ISCSLP 2016 – works,” in Proceedings of ICCP 2019 – 15th International Conference
10th International Symposium on Chinese Spoken Language Processing, on Intelligent Computer Communication and Processing, September 5-7,
October 17-20, Tianjin, China, 2016. Cluj-Napoca, Romania, 2019, pp. 53–60.
[103] N. Sacchi, A. Nanchen, M. Jaggi, and M. Cernak, “Open-vocabulary [119] P. Nakkiran, R. Alvarez, R. Prabhavalkar, and C. Parada, “Compressing
keyword spotting with audio and text embeddings,” in Proceedings of deep neural networks using a rank-constrained topology,” in Proceedings
INTERSPEECH 2019 – 20th Annual Conference of the International of INTERSPEECH 2015 – 16th Annual Conference of the International
Speech Communication Association, September 15-19, Graz, Austria, Speech Communication Association, September 6-10, Dresden, Ger-
2019, pp. 3362–3366. many, 2015, pp. 1473–1477.
[104] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-by-example [120] H. Mostafa, “Supervised learning based on temporal coding in spiking
keyword spotting system using multi-head attention and soft-triple loss,” neural networks,” IEEE Transactions on Neural Networks and Learning
in Proceedings of ICASSP 2021 – 46th IEEE International Conference on Systems, vol. 29, pp. 3227–3235, 2018.
Acoustics, Speech and Signal Processing, June 6-11, Toronto, Canada, [121] G. Tucker, M. Wu, M. Sun, S. Panchapagesan, G. Fu, and S. Vitaladevuni,
2021, pp. 6858–6862. “Model compression applied to small-footprint keyword spotting,” in
[105] A. Singhal, “Modern information retrieval: A brief overview,” Bulletin of Proceedings of INTERSPEECH 2016 – 17th Annual Conference of the
the IEEE Computer Society Technical Committee on Data Engineering, International Speech Communication Association, September 8-12, San
vol. 24, pp. 35–43, 2001. Francisco, USA, 2016, pp. 1878–1882.
[106] C. Parada, A. Sethy, and B. Ramabhadran, “Query-by-example spoken [122] Y. Huang, T. Hughes, T. Z. Shabestary, and T. Applebaum, “Supervised
term detection for OOV terms,” in Proceedings of ASRU 2009 – IEEE noise reduction for multichannel keyword spotting,” in Proceedings of
Automatic Speech Recognition and Understanding Workshop, December ICASSP 2018 – 43rd IEEE International Conference on Acoustics, Speech
13-17, Moreno, Italy, 2009, pp. 404–409. and Signal Processing, April 15-20, Calgary, Canada, 2018, pp. 5474–
[107] K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional 5478.
acoustic embeddings of variable-length segments in low-resource set- [123] R. Menon, H. Kamper, J. Quinn, and T. Niesler, “Fast ASR-free and
tings,” in Proceedings of ASRU 2013 – IEEE Automatic Speech Recog- almost zero-resource keyword spotting using DTW and CNNs for hu-
nition and Understanding Workshop, December 8-12, Olomouc, Czech manitarian monitoring,” in Proceedings of INTERSPEECH 2018 – 19th
Republic, 2013, pp. 410–415. Annual Conference of the International Speech Communication Associa-
[108] Y. Mishchenko, Y. Goren, M. Sun, C. Beauchene, S. Matsoukas, O. Ry- tion, September 2-6, Hyderabad, India, 2018, pp. 2608–2612.
bakov, and S. N. P. Vitaladevuni, “Low-bit quantization and quantization- [124] H. Liu, A. Abhyankar, Y. Mishchenko, T. Sénéchal, G. Fu, B. Kulis,
aware training for small-footprint keyword spotting,” in Proceedings of N. Stein, A. Shah, and S. N. P. Vitaladevuni, “Metadata-aware end-to-end
ICMLA 2019 – 18th IEEE International Conference on Machine Learning keyword spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual
and Applications, December 16-19, Boca Raton, USA, 2019, pp. 706– Conference of the International Speech Communication Association,
711. October 25-29, Shanghai, China, 2020, pp. 2282–2286.
[109] B. Liu, Y. Sun, and B. Liu, “Translational bit-by-bit multi-bit quantization [125] T. Mo, Y. Yu, M. Salameh, D. Niu, and S. Jui, “Neural architecture
for CRNN on keyword spotting,” in Proceedings of CyberC 2019 – search for keyword spotting,” in Proceedings of INTERSPEECH 2020
International Conference on Cyber-Enabled Distributed Computing and – 21st Annual Conference of the International Speech Communication
Knowledge Discovery, October 17-19, Guilin, China, 2019, pp. 444–451. Association, October 25-29, Shanghai, China, 2020, pp. 1982–1986.
[110] J.-W. Jung, H.-S. Heo, I.-H. Yang, H.-J. Shim, and H.-J. Yun, “A complete [126] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
end-to-end speaker verification system using deep neural networks: From recognition,” in Proceedings of CVPR 2016 – Conference on Computer
raw signals to verification result,” in Proceedings of ICASSP 2018 – Vision and Pattern Recognition, June 26-July 1, Las Vegas, USA, 2016,
43rd IEEE International Conference on Acoustics, Speech and Signal pp. 770–778.
Processing, April 15-20, Calgary, Canada, 2018, pp. 5349–5353. [127] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
[111] H. Muckenhirn, M. Magimai.-Doss, and S. Marcel, “Towards directly convolutional networks,” in Proceedings of ICLR 2017 – 5th International
modeling raw speech signal for speaker verification using CNNs,” in Conference on Learning Representations, April 24-26, Toulon, France,
Proceedings of ICASSP 2018 – 43rd IEEE International Conference on 2017.
Acoustics, Speech and Signal Processing, April 15-20, Calgary, Canada, [128] I. López-Espejo, Z.-H. Tan, and J. Jensen, “Keyword spotting for hearing
2018, pp. 4884–4888. assistive devices robust to external speakers,” in Proceedings of INTER-
[112] M. Ravanelli and Y. Bengio, “Speaker recognition from raw waveform SPEECH 2019 – 20th Annual Conference of the International Speech
with SincNet,” in Proceedings of SLT 2018 – IEEE Spoken Language Communication Association, September 15-19, Graz, Austria, 2019, pp.
Technology Workshop, December 18-21, Athens, Greece, 2018, pp. 1021– 3223–3227.
1028. [129] ——, “Improved external speaker-robust keyword spotting for hearing
[113] T. Irino and M. Unoki, “An analysis/synthesis auditory filterbank based assistive devices,” IEEE/ACM Transactions on Audio, Speech, and Lan-
on an IIR implementation of the gammachirp,” Journal of the Acoustical guage Processing, vol. 28, pp. 1233–1247, 2020.
Society of Japan, vol. 20, pp. 397–406, 1999. [130] ——, “A novel loss function and training strategy for noise-robust
[114] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, keyword spotting,” IEEE/ACM Transactions on Audio, Speech, and Lan-
“Learning the speech front-end with raw waveform CLDNNs,” in Pro- guage Processing, 2021.
ceedings of INTERSPEECH 2015 – 16th Annual Conference of the Inter- [131] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely
national Speech Communication Association, September 6-10, Dresden, connected convolutional networks,” in Proceedings of CVPR 2017 –
Germany, 2015, pp. 1–5. Conference on Computer Vision and Pattern Recognition, July 21-26,
[115] H. Seki, K. Yamamoto, and S. Nakagawa, “A deep neural network inte- Honolulu, USA, 2017, pp. 4700–4708.
grated with filterbank learning for speech recognition,” in Proceedings [132] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
of ICASSP 2017 – 42nd IEEE International Conference on Acoustics, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,
Speech and Signal Processing, March 5-9, New Orleans, USA, 2017, pp. “WaveNet: A generative model for raw audio,” in Proceedings of SSW
5480–5484. 2016 – The ISCA Speech Synthesis Workshop, September 13-15, Sunny-
[116] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, vale, USA, 2016, p. 125.
“End-to-end speech recognition from the raw waveform,” in Proceedings [133] D. C. de Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A neural
of INTERSPEECH 2018 – 19th Annual Conference of the International attention model for speech command recognition,” arXiv:1808.08929v1,
Speech Communication Association, September 2-6, Hyderabad, India, 2018.
2018, pp. 781–785. [134] H. Zhou, W. Hu, Y. T. Yeung, and X. Chen, “Energy-friendly keyword
[117] R. Shankar, C. Vikram, and S. Prasanna, “Spoken keyword detection spotting system using add-based convolution,” in Proceedings of INTER-
using joint DTW-CNN,” in Proceedings of INTERSPEECH 2018 – 19th SPEECH 2021 – 22nd Annual Conference of the International Speech

VOLUME 4, 2016 27
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Communication Association, August 30-September 3, Brno, Czechia, [152] S. An, Y. Kim, H. Xu, J. Lee, M. Lee, and I. Oh, “Robust keyword
2021, pp. 4234–4238. spotting via recycle-pooling for mobile game,” in Proceedings of IN-
[135] H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, and C. Xu, “AdderNet: TERSPEECH 2019 – 20th Annual Conference of the International Speech
Do we really need multiplications in deep learning?” in Proceedings of Communication Association, September 15-19, Graz, Austria, 2019, pp.
CVPR 2020 – Conference on Computer Vision and Pattern Recognition, 3661–3662.
June 14-19, Virtual, 2020, pp. 1468–1477. [153] P. Warden. (2017) Launching the Speech Commands
[136] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, Dataset. [Online]. Available: https://ai.googleblog.com/2017/08/
M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural launching-speech-commands-dataset.html
networks for mobile vision applications,” arXiv:1704.04861v1, 2017. [154] ——, “Speech Commands: A dataset for limited-vocabulary speech
[137] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural recognition,” arXiv:1804.03209v1, 2018.
Computation, vol. 9, pp. 1735–1780, 1997. [155] J. P. A. Pérez, S. Celma, and B. C. López, Automatic Gain Control:
[138] H. Sundar, J. F. Lehman, and R. Singh, “Keyword spotting in multi-player Techniques and Architectures for RF Receivers. Springer, 2011.
voice driven games for children,” in Proceedings of INTERSPEECH 2015 [156] Y. Wang, P. Getreuer, T. Hughes, R. F. Lyon, and R. A. Saurous, “Train-
– 16th Annual Conference of the International Speech Communication able frontend for robust and far-field keyword spotting,” in Proceedings
Association, September 6-10, Dresden, Germany, 2015, pp. 1660–1664. of ICASSP 2017 – 42nd IEEE International Conference on Acoustics,
[139] Y. Bai, J. Yi, H. Ni, Z. Wen, B. Liu, Y. Li, and J. Tao, “End-to- Speech and Signal Processing, March 5-9, New Orleans, USA, 2017, pp.
end keywords spotting based on connectionist temporal classification 5670–5674.
for Mandarin,” in Proceedings of ISCSLP 2016 – 10th International [157] Y. Gu, Z. Du, H. Zhang, and X. Zhang, “A monaural speech
Symposium on Chinese Spoken Language Processing, October 17-20, enhancement method for robust small-footprint keyword spotting,”
Tianjin, China, 2016. arXiv:1906.08415v1, 2019.
[140] E. Ceolini, J. Anumula, S. Braun, and S.-C. Liu, “Event-driven pipeline [158] T. Menne, R. Schlüter, and H. Ney, “Investigation into joint optimization
for low-latency low-compute keyword spotting and speaker verification of single channel speech enhancement and acoustic modeling for robust
system,” in Proceedings of ICASSP 2019 – 44th IEEE International ASR,” in Proceedings of ICASSP 2019 – 44th IEEE International Confer-
Conference on Acoustics, Speech and Signal Processing, May 12-17, ence on Acoustics, Speech and Signal Processing, May 12-17, Brighton,
Brighton, UK, 2019, pp. 7953–7957. UK, 2019, pp. 6660–6664.
[141] I. Sutskever, O. Vinyals, and Q. Le, “Sequence to sequence learning [159] M. Jung, Y. Jung, J. Goo, and H. Kim, “Multi-task network for noise-
with neural networks,” in Proceedings of NIPS 2014 – 28th International robust keyword spotting and speaker verification using CTC-based soft
Conference on Neural Information Processing Systems, December 8-13, VAD and global query attention,” in Proceedings of INTERSPEECH
Montreal, Canada, 2014, pp. 3104–3112. 2020 – 21st Annual Conference of the International Speech Communi-
[142] Y.-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y. Lee, and L.-S. Lee, “Au- cation Association, October 25-29, Shanghai, China, 2020, pp. 931–935.
dio Word2Vec: Unsupervised learning of audio segment representations [160] Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and
using sequence-to-sequence autoencoder,” in Proceedings of INTER- B. Schuller, “Deep learning for environmentally robust speech recog-
SPEECH 2016 – 17th Annual Conference of the International Speech nition: An overview of recent developments,” ACM Transactions on
Communication Association, September 8-12, San Francisco, USA, 2016, Intelligent Systems and Technology, vol. 9, pp. 1–28, 2018.
pp. 765–769. [161] K. Wang, B. He, and W.-P. Zhu, “TSTNN: Two-stage transformer based
[143] Z. Liu, T. Li, and P. Zhang, “RNN-T based open-vocabulary keyword neural network for speech enhancement in the time domain,” in Proceed-
spotting in Mandarin with multi-level detection,” in Proceedings of ings of ICASSP 2021 – 46th IEEE International Conference on Acoustics,
ICASSP 2021 – 46th IEEE International Conference on Acoustics, Speech Speech and Signal Processing, June 6-11, Toronto, Canada, 2021, pp.
and Signal Processing, June 6-11, Toronto, Canada, 2021, pp. 5649– 7098–7102.
5653. [162] Y. A. Huang, T. Z. Shabestary, A. Gruenstein, and L. Wan, “Multi-
[144] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, microphone adaptive noise cancellation for robust hotword detection,”
Łukasz Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceed- in Proceedings of INTERSPEECH 2019 – 20th Annual Conference of
ings of NIPS 2017 – 31st International Conference on Neural Information the International Speech Communication Association, September 15-19,
Processing Systems, December 4-9, Long Beach, USA, 2017, pp. 5998– Graz, Austria, 2019, pp. 1233–1237.
6008. [163] Google. Google Nest and Home device specifications. [Online]. Avail-
[145] Z. Zhao and W.-Q. Zhang, “End-to-end keyword search based on at- able: https://support.google.com/googlenest/answer/7072284?hl=en
tention and energy scorer for low resource languages,” in Proceedings [164] N. Ito, N. Ono, E. Vincent, and S. Sagayama, “Designing the Wiener post-
of INTERSPEECH 2020 – 21st Annual Conference of the International filter for diffuse noise suppression using imaginary parts of inter-channel
Speech Communication Association, October 25-29, Shanghai, China, cross-spectra,” in Proceedings of ICASSP 2010 – 35th IEEE International
2020, pp. 2587–2591. Conference on Acoustics, Speech and Signal Processing, March 14-19,
[146] B. Zhang, W. Li, Q. Li, W. Zhuang, X. Chu, and Y. Wang, “AutoKWS: Dallas, USA, 2010, pp. 2818–2821.
Keyword spotting with differentiable architecture search,” in Proceedings [165] S. Lefkimmiatis and P. Maragos, “A generalized estimation approach for
of ICASSP 2021 – 46th IEEE International Conference on Acoustics, linear and nonlinear microphone array post-filters,” Speech Communica-
Speech and Signal Processing, June 6-11, Toronto, Canada, 2021, pp. tion, vol. 49, pp. 657–666, 2007.
2830–2834. [166] W. Kellermann, “Beamforming for Speech and Audio Signals,” in Hand-
[147] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representa- book of Signal Processing in Acoustics, D. Havelock, S. Kuwano, and
tions by back-propagating errors,” Nature, vol. 323, pp. 533–536, 1986. M. Vorländer, Eds. Springer, 2008, pp. 691–702.
[148] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations [167] S.-J. Chen, A. S. Subramanian, H. Xu, and S. Watanabe, “Building state-
in convolutional architectures for object recognition,” in Proceedings of-the-art distant speech recognition using the CHiME-4 challenge with
of ICANN 2010 – 20th International Conference on Artificial Neural a setup of speech enhancement baseline,” in Proceedings of INTER-
Networks, September 15-18, Thessaloniki, Greece, 2010, pp. 92–101. SPEECH 2018 – 19th Annual Conference of the International Speech
[149] J. Kiefer and J. Wolfowitz, “Stochastic estimation of the maximum of a Communication Association, September 2-6, Hyderabad, India, 2018, pp.
regression function,” The Annals of Mathematical Statistics, vol. 23, pp. 1571–1575.
462–466, 1952. [168] X. Xiao, S. Watanabe, H. Erdogan, L. Lu, J. Hershey, M. L. Seltzer,
[150] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” G. Chen, Y. Zhang, M. Mandel, and D. Yu, “Deep beamforming networks
in Proceedings of ICLR 2015 – 3rd International Conference on Learning for multi-channel speech recognition,” in Proceedings of ICASSP 2016
Representations, May 7-9, San Diego, USA, 2015. – 41st IEEE International Conference on Acoustics, Speech and Signal
[151] B. Wei, M. Yang, T. Zhang, X. Tang, X. Huang, K. Kim, J. Lee, K. Cho, Processing, March 20-25, Shanghai, China, 2016, pp. 5745–5749.
and S.-U. Park, “End-to-end transformer-based open-vocabulary key- [169] T. Bluche and T. Gisselbrecht, “Predicting detection filters for small
word spotting with location-guided local attention,” in Proceedings of IN- footprint open-vocabulary keyword spotting,” in Proceedings of INTER-
TERSPEECH 2021 – 22nd Annual Conference of the International Speech SPEECH 2020 – 21st Annual Conference of the International Speech
Communication Association, August 30-September 3, Brno, Czechia, Communication Association, October 25-29, Shanghai, China, 2020, pp.
2021, pp. 361–365. 2552–2556.

28 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

[170] H.-J. Park, P. Zhu, I. L. Moreno, and N. Subrahmanya, “Noisy student- [189] H. S. (Microsoft). (2021) Keyword recognition. [Online].
teacher training for robust keyword spotting,” in Proceedings of INTER- Available: https://docs.microsoft.com/en-us/azure/cognitive-services/
SPEECH 2021 – 22nd Annual Conference of the International Speech speech-service/keyword-recognition-overview
Communication Association, August 30-September 3, Brno, Czechia, [190] V. Garg, W. Chang, S. Sigtia, S. Adya, P. Simha, P. Dighe, and C. Dhir,
2021, pp. 331–335. “Streaming transformer for hardware efficient voice trigger detection
[171] J. Malek and J. Zdansky, “On practical aspects of multi-condition training and false trigger mitigation,” in Proceedings of INTERSPEECH 2021
based on augmentation for reverberation-/noise-robust speech recogni- – 22nd Annual Conference of the International Speech Communication
tion,” Lecture Notes in Computer Science, vol. 11697, pp. 251–263, 2019. Association, August 30-September 3, Brno, Czechia, 2021, pp. 4209–
[172] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, 4213.
and Q. V. Le, “SpecAugment: A simple data augmentation method for [191] C. Knapp and G. Carter, “The generalized correlation method for estima-
automatic speech recognition,” arXiv:1904.08779v3, 2019. tion of time delay,” IEEE Transactions on Acoustics, Speech, and Signal
[173] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Good- Processing, vol. 24, pp. 320–327, 1976.
fellow, and R. Fergus, “Intriguing properties of neural networks,” in [192] J. C. Brown, “Calculation of a constant Q spectral transform,” Journal of
Proceedings of ICLR 2014 – 2nd International Conference on Learning the Acoustical Society of America, vol. 89, pp. 425–, 01 1991.
Representations, April 14-16, Banff, Canada, 2014. [193] M. Chen, S. Zhang, M. Lei, Y. Liu, H. Yao, and J. Gao, “Compact feedfor-
[174] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing ward sequential memory networks for small-footprint keyword spotting,”
adversarial examples,” in Proceedings of ICML 2015 – 32nd International in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of
Conference on Machine Learning, July 6-11, Lille, France, 2015, pp. 1– the International Speech Communication Association, September 2-6,
10. Hyderabad, India, 2018, pp. 2663–2667.
[175] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters [194] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Mandarin
with limited and synthesized speech data,” in Proceedings of ICASSP ASR research into industrial scale,” arXiv:1808.10583v2, 2018.
2020 – 45th IEEE International Conference on Acoustics, Speech and [195] B. Kim, M. Lee, J. Lee, Y. Kim, and K. Hwang, “Query-by-example
Signal Processing, May 4-8, Barcelona, Spain, 2020, pp. 7474–7478. on-device keyword spotting,” in Proceedings of ASRU 2019 – IEEE
[176] A. Werchniak, R. B. Chicote, Y. Mishchenko, J. Droppo, J. Condal, Automatic Speech Recognition and Understanding Workshop, December
P. Liu, and A. Shah, “Exploring the application of synthetic audio in 14-18, Singapore, Singapore, 2019, pp. 532–538.
training keyword spotters,” in Proceedings of ICASSP 2021 – 46th IEEE [196] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
International Conference on Acoustics, Speech and Signal Processing, ASR corpus based on public domain audio books,” in Proceedings of
June 6-11, Toronto, Canada, 2021, pp. 7993–7996. ICASSP 2015 – 40th IEEE International Conference on Acoustics, Speech
[177] K. Zhang, Z. Wu, D. Yuan, J. Luan, J. Jia, H. Meng, and B. Song, “Re- and Signal Processing, April 19-24, Brisbane, Australia, 2015, pp. 5206–
weighted interval loss for handling data imbalance problem of end-to-end 5210.
keyword spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual [197] R. Leonard, “A database for speaker-independent digit recognition,”
Conference of the International Speech Communication Association, in Proceedings of ICASSP 1984 – IEEE International Conference on
October 25-29, Shanghai, China, 2020, pp. 2567–2571. Acoustics, Speech and Signal Processing, March 19-21, San Diego, USA,
[178] K. Sohn, “Improved deep metric learning with multi-class N-pair loss 1984.
objective,” in Proceedings of NIPS 2016 – 30th International Conference [198] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and
on Neural Information Processing Systems, December 5-10, Barcelona, N. L. Dahlgren, “DARPA TIMIT acoustic phonetic continuous speech
Spain, 2016, pp. 1857–1865. corpus CDROM,” 1993.
[179] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embed- [199] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised bottle-
ding for face recognition and clustering,” in Proceedings of CVPR 2015 neck features for low-resource query-by-example spoken term detection,”
– Conference on Computer Vision and Pattern Recognition, June 7-12, in Proceedings of INTERSPEECH 2016 – 17th Annual Conference of the
Boston, USA, 2015, pp. 815–823. International Speech Communication Association, September 8-12, San
[180] J. Huh, M. Lee, H. Heo, S. Mun, and J. S. Chung, “Metric learning for Francisco, USA, 2016, pp. 923–927.
keyword spotting,” in Proceedings of SLT 2021 – IEEE Spoken Language [200] D. B. Paul and J. M. Baker, “The design for the Wall Street Journal-
Technology Workshop, January 19-22, Shenzhen, China, 2021, pp. 133– based CSR corpus,” in Speech and Natural Language: Proceedings of a
140. Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
[181] L. Kaushik, A. Sangwan, and J. H. Hansen, “Automatic audio sentiment [Online]. Available: https://www.aclweb.org/anthology/H92-1073
extraction using keyword spotting,” in Proceedings of INTERSPEECH [201] D. Leroy, A. Coucke, T. Lavril, T. Gisselbrecht, and J. Dureau, “Fed-
2015 – 16th Annual Conference of the International Speech Communica- erated learning for keyword spotting,” in Proceedings of ICASSP 2019
tion Association, September 6-10, Dresden, Germany, 2015, pp. 2709– – 44th IEEE International Conference on Acoustics, Speech and Signal
2713. Processing, May 12-17, Brighton, UK, 2019, pp. 6341–6345.
[182] L. Kaushik, A. Sangwan, and J. H. L. Hansen, “Automatic sentiment de- [202] Y. Gao, Y. Mishchenko, A. Shah, S. Matsoukas, and S. Vitaladevuni, “To-
tection in naturalistic audio,” IEEE/ACM Transactions on Audio, Speech, wards data-efficient modeling for wake word spotting,” in Proceedings of
and Language Processing, vol. 25, pp. 1668–1679, 2017. ICASSP 2020 – 45th IEEE International Conference on Acoustics, Speech
[183] S. Lee, A. Potamianos, and S. Narayanan, “Acoustics of children’s and Signal Processing, May 4-8, Barcelona, Spain, 2020, pp. 7479–7483.
speech: Developmental changes of temporal and spectral parameters,” [203] H.-G. Hirsch, FaNT - Filtering and noise adding tool, 2005, https:
The Journal of the Acoustical Society of America, vol. 105, pp. 1455– //github.com/i3thuan5/FaNT.
1468, 1999. [204] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for acoustic
[184] M. Wöllmer, B. W. Schüller, A. Batliner, S. Steidl, and D. Seppi, scene classification and sound event detection,” in Proceedings of EU-
“Tandem decoding of children’s speech for keyword detection in a child- SIPCO 2016 – 24th European Signal Processing Conference, August 29-
robot interaction scenario,” ACM Transactions on Speech and Language September 2, Budapest, Hungary, 2016, pp. 1128–1132.
Processing, vol. 7, pp. 1–22, 2011. [205] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments
[185] M. Vacher, B. Lecouteux, and F. Portet, “On distant speech recognition multi-channel acoustic noise database (DEMAND): A database of
for home automation,” Lecture Notes in Computer Science, vol. 8700, pp. multichannel environmental noise recordings,” Proceedings of Meetings
161–188, 2015. on Acoustics, vol. 19, no. 1, 2013. [Online]. Available: https:
[186] M. Rayner, B. A. Hockey, J.-M. Renders, N. Chatzichrisafis, and K. Far- //asa.scitation.org/doi/abs/10.1121/1.4799597
rell, “Spoken Dialogue Application in Space: The Clarissa Procedure [206] D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise
Browser,” in Speech Technology: Theory and Applications, F. Chen and corpus,” arXiv:1510.08484v1, 2015.
K. Jokinen, Eds. Springer, 2010, ch. 12, pp. 221–250. [207] A. Varga and H. J. Steeneken, “Assessment for automatic speech recogni-
[187] S. Team. (2020) Number of digital voice assistants in use worldwide tion: II. NOISEX-92: A database and an experiment to study the effect of
from 2019 to 2024. [Online]. Available: https://www.statista.com/ additive noise on speech recognition systems,” Speech Communication,
statistics/973815/worldwide-digital-voice-assistant-in-use/ vol. 12, pp. 247–251, 1993.
[188] Statista. (2017) Hey Siri: An On-device DNN-powered Voice Trigger for [208] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’
Apple’s Personal Assistant. [Online]. Available: https://machinelearning. speech separation and recognition challenge: Dataset, task and baselines,”
apple.com/research/hey-siri in Proceedings of ASRU 2015 – IEEE Automatic Speech Recognition and

VOLUME 4, 2016 29
I. López-Espejo et al.: Deep Spoken KWS: An Overview

Understanding Workshop, December 13-17, Scottsdale, USA, 2015, pp. [230] S.-G. Leem, I.-C. Yoo, and D. Yook, “Multitask learning of deep neural
504–511. network-based keyword spotting for IoT devices,” IEEE Transactions on
[209] ——, “The third ‘CHiME’ speech separation and recognition challenge: Consumer Electronics, vol. 65, pp. 188–194, 2019.
Analysis and outcomes,” Computer Speech & Language, vol. 46, pp. [231] NIST, “OpenKWS13 keyword search evaluation plan,” National Institute
605–626, 2017. of Standards and Technology, Evaluation plan, 2013.
[210] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A [232] R. Tang, J. Lee, A. Razi, J. Cambre, I. Bicking, J. Kaye, and J. Lin,
study on data augmentation of reverberant speech for robust speech “Howl: A deployed, open-source wake word detection system,” in
recognition,” in Proceedings of ICASSP 2017 – 42nd IEEE International Proceedings of Second Workshop for NLP Open Source Software
Conference on Acoustics, Speech and Signal Processing, March 5-9, New (NLP-OSS). Online: Association for Computational Linguistics, Nov.
Orleans, USA, 2017, pp. 5220–5224. 2020, pp. 61–65. [Online]. Available: https://www.aclweb.org/anthology/
[211] “Creative Commons Attribution 4.0 International,” Creative Commons. 2020.nlposs-1.9
[Online]. Available: https://creativecommons.org/licenses/by/4.0/ [233] P. Golik, Z. Tüske, R. Schlüter, and H. Ney, “Multilingual features based
[212] Google Developers, “Machine Learning Crash Course - Clas- keyword search for very low-resource languages,” in Proceedings of IN-
sification: Accuracy,” https://developers.google.com/machine-learning/ TERSPEECH 2015 – 16th Annual Conference of the International Speech
crash-course/classification/accuracy. Communication Association, September 6-10, Dresden, Germany, 2015,
[213] J. P. Mower, “PREP-Mt: predictive RNA editor for plant mitochondrial pp. 1260–1264.
genes,” BMC Bioinformatics, vol. 6, 2005. [234] H. Wang, A. Ragni, M. J. F. Gales, K. M. Knill, P. C. Woodland, and
[214] N. V. Chawla, Data Mining for Imbalanced Datasets: An Overview. C. Zhang, “Joint decoding of tandem and hybrid systems for improved
Boston, MA: Springer US, 2005, pp. 853–867. keyword spotting on low resource languages,” in Proceedings of INTER-
[215] M. Hossin and M. Sulaiman, “A review on evaluation metrics for SPEECH 2015 – 16th Annual Conference of the International Speech
data classification evaluations,” International Journal of Data Mining & Communication Association, September 6-10, Dresden, Germany, 2015,
Knowledge Management Process, vol. 5, pp. 1–11, 2015. pp. 3660–3664.
[216] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, “The [235] C.-C. Leung, L. Wang, H. Xu, J. Hou, V. T. Pham, H. Lv, L. Xie,
binormal assumption on precision-recall curves,” in Proceedings of ICPR X. Xiao, C. Ni, B. Ma, E. S. Chng, and H. Li, “Toward high-performance
2010 – 20th International Conference on Pattern Recognition, August 23- language-independent query-by-example spoken term detection for Me-
26, Istanbul, Turkey, 2010, pp. 4263–4266. diaEval 2015: Post-evaluation analysis,” in Proceedings of INTER-
SPEECH 2016 – 17th Annual Conference of the International Speech
[217] R. Pokrywka, “Reducing false alarm rate in anomaly detection with
Communication Association, September 8-12, San Francisco, USA, 2016,
layered filtering,” Lecture Notes in Computer Science, vol. 5101, pp. 396–
pp. 3703–3707.
404, 2008.
[236] G. Huang, A. Gorin, J.-L. Gauvain, and L. Lamel, “Machine translation
[218] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognition
based data augmentation for Cantonese keyword spotting,” in Proceed-
Letters, vol. 27, pp. 861–874, 2006.
ings of ICASSP 2016 – 41st IEEE International Conference on Acoustics,
[219] H. Benisty, I. Katz, K. Crammer, and D. Malah, “Discriminative keyword
Speech and Signal Processing, March 20-25, Shanghai, China, 2016, pp.
spotting for limited-data applications,” Speech Communication, vol. 99,
6020–6024.
pp. 1–11, 2018.
[237] N. F. Chen, P. V. Tung, H. Xu, X. Xiao, D. V. Hai, C. Ni, I.-F. Chen,
[220] R. Menon, H. Kamper, E. van der Westhuizen, J. Quinn, and T. Niesler, S. Sivadas, C.-H. Lee, E. S. Chng, B. Ma, and H. Li, “Exemplar-
“Feature exploration for almost zero-resource ASR-free keyword spot- inspired strategies for low-resource spoken keyword search in Swahili,”
ting using a multilingual bottleneck extractor and correspondence autoen- in Proceedings of ICASSP 2016 – 41st IEEE International Conference on
coders,” in Proceedings of INTERSPEECH 2019 – 20th Annual Confer- Acoustics, Speech and Signal Processing, March 20-25, Shanghai, China,
ence of the International Speech Communication Association, September 2016, pp. 6040–6044.
15-19, Graz, Austria, 2019, pp. 3475–3479. [238] J. Trmal, M. Wiesner, V. Peddinti, X. Zhang, P. Ghahremani, Y. Wang,
[221] J. Wintrode and J. Wilkes, “Fast lattice-free keyword filtering for acceler- V. Manohar, H. Xu, D. Povey, and S. Khudanpur, “The Kaldi OpenKWS
ated spoken term detection,” in Proceedings of ICASSP 2020 – 45th IEEE system: Improving low resource keyword search,” in Proceedings of
International Conference on Acoustics, Speech and Signal Processing, INTERSPEECH 2017 – 18th Annual Conference of the International
May 4-8, Barcelona, Spain, 2020, pp. 7469–7473. Speech Communication Association, August 20-24, Stockholm, Sweden,
[222] W.-H. Lee, P. D. Gader, and J. N. Wilson, “Optimizing the area under a 2017, pp. 3597–3601.
receiver operating characteristic curve with application to landmine de- [239] J. Davis and M. Goadrich, “The relationship between precision-recall
tection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, and ROC curves,” in Proceedings of ICML 2006 – 23rd International
pp. 389–397, 2007. Conference on Machine Learning, June 25-29, Pittsburgh, USA, 2006,
[223] J. Guo, K. Kumatani, M. Sun, M. Wu, A. Raju, N. Ström, and A. Mandal, pp. 233–240.
“Time-delayed bottleneck highway networks using a DFT feature for [240] T. Saito and M. Rehmsmeier, “The precision-recall plot is more informa-
keyword spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE Inter- tive than the ROC plot when evaluating binary classifiers on imbalanced
national Conference on Acoustics, Speech and Signal Processing, April datasets,” PLOS ONE, vol. 10, pp. 1–21, 2015.
15-20, Calgary, Canada, 2018, pp. 5489–5493. [241] C. J. van Rijsbergen, Information Retrieval. Butterworth-Heinemann,
[224] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki, 1979.
“The DET curve in assessment of detection task performance,” in Pro- [242] Y. Jianbin, K. Jian, Z. Wei-Qiang, and L. Jia, “Mutitask learning based
ceedings of EUROSPEECH 1997 – 5th European Conference on Speech muti-examples keywords spotting in low resource condition,” in Proceed-
Communication and Technology, September 22-25, Rhodes, Greece, ings of ICSP 2018 – 14th International Conference on Signal Processing,
1997, pp. 1895–1898. August 12-16, Beijing, China, 2018, pp. 581–585.
[225] R. W. Broadley, J. Klenk, S. B. Thies, L. P. J. Kenney, and M. H. Granat, [243] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement
“Methods for the real-world evaluation of fall detection technology: A learning,” in Proceedings of ICLR 2017 – 5th International Conference
scoping review,” MDPI Sensors, vol. 18, pp. 1–28, 2018. on Learning Representations, April 24-26, Toulon, France, 2017.
[226] D. A. van Leeuwen and N. Brümmer, “An introduction to application- [244] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Pro-
independent evaluation of speaker recognition systems,” Lecture Notes in ceedings of CVPR 2018 – Conference on Computer Vision and Pattern
Artificial Intelligence, vol. 4343, pp. 330–353, 2007. Recognition, June 18-22, Salt Lake City, USA, 2018, pp. 7132–7141.
[227] A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker ver- [245] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Un-
ification: Classifiers, databases and RSR2015,” Speech Communication, terthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit,
vol. 60, pp. 56–77, 2014. and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for
[228] Z. Bai, X.-L. Zhang, and J. Chen, “Speaker recognition based on deep Image Recognition at Scale,” in Proceedings of ICLR 2021 – 9th Interna-
learning: An overview,” arXiv:2012.00931v1, 2020. tional Conference on Learning Representations, May 3-7, Virtual, 2021.
[229] A. K. Sarkar, Z.-H. Tan, H. Tang, S. Shon, and J. Glass, “Time-contrastive [246] R. Su, X. Liu, L. Wang, and J. Yang, “Cross-domain deep visual feature
learning based deep bottleneck features for text-dependent speaker ver- generation for Mandarin audio–visual speech recognition,” IEEE/ACM
ification,” IEEE/ACM Transactions on Audio, Speech, and Language Transactions on Audio, Speech, and Language Processing, vol. 28, pp.
Processing, vol. 27, pp. 1267–1279, 2019. 185–197, 2019.

30 VOLUME 4, 2016
I. López-Espejo et al.: Deep Spoken KWS: An Overview

[247] T. Makino, H. Liao, Y. Assael, B. Shillingford, B. Garcia, O. Braga, and IVÁN LÓPEZ-ESPEJO received the M.Sc. de-
O. Siohan, “Recurrent neural network transducer for audio-visual speech gree in Telecommunications Engineering, the
recognition,” in Proceedings of ASRU 2019 – IEEE Automatic Speech M.Sc. degree in Electronics Engineering and the
Recognition and Understanding Workshop, December 14-18, Singapore, Ph.D. degree in Information and Communications
Singapore, 2019, pp. 905–912. Technology, all from the University of Granada,
[248] P. Zhou, W. Yang, W. Chen, Y. Wang, and J. Jia, “Modality attention for Granada (Spain), in 2011, 2013 and 2017, respec-
end-to-end audio-visual speech recognition,” in Proceedings of ICASSP tively. In 2018, he was the leader of the speech
2019 – 44th IEEE International Conference on Acoustics, Speech and
technology team of Veridas, Pamplona (Spain).
Signal Processing, May 12-17, Brighton, UK, 2019, pp. 6565–6569.
Since 2019, he is a post-doctoral researcher at the
[249] J. Yu, S.-X. Zhang, J. Wu, S. Ghorbani, B. Wu, S. Kang, S. Liu, X. Liu,
H. Meng, and D. Yu, “Audio-visual recognition of overlapped speech section for Artificial Intelligence and Sound at the
for the LRS2 dataset,” in Proceedings of ICASSP 2020 – 45th IEEE Department of Electronic Systems of Aalborg University, Aalborg (Den-
International Conference on Acoustics, Speech and Signal Processing, mark). His research interests include speech enhancement and robust speech
May 4-8, Barcelona, Spain, 2020, pp. 6984–6988. recognition, multi-channel speech processing, and speaker verification.
[250] P. Wu, H. Liu, X. Li, T. Fan, and X. Zhang, “A novel lip descriptor for
audio-visual keyword spotting based on adaptive decision fusion,” IEEE
Transactions on Multimedia, vol. 18, pp. 326–338, 2016.
[251] R. Ding, C. Pang, and H. Liu, “Audio-visual keyword spotting based on
multidimensional convolutional neural network,” in Proceedings of ICIP
2018 – IEEE International Conference on Image Processing, October 7-
10, Athens, Greece, 2018, pp. 4138–4142.
[252] L. Momeni, T. Afouras, T. Stafylakis, S. Albanie, and A. Zisserman,
“Seeing wake words: Audio-visual keyword spotting,” in Proceedings
of BMVC 2020 – The 31st British Machine Vision Virtual Conference,
September 7-10, 2020, pp. 1–13.
[253] J.-S. Lee and C. H. Park, Adaptive Decision Fusion for Audio-Visual
Speech Recognition. IntechOpen, 2008, pp. 275–296.
[254] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proceedings
of ACCV 2016 – 13th Asian Conference on Computer Vision, November
20-24, Taipei, Taiwan, 2016.
[255] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading
sentences in the wild,” in Proceedings of CVPR 2017 – Conference on
Computer Vision and Pattern Recognition, July 21-26, Honolulu, USA,
2017, pp. 3444–3453.
[256] T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale
dataset for visual speech recognition,” arXiv:1809.00496v2, 2018.
[257] T. Parcollet, M. Morchid, and G. Linarès, “E2E-SINCNET: Toward Fully
End-To-End Speech Recognition,” in Proceedings of ICASSP 2020 –
45th IEEE International Conference on Acoustics, Speech and Signal
Processing, May 4-8, Barcelona, Spain, 2020, pp. 7714–7718.
[258] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
network,” in Proceedings of NIPS 2014 – 28th International Conference
on Neural Information Processing Systems, December 8-13, Montreal,
Canada, 2014, pp. 1–9.
[259] Y. Ouali, C. Hudelot, and M. Tami, “An overview of deep semi-
supervised learning,” arXiv:2006.05278v2, 2020.
[260] S. Sigtia, E. Marchi, S. Kajarekar, D. Naik, and J. Bridle, “Multi-
task learning for speaker verification and voice trigger detection,” in
Proceedings of ICASSP 2020 – 45th IEEE International Conference on
Acoustics, Speech and Signal Processing, May 4-8, Barcelona, Spain, ZHENG-HUA TAN (M’00–SM’06) received the
2020, pp. 6844–6848. B.Sc. and M.Sc. degrees in electrical engineering
[261] Y. Jia, X. Wang, X. Qin, Y. Zhang, X. Wang, J. Wang, D. Zhang, and from Hunan University, Changsha, China, in 1990
M. Li, “The 2020 Personalized Voice Trigger Challenge: Open Datasets, and 1996, respectively, and the Ph.D. degree in
Evaluation Metrics, Baseline System and Results,” in Proceedings of IN- electronic engineering from Shanghai Jiao Tong
TERSPEECH 2021 – 22nd Annual Conference of the International Speech University (SJTU), Shanghai, China, in 1999.
Communication Association, August 30-September 3, Brno, Czechia, He is a Professor in the Department of Elec-
2021, pp. 4239–4243. tronic Systems and a Co-Head of the Centre for
[262] I. López-Espejo, A. M. Peinado, A. M. Gomez, and J. A. Gonzalez, Acoustic Signal Processing Research at Aalborg
“Dual-channel spectral weighting for robust speech recognition in mobile University, Aalborg, Denmark. He was a Visiting
devices,” Digital Signal Processing, vol. 75, pp. 13–24, 2018.
Scientist at the Computer Science and Artificial Intelligence Laboratory,
MIT, Cambridge, USA, an Associate Professor at SJTU, Shanghai, China,
and a postdoctoral fellow at KAIST, Daejeon, Korea. His research inter-
ests include machine learning, deep learning, pattern recognition, speech
and speaker recognition, noise-robust speech processing, multimodal signal
processing, and social robotics. He has (co)-authored over 200 refereed
publications. He is the Chair of the IEEE Signal Processing Society Machine
Learning for Signal Processing Technical Committee (MLSP TC). He is
an Associate Editor for the IEEE/ACM TRANSACTIONS ON AUDIO,
SPEECH AND LANGUAGE PROCESSING. He has served as an Editorial
Board Member for Computer Speech and Language and was a Guest Editor
for the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESS-
ING and Neurocomputing. He was the General Chair for IEEE MLSP 2018
and a TPC Co-Chair for IEEE SLT 2016.

VOLUME 4, 2016 31
I. López-Espejo et al.: Deep Spoken KWS: An Overview

JOHN H. L. HANSEN (Fellow, IEEE) received for ISCA INTERSPEECH-2022.


the B.S.E.E. degree from the College of Engi-
neering, Rutgers University, New Brunswick, NJ,
USA, and the M.S. and Ph.D. degrees in elec-
trical engineering from the Georgia Institute of
Technology, Atlanta, GA, USA. In 2005, he joined
the Erik Jonsson School of Engineering and Com-
puter Science, the University of Texas at Dallas,
Richardson, TX, USA, where he is currently an
Associate Dean for research and a Professor of
electrical and computer engineering. He also holds the Distinguished Uni-
versity Chair in telecommunications engineering and a joint appointment
as a Professor of speech and hearing with the School of Behavioral and
Brain Sciences. From 2005 to 2012, he was the Head of the Department
of Electrical Engineering, the University of Texas at Dallas. At UT Dallas,
he established the Center for Robust Speech Systems. From 1998 to 2005, he
was the Department Chair and a Professor of speech, language, and hearing
sciences, and a Professor of electrical and computer engineering with the
University of Colorado Boulder, Boulder, CO, USA, where he co-founded
and was an Associate Director of the Center for Spoken Language Research.
In 1988, he established the Robust Speech Processing Laboratory. He has
supervised 92 Ph.D. or M.S. thesis students, which include 51 Ph.D. and
41 M.S. or M.A. He has authored or coauthored 765 journal and conference
papers including 13 textbooks in the field of speech processing and language
technology, signal processing for vehicle systems, co-author of the textbook
Discrete-Time Processing of Speech Signals (IEEE Press, 2000), Vehicles,
Drivers and Safety: Intelligent Vehicles and Transportation (vol. 2 De- JESPER JENSEN received the M.Sc. degree in
Gruyter, 2020), Digital Signal Processing for In-Vehicle Systems and Safety electrical engineering and the Ph.D. degree in sig-
(Springer, 2012), and the lead author of The Impact of Speech Under ‘Stress’ nal processing from Aalborg University, Aalborg,
on Military Speech Technology (NATO RTO-TR-10, 2000). His research Denmark, in 1996 and 2000, respectively. From
interests include machine learning for speech and language processing, 1996 to 2000, he was with the Center for Person
speech processing, analysis, and modeling of speech and speaker traits, Kommunikation (CPK), Aalborg University, as a
speech enhancement, signal processing for hearing impaired or cochlear Ph.D. student and Assistant Research Professor.
implants, machine learning-based knowledge estimation and extraction of From 2000 to 2007, he was a Post-Doctoral Re-
naturalistic audio, and in-vehicle driver modeling and distraction assessment searcher and Assistant Professor with Delft Uni-
for human–machine interaction. He is an IEEE Fellow for contributions versity of Technology, Delft, The Netherlands, and
to robust speech recognition in stress and noise, and ISCA Fellow for an External Associate Professor with Aalborg University. Currently, he is a
contributions to research for speech processing of signals under adverse Senior Principal Scientist with Oticon A/S, Copenhagen, Denmark, where
conditions. He was the recipient of Acoustical Society of America’s 25 Year his main responsibility is scouting and development of new signal processing
Award in 2010, and is currently serving as ISCA President (2017–2022). concepts for hearing aid applications. He is a Professor with the Section for
He is also a Member and the past Vice-Chair on U.S. Office of Scientific Artificial Intelligence and Sound (AIS), Department of Electronic Systems,
Advisory Committees (OSAC) for OSAC-Speaker in the voice forensics do- at Aalborg University. He is also a co-founder of the Centre for Acoustic Sig-
main from 2015 to 2021. He was the IEEE Technical Committee (TC) Chair nal Processing Research (CASPR) at Aalborg University. His main interests
and a Member of the IEEE Signal Processing Society: Speech-Language are in the area of acoustic signal processing, including signal retrieval from
Processing Technical Committee (SLTC) from 2005 to 2008 and from 2010 noisy observations, coding, speech and audio modification and synthesis,
to 2014, elected the IEEE SLTC Chairman from 2011 to 2013, and elected intelligibility enhancement of speech signals, signal processing for hearing
an ISCA Distinguished Lecturer from 2011 to 2012. He was a Member aid applications, and perceptual aspects of signal processing.
of the IEEE Signal Processing Society Educational Technical Committee
from 2005 to 2010, a Technical Advisor to the U.S. Delegate for NATO
(IST/TG-01), an IEEE Signal Processing Society Distinguished Lecturer
from 2005 to 2006, an Associate Editor for the IEEE TRANSACTIONS
ON AUDIO, SPEECH, AND LANGUAGE PROCESSING from 1992 to
1999 and the IEEE SIGNAL PROCESSING LETTERS from 1998 to 2000,
Editorial Board Member for the IEEE Signal Processing Magazine from
2001 to 2003, and the Guest Editor in October 1994 for Special Issue on
Robust Speech Recognition for the IEEE TRANSACTIONS ON AUDIO,
SPEECH, AND LANGUAGE PROCESSING. He is currently an Associate
Editor for the JASA, and was on the Speech Communications Technical
Committee for Acoustical Society of America from 2000 to 2003. In 2016,
he was awarded the honorary degree Doctor Technices Honoris Causa from
Aalborg University, Aalborg, Denmark in recognition of his contributions
to the field of speech signal processing and speech or language or hearing
sciences. He was the recipient of the 2020 Provost’s Award for Excellence in
Graduate Student Supervision from the University of Texas at Dallas and the
2005 University of Colorado Teacher Recognition Award. He organized and
was General Chair for ISCA Interspeech-2002, Co-Organizer and Technical
Program Chair for the IEEE ICASSP-2010, Dallas, TX, and Co-Chair and
Organizer for IEEE SLT-2014, Lake Tahoe, NV. He will be the Tech.
Program Chair for the IEEE ICASSP-2024, and Co-Chair and Organizer

32 VOLUME 4, 2016

You might also like