Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science
Save
Voice spectrogram
Open in app Sign up Sign In
Speech recognition has become an integral part of human-computer interfaces (HCI).
Search
They are present inMedium
personal assistants like Google Assistant, Microsoft Cortana,
Amazon Alexa and Apple Siri to self-driving car HCIs and activities where employees
need to wear lots of protection equipment (like the oil and gas industry, for example).
Waveform, neural attention weights and mel-frequency spectrogram for word “one”. Neural attention helps
models focus on parts of the audio that really matter.
Much of this processing is done in the cloud, using powerful neural networks that have
been trained on enormous amounts of data. However, speech command models,
which are able to recognize a single word like “start”, “stop”, “left” or “right” (or “Hey
Google”, “Alexa”, “Echo”) usually run locally for a variety of reasons. First, it would
become expensive to continuously stream all audio acquired from a given device to the
cloud. Second, this would not even be possible in some applications: consider the case
where an operator in an offshore platform wants to use simple voice commands to
control an auxiliary robot: there may not be an internet connection available or the
latency could be too high.
284 2
In this work, the Google Speech Recognition dataset was used. It contains short audio
clips of length 1s of various words and it is an excellent starting point to learn how to
apply deep leraning to speech.
Voice formants reveal frequency regions of greater energy concentration. They are
peaks of greater amplitude in the sound spectre and are inherent to a certain
configuration adopted by the vocal tract during vowel speech. When a word is spoken,
formants are associated with natural resonance frequencies of the vocal tract, and
depend on tongue positioning relative to inner structures and to lip movement. The
system can be approximated by tube with one closed extremity (larynx) and one open
(lips), modified by tongue, lips and pharynx movement. The resonance which occurs
in the cavities of this tube is called formant.
Interestingly, deep learning practitioners decided to ignore all of this information very
fast. During his time at Baidu, researcher Andrew Ng went on to say that phonemes,
which are the smallest components of sound, didn’t matter. To a certain extent, in fact,
what matters is that voice is mostly a (quasi) periodic signal if time intervals are small
enough. This leads to the idea of ignoring the phase of the signal and only using its
power spectrum as a source of information. The fact that sound can be reconstructed
from its power spectrum (with the Griffin-Lim algorithm or a neural vocoder, for
example) proves that this is the case.
Diphtongs explicited in word “saia”. Topmost: phonetic notation. Middle: sound waveform. Bottom: Spectrogram
with fundamental frequency highlighted in blue and the three first formants highlighted in red.
Make the model explainable by computing importance weights over the input;
Recurrent neural network with attention mechanism. Numbers between [brackets] are tensor dimensions.
raw_len is WAV audio length (16000 in the case of audios of length 1s with a sampling rate of 16kHz). spec_len is
the sequence length of the generated mel-scale spectrogram. nMel is the number of mel bands. nClasses is the
number of desired classes. The activation of the last Dense layer is softmax. The activation of the 64 and 32
dense classification layers is the rectified linear unit (relu).
The proposed architecture uses convolutions to extract short-term dependencies,
RNNs and attention to extract long-term dependencies. The implementation details
can be found in this repository.
In the case of Google Speech dataset, the proposed model achieves over 94% accuracy
when trying to identify one of 20 words, silence or unknown. In addition, the attention
layer makes the model explainable. For example, in the case of the word “right”, note
that the network puts a lot of attention into the transition from r to i. This is very
intuitive considering that the t might not be audible in some cases. Also note that the
fact that the attention weights are placed in the transition does not mean that the
model ignores the other parts of the audio: the bidirectional RNN brings information
from the past and from the future.
Conclusion
Speech command recognition is present in a wide range of devices and utilized by
many HCI interfaces. In many situations, it is desirable to obtain lightweight and high
accuracy models that can run locally. This article explained one possible RNN
architecture that achieves state-of-the-art performance on the Google Speech
Recognition tasks while still keeping a small footprint in terms of trainable
parameters. Source code is available on github. It uses Google Speech Command
Dataset (v1 and v2) to demonstrate how to train models that are able to identify, for
example, 20 commands plus silence or unknown word.
The architecture is able to extract short and long-term dependencies and uses an
attention mechanism to pinpoint which region has the most useful information, that is
then fed to a sequence of dense layers.
In engineering applications, being able to explain what features were used to select a
particular category is important. As shown above, the attention mechanism explains
what parts of the audio are important for classification and also matches the intuition
that regions of vowel transitions are relevant to recognize words. For completeness,
confusion matrices can be found in this article and show that the word pairs tree-three
and no-down are difficult to identify without extra context.
References
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
https://arxiv.org/abs/1804.03209
Formant tuning strategies in professional male opera singers, Journal of Voice 27 (3)
(2013) 278–288. URL
http://www.sciencedirect.com/science/article/pii/S089219971200209320
Respiratory and acoustical differences between belt and neutral style of singing,
Journal of Voice 29 (4) (2015) 418–5. URL
http://www.sciencedirect.com/science/article/pii/S089219971400204
Attention Mechanisms
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.