Recognizing Speech Commands Using Recurrent Neural Networks With Attention - by Douglas Coimbra de Andrade - Towards Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Published in Towards Data Science

Douglas Coimbra de Andrade Follow

Dec 27, 2018 · 6 min read · Listen

Save

Recognizing Speech Commands Using


Recurrent Neural Networks with Attention
How to implement an explainable neural model for speech applications

Voice spectrogram
Open in app Sign up Sign In
Speech recognition has become an integral part of human-computer interfaces (HCI).
Search
They are present inMedium
personal assistants like Google Assistant, Microsoft Cortana,
Amazon Alexa and Apple Siri to self-driving car HCIs and activities where employees
need to wear lots of protection equipment (like the oil and gas industry, for example).

Waveform, neural attention weights and mel-frequency spectrogram for word “one”. Neural attention helps
models focus on parts of the audio that really matter.

Much of this processing is done in the cloud, using powerful neural networks that have
been trained on enormous amounts of data. However, speech command models,
which are able to recognize a single word like “start”, “stop”, “left” or “right” (or “Hey
Google”, “Alexa”, “Echo”) usually run locally for a variety of reasons. First, it would
become expensive to continuously stream all audio acquired from a given device to the
cloud. Second, this would not even be possible in some applications: consider the case
where an operator in an offshore platform wants to use simple voice commands to
control an auxiliary robot: there may not be an internet connection available or the
latency could be too high.
284 2

In this work, the Google Speech Recognition dataset was used. It contains short audio
clips of length 1s of various words and it is an excellent starting point to learn how to
apply deep leraning to speech.

Fundamentals of Human Voice


When analyzing human voice, a very important aspect are filters, which constitute a
selective frequency transmission system that allows energy through some frequencies
and not others. As shown in the picture below, the pharynx, oral cavity and lips play an
important role in human speech.

Voice formants reveal frequency regions of greater energy concentration. They are
peaks of greater amplitude in the sound spectre and are inherent to a certain
configuration adopted by the vocal tract during vowel speech. When a word is spoken,
formants are associated with natural resonance frequencies of the vocal tract, and
depend on tongue positioning relative to inner structures and to lip movement. The
system can be approximated by tube with one closed extremity (larynx) and one open
(lips), modified by tongue, lips and pharynx movement. The resonance which occurs
in the cavities of this tube is called formant.
Interestingly, deep learning practitioners decided to ignore all of this information very
fast. During his time at Baidu, researcher Andrew Ng went on to say that phonemes,
which are the smallest components of sound, didn’t matter. To a certain extent, in fact,
what matters is that voice is mostly a (quasi) periodic signal if time intervals are small
enough. This leads to the idea of ignoring the phase of the signal and only using its
power spectrum as a source of information. The fact that sound can be reconstructed
from its power spectrum (with the Griffin-Lim algorithm or a neural vocoder, for
example) proves that this is the case.

Though there is a considerable amount of work being done on processing raw


waveforms (for example, in this recent Facebook study), spectral methods are
prevalent. At the moment, the standard way of preprocessing audio is to compute the
short time Fourier transform (STFT) with a given hop size from the raw waveform. The
result, called spectrogram, is a tridimensional arrangement that shows frequency
distribution and intensity of the audio as a function of time, as shown in the picture
below. Another useful trick is to “stretch” lower frequencies in order to mimic the
human perception, which is less sensitive to changes in higher frequencies. This can
be done by computing mel-frequency coefficients (MFCs). There are many online
resources that explain MFCs in detail should this be of interest (Wikipedia is a good
start).

Diphtongs explicited in word “saia”. Topmost: phonetic notation. Middle: sound waveform. Bottom: Spectrogram
with fundamental frequency highlighted in blue and the three first formants highlighted in red.

Neural Attention Architecture


Now that the foundations of speech processing are known, it is possible to propose a
neural network that is able to handle command recognition while still keeping a small
footprint in terms of number of trainable parameters. A recurrent model with
attention brings various advantages, such as:

Make the model explainable by computing importance weights over the input;

Learn to identify what part of the audio is important;

Recurrent Neural Network (RNN) architectures (like long short-term memory —


LSTM or gated recurrent unit — GRU) have a proven record of being able to carry
information while still controlling vanishing/exploding gradients;

Since it is usually acceptable to respond with 1s delay, a bidirectional RNN allows


the model to extract past and future dependencies at a given point of the audio.

Recurrent neural network with attention mechanism. Numbers between [brackets] are tensor dimensions.
raw_len is WAV audio length (16000 in the case of audios of length 1s with a sampling rate of 16kHz). spec_len is
the sequence length of the generated mel-scale spectrogram. nMel is the number of mel bands. nClasses is the
number of desired classes. The activation of the last Dense layer is softmax. The activation of the 64 and 32
dense classification layers is the rectified linear unit (relu).
The proposed architecture uses convolutions to extract short-term dependencies,
RNNs and attention to extract long-term dependencies. The implementation details
can be found in this repository.

In the case of Google Speech dataset, the proposed model achieves over 94% accuracy
when trying to identify one of 20 words, silence or unknown. In addition, the attention
layer makes the model explainable. For example, in the case of the word “right”, note
that the network puts a lot of attention into the transition from r to i. This is very
intuitive considering that the t might not be audible in some cases. Also note that the
fact that the attention weights are placed in the transition does not mean that the
model ignores the other parts of the audio: the bidirectional RNN brings information
from the past and from the future.

Waveform, attention weights and mel-frequency spectrogram of word “right”

Conclusion
Speech command recognition is present in a wide range of devices and utilized by
many HCI interfaces. In many situations, it is desirable to obtain lightweight and high
accuracy models that can run locally. This article explained one possible RNN
architecture that achieves state-of-the-art performance on the Google Speech
Recognition tasks while still keeping a small footprint in terms of trainable
parameters. Source code is available on github. It uses Google Speech Command
Dataset (v1 and v2) to demonstrate how to train models that are able to identify, for
example, 20 commands plus silence or unknown word.

The architecture is able to extract short and long-term dependencies and uses an
attention mechanism to pinpoint which region has the most useful information, that is
then fed to a sequence of dense layers.

In engineering applications, being able to explain what features were used to select a
particular category is important. As shown above, the attention mechanism explains
what parts of the audio are important for classification and also matches the intuition
that regions of vowel transitions are relevant to recognize words. For completeness,
confusion matrices can be found in this article and show that the word pairs tree-three
and no-down are difficult to identify without extra context.

References
Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition
https://arxiv.org/abs/1804.03209

A neural attention model for speech command recognition


https://arxiv.org/abs/1808.08929

Formant tuning strategies in professional male opera singers, Journal of Voice 27 (3)
(2013) 278–288. URL
http://www.sciencedirect.com/science/article/pii/S089219971200209320

Respiratory and acoustical differences between belt and neutral style of singing,
Journal of Voice 29 (4) (2015) 418–5. URL
http://www.sciencedirect.com/science/article/pii/S089219971400204

A Robust Frequency-Domain Method For Estimation Of Intended Fundamental


Frequency In Voice Analysis, in International Journal of Innovation Science and
Research

wav2letter++: The Fastest Open-source Speech Recognition System


https://arxiv.org/abs/1812.07625
Artificial Intelligence Speech Command Keyword Spotting Neural Networks

Attention Mechanisms

Thanks to Ludovic Benistant

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge
research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

About Help Terms Privacy

Get the Medium app

You might also like