Audio Data Analysis Using Machine Learning and Deep
Audio Data Analysis Using Machine Learning and Deep
Systems like Audio Analytic ‘listen’ to the events inside and outside your car,
enabling the vehicle to make adjustments in order to increase a driver’s
safety. Another example is SoundSee technology by Bosch that can analyze
machine noises and facilitate predictive maintenance to monitor equipment
health and prevent costly failures.
Healthcare is another field where environmental sound recognition comes
in handy. It offers a non-invasive type of remote patient monitoring to
detect events like falling. Besides that, analysis of coughing, sneezing,
snoring, and other sounds can facilitate pre-screening, identifying a
patient's status, assessing the infection level in public spaces, and so on.
Time period is how long a certain sound lasts or, in other words, how many seconds
it takes to complete one cycle of vibrations.
Frequency measured in Hertz (Hz) indicates how many sound vibrations happen per
second. People interpret frequency as low or high pitch.
The Fast Fourier Transform (FFT) is the algorithm computing the Fourier transform.
• The short-time Fourier transform (STFT) is a sequence of Fourier transforms converting a waveform into a spectrogram.
• Audio analysis software
• Of course, you don’t need to perform transformations manually. Neither need you to
understand the complex mathematics behind FT, STFT, and other techniques used in
audio analysis. All these and many other tasks are done automatically by audio
analysis software that in most cases supports the following operations:
• import audio data
• add annotations (labels),
• edit recordings and split them into pieces,
• remove noise,
• convert signals into corresponding visual representations (waveforms, spectrum
plots, spectrograms, mel spectrograms),
• do preprocessing operations,
• analyze time and frequency content,
• extract audio features and more.
• The most advanced platforms also allow you to train machine learning models and
even provide you with pre-trained algorithms.
•
Audacity is a free and open-source audio editor to split recordings, remove noise,
transform waveforms to spectrograms, and label them. Audacity doesn’t require coding
skills. Yet, its toolset for audio analysis is not very sophisticated. For further steps, you
need to load your dataset to Python or switch to a platform specifically focusing on
analysis and/or machine learning.
• Tensorflow-io package for preparation and augmentation of audio data lets you
perform a wide range of operations — noise removal, converting waveforms to
spectrograms, frequency, and time masking to make the sound clearly audible,
and more. The tool belongs to the open-source TensorFlow ecosystem, covering
end-to-end machine learning workflow. So, after preprocessing you can train an
ML model on the same platform.
• Torchaudio is an audio processing library for PyTorch. It offers several tools for
handling and transforming audio data. It supports various audio formats and
provides essential data loading and preprocessing capabilities.
• Librosa is an open-source Python library that has almost everything you need for
audio and music analysis. It enables displaying characteristics of audio files,
creating all types of audio data visualizations, and extracting features from
them, to name just a few capabilities.
Sound libraries are free audio pieces grouped by theme. Sources like Freesound and BigSoundBank
offer voice recordings, environment sounds, noises, and honestly all kinds of stuff. For example, you
can find the soundscape of the applause, and the set with skateboard sounds.
The most important thing is that sound libraries are not specifically prepared for machine learning
projects. So, we need to perform extra work on set completion, labeling, and quality control.
Audio datasets are, on the contrary, created with particular machine learning tasks in mind. For
instance, the Bird Audio Detection dataset by the Machine Listening Lab has more than 7,000
excerpts collected during bio-acoustics monitoring projects. Another example is the
ESC-50: Environmental Sound Classification dataset, containing 2,000 labeled audio recordings. Each
file is 5 seconds long and belongs to one of the 50 semantical classes organized in five categories.
One of the largest audio data collections is AudioSet by Google. It includes over 2 million human-
labeled 10-second sound clips, extracted from YouTube videos. The dataset covers 632 classes, from
music and speech to splinter and toothbrush sounds.
• Commercial datasets
• Commercial audio sets for machine learning are definitely more
reliable in terms of data integrity than free ones. We can
recommend ProSoundEffects selling datasets to train models for
speech recognition, environmental sound classification, audio
source separation, and other applications. In total, the company
has 357,000 files recorded by experts in film sound and classified
into 500+ categories.
But what if the sound data you’re looking for is way too specific or
rare? What if you need full control of the recording and labeling?
Well, then better do it in a partnership with reliable specialists
from the same industry as your machine learning project.
• Expert datasets
• When working with Sleep.ai, our task was to create a model
capable of identifying grinding sounds that people with bruxism
typically make during sleep. Clearly, we needed special data, not
available through open sources. Also, the data reliability and quality
had to be the best so we could get trustworthy results.
Though labeling suggests assistance from software tools and some degree of
automation, for the most part, it’s still performed manually, by professional
annotators and/or domain experts. In our bruxism detection project, sleep
experts listened to audio recordings and mark them with grinding or snoring
labels.
• Audio data preprocessing
• Besides enriching data with meaningful tags, we have to preprocess sound data to
achieve better prediction accuracy. Here are the most basic steps for speech
recognition and sound classification projects.
Framing means cutting the continuous stream of sound into short pieces (frames) of
the same length (typically, of 20-40 ms) for further segment-wise processing.
Basically, all windows do the same thing: reduce or smooth the amplitude at the start
and the end of each frame while increasing it at the center to preserve the average
value.
• Overlap-add (OLA) method prevents losing vital information that can be caused by
windowing. OLA provides 30-50 percent overlap between adjacent frames, allowing to
modify them without the risk of distortion. In this case, the original signal can be accurately
reconstructed from windows.
• Feature extraction
• Audio features or descriptors are properties of signals, computed from visualizations of
preprocessed audio data. They can belong to one of three domains
• time domain represented by waveforms,
• frequency domain represented by spectrum plots, and
• time and frequency domain represented by spectrograms.
• Time-domain features
• As we mentioned before, time domain or temporal features are extracted directly from original
waveforms. Notice that waveforms don't contain much information on how the piece would really
sound. They indicate only how the amplitude changes with time. In the image below we can see
that the air condition and siren waveforms look alike, but surely those sounds would not be similar.
• Now let’s move to some key features we can draw from waveforms.
Amplitude envelope (AE) traces amplitude peaks within the frame and shows how they
change over time. With AE, you can automatically measure the duration of distinct parts of a
sound (as shown in the picture below). AE is widely used for the onset detection to indicate
when a certain signal starts, and for music genre classification.
• Short-time energy (STE) shows the energy variation within a short
speech frame.
Zero-crossing Rate (ZCR) counts how many times the signal wave
crosses the horizontal axis within a frame. It’s one of the most important
acoustic features, widely used to detect the presence or absence of speech,
and differentiate noise from silence and music from speech.
• Frequency domain features
• Frequency-domain features are more difficult to extract than temporal ones as the process
involves converting waveforms into spectrum plots or spectrograms using FT or STFT. Yet,
it’s the frequency content that reveals many important sound characteristics invisible or
hard to see in the time domain.
No surprise that the initial application of MFCCs is speech and voice recognition.
But they also proved to be effective for music processing and
acoustic diagnostics for medical purposes, including snoring detection. For
example, one of the recent deep learning models developed by the School of
Engineering (Eastern Michigan University) was trained on 1000 MFCC images
(spectrograms) of snoring sounds.
The waveform of snoring sound (a) and its MFCC spectrogram (b) compared
with the waveform of the toilet flush sound (c) and corresponding MFCC image
• Speech Recognition — Feature Extraction MFCC & PLP
• Machine learning ML extracts features from raw data and creates a dense representation of the
content. This forces us to learn the core information without the noise to make inferences (if it is
done correctly).
• Back to the speech recognition, our objective is finding the best sequence of words
corresponding to the audio based on the acoustic and language model.
A few alternatives for w are the Hamming window and the Hanning window. The following
diagram indicates how a sinusoidal waveform will be chopped off using these windows. As
shown, for Hamming and Hanning window, the amplitude drops off near the edge. (The
Hamming window has a slight sudden drop at the edge while the Hanning window does not.)
• The corresponding equations for w are:
On the top right below is a soundwave in the time domain. It mainly composes of two
frequencies only. As shown, the chopped frame with Hamming and Hanning maintains the
original frequency information better with less noise compared to a rectangle window.
• All these mappings are non-linear. In feature extraction,
we apply triangular band-pass filters to coverts the
frequency information to mimic what a human
perceived.
• First, we square the output of the DFT. This reflects the power of the speech at each
frequency (x[k]²) and we call it the DFT power spectrum. We apply these triangular Mel-
scale filter banks to transform it to Mel-scale power spectrum. The output for each Mel-
scale power spectrum slot represents the energy from a number of frequency bands that
it covers. This mapping is called the Mel Binning. The precise equations for slot m will
be:
The Trainangular bandpass is wider at the higher frequencies to reflect human hearing is less
sensitivity in high frequency. Specifically, it is linearly spaced below 1000 Hz and turns
logarithmically afterward.
• All these efforts try to mimic how the basilar membrane in
our ear senses the vibration of sounds. The basilar
membrane has about 15,000 hairs inside the cochlear at
birth. The diagram below demonstrates the frequency
response of those hairs. So the curve-shape response below
is simply approximated by triangles in Mel filterbank.
• We imitate how our ears perceive sound through those
hairs. In short, it is modeled by the triangular filters
using Mel filtering bank.
• Log
• Mel filterbank outputs a power spectrum. Humans are
less sensitive to small energy change at high energy
than small changes at a low energy level. In fact, it is
logarithmic. So our next step will take the log out of the
output of the Mel filterbank. This also reduces the
acoustic variants that are not significant for speech
recognition. Next, we need to address two more
requirements. First, we need to remove the F0
information (the pitch) and makes the extracted
features independent of others.
• Cepstrum —
IDFT
• Below is the
model of how
speech is
produced.
• Our articulations control the shape of the vocal tract. The source-
filter model combines the vibrations produced by the vocal folds
with the filter created by our articulations. The glottal source
waveform will be suppressed or amplified at different frequencies
by the shape of the vocal tract.
• Cepstrum is the reverse of the first 4 letters in the word
“spectrum”. Our next step is to compute the Cepstral which
separates the glottal source and the filter. Diagram (a) is the
spectrum with the y-axis being the magnitude. Diagram (b) takes
the log of the magnitude. Look closer, the wave fluctuates about 8
times between 1000 and 2000. Actually, it fluctuates about 8
times for every 1000 units. That is about 125 Hz — the source
vibration of the vocal folds.
As observed, the log spectrum (the first diagram below) composes of information related to the
phone (the second diagram) and the pitch (the third diagram). The peaks in the second diagram
identify the formants that distinguish phones. But how can we separate them?
• Recall that periods in the time or frequency domain is
inverted after transformation.
Recall that the pitch information has short periods in the frequency domain. We can apply the
inverse Fourier Transformation to separate the pitch information from the formants. As shown
below, the pitch information will show up on the middle and the right side. The peak in the middle
is actually corresponding to F0 and the phone-related information will locate in the far left.
• Here is another visualization. The solid line on the left
diagram is the signal in the frequency domain. It is
composed of the phone information drawn in the dotted
line and the pitch information. After the IDFT (inverse
Discrete Fourier Transform), the pitch information w
• ith 1/T period is transformed to a peak near T at the
right side.
• So for speech recognition, we just need the coefficients
on the far left and discard the others. In fact, MFCC just
takes the first 12 cepstral values. There is another
important property related to these 12 coefficients. Log
power spectrum is real and symmetric. Its inverse DFT
is equivalent to a discrete cosine transformation (DCT).
In pronunciation, context and dynamic information are important. Articulations, like stop closures
and releases, can be recognized by the formant transitions. Characterizing feature changes over
time provides the context information for a phone. Another 13 values compute the delta
values d(t) below. It measures the changes in features from the previous frame to the next
frame. This is the first-order derivative of the features.
• The last 13 parameters are the dynamic changes of d(t)
from the last frame to the next frame. It acts as the
second-order derivative of c(t).
• So the 39 MFCC features parameters are 12 Cepstrum
coefficients plus the energy term. Then we have 2 more
sets corresponding to the delta and the double delta
values.
• Cepstral mean and variance normalization
• Next, we can perform the feature normalization. We
normalize the features with its mean and divide it by its
variance. The mean and variance are computed with
the feature value j over all the frames in a single
utterance. This allows us to adjust values to
countermeasure the variants in each recording.
However, if the audio clip is short, this may not be reliable. Instead, we may compute the
average and variance values based on speakers, or even over the entire training dataset. This
type of feature normalization will effectively cancel the pre-emphasis done earlier. That is how
we extract MFCC features. As a last note, MFCC is not very robust against noise.
• Perceptual Linear Prediction (PLP)
• PLP is very similar to MFCC. Motivated by hearing
perception, it uses equal loudness pre-emphasis and
cube-root compression instead of the log compression.
• It also uses linear regressive to finalize the cepstral
coefficients. PLP has slightly better accuracy and
slightly better noise robustness. But it is also believed
that MFCC is a safe choice. Throughout this series, when
we say we extract MFCC features, we can extract PLP
features instead also.
Discrete Fourier Transform (DFT)
Next, we apply DFT to extract information in the frequency domain.
Mel filterbank
As mentioned in the previous article, the equipment measurements are not the same as our
hearing perception. For humans, the perceived loudness changes according to frequency. Also,
perceived frequency resolution decreases as frequency increases. i.e. humans are less sensitive to
higher frequencies. The diagram on the left indicates how the Mel scale maps the measured
frequency to that we perceived in the context of frequency resolution.
• To train a model for the Sleep.ai project, our data scientists selected a set of most relevant
features from both the time and frequency domains. In combination, they created rich profiles of
grinding and snoring sounds.
• Selecting and training machine learning models
• Since audio features come in the visual form (mostly as spectrograms), it makes them an object
of image recognition that relies on deep neural networks. There are several popular
architectures showing good results in sound detection and classification. Here, we only focus on
two commonly used to identify sleep problems by sound.
• Long short-term memory networks (LSTMs)
• Long short-term memory networks (LSTMs) are known for their ability to spot long-term
dependencies in data and remember information from numerous prior steps. According to sleep
apnea detection research, LSTMs can achieve an accuracy of 87 percent when using MFCC
features as input to separate normal snoring sounds from abnormal ones.
Another study shows even better results: the LSTM classified normal and abnormal snoring
events with an accuracy of 95.3 percent. The neural network was trained using five types of
features including MFCCs and short-time energy from the time domain. Together, they represent
different characteristics of snoring.
• Convolutional neural networks (CNNs)
• Convolutional neural networks lead the pack in computer vision in healthcare and
other industries. They are often referred to as a natural choice for image recognition
tasks. The efficiency of CNN architecture in spectrogram processing proves the
validity of this statement one more time.
Almost the same results are reported for the combination of CNN and LSTM
architectures. The group of scientists from the Eindhoven University of Technology
applied the CNN model to extract features from spectrograms and then run the
LSTM to classify the CNN output into snore and non-snore events. The accuracy
values range from 94.4 to 95.9 percent depending on the location of the microphone
used for recording snoring sounds.
For Sleep.io project, the AltexSoft data science team used two CNNs (for snoring and
grinding detection) and trained it on the TensorFlow platform. After models achieved
an accuracy of over 80 percent, they were launched to production. Their results
have been constantly getting better with the growing number of inputs collected
from real users.
• Keep in mind, though, that no health app, however
smart it is, can replace a real doctor. The conclusion
made by AI must be verified by your dentist, physician,
or another medical expert.