0% found this document useful (0 votes)
29 views

Audio Data Analysis Using Machine Learning and Deep

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Audio Data Analysis Using Machine Learning and Deep

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 74

Audio Data Analysis using

Machine learning and Deep


Learning and its application in
medical science
Contents
• Introduction
We live in the world of sounds: Pleasant and annoying, low and high,
quiet and loud, they impact our mood and our decisions. Our brains are
constantly processing sounds to give us important information about
our environment. But acoustic signals can tell us even more if analyze
them using modern technologies.

Today, we have AI and machine learning to extract insights, inaudible to


human beings, from speech, voices, snoring, music, industrial and
traffic noise, and other types of acoustic signals. In this article, we’ll
share what we’ve learned when creating AI-based sound recognition
solutions for healthcare projects.
Particularly, we’ll explain how to obtain audio data, prepare it for
analysis, and choose the right ML model to achieve the highest
prediction accuracy. But first, let’s go over the basics: What is the audio
analysis, and what makes audio data so challenging to deal with.
What is audio analysis?
Audio analysis is a process of transforming, exploring,
and interpreting audio signals recorded by digital
devices. Aiming at understanding sound data, it applies a
range of technologies, including state-of-the-art
deep learning algorithms. Audio analysis has already
gained broad adoption in various industries, from
entertainment to healthcare to manufacturing. Below
we’ll give the most popular use cases.
• Speech recognition
• Speech recognition is about the ability of computers to distinguish
spoken words with natural language processing techniques. It allows us
to control PCs, smartphones, and other devices via voice commands and
dictate texts to machines instead of manual entering. Siri by Apple,
Alexa by Amazon, Google Assistant, and Cortana by Microsoft are
popular examples of how deeply the technology has penetrated into our
daily lives.
• Voice recognition
• Voice recognition is meant to identify people by the unique
characteristics of their voices rather than to isolate separate words. The
approach finds applications in security systems for user authentication.
For instance, Nuance Gatekeeper biometric engine verifies employees
and customers by their voices in the banking sector.
• Music recognition
• Music recognition is a popular feature of such apps as Shazam that helps
you identify unknown songs from a short sample. Another application of
musical audio analysis is genre classification: Say, Spotify runs its proprietary
algorithm to group tracks into categories (their database holds more than
5,000 genres)
• Environmental sound recognition
• Environmental sound recognition focuses on the identification of noises
around us, promising a bunch of advantages to automotive and
manufacturing industries. It’s vital for understanding surroundings in
IoT applications.

Systems like Audio Analytic ‘listen’ to the events inside and outside your car,
enabling the vehicle to make adjustments in order to increase a driver’s
safety. Another example is SoundSee technology by Bosch that can analyze
machine noises and facilitate predictive maintenance to monitor equipment
health and prevent costly failures.
Healthcare is another field where environmental sound recognition comes
in handy. It offers a non-invasive type of remote patient monitoring to
detect events like falling. Besides that, analysis of coughing, sneezing,
snoring, and other sounds can facilitate pre-screening, identifying a
patient's status, assessing the infection level in public spaces, and so on.

A real-life use case of such analysis is Sleep.ai which detects teeth


grinding and snoring sounds during sleep. The solution created by
AltexSoft for a Dutch healthcare startup helps dentists identify and
monitor bruxism to eventually understand the causes of this abnormality
and treat it.

No matter what type of sounds you analyze, it all starts with an


understanding of audio data and its specific characteristics.
• What is audio data?
• Audio data represents analog sounds in a digital form, preserving the main
properties of the original. As we know from school lessons in physics, a sound is a
wave of vibrations traveling through a medium like air or water and finally reaching
our ears. It has three key characteristics to be considered when analyzing audio data
— time period, amplitude, and frequency.

Time period is how long a certain sound lasts or, in other words, how many seconds
it takes to complete one cycle of vibrations.

Amplitude is the sound intensity measured in decibels (dB) which we perceive as


loudness.

Frequency measured in Hertz (Hz) indicates how many sound vibrations happen per
second. People interpret frequency as low or high pitch.

While frequency is an objective parameter, the pitch is subjective. The human


hearing range lies between 20 and 20,000 Hz. Scientists claim that most people
perceive as low pitch all sounds below 500 Hz — like the plane engine roar. In turn,
high pitch for us is everything beyond 2,000 Hz (for example, a whistle.)
• Audio data file formats
• Similar to texts and images, audio is unstructured data meaning that it’s not arranged in tables
with connected rows and columns. Instead, you can store audio in various file formats like
• WAV or WAVE (Waveform Audio File Format) developed by Microsoft and IBM. It’s a lossless or
raw file format meaning that it doesn’t compress the original sound recording;
• AIFF (Audio Interchange File Format) developed by Apple. Like WAV, it works with uncompressed
audio;
• FLAC (Free Lossless Audio Codec) developed by Xiph.Org Foundation that offers free multimedia
formats and software tools. FLAC files are compressed without losing sound quality.
• MP3 (mpeg-1 audio layer 3) developed by the Fraunhofer Society in Germany and supported
globally. It’s the most common file format since it makes music easy to store on portable devices
and send back and forth via the Internet. Though mp3 compresses audio, it still offers an
acceptable sound quality.
• We recommend using aiff and wav files for analysis as they don’t miss any information present in
analog sounds. At the same time, keep in mind that neither of those and other audio files can be
fed directly to machine learning models. To make audio understandable for computers, data
must undergo a transformation.
• Audio data transformation basics to know
• Before diving deeper into the processing of audio files, we
need to introduce specific terms, that you will encounter at
almost every step of our journey from sound data collection
to getting ML predictions. It’s worth noting that audio analysis
involves working with images rather than listening.

A waveform is a basic visual representation of an audio


signal that reflects how an amplitude changes over time. The
graph displays the time on the horizontal (X) axis and the
amplitude on the vertical (Y) axis but it doesn’t tell us what’s
happening to frequencies.
Audio Singal
Processing for Machine Learnin
g
• A spectrum or spectral plot is a graph where the X-
axis shows the frequency of the sound wave while the Y-
axis represents its amplitude. This type of sound
data visualization helps you analyze frequency content
but misses the time component.
• A spectrogram is a detailed view of a signal that covers all three
characteristics of sound. You can learn about time from the x-axis,
frequencies from the y-axis, and amplitude from color. The louder the event
the brighter the color, while silence is represented by black. Having three
dimensions on one graph is very convenient: it allows you to track how
frequencies change over time, examine the sound in all its fullness, and
spot various problem areas (like noises) and patterns by sight.
A mel spectrogram where mel stands for melody is a type of spectrogram based
on the mel scale that describes how people perceive sound characteristics. Our ear
can distinguish low frequencies better than high frequencies. You can check it
yourself: Try to play tones from 500 to 1000 Hz and then from 10,000 to 10,500 Hz.
The former frequency range would seem much broader than the latter, though, in
fact, they are the same. The mel spectrogram incorporates this unique feature of
human hearing, converting the values in Hertz into the mel scale. This approach is
widely used for genre classification, instrument detection in songs, and speech
emotion recognition.
• The Fourier transform (FT) is a mathematical function that breaks a signal into spikes of different amplitudes and
frequencies. We use it to convert waveforms into corresponding spectrum plots to look at the same signal from a different
angle and perform frequency analysis. It’s a powerful instrument to understand signals and troubleshooting errors in them.

The Fast Fourier Transform (FFT) is the algorithm computing the Fourier transform.
• The short-time Fourier transform (STFT) is a sequence of Fourier transforms converting a waveform into a spectrogram.
• Audio analysis software
• Of course, you don’t need to perform transformations manually. Neither need you to
understand the complex mathematics behind FT, STFT, and other techniques used in
audio analysis. All these and many other tasks are done automatically by audio
analysis software that in most cases supports the following operations:
• import audio data
• add annotations (labels),
• edit recordings and split them into pieces,
• remove noise,
• convert signals into corresponding visual representations (waveforms, spectrum
plots, spectrograms, mel spectrograms),
• do preprocessing operations,
• analyze time and frequency content,
• extract audio features and more.
• The most advanced platforms also allow you to train machine learning models and
even provide you with pre-trained algorithms.

Audacity is a free and open-source audio editor to split recordings, remove noise,
transform waveforms to spectrograms, and label them. Audacity doesn’t require coding
skills. Yet, its toolset for audio analysis is not very sophisticated. For further steps, you
need to load your dataset to Python or switch to a platform specifically focusing on
analysis and/or machine learning.
• Tensorflow-io package for preparation and augmentation of audio data lets you
perform a wide range of operations — noise removal, converting waveforms to
spectrograms, frequency, and time masking to make the sound clearly audible,
and more. The tool belongs to the open-source TensorFlow ecosystem, covering
end-to-end machine learning workflow. So, after preprocessing you can train an
ML model on the same platform.
• Torchaudio is an audio processing library for PyTorch. It offers several tools for
handling and transforming audio data. It supports various audio formats and
provides essential data loading and preprocessing capabilities.
• Librosa is an open-source Python library that has almost everything you need for
audio and music analysis. It enables displaying characteristics of audio files,
creating all types of audio data visualizations, and extracting features from
them, to name just a few capabilities.

Audio Toolbox by MathWorks offers numerous instruments for audio data


processing and analysis, from labeling to estimating signal metrics to extracting
certain features. It also comes with pre-trained machine learning and deep
learning models that can be used for speech analysis and sound recognition.
• Audio data analysis steps
• Now that we have a basic understanding of sound data, let’s take a glance at the key stages of the
end-to-end audio analysis project.
1. Obtain project-specific audio data stored in standard file formats.
2. Prepare data for your machine learning project, using software tools
3. Extract audio features from visual representations of sound data.
4. Select the machine learning model and train it on audio features.
• Voice and sound data acquisition
• You have three options to obtain data to train machine learning models: use free sound libraries or
audio datasets, purchase it from data providers, or collect it involving domain experts.
• Free data sources
• There are a lot of such sources out there on the web. But what we do not control in this case is
data quality and quantity, and the general approach to recording.

Sound libraries are free audio pieces grouped by theme. Sources like Freesound and BigSoundBank
offer voice recordings, environment sounds, noises, and honestly all kinds of stuff. For example, you
can find the soundscape of the applause, and the set with skateboard sounds.

The most important thing is that sound libraries are not specifically prepared for machine learning
projects. So, we need to perform extra work on set completion, labeling, and quality control.

Audio datasets are, on the contrary, created with particular machine learning tasks in mind. For
instance, the Bird Audio Detection dataset by the Machine Listening Lab has more than 7,000
excerpts collected during bio-acoustics monitoring projects. Another example is the
ESC-50: Environmental Sound Classification dataset, containing 2,000 labeled audio recordings. Each
file is 5 seconds long and belongs to one of the 50 semantical classes organized in five categories.

One of the largest audio data collections is AudioSet by Google. It includes over 2 million human-
labeled 10-second sound clips, extracted from YouTube videos. The dataset covers 632 classes, from
music and speech to splinter and toothbrush sounds.
• Commercial datasets
• Commercial audio sets for machine learning are definitely more
reliable in terms of data integrity than free ones. We can
recommend ProSoundEffects selling datasets to train models for
speech recognition, environmental sound classification, audio
source separation, and other applications. In total, the company
has 357,000 files recorded by experts in film sound and classified
into 500+ categories.

But what if the sound data you’re looking for is way too specific or
rare? What if you need full control of the recording and labeling?
Well, then better do it in a partnership with reliable specialists
from the same industry as your machine learning project.
• Expert datasets
• When working with Sleep.ai, our task was to create a model
capable of identifying grinding sounds that people with bruxism
typically make during sleep. Clearly, we needed special data, not
available through open sources. Also, the data reliability and quality
had to be the best so we could get trustworthy results.

To obtain such a dataset, the startup partnered with sleep


laboratories, where scientists monitor people while they are
sleeping to define healthy sleep patterns and diagnose sleep
disorders. Experts use various devices to record brain activity,
movements, and other events. For us, they prepared a labeled data
set with about 12,000 samples of grinding and snoring sounds.
• Audio data preparation
• In the case of Sleep.io, our team skipped this step entrusting sleep experts with
the task of data preparation for our project. The same relates to those who buy
annotated sound collections from data providers. But if you have only raw data
meaning recordings saved in one of the audio file formats you need to get them
ready for machine learning.
• Audio data labeling
• Data labeling or annotation is about tagging raw data with correct answers to
run supervised machine learning. In the process of training, your model will
learn to recognize patterns in new data and make the right predictions, based
on the labels. So, their quality and accuracy are critical for the success of ML
projects.

Though labeling suggests assistance from software tools and some degree of
automation, for the most part, it’s still performed manually, by professional
annotators and/or domain experts. In our bruxism detection project, sleep
experts listened to audio recordings and mark them with grinding or snoring
labels.
• Audio data preprocessing
• Besides enriching data with meaningful tags, we have to preprocess sound data to
achieve better prediction accuracy. Here are the most basic steps for speech
recognition and sound classification projects.

Framing means cutting the continuous stream of sound into short pieces (frames) of
the same length (typically, of 20-40 ms) for further segment-wise processing.

Windowing is a fundamental audio processing technique to minimize spectral leakage


— the common error that results in smearing the frequency and degrading the
amplitude accuracy. There are several window functions (Hamming, Hanning, Flat Top,
etc) applied to different types of signals, though the Hanning variation works well for
95 percent of cases.

Basically, all windows do the same thing: reduce or smooth the amplitude at the start
and the end of each frame while increasing it at the center to preserve the average
value.
• Overlap-add (OLA) method prevents losing vital information that can be caused by
windowing. OLA provides 30-50 percent overlap between adjacent frames, allowing to
modify them without the risk of distortion. In this case, the original signal can be accurately
reconstructed from windows.
• Feature extraction
• Audio features or descriptors are properties of signals, computed from visualizations of
preprocessed audio data. They can belong to one of three domains
• time domain represented by waveforms,
• frequency domain represented by spectrum plots, and
• time and frequency domain represented by spectrograms.
• Time-domain features
• As we mentioned before, time domain or temporal features are extracted directly from original
waveforms. Notice that waveforms don't contain much information on how the piece would really
sound. They indicate only how the amplitude changes with time. In the image below we can see
that the air condition and siren waveforms look alike, but surely those sounds would not be similar.
• Now let’s move to some key features we can draw from waveforms.

Amplitude envelope (AE) traces amplitude peaks within the frame and shows how they
change over time. With AE, you can automatically measure the duration of distinct parts of a
sound (as shown in the picture below). AE is widely used for the onset detection to indicate
when a certain signal starts, and for music genre classification.
• Short-time energy (STE) shows the energy variation within a short
speech frame.

It’s a powerful tool to separate voiced and unvoiced segments.

Root mean square energy (RMSE) gives you an understanding of the


average energy of the signal. It can be computed from a waveform or a
spectrogram. In the first case, you’ll get results faster. Yet, a spectrogram
provides a more accurate representation of energy over time. RMSE is
particularly useful for audio segmentation and music genre classification.

Zero-crossing Rate (ZCR) counts how many times the signal wave
crosses the horizontal axis within a frame. It’s one of the most important
acoustic features, widely used to detect the presence or absence of speech,
and differentiate noise from silence and music from speech.
• Frequency domain features
• Frequency-domain features are more difficult to extract than temporal ones as the process
involves converting waveforms into spectrum plots or spectrograms using FT or STFT. Yet,
it’s the frequency content that reveals many important sound characteristics invisible or
hard to see in the time domain.

The most common frequency domain features include


• mean or average frequency,
• median frequency when the spectrum is divided into two regions with equal amplitude,
• signal-to-noise ratio (SNR) comparing the strength of desired sound against the background
nose,
• band energy ratio (BER) depicting relations between higher and lower frequency bands. In
other words. it measures how low frequencies are dominant over high ones.
• Of course, there are numerous other properties to look at in this domain. To recap, it tells
us how the sound energy spreads across frequencies while the time domain shows how a
signal change over time.
• Time-frequency domain features
• This domain combines both time and frequency components and uses various
types of spectrograms as a visual representation of a sound. You can get a
spectrogram from a waveform applying the short-time Fourier transform.

One of the most popular groups of time-frequency domain features is mel-


frequency cepstral coefficients (MFCCs). They work within the human
hearing range and as such are based on the mel scale and mel spectrograms we
discussed earlier.

No surprise that the initial application of MFCCs is speech and voice recognition.
But they also proved to be effective for music processing and
acoustic diagnostics for medical purposes, including snoring detection. For
example, one of the recent deep learning models developed by the School of
Engineering (Eastern Michigan University) was trained on 1000 MFCC images
(spectrograms) of snoring sounds.
The waveform of snoring sound (a) and its MFCC spectrogram (b) compared
with the waveform of the toilet flush sound (c) and corresponding MFCC image
• Speech Recognition — Feature Extraction MFCC & PLP
• Machine learning ML extracts features from raw data and creates a dense representation of the
content. This forces us to learn the core information without the noise to make inferences (if it is
done correctly).
• Back to the speech recognition, our objective is finding the best sequence of words
corresponding to the audio based on the acoustic and language model.

To create an acoustic model, our observation X is represented by a sequence of acoustic feature


vectors (x₁, x₂, x₃, …). In the previous article, we learn how people articulate and perceive
speech. In this article, we discuss how audio features are extracted from what we learned.
• Requirement
• Let’s define some of the requirements for the feature extraction in ASR (Automatic speech
recognizer) first. Given an audio segment, we are using a sliding window of 25ms wide to
extract audio features.
• This 25ms width is large enough for us to capture enough information and yet
the features inside this frame should remain relatively stationary. If we speak 3
words per second with 4 phones and each phone will be sub-divided into 3
stages, then there are 36 states per second or 28 ms per state. So the 25ms
window is about right.
• Context is very important in speech. Pronunciations are changed
according to the articulation before and after a phone. Each slid
window is about 10ms apart so we can capture the dynamics among
frames to capture the proper context.
• Pitch varies with people. However, this has little role in recognizing
what he/she said. F0 is related to the pitch. It provides no value in
speech recognition and should be removed. What is more important is
the formants F1, F2, F3, … For those that have problems in following
these terms, we suggest you read the previous article first.
• We also hope the extracted features will be robust to who the speaker
is, and the noise in the environments. Also, like any ML problems, we
want extracted features to be independent of others. It is easier to
develop models and to train these models with independent features.
• One popular audio feature extraction method is the Mel-frequency cepstral
coefficients (MFCC) which have 39 features. The feature count is small enough to force
us to learn the information of the audio. 12 parameters are related to the amplitude of
frequencies. It provides us enough frequency channels to analyze the audio.
• Below is the flow of extracting the MFCC features.
• The key objectives are:
• Remove vocal fold excitation (F0) — the pitch
information.
• Make the extracted features independent.
• Adjust to how humans perceive loudness and frequency
of sound.
• Capture the dynamics of phones (the context).
• Mel-frequency cepstral coefficients (MFCC)
• Let’s cover each step one at a time.
• A/D conversion
• A/D conversion samples the audio clips and digitizes the content, i.e. converting the
analog signal into discrete space. A sampling frequency of 8 or 16 kHz is often used.
• Pre-emphasis
• Pre-emphasis boosts the amount of energy in the high frequencies. For voiced segments like
vowels, there is more energy at the lower frequencies than the higher frequencies. This is called
spectral tilt which is related to the glottal source (how vocal folds produce sound). Boosting the
high-frequency energy makes information in higher formants more available to the acoustic
model. This improves phone detection accuracy. For humans, we start having hearing problems
when we cannot hear these high-frequency sounds. Also, noise has a high frequency. In the
engineering field, we use pre-emphasis to make the system less susceptible to noise introduced
in the process later. For some applications, we just need to undo the boosting at the end.
• Pre-emphasis uses a filter to boost higher frequencies. Below is the before and after signal on
how the high-frequency signal is boosted.
• Windowing
• Windowing involves the slicing of the audio waveform
into sliding frames.
• But we cannot just chop it off at the edge of the frame.
The suddenly fallen in amplitude will create a lot of
noise that shows up in the high-frequency. To slice the
audio, the amplitude should gradually drop off near the
edge of a frame.
• Let’s say w is the window applied to the original audio
clip in the time domain.

A few alternatives for w are the Hamming window and the Hanning window. The following
diagram indicates how a sinusoidal waveform will be chopped off using these windows. As
shown, for Hamming and Hanning window, the amplitude drops off near the edge. (The
Hamming window has a slight sudden drop at the edge while the Hanning window does not.)
• The corresponding equations for w are:

On the top right below is a soundwave in the time domain. It mainly composes of two
frequencies only. As shown, the chopped frame with Hamming and Hanning maintains the
original frequency information better with less noise compared to a rectangle window.
• All these mappings are non-linear. In feature extraction,
we apply triangular band-pass filters to coverts the
frequency information to mimic what a human
perceived.
• First, we square the output of the DFT. This reflects the power of the speech at each
frequency (x[k]²) and we call it the DFT power spectrum. We apply these triangular Mel-
scale filter banks to transform it to Mel-scale power spectrum. The output for each Mel-
scale power spectrum slot represents the energy from a number of frequency bands that
it covers. This mapping is called the Mel Binning. The precise equations for slot m will
be:

The Trainangular bandpass is wider at the higher frequencies to reflect human hearing is less
sensitivity in high frequency. Specifically, it is linearly spaced below 1000 Hz and turns
logarithmically afterward.
• All these efforts try to mimic how the basilar membrane in
our ear senses the vibration of sounds. The basilar
membrane has about 15,000 hairs inside the cochlear at
birth. The diagram below demonstrates the frequency
response of those hairs. So the curve-shape response below
is simply approximated by triangles in Mel filterbank.
• We imitate how our ears perceive sound through those
hairs. In short, it is modeled by the triangular filters
using Mel filtering bank.
• Log
• Mel filterbank outputs a power spectrum. Humans are
less sensitive to small energy change at high energy
than small changes at a low energy level. In fact, it is
logarithmic. So our next step will take the log out of the
output of the Mel filterbank. This also reduces the
acoustic variants that are not significant for speech
recognition. Next, we need to address two more
requirements. First, we need to remove the F0
information (the pitch) and makes the extracted
features independent of others.
• Cepstrum —
IDFT
• Below is the
model of how
speech is
produced.
• Our articulations control the shape of the vocal tract. The source-
filter model combines the vibrations produced by the vocal folds
with the filter created by our articulations. The glottal source
waveform will be suppressed or amplified at different frequencies
by the shape of the vocal tract.
• Cepstrum is the reverse of the first 4 letters in the word
“spectrum”. Our next step is to compute the Cepstral which
separates the glottal source and the filter. Diagram (a) is the
spectrum with the y-axis being the magnitude. Diagram (b) takes
the log of the magnitude. Look closer, the wave fluctuates about 8
times between 1000 and 2000. Actually, it fluctuates about 8
times for every 1000 units. That is about 125 Hz — the source
vibration of the vocal folds.
As observed, the log spectrum (the first diagram below) composes of information related to the
phone (the second diagram) and the pitch (the third diagram). The peaks in the second diagram
identify the formants that distinguish phones. But how can we separate them?
• Recall that periods in the time or frequency domain is
inverted after transformation.

Recall that the pitch information has short periods in the frequency domain. We can apply the
inverse Fourier Transformation to separate the pitch information from the formants. As shown
below, the pitch information will show up on the middle and the right side. The peak in the middle
is actually corresponding to F0 and the phone-related information will locate in the far left.
• Here is another visualization. The solid line on the left
diagram is the signal in the frequency domain. It is
composed of the phone information drawn in the dotted
line and the pitch information. After the IDFT (inverse
Discrete Fourier Transform), the pitch information w
• ith 1/T period is transformed to a peak near T at the
right side.
• So for speech recognition, we just need the coefficients
on the far left and discard the others. In fact, MFCC just
takes the first 12 cepstral values. There is another
important property related to these 12 coefficients. Log
power spectrum is real and symmetric. Its inverse DFT
is equivalent to a discrete cosine transformation (DCT).

DCT is an orthogonal transformation. Mathematically, the transformation produces uncorrelated


features. Therefore, MFCC features are highly unrelated. In ML, this makes our model easier to
model and to train. If we model these parameters with multivariate Gaussian distribution, all the
non-diagonal values in the covariance matrix will be zero. Mathematically, the output of this stage
is
• The following is the visualization of the 12 Cepstrum
coefficients.
• Dynamic features (delta)
• MFCC has 39 features. We finalize 12 and what are the
rest. The 13th parameter is the energy in each frame. It
helps us to identify phones.

In pronunciation, context and dynamic information are important. Articulations, like stop closures
and releases, can be recognized by the formant transitions. Characterizing feature changes over
time provides the context information for a phone. Another 13 values compute the delta
values d(t) below. It measures the changes in features from the previous frame to the next
frame. This is the first-order derivative of the features.
• The last 13 parameters are the dynamic changes of d(t)
from the last frame to the next frame. It acts as the
second-order derivative of c(t).
• So the 39 MFCC features parameters are 12 Cepstrum
coefficients plus the energy term. Then we have 2 more
sets corresponding to the delta and the double delta
values.
• Cepstral mean and variance normalization
• Next, we can perform the feature normalization. We
normalize the features with its mean and divide it by its
variance. The mean and variance are computed with
the feature value j over all the frames in a single
utterance. This allows us to adjust values to
countermeasure the variants in each recording.

However, if the audio clip is short, this may not be reliable. Instead, we may compute the
average and variance values based on speakers, or even over the entire training dataset. This
type of feature normalization will effectively cancel the pre-emphasis done earlier. That is how
we extract MFCC features. As a last note, MFCC is not very robust against noise.
• Perceptual Linear Prediction (PLP)
• PLP is very similar to MFCC. Motivated by hearing
perception, it uses equal loudness pre-emphasis and
cube-root compression instead of the log compression.
• It also uses linear regressive to finalize the cepstral
coefficients. PLP has slightly better accuracy and
slightly better noise robustness. But it is also believed
that MFCC is a safe choice. Throughout this series, when
we say we extract MFCC features, we can extract PLP
features instead also.
Discrete Fourier Transform (DFT)
Next, we apply DFT to extract information in the frequency domain.

Mel filterbank
As mentioned in the previous article, the equipment measurements are not the same as our
hearing perception. For humans, the perceived loudness changes according to frequency. Also,
perceived frequency resolution decreases as frequency increases. i.e. humans are less sensitive to
higher frequencies. The diagram on the left indicates how the Mel scale maps the measured
frequency to that we perceived in the context of frequency resolution.
• To train a model for the Sleep.ai project, our data scientists selected a set of most relevant
features from both the time and frequency domains. In combination, they created rich profiles of
grinding and snoring sounds.
• Selecting and training machine learning models
• Since audio features come in the visual form (mostly as spectrograms), it makes them an object
of image recognition that relies on deep neural networks. There are several popular
architectures showing good results in sound detection and classification. Here, we only focus on
two commonly used to identify sleep problems by sound.
• Long short-term memory networks (LSTMs)
• Long short-term memory networks (LSTMs) are known for their ability to spot long-term
dependencies in data and remember information from numerous prior steps. According to sleep
apnea detection research, LSTMs can achieve an accuracy of 87 percent when using MFCC
features as input to separate normal snoring sounds from abnormal ones.

Another study shows even better results: the LSTM classified normal and abnormal snoring
events with an accuracy of 95.3 percent. The neural network was trained using five types of
features including MFCCs and short-time energy from the time domain. Together, they represent
different characteristics of snoring.
• Convolutional neural networks (CNNs)
• Convolutional neural networks lead the pack in computer vision in healthcare and
other industries. They are often referred to as a natural choice for image recognition
tasks. The efficiency of CNN architecture in spectrogram processing proves the
validity of this statement one more time.

In the above-mentioned project by the School of Engineering (Eastern Michigan


University) a CNN-based deep learning model hit an accuracy of 96 percent in the
classification of snoring vs non-snoring sounds.

Almost the same results are reported for the combination of CNN and LSTM
architectures. The group of scientists from the Eindhoven University of Technology
applied the CNN model to extract features from spectrograms and then run the
LSTM to classify the CNN output into snore and non-snore events. The accuracy
values range from 94.4 to 95.9 percent depending on the location of the microphone
used for recording snoring sounds.

For Sleep.io project, the AltexSoft data science team used two CNNs (for snoring and
grinding detection) and trained it on the TensorFlow platform. After models achieved
an accuracy of over 80 percent, they were launched to production. Their results
have been constantly getting better with the growing number of inputs collected
from real users.
• Keep in mind, though, that no health app, however
smart it is, can replace a real doctor. The conclusion
made by AI must be verified by your dentist, physician,
or another medical expert.

You might also like