0% found this document useful (0 votes)

9 views9 pages

APPFDL

The document discusses audio pre-processing techniques for deep learning, focusing on the conversion of analog audio signals to digital formats and various representations such as Waveform, FFT, STFT, and MFCC. It provides insights into the processes of sampling and quantization, as well as practical Python code examples for extracting and visualizing these audio representations. The article emphasizes the importance of spectrograms and MFCCs in deep learning applications for audio data.

Uploaded by

Kevin Lima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views9 pages

APPFDL

Uploaded by

Kevin Lima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/347356900

Audio Pre-Processing For Deep Learning

Article · December 2020

CITATIONS READS

3 6,568

1 author:

Saman Arzaghi
University of Tehran
2 PUBLICATIONS 3 CITATIONS

SEE PROFILE

All content following this page was uploaded by Saman Arzaghi on 16 December 2020.

The user has requested enhancement of the downloaded file.

Audio Pre-Processing For Deep Learning
Saman Arzaghi
School of Mathematics, Statistics and Computer Science
College of Science, University of Tehran
[email protected]
Dec 2020

Abstract
Audio signals are continuous (analog) signals that gradually decrease in amplitude as the
sound source decreases. Computers, on the other hand, store their data digitally stream strings of
bits zero and one. Digital data is naturally discrete because the value of zero or one digital data is
only valid at a given moment. Therefore, the continuous analog audio signal must be converted to
a discontinuous digital form so that the computer can store or process audio. Of course, the digital
data must be converted back to analog form so that it can be heard through an audio system. Two-
way conversion between analog and digital signals is the primary operation of all adapter cards
and sound cards.
In this article, we will discuss different ways to represent audio (like Waveform, FFT, STFT,
and MFCC), the difference between them, How to turn each into another, and some codes for each
representation.

1 What is Sound?
Well, sound produce when an object vibrates and those vibrations determine the oscillation of air
molecules which basically creates an alternation of air pressure and this high pressure alternated with
low pressure causes a wave and we can represent this wave using wave form.

Simple waveform sound

When we talk about sounds around us like our own voice or the sound of the whistle, we are talking
continuous waveforms and they are analog, but obviously, we can’t really store analog waveforms we
need an away of digitalizing them, and for doing that we can use analog-digital conversion (ADC)
process and for that we must perform two steps:
2

• Sampling −→ Sample the signal at specific time intervals

• Quantization −→ Quantize the amplitude given and represent with a limited number of bits
(note that more bits we store the amplitude and-the better quality of sound will be)

A sound wave, in red, represented digitally, in blue

(after sampling and 4-bit quantisation)

2 Start Working With an Audio File

Ok, now let’s take a look at the ending solo on High Hopes by Pink Floyd, which is indeed my
favorite. This shows us the loudness (amplitude) of sound wave changing with time. Here amplitude
= 0 represents silence. (From now on, we will analyze this audio):

Plot audio wave in time domain

(This amplitude is actually the amplitude of air particles which are oscillating
because of the pressure change in the atmosphere due to sound)

You can use the Python code below to extract waveform from a raw file.wav and then show the
plot:
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8

9 # Load audio file with Librosa

10 signal, sample_rate = librosa.load(file, sr=22050)

11
12 # DISPLAY WAVEFORM
13 def waveform():
14 plt.figure(figsize=FIG_SIZE)
15 librosa.display.waveplot(signal, sample_rate, alpha=0.4)
16 plt.xlabel("Time (s)")
17 plt.ylabel("Amplitude")
18 plt.title("Waveform")
19 plt.show()

3 Fourier Transform
If you take look at the plot audio wave in time domain, it looks so complex to understand, but nature
has given us an incredible way of knowing quite a lot about complex sounds, and that’s given through
a Fourier transform (FT) . A Fourier transform is decomposing complex periodic sound into a sum of
sine waves oscillating at different frequencies.

FT decomposes complex audio to simpler ones

It is great because now we can decomposed complex sounds into simpler ones and analyze them.

4 Fast Foureir Transform

The ”Fast Fourier Transform” (FFT) is an important measurement method in the science of audio
and acoustics measurement. It converts a signal into individual spectral components and thereby
provides frequency information about the signal. FFTs are used for fault analysis, quality control, and
condition monitoring of machines or systems. you might be wondering what is difference between fft
and ft, so the answer is that The Fast Fourier Transform is a particularly efficient way of computing a
DFT (Discrete Fourier Transform (DFT) is the discrete version of the Fourier Transform (FT) that
transforms a signal (or discrete sequence) from the time domain representation to its representation
in the frequency domain) Whereas, FFT is any efficient algorithm for calculating the DFT and its
inverse by factorization into sparse matrices). it seems a little confusing because it is , but you don’t
need to get that much deep into details although if you need more details I suggest you take a look at
Wikipedia.
4

How FFT works

FFT analysis for the same song that we analyzed earlier is shown below:

Plot FFT of the song

You can use the Python code below to extract FFT from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6

7 file = "highhopes.wav"
8
9 # Load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11
12 # DISPLAY FFT TO POWER SPECTRUM
13 def fft():
14 # Lerform Fourier transform
15 fft = np.fft.fft(signal)
16 # Lalculate abs values on complex numbers to get magnitude
17 spectrum = np.abs(fft)
18 # Create frequency variable
19 f = np.linspace(0, sample_rate, len(spectrum))
20 # Take half of the spectrum and frequency
21 left_spectrum = spectrum[:int(len(spectrum)/2)]
22 left_f = f[:int(len(spectrum)/2)]
23 # Plot spectrum
24 plt.figure(figsize=FIG_SIZE)
25 plt.plot(left_f, left_spectrum, alpha=0.4)
5

26 plt.xlabel("Frequency")
27 plt.ylabel("Magnitude")
28 plt.title("Power spectrum")
29 plt.show()

5 Short-Time Fourier Transform

Note that when we do Fourier transform basically we move from the time domain to the frequency
domain and because of it we lose information about time. at first, it seems we lost a lot of information
but there is a solution to that, and it’s called short-time Fourier transform (STFT) given through a
Fourier transform. if you’re interested how does the plot looks like, take a look at the below figure.

Plot FTFT of the song

You can use the Python code below to extract STFT from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
9 # load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11 # DISPLAY STFT TO SPECTROGRAM
12 def stft():
13 hop_length = 512 # In num. of samples
14 n_fft = 2048 # Window in num. of samples
15 # Calculate duration hop length and window in seconds
16 hop_length_duration = float(hop_length)/sample_rate
17 n_fft_duration = float(n_fft)/sample_rate
18 print("STFT hop length duration is: {}s".format(hop_length_duration))
19 print("STFT window duration is: {}s".format(n_fft_duration))
20 # Perform stft
21 stft = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
22 # Calculate abs values on complex numbers to get magnitude
23 spectrogram = np.abs(stft)
24 # Display spectrogram
25 plt.figure(figsize=FIG_SIZE)
6

26 librosa.display.specshow(spectrogram, sr=sample_rate, hop_length=hop_length)

27 plt.xlabel("Time")
28 plt.ylabel("Frequency")
29 plt.colorbar()
30 plt.title("Spectrogram")
31 plt.show()
32 # Apply logarithm to cast amplitude to Decibels
33 log_spectrogram = librosa.amplitude_to_db(spectrogram)
34 plt.figure(figsize=FIG_SIZE)
35 librosa.display.specshow(log_spectrogram, sr=sample_rate, hop_length=
hop_length)
36 plt.xlabel("Time")
37 plt.ylabel("Frequency")
38 plt.colorbar(format="%+2.0f dB")
39 plt.title("Spectrogram (dB)")
40 plt.show()

But let’s say how does it work and how we can build STFT. It computes several Fourier transforms
at different intervals and in doing so it preserves information about time and the way sounds evolved
it’s like over time. So the different intervals at which we perform the Fourier transform is given by
the frame size and so a frame is a bunch of samples and so we fix he number of samples and we are
giver spectrogram.

How does STFT works

Now we may be wondering why did we have to learn about spectrogram because spectrograms
are fundamental for performing like deep learning like applications like on audio data the whole
pre-processing pipeline for audio data for deep learning is based on spectrograms. So we can pass
wave form audios into state and we get spectrogram and we use spectrogram as an input for our deep
learning model.
7

Data conversion steps (using STFT)

6 Mel Frequency Cepstral Coefficient

Let’s introduce another feature that is fundamental and as important as spectrogram for deep learning,
it called Mel Frequency Cepstral Coefficient (MFCC). MFCCs capture of timbral/textural aspects of
sound. it means if you have for example a piano and a violin playing the same melody you would
have potentially like the same peach context and the same frequency but what would change is the
quality of sound but MFCCs are capable of capturing the information and the differences. the below
figure show us how does plot MFCC looks.

Plot MFCC of the song

You can use the Python code below to extract MFCC from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
9 # load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11 # DISPLAY MFCCs
12 def mfccs():
13 hop_length = 512 # In num. of samples
14 n_fft = 2048 # Window in num. of samples
15 # Extract 13 MFCCs
16 MFCCs = librosa.feature.mfcc(signal, sample_rate, n_fft=n_fft, hop_length=
hop_length, n_mfcc=13)
17 # Display MFCCs
8

18 plt.figure(figsize=FIG_SIZE)
19 librosa.display.specshow(MFCCs, sr=sample_rate, hop_length=hop_length)
20 plt.xlabel("Time")
21 plt.ylabel("MFCC coefficients")
22 plt.colorbar()
23 plt.title("MFCCs")
24 plt.show()
And the way we use MFCCs is the same as how we use STFT.

Data conversion steps (using MFCC)

7 What Did We Use to Do?(optional)

In the past, traditional machine learning pre-processing for audio used to be so much different. let’s
take a look at that : we can take a lot of information from waveforms, for example, we could use wave-
forms extracting time-domain features and use spectrogram extracting frequency domain features. so
we start from the waveform and extract those features we wanted and we combine these features and
use it in machine learning algorithms like logistic regression or super vector machine. fortunately,
with advanced deep learning, the whole process becomes a little bit more straight forward. After all,
we don’t need to see that much feature engineering because we use spectrogram. this is why deep
learning models like in this case for audio is called end-to-end model cause you just use some basic
information without worrying to much about extracting specific features

References
[1] Richard G. Lyons. Understanding Digital Signal Processing. 3rd ed. Pearson. 2010.
[2] The Sound of AI,
https://www.youtube.com/channel/UCZPFjMe1uRSirmSpznqvJfQ?pbj
reload=102
[3] Understanding Audio data, Fourier Transform, FFT and Spectrogram features for a Speech
Recognition System,
https://towardsdatascience.com/understanding-audio-data-fourier
-transform-fft-spectrogram-and-speech-recognition-a4072d228520
[4] Understanding the Fourier Transform by example,
https://www.ritchievink.com/blog/2017/04/23/understanding-the
-fourier-transform-by-example/

View publication stats

Audio Data Analysis Using Machine Learning and Deep
No ratings yet
Audio Data Analysis Using Machine Learning and Deep
74 pages
Spectral Modeling and Signal Processing Intro421
100% (2)
Spectral Modeling and Signal Processing Intro421
35 pages
FFT Spectral Analysis
No ratings yet
FFT Spectral Analysis
47 pages
Real Time Digital Audio Processing With Arduino
100% (1)
Real Time Digital Audio Processing With Arduino
16 pages
Audio Analysis in Python 1676006837
No ratings yet
Audio Analysis in Python 1676006837
5 pages
Fast Fourier Transform in MATLAB: Magnitude of The Complex Amplitude
No ratings yet
Fast Fourier Transform in MATLAB: Magnitude of The Complex Amplitude
4 pages
German Language Learning This Book Includes Learn German For Beginner PDF
100% (2)
German Language Learning This Book Includes Learn German For Beginner PDF
534 pages
MFCC
100% (2)
MFCC
6 pages
Saso Iso 17089 2 2020 e
No ratings yet
Saso Iso 17089 2 2020 e
45 pages
GS Syllabus Legal Aspect of Education Aug 2024
No ratings yet
GS Syllabus Legal Aspect of Education Aug 2024
7 pages
Corre Test Bank
No ratings yet
Corre Test Bank
54 pages
IOE Report
No ratings yet
IOE Report
21 pages
Nietjet 0602S 2018 003
No ratings yet
Nietjet 0602S 2018 003
5 pages
A Comparison of Random and Periodic Marine Simultaneous-Source Encoding
No ratings yet
A Comparison of Random and Periodic Marine Simultaneous-Source Encoding
3 pages
1498762743
No ratings yet
1498762743
354 pages
Audproc 2
No ratings yet
Audproc 2
40 pages
Masterprotect 1813: Amine-Cured, Pitch Free Epoxy
100% (1)
Masterprotect 1813: Amine-Cured, Pitch Free Epoxy
2 pages
Hall 2018 Time Frequency Decomposition
No ratings yet
Hall 2018 Time Frequency Decomposition
3 pages
DSP 1
No ratings yet
DSP 1
9 pages
FFT Analyzer - OnoSokki - Notes
No ratings yet
FFT Analyzer - OnoSokki - Notes
30 pages
Mrac Paper1a
No ratings yet
Mrac Paper1a
11 pages
Fourier Transform
No ratings yet
Fourier Transform
12 pages
Assignment 1 Rajveer Saini: Question 2 Code
No ratings yet
Assignment 1 Rajveer Saini: Question 2 Code
3 pages
Programming & Numerical Analysis: Kai-Feng Chen
No ratings yet
Programming & Numerical Analysis: Kai-Feng Chen
45 pages
Digital Filter Design For Audio Processing: Ethan Elenberg Anthony Hsu Marc L'Heureux
No ratings yet
Digital Filter Design For Audio Processing: Ethan Elenberg Anthony Hsu Marc L'Heureux
31 pages
Audio Fingerprinting With Python and Numpy
No ratings yet
Audio Fingerprinting With Python and Numpy
13 pages
FROMTXTTIMESERIESTOWAVEFILESANDSPECTROGRAMEXTRACTION SEISMIC JupyterNotebook
No ratings yet
FROMTXTTIMESERIESTOWAVEFILESANDSPECTROGRAMEXTRACTION SEISMIC JupyterNotebook
29 pages
Analysisof Speech Signal 29 TH October 2018
No ratings yet
Analysisof Speech Signal 29 TH October 2018
16 pages
Audio and Digital Signal Processing
No ratings yet
Audio and Digital Signal Processing
18 pages
Pad Assignment 2
No ratings yet
Pad Assignment 2
12 pages
Lec 5
No ratings yet
Lec 5
19 pages
Linear Algebra, Signal Processing, and Wavelets - A Unified Approach - MATLAB Version (Instructor's Solution Manual) (Solutions)
No ratings yet
Linear Algebra, Signal Processing, and Wavelets - A Unified Approach - MATLAB Version (Instructor's Solution Manual) (Solutions)
209 pages
Sound Lab: Power Spectra: Background
No ratings yet
Sound Lab: Power Spectra: Background
4 pages
Syllabus: Weekly Content
No ratings yet
Syllabus: Weekly Content
3 pages
FFT Spectral Analysis
No ratings yet
FFT Spectral Analysis
69 pages
P4 Villoria Quiñones Pablo Alberto
No ratings yet
P4 Villoria Quiñones Pablo Alberto
2 pages
MSC Data Science - 02 PDF
No ratings yet
MSC Data Science - 02 PDF
37 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
Department of Electronics 2020-2021: Prof. Shilpa Achaliya
No ratings yet
Department of Electronics 2020-2021: Prof. Shilpa Achaliya
15 pages
Report On Project 1 Speech Emotion Recognition
No ratings yet
Report On Project 1 Speech Emotion Recognition
10 pages
Aml CT2 4M
No ratings yet
Aml CT2 4M
8 pages
Predicting Singer Voice Using Convolutional Neural Network
No ratings yet
Predicting Singer Voice Using Convolutional Neural Network
17 pages
Experiment No. 3: The Fourier Transform - An Audio Signal Is Comprised of Several Single-Frequency Sound
No ratings yet
Experiment No. 3: The Fourier Transform - An Audio Signal Is Comprised of Several Single-Frequency Sound
7 pages
FFT Analysis in Practice
No ratings yet
FFT Analysis in Practice
86 pages
Aist2010 03 Analysis
No ratings yet
Aist2010 03 Analysis
22 pages
Audio Noise Detection
No ratings yet
Audio Noise Detection
29 pages
EE-421 Digital Signal Processing Complex Engineering Problem
No ratings yet
EE-421 Digital Signal Processing Complex Engineering Problem
10 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Matlab Activity
No ratings yet
Matlab Activity
2 pages
Seewave Analysis
No ratings yet
Seewave Analysis
17 pages
Lab Filter Noise Music
No ratings yet
Lab Filter Noise Music
5 pages
03 Audio
No ratings yet
03 Audio
32 pages
CMP4101 - 4 - Frequency Domain Signal Processing - Part II
No ratings yet
CMP4101 - 4 - Frequency Domain Signal Processing - Part II
80 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Gender Recognition Using Fast Fourier Transform With Ann
No ratings yet
Gender Recognition Using Fast Fourier Transform With Ann
6 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
FFT Research
No ratings yet
FFT Research
8 pages
Fourier Series Expansion of Periodic Signal: (With Period of T)
No ratings yet
Fourier Series Expansion of Periodic Signal: (With Period of T)
45 pages
DSP Project 2
No ratings yet
DSP Project 2
10 pages
WEEK 7 Module - Circuits
No ratings yet
WEEK 7 Module - Circuits
6 pages
ECE251s Signals Project
No ratings yet
ECE251s Signals Project
11 pages
ML Assignment 2 Report
No ratings yet
ML Assignment 2 Report
59 pages
Binomial Theorem: IIT JEE (Main) Examination
No ratings yet
Binomial Theorem: IIT JEE (Main) Examination
56 pages
Fast Fourier Transform (FFT) : The FFT in One Dimension The FFT in Multiple Dimensions
No ratings yet
Fast Fourier Transform (FFT) : The FFT in One Dimension The FFT in Multiple Dimensions
10 pages
Eng 6 Audio Signals: Bevan Baas, Andre Knoesen
No ratings yet
Eng 6 Audio Signals: Bevan Baas, Andre Knoesen
30 pages
Dissertation Zusammenfassung Schreiben
100% (2)
Dissertation Zusammenfassung Schreiben
6 pages
Personal Values - Mark Manson
No ratings yet
Personal Values - Mark Manson
52 pages
BVT Bed Re Ets: Vie I
No ratings yet
BVT Bed Re Ets: Vie I
228 pages
ACR-Orientation Work Arrangement
No ratings yet
ACR-Orientation Work Arrangement
10 pages
0471 Thermal Insulation and Pliable Membranes
No ratings yet
0471 Thermal Insulation and Pliable Membranes
9 pages
Civil Engineering Important Questions
No ratings yet
Civil Engineering Important Questions
8 pages
Matter and Measurement: Theodore L. Brown H. Eugene Lemay, Jr. and Bruce E. Bursten
No ratings yet
Matter and Measurement: Theodore L. Brown H. Eugene Lemay, Jr. and Bruce E. Bursten
48 pages
Authors Book
No ratings yet
Authors Book
274 pages
Flaws in Education System
No ratings yet
Flaws in Education System
47 pages
GPT-9000 User Manual - EN Rev G 201712
No ratings yet
GPT-9000 User Manual - EN Rev G 201712
183 pages
Bai Tap Unit 5
No ratings yet
Bai Tap Unit 5
3 pages
Unit 1 Family Life Lesson 2 Language
No ratings yet
Unit 1 Family Life Lesson 2 Language
76 pages
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
No ratings yet
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
20 pages
The Structure of Appearance
No ratings yet
The Structure of Appearance
344 pages
UC3843 ChipsWinner
No ratings yet
UC3843 ChipsWinner
11 pages
ABYIP 2025.docx (Pungsod)
No ratings yet
ABYIP 2025.docx (Pungsod)
10 pages
Sentusys™ Intelligent Tube System: Michael Andersson & Erika Hedblom - Sandvik Materials Technology
No ratings yet
Sentusys™ Intelligent Tube System: Michael Andersson & Erika Hedblom - Sandvik Materials Technology
19 pages
Module02 Precalculus Voctech
No ratings yet
Module02 Precalculus Voctech
8 pages
Section 1-Short Cantilever ST
No ratings yet
Section 1-Short Cantilever ST
5 pages
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
No ratings yet
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
1 page
Kami Export - ALEXA CADENA - 13 Cellular Respiration-S
No ratings yet
Kami Export - ALEXA CADENA - 13 Cellular Respiration-S
6 pages
Konica Monolta Drum (Photoconductor) DR512-DR512K
No ratings yet
Konica Monolta Drum (Photoconductor) DR512-DR512K
4 pages
SOP (Mahi - Project Coordinator)
No ratings yet
SOP (Mahi - Project Coordinator)
1 page
A First Course in Wavelets with Fourier Analysis
From Everand
A First Course in Wavelets with Fourier Analysis
Albert Boggess
3.5/5 (2)
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
From Everand
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
Anton R Kamenov
No ratings yet

APPFDL

Uploaded by

APPFDL

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Audio Pre-Processing For Deep Learning

Article · December 2020

The user has requested enhancement of the downloaded file.

Simple waveform sound

• Sampling −→ Sample the signal at specific time intervals

A sound wave, in red, represented digitally, in blue

2 Start Working With an Audio File

Plot audio wave in time domain

9 # Load audio file with Librosa

10 signal, sample_rate = librosa.load(file, sr=22050)

FT decomposes complex audio to simpler ones

4 Fast Foureir Transform

How FFT works

Plot FFT of the song

5 Short-Time Fourier Transform

Plot FTFT of the song

26 librosa.display.specshow(spectrogram, sr=sample_rate, hop_length=hop_length)

How does STFT works

Data conversion steps (using STFT)

6 Mel Frequency Cepstral Coefficient

Plot MFCC of the song

Data conversion steps (using MFCC)

7 What Did We Use to Do?(optional)

View publication stats

You might also like