APPFDL
APPFDL
net/publication/347356900
CITATIONS READS
3 6,568
1 author:
Saman Arzaghi
University of Tehran
2 PUBLICATIONS 3 CITATIONS
SEE PROFILE
All content following this page was uploaded by Saman Arzaghi on 16 December 2020.
Abstract
Audio signals are continuous (analog) signals that gradually decrease in amplitude as the
sound source decreases. Computers, on the other hand, store their data digitally stream strings of
bits zero and one. Digital data is naturally discrete because the value of zero or one digital data is
only valid at a given moment. Therefore, the continuous analog audio signal must be converted to
a discontinuous digital form so that the computer can store or process audio. Of course, the digital
data must be converted back to analog form so that it can be heard through an audio system. Two-
way conversion between analog and digital signals is the primary operation of all adapter cards
and sound cards.
In this article, we will discuss different ways to represent audio (like Waveform, FFT, STFT,
and MFCC), the difference between them, How to turn each into another, and some codes for each
representation.
1 What is Sound?
Well, sound produce when an object vibrates and those vibrations determine the oscillation of air
molecules which basically creates an alternation of air pressure and this high pressure alternated with
low pressure causes a wave and we can represent this wave using wave form.
When we talk about sounds around us like our own voice or the sound of the whistle, we are talking
continuous waveforms and they are analog, but obviously, we can’t really store analog waveforms we
need an away of digitalizing them, and for doing that we can use analog-digital conversion (ADC)
process and for that we must perform two steps:
2
• Quantization −→ Quantize the amplitude given and represent with a limited number of bits
(note that more bits we store the amplitude and-the better quality of sound will be)
You can use the Python code below to extract waveform from a raw file.wav and then show the
plot:
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
3 Fourier Transform
If you take look at the plot audio wave in time domain, it looks so complex to understand, but nature
has given us an incredible way of knowing quite a lot about complex sounds, and that’s given through
a Fourier transform (FT) . A Fourier transform is decomposing complex periodic sound into a sum of
sine waves oscillating at different frequencies.
It is great because now we can decomposed complex sounds into simpler ones and analyze them.
FFT analysis for the same song that we analyzed earlier is shown below:
You can use the Python code below to extract FFT from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
9 # Load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11
12 # DISPLAY FFT TO POWER SPECTRUM
13 def fft():
14 # Lerform Fourier transform
15 fft = np.fft.fft(signal)
16 # Lalculate abs values on complex numbers to get magnitude
17 spectrum = np.abs(fft)
18 # Create frequency variable
19 f = np.linspace(0, sample_rate, len(spectrum))
20 # Take half of the spectrum and frequency
21 left_spectrum = spectrum[:int(len(spectrum)/2)]
22 left_f = f[:int(len(spectrum)/2)]
23 # Plot spectrum
24 plt.figure(figsize=FIG_SIZE)
25 plt.plot(left_f, left_spectrum, alpha=0.4)
5
26 plt.xlabel("Frequency")
27 plt.ylabel("Magnitude")
28 plt.title("Power spectrum")
29 plt.show()
You can use the Python code below to extract STFT from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
9 # load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11 # DISPLAY STFT TO SPECTROGRAM
12 def stft():
13 hop_length = 512 # In num. of samples
14 n_fft = 2048 # Window in num. of samples
15 # Calculate duration hop length and window in seconds
16 hop_length_duration = float(hop_length)/sample_rate
17 n_fft_duration = float(n_fft)/sample_rate
18 print("STFT hop length duration is: {}s".format(hop_length_duration))
19 print("STFT window duration is: {}s".format(n_fft_duration))
20 # Perform stft
21 stft = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
22 # Calculate abs values on complex numbers to get magnitude
23 spectrogram = np.abs(stft)
24 # Display spectrogram
25 plt.figure(figsize=FIG_SIZE)
6
But let’s say how does it work and how we can build STFT. It computes several Fourier transforms
at different intervals and in doing so it preserves information about time and the way sounds evolved
it’s like over time. So the different intervals at which we perform the Fourier transform is given by
the frame size and so a frame is a bunch of samples and so we fix he number of samples and we are
giver spectrogram.
Now we may be wondering why did we have to learn about spectrogram because spectrograms
are fundamental for performing like deep learning like applications like on audio data the whole
pre-processing pipeline for audio data for deep learning is based on spectrograms. So we can pass
wave form audios into state and we get spectrogram and we use spectrogram as an input for our deep
learning model.
7
You can use the Python code below to extract MFCC from a raw file.wav and then show the plot.
1 import numpy as np
2 import librosa, librosa.display
3 import matplotlib.pyplot as plt
4
5 FIG_SIZE = (15,10)
6
7 file = "highhopes.wav"
8
9 # load audio file with Librosa
10 signal, sample_rate = librosa.load(file, sr=22050)
11 # DISPLAY MFCCs
12 def mfccs():
13 hop_length = 512 # In num. of samples
14 n_fft = 2048 # Window in num. of samples
15 # Extract 13 MFCCs
16 MFCCs = librosa.feature.mfcc(signal, sample_rate, n_fft=n_fft, hop_length=
hop_length, n_mfcc=13)
17 # Display MFCCs
8
18 plt.figure(figsize=FIG_SIZE)
19 librosa.display.specshow(MFCCs, sr=sample_rate, hop_length=hop_length)
20 plt.xlabel("Time")
21 plt.ylabel("MFCC coefficients")
22 plt.colorbar()
23 plt.title("MFCCs")
24 plt.show()
And the way we use MFCCs is the same as how we use STFT.
References
[1] Richard G. Lyons. Understanding Digital Signal Processing. 3rd ed. Pearson. 2010.
[2] The Sound of AI,
https://www.youtube.com/channel/UCZPFjMe1uRSirmSpznqvJfQ?pbj
reload=102
[3] Understanding Audio data, Fourier Transform, FFT and Spectrogram features for a Speech
Recognition System,
https://towardsdatascience.com/understanding-audio-data-fourier
-transform-fft-spectrogram-and-speech-recognition-a4072d228520
[4] Understanding the Fourier Transform by example,
https://www.ritchievink.com/blog/2017/04/23/understanding-the
-fourier-transform-by-example/