0% found this document useful (0 votes)
7 views

SR_Lab File

The document outlines a series of experiments conducted using MATLAB R2020a, focusing on audio signal processing. Each experiment has specific aims, such as recording audio, detecting speech regions, and computing the Fast Fourier Transform (FFT) of audio signals, along with detailed methodologies and conclusions. The results demonstrate successful execution of audio-related functions and analysis, providing insights into speech detection and frequency analysis.

Uploaded by

mayankinhome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

SR_Lab File

The document outlines a series of experiments conducted using MATLAB R2020a, focusing on audio signal processing. Each experiment has specific aims, such as recording audio, detecting speech regions, and computing the Fast Fourier Transform (FFT) of audio signals, along with detailed methodologies and conclusions. The results demonstrate successful execution of audio-related functions and analysis, providing insights into speech detection and frequency analysis.

Uploaded by

mayankinhome
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Contents

1. Experiment 1 ......................................................................................................................................... 3
2. Experiment 2 ......................................................................................................................................... 7
3. Experiment 3 ....................................................................................................................................... 11
4. Experiment 4 ....................................................................................................................................... 20
5. Experiment 5 ....................................................................................................................................... 25
6. Experiment 6 ....................................................................................................................................... 35
7. Experiment 7 ....................................................................................................................................... 41
8. Experiment 8 ....................................................................................................................................... 53
1. Experiment 1

Aim: - To record, read and write an audio signal and execute other related functions
Software Used: - MATLAB R2020a
Theory: -
Production of speech in the human vocal system is done in four stages. They have been
briefly mentioned below, in the sequence of occurrence.
S.No. Stage Organs Function
1 Breathing Lungs, Diaphragm, Rib Intake of pulmonic air stream up
Muscles to full capacity using the
diaphragm
2 Phonation Larynx, Vocal Cords, Production of voice through
Trachea vibration of vocal cords
3 Resonation Upper part of larynx, Voice amplification and
pharynx, nasal cavity, oral modification
cavity
4 Articulation Uvula, Velum, Tongue, Production and characterization of
Lower Lip, Upper Jaw phonemes and accents

The below functions are used in the experiment for the achievement of the objective
S.No. Function Description
1 r = audiorecorder(fs,nBits,NumChannels,ID) Creates an audio recorder
object with the audio from the
device specified by the device
identifier ID and channels by
NumChannels, sampled at fs
and quantized to nBits
2 r = getaudiodata(recorder,datatype) Obtains the audio data from
the recorder object and
converts to the datatype
specified
3 p = play(recobj, [start stop]) Plays the audio between the
samples specified by start
and stop
4 audiowrite(filename,y,Fs,Name,Value) Writes the matrix of the
specified audio data into a file
of the specified name. The
name value pairs include inter
alia bits per sample, bit rate,
quality, artist, title and
comments
5 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
6 player = audioplayer(y,Fs,nBits,ID) Creates audio player object
with the specified parameters
7 P = get(recorder) Obtains the property values
of the specified object
8 record(obj) Records the data and event
information for the specified
object
9 pause Stops execution temporarily
9 recorder.resume Resumes recording
10 recorder.stop Stops recording

Program: -
clc;
clear all;
close all;

Fs = 44100;
nbits = 16;
nchannels = 2;
ID = -1;
t = 5;
file = 'audio_rec.wav';
chk_rec = false;

if(chk_rec)
recorder = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
recordblocking(recorder,t);
disp('Stop Speaking');

audio_data = getaudiodata(recorder,'double'); % Object function 1


player = play(recorder); % Object function 2
audiowrite(file,audio_data,Fs);
else
audio_rec = audioread(file);
end

if ~chk_rec
audio_data = audioread(file);
audio_player = audioplayer(audio_data,Fs,nbits,ID);
end

t_start = 2;
t_stop = 4;
samples = [2 * Fs,4* Fs];
audio_data_res = audioread(file,samples);
duration = 1:1:Fs * t;
audio_player_res = audioplayer(audio_data_res,Fs,nbits,ID);
disp('Playing resampled audio');
pause(2);
play(audio_player_res);

figure(1);
subplot(2,1,1);
plot(duration / Fs,audio_data);
xlabel('Time');
ylabel('Amplitude');
title('Original Recorded Audio','FontSize',14);
subplot(2,1,2);
plot(duration(2 * Fs:4 * Fs) / Fs,audio_data_res);
xlabel('Time');
ylabel('Amplitude');
title('Resampled Audio Data','FontSize',14);

% Property values of audio recorder object

if chk_rec
properties = get(recorder); % Object function 3
else
properties = get(audio_player);
end

% Change sampling frequency

fs = 8000;
if ~chk_rec
audio_player_resampled = audioplayer(audio_data,fs,nbits,ID);
end

% Change number of bits

nBits = 8;
if ~chk_rec
audio_player_reformed = audioplayer(audio_data,Fs,nBits,ID);
end

% Pause/resume

recorder_pr = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
record(recorder_pr);
pause(1);
recorder_pr.resume; % Object function 4
disp('Stop Speaking');
recorder_pr.stop; % Object function 5
play(recorder_pr);
Output: -

Conclusions: -
1. recordblocking function does not relinquish control to the main program until the
recording is completed
2. To sub-sample the audio signal, the time period is multiplied by the sampling
frequency to isolate the samples in the desired interval
3. While plotting the sub-sampled the signals, the dependent axis variable is divided by
the sampling frequency so as to identify the interval of sub-sampling
4. Resampling the audio data at a lower sampling frequency severely affects the audio
characteristics of the data
5. Change in the number of bits has no perceptible effect on the recorded audio data
6. For pausing, recording and stopping functionalities, the record function is used instead
of recordblocking.

Result: -
Audio signal was recorded and related functions executed thereon, as evidenced by the
observations and conclusions drawn therefrom.
2. Experiment 2

Aim: - To detect regions of speech in an audio signal


Software Used: - MATLAB R2020a
Theory: -
Speech recognition involves the following components
a. Analog to Digital Conversion – Conversion of analog acoustic signal into digital format
b. Acoustic/Language Modelling – Modelling of the speech as per the statistical
template/predictive language models
c. Speech Engine – Deciphering the contents of the speech signal, phoneme by
phoneme
d. Display – Display the inferred speech
e. Feedback – To train the speech engine as per the speech structure and phoneme
constitution of the given language
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 i = detectSpeech(audioIn,fs) Returns indices
corresponding to the
presence of the speech signal
in the given audio signal
audioin, sampled at the rate
fs
3 i = detectSpeech(audioIn,fs,Name,Value) Detection of speech under
the specified constraints.
Name value pairs include
window type and it attributes,
overlap length, merge
distance and thresholds
Program: -
clc;
clear all;
close all;

file_name = 'audio_rec.wav';
% Detecting speech regions
[audioIn,fs] = audioread(file_name);
figure(1);
detectSpeech(audioIn(:,1),fs); % Plotting detecting speech in the entire
length of the audio
xlabel('Time');
ylabel('Amplitude');
title('Detected Speech in the Audio Signal');
% Thresholding using Windowing, Overlap Length and Merge Distance

window_duration = 0.074; % Range [2,size(audioIn,1)]


window_samples = round(window_duration * fs);
window_type = 'hanning'; % Chebyshev, Hanning, Bartlett, Blackman, Gaussian,
Tukey, Kaiser, Taylor, Bohman
f = str2func(window_type); % Returns function handle
window = f(window_samples,'periodic');

overlap_percent = 10;
overlapping_samples = round(window_samples * overlap_percent / 100); % Number
of overlapping samples in adjacent windows

merge_duration = 0.1;
merge_distance = round(merge_duration * fs); % Number of samples to be merged
on occurence of a positive detction of speech

figure(2);
detectSpeech(audioIn(:,1),fs,'Window',window,'OverlapLength',overlapping_samp
les,'MergeDistance',merge_distance);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using custom window');

% Reuse of decision thresholds on the segments of the same signal

split_position = 0.3; % Specifying the splitting ratio


t = numel(audioIn(:,1)) / fs;
split_loc = split_position * t;
first_part = audioIn(1:round(split_loc * fs),1);
second_part = audioIn(round(split_loc * fs + 1):end,1);
[idx,thresholds] = detectSpeech(first_part,fs); % Thresholds from the first
part

figure(3);
detectSpeech(second_part,fs,'Thresholds',thresholds);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using the predetected thresholds');
Output: -
S.No. Figure Description Figure
1 Detected Boundaries in audio
signal without any custom
window

2 Detected Boundaries in audio


signal using Hanning Window
of 0.074 s length, 10% overlap
and 0.1 s merge duration

3 Detected boundaries in
speech signal using
thresholds obtained from
initial 30% segment of the
signal.
Conclusions: -
1. Different window functions viz. Chebyshev, Hanning, Bartlett, Blackman, Gaussian,
Tukey, Kaiser, Taylor, Bohman may be used with relevant attributes for the detection
of the speech segments in the audio signal.
2. Window length, overlap length and merge distance specified in the terms of the
number of samples.
3. The audio signal is split in the desired ratio and the thresholds detected in one part
are used to identify the speech segments in the other part.
Result: -
Regions of speech were detected in the recorded audio signal, as evidenced by the
observations and conclusions drawn above.
3. Experiment 3

Aim: - To compute the Fast Fourier Transform (FFT) of an audio signal


Software Used: - MATLAB R2020a
Theory: -
The Continuous time Fourier Transform of the signal 𝑥(𝑡) is given by

𝑋(𝑓) = ∫ 𝑥(𝑡)𝑒 −𝑗2𝜋𝑓𝑡 𝑑𝑡
−∞

Its inverse is expressed as


1 ∞
𝑥(𝑡) = ∫ 𝑋(𝑓)𝑒 𝑗2𝜋𝑓𝑡 𝑑𝑓
2𝜋 −∞
The Discrete Fourier Transform (DFT) is computed efficiently using the Fast Foureier
Transform, which employs the cascading approach to reduce the exponential complexity
associated with additions and multiplications to logarithmic scale. The DFT of an N-length
discrete time sequence 𝑥[𝑛] is given by
𝑁−1
𝑗2𝜋𝑘𝑛
𝑋𝑘 = ∑ 𝑥[𝑛]𝑒 𝑁

𝑛=0

Its inverse is given as


𝑁−1
1 𝑗2𝜋𝑘𝑛
𝑥[𝑛] = ∑ 𝑋𝑘 𝑒 − 𝑁
𝑁
𝑘=0

The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = fft(X,n,dim) Returns the n-length Fast
Fourier Transform along the
specified dimension dim.
Program: -
clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
t = 0.01:0.01:10;
f1 = 10;
f2 = 20;
f3 = 30;
split_p1 = 0.2;
split_p2 = 0.5;
split_p3 = 1 - (split_p1 + split_p2);
s1 = sin(2 * pi * f1 * t) + sin(2 * pi * f2 * t) + sin(2 * pi * f3 * t); %
Multitone signal
s2 = cat(2,sin(2 * pi * f1 * t(1:split_p1 * length(t))),sin(2 * pi * f2 *
t(split_p1 * length(t) + 1:split_p1 * length(t) + split_p2 * length(t))),...
sin(2 * pi * f3 * t((split_p1 + split_p2) * length(t) + 1:end))); %
Synthetic non-stationary signal
N = randn(size(t));

Y_s1 = fft(s1 + N);


Y1 = abs(Y_s1/length(t));
Y1 = Y1(1:length(t) / 2 + 1);
Y1(2:end - 1) = 2 * Y1(2:end - 1);

figure(1);
subplot(4,1,1);
plot(t,s1);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal');
subplot(4,1,2);
plot(t,s1 + N);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y1);
xlabel('Time');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy multitone signal');
subplot(4,1,4);
plot(t,ifft(Y_s1));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy multitone signal');

Y_s2 = fft(s2 + N);


Y2 = abs(Y_s2/length(t));
Y2 = Y2(1:length(t) / 2 + 1);
Y2(2:end - 1) = 2 * Y2(2:end - 1);

figure(2);
subplot(4,1,1);
plot(t,s2);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal');
subplot(4,1,2);
plot(t,s2 + N);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy Synthetic Non-stationary
signal');
subplot(4,1,4);
plot(t,ifft(Y_s2));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy Synthetic Non-stationary signal');

d1 = 1:split_p1 * length(t);
d2 = 1:(split_p1 + split_p2) * length(t);
d3 = (split_p1 + split_p2) * length(t) + 1:length(t);
Y_s2_p1 = window_fft(s2,d1,N);
Y_s2_p2 = window_fft(s2,d2,N);
Y_s2_p3 = window_fft(s2,d3,N);
figure(3);
subplot(3,2,1);
plot(t(1:split_p1 * length(t)),s2(1:split_p1 * length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 1');
subplot(3,2,2);
plot(((1:length(d1) / 2 + 1) / 2 - 0.5) / (split_p1 * 10 / 2),Y_s2_p1);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 1');
subplot(3,2,3);
plot(t(1:(split_p1 + split_p2) * length(t)),s2(1:(split_p1 + split_p2) *
length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 2');
subplot(3,2,4);
plot(((1:length(d2) / 2 + 1) / 2 - 0.5) / ((split_p1 + split_p2) * 10 /
2),Y_s2_p2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 2');
subplot(3,2,5);
plot(t((split_p1 + split_p2) * length(t) + 1:length(t)),s2((split_p1 +
split_p2) * length(t) + 1:length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 3');
subplot(3,2,6);
plot(((1:length(d3) / 2 + 1) / 2 - 0.5) / (split_p3 * 10 / 2),Y_s2_p3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 3');

Y_s3 = fft(audioData(:,1));
Y3 = abs(Y_s3/length(audioData));
Y3 = Y3(1:length(audioData) / 2 + 1);
Y3(2:end - 1) = 2 * Y3(2:end - 1);

figure(4);
subplot(3,1,1);
plot((1:length(audioData))/fs,audioData);
xlabel('Time');
ylabel('Amplitude');
title('Recorded Audio signal');
subplot(3,1,2);
plot((1:length(audioData) / 2 + 1) / 10,Y3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of audio signal');
subplot(3,1,3);
plot((1:length(audioData))/fs,ifft(Y_s3));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of audio signal');
Output: -
S.No. Figure Figure
Description
1 FFT of multitoned
signal consisting
of harmonics of
10, 20 and 30 Hz
and its 1000 point
IFFT
2 FFT of noisy non-
stationary signal
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, and its
1000 point IFFT
3 Segmentation of
non-stationary
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, into the
corresponding
duration windows
and computation
of FFT for each
window
4 FFT of recorded
audio signal and
its 220500 point
IFFT
Conclusions: -
4. Fast Fourier Transform (FFT) is able to distinctly identify the frequencies present in a
multi-tone signal in the presence of Additive White Gaussian Noise (AWGN) of
strength.
5. FFT is also able to detect the frequencies present in a non-stationary signal, in the
presence of noise, though the detection region is widened wrt the frequency identified,
thus revealing the shortcoming of FFT in the spectral analysis of the non-stationary
signals.
1. FFT is able to identify the frequencies present in the different segments of the audio
signal, irrespective of the remaining segments, though the region of detection is
widened wrt the frequency detected, as earlier.
2. For the non-stationary audio signal, the FFT reveals the presence of the frequencies
in the lower range of the spectrum, typically less than 1KHz.
3. For the non-stationary audio signal, even a 220500 point IFFT is not able reconstruct
the original speech signal accurately i.e. it is unable to model the sharp discontinuities
and fast fluctuations in the audio signal, thus highlighting another shortcoming of the
FFT.
Result: -
Fast Fourier Transform (FFT) of the recorded audio signal was computed and different
functions operations performed thereon, as evidenced by the observations and
conclusions drawn above.
4. Experiment 4

Aim: - To compute the Short Time Fourier Transform (STFT) of an audio signal
Software Used: - MATLAB R2020a
Theory: -
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

where 𝑓(𝑡) is the incoming signal, 𝑊(𝑡 − 𝑡 ′ ) is the centered window function, and the
frequency parameter is denoted by the variable 𝑢.
A wide window provides a good frequency resolution but poor time resolution. The vice
versa is true in the case of narrow window.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = stft(X, FFT Length, Window, Overlap Returns the Short Time
Length, Frequency Centering) Fourier transform of the input
signal, calculated in
accordance with the specified
parameters .

Program: -
clc;
clear all;
close all;

segment_length = 10000;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(1:segment_length);
N = 5000; % In millisecond
nf = 50; % Normalization factor
t = (0:N - 1) / nf; % Normalization necessary
f1 = 75;
f2 = 50;
f3 = 25;
f4 = 10;
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
x = cat(2,sin(2 * f1 * t(1:length(t) / 4)),sin(2 * f2 * t(length(t) / 4 +
1:length(t) / 2)),sin(2 * f3 * t(length(t) / 2 + 1:3 * length(t) / 4)),...
sin(2 * f4 * t(3 * length(t) / 4:end)));
[s,f,t] =
stft(x,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLength',ove
rlap_length,'Centered',freq_centering);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);

figure(1);
surf(t / nf,f * nf / 2,abs(s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Non-Stationary Signal');
colormap jet

figure(2);
surf(abs(s_s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Speech Signal');
colormap jet
Output: -
S.No. Figure Figure
Description
1 STFT of a 5
second long
synthetic non-
stationary signal
with harmonics of
10, 25, 50 and 75
Hz
2 STFT of recorded
audio signal
Conclusions: -
1. Short Time Fourier Transform provides a good time-frequency resolution for
multitoned and non-stationary signals containing frequencies on the lower harmonic
scale.
2. STFT is able to clearly distinguish the frequencies in the non-stationary signal,
occurring in their respective points of time.
3. Setting frequency centering as false produces a two-sided STFT spectrum, which
contains the angular frequencies in the range from [-π,π].
Result: -
Short Time Fourier Transform (STFT) of the synthetic non-stationary signal and recorded
audio signal was computed and different functions operations performed thereon, as
evidenced by the observations and conclusions drawn above.
5. Experiment 5

Aim: - To compute the scalogram of the audio signal using Discrete Wavelet Transform
and conduct its multi-resolution analysis
Software Used: - MATLAB R2020a
Theory: -
Wavelet Transform
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

where 𝑓(𝑡) is the incoming signal, 𝑊(𝑡 − 𝑡 ′ ) is the centered window function, and the
frequency parameter is denoted by the variable 𝑢.
A wide window provides a good frequency resolution but poor time resolution. The vice
versa is true in the case of narrow window.
The wavelet transform is given by

∗ (𝑡)𝑑𝑡
𝛾(𝑠, 𝜏) = ∫ 𝑓(𝑡)𝜓𝑠,𝜏

Its inverse is expressed as

𝑓(𝑡) = ∬ 𝛾(𝑠, 𝜏)𝜓𝑠,𝜏 (𝑡)𝑑𝜏𝑑𝑠

All the wavelets are derived from mother wavelet given by


1 𝑡−𝜏
𝑓(𝑡) = 𝜓( )
√𝑠 𝑠
where 𝑠 is the scaling parameter and 𝜏 is the shift parameter of the wavelet.
The basis function is taken as the wavelet instead of complex exponentials as in the case
of Fourier Transform and STFT.
Multi-Resolution Analysis of DWT
We perform the multi-resolution analysis of the maximal overlap discrete wavelet
transform (MODWT) computed for a given signal. The filter used in the computation of
the MODWT and its multi-resolution analysis is the same.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = stft(X, FFT Length, Window, Overlap Returns the Short Time
Length, Frequency Centering) Fourier transform of the input
signal, calculated in
accordance with the specified
parameters .
3 y = cwt(audioData,wname) Computes the continuous
wavelet transform of the
given data using the specified
wavelet
4 y1 = modwt(audioData,level,wname); Calculates maximum overlap
discrete wavelet transform for
the given data using the
specified number of levels
and wavelet
5 y2 = modwtmra(y1,wname); Performs the multi-resolution
analysis of the computed
MODWT using the same
wavelet

Program: -
Scalogram

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);

figure(1);
plot(t / fs,audioData);
xlabel('Time (s)');
ylabel('Amplitude (V)');
title('Recorded Speech Signal');

wname = 'morse';
wt = cwt(audioData,wname);
figure(2);
subplot(2,1,1);
imagesc(t / fs,1:size(wt,1),abs(wt));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('Scalogram of Speech Signal');

dim = 1:size(wt,1) / 2;
s_s_norm = s_s(dim,:);
subplot(2,1,2);
imagesc(t / fs,1:size(s_s_norm,1),abs(s_s_norm));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('STFT of Speech Signal');

Multi–Resolution Analysis

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);

level = 10;
wname = 'sym4';
mdwt_sp = modwt(audioData',level,wname);
mra = modwtmra(mdwt_sp,wname);
err = abs((audioData' - sum(mra)));

f1 = figure;
f2 = figure;
figure(f1);
subplot(level / 2,1,1);
plot(t / fs,audioData);
title('Original Recorded Audio');
for i = 1:level + 1
if i < level / 2 + 1
figure(f1);
subplot(level / 2 + 1,1,i + 1)
elseif i == level / 2 + 1
figure(f1);
xlabel('Time (s)');
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
else
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
end
x = ['D',num2str(i)];
plot(t / fs,mra(i,:));
title(x);
end
xlabel('Time (s)');
set(gcf,'Position', [0, 0, 2000, 2000]);

figure(3);
subplot(3,1,1);
plot(t / fs,audioData');
title('Recorded Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,2);
plot(t / fs,sum(mra));
title('Reconstructed Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,3);
plot(t / fs,err);
title('Reonstruction Error');
xlabel('Time (s)');
ylabel('Amplitude');
Output: -
S.No. Figure Figure
Description
1 Time series
representation
of the recorded
audio signal
2 Scalogram of
the recorded
audio signal

3 STFT of
recorded audio
signal
4 Original Signal
and Resolutions
1-5
5 Resolutions 6-
11
6 Comparison
between
original and
reconstructed
signal and
reconstruction
error
Conclusions: -
1. Short Time Fourier Transform provides an inferior time frequency resolution in
comparison to the wavelet transform as the spectral amplitudes are displayed to be
fairly constant with the frequencies and span the entire range of frequencies. Further
the spectral amplitude concentration appears uniform across the frequency scale.
2. Further, in the region of absence of the audio data in the read audio file, STFT detects
some spectral amplitude, which is not detected in the scalogram computed using the
wavelet transform. This demonstrates the higher accuracy of time frequency
representation using the wavelet transform.
3. Seven level resolution yields signal which is similar in appearance to the original audio
signal and retains its characteristic non-stationarity. Further levels of decomposition
disintegrate the signal into near fundamental frequency components which are almost
stationary.
4. Reconstruction error obtained with 10 level resolution analysis is of the order of 10−13.
Result: -
The scalogram of the audio signal was computed using the Discrete Wavelet Transform
and its multi-resolution analysis was conducted, as evidenced by the observations and
conclusions drawn above.
6. Experiment 6

Aim: - To
i. Obtain MFCC data for an audio signal
ii. Perform frequency Domain Voice Activity Detection and Cepstral feature
extraction
Software Used: - MATLAB R2020a
Theory: -
The extraction of Mel Frequency Cepstral Coefficients is driven by human speech
perception and speech production. The coefficients so extracted represent the
information originating from the vocal tract filter, separated from the information content
of the glottal source. Further, the variance between the different coefficients tends to be
uncorrelated. The procedure for MFCC computation is enlisted as follows:
a. Calculation of frequency spectrum and application Mel binning
b. Apply inverse DFT to the logarithm of the mel-warped spectrum to produce the
cepstrum
c. The 39 dimensional MFF feature vector consists of the first 12 significant cepstral
coefficients, energy (sum of power of the frame samples), 13 delta and 13 double
delta coefficients.
The extraction of the features may also be performed in the frequency domain itself, by
efficiently transforming the audio signal into frequency domain.
The MFCC extraction procedure outlined above has been depicted in the figure below.

Figure 1: MFCC Extraction

The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the specified
file and returns data in y
sampled at fs
2 cepFeatures = creates a
cepstralFeatureExtractor('SampleRate',fs) cepstralFeatureExtractorobject
with the purpose of extraction
of cepstral features from an
audio segment, sampled at a
rate of fs samples per second
3 afr = dsp.AudioFileReader Returns an audio file reader
System object that reads
audio from an audio file
4 asyncBuff = dsp.AsyncBuffer Returns an async buffer
System object, which is used
to write samples to and read
samples from a first-in, first-
out (FIFO) buffer
5 ss = dsp.SignalSink Returns a signal sink that logs
2-D input data in the object
6 VAD = Creates a System object,
voiceActivityDetector('InputDomain','Frequency') VAD, that accepts frequency-
domain input

Program: -
MFCC Extraction

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);

duration = round(0.04 * fs); % 2s audio segment


audioSegment = audioData(40000:40000+duration-1);
cepFeatures = cepstralFeatureExtractor('SampleRate',fs);

[coeffs,delta,deltaDelta] = cepFeatures(audioSegment);
[filterbank, freq] = getFilters(cepFeatures);

audioSegmentTwo = audioData(58200:58200+duration-1); % Number of cepstral


coefficients determined by NumCoeffs
[coeffsTwo,deltaTwo,deltaDeltaTwo] = cepFeatures(audioSegmentTwo); %
Subtracting 2 deltas gives deltadelta

audioSegmentThree = audioData(20000:20000+duration-1); % Number of cepstral


coefficients determined by NumCoeffs
[coeffsThree,deltaThree,deltaDeltaThree] = cepFeatures(audioSegmentThree); %
Subtracting 2 deltas gives deltadelta

subplot(3,1,1);
plot(deltaTwo);
title('DeltaTwo');
subplot(3,1,2);
plot(deltaThree);
title('DeltaThree');
subplot(3,1,3);
plot(deltaDeltaThree);
title('DeltaDeltaThree');

Voice Activity Detection and Cepstral Feature Extraction in Frequency Domain

clc;
clear all;
close all;

file_name = 'Counting-16-44p1-mono-15secs.wav'; % Audio file reader


fileReader = dsp.AudioFileReader(file_name);
fs = fileReader.SampleRate;

samplesPerFrame = ceil(0.03 * fs); % 30ms frames with 10 ms hop, overlap of


20 ms
samplesPerHop = ceil(0.01 * fs);
samplesPerOverlap = samplesPerFrame - samplesPerHop;

fileReader.SamplesPerFrame = samplesPerHop; % Asynchronous buffer


buffer = dsp.AsyncBuffer;

VAD = voiceActivityDetector('InputDomain','Frequency'); % VAD object and


cepstral feature extractor object
cepFeatures =
cepstralFeatureExtractor('InputDomain','Frequency','SampleRate',fs,'LogEnergy
','Replace');
sink = dsp.SignalSink; % Sink to buffer

threshold = 0.5;
nanVector = nan(1,13);
while ~isDone(fileReader)
audioIn = fileReader();
write(buffer,audioIn); % Reading each hop

overlappedAudio = read(buffer,samplesPerFrame,samplesPerOverlap); % Read


a frame with the stipulated overlap length
X = fft(overlappedAudio,2048); % Conversion into frequency domain

probabilityOfSpeech = VAD(X); % Probability of existence of speech


if probabilityOfSpeech > threshold
[xFeatures,delta,deltadelta] = cepFeatures(X); % Extract ceptsral
features is speech present
sink(xFeatures')
else
sink(nanVector) % Store Nan otherwise
end
end

timeVector = linspace(0,15,size(sink.Buffer,1));
figure(1);
plot(timeVector,sink.Buffer)
title('Cepstral Coefficients');
xlabel('Time (s)')
ylabel('MFCC Amplitude')
legend('Log-
Energy','c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12')

Output: -
S.No Figure Figure
. Descriptio
n
1 MFCC
Coefficients
, Delta and
Double
Delta
Features
2 Delta and
Double
Delta
Features of
the
Cepstrum

3 Cepstral
Coefficients
extracted
from
frequency
domain
audio signal
Conclusions: -
1. Delta and Double Delta features of the initial audio segment are always zero.
2. Double delta feature is the difference of the delta features of the previous 2 audio
segments.
3. In the cepstral coefficients extracted in the frequency domain, certain coefficients
stand out wrt other coefficients in their strength of response.
4. Therefore, every feature can be uniquely isolated through appropriate filtering
schemes.
Result: -
The MFCC data for an audio signal was obtained and frequency domain Voice Activity
Detection and subsequent Cepstral feature extraction was undertaken, as evidenced by
the observation and conclusions drawn above.
7. Experiment 7

Aim: - To calculate different spectral descriptors of an audio signal


Software Used: - MATLAB R2020a
Theory: -
Spectral descriptors are used to characterize the nature and shape of an audio segment.
They are widely used in speaker identification and recognition, acoustic scene
recognition, instrument identification, music genre classification, mood recognition and
voice activity detection.
The following spectral descriptors are computed for the audio signal:
a. Spectral Centroid - The spectral centroid represents the "center of gravity" of the
spectrum and used as an indication of energy localization. It is expressed as

𝑏
2
∑𝑘=𝑏 𝑓𝑘 𝑠𝑘
1
𝜇1 = 2 𝑏
∑𝑘=𝑏 𝑠𝑘
1
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.
b. Spectral Spread - It represents the "instantaneous bandwidth" of the spectrum and is
used as an indication of the dominance of a tone. It is given by

∑𝑏𝑘=𝑏
2
(𝑓𝑘 − 𝜇1 )2 𝑠𝑘
1
𝜇2 = √
∑𝑏𝑘=𝑏
2
𝑠
1 𝑘

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins and 𝜇1 is the spectral centroid.

c. Spectral Skewness - The spectral skewness assesses the symmetry around the
centroid. In phonetics, it is often referred to as spectral tilt and is used with other
spectral moments to distinguish the place of articulation. For harmonic signals, it
indicates the relative strength of higher and lower harmonics. It is given by

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )2 𝑠𝑘
1 𝑘
𝜇3 = 𝑏
𝜇23 ∑𝑘=𝑏
2
𝑠
1 𝑘
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins, 𝜇1 is the spectral centroid and 𝜇2 is the spectral
spread.
d. Spectral Kurtosis - The spectral kurtosis measures the flatness, or non-Gaussianity,
of the spectrum around its centroid. Conversely, it is used to measure the peakiness
of a spectrum. It is computed as

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )4 𝑠𝑘
1 𝑘
𝜇4 = 𝑏
𝜇24 ∑𝑘=𝑏
2
𝑠
1 𝑘

e. Spectral Entropy – it has been used successfully in voiced/unvoiced decisions for


automatic speech recognition as unvoiced/non-voiced regions tend to have higher
entropy than voiced regions, due to greater randomness. It is expressed as

∑𝑏𝑘=𝑏
2
𝑠 log(𝑠𝑘 )
1 𝑘
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 =
log(𝑏2 − 𝑏1 )

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.

f. Spectral Flatness – It is an indication of the peakiness of the spectrum. A higher


spectral flatness indicates noise, while a lower spectral flatness indicates tonality. It is
given as

1
𝑏
2
(∏𝑘=𝑏 𝑠 )𝑏2−𝑏1
1 𝑘
𝑓𝑙𝑎𝑡𝑛𝑒𝑠𝑠 = 1
∑𝑏𝑘=𝑏
2
𝑠
𝑏2 −𝑏1 1 𝑘

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.

g. Spectral Slope - Spectral slope is directly related to the resonant characteristics of the
vocal folds and has also been applied to speaker identification. It is a socially important
aspect of timbre and can be discrimination in early childhood development. Spectral
slope is most pronounced when the energy in the lower formants is much greater than
the energy in the higher formants. It is given by

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )(𝑠𝑘 − 𝜇𝑠 )
1 𝑘
𝑠𝑙𝑜𝑝𝑒 = 2
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )
1 𝑘

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝜇𝑓 is the mean frequency, 𝑠𝑘 is the


spectral value at bin 𝑘, 𝜇𝑠 is the mean spectral value and 𝑏1 and 𝑏2 are the band
edges, in bins.
h. Spectral Decrease - Along with slope, it is used in the analysis of music, particularly
in instrument recognition. It is given by

𝑠𝑘 −𝑠𝑏1
∑𝑏𝑘=𝑏
2
1 +1 𝑘−1
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒 =
∑𝑏𝑘=𝑏
2
𝑠𝑘
1 +1

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.

i. Spectral Roll-off Point – It measures the bandwidth of the audio signal by finding the
energy concentration in the frequency bins. It is primarily used in detection and
classification activities on different types of acoustic signals. It is expressed as

𝑖 𝑏2

𝑅𝑜𝑙𝑙𝑜𝑓𝑓 𝑃𝑜𝑖𝑛𝑡 = 𝑖 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ∑ |𝑠𝑘 | = 𝜅 ∑ 𝑠𝑘


𝑘=𝑏1 𝑘=𝑏1

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins and
𝜅 is the specified energy threshold, usually 95% or 85%.

The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in
the given range from the
specified file and returns
data in y sampled at fs
2 centroid = spectralCentroid(x,f) Returns the spectral
centroid of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
3 spread = spectralSpread(x,f) Returns the spectral
spread of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
4 skewness = spectralSkewness(x,f) Returns the spectral
skewness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
5 kurtosis = spectralKurtosis(x,f) Returns the spectral
kurtosis of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
6 entropy = spectralEntropy(x,f) Returns the spectral
entropy of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
7 flatness = spectralFlatness(x,f) Returns the spectral
flatness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
8 slope = spectralSlope(x,f) Returns the spectral slope
of the signal, x, over time.
Interpretation of x depends
on the shape of f.
9 decrease = spectralDecrease(x,f) Returns the spectral
decrease of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
10 rolloffPoint = spectralRolloffPoint(x,f) Returns the spectral roll off
point of the signal, x, over
time. Interpretation of x
depends on the shape of f.

Program: -
clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
audioData = sum(audioData,2)/2;

centroid = spectralCentroid(audioData,fs); % Spectral Centroid

figure(1);
subplot(2,1,1)
t_ca = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_ca,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
subplot(2,1,2)
t_cc = linspace(0,size(audioData,1)/fs,size(centroid,1));
plot(t_cc,centroid)
xlabel('Time (s)')
ylabel('Centroid (Hz)')
title('Centroid of the Recorded Audio Signal');

spread = spectralSpread(audioData,fs); % Spectral Spread

figure(2);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),2048,fs,'yaxis')
title('Spectrogram of Audio Signal');
subplot(2,1,2)
t_ss = linspace(0,size(audioData,1)/fs,size(spread,1));
plot(t_ss,spread)
xlabel('Time (s)')
ylabel('Spread')
title('Spectral Spread of Audio Signal');

skewness = spectralSkewness(audioData,fs); % Spectral Skewness


t_s = linspace(0,size(audioData,1)/fs,size(skewness,1))/60;

figure(3);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power')
view([-58 33])
title('Recorded Audio Signal');

subplot(2,1,2)
plot(t_s,skewness)
xlabel('Time (minutes)')
ylabel('Skewness')
title('Skewness of Audio Signal');

kurtosis = spectralKurtosis(audioData,fs); % Spectral Kurtosis

t_k = linspace(0,size(audioData,1)/fs,size(audioData,1));

figure(4);
subplot(2,1,1)
plot(t_k,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

t_k = linspace(0,size(audioData,1)/fs,size(kurtosis,1)); % Spectral Kurtosis


subplot(2,1,2)
plot(t_k,kurtosis)
xlabel('Time (s)')
ylabel('Kurtosis')
title('Kurtosis of Audio Signal');

entropy = spectralEntropy(audioData,fs); % Spectral Entropy

t_e = linspace(0,size(audioData,1)/fs,size(audioData,1));
figure(5);
subplot(2,1,1)
plot(t_e,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

t_e = linspace(0,size(audioData,1)/fs,size(entropy,1));
subplot(2,1,2)
plot(t_e,entropy)
xlabel('Time (s)')
ylabel('Entropy')
title('Entropy of Audio Signal');

flatness = spectralFlatness(audioData,fs); % Spectral Flatness

figure(6);
subplot(2,1,1)
t_f = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_f,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

subplot(2,1,2)
t_f = linspace(0,size(audioData,1)/fs,size(flatness,1));
plot(t_f,flatness)
ylabel('Flatness')
xlabel('Time (s)')
title('Flatness of Audio Signal');

specslope = spectralSlope(audioData,fs); % Spectral Slope


t_ss = linspace(0,size(audioData,1)/fs,size(specslope,1));

figure(7);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power');
title('Spectrogram of Audio Signal');
subplot(2,1,2)
plot(t_ss,specslope)
title('Spectral Slope')
ylabel('Slope')
xlabel('Time (s)')

spectral_decrease = spectralDecrease(audioData,fs); % Spectral Decrease


t_d = linspace(0,size(audioData,1)/fs,size(spectral_decrease,1));
figure(8);
plot(t_d,spectral_decrease)
title('Spectral Decrease')
ylabel('Decrease')
xlabel('Time (s)')

spectral_rolloff = spectralRolloffPoint(audioData,fs); % Spectal Rolloff


Point
t_sr = linspace(0,size(audioData,1)/fs,size(spectral_rolloff,1));
figure(9);
plot(t_sr,spectral_rolloff)
title('Spectral Rolloff Point')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')

Output: -
S.N Figure Figure
o. Description
1 Spectral
Centroid

2 Spectral
Spread
3 Skewness

4 Kurtosis
5 Entropy

6 Flatness
7 Spectral
Slope

8 Spectral
Decrease
9 Spectral
Rolloff Point
Conclusions: -
1. Centroid is deviated towards the portion of the signal with higher amplitude scale.
2. Spectral spread increases in the region where the bandwidth is higher due to tones
being spread farther apart.
3. The skewness represents the tilt in the direction of the centroid
4. Kurtosis is lower for where the audio signal is nearly uniform.
5. Regions of voiced speech have lower entropy than the unvoiced regions.
6. Higher spectral flatness occurs in the segments with noise/unvoiced regions. In voiced
regions, flatness is low.
7. Spectral slope is accurately able to display the amount of decrement in the spectrum
of the audio signal
8. Spectral decrease models the amount of decrease in the spectrum.
9. Spectral roll off point is able to distinguish between voiced and unvoiced regions as
well as locate the frequency bins under which a given percentage of the spectral
energy falls, thus measuring the associated bandwidth.
Result: -
The different spectral descriptors of an audio signal were calculated, as evidenced by the
observation and conclusions drawn above.
8. Experiment 8

Aim: - Recognition of emotion in a speech signal


Software Used: - MATLAB R2020a
Theory: -
A simple speech emotion recognition (SER) system is implemented using a BiLSTM
network which was trained on a small German-language database, containing 535
utterances spoken by 10 actors intended to convey one of the following text independent
emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness, or neutral. A pre-
trained network is used for the categorization of the emotions, wherein the sample rate of
the data set is considered. The features were chosen using the sequential feature
selection. Subsequently, the feature sequences are fed into the network for prediction
and mean prediction is calculated. Further, the probability distribution of the chosen
emotions is also plotted.
Network training performed using the 10-fold yielded an average of 60% cross validation
accuracy because of insufficient training data, which leads to both overfitting and under
fitting. This can be enhanced by increasing the size of the data set, which is done keeping
in mind the tradeoff between processing time and accuracy improvement.
Deployment training is done using all available speakers in the dataset. While system
validation training, in order to provide an accurate assessment of the model, training and
validation is undertaken using leave-one-speaker-out (LOSO) k-fold cross validation. In
this method, we train using k−1 speakers and then validate on the left-out speaker. The
process is repeated k times, with the final validation accuracy being the average of the k
folds.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 ADS = audioDatastore(location) Creates a data store ADS
based on an audio file or
collection of audio files in
location, used to manage a
collection of audio files,
where individual files may
conform to memory but
their collection may not
2 aFE = audioFeatureExtractor() Creates an audio feature
extractor with default
property values
3 aug = audioDataAugmenter()  Used to enlarge audio
dataset using audio-
specific augmentation
techniques like pitch
shifting, time-scale
modification, time
shifting, noise addition,
and volume control.
 Also used to create
cascaded or parallel
augmentation pipelines
to apply multiple
algorithms
deterministically or
probabilistically.
 The function creates an
audio data augmenter
object with default
property values
4 layer = sequenceInputLayer(inputSize)  Used to input sequence
data to the network
 The function creates a
sequence input layer
and sets the InputSize
property
5 layer = bilstmLayer(numHiddenUnits)  A bidirectional LSTM
(BiLSTM) layer learns
bidirectional long-term
dependencies between
time steps of time
series or sequence
data.
 These dependencies
are useful when the
network is to learn from
the complete time
series at each time
step.
 The function creates a
bidirectional Long
Short-term Memory
Layer and sets the
NumHiddenUnits
property

6 layer = dropoutLayer  A dropout layer


randomly sets input
elements to zero with a
given probability.
 Creates a dropout
layer.

7 net = trainNetwork(sequences,Y,layers,options)  Trains a network for


sequence classification
and regression
problems (for example,
an LSTM or BiLSTM
network), where
sequences represents
sequence or time
series predictors and Y
contains the responses.
 For classification
problems, Y is a
categorical vector or a
cell array of categorical
sequences.
 For regression
problems, Y is a matrix
of targets or a cell array
of numeric sequences.

Program: -
clc;
clear all;
close all;

url = "http://emodb.bilderbar.info/download/download.zip";
downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,"Emo-DB");

if ~exist(datasetFolder,'dir')
disp('Downloading Emo-DB (40.5 MB)...')
unzip(url,datasetFolder)
end

ads = audioDatastore(fullfile(datasetFolder,"wav"));

filepaths = ads.Files;
emotionCodes = cellfun(@(x)x(end-5),filepaths,'UniformOutput',false);
emotions = replace(emotionCodes,{'W','L','E','A','F','T','N'}, ...
{'Anger','Boredom','Disgust','Anxiety/Fear','Happiness','Sadness','Neutral'});

speakerCodes = cellfun(@(x)x(end-10:end-9),filepaths,'UniformOutput',false);
labelTable =
cell2table([speakerCodes,emotions],'VariableNames',{'Speaker','Emotion'});
labelTable.Emotion = categorical(labelTable.Emotion);
labelTable.Speaker = categorical(labelTable.Speaker);
summary(labelTable)

ads.Labels = labelTable;

load('network_Audio_SER.mat','net','afe','normalizers');
fs = afe.SampleRate;

speaker = categorical(“4”);
emotion = categorical(“7”);

adsSubset = subset(ads,ads.Labels.Speaker==speaker & ads.Labels.Emotion == emotion);

audio = read(adsSubset);
sound(audio,fs)

features = (extract(afe,audio))';

featuresNormalized = (features - normalizers.Mean)./normalizers.StandardDeviation;


numOverlap = 10;
featureSequences = HelperFeatureVector2Sequence(featuresNormalized,20,numOverlap);

YPred = double(predict(net,featureSequences));
average = categorical(“02”);
switch average
case 'mean'
probs = mean(YPred,1);
case 'median'
probs = median(YPred,1);
case 'mode'
probs = mode(YPred,1);
end

pie(probs./sum(probs),string(net.Layers(end).Classes))

numAugmentations = 50;
augmenter = audioDataAugmenter('NumAugmentations',numAugmentations, ...
'TimeStretchProbability',0, ...
'VolumeControlProbability',0, ...
...
'PitchShiftProbability',0.5, ...
...
'TimeShiftProbability',1, ...
'TimeShiftRange',[-0.3,0.3], ...
...
'AddNoiseProbability',1, ...
'SNRRange', [-20,40]);
currentDir = pwd;
writeDirectory = fullfile(currentDir,'augmentedData');
mkdir(writeDirectory

N = numel(ads.Files)*numAugmentations;
myWaitBar = HelperPoolWaitbar(N,"Augmenting Dataset...");

reset(ads)

numPartitions = 18;

tic
parfor ii = 1:numPartitions
adsPart = partition(ads,numPartitions,ii);
while hasdata(adsPart)
[x,adsInfo] = read(adsPart);
data = augment(augmenter,x,fs);

[~,fn] = fileparts(adsInfo.FileName);
for i = 1:size(data,1)
augmentedAudio = data.Audio{i};
augmentedAudio = augmentedAudio/max(abs(augmentedAudio),[],'all');
augNum = num2str(i);
if numel(augNum)==1
iString = ['0',augNum];
else
iString = augNum;
end

audiowrite(fullfile(writeDirectory,sprintf('%s_aug%s.wav',fn,iString)),augmentedAudio
,fs);
increment(myWaitBar)
end
end
end

delete(myWaitBar)
fprintf('Augmentation complete (%0.2f minutes).\n',toc/60)

adsAug = audioDatastore(writeDirectory);
adsAug.Labels = repelem(ads.Labels,augmenter.NumAugmentations,1);

win = hamming(round(0.03*fs),"periodic");
overlapLength = 0;

afe = audioFeatureExtractor( ...


'Window',win, ...
'OverlapLength',overlapLength, ...
'SampleRate',fs, ...
...
'gtcc',true, ...
'gtccDelta',true, ...
'mfccDelta',true, ...
...
'SpectralDescriptorInput','melSpectrum', ...
'spectralCrest',true);

adsTrain = adsAug;
tallTrain = tall(adsTrain);
featuresTallTrain = cellfun(@(x)extract(afe,x),tallTrain,"UniformOutput",false);
featuresTallTrain = cellfun(@(x)x',featuresTallTrain,"UniformOutput",false);
featuresTrain = gather(featuresTallTrain);

allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');

featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);

featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);

labelsTrain = repelem(adsTrain.Labels.Emotion,[sequencePerFileTrain{:}]);

emptyEmotions = ads.Labels.Emotion;
emptyEmotions(:) = [];

dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(size(sequencesTrain{1},1))
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,"OutputMode","last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer
classificationLayer];

miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
"MiniBatchSize",miniBatchSize, ...
"InitialLearnRate",initialLearnRate, ...
"LearnRateDropPeriod",learnRateDropPeriod, ...
"LearnRateSchedule","piecewise", ...
"MaxEpochs",maxEpochs, ...
"Shuffle","every-epoch", ...
"Verbose",false, ...
"Plots","Training-Progress");

net = trainNetwork(sequencesTrain,labelsTrain,layers,options);

saveSERSystem = categorical(“01”);
if saveSERSystem
normalizers.Mean = M;
normalizers.StandardDeviation = S;
save('network_Audio_SER.mat','net','afe','normalizers')
end
speaker = ads.Labels.Speaker;
numFolds = numel(speaker);
summary(speaker)
[labelsTrue,labelsPred] = HelperTrainAndValidateNetwork(ads,adsAug,afe);
for ii = 1:numel(labelsTrue)
foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;
fprintf('Fold %1.0f, Accuracy = %0.1f\n',ii,foldAcc);
end
labelsTrueMat = cat(1,labelsTrue{:});
labelsPredMat = cat(1,labelsPred{:});
figure
cm = confusionchart(labelsTrueMat,labelsPredMat);
valAccuracy = mean(labelsTrueMat==labelsPredMat)*100;
cm.Title = sprintf('Confusion Matrix for 10-Fold Cross-Validation\nAverage Accuracy =
%0.1f',valAccuracy);
sortClasses(cm,categories(emptyEmotions))
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';

function [sequences,sequencePerFile] =
HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)
if featureVectorsPerSequence <= featureVectorOverlap
error('The number of overlapping feature vectors must be less than the number
of feature vectors per sequence.')
end

if ~iscell(features)
features = {features};
end
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
sequences = {};
sequencePerFile = cell(numel(features),1);
for ii = 1:numel(features)
sequencePerFile{ii} = floor((size(features{ii},2) -
featureVectorsPerSequence)/hopLength) + 1;
idx2 = 1;
for j = 1:sequencePerFile{ii}
sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence
- 1);
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
end

function [trueLabelsCrossFold,predictedLabelsCrossFold] =
HelperTrainAndValidateNetwork(varargin)
if nargin == 3
ads = varargin{1};
augads = varargin{2};
extractor = varargin{3};
elseif nargin == 2
ads = varargin{1};
augads = varargin{1};
extractor = varargin{2};
end
speaker = categories(ads.Labels.Speaker);
numFolds = numel(speaker);
emptyEmotions = (ads.Labels.Emotion);
emptyEmotions(:) = [];

trueLabelsCrossFold = {};
predictedLabelsCrossFold = {};

for i = 1:numFolds

idxTrain = augads.Labels.Speaker~=speaker(i);
augadsTrain = subset(augads,idxTrain);
augadsTrain.Labels = augadsTrain.Labels.Emotion;
tallTrain = tall(augadsTrain);
idxValidation = ads.Labels.Speaker==speaker(i);
adsValidation = subset(ads,idxValidation);
adsValidation.Labels = adsValidation.Labels.Emotion;
tallValidation = tall(adsValidation);

tallTrain =
cellfun(@(x)x/max(abs(x),[],'all'),tallTrain,"UniformOutput",false);
tallFeaturesTrain =
cellfun(@(x)extract(extractor,x),tallTrain,"UniformOutput",false);
tallFeaturesTrain = cellfun(@(x)x',tallFeaturesTrain,"UniformOutput",false);
[~,featuresTrain] = evalc('gather(tallFeaturesTrain)');
tallValidation =
cellfun(@(x)x/max(abs(x),[],'all'),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)extract(extractor,x),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)x',tallFeaturesValidation,"UniformOutput",false);
[~,featuresValidation] = evalc('gather(tallFeaturesValidation)');
allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);
for ii = 1:numel(featuresTrain)
idx = find(isnan(featuresTrain{ii}));
if ~isempty(idx)
featuresTrain{ii}(idx) = 0;
end
end
featuresValidation = cellfun(@(x)(x-
M)./S,featuresValidation,'UniformOutput',false);
for ii = 1:numel(featuresValidation)
idx = find(isnan(featuresValidation{ii}));
if ~isempty(idx)
featuresValidation{ii}(idx) = 0;
end
end
featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);
[sequencesValidation,sequencePerFileValidation] =
HelperFeatureVector2Sequence(featuresValidation,featureVectorsPerSequence,featureVect
orOverlap);

labelsTrain = [emptyEmotions;augadsTrain.Labels];
labelsTrain = labelsTrain(:);
labelsTrain = repelem(labelsTrain,[sequencePerFileTrain{:}]);

dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(size(sequencesTrain{1},1))
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,"OutputMode","last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer
classificationLayer];

miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
"MiniBatchSize",miniBatchSize, ...
"InitialLearnRate",initialLearnRate, ...
"LearnRateDropPeriod",learnRateDropPeriod, ...
"LearnRateSchedule","piecewise", ...
"MaxEpochs",maxEpochs, ...
"Shuffle","every-epoch", ...
"Verbose",false);

net = trainNetwork(sequencesTrain,labelsTrain,layers,options);

predictedLabelsPerSequence = classify(net,sequencesValidation);
trueLabels = categorical(adsValidation.Labels);
predictedLabels = trueLabels;
idx1 = 1;
for ii = 1:numel(trueLabels)
predictedLabels(ii,:) = mode(predictedLabelsPerSequence(idx1:idx1 +
sequencePerFileValidation{ii} - 1,:),1);
idx1 = idx1 + sequencePerFileValidation{ii};
end
trueLabelsCrossFold{i} = trueLabels;
predictedLabelsCrossFold{i} = predictedLabels;
end
end
Output: -
S.No. Description Figure
1 Pie chart of
median
probability
of the
different
emotions
pertaining to
the selected
subject
2 Training
over 1212
iterations
using 10
fold cross
validation
3 Confusion
Matrix for
10-fold
Cross-
Validation
Conclusions: -
1. Features used herein were selected using sequential feature selection.
2. The training and validation is done using leave-one-speaker-out (LOSO) k-fold cross
validation.
3. A model trained on insufficient data may suffer from fitment problems. This is
alleviated using signal augmentation undertaken via pitch shifting, time-scale
modification, time shifting, noise addition, and volume control.
Result: -
Emotion recognition in speech signal was performed, as evidenced by the observation
and conclusions drawn above.

You might also like