0% found this document useful (0 votes)

11 views64 pages

SR - Lab File

The document outlines a series of experiments conducted using MATLAB R2020a, focusing on audio signal processing. Each experiment has specific aims, such as recording audio, detecting speech regions, and computing the Fast Fourier Transform (FFT) of audio signals, along with detailed methodologies and conclusions. The results demonstrate successful execution of audio-related functions and analysis, providing insights into speech detection and frequency analysis.

Uploaded by

mayankinhome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views64 pages

SR - Lab File

Uploaded by

mayankinhome

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

1. Experiment 1 ......................................................................................................................................... 3
2. Experiment 2 ......................................................................................................................................... 7
3. Experiment 3 ....................................................................................................................................... 11
4. Experiment 4 ....................................................................................................................................... 20
5. Experiment 5 ....................................................................................................................................... 25
6. Experiment 6 ....................................................................................................................................... 35
7. Experiment 7 ....................................................................................................................................... 41
8. Experiment 8 ....................................................................................................................................... 53
1. Experiment 1

Aim: - To record, read and write an audio signal and execute other related functions
Software Used: - MATLAB R2020a
Theory: -
Production of speech in the human vocal system is done in four stages. They have been
briefly mentioned below, in the sequence of occurrence.
S.No. Stage Organs Function
1 Breathing Lungs, Diaphragm, Rib Intake of pulmonic air stream up
Muscles to full capacity using the
diaphragm
2 Phonation Larynx, Vocal Cords, Production of voice through
Trachea vibration of vocal cords
3 Resonation Upper part of larynx, Voice amplification and
pharynx, nasal cavity, oral modification
cavity
4 Articulation Uvula, Velum, Tongue, Production and characterization of
Lower Lip, Upper Jaw phonemes and accents

The below functions are used in the experiment for the achievement of the objective
S.No. Function Description
1 r = audiorecorder(fs,nBits,NumChannels,ID) Creates an audio recorder
object with the audio from the
device specified by the device
identifier ID and channels by
NumChannels, sampled at fs
and quantized to nBits
2 r = getaudiodata(recorder,datatype) Obtains the audio data from
the recorder object and
converts to the datatype
specified
3 p = play(recobj, [start stop]) Plays the audio between the
samples specified by start
and stop
4 audiowrite(filename,y,Fs,Name,Value) Writes the matrix of the
specified audio data into a file
of the specified name. The
name value pairs include inter
alia bits per sample, bit rate,
quality, artist, title and
comments
5 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
6 player = audioplayer(y,Fs,nBits,ID) Creates audio player object
with the specified parameters
7 P = get(recorder) Obtains the property values
of the specified object
8 record(obj) Records the data and event
information for the specified
object
9 pause Stops execution temporarily
9 recorder.resume Resumes recording
10 recorder.stop Stops recording

Program: -
clc;
clear all;
close all;

Fs = 44100;
nbits = 16;
nchannels = 2;
ID = -1;
t = 5;
file = 'audio_rec.wav';
chk_rec = false;

if(chk_rec)
recorder = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
recordblocking(recorder,t);
disp('Stop Speaking');

audio_data = getaudiodata(recorder,'double'); % Object function 1

player = play(recorder); % Object function 2
audiowrite(file,audio_data,Fs);
else
audio_rec = audioread(file);
end

if ~chk_rec
audio_data = audioread(file);
audio_player = audioplayer(audio_data,Fs,nbits,ID);
end

t_start = 2;
t_stop = 4;
samples = [2 * Fs,4* Fs];
audio_data_res = audioread(file,samples);
duration = 1:1:Fs * t;
audio_player_res = audioplayer(audio_data_res,Fs,nbits,ID);
disp('Playing resampled audio');
pause(2);
play(audio_player_res);

figure(1);
subplot(2,1,1);
plot(duration / Fs,audio_data);
xlabel('Time');
ylabel('Amplitude');
title('Original Recorded Audio','FontSize',14);
subplot(2,1,2);
plot(duration(2 * Fs:4 * Fs) / Fs,audio_data_res);
xlabel('Time');
ylabel('Amplitude');
title('Resampled Audio Data','FontSize',14);

% Property values of audio recorder object

if chk_rec
properties = get(recorder); % Object function 3
else
properties = get(audio_player);
end

% Change sampling frequency

fs = 8000;
if ~chk_rec
audio_player_resampled = audioplayer(audio_data,fs,nbits,ID);
end

% Change number of bits

nBits = 8;
if ~chk_rec
audio_player_reformed = audioplayer(audio_data,Fs,nBits,ID);
end

% Pause/resume

recorder_pr = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
record(recorder_pr);
pause(1);
recorder_pr.resume; % Object function 4
disp('Stop Speaking');
recorder_pr.stop; % Object function 5
play(recorder_pr);
Output: -

Conclusions: -
1. recordblocking function does not relinquish control to the main program until the
recording is completed
2. To sub-sample the audio signal, the time period is multiplied by the sampling
frequency to isolate the samples in the desired interval
3. While plotting the sub-sampled the signals, the dependent axis variable is divided by
the sampling frequency so as to identify the interval of sub-sampling
4. Resampling the audio data at a lower sampling frequency severely affects the audio
characteristics of the data
5. Change in the number of bits has no perceptible effect on the recorded audio data
6. For pausing, recording and stopping functionalities, the record function is used instead
of recordblocking.

Result: -
Audio signal was recorded and related functions executed thereon, as evidenced by the
observations and conclusions drawn therefrom.
2. Experiment 2

Aim: - To detect regions of speech in an audio signal

Software Used: - MATLAB R2020a
Theory: -
Speech recognition involves the following components
a. Analog to Digital Conversion – Conversion of analog acoustic signal into digital format
b. Acoustic/Language Modelling – Modelling of the speech as per the statistical
template/predictive language models
c. Speech Engine – Deciphering the contents of the speech signal, phoneme by
phoneme
d. Display – Display the inferred speech
e. Feedback – To train the speech engine as per the speech structure and phoneme
constitution of the given language
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 i = detectSpeech(audioIn,fs) Returns indices
corresponding to the
presence of the speech signal
in the given audio signal
audioin, sampled at the rate
fs
3 i = detectSpeech(audioIn,fs,Name,Value) Detection of speech under
the specified constraints.
Name value pairs include
window type and it attributes,
overlap length, merge
distance and thresholds
Program: -
clc;
clear all;
close all;

file_name = 'audio_rec.wav';
% Detecting speech regions
[audioIn,fs] = audioread(file_name);
figure(1);
detectSpeech(audioIn(:,1),fs); % Plotting detecting speech in the entire
length of the audio
xlabel('Time');
ylabel('Amplitude');
title('Detected Speech in the Audio Signal');
% Thresholding using Windowing, Overlap Length and Merge Distance

window_duration = 0.074; % Range [2,size(audioIn,1)]

window_samples = round(window_duration * fs);
window_type = 'hanning'; % Chebyshev, Hanning, Bartlett, Blackman, Gaussian,
Tukey, Kaiser, Taylor, Bohman
f = str2func(window_type); % Returns function handle
window = f(window_samples,'periodic');

overlap_percent = 10;
overlapping_samples = round(window_samples * overlap_percent / 100); % Number
of overlapping samples in adjacent windows

merge_duration = 0.1;
merge_distance = round(merge_duration * fs); % Number of samples to be merged
on occurence of a positive detction of speech

figure(2);
detectSpeech(audioIn(:,1),fs,'Window',window,'OverlapLength',overlapping_samp
les,'MergeDistance',merge_distance);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using custom window');

% Reuse of decision thresholds on the segments of the same signal

split_position = 0.3; % Specifying the splitting ratio

t = numel(audioIn(:,1)) / fs;
split_loc = split_position * t;
first_part = audioIn(1:round(split_loc * fs),1);
second_part = audioIn(round(split_loc * fs + 1):end,1);
[idx,thresholds] = detectSpeech(first_part,fs); % Thresholds from the first
part

figure(3);
detectSpeech(second_part,fs,'Thresholds',thresholds);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using the predetected thresholds');
Output: -
S.No. Figure Description Figure
1 Detected Boundaries in audio
signal without any custom
window

2 Detected Boundaries in audio

signal using Hanning Window
of 0.074 s length, 10% overlap
and 0.1 s merge duration

3 Detected boundaries in
speech signal using
thresholds obtained from
initial 30% segment of the
signal.
Conclusions: -
1. Different window functions viz. Chebyshev, Hanning, Bartlett, Blackman, Gaussian,
Tukey, Kaiser, Taylor, Bohman may be used with relevant attributes for the detection
of the speech segments in the audio signal.
2. Window length, overlap length and merge distance specified in the terms of the
number of samples.
3. The audio signal is split in the desired ratio and the thresholds detected in one part
are used to identify the speech segments in the other part.
Result: -
Regions of speech were detected in the recorded audio signal, as evidenced by the
observations and conclusions drawn above.
3. Experiment 3

Aim: - To compute the Fast Fourier Transform (FFT) of an audio signal

Software Used: - MATLAB R2020a
Theory: -
The Continuous time Fourier Transform of the signal 𝑥(𝑡) is given by
∞
𝑋(𝑓) = ∫ 𝑥(𝑡)𝑒 −𝑗2𝜋𝑓𝑡 𝑑𝑡
−∞

Its inverse is expressed as

1 ∞
𝑥(𝑡) = ∫ 𝑋(𝑓)𝑒 𝑗2𝜋𝑓𝑡 𝑑𝑓
2𝜋 −∞
The Discrete Fourier Transform (DFT) is computed efficiently using the Fast Foureier
Transform, which employs the cascading approach to reduce the exponential complexity
associated with additions and multiplications to logarithmic scale. The DFT of an N-length
discrete time sequence 𝑥[𝑛] is given by
𝑁−1
𝑗2𝜋𝑘𝑛
𝑋𝑘 = ∑ 𝑥[𝑛]𝑒 𝑁

𝑛=0

Its inverse is given as

𝑁−1
1 𝑗2𝜋𝑘𝑛
𝑥[𝑛] = ∑ 𝑋𝑘 𝑒 − 𝑁
𝑁
𝑘=0

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
t = 0.01:0.01:10;
f1 = 10;
f2 = 20;
f3 = 30;
split_p1 = 0.2;
split_p2 = 0.5;
split_p3 = 1 - (split_p1 + split_p2);
s1 = sin(2 * pi * f1 * t) + sin(2 * pi * f2 * t) + sin(2 * pi * f3 * t); %
Multitone signal
s2 = cat(2,sin(2 * pi * f1 * t(1:split_p1 * length(t))),sin(2 * pi * f2 *
t(split_p1 * length(t) + 1:split_p1 * length(t) + split_p2 * length(t))),...
sin(2 * pi * f3 * t((split_p1 + split_p2) * length(t) + 1:end))); %
Synthetic non-stationary signal
N = randn(size(t));

Y_s1 = fft(s1 + N);

Y1 = abs(Y_s1/length(t));
Y1 = Y1(1:length(t) / 2 + 1);
Y1(2:end - 1) = 2 * Y1(2:end - 1);

figure(1);
subplot(4,1,1);
plot(t,s1);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal');
subplot(4,1,2);
plot(t,s1 + N);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y1);
xlabel('Time');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy multitone signal');
subplot(4,1,4);
plot(t,ifft(Y_s1));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy multitone signal');

Y_s2 = fft(s2 + N);

Y2 = abs(Y_s2/length(t));
Y2 = Y2(1:length(t) / 2 + 1);
Y2(2:end - 1) = 2 * Y2(2:end - 1);

figure(2);
subplot(4,1,1);
plot(t,s2);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal');
subplot(4,1,2);
plot(t,s2 + N);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy Synthetic Non-stationary
signal');
subplot(4,1,4);
plot(t,ifft(Y_s2));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy Synthetic Non-stationary signal');

d1 = 1:split_p1 * length(t);
d2 = 1:(split_p1 + split_p2) * length(t);
d3 = (split_p1 + split_p2) * length(t) + 1:length(t);
Y_s2_p1 = window_fft(s2,d1,N);
Y_s2_p2 = window_fft(s2,d2,N);
Y_s2_p3 = window_fft(s2,d3,N);
figure(3);
subplot(3,2,1);
plot(t(1:split_p1 * length(t)),s2(1:split_p1 * length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 1');
subplot(3,2,2);
plot(((1:length(d1) / 2 + 1) / 2 - 0.5) / (split_p1 * 10 / 2),Y_s2_p1);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 1');
subplot(3,2,3);
plot(t(1:(split_p1 + split_p2) * length(t)),s2(1:(split_p1 + split_p2) *
length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 2');
subplot(3,2,4);
plot(((1:length(d2) / 2 + 1) / 2 - 0.5) / ((split_p1 + split_p2) * 10 /
2),Y_s2_p2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 2');
subplot(3,2,5);
plot(t((split_p1 + split_p2) * length(t) + 1:length(t)),s2((split_p1 +
split_p2) * length(t) + 1:length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 3');
subplot(3,2,6);
plot(((1:length(d3) / 2 + 1) / 2 - 0.5) / (split_p3 * 10 / 2),Y_s2_p3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 3');

Y_s3 = fft(audioData(:,1));
Y3 = abs(Y_s3/length(audioData));
Y3 = Y3(1:length(audioData) / 2 + 1);
Y3(2:end - 1) = 2 * Y3(2:end - 1);

figure(4);
subplot(3,1,1);
plot((1:length(audioData))/fs,audioData);
xlabel('Time');
ylabel('Amplitude');
title('Recorded Audio signal');
subplot(3,1,2);
plot((1:length(audioData) / 2 + 1) / 10,Y3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of audio signal');
subplot(3,1,3);
plot((1:length(audioData))/fs,ifft(Y_s3));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of audio signal');
Output: -
S.No. Figure Figure
Description
1 FFT of multitoned
signal consisting
of harmonics of
10, 20 and 30 Hz
and its 1000 point
IFFT
2 FFT of noisy non-
stationary signal
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, and its
1000 point IFFT
3 Segmentation of
non-stationary
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, into the
corresponding
duration windows
and computation
of FFT for each
window
4 FFT of recorded
audio signal and
its 220500 point
IFFT
Conclusions: -
4. Fast Fourier Transform (FFT) is able to distinctly identify the frequencies present in a
multi-tone signal in the presence of Additive White Gaussian Noise (AWGN) of
strength.
5. FFT is also able to detect the frequencies present in a non-stationary signal, in the
presence of noise, though the detection region is widened wrt the frequency identified,
thus revealing the shortcoming of FFT in the spectral analysis of the non-stationary
signals.
1. FFT is able to identify the frequencies present in the different segments of the audio
signal, irrespective of the remaining segments, though the region of detection is
widened wrt the frequency detected, as earlier.
2. For the non-stationary audio signal, the FFT reveals the presence of the frequencies
in the lower range of the spectrum, typically less than 1KHz.
3. For the non-stationary audio signal, even a 220500 point IFFT is not able reconstruct
the original speech signal accurately i.e. it is unable to model the sharp discontinuities
and fast fluctuations in the audio signal, thus highlighting another shortcoming of the
FFT.
Result: -
Fast Fourier Transform (FFT) of the recorded audio signal was computed and different
functions operations performed thereon, as evidenced by the observations and
conclusions drawn above.
4. Experiment 4

Aim: - To compute the Short Time Fourier Transform (STFT) of an audio signal
Software Used: - MATLAB R2020a
Theory: -
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

where 𝑓(𝑡) is the incoming signal, 𝑊(𝑡 − 𝑡 ′ ) is the centered window function, and the
frequency parameter is denoted by the variable 𝑢.
A wide window provides a good frequency resolution but poor time resolution. The vice
versa is true in the case of narrow window.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = stft(X, FFT Length, Window, Overlap Returns the Short Time
Length, Frequency Centering) Fourier transform of the input
signal, calculated in
accordance with the specified
parameters .

Program: -
clc;
clear all;
close all;

segment_length = 10000;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(1:segment_length);
N = 5000; % In millisecond
nf = 50; % Normalization factor
t = (0:N - 1) / nf; % Normalization necessary
f1 = 75;
f2 = 50;
f3 = 25;
f4 = 10;
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
x = cat(2,sin(2 * f1 * t(1:length(t) / 4)),sin(2 * f2 * t(length(t) / 4 +
1:length(t) / 2)),sin(2 * f3 * t(length(t) / 2 + 1:3 * length(t) / 4)),...
sin(2 * f4 * t(3 * length(t) / 4:end)));
[s,f,t] =
stft(x,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLength',ove
rlap_length,'Centered',freq_centering);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);

figure(1);
surf(t / nf,f * nf / 2,abs(s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Non-Stationary Signal');
colormap jet

figure(2);
surf(abs(s_s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Speech Signal');
colormap jet
Output: -
S.No. Figure Figure
Description
1 STFT of a 5
second long
synthetic non-
stationary signal
with harmonics of
10, 25, 50 and 75
Hz
2 STFT of recorded
audio signal
Conclusions: -
1. Short Time Fourier Transform provides a good time-frequency resolution for
multitoned and non-stationary signals containing frequencies on the lower harmonic
scale.
2. STFT is able to clearly distinguish the frequencies in the non-stationary signal,
occurring in their respective points of time.
3. Setting frequency centering as false produces a two-sided STFT spectrum, which
contains the angular frequencies in the range from [-π,π].
Result: -
Short Time Fourier Transform (STFT) of the synthetic non-stationary signal and recorded
audio signal was computed and different functions operations performed thereon, as
evidenced by the observations and conclusions drawn above.
5. Experiment 5

Aim: - To compute the scalogram of the audio signal using Discrete Wavelet Transform
and conduct its multi-resolution analysis
Software Used: - MATLAB R2020a
Theory: -
Wavelet Transform
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

∗ (𝑡)𝑑𝑡
𝛾(𝑠, 𝜏) = ∫ 𝑓(𝑡)𝜓𝑠,𝜏

Its inverse is expressed as

𝑓(𝑡) = ∬ 𝛾(𝑠, 𝜏)𝜓𝑠,𝜏 (𝑡)𝑑𝜏𝑑𝑠

All the wavelets are derived from mother wavelet given by

1 𝑡−𝜏
𝑓(𝑡) = 𝜓( )
√𝑠 𝑠
where 𝑠 is the scaling parameter and 𝜏 is the shift parameter of the wavelet.
The basis function is taken as the wavelet instead of complex exponentials as in the case
of Fourier Transform and STFT.
Multi-Resolution Analysis of DWT
We perform the multi-resolution analysis of the maximal overlap discrete wavelet
transform (MODWT) computed for a given signal. The filter used in the computation of
the MODWT and its multi-resolution analysis is the same.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = stft(X, FFT Length, Window, Overlap Returns the Short Time
Length, Frequency Centering) Fourier transform of the input
signal, calculated in
accordance with the specified
parameters .
3 y = cwt(audioData,wname) Computes the continuous
wavelet transform of the
given data using the specified
wavelet
4 y1 = modwt(audioData,level,wname); Calculates maximum overlap
discrete wavelet transform for
the given data using the
specified number of levels
and wavelet
5 y2 = modwtmra(y1,wname); Performs the multi-resolution
analysis of the computed
MODWT using the same
wavelet

Program: -
Scalogram

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);

figure(1);
plot(t / fs,audioData);
xlabel('Time (s)');
ylabel('Amplitude (V)');
title('Recorded Speech Signal');

wname = 'morse';
wt = cwt(audioData,wname);
figure(2);
subplot(2,1,1);
imagesc(t / fs,1:size(wt,1),abs(wt));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('Scalogram of Speech Signal');

dim = 1:size(wt,1) / 2;
s_s_norm = s_s(dim,:);
subplot(2,1,2);
imagesc(t / fs,1:size(s_s_norm,1),abs(s_s_norm));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('STFT of Speech Signal');

Multi–Resolution Analysis

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);

level = 10;
wname = 'sym4';
mdwt_sp = modwt(audioData',level,wname);
mra = modwtmra(mdwt_sp,wname);
err = abs((audioData' - sum(mra)));

f1 = figure;
f2 = figure;
figure(f1);
subplot(level / 2,1,1);
plot(t / fs,audioData);
title('Original Recorded Audio');
for i = 1:level + 1
if i < level / 2 + 1
figure(f1);
subplot(level / 2 + 1,1,i + 1)
elseif i == level / 2 + 1
figure(f1);
xlabel('Time (s)');
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
else
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
end
x = ['D',num2str(i)];
plot(t / fs,mra(i,:));
title(x);
end
xlabel('Time (s)');
set(gcf,'Position', [0, 0, 2000, 2000]);

figure(3);
subplot(3,1,1);
plot(t / fs,audioData');
title('Recorded Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,2);
plot(t / fs,sum(mra));
title('Reconstructed Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,3);
plot(t / fs,err);
title('Reonstruction Error');
xlabel('Time (s)');
ylabel('Amplitude');
Output: -
S.No. Figure Figure
Description
1 Time series
representation
of the recorded
audio signal
2 Scalogram of
the recorded
audio signal

3 STFT of
recorded audio
signal
4 Original Signal
and Resolutions
1-5
5 Resolutions 6-
11
6 Comparison
between
original and
reconstructed
signal and
reconstruction
error
Conclusions: -
1. Short Time Fourier Transform provides an inferior time frequency resolution in
comparison to the wavelet transform as the spectral amplitudes are displayed to be
fairly constant with the frequencies and span the entire range of frequencies. Further
the spectral amplitude concentration appears uniform across the frequency scale.
2. Further, in the region of absence of the audio data in the read audio file, STFT detects
some spectral amplitude, which is not detected in the scalogram computed using the
wavelet transform. This demonstrates the higher accuracy of time frequency
representation using the wavelet transform.
3. Seven level resolution yields signal which is similar in appearance to the original audio
signal and retains its characteristic non-stationarity. Further levels of decomposition
disintegrate the signal into near fundamental frequency components which are almost
stationary.
4. Reconstruction error obtained with 10 level resolution analysis is of the order of 10−13.
Result: -
The scalogram of the audio signal was computed using the Discrete Wavelet Transform
and its multi-resolution analysis was conducted, as evidenced by the observations and
conclusions drawn above.
6. Experiment 6

Aim: - To
i. Obtain MFCC data for an audio signal
ii. Perform frequency Domain Voice Activity Detection and Cepstral feature
extraction
Software Used: - MATLAB R2020a
Theory: -
The extraction of Mel Frequency Cepstral Coefficients is driven by human speech
perception and speech production. The coefficients so extracted represent the
information originating from the vocal tract filter, separated from the information content
of the glottal source. Further, the variance between the different coefficients tends to be
uncorrelated. The procedure for MFCC computation is enlisted as follows:
a. Calculation of frequency spectrum and application Mel binning
b. Apply inverse DFT to the logarithm of the mel-warped spectrum to produce the
cepstrum
c. The 39 dimensional MFF feature vector consists of the first 12 significant cepstral
coefficients, energy (sum of power of the frame samples), 13 delta and 13 double
delta coefficients.
The extraction of the features may also be performed in the frequency domain itself, by
efficiently transforming the audio signal into frequency domain.
The MFCC extraction procedure outlined above has been depicted in the figure below.

Figure 1: MFCC Extraction

Program: -
MFCC Extraction

clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);

duration = round(0.04 * fs); % 2s audio segment

audioSegment = audioData(40000:40000+duration-1);
cepFeatures = cepstralFeatureExtractor('SampleRate',fs);

[coeffs,delta,deltaDelta] = cepFeatures(audioSegment);
[filterbank, freq] = getFilters(cepFeatures);

audioSegmentTwo = audioData(58200:58200+duration-1); % Number of cepstral

coefficients determined by NumCoeffs
[coeffsTwo,deltaTwo,deltaDeltaTwo] = cepFeatures(audioSegmentTwo); %
Subtracting 2 deltas gives deltadelta

audioSegmentThree = audioData(20000:20000+duration-1); % Number of cepstral

coefficients determined by NumCoeffs
[coeffsThree,deltaThree,deltaDeltaThree] = cepFeatures(audioSegmentThree); %
Subtracting 2 deltas gives deltadelta

subplot(3,1,1);
plot(deltaTwo);
title('DeltaTwo');
subplot(3,1,2);
plot(deltaThree);
title('DeltaThree');
subplot(3,1,3);
plot(deltaDeltaThree);
title('DeltaDeltaThree');

Voice Activity Detection and Cepstral Feature Extraction in Frequency Domain

clc;
clear all;
close all;

file_name = 'Counting-16-44p1-mono-15secs.wav'; % Audio file reader

fileReader = dsp.AudioFileReader(file_name);
fs = fileReader.SampleRate;

samplesPerFrame = ceil(0.03 * fs); % 30ms frames with 10 ms hop, overlap of

20 ms
samplesPerHop = ceil(0.01 * fs);
samplesPerOverlap = samplesPerFrame - samplesPerHop;

fileReader.SamplesPerFrame = samplesPerHop; % Asynchronous buffer

buffer = dsp.AsyncBuffer;

VAD = voiceActivityDetector('InputDomain','Frequency'); % VAD object and

cepstral feature extractor object
cepFeatures =
cepstralFeatureExtractor('InputDomain','Frequency','SampleRate',fs,'LogEnergy
','Replace');
sink = dsp.SignalSink; % Sink to buffer

threshold = 0.5;
nanVector = nan(1,13);
while ~isDone(fileReader)
audioIn = fileReader();
write(buffer,audioIn); % Reading each hop

overlappedAudio = read(buffer,samplesPerFrame,samplesPerOverlap); % Read

a frame with the stipulated overlap length
X = fft(overlappedAudio,2048); % Conversion into frequency domain

probabilityOfSpeech = VAD(X); % Probability of existence of speech

if probabilityOfSpeech > threshold
[xFeatures,delta,deltadelta] = cepFeatures(X); % Extract ceptsral
features is speech present
sink(xFeatures')
else
sink(nanVector) % Store Nan otherwise
end
end

timeVector = linspace(0,15,size(sink.Buffer,1));
figure(1);
plot(timeVector,sink.Buffer)
title('Cepstral Coefficients');
xlabel('Time (s)')
ylabel('MFCC Amplitude')
legend('Log-
Energy','c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12')

Output: -
S.No Figure Figure
. Descriptio
n
1 MFCC
Coefficients
, Delta and
Double
Delta
Features
2 Delta and
Double
Delta
Features of
the
Cepstrum

3 Cepstral
Coefficients
extracted
from
frequency
domain
audio signal
Conclusions: -
1. Delta and Double Delta features of the initial audio segment are always zero.
2. Double delta feature is the difference of the delta features of the previous 2 audio
segments.
3. In the cepstral coefficients extracted in the frequency domain, certain coefficients
stand out wrt other coefficients in their strength of response.
4. Therefore, every feature can be uniquely isolated through appropriate filtering
schemes.
Result: -
The MFCC data for an audio signal was obtained and frequency domain Voice Activity
Detection and subsequent Cepstral feature extraction was undertaken, as evidenced by
the observation and conclusions drawn above.
7. Experiment 7

Aim: - To calculate different spectral descriptors of an audio signal

Software Used: - MATLAB R2020a
Theory: -
Spectral descriptors are used to characterize the nature and shape of an audio segment.
They are widely used in speaker identification and recognition, acoustic scene
recognition, instrument identification, music genre classification, mood recognition and
voice activity detection.
The following spectral descriptors are computed for the audio signal:
a. Spectral Centroid - The spectral centroid represents the "center of gravity" of the
spectrum and used as an indication of energy localization. It is expressed as

𝑏
2
∑𝑘=𝑏 𝑓𝑘 𝑠𝑘
1
𝜇1 = 2 𝑏
∑𝑘=𝑏 𝑠𝑘
1
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.
b. Spectral Spread - It represents the "instantaneous bandwidth" of the spectrum and is
used as an indication of the dominance of a tone. It is given by

∑𝑏𝑘=𝑏
2
(𝑓𝑘 − 𝜇1 )2 𝑠𝑘
1
𝜇2 = √
∑𝑏𝑘=𝑏
2
𝑠
1 𝑘

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins and 𝜇1 is the spectral centroid.

c. Spectral Skewness - The spectral skewness assesses the symmetry around the
centroid. In phonetics, it is often referred to as spectral tilt and is used with other
spectral moments to distinguish the place of articulation. For harmonic signals, it
indicates the relative strength of higher and lower harmonics. It is given by

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )2 𝑠𝑘
1 𝑘
𝜇3 = 𝑏
𝜇23 ∑𝑘=𝑏
2
𝑠
1 𝑘
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins, 𝜇1 is the spectral centroid and 𝜇2 is the spectral
spread.
d. Spectral Kurtosis - The spectral kurtosis measures the flatness, or non-Gaussianity,
of the spectrum around its centroid. Conversely, it is used to measure the peakiness
of a spectrum. It is computed as

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )4 𝑠𝑘
1 𝑘
𝜇4 = 𝑏
𝜇24 ∑𝑘=𝑏
2
𝑠
1 𝑘

e. Spectral Entropy – it has been used successfully in voiced/unvoiced decisions for

automatic speech recognition as unvoiced/non-voiced regions tend to have higher
entropy than voiced regions, due to greater randomness. It is expressed as

∑𝑏𝑘=𝑏
2
𝑠 log(𝑠𝑘 )
1 𝑘
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 =
log(𝑏2 − 𝑏1 )

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.

f. Spectral Flatness – It is an indication of the peakiness of the spectrum. A higher

spectral flatness indicates noise, while a lower spectral flatness indicates tonality. It is
given as

1
𝑏
2
(∏𝑘=𝑏 𝑠 )𝑏2−𝑏1
1 𝑘
𝑓𝑙𝑎𝑡𝑛𝑒𝑠𝑠 = 1
∑𝑏𝑘=𝑏
2
𝑠
𝑏2 −𝑏1 1 𝑘

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.

g. Spectral Slope - Spectral slope is directly related to the resonant characteristics of the
vocal folds and has also been applied to speaker identification. It is a socially important
aspect of timbre and can be discrimination in early childhood development. Spectral
slope is most pronounced when the energy in the lower formants is much greater than
the energy in the higher formants. It is given by

∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )(𝑠𝑘 − 𝜇𝑠 )
1 𝑘
𝑠𝑙𝑜𝑝𝑒 = 2
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )
1 𝑘

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝜇𝑓 is the mean frequency, 𝑠𝑘 is the

spectral value at bin 𝑘, 𝜇𝑠 is the mean spectral value and 𝑏1 and 𝑏2 are the band
edges, in bins.
h. Spectral Decrease - Along with slope, it is used in the analysis of music, particularly
in instrument recognition. It is given by

𝑠𝑘 −𝑠𝑏1
∑𝑏𝑘=𝑏
2
1 +1 𝑘−1
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒 =
∑𝑏𝑘=𝑏
2
𝑠𝑘
1 +1

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.

i. Spectral Roll-off Point – It measures the bandwidth of the audio signal by finding the
energy concentration in the frequency bins. It is primarily used in detection and
classification activities on different types of acoustic signals. It is expressed as

𝑖 𝑏2

𝑅𝑜𝑙𝑙𝑜𝑓𝑓 𝑃𝑜𝑖𝑛𝑡 = 𝑖 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ∑ |𝑠𝑘 | = 𝜅 ∑ 𝑠𝑘

𝑘=𝑏1 𝑘=𝑏1

where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins and
𝜅 is the specified energy threshold, usually 95% or 85%.

The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in
the given range from the
specified file and returns
data in y sampled at fs
2 centroid = spectralCentroid(x,f) Returns the spectral
centroid of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
3 spread = spectralSpread(x,f) Returns the spectral
spread of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
4 skewness = spectralSkewness(x,f) Returns the spectral
skewness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
5 kurtosis = spectralKurtosis(x,f) Returns the spectral
kurtosis of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
6 entropy = spectralEntropy(x,f) Returns the spectral
entropy of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
7 flatness = spectralFlatness(x,f) Returns the spectral
flatness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
8 slope = spectralSlope(x,f) Returns the spectral slope
of the signal, x, over time.
Interpretation of x depends
on the shape of f.
9 decrease = spectralDecrease(x,f) Returns the spectral
decrease of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
10 rolloffPoint = spectralRolloffPoint(x,f) Returns the spectral roll off
point of the signal, x, over
time. Interpretation of x
depends on the shape of f.

Program: -
clc;
clear all;
close all;

file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
audioData = sum(audioData,2)/2;

centroid = spectralCentroid(audioData,fs); % Spectral Centroid

figure(1);
subplot(2,1,1)
t_ca = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_ca,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
subplot(2,1,2)
t_cc = linspace(0,size(audioData,1)/fs,size(centroid,1));
plot(t_cc,centroid)
xlabel('Time (s)')
ylabel('Centroid (Hz)')
title('Centroid of the Recorded Audio Signal');

spread = spectralSpread(audioData,fs); % Spectral Spread

figure(2);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),2048,fs,'yaxis')
title('Spectrogram of Audio Signal');
subplot(2,1,2)
t_ss = linspace(0,size(audioData,1)/fs,size(spread,1));
plot(t_ss,spread)
xlabel('Time (s)')
ylabel('Spread')
title('Spectral Spread of Audio Signal');

skewness = spectralSkewness(audioData,fs); % Spectral Skewness

t_s = linspace(0,size(audioData,1)/fs,size(skewness,1))/60;

figure(3);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power')
view([-58 33])
title('Recorded Audio Signal');

subplot(2,1,2)
plot(t_s,skewness)
xlabel('Time (minutes)')
ylabel('Skewness')
title('Skewness of Audio Signal');

kurtosis = spectralKurtosis(audioData,fs); % Spectral Kurtosis

t_k = linspace(0,size(audioData,1)/fs,size(audioData,1));

figure(4);
subplot(2,1,1)
plot(t_k,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

t_k = linspace(0,size(audioData,1)/fs,size(kurtosis,1)); % Spectral Kurtosis

subplot(2,1,2)
plot(t_k,kurtosis)
xlabel('Time (s)')
ylabel('Kurtosis')
title('Kurtosis of Audio Signal');

entropy = spectralEntropy(audioData,fs); % Spectral Entropy

t_e = linspace(0,size(audioData,1)/fs,size(audioData,1));
figure(5);
subplot(2,1,1)
plot(t_e,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

t_e = linspace(0,size(audioData,1)/fs,size(entropy,1));
subplot(2,1,2)
plot(t_e,entropy)
xlabel('Time (s)')
ylabel('Entropy')
title('Entropy of Audio Signal');

flatness = spectralFlatness(audioData,fs); % Spectral Flatness

figure(6);
subplot(2,1,1)
t_f = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_f,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');

subplot(2,1,2)
t_f = linspace(0,size(audioData,1)/fs,size(flatness,1));
plot(t_f,flatness)
ylabel('Flatness')
xlabel('Time (s)')
title('Flatness of Audio Signal');

specslope = spectralSlope(audioData,fs); % Spectral Slope

t_ss = linspace(0,size(audioData,1)/fs,size(specslope,1));

figure(7);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power');
title('Spectrogram of Audio Signal');
subplot(2,1,2)
plot(t_ss,specslope)
title('Spectral Slope')
ylabel('Slope')
xlabel('Time (s)')

spectral_decrease = spectralDecrease(audioData,fs); % Spectral Decrease

t_d = linspace(0,size(audioData,1)/fs,size(spectral_decrease,1));
figure(8);
plot(t_d,spectral_decrease)
title('Spectral Decrease')
ylabel('Decrease')
xlabel('Time (s)')

spectral_rolloff = spectralRolloffPoint(audioData,fs); % Spectal Rolloff

Point
t_sr = linspace(0,size(audioData,1)/fs,size(spectral_rolloff,1));
figure(9);
plot(t_sr,spectral_rolloff)
title('Spectral Rolloff Point')
ylabel('Rolloff Point (Hz)')
xlabel('Time (s)')

Output: -
S.N Figure Figure
o. Description
1 Spectral
Centroid

2 Spectral
Spread
3 Skewness

4 Kurtosis
5 Entropy

6 Flatness
7 Spectral
Slope

8 Spectral
Decrease
9 Spectral
Rolloff Point
Conclusions: -
1. Centroid is deviated towards the portion of the signal with higher amplitude scale.
2. Spectral spread increases in the region where the bandwidth is higher due to tones
being spread farther apart.
3. The skewness represents the tilt in the direction of the centroid
4. Kurtosis is lower for where the audio signal is nearly uniform.
5. Regions of voiced speech have lower entropy than the unvoiced regions.
6. Higher spectral flatness occurs in the segments with noise/unvoiced regions. In voiced
regions, flatness is low.
7. Spectral slope is accurately able to display the amount of decrement in the spectrum
of the audio signal
8. Spectral decrease models the amount of decrease in the spectrum.
9. Spectral roll off point is able to distinguish between voiced and unvoiced regions as
well as locate the frequency bins under which a given percentage of the spectral
energy falls, thus measuring the associated bandwidth.
Result: -
The different spectral descriptors of an audio signal were calculated, as evidenced by the
observation and conclusions drawn above.
8. Experiment 8

Aim: - Recognition of emotion in a speech signal

Software Used: - MATLAB R2020a
Theory: -
A simple speech emotion recognition (SER) system is implemented using a BiLSTM
network which was trained on a small German-language database, containing 535
utterances spoken by 10 actors intended to convey one of the following text independent
emotions: anger, boredom, disgust, anxiety/fear, happiness, sadness, or neutral. A pre-
trained network is used for the categorization of the emotions, wherein the sample rate of
the data set is considered. The features were chosen using the sequential feature
selection. Subsequently, the feature sequences are fed into the network for prediction
and mean prediction is calculated. Further, the probability distribution of the chosen
emotions is also plotted.
Network training performed using the 10-fold yielded an average of 60% cross validation
accuracy because of insufficient training data, which leads to both overfitting and under
fitting. This can be enhanced by increasing the size of the data set, which is done keeping
in mind the tradeoff between processing time and accuracy improvement.
Deployment training is done using all available speakers in the dataset. While system
validation training, in order to provide an accurate assessment of the model, training and
validation is undertaken using leave-one-speaker-out (LOSO) k-fold cross validation. In
this method, we train using k−1 speakers and then validate on the left-out speaker. The
process is repeated k times, with the final validation accuracy being the average of the k
folds.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 ADS = audioDatastore(location) Creates a data store ADS
based on an audio file or
collection of audio files in
location, used to manage a
collection of audio files,
where individual files may
conform to memory but
their collection may not
2 aFE = audioFeatureExtractor() Creates an audio feature
extractor with default
property values
3 aug = audioDataAugmenter()  Used to enlarge audio
dataset using audio-
specific augmentation
techniques like pitch
shifting, time-scale
modification, time
shifting, noise addition,
and volume control.
 Also used to create
cascaded or parallel
augmentation pipelines
to apply multiple
algorithms
deterministically or
probabilistically.
 The function creates an
audio data augmenter
object with default
property values
4 layer = sequenceInputLayer(inputSize)  Used to input sequence
data to the network
 The function creates a
sequence input layer
and sets the InputSize
property
5 layer = bilstmLayer(numHiddenUnits)  A bidirectional LSTM
(BiLSTM) layer learns
bidirectional long-term
dependencies between
time steps of time
series or sequence
data.
 These dependencies
are useful when the
network is to learn from
the complete time
series at each time
step.
 The function creates a
bidirectional Long
Short-term Memory
Layer and sets the
NumHiddenUnits
property

6 layer = dropoutLayer  A dropout layer

randomly sets input
elements to zero with a
given probability.
 Creates a dropout
layer.

7 net = trainNetwork(sequences,Y,layers,options)  Trains a network for

sequence classification
and regression
problems (for example,
an LSTM or BiLSTM
network), where
sequences represents
sequence or time
series predictors and Y
contains the responses.
 For classification
problems, Y is a
categorical vector or a
cell array of categorical
sequences.
 For regression
problems, Y is a matrix
of targets or a cell array
of numeric sequences.

Program: -
clc;
clear all;
close all;

url = "http://emodb.bilderbar.info/download/download.zip";
downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,"Emo-DB");

if ~exist(datasetFolder,'dir')
disp('Downloading Emo-DB (40.5 MB)...')
unzip(url,datasetFolder)
end

ads = audioDatastore(fullfile(datasetFolder,"wav"));

filepaths = ads.Files;
emotionCodes = cellfun(@(x)x(end-5),filepaths,'UniformOutput',false);
emotions = replace(emotionCodes,{'W','L','E','A','F','T','N'}, ...
{'Anger','Boredom','Disgust','Anxiety/Fear','Happiness','Sadness','Neutral'});

speakerCodes = cellfun(@(x)x(end-10:end-9),filepaths,'UniformOutput',false);
labelTable =
cell2table([speakerCodes,emotions],'VariableNames',{'Speaker','Emotion'});
labelTable.Emotion = categorical(labelTable.Emotion);
labelTable.Speaker = categorical(labelTable.Speaker);
summary(labelTable)

ads.Labels = labelTable;

load('network_Audio_SER.mat','net','afe','normalizers');
fs = afe.SampleRate;

speaker = categorical(“4”);
emotion = categorical(“7”);

adsSubset = subset(ads,ads.Labels.Speaker==speaker & ads.Labels.Emotion == emotion);

audio = read(adsSubset);
sound(audio,fs)

features = (extract(afe,audio))';

featuresNormalized = (features - normalizers.Mean)./normalizers.StandardDeviation;

numOverlap = 10;
featureSequences = HelperFeatureVector2Sequence(featuresNormalized,20,numOverlap);

YPred = double(predict(net,featureSequences));
average = categorical(“02”);
switch average
case 'mean'
probs = mean(YPred,1);
case 'median'
probs = median(YPred,1);
case 'mode'
probs = mode(YPred,1);
end

pie(probs./sum(probs),string(net.Layers(end).Classes))

numAugmentations = 50;
augmenter = audioDataAugmenter('NumAugmentations',numAugmentations, ...
'TimeStretchProbability',0, ...
'VolumeControlProbability',0, ...
...
'PitchShiftProbability',0.5, ...
...
'TimeShiftProbability',1, ...
'TimeShiftRange',[-0.3,0.3], ...
...
'AddNoiseProbability',1, ...
'SNRRange', [-20,40]);
currentDir = pwd;
writeDirectory = fullfile(currentDir,'augmentedData');
mkdir(writeDirectory

N = numel(ads.Files)*numAugmentations;
myWaitBar = HelperPoolWaitbar(N,"Augmenting Dataset...");

reset(ads)

numPartitions = 18;

tic
parfor ii = 1:numPartitions
adsPart = partition(ads,numPartitions,ii);
while hasdata(adsPart)
[x,adsInfo] = read(adsPart);
data = augment(augmenter,x,fs);

[~,fn] = fileparts(adsInfo.FileName);
for i = 1:size(data,1)
augmentedAudio = data.Audio{i};
augmentedAudio = augmentedAudio/max(abs(augmentedAudio),[],'all');
augNum = num2str(i);
if numel(augNum)==1
iString = ['0',augNum];
else
iString = augNum;
end

audiowrite(fullfile(writeDirectory,sprintf('%s_aug%s.wav',fn,iString)),augmentedAudio
,fs);
increment(myWaitBar)
end
end
end

delete(myWaitBar)
fprintf('Augmentation complete (%0.2f minutes).\n',toc/60)

adsAug = audioDatastore(writeDirectory);
adsAug.Labels = repelem(ads.Labels,augmenter.NumAugmentations,1);

win = hamming(round(0.03*fs),"periodic");
overlapLength = 0;

afe = audioFeatureExtractor( ...

'Window',win, ...
'OverlapLength',overlapLength, ...
'SampleRate',fs, ...
...
'gtcc',true, ...
'gtccDelta',true, ...
'mfccDelta',true, ...
...
'SpectralDescriptorInput','melSpectrum', ...
'spectralCrest',true);

adsTrain = adsAug;
tallTrain = tall(adsTrain);
featuresTallTrain = cellfun(@(x)extract(afe,x),tallTrain,"UniformOutput",false);
featuresTallTrain = cellfun(@(x)x',featuresTallTrain,"UniformOutput",false);
featuresTrain = gather(featuresTallTrain);

allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');

featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);

featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);

labelsTrain = repelem(adsTrain.Labels.Emotion,[sequencePerFileTrain{:}]);

emptyEmotions = ads.Labels.Emotion;
emptyEmotions(:) = [];

dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(size(sequencesTrain{1},1))
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,"OutputMode","last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer
classificationLayer];

miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
"MiniBatchSize",miniBatchSize, ...
"InitialLearnRate",initialLearnRate, ...
"LearnRateDropPeriod",learnRateDropPeriod, ...
"LearnRateSchedule","piecewise", ...
"MaxEpochs",maxEpochs, ...
"Shuffle","every-epoch", ...
"Verbose",false, ...
"Plots","Training-Progress");

net = trainNetwork(sequencesTrain,labelsTrain,layers,options);

saveSERSystem = categorical(“01”);
if saveSERSystem
normalizers.Mean = M;
normalizers.StandardDeviation = S;
save('network_Audio_SER.mat','net','afe','normalizers')
end
speaker = ads.Labels.Speaker;
numFolds = numel(speaker);
summary(speaker)
[labelsTrue,labelsPred] = HelperTrainAndValidateNetwork(ads,adsAug,afe);
for ii = 1:numel(labelsTrue)
foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;
fprintf('Fold %1.0f, Accuracy = %0.1f\n',ii,foldAcc);
end
labelsTrueMat = cat(1,labelsTrue{:});
labelsPredMat = cat(1,labelsPred{:});
figure
cm = confusionchart(labelsTrueMat,labelsPredMat);
valAccuracy = mean(labelsTrueMat==labelsPredMat)*100;
cm.Title = sprintf('Confusion Matrix for 10-Fold Cross-Validation\nAverage Accuracy =
%0.1f',valAccuracy);
sortClasses(cm,categories(emptyEmotions))
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';

function [sequences,sequencePerFile] =
HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)
if featureVectorsPerSequence <= featureVectorOverlap
error('The number of overlapping feature vectors must be less than the number
of feature vectors per sequence.')
end

if ~iscell(features)
features = {features};
end
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
sequences = {};
sequencePerFile = cell(numel(features),1);
for ii = 1:numel(features)
sequencePerFile{ii} = floor((size(features{ii},2) -
featureVectorsPerSequence)/hopLength) + 1;
idx2 = 1;
for j = 1:sequencePerFile{ii}
sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence
- 1);
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
end

function [trueLabelsCrossFold,predictedLabelsCrossFold] =
HelperTrainAndValidateNetwork(varargin)
if nargin == 3
ads = varargin{1};
augads = varargin{2};
extractor = varargin{3};
elseif nargin == 2
ads = varargin{1};
augads = varargin{1};
extractor = varargin{2};
end
speaker = categories(ads.Labels.Speaker);
numFolds = numel(speaker);
emptyEmotions = (ads.Labels.Emotion);
emptyEmotions(:) = [];

trueLabelsCrossFold = {};
predictedLabelsCrossFold = {};

for i = 1:numFolds

idxTrain = augads.Labels.Speaker~=speaker(i);
augadsTrain = subset(augads,idxTrain);
augadsTrain.Labels = augadsTrain.Labels.Emotion;
tallTrain = tall(augadsTrain);
idxValidation = ads.Labels.Speaker==speaker(i);
adsValidation = subset(ads,idxValidation);
adsValidation.Labels = adsValidation.Labels.Emotion;
tallValidation = tall(adsValidation);

tallTrain =
cellfun(@(x)x/max(abs(x),[],'all'),tallTrain,"UniformOutput",false);
tallFeaturesTrain =
cellfun(@(x)extract(extractor,x),tallTrain,"UniformOutput",false);
tallFeaturesTrain = cellfun(@(x)x',tallFeaturesTrain,"UniformOutput",false);
[~,featuresTrain] = evalc('gather(tallFeaturesTrain)');
tallValidation =
cellfun(@(x)x/max(abs(x),[],'all'),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)extract(extractor,x),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)x',tallFeaturesValidation,"UniformOutput",false);
[~,featuresValidation] = evalc('gather(tallFeaturesValidation)');
allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);
for ii = 1:numel(featuresTrain)
idx = find(isnan(featuresTrain{ii}));
if ~isempty(idx)
featuresTrain{ii}(idx) = 0;
end
end
featuresValidation = cellfun(@(x)(x-
M)./S,featuresValidation,'UniformOutput',false);
for ii = 1:numel(featuresValidation)
idx = find(isnan(featuresValidation{ii}));
if ~isempty(idx)
featuresValidation{ii}(idx) = 0;
end
end
featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);
[sequencesValidation,sequencePerFileValidation] =
HelperFeatureVector2Sequence(featuresValidation,featureVectorsPerSequence,featureVect
orOverlap);

labelsTrain = [emptyEmotions;augadsTrain.Labels];
labelsTrain = labelsTrain(:);
labelsTrain = repelem(labelsTrain,[sequencePerFileTrain{:}]);

net = trainNetwork(sequencesTrain,labelsTrain,layers,options);

predictedLabelsPerSequence = classify(net,sequencesValidation);
trueLabels = categorical(adsValidation.Labels);
predictedLabels = trueLabels;
idx1 = 1;
for ii = 1:numel(trueLabels)
predictedLabels(ii,:) = mode(predictedLabelsPerSequence(idx1:idx1 +
sequencePerFileValidation{ii} - 1,:),1);
idx1 = idx1 + sequencePerFileValidation{ii};
end
trueLabelsCrossFold{i} = trueLabels;
predictedLabelsCrossFold{i} = predictedLabels;
end
end
Output: -
S.No. Description Figure
1 Pie chart of
median
probability
of the
different
emotions
pertaining to
the selected
subject
2 Training
over 1212
iterations
using 10
fold cross
validation
3 Confusion
Matrix for
10-fold
Cross-
Validation
Conclusions: -
1. Features used herein were selected using sequential feature selection.
2. The training and validation is done using leave-one-speaker-out (LOSO) k-fold cross
validation.
3. A model trained on insufficient data may suffer from fitment problems. This is
alleviated using signal augmentation undertaken via pitch shifting, time-scale
modification, time shifting, noise addition, and volume control.
Result: -
Emotion recognition in speech signal was performed, as evidenced by the observation
and conclusions drawn above.

Speech Signal Processing Lab Work Book
No ratings yet
Speech Signal Processing Lab Work Book
55 pages
Ita Posgrad EA 268 Lab-1
No ratings yet
Ita Posgrad EA 268 Lab-1
4 pages
Hanoi University of Science and Technology
No ratings yet
Hanoi University of Science and Technology
9 pages
Exp1 Merged
No ratings yet
Exp1 Merged
11 pages
Sns Lab 7 19-Ee-0
No ratings yet
Sns Lab 7 19-Ee-0
12 pages
Báo Cáo TH C Hành - Bu I 3
No ratings yet
Báo Cáo TH C Hành - Bu I 3
4 pages
Audio and Speech Processing - Prof - Muralikrishna H
No ratings yet
Audio and Speech Processing - Prof - Muralikrishna H
28 pages
Speech Processing Lab Manual
No ratings yet
Speech Processing Lab Manual
23 pages
673random Signal Analysi Final1
No ratings yet
673random Signal Analysi Final1
6 pages
ASP Lab Report
No ratings yet
ASP Lab Report
8 pages
ECE471 Lab#3 Due: 3/27/2015 Voice Recording and FFT (20points)
No ratings yet
ECE471 Lab#3 Due: 3/27/2015 Voice Recording and FFT (20points)
1 page
Acquisition of One and Two Dimensional Signals
No ratings yet
Acquisition of One and Two Dimensional Signals
5 pages
Aryan Raj ASP Aat
No ratings yet
Aryan Raj ASP Aat
9 pages
Exp2 Group4
No ratings yet
Exp2 Group4
4 pages
This Is Used For Recording Audio Signal Via Mi-Crophone: %NAME: KOY Brosoeu %Group:I4-EA %ID: E20130325
No ratings yet
This Is Used For Recording Audio Signal Via Mi-Crophone: %NAME: KOY Brosoeu %Group:I4-EA %ID: E20130325
5 pages
Ramaiah University of Applied Sciences: Faculty of Engineering & Technology Lab Exam Question Paper - M. Tech
No ratings yet
Ramaiah University of Applied Sciences: Faculty of Engineering & Technology Lab Exam Question Paper - M. Tech
7 pages
Audio Signal Processing
No ratings yet
Audio Signal Processing
7 pages
Coding
No ratings yet
Coding
6 pages
BÀI THI GIỮA KỲ MÔN XỬ LÝ TIẾNG
No ratings yet
BÀI THI GIỮA KỲ MÔN XỬ LÝ TIẾNG
22 pages
Speaker Recognition
No ratings yet
Speaker Recognition
11 pages
Ab Star Action
No ratings yet
Ab Star Action
7 pages
Digital Signal Processing Sessional: Rajshahi University of Engineering & Technology
No ratings yet
Digital Signal Processing Sessional: Rajshahi University of Engineering & Technology
13 pages
BSC (Hons) in Engineering (Eee/Cse/Epe) Ec 247 Computing For Engineers Laboratory 3 - Plotting, Analyzing Waveforms & Audio Signals
No ratings yet
BSC (Hons) in Engineering (Eee/Cse/Epe) Ec 247 Computing For Engineers Laboratory 3 - Plotting, Analyzing Waveforms & Audio Signals
5 pages
Generating Audio Signal and Performing Different Operations On Recorded Signal
No ratings yet
Generating Audio Signal and Performing Different Operations On Recorded Signal
4 pages
LL
No ratings yet
LL
2 pages
DSP File
No ratings yet
DSP File
3 pages
Lab 3
No ratings yet
Lab 3
6 pages
Audio Processing With MatLab
No ratings yet
Audio Processing With MatLab
7 pages
Signal OEL
No ratings yet
Signal OEL
4 pages
Function Grab
No ratings yet
Function Grab
1 page
Digital Signal Processing (Audio Filtering) - Luigi de Real
No ratings yet
Digital Signal Processing (Audio Filtering) - Luigi de Real
14 pages
Xu Ly Am Thanh
No ratings yet
Xu Ly Am Thanh
10 pages
Importing Audio and Video in Matlab
No ratings yet
Importing Audio and Video in Matlab
5 pages
DSP Lab 3
No ratings yet
DSP Lab 3
18 pages
ariza dsp
No ratings yet
ariza dsp
11 pages
Fundamental Frequency Estimation - Frequency Domain
No ratings yet
Fundamental Frequency Estimation - Frequency Domain
5 pages
Signal and System - 3-Mithun
No ratings yet
Signal and System - 3-Mithun
45 pages
Lec 18
No ratings yet
Lec 18
17 pages
EC39201 - Expt4 - Lab Report - Grp-24
No ratings yet
EC39201 - Expt4 - Lab Report - Grp-24
5 pages
DSP Project 2
No ratings yet
DSP Project 2
10 pages
RV College of Engineering: 1RV19ET034 Mukul Dev Choudhary 1RV19ET016 Bishal Kumar
No ratings yet
RV College of Engineering: 1RV19ET034 Mukul Dev Choudhary 1RV19ET016 Bishal Kumar
7 pages
Voice Recognition
No ratings yet
Voice Recognition
17 pages
39 22EC10057 Prasit
No ratings yet
39 22EC10057 Prasit
4 pages
For End For End: "Sp01.wav"
No ratings yet
For End For End: "Sp01.wav"
2 pages
Digital Signal Processing: Name Registration Number Class Instructor Name
No ratings yet
Digital Signal Processing: Name Registration Number Class Instructor Name
13 pages
Lab 2 Audio Sampling and Quantization Levels
No ratings yet
Lab 2 Audio Sampling and Quantization Levels
5 pages
Lab 3-Synthesis of Sinusoidal Signals and Sampling Theorem
No ratings yet
Lab 3-Synthesis of Sinusoidal Signals and Sampling Theorem
8 pages
Homework 1
No ratings yet
Homework 1
3 pages
Amplitude/Frequency Modulation Matlab GUI Project: June 2015
No ratings yet
Amplitude/Frequency Modulation Matlab GUI Project: June 2015
163 pages
Engineering Assignment Sample
No ratings yet
Engineering Assignment Sample
10 pages
Functions in MATLAB and The Groove Station: University of Washington Dept. of Electrical Engineering
No ratings yet
Functions in MATLAB and The Groove Station: University of Washington Dept. of Electrical Engineering
4 pages
Part 2 Basic Audio Mat Lab
No ratings yet
Part 2 Basic Audio Mat Lab
2 pages
DSP Lab 2
No ratings yet
DSP Lab 2
6 pages
ECEA108L Expt5 Castillo
No ratings yet
ECEA108L Expt5 Castillo
13 pages
Ca2 DSP Report
No ratings yet
Ca2 DSP Report
7 pages
American International University-Bangladesh (AIUB) Faculty of Engineering (EEE)
No ratings yet
American International University-Bangladesh (AIUB) Faculty of Engineering (EEE)
6 pages
Chapter Seven
No ratings yet
Chapter Seven
30 pages
Explanation of The Code Line Per Line
No ratings yet
Explanation of The Code Line Per Line
4 pages
Sound Design and Mixing in Reason
From Everand
Sound Design and Mixing in Reason
Andrew Eisele
3/5 (2)
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
Application Note Monitoring Strategy Diagnosing Gearbox Damage
No ratings yet
Application Note Monitoring Strategy Diagnosing Gearbox Damage
5 pages
Suprava Profile
No ratings yet
Suprava Profile
9 pages
PHY05
No ratings yet
PHY05
7 pages
A New Approach For Persian Speech Recognition
No ratings yet
A New Approach For Persian Speech Recognition
6 pages
Automatic Identification of Telephone Speech: Language
No ratings yet
Automatic Identification of Telephone Speech: Language
30 pages
1 s2.0 S0888327016303144 Main
No ratings yet
1 s2.0 S0888327016303144 Main
14 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Speech Processing Unit 4 Notes
No ratings yet
Speech Processing Unit 4 Notes
16 pages
MFCC CZT
No ratings yet
MFCC CZT
10 pages
Advanced Digital Signal Processing With Matlab (R)
No ratings yet
Advanced Digital Signal Processing With Matlab (R)
4 pages
Gabor
No ratings yet
Gabor
25 pages
A Comparative Study of Various Methods of Gear Faults Diagnosis
No ratings yet
A Comparative Study of Various Methods of Gear Faults Diagnosis
13 pages
Product Data Sheet: VIBRO Condition Monitoring 3 (VCM-3 and VCM-3 Ex)
No ratings yet
Product Data Sheet: VIBRO Condition Monitoring 3 (VCM-3 and VCM-3 Ex)
7 pages
Normative Values of Cepstral Peak Prominence Measures in Typical Speakers by Sex, Speech Stimuli, and Software Type Across The Life Span
No ratings yet
Normative Values of Cepstral Peak Prominence Measures in Typical Speakers by Sex, Speech Stimuli, and Software Type Across The Life Span
13 pages
Music Genre Classification
No ratings yet
Music Genre Classification
33 pages
Deconvolution Basics: Matthew Hutchinson
No ratings yet
Deconvolution Basics: Matthew Hutchinson
3 pages
Scale Transform in Speech Analysis
No ratings yet
Scale Transform in Speech Analysis
6 pages
Demo
No ratings yet
Demo
29 pages
Mel Frequency Cepstral Coefficients Properties Optimization Due To Ultrasonic Bands and Data Structure: Application To Acoustic Signals Identification
No ratings yet
Mel Frequency Cepstral Coefficients Properties Optimization Due To Ultrasonic Bands and Data Structure: Application To Acoustic Signals Identification
9 pages
Punjabi A
No ratings yet
Punjabi A
7 pages
Speech Recognition System - A Review
No ratings yet
Speech Recognition System - A Review
10 pages
Fault Identification and Monitoring in Rolling Element Bearing
100% (1)
Fault Identification and Monitoring in Rolling Element Bearing
234 pages
Audio Signal Processing Basics
No ratings yet
Audio Signal Processing Basics
11 pages
Audio Anti-Spoofing Detection: A Survey: Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang
No ratings yet
Audio Anti-Spoofing Detection: A Survey: Menglu Li, Yasaman Ahmadiadli, and Xiao-Ping Zhang
43 pages
IT Report-1
No ratings yet
IT Report-1
14 pages
MATLAB Exercise - LPC Vocoder
No ratings yet
MATLAB Exercise - LPC Vocoder
5 pages
Speech Processing
No ratings yet
Speech Processing
5 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
SSRN Id2254038
No ratings yet
SSRN Id2254038
39 pages

SR - Lab File

Uploaded by

SR - Lab File

Uploaded by

Contents

audio_data = getaudiodata(recorder,'double'); % Object function 1

% Property values of audio recorder object

% Change sampling frequency

% Change number of bits

Aim: - To detect regions of speech in an audio signal

window_duration = 0.074; % Range [2,size(audioIn,1)]

% Reuse of decision thresholds on the segments of the same signal

split_position = 0.3; % Specifying the splitting ratio

2 Detected Boundaries in audio

Aim: - To compute the Fast Fourier Transform (FFT) of an audio signal

Its inverse is expressed as

Its inverse is given as

Y_s1 = fft(s1 + N);

Y_s2 = fft(s2 + N);

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

𝑆𝑇𝐹𝑇𝑓𝑢 (𝑡 ′ , 𝑢) = ∫[𝑓(𝑡). 𝑊(𝑡 − 𝑡 ′ )]. 𝑒 −𝑗2𝜋𝑢𝑡 𝑑𝑡

Its inverse is expressed as

𝑓(𝑡) = ∬ 𝛾(𝑠, 𝜏)𝜓𝑠,𝜏 (𝑡)𝑑𝜏𝑑𝑠

All the wavelets are derived from mother wavelet given by

Figure 1: MFCC Extraction

duration = round(0.04 * fs); % 2s audio segment

audioSegmentTwo = audioData(58200:58200+duration-1); % Number of cepstral

audioSegmentThree = audioData(20000:20000+duration-1); % Number of cepstral

Voice Activity Detection and Cepstral Feature Extraction in Frequency Domain

file_name = 'Counting-16-44p1-mono-15secs.wav'; % Audio file reader

samplesPerFrame = ceil(0.03 * fs); % 30ms frames with 10 ms hop, overlap of

fileReader.SamplesPerFrame = samplesPerHop; % Asynchronous buffer

VAD = voiceActivityDetector('InputDomain','Frequency'); % VAD object and

overlappedAudio = read(buffer,samplesPerFrame,samplesPerOverlap); % Read

probabilityOfSpeech = VAD(X); % Probability of existence of speech

Aim: - To calculate different spectral descriptors of an audio signal

e. Spectral Entropy – it has been used successfully in voiced/unvoiced decisions for

f. Spectral Flatness – It is an indication of the peakiness of the spectrum. A higher

where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝜇𝑓 is the mean frequency, 𝑠𝑘 is the

𝑅𝑜𝑙𝑙𝑜𝑓𝑓 𝑃𝑜𝑖𝑛𝑡 = 𝑖 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 ∑ |𝑠𝑘 | = 𝜅 ∑ 𝑠𝑘

centroid = spectralCentroid(audioData,fs); % Spectral Centroid

spread = spectralSpread(audioData,fs); % Spectral Spread

skewness = spectralSkewness(audioData,fs); % Spectral Skewness

kurtosis = spectralKurtosis(audioData,fs); % Spectral Kurtosis

t_k = linspace(0,size(audioData,1)/fs,size(kurtosis,1)); % Spectral Kurtosis

entropy = spectralEntropy(audioData,fs); % Spectral Entropy

flatness = spectralFlatness(audioData,fs); % Spectral Flatness

specslope = spectralSlope(audioData,fs); % Spectral Slope

spectral_decrease = spectralDecrease(audioData,fs); % Spectral Decrease

spectral_rolloff = spectralRolloffPoint(audioData,fs); % Spectal Rolloff

Aim: - Recognition of emotion in a speech signal

6 layer = dropoutLayer  A dropout layer

7 net = trainNetwork(sequences,Y,layers,options)  Trains a network for

adsSubset = subset(ads,ads.Labels.Speaker==speaker & ads.Labels.Emotion == emotion);

featuresNormalized = (features - normalizers.Mean)./normalizers.StandardDeviation;

afe = audioFeatureExtractor( ...

You might also like