SR_Lab File
SR_Lab File
1. Experiment 1 ......................................................................................................................................... 3
2. Experiment 2 ......................................................................................................................................... 7
3. Experiment 3 ....................................................................................................................................... 11
4. Experiment 4 ....................................................................................................................................... 20
5. Experiment 5 ....................................................................................................................................... 25
6. Experiment 6 ....................................................................................................................................... 35
7. Experiment 7 ....................................................................................................................................... 41
8. Experiment 8 ....................................................................................................................................... 53
1. Experiment 1
Aim: - To record, read and write an audio signal and execute other related functions
Software Used: - MATLAB R2020a
Theory: -
Production of speech in the human vocal system is done in four stages. They have been
briefly mentioned below, in the sequence of occurrence.
S.No. Stage Organs Function
1 Breathing Lungs, Diaphragm, Rib Intake of pulmonic air stream up
Muscles to full capacity using the
diaphragm
2 Phonation Larynx, Vocal Cords, Production of voice through
Trachea vibration of vocal cords
3 Resonation Upper part of larynx, Voice amplification and
pharynx, nasal cavity, oral modification
cavity
4 Articulation Uvula, Velum, Tongue, Production and characterization of
Lower Lip, Upper Jaw phonemes and accents
The below functions are used in the experiment for the achievement of the objective
S.No. Function Description
1 r = audiorecorder(fs,nBits,NumChannels,ID) Creates an audio recorder
object with the audio from the
device specified by the device
identifier ID and channels by
NumChannels, sampled at fs
and quantized to nBits
2 r = getaudiodata(recorder,datatype) Obtains the audio data from
the recorder object and
converts to the datatype
specified
3 p = play(recobj, [start stop]) Plays the audio between the
samples specified by start
and stop
4 audiowrite(filename,y,Fs,Name,Value) Writes the matrix of the
specified audio data into a file
of the specified name. The
name value pairs include inter
alia bits per sample, bit rate,
quality, artist, title and
comments
5 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
6 player = audioplayer(y,Fs,nBits,ID) Creates audio player object
with the specified parameters
7 P = get(recorder) Obtains the property values
of the specified object
8 record(obj) Records the data and event
information for the specified
object
9 pause Stops execution temporarily
9 recorder.resume Resumes recording
10 recorder.stop Stops recording
Program: -
clc;
clear all;
close all;
Fs = 44100;
nbits = 16;
nchannels = 2;
ID = -1;
t = 5;
file = 'audio_rec.wav';
chk_rec = false;
if(chk_rec)
recorder = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
recordblocking(recorder,t);
disp('Stop Speaking');
if ~chk_rec
audio_data = audioread(file);
audio_player = audioplayer(audio_data,Fs,nbits,ID);
end
t_start = 2;
t_stop = 4;
samples = [2 * Fs,4* Fs];
audio_data_res = audioread(file,samples);
duration = 1:1:Fs * t;
audio_player_res = audioplayer(audio_data_res,Fs,nbits,ID);
disp('Playing resampled audio');
pause(2);
play(audio_player_res);
figure(1);
subplot(2,1,1);
plot(duration / Fs,audio_data);
xlabel('Time');
ylabel('Amplitude');
title('Original Recorded Audio','FontSize',14);
subplot(2,1,2);
plot(duration(2 * Fs:4 * Fs) / Fs,audio_data_res);
xlabel('Time');
ylabel('Amplitude');
title('Resampled Audio Data','FontSize',14);
if chk_rec
properties = get(recorder); % Object function 3
else
properties = get(audio_player);
end
fs = 8000;
if ~chk_rec
audio_player_resampled = audioplayer(audio_data,fs,nbits,ID);
end
nBits = 8;
if ~chk_rec
audio_player_reformed = audioplayer(audio_data,Fs,nBits,ID);
end
% Pause/resume
recorder_pr = audiorecorder(Fs,nbits,nchannels,ID);
disp('Start Speaking');
record(recorder_pr);
pause(1);
recorder_pr.resume; % Object function 4
disp('Stop Speaking');
recorder_pr.stop; % Object function 5
play(recorder_pr);
Output: -
Conclusions: -
1. recordblocking function does not relinquish control to the main program until the
recording is completed
2. To sub-sample the audio signal, the time period is multiplied by the sampling
frequency to isolate the samples in the desired interval
3. While plotting the sub-sampled the signals, the dependent axis variable is divided by
the sampling frequency so as to identify the interval of sub-sampling
4. Resampling the audio data at a lower sampling frequency severely affects the audio
characteristics of the data
5. Change in the number of bits has no perceptible effect on the recorded audio data
6. For pausing, recording and stopping functionalities, the record function is used instead
of recordblocking.
Result: -
Audio signal was recorded and related functions executed thereon, as evidenced by the
observations and conclusions drawn therefrom.
2. Experiment 2
file_name = 'audio_rec.wav';
% Detecting speech regions
[audioIn,fs] = audioread(file_name);
figure(1);
detectSpeech(audioIn(:,1),fs); % Plotting detecting speech in the entire
length of the audio
xlabel('Time');
ylabel('Amplitude');
title('Detected Speech in the Audio Signal');
% Thresholding using Windowing, Overlap Length and Merge Distance
overlap_percent = 10;
overlapping_samples = round(window_samples * overlap_percent / 100); % Number
of overlapping samples in adjacent windows
merge_duration = 0.1;
merge_distance = round(merge_duration * fs); % Number of samples to be merged
on occurence of a positive detction of speech
figure(2);
detectSpeech(audioIn(:,1),fs,'Window',window,'OverlapLength',overlapping_samp
les,'MergeDistance',merge_distance);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using custom window');
figure(3);
detectSpeech(second_part,fs,'Thresholds',thresholds);
xlabel('Time');
ylabel('Amplitude');
title('Speech detection using the predetected thresholds');
Output: -
S.No. Figure Description Figure
1 Detected Boundaries in audio
signal without any custom
window
3 Detected boundaries in
speech signal using
thresholds obtained from
initial 30% segment of the
signal.
Conclusions: -
1. Different window functions viz. Chebyshev, Hanning, Bartlett, Blackman, Gaussian,
Tukey, Kaiser, Taylor, Bohman may be used with relevant attributes for the detection
of the speech segments in the audio signal.
2. Window length, overlap length and merge distance specified in the terms of the
number of samples.
3. The audio signal is split in the desired ratio and the thresholds detected in one part
are used to identify the speech segments in the other part.
Result: -
Regions of speech were detected in the recorded audio signal, as evidenced by the
observations and conclusions drawn above.
3. Experiment 3
𝑛=0
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = fft(X,n,dim) Returns the n-length Fast
Fourier Transform along the
specified dimension dim.
Program: -
clc;
clear all;
close all;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
t = 0.01:0.01:10;
f1 = 10;
f2 = 20;
f3 = 30;
split_p1 = 0.2;
split_p2 = 0.5;
split_p3 = 1 - (split_p1 + split_p2);
s1 = sin(2 * pi * f1 * t) + sin(2 * pi * f2 * t) + sin(2 * pi * f3 * t); %
Multitone signal
s2 = cat(2,sin(2 * pi * f1 * t(1:split_p1 * length(t))),sin(2 * pi * f2 *
t(split_p1 * length(t) + 1:split_p1 * length(t) + split_p2 * length(t))),...
sin(2 * pi * f3 * t((split_p1 + split_p2) * length(t) + 1:end))); %
Synthetic non-stationary signal
N = randn(size(t));
figure(1);
subplot(4,1,1);
plot(t,s1);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal');
subplot(4,1,2);
plot(t,s1 + N);
xlabel('Time');
ylabel('Amplitude');
title('Multitone signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y1);
xlabel('Time');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy multitone signal');
subplot(4,1,4);
plot(t,ifft(Y_s1));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy multitone signal');
figure(2);
subplot(4,1,1);
plot(t,s2);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal');
subplot(4,1,2);
plot(t,s2 + N);
xlabel('Time');
ylabel('Amplitude');
title('Synthetic Non-stationary signal with AWGN');
subplot(4,1,3);
plot((1:length(t) / 2 + 1) / 10,Y2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of noisy Synthetic Non-stationary
signal');
subplot(4,1,4);
plot(t,ifft(Y_s2));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of noisy Synthetic Non-stationary signal');
d1 = 1:split_p1 * length(t);
d2 = 1:(split_p1 + split_p2) * length(t);
d3 = (split_p1 + split_p2) * length(t) + 1:length(t);
Y_s2_p1 = window_fft(s2,d1,N);
Y_s2_p2 = window_fft(s2,d2,N);
Y_s2_p3 = window_fft(s2,d3,N);
figure(3);
subplot(3,2,1);
plot(t(1:split_p1 * length(t)),s2(1:split_p1 * length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 1');
subplot(3,2,2);
plot(((1:length(d1) / 2 + 1) / 2 - 0.5) / (split_p1 * 10 / 2),Y_s2_p1);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 1');
subplot(3,2,3);
plot(t(1:(split_p1 + split_p2) * length(t)),s2(1:(split_p1 + split_p2) *
length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 2');
subplot(3,2,4);
plot(((1:length(d2) / 2 + 1) / 2 - 0.5) / ((split_p1 + split_p2) * 10 /
2),Y_s2_p2);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 2');
subplot(3,2,5);
plot(t((split_p1 + split_p2) * length(t) + 1:length(t)),s2((split_p1 +
split_p2) * length(t) + 1:length(t)));
xlabel('Time');
ylabel('Amplitude');
title('Signal in window 3');
subplot(3,2,6);
plot(((1:length(d3) / 2 + 1) / 2 - 0.5) / (split_p3 * 10 / 2),Y_s2_p3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum in window 3');
Y_s3 = fft(audioData(:,1));
Y3 = abs(Y_s3/length(audioData));
Y3 = Y3(1:length(audioData) / 2 + 1);
Y3(2:end - 1) = 2 * Y3(2:end - 1);
figure(4);
subplot(3,1,1);
plot((1:length(audioData))/fs,audioData);
xlabel('Time');
ylabel('Amplitude');
title('Recorded Audio signal');
subplot(3,1,2);
plot((1:length(audioData) / 2 + 1) / 10,Y3);
xlabel('Frequency');
ylabel('Amplitude');
title('Single sided amplitude spectrum of audio signal');
subplot(3,1,3);
plot((1:length(audioData))/fs,ifft(Y_s3));
xlabel('Time');
ylabel('Amplitude');
title('IFFT of audio signal');
Output: -
S.No. Figure Figure
Description
1 FFT of multitoned
signal consisting
of harmonics of
10, 20 and 30 Hz
and its 1000 point
IFFT
2 FFT of noisy non-
stationary signal
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, and its
1000 point IFFT
3 Segmentation of
non-stationary
signal consisting
of harmonics of
10, 20 and 30 Hz
for 20%, 50% and
remaining 30% of
the signal
duration, into the
corresponding
duration windows
and computation
of FFT for each
window
4 FFT of recorded
audio signal and
its 220500 point
IFFT
Conclusions: -
4. Fast Fourier Transform (FFT) is able to distinctly identify the frequencies present in a
multi-tone signal in the presence of Additive White Gaussian Noise (AWGN) of
strength.
5. FFT is also able to detect the frequencies present in a non-stationary signal, in the
presence of noise, though the detection region is widened wrt the frequency identified,
thus revealing the shortcoming of FFT in the spectral analysis of the non-stationary
signals.
1. FFT is able to identify the frequencies present in the different segments of the audio
signal, irrespective of the remaining segments, though the region of detection is
widened wrt the frequency detected, as earlier.
2. For the non-stationary audio signal, the FFT reveals the presence of the frequencies
in the lower range of the spectrum, typically less than 1KHz.
3. For the non-stationary audio signal, even a 220500 point IFFT is not able reconstruct
the original speech signal accurately i.e. it is unable to model the sharp discontinuities
and fast fluctuations in the audio signal, thus highlighting another shortcoming of the
FFT.
Result: -
Fast Fourier Transform (FFT) of the recorded audio signal was computed and different
functions operations performed thereon, as evidenced by the observations and
conclusions drawn above.
4. Experiment 4
Aim: - To compute the Short Time Fourier Transform (STFT) of an audio signal
Software Used: - MATLAB R2020a
Theory: -
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.
where 𝑓(𝑡) is the incoming signal, 𝑊(𝑡 − 𝑡 ′ ) is the centered window function, and the
frequency parameter is denoted by the variable 𝑢.
A wide window provides a good frequency resolution but poor time resolution. The vice
versa is true in the case of narrow window.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the
specified file and returns data
in y sampled at fs
2 y = stft(X, FFT Length, Window, Overlap Returns the Short Time
Length, Frequency Centering) Fourier transform of the input
signal, calculated in
accordance with the specified
parameters .
Program: -
clc;
clear all;
close all;
segment_length = 10000;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(1:segment_length);
N = 5000; % In millisecond
nf = 50; % Normalization factor
t = (0:N - 1) / nf; % Normalization necessary
f1 = 75;
f2 = 50;
f3 = 25;
f4 = 10;
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
x = cat(2,sin(2 * f1 * t(1:length(t) / 4)),sin(2 * f2 * t(length(t) / 4 +
1:length(t) / 2)),sin(2 * f3 * t(length(t) / 2 + 1:3 * length(t) / 4)),...
sin(2 * f4 * t(3 * length(t) / 4:end)));
[s,f,t] =
stft(x,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLength',ove
rlap_length,'Centered',freq_centering);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);
figure(1);
surf(t / nf,f * nf / 2,abs(s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Non-Stationary Signal');
colormap jet
figure(2);
surf(abs(s_s));
xlabel('Time (ms)');
ylabel('Frequency (Hz)');
zlabel('Amplitude');
title('Short Time Fourier Transform of Speech Signal');
colormap jet
Output: -
S.No. Figure Figure
Description
1 STFT of a 5
second long
synthetic non-
stationary signal
with harmonics of
10, 25, 50 and 75
Hz
2 STFT of recorded
audio signal
Conclusions: -
1. Short Time Fourier Transform provides a good time-frequency resolution for
multitoned and non-stationary signals containing frequencies on the lower harmonic
scale.
2. STFT is able to clearly distinguish the frequencies in the non-stationary signal,
occurring in their respective points of time.
3. Setting frequency centering as false produces a two-sided STFT spectrum, which
contains the angular frequencies in the range from [-π,π].
Result: -
Short Time Fourier Transform (STFT) of the synthetic non-stationary signal and recorded
audio signal was computed and different functions operations performed thereon, as
evidenced by the observations and conclusions drawn above.
5. Experiment 5
Aim: - To compute the scalogram of the audio signal using Discrete Wavelet Transform
and conduct its multi-resolution analysis
Software Used: - MATLAB R2020a
Theory: -
Wavelet Transform
Short Time Fourier Transform involves the segmentation of the signal into narrow
intervals and computation of Fourier Transform in each such interval. This is the proposed
methodology for obtaining the time-frequency information by windowing the incoming
signal.
where 𝑓(𝑡) is the incoming signal, 𝑊(𝑡 − 𝑡 ′ ) is the centered window function, and the
frequency parameter is denoted by the variable 𝑢.
A wide window provides a good frequency resolution but poor time resolution. The vice
versa is true in the case of narrow window.
The wavelet transform is given by
∗ (𝑡)𝑑𝑡
𝛾(𝑠, 𝜏) = ∫ 𝑓(𝑡)𝜓𝑠,𝜏
Program: -
Scalogram
clc;
clear all;
close all;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);
fft_length = 2048;
window_length = 128;
window_type = 'hamming';
overlap_length = 64;
freq_centering = false;
fun = str2func(window_type);
[s_s,f_s,t_s] =
stft(audioData,'FFTLength',fft_length,'Window',fun(window_length),'OverlapLen
gth',overlap_length,'Centered',freq_centering);
figure(1);
plot(t / fs,audioData);
xlabel('Time (s)');
ylabel('Amplitude (V)');
title('Recorded Speech Signal');
wname = 'morse';
wt = cwt(audioData,wname);
figure(2);
subplot(2,1,1);
imagesc(t / fs,1:size(wt,1),abs(wt));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('Scalogram of Speech Signal');
dim = 1:size(wt,1) / 2;
s_s_norm = s_s(dim,:);
subplot(2,1,2);
imagesc(t / fs,1:size(s_s_norm,1),abs(s_s_norm));
colorbar
xlabel('Time (s)');
ylabel('Frequency (Hz)');
title('STFT of Speech Signal');
Multi–Resolution Analysis
clc;
clear all;
close all;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);
level = 10;
wname = 'sym4';
mdwt_sp = modwt(audioData',level,wname);
mra = modwtmra(mdwt_sp,wname);
err = abs((audioData' - sum(mra)));
f1 = figure;
f2 = figure;
figure(f1);
subplot(level / 2,1,1);
plot(t / fs,audioData);
title('Original Recorded Audio');
for i = 1:level + 1
if i < level / 2 + 1
figure(f1);
subplot(level / 2 + 1,1,i + 1)
elseif i == level / 2 + 1
figure(f1);
xlabel('Time (s)');
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
else
figure(f2);
subplot(level / 2 + 1,1,i - level / 2)
end
x = ['D',num2str(i)];
plot(t / fs,mra(i,:));
title(x);
end
xlabel('Time (s)');
set(gcf,'Position', [0, 0, 2000, 2000]);
figure(3);
subplot(3,1,1);
plot(t / fs,audioData');
title('Recorded Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,2);
plot(t / fs,sum(mra));
title('Reconstructed Audio Signal');
xlabel('Time (s)');
ylabel('Amplitude');
subplot(3,1,3);
plot(t / fs,err);
title('Reonstruction Error');
xlabel('Time (s)');
ylabel('Amplitude');
Output: -
S.No. Figure Figure
Description
1 Time series
representation
of the recorded
audio signal
2 Scalogram of
the recorded
audio signal
3 STFT of
recorded audio
signal
4 Original Signal
and Resolutions
1-5
5 Resolutions 6-
11
6 Comparison
between
original and
reconstructed
signal and
reconstruction
error
Conclusions: -
1. Short Time Fourier Transform provides an inferior time frequency resolution in
comparison to the wavelet transform as the spectral amplitudes are displayed to be
fairly constant with the frequencies and span the entire range of frequencies. Further
the spectral amplitude concentration appears uniform across the frequency scale.
2. Further, in the region of absence of the audio data in the read audio file, STFT detects
some spectral amplitude, which is not detected in the scalogram computed using the
wavelet transform. This demonstrates the higher accuracy of time frequency
representation using the wavelet transform.
3. Seven level resolution yields signal which is similar in appearance to the original audio
signal and retains its characteristic non-stationarity. Further levels of decomposition
disintegrate the signal into near fundamental frequency components which are almost
stationary.
4. Reconstruction error obtained with 10 level resolution analysis is of the order of 10−13.
Result: -
The scalogram of the audio signal was computed using the Discrete Wavelet Transform
and its multi-resolution analysis was conducted, as evidenced by the observations and
conclusions drawn above.
6. Experiment 6
Aim: - To
i. Obtain MFCC data for an audio signal
ii. Perform frequency Domain Voice Activity Detection and Cepstral feature
extraction
Software Used: - MATLAB R2020a
Theory: -
The extraction of Mel Frequency Cepstral Coefficients is driven by human speech
perception and speech production. The coefficients so extracted represent the
information originating from the vocal tract filter, separated from the information content
of the glottal source. Further, the variance between the different coefficients tends to be
uncorrelated. The procedure for MFCC computation is enlisted as follows:
a. Calculation of frequency spectrum and application Mel binning
b. Apply inverse DFT to the logarithm of the mel-warped spectrum to produce the
cepstrum
c. The 39 dimensional MFF feature vector consists of the first 12 significant cepstral
coefficients, energy (sum of power of the frame samples), 13 delta and 13 double
delta coefficients.
The extraction of the features may also be performed in the frequency domain itself, by
efficiently transforming the audio signal into frequency domain.
The MFCC extraction procedure outlined above has been depicted in the figure below.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in the
given range from the specified
file and returns data in y
sampled at fs
2 cepFeatures = creates a
cepstralFeatureExtractor('SampleRate',fs) cepstralFeatureExtractorobject
with the purpose of extraction
of cepstral features from an
audio segment, sampled at a
rate of fs samples per second
3 afr = dsp.AudioFileReader Returns an audio file reader
System object that reads
audio from an audio file
4 asyncBuff = dsp.AsyncBuffer Returns an async buffer
System object, which is used
to write samples to and read
samples from a first-in, first-
out (FIFO) buffer
5 ss = dsp.SignalSink Returns a signal sink that logs
2-D input data in the object
6 VAD = Creates a System object,
voiceActivityDetector('InputDomain','Frequency') VAD, that accepts frequency-
domain input
Program: -
MFCC Extraction
clc;
clear all;
close all;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
t = 1:length(audioData);
[coeffs,delta,deltaDelta] = cepFeatures(audioSegment);
[filterbank, freq] = getFilters(cepFeatures);
subplot(3,1,1);
plot(deltaTwo);
title('DeltaTwo');
subplot(3,1,2);
plot(deltaThree);
title('DeltaThree');
subplot(3,1,3);
plot(deltaDeltaThree);
title('DeltaDeltaThree');
clc;
clear all;
close all;
threshold = 0.5;
nanVector = nan(1,13);
while ~isDone(fileReader)
audioIn = fileReader();
write(buffer,audioIn); % Reading each hop
timeVector = linspace(0,15,size(sink.Buffer,1));
figure(1);
plot(timeVector,sink.Buffer)
title('Cepstral Coefficients');
xlabel('Time (s)')
ylabel('MFCC Amplitude')
legend('Log-
Energy','c1','c2','c3','c4','c5','c6','c7','c8','c9','c10','c11','c12')
Output: -
S.No Figure Figure
. Descriptio
n
1 MFCC
Coefficients
, Delta and
Double
Delta
Features
2 Delta and
Double
Delta
Features of
the
Cepstrum
3 Cepstral
Coefficients
extracted
from
frequency
domain
audio signal
Conclusions: -
1. Delta and Double Delta features of the initial audio segment are always zero.
2. Double delta feature is the difference of the delta features of the previous 2 audio
segments.
3. In the cepstral coefficients extracted in the frequency domain, certain coefficients
stand out wrt other coefficients in their strength of response.
4. Therefore, every feature can be uniquely isolated through appropriate filtering
schemes.
Result: -
The MFCC data for an audio signal was obtained and frequency domain Voice Activity
Detection and subsequent Cepstral feature extraction was undertaken, as evidenced by
the observation and conclusions drawn above.
7. Experiment 7
𝑏
2
∑𝑘=𝑏 𝑓𝑘 𝑠𝑘
1
𝜇1 = 2 𝑏
∑𝑘=𝑏 𝑠𝑘
1
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.
b. Spectral Spread - It represents the "instantaneous bandwidth" of the spectrum and is
used as an indication of the dominance of a tone. It is given by
∑𝑏𝑘=𝑏
2
(𝑓𝑘 − 𝜇1 )2 𝑠𝑘
1
𝜇2 = √
∑𝑏𝑘=𝑏
2
𝑠
1 𝑘
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins and 𝜇1 is the spectral centroid.
c. Spectral Skewness - The spectral skewness assesses the symmetry around the
centroid. In phonetics, it is often referred to as spectral tilt and is used with other
spectral moments to distinguish the place of articulation. For harmonic signals, it
indicates the relative strength of higher and lower harmonics. It is given by
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )2 𝑠𝑘
1 𝑘
𝜇3 = 𝑏
𝜇23 ∑𝑘=𝑏
2
𝑠
1 𝑘
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins, 𝜇1 is the spectral centroid and 𝜇2 is the spectral
spread.
d. Spectral Kurtosis - The spectral kurtosis measures the flatness, or non-Gaussianity,
of the spectrum around its centroid. Conversely, it is used to measure the peakiness
of a spectrum. It is computed as
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇1 )4 𝑠𝑘
1 𝑘
𝜇4 = 𝑏
𝜇24 ∑𝑘=𝑏
2
𝑠
1 𝑘
∑𝑏𝑘=𝑏
2
𝑠 log(𝑠𝑘 )
1 𝑘
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 =
log(𝑏2 − 𝑏1 )
where 𝑓𝑘 is the frequency corresponding to bin 𝑘, 𝑠𝑘 is the spectral value at bin 𝑘 and
𝑏1 and 𝑏2 are the band edges, in bins.
1
𝑏
2
(∏𝑘=𝑏 𝑠 )𝑏2−𝑏1
1 𝑘
𝑓𝑙𝑎𝑡𝑛𝑒𝑠𝑠 = 1
∑𝑏𝑘=𝑏
2
𝑠
𝑏2 −𝑏1 1 𝑘
where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.
g. Spectral Slope - Spectral slope is directly related to the resonant characteristics of the
vocal folds and has also been applied to speaker identification. It is a socially important
aspect of timbre and can be discrimination in early childhood development. Spectral
slope is most pronounced when the energy in the lower formants is much greater than
the energy in the higher formants. It is given by
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )(𝑠𝑘 − 𝜇𝑠 )
1 𝑘
𝑠𝑙𝑜𝑝𝑒 = 2
∑𝑏𝑘=𝑏
2
(𝑓 − 𝜇𝑓 )
1 𝑘
𝑠𝑘 −𝑠𝑏1
∑𝑏𝑘=𝑏
2
1 +1 𝑘−1
𝑑𝑒𝑐𝑟𝑒𝑎𝑠𝑒 =
∑𝑏𝑘=𝑏
2
𝑠𝑘
1 +1
where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins.
i. Spectral Roll-off Point – It measures the bandwidth of the audio signal by finding the
energy concentration in the frequency bins. It is primarily used in detection and
classification activities on different types of acoustic signals. It is expressed as
𝑖 𝑏2
where 𝑠𝑘 is the spectral value at bin 𝑘 and 𝑏1 and 𝑏2 are the band edges, in bins and
𝜅 is the specified energy threshold, usually 95% or 85%.
The below functions are used in the experiment for the achievement of the objective
S.No. Function Variants Description
1 [y,fs] = audioread(filename,samples) Reads the audio data in
the given range from the
specified file and returns
data in y sampled at fs
2 centroid = spectralCentroid(x,f) Returns the spectral
centroid of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
3 spread = spectralSpread(x,f) Returns the spectral
spread of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
4 skewness = spectralSkewness(x,f) Returns the spectral
skewness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
5 kurtosis = spectralKurtosis(x,f) Returns the spectral
kurtosis of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
6 entropy = spectralEntropy(x,f) Returns the spectral
entropy of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
7 flatness = spectralFlatness(x,f) Returns the spectral
flatness of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
8 slope = spectralSlope(x,f) Returns the spectral slope
of the signal, x, over time.
Interpretation of x depends
on the shape of f.
9 decrease = spectralDecrease(x,f) Returns the spectral
decrease of the signal, x,
over time. Interpretation of
x depends on the shape of
f.
10 rolloffPoint = spectralRolloffPoint(x,f) Returns the spectral roll off
point of the signal, x, over
time. Interpretation of x
depends on the shape of f.
Program: -
clc;
clear all;
close all;
file_name = 'audio_rec.wav';
[audioData,fs] = audioread(file_name);
audioData = audioData(:,1);
audioData = sum(audioData,2)/2;
figure(1);
subplot(2,1,1)
t_ca = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_ca,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
subplot(2,1,2)
t_cc = linspace(0,size(audioData,1)/fs,size(centroid,1));
plot(t_cc,centroid)
xlabel('Time (s)')
ylabel('Centroid (Hz)')
title('Centroid of the Recorded Audio Signal');
figure(2);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),2048,fs,'yaxis')
title('Spectrogram of Audio Signal');
subplot(2,1,2)
t_ss = linspace(0,size(audioData,1)/fs,size(spread,1));
plot(t_ss,spread)
xlabel('Time (s)')
ylabel('Spread')
title('Spectral Spread of Audio Signal');
figure(3);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power')
view([-58 33])
title('Recorded Audio Signal');
subplot(2,1,2)
plot(t_s,skewness)
xlabel('Time (minutes)')
ylabel('Skewness')
title('Skewness of Audio Signal');
t_k = linspace(0,size(audioData,1)/fs,size(audioData,1));
figure(4);
subplot(2,1,1)
plot(t_k,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
t_e = linspace(0,size(audioData,1)/fs,size(audioData,1));
figure(5);
subplot(2,1,1)
plot(t_e,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
t_e = linspace(0,size(audioData,1)/fs,size(entropy,1));
subplot(2,1,2)
plot(t_e,entropy)
xlabel('Time (s)')
ylabel('Entropy')
title('Entropy of Audio Signal');
figure(6);
subplot(2,1,1)
t_f = linspace(0,size(audioData,1)/fs,size(audioData,1));
plot(t_f,audioData)
ylabel('Amplitude')
title('Recorded Audio Signal');
subplot(2,1,2)
t_f = linspace(0,size(audioData,1)/fs,size(flatness,1));
plot(t_f,flatness)
ylabel('Flatness')
xlabel('Time (s)')
title('Flatness of Audio Signal');
figure(7);
subplot(2,1,1)
spectrogram(audioData,round(fs*0.05),round(fs*0.04),round(fs*0.05),fs,'yaxis'
,'power');
title('Spectrogram of Audio Signal');
subplot(2,1,2)
plot(t_ss,specslope)
title('Spectral Slope')
ylabel('Slope')
xlabel('Time (s)')
Output: -
S.N Figure Figure
o. Description
1 Spectral
Centroid
2 Spectral
Spread
3 Skewness
4 Kurtosis
5 Entropy
6 Flatness
7 Spectral
Slope
8 Spectral
Decrease
9 Spectral
Rolloff Point
Conclusions: -
1. Centroid is deviated towards the portion of the signal with higher amplitude scale.
2. Spectral spread increases in the region where the bandwidth is higher due to tones
being spread farther apart.
3. The skewness represents the tilt in the direction of the centroid
4. Kurtosis is lower for where the audio signal is nearly uniform.
5. Regions of voiced speech have lower entropy than the unvoiced regions.
6. Higher spectral flatness occurs in the segments with noise/unvoiced regions. In voiced
regions, flatness is low.
7. Spectral slope is accurately able to display the amount of decrement in the spectrum
of the audio signal
8. Spectral decrease models the amount of decrease in the spectrum.
9. Spectral roll off point is able to distinguish between voiced and unvoiced regions as
well as locate the frequency bins under which a given percentage of the spectral
energy falls, thus measuring the associated bandwidth.
Result: -
The different spectral descriptors of an audio signal were calculated, as evidenced by the
observation and conclusions drawn above.
8. Experiment 8
Program: -
clc;
clear all;
close all;
url = "http://emodb.bilderbar.info/download/download.zip";
downloadFolder = tempdir;
datasetFolder = fullfile(downloadFolder,"Emo-DB");
if ~exist(datasetFolder,'dir')
disp('Downloading Emo-DB (40.5 MB)...')
unzip(url,datasetFolder)
end
ads = audioDatastore(fullfile(datasetFolder,"wav"));
filepaths = ads.Files;
emotionCodes = cellfun(@(x)x(end-5),filepaths,'UniformOutput',false);
emotions = replace(emotionCodes,{'W','L','E','A','F','T','N'}, ...
{'Anger','Boredom','Disgust','Anxiety/Fear','Happiness','Sadness','Neutral'});
speakerCodes = cellfun(@(x)x(end-10:end-9),filepaths,'UniformOutput',false);
labelTable =
cell2table([speakerCodes,emotions],'VariableNames',{'Speaker','Emotion'});
labelTable.Emotion = categorical(labelTable.Emotion);
labelTable.Speaker = categorical(labelTable.Speaker);
summary(labelTable)
ads.Labels = labelTable;
load('network_Audio_SER.mat','net','afe','normalizers');
fs = afe.SampleRate;
speaker = categorical(“4”);
emotion = categorical(“7”);
audio = read(adsSubset);
sound(audio,fs)
features = (extract(afe,audio))';
YPred = double(predict(net,featureSequences));
average = categorical(“02”);
switch average
case 'mean'
probs = mean(YPred,1);
case 'median'
probs = median(YPred,1);
case 'mode'
probs = mode(YPred,1);
end
pie(probs./sum(probs),string(net.Layers(end).Classes))
numAugmentations = 50;
augmenter = audioDataAugmenter('NumAugmentations',numAugmentations, ...
'TimeStretchProbability',0, ...
'VolumeControlProbability',0, ...
...
'PitchShiftProbability',0.5, ...
...
'TimeShiftProbability',1, ...
'TimeShiftRange',[-0.3,0.3], ...
...
'AddNoiseProbability',1, ...
'SNRRange', [-20,40]);
currentDir = pwd;
writeDirectory = fullfile(currentDir,'augmentedData');
mkdir(writeDirectory
N = numel(ads.Files)*numAugmentations;
myWaitBar = HelperPoolWaitbar(N,"Augmenting Dataset...");
reset(ads)
numPartitions = 18;
tic
parfor ii = 1:numPartitions
adsPart = partition(ads,numPartitions,ii);
while hasdata(adsPart)
[x,adsInfo] = read(adsPart);
data = augment(augmenter,x,fs);
[~,fn] = fileparts(adsInfo.FileName);
for i = 1:size(data,1)
augmentedAudio = data.Audio{i};
augmentedAudio = augmentedAudio/max(abs(augmentedAudio),[],'all');
augNum = num2str(i);
if numel(augNum)==1
iString = ['0',augNum];
else
iString = augNum;
end
audiowrite(fullfile(writeDirectory,sprintf('%s_aug%s.wav',fn,iString)),augmentedAudio
,fs);
increment(myWaitBar)
end
end
end
delete(myWaitBar)
fprintf('Augmentation complete (%0.2f minutes).\n',toc/60)
adsAug = audioDatastore(writeDirectory);
adsAug.Labels = repelem(ads.Labels,augmenter.NumAugmentations,1);
win = hamming(round(0.03*fs),"periodic");
overlapLength = 0;
adsTrain = adsAug;
tallTrain = tall(adsTrain);
featuresTallTrain = cellfun(@(x)extract(afe,x),tallTrain,"UniformOutput",false);
featuresTallTrain = cellfun(@(x)x',featuresTallTrain,"UniformOutput",false);
featuresTrain = gather(featuresTallTrain);
allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);
featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);
labelsTrain = repelem(adsTrain.Labels.Emotion,[sequencePerFileTrain{:}]);
emptyEmotions = ads.Labels.Emotion;
emptyEmotions(:) = [];
dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(size(sequencesTrain{1},1))
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,"OutputMode","last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer
classificationLayer];
miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
"MiniBatchSize",miniBatchSize, ...
"InitialLearnRate",initialLearnRate, ...
"LearnRateDropPeriod",learnRateDropPeriod, ...
"LearnRateSchedule","piecewise", ...
"MaxEpochs",maxEpochs, ...
"Shuffle","every-epoch", ...
"Verbose",false, ...
"Plots","Training-Progress");
net = trainNetwork(sequencesTrain,labelsTrain,layers,options);
saveSERSystem = categorical(“01”);
if saveSERSystem
normalizers.Mean = M;
normalizers.StandardDeviation = S;
save('network_Audio_SER.mat','net','afe','normalizers')
end
speaker = ads.Labels.Speaker;
numFolds = numel(speaker);
summary(speaker)
[labelsTrue,labelsPred] = HelperTrainAndValidateNetwork(ads,adsAug,afe);
for ii = 1:numel(labelsTrue)
foldAcc = mean(labelsTrue{ii}==labelsPred{ii})*100;
fprintf('Fold %1.0f, Accuracy = %0.1f\n',ii,foldAcc);
end
labelsTrueMat = cat(1,labelsTrue{:});
labelsPredMat = cat(1,labelsPred{:});
figure
cm = confusionchart(labelsTrueMat,labelsPredMat);
valAccuracy = mean(labelsTrueMat==labelsPredMat)*100;
cm.Title = sprintf('Confusion Matrix for 10-Fold Cross-Validation\nAverage Accuracy =
%0.1f',valAccuracy);
sortClasses(cm,categories(emptyEmotions))
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
function [sequences,sequencePerFile] =
HelperFeatureVector2Sequence(features,featureVectorsPerSequence,featureVectorOverlap)
if featureVectorsPerSequence <= featureVectorOverlap
error('The number of overlapping feature vectors must be less than the number
of feature vectors per sequence.')
end
if ~iscell(features)
features = {features};
end
hopLength = featureVectorsPerSequence - featureVectorOverlap;
idx1 = 1;
sequences = {};
sequencePerFile = cell(numel(features),1);
for ii = 1:numel(features)
sequencePerFile{ii} = floor((size(features{ii},2) -
featureVectorsPerSequence)/hopLength) + 1;
idx2 = 1;
for j = 1:sequencePerFile{ii}
sequences{idx1,1} = features{ii}(:,idx2:idx2 + featureVectorsPerSequence
- 1);
idx1 = idx1 + 1;
idx2 = idx2 + hopLength;
end
end
end
function [trueLabelsCrossFold,predictedLabelsCrossFold] =
HelperTrainAndValidateNetwork(varargin)
if nargin == 3
ads = varargin{1};
augads = varargin{2};
extractor = varargin{3};
elseif nargin == 2
ads = varargin{1};
augads = varargin{1};
extractor = varargin{2};
end
speaker = categories(ads.Labels.Speaker);
numFolds = numel(speaker);
emptyEmotions = (ads.Labels.Emotion);
emptyEmotions(:) = [];
trueLabelsCrossFold = {};
predictedLabelsCrossFold = {};
for i = 1:numFolds
idxTrain = augads.Labels.Speaker~=speaker(i);
augadsTrain = subset(augads,idxTrain);
augadsTrain.Labels = augadsTrain.Labels.Emotion;
tallTrain = tall(augadsTrain);
idxValidation = ads.Labels.Speaker==speaker(i);
adsValidation = subset(ads,idxValidation);
adsValidation.Labels = adsValidation.Labels.Emotion;
tallValidation = tall(adsValidation);
tallTrain =
cellfun(@(x)x/max(abs(x),[],'all'),tallTrain,"UniformOutput",false);
tallFeaturesTrain =
cellfun(@(x)extract(extractor,x),tallTrain,"UniformOutput",false);
tallFeaturesTrain = cellfun(@(x)x',tallFeaturesTrain,"UniformOutput",false);
[~,featuresTrain] = evalc('gather(tallFeaturesTrain)');
tallValidation =
cellfun(@(x)x/max(abs(x),[],'all'),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)extract(extractor,x),tallValidation,"UniformOutput",false);
tallFeaturesValidation =
cellfun(@(x)x',tallFeaturesValidation,"UniformOutput",false);
[~,featuresValidation] = evalc('gather(tallFeaturesValidation)');
allFeatures = cat(2,featuresTrain{:});
M = mean(allFeatures,2,'omitnan');
S = std(allFeatures,0,2,'omitnan');
featuresTrain = cellfun(@(x)(x-M)./S,featuresTrain,'UniformOutput',false);
for ii = 1:numel(featuresTrain)
idx = find(isnan(featuresTrain{ii}));
if ~isempty(idx)
featuresTrain{ii}(idx) = 0;
end
end
featuresValidation = cellfun(@(x)(x-
M)./S,featuresValidation,'UniformOutput',false);
for ii = 1:numel(featuresValidation)
idx = find(isnan(featuresValidation{ii}));
if ~isempty(idx)
featuresValidation{ii}(idx) = 0;
end
end
featureVectorsPerSequence = 20;
featureVectorOverlap = 10;
[sequencesTrain,sequencePerFileTrain] =
HelperFeatureVector2Sequence(featuresTrain,featureVectorsPerSequence,featureVectorOve
rlap);
[sequencesValidation,sequencePerFileValidation] =
HelperFeatureVector2Sequence(featuresValidation,featureVectorsPerSequence,featureVect
orOverlap);
labelsTrain = [emptyEmotions;augadsTrain.Labels];
labelsTrain = labelsTrain(:);
labelsTrain = repelem(labelsTrain,[sequencePerFileTrain{:}]);
dropoutProb1 = 0.3;
numUnits = 200;
dropoutProb2 = 0.6;
layers = [ ...
sequenceInputLayer(size(sequencesTrain{1},1))
dropoutLayer(dropoutProb1)
bilstmLayer(numUnits,"OutputMode","last")
dropoutLayer(dropoutProb2)
fullyConnectedLayer(numel(categories(emptyEmotions)))
softmaxLayer
classificationLayer];
miniBatchSize = 512;
initialLearnRate = 0.005;
learnRateDropPeriod = 2;
maxEpochs = 3;
options = trainingOptions("adam", ...
"MiniBatchSize",miniBatchSize, ...
"InitialLearnRate",initialLearnRate, ...
"LearnRateDropPeriod",learnRateDropPeriod, ...
"LearnRateSchedule","piecewise", ...
"MaxEpochs",maxEpochs, ...
"Shuffle","every-epoch", ...
"Verbose",false);
net = trainNetwork(sequencesTrain,labelsTrain,layers,options);
predictedLabelsPerSequence = classify(net,sequencesValidation);
trueLabels = categorical(adsValidation.Labels);
predictedLabels = trueLabels;
idx1 = 1;
for ii = 1:numel(trueLabels)
predictedLabels(ii,:) = mode(predictedLabelsPerSequence(idx1:idx1 +
sequencePerFileValidation{ii} - 1,:),1);
idx1 = idx1 + sequencePerFileValidation{ii};
end
trueLabelsCrossFold{i} = trueLabels;
predictedLabelsCrossFold{i} = predictedLabels;
end
end
Output: -
S.No. Description Figure
1 Pie chart of
median
probability
of the
different
emotions
pertaining to
the selected
subject
2 Training
over 1212
iterations
using 10
fold cross
validation
3 Confusion
Matrix for
10-fold
Cross-
Validation
Conclusions: -
1. Features used herein were selected using sequential feature selection.
2. The training and validation is done using leave-one-speaker-out (LOSO) k-fold cross
validation.
3. A model trained on insufficient data may suffer from fitment problems. This is
alleviated using signal augmentation undertaken via pitch shifting, time-scale
modification, time shifting, noise addition, and volume control.
Result: -
Emotion recognition in speech signal was performed, as evidenced by the observation
and conclusions drawn above.