0% found this document useful (0 votes)
8 views69 pages

Lec 5

This tutorial on emotion recognition by speech signal, led by Gulshan Sharma, focuses on utilizing the RAVDESS dataset, which contains audio files of various emotions spoken by actors. Participants will learn to download the dataset, extract features such as fundamental frequency and Mel Frequency Cepstral Coefficients, and apply classification algorithms like Gaussian Naive Bayes and Support Vector Machine using Python in Google Colab. The session aims to provide hands-on experience in emotion recognition through speech analysis.

Uploaded by

chaaya0605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views69 pages

Lec 5

This tutorial on emotion recognition by speech signal, led by Gulshan Sharma, focuses on utilizing the RAVDESS dataset, which contains audio files of various emotions spoken by actors. Participants will learn to download the dataset, extract features such as fundamental frequency and Mel Frequency Cepstral Coefficients, and apply classification algorithms like Gaussian Naive Bayes and Support Vector Machine using Python in Google Colab. The session aims to provide hands-on experience in emotion recognition through speech analysis.

Uploaded by

chaaya0605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Affective Computing

Prof. Jainendra Shukla


Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Lecture - 15
Tutorial: Emotion Recognition by Speech Signal

Hello everyone. My name is Gulshan Sharma and I am the Teaching Assistant for this
NPTEL Affective Computing course. First and foremost I would like to welcome everyone
on this very first tutorial of this course. We will attempt to learn emotion recognition through
speech in this tutorial.

(Refer Slide Time: 00:40)

With recent advances in the field of machine learning, emotion recognition via speech signal
has dramatically increased. Various theoretical and experimental study have been conducted
in order to identify a person’s emotional state by examining their speech signals. The speech
emotion system pipeline includes preparation of an appropriate dataset, selection of
promising feature and design of an appropriate classification method.

259
(Refer Slide Time: 01:13)

So, in this tutorial, we will be utilizing a publicly available dataset known as RAVDESS. The
RAVDESS dataset consists of 7356 files. The database includes speeches and songs from 24
actors, 12 male and 12 female. Emotion classes include calm, happiness, sadness, anger, fear,
surprise and disgust. And this dataset is available in 3 formats audio only, video only, and
audio and video. So, for our task, we will use only speech part, that is our audio only files.

(Refer Slide Time: 01:57)

Now, moving towards the file name convention in this dataset, the file name in this dataset
consists of 7 identifiers where the first identifier tell us about the modality either it is a full

260
audio video file, video only file or audio only file. The second identifier will tell about the
vocal channel, it is either a speech file or a song file. The third and the most important
identifier is our emotion identifier, which will tell about the class of the emotion, neutral,
calm, happy, sad, angry, fearful, disgust or surprised.

The fourth one is the emotional intensity either the emotion is of normal intensity or the
strong intensity. Later on, fifth identifier will tell about the statement. And sixth will tell
about the repetition of that statement. And the seventh identifier will tell about the actor. Odd
number actors are male and even number actors are female.

(Refer Slide Time: 02:55)

So, before starting the coding part, let me first give you the complete overview of this
tutorial. We will start with downloading the dataset. After downloading the dataset we will
import that dataset into a Google Drive. The reason behind importing the dataset into Google
Drive is that we will be using Google Colab for our experimentations.

And our experimentation will start with reading audio file in Python. Then, we will extract
these fundamental frequency zero cross rates and Mel Frequency Cepstral Coefficient as our
features from the audio files. And after that we will be employing some of the classification
algorithms like Gaussian Naive Bayes, Linear Discriminant Analysis and Support Vector
Machine.

261
And in the end, we will also try to create a 1-dimensional convolution neural network over
the raw audio for the emotion classification.

(Refer Slide Time: 04:06)

So, starting with our very first exercise which is dataset download, we can download this
dataset by simply searching the RAVDESS on Google.

(Refer Slide Time: 04:10)

262
After putting this RAVDESS keyword on Google, you will find this zenodo link over here.
So, basically, zenodo is a general purpose open access repository. Here we can store our data
up to 50 GBs.

(Refer Slide Time: 04:29)

So, we can simply click on this link, and find the dataset.

(Refer Slide Time: 04:32)

So, this dataset is basically released under creative common attribute license. So, one can
openly use it for the publications.

263
(Refer Slide Time: 04:42)

And to download the dataset, and to download the our exact part which is audio speech actor
dataset we can simply click on this link.

(Refer Slide Time: 04:47)

It will take some time to download, but in my case I have already downloaded this dataset.
And I can show you.

264
(Refer Slide Time: 05:05)

After unzipping the downloaded file, this dataset will look something like this. So, there will
be 24 folders each belonging to one actor.

(Refer Slide Time: 05:19)

And after going through one folder, we will have a couple of files over here. Or maybe I can
just play a couple of files just for your reference.

Dogs are sitting by the door. Dogs are sitting by the door. Dogs are sitting by the door. Dogs
are sitting by the door. Dogs are sitting by the door.

265
So, as you can see there are multiple emotions saying this line, dogs are sitting on the door.
And if I move to some other folder, let us say actor 2 folder and play a couple of files.

Kids are talking by the door. Kids are talking by the door. Kids are talking by the door. Kids
are talking by the door.

So, as we can see, like there are a couple of variations in this speaking style representing
different different type of emotions. So, as we have downloaded this dataset, now our next
task is to upload it on Google Drive, so that we can easily access it through a Google Colab. I
believe most of us can easily upload a folder on Google Drive. So, I will be skipping that
part.

But with some of the participants it could be a situation that they are having a low bandwidth
internet connection, these participants can take any of the folder and upload it on a Google
Drive. So, let us suppose you are taking folder number 1.

(Refer Slide Time: 07:10)

So, folder number 1, I believe is of 25.9 MB. So, it will not be a very big file to upload on a
Google Drive.

266
(Refer Slide Time: 07:14)

So, now, I will shift on the Google Colab and we will start writing a program for emotion
recognition.

(Refer Slide Time: 07:33)

So, before starting our programming exercise, so before starting our programming exercise I
assume that everyone has some sort of experience with Python programming language and
everyone is aware of Google Colab Interface. We will start this exercise with importing
couple of libraries and the helping functions. To save some time I have already copied

267
required import code. So, everyone who is programming along with me can pause this video
and write this code in their own environment.

So, after importing the required libraries and helping functions, we will start with reading the
audio files using Python, but before that we need to mount Google Drive with our Colab
Interface. To do so, we will first click on this files icon over here and then select mount drive
option.

(Refer Slide Time: 08:44)

After pressing this button, you will find a dialog box over here asking for permission to
access Google Drive. So, we will simply click on connect to Google Drive. So, after
mounting Google Drive with Colab environment, we will now import the data. I am also
assuming that some of the participant does not have enough powerful machine or high speed
internet connection.

So, to simplify our job, I will be using data from a single subfolder. Since, we are importing
audio data, so to read audio data in our Python environment, I will be using librosa dot load
function.

268
(Refer Slide Time: 09:27)

I will create two variable called data and sample rate, then I will simply write librosa dot load
and inside the brackets I will pass the I will pass the file name. So, as you can say this
function has successfully run, and now I will show you the shape of the data. It consists of
72838 sample values and our sample rate is 22,050.

So, for more simplification I am planning to use just first 3 seconds of our audio data. So,
calculating first 3 seconds of data, I can simply multiply our fs by 3, our sample rate by 3 and
we will get the exact time. So, first 3 seconds of time. So, it will be equivalent to 66150
samples.

Now, we have imported just one data. So, we need to import all the data inside this folder. To
do so, I need to write a piece of code where I will be sequentially reading all the files and
saving them into a Python list. So, let us start with our code, I will name my variable as data
all I will be also extracting all the labels. Since, we have already seen that data file name
contains their respective label. So, we need to extract the relevant label also.

So, I will start with a loop where I will be reading all the file names in a sorted function from
os dot list dir and in this list dir I will be importing that data file, I mean the data path of the
data folder. So, maybe I can just create a another variable which is data path and this will be
treated as a string and I need to write the exact data path for this folder. So, I will simply copy
it from this place and paste it over here, ok.

269
Now, I can simply use this variable wherever I want this data path. So, first I will try to
extract all the labels. So, for that I will be pending all the labels, in this label all list and I will
simply read the file name and extract the substring, does that sub identify from that particular
file name. Maybe I will I also need to write classes to integer as these little bit treated as
label, so that that could approach.

Now, I can simply read my data and also my sample rate librosa dot load. And now, now this
line; this line of code will simply read all the file names and append it with the our data path
and then the librosa function will read that corresponding file. Now, I need to store all those
files in our data_all list. So, to do so, I will simply write all append data and also I will be you
know simply using first 3 second. So, will I will put a colon, and sorry I will put a colon and
type the time over here.

So, let me just run this code. It might take a couple of seconds to run this file completely, ok.
So, code has been executed now. So, I will simply convert these list into numpy array. So, to
do so, I will simply write data all equal to np dot array and label all will become.. sorry, this
is a mistake over here. I need to write underscore not hyphen.

So, after converting these list into numpy arrays I can simply try to see their shape, what are
their exact shapes. So, I will simply write data all dot shape, ok. So, now, we have written 60
files each consisting of 66150 sample which corresponding to 3 seconds of initial data. I can
also see the label shape which is equal to 60 file. So, I guess we are good to go. Maybe I can
also show you the exact labels.

270
(Refer Slide Time: 16:01)

So, you can simply print, ok. So, these are the labels corresponding to our 60 files. Maybe I
will do a simple pre-processing over here as I want my label to start with 0. So, I will simply
write a code, ok. So, let me print these labels again, perfect. Now, I will also show you a
utility in IPython. From IPython dot display, we have imported audio. So, this function can
simply play that exact audio file in our Python environment. So, let us try to play a audio file,
ok.

In my case suppose, I will be using a 0th indexed file. And our rate will be equal to, our rate
was fs was 220 50 (Refer Time: 17:24), I will simply write, I will simply type fs over here.

‘Kids are talking by the door’.

So, as you can see using this utility, I can simply play the exact audio file in my Colab.
Maybe I can play one more file over here.

‘Kids are talking by the door’.

Sounds good. So, after importing our data, I will now move towards some sort of a particular
feature extraction phase. And in our feature extraction phase, we will be simply using
fundamental frequency, and uh, zero crossing rate, Mel-frequency cepstral coefficients as our
basic features.

271
So, let me show you how to extract a fundamental frequency from this audio files. And, to
extract the fundamental frequency, I will simply make a.. I am so sorry; I will simply make a
variable f0. And I will be using a library function called librosa dot yin. And in this function,
I just need to pass a data instance with a range of, range of frequency value like minimum
frequency value and the highest frequency value.

So, let me just extract fundamental frequency for a single instance, then I will show you how
to do it for a whole folder, ok.

(Refer Slide Time: 19:10)

So, it is working. I will simply print the exact value of this fundamental frequency or maybe I
can also try to show you a plot where these values are plotted against this time. To do so, I
will simply type plt dot plot and I will simply pass the array.

272
(Refer Slide Time: 19:36)

So, yeah, as you can see that initial values are somewhat around 882, then there was some
sort of a variation in this part and then again it is going to 882 over here. Or, maybe I just
need to show you another file, let us say 5 and then I will print another plot over here. So, as
you can see over here that there is some sort of a difference between the fundamental
frequencies in two different emotions.

(Refer Slide Time: 20:25)

Now, let me simply extract the fundamental frequency for all the data. For that I will simply
write another list where; to do so, I will be simply using a for loop, where I will iterate over

273
all the data and extract our fundamental frequency. So, I will simply append this. Later I will
convert it into a and by later I will; later I will convert it to a numpy array, this exact list. This
is copied, see it sometime.

Let me print the shape of our; now, let me print the shape of this variable, ok. So, we have
selected the fundamental frequency for each and every file in the folder 1. Now, I will go
with another sort of feature known as our zero cross ratings. So, I will be extracting 0 cross
rate over here. So, to do so, I will be again using librosa library, and there is a function in it
called zero crossing rate which will be giving us the exact zero cross rating corresponding to
these audio files.

So, let me use the variable name as zcr and then I can use, then I can write; let me show you
with the single file first. (Refer Time: 23:49) it is working. Let me show the print. Let me
print the zcr, ok. So, you can see there is some sort of differences over here. Maybe I can give
you a better visualization by simply plotting this over the time.

(Refer Slide Time: 24:11)

So, for that I will be writing plot, ok. Maybe I will create the zcr for another emotion. See for
emotion number 5 and then plot, ok. There is some issue over here, ok. It is not scr, it is zcr.

Yeah, here we can also observe that there is some significant amount of differences between
these two features. So, I will simply write another code to you know extract the zcr value for
all the files.

274
(Refer Slide Time: 25:19)

Now, maybe to just keep it consistent with my previous feature shape which was 16 to 130,
you can also reshape this zcr_all function. And now, if I run my print function again, so yeah
we will got a similar shape over here. This is just to keep back the consistency among the all
the feature. Now, I will show you two extract another feature called Mel-frequency cepstral
coefficients.

So, for that maybe I just need to let us write MFCC separate over here. And yeah, I can
simply extract MFCC from librosa. And yeah, there is a parameter in MFCC, it is called
number of MFCC coefficient. In my case, let us say, we will be extracting first 13
coefficients, yeah.

275
(Refer Slide Time: 26:27)

So, these are MFCC coefficients. I will like to print its shape also. So, yeah, there are 13
cross 130 for a single (Refer Time: 26:42). So, in this case, as we are getting 13 rows and 130
columns. So, for each row there are 130 values and each of these 13 values corresponding to
one MFCC coefficient.

Now, to visualize this I can simply type and I can simply write MFCC, ok or maybe just to
avoid these values I can simply put a semicolon.

(Refer Slide Time: 27:12)

276
(Refer Slide Time: 27:23)

Now, let me plot it for another file. Let us see for file number I mean in the 5th file and our
MFCC coefficient will look something like this. So, are there some significant differences
over here?

(Refer Slide Time: 27:42)

I believe yes, I can see some differences over here. There are some differences over here also.
And since, it is a very complex and very tightly bounded lines, but yeah there are some
significant differences over these two files. So, now, again we will simply extract all these
MFCC for all the files.

277
(Refer Slide Time: 28:17)

So, now yeah it looked consistent with my prior representations. And now, we have extracted
all 3 features of each basic features like MFCC, fundamental frequency and zero cross
ratings. Now, after this I will be using my basic machine learning inferences. In my machine
learning algorithms, I will be using Gaussian Naive Bayes, linear discriminant analysis and
support vector machines.

So, now, moving towards the one machine learning part, in machine learning part we will
take one feature, divide it into their respective train and test parts, and then run our classifier
over it. So, starting with the our very first feature which is fundamental frequency. So, let me
first divide it into their train and test part. So, for this division, I will be using train test split
function from sklearn library. So, code will look something like this, ok.

Now, I can simply run my classifier as clf equal to see my first classifier which is let me show
you, Gaussian Naive Bayes. So, I have already inputted like from as sklearn dot Naive Bayes
import Gaussian Naive Bayes.

278
(Refer Slide Time: 29:37)

So, I will simply copy it over here and function over here and fit it on my training set, ok.
Now, my classifier is fit on our training set. So, let me check about the training accuracy over
here, ok. So, we are getting 89 percent of training accuracy. And let me also check for the
testing score, training accuracy I will get, ok; so, I am getting 83 percent of training accuracy
over here using LDA which is lesser than our Gaussian Naive Bayes.

And let me also check out my test score, ok. So, yeah test score is also getting down to 0.25
percent. So, I believe Gaussian Naive Bayes is performing better in our fundamental
frequency. So, guys let me try my third classifier now which is support vector machine. So,
for that I will also again reuse my code. And instead of Gaussian Naive Bayes, I will be using
our SVC function over here.

279
(Refer Slide Time: 30:53)

So, I will replace Gaussian Naive with SVC and inside SVC I need to define my kernel,
which kernel I will be using. So, kernel equal to let us say we start with a linear kernel and let
me check, ok yeah. So, classifier is set on our training data. Let us see about our train score,
ok. So, with linear classifier we are getting 100 percent training accuracy. Let me check the
test score over here. We are getting 58 percent of accuracy which is you know higher than all
of other classifiers.

So, maybe I can also check it with another kernel called RBF kernel, ok. RBF kernel is not
getting with that good accuracy. And yeah of course, our testing score also decreased towards
16 percent, which I believe is a chance level. So, yeah, in our case for a fundamental
frequency, we can easily see that our support vector machine with linear kernel giving the
best results, ok.

Now, let us try a similar classifier using another feature. So, after fundamental frequency, our
next feature was zero cross rates. Let me code similar stuff for zcr. Again, I will be you know
simply reusing my code over here. So, instead of f frequency all, I will be using my zcr
underscore all and rest of the part will be same, ok. Now, my train and test variable this
setting is over the zcr features.

So, again I will simply reuse my code. I will be using Gaussian Naive Bayes over here. And
again I have to show my train and test accuracies. So, I will simply use this code, ok. So, for

280
Gaussian Naive Bayes in case of our zero cross rates, the train accuracy is somewhat around
85 percent, but the test accuracy is around chance level only.

(Refer Slide Time: 33:27)

So, we will try with another classifier which is our linear discriminant analysis. Again, I will
simply reuse my code. So, for linear discriminant analysis we are getting a train accuracy of
100 percent and test accuracy is somewhat around 8 percent which is very lower than I guess
chance level. And this is a clear example of overfitting in this case. In fact, this is also
example of overfitting.

Let me try with support vector machine now. So, you simply change the function to SVC,
SVC and then we use kernel equal to linear, ok. Some problem over here, ok. I forgot to put
equal to. In this case, results look little bit better than, I mean, Gaussian Naive Bayes and
linear discriminant analysis, but still there is a huge variance between train score and a test
score. So, it is another example of overfitting only. Let me try with the RBF kernel, same
case, overfitting.

So, now, moving towards our final feature, final manual feature that we have extracted
MFCC, let us try to run similar code using MFCC. Again, in re-usability of code.

281
(Refer Slide Time: 35:24)

Now, training over Gaussian Naive Bayes classifier for MFCC feature, ok again, ok. These
are comparatively better result. We are getting trained accuracy of 95 percent and test
accuracy of 50 percent. Now, let us try with linear discriminant analysis, ok. 70, 41, and in
case of.. now in case of our support vector machines, 1 and 75 which was a good result for
linear as a linear kernel.

And in case of RBF kernel, ok (Refer Time: 36:34) simply you know working over here. So,
yeah, as we can see that support vector machine with linear kernel is giving best results. Now,
as we all can see that we have used our basic feature on our basic classifier. After this, I
maybe I can do one more exercise where we will be using a raw audio data over a
one-dimensional convolution neural network.

And that one-dimensional neural network will you know automatically extract relevant
features out of the audio data. And using a softmax classification method we will simply
classify the emotion classes. So, to do so, I will be using a help of a library called Keras. And
before that I need to do some minor settings in my environment in terms of data shape.

282
(Refer Slide Time: 37:30)

I need to just reshape my data, so that I can fit it in a convolution neural network. So, what I
am basically doing here is.. I mean, let me show you the exact shape of our data before
running this code. So, my original data shape was 60 cross 66150, ok. And after you know
reshaping this data my data shape will become. So, my next part will be how to code a 1D
CNN. We will be using a library called Keras. I have already imported these library.

So, to start with I will simply write model equal to models dot sequential, S may be capital
over here. I will first define the input shape which is essentially a layer, my network dot input
a shape will be a tuple consist of 66150 cross 1. So, I will simply copy it. And after inputting
the data of this shape, I will add a convolution layer over it.

So, the code will look something like model dot add. For activation function we will be using
ReLu and padding will be same. After a convolution layer maybe I will try another layer
called max pooling or average pooling. Let us say I will use max pool, ok. Maybe I will just
use single convolution layer and try to see how my result changes with this network.

Maybe I can put a batch normalization layer, then a dense layer (Refer Time: 39:53) layer dot
add see number of neuron equal to 128, with activation equal to relu, ok. Now, maybe we can
also add a dropout layer you know just to avoid any overfitting case. And as a final layer, we
will simply add a dense layer with number of neurons equal to 8 which is equivalent over
number of classes and activation will be Softmax, just for the classification purpose, yeah.

283
Now, maybe I can just present the summary of this model, ok. There is some error, syntax
error over here, ok. I forgot to put an equal to over here, ok. Another error, activation I again
forgot put the equal to over here, no attribute max pooling 1D. Let me check, ok.

(Refer Slide Time: 41:14)

Another error attribute max pooling, ok. The P in max pooling will be in capital, ok. One
more error, ok epsilon spelling is wrong.

(Refer Slide Time: 41:34)

One more error, I forgot to put s over here.

284
(Refer Slide Time: 41:45)

Yeah, now it is working. And this is the summary of our model and these are total number of
parameter that the model will be tuning. Out of these parameter, this much will be trainable
parameter and others will be non-trainable parameters. Now, we have defined our.. the
structure of our network. Now, we can simply compile this model using model dot compile.

Now, in compilation of model we have to define a loss function, but exact loss function we
will be using and a optimizer, but sort of optimizer we will be using to you know optimize
that loss function, ok. There is a error over here, sequential model has no attribute compile,
ok. I have written the wrong spelling. So, after compilation of my model, I just need to fit my
model over a training data.

But before dividing our data into train and test split, I just need to convert my labels into
categorical classes in terms of one hot encoded vector, since we are using categorical cross
entropy loss. So, for that I will be using inbuilt function in tensor for keras, ok. This has
changed my labels into one hot code encoded vectors. Let me show you, yeah.

285
(Refer Slide Time: 43:23)

So, instead of a single value over here, we have these vectors.

(Refer Slide Time: 43:31)

Now, I can use my train and test splitting code to you know divide this data into train and test
split. I will simply copy my previous code and paste it over here. And this time my training
data will be our raw audio data, ok. So, after dividing into train and test split, we have to
simply fit our model, but we have already defined and compiled through model dot fit. It
might take some time to you know as we are, right now we are just using our CPU you know
Google Colab. So, it might take some time to learn, ok.

286
We can see that our accuracy is improving over here, ok. So, my model has now completed
all its epochs. So, let us try to evaluate what test accuracy is over here. We are getting training
accuracy of somewhere around 70 percent and test accuracy is 0 percent, ok. So, this, my this
network architecture is not learning anything as of now.

Maybe I can you know start with some sort of a hyper parameter tuning, start adding a couple
of layers in it or maybe using different sort of activation functions or decreasing or increasing
the number of neuron in dense layers. All that sort of hyper parameter tuning we I can do to
make this network a better classifier.

So, guys, this was all about this tutorial. Hope I was able to give you a basic idea about
programming on these sort of networks. And if you have any sort of doubt, feel free to put a
question in discussion forum.

Thank you.

287
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Week - 05
Lecture - 01
Speech Based Emotion Recognition

Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology, Ropar.
Friends, this is the lecture in the series of Affective Computing. Today, we will be discussing
about how we can recognize emotions of a user by analyzing the voice. So, we will be
fixating on the voice modality. So, the content which we will be covering in this lecture is as
follows.

(Refer Slide Time: 00:53)

First, I will introduce to you, give you some examples about why speech is an extremely
important cue for us to understand the emotion of a user. Then we will discuss several
applications where speech and voice based affective computing are already being used. And
then we will switch gears and we will talk about the first component which is required to
create a voice based affective computing system, which is labeled datasets.

288
And in this pursuit, we will also discuss the different attributes of the data and the conditions
in which it has been recorded. Now, if you look at me, let us say I have to say a statement
about how I am feeling today and I say, well today is a nice day and I am feeling contented.

Now, let me look down a bit and I say, today is a wonderful day and I am feeling great. Now,
in the first case, you could hear me and you could see my face very clearly. And in the second
case, my face was partially visible, but you could hear me clearly. And, I am sure you can
make out that in the first case, I was showing neutral expression. And in the second case,
even though my face was facing downwards, not directly looking into the camera, I was
sounding to be more positive right.

So, there was a more happy emotion which could be heard from my speech. So, this is one of
the reasons why we are using voice as one of the primary modalities in affective computing.
You talk to a friend, you understand their facial expressions, you look at the person's face, but
in the parallel, you are also listening to what that person is speaking. So, you can actually tell
how that person is feeling from their speech. And that is why we would be looking at
different aspects of how voice can be used in this environment.

(Refer Slide Time: 03:36)

So, here is an example. So, this is a video which I am going to play from audio visual group
affect dataset. So, let me play the video.

Protects.

289
Protects. (Refer Time: 03:47) So, you protect yourself. No.

The other delivers protects.

(Refer Time: 03:52).

You protect yourself no.

(Refer Time: 03:55).

The other delivers.

So, if you notice in this case, the body language of the subjects here, that is trying to be a bit
aggressive. So, this looks like a training video. But if you hear the voice over, the explanation
voice in this video, you can tell that there is no fight going on, there is no aggressive
behavior, it is simply a training going on. And how are we able to find that? By simply
looking at the tonality of the voice.

If it was, let us say, actually a fight or some aggressive behavior shown by the subjects in the
video and the voice was also from one of the subjects, we would also hear a similar pattern
which would tell us that let us say the subjects could be angry. But in this case, even though
the body language facial expression says that they are in an aggressive pose, but from the
voice, we can tell that this is actually a training video. So, it is the environment is actually
neutral. Now, let us look at and hear another video.

(Refer Time: 05:10) [FL].

Now, in this case, the video has been blacked out. You can hear the audio and you can tell
that there are several subjects in the audio video sample and the subjects are happy right.
How are we able to tell that? We can hear the laughter.

290
(Refer Slide Time: 05:34)

Now, if I was to play the video, now this is the video which we had earlier blacked out. So,
you can look at of course, the facial expressions, but even without looking at the facial
expressions and the gestures just by hearing you can tell that the subject are happy. So, this
gives us enough motivation to actually pursue voice as a primary modality in affective
computing.

Now, as I mentioned earlier, there are a large number of applications where speech of the user
that is being analyzed to understand the effect. And friends this is similar to how when we
discussed in the last lecture about facial expression analysis, the several applications were
there in health and in education.

We find the similar use cases for voice based affect, but which are applicable in different
circumstances in circumstances, in scenarios where it could be non-trivial to have a camera
look at the user. Of course, there is a privacy concern which comes with the camera as well.
So, instead we can use microphones and we can analyze the spoken speech and the
information which is there in the background.

291
(Refer Slide Time: 07:03)

Now, the first and quite obvious application of voice in affective computing is understanding
the man machine interaction on a natural basis. What does that mean? Let us say there is a
social robo. Now, the robo is greeting the user and the user has just entered into the room.
The robo greets the user and the user replies back.

Now, based on the voice of the user and the expression which is being conveyed by the user,
the machine which is the robo in this case is able to understand the emotion of the user and
then the feedback to the user can be based on the emotional state. Let us say if the user is not
so cheerful.

So, the robo reacts accordingly and then tries to understand with a question which could
better make the either the user more comfortable, relaxed or the robo tries to investigate a bit.
So that it can have a conversation which is appropriate with respect to the emotion of the
user.

The 2nd we see is in entertainment particularly looking at computer movies. So, friends in
this case, we are talking about the aspect of indexing. Let us say you have a large repository
of movies ok. So, these are let us say a large repository. Now, the user wants to search, let us
say the user wants to search all those videos which are belonging to a happy event ok.

You can think of it as a set of videos in your phone's gallery and you want to fetch those
videos which let us say are from events such as birthdays which are generally cheerful right.

292
So, from this audio visual sample we can analyze the audio which would mean the spoken
content by the subject and the background voices could be music. So, we can analyze and get
the emotion.

Now, this emotion information it can be stored as a metadata into this gallery. So, let us say
the user searches for all the happy videos. We look through the metadata which tells us that
when we analyze the audio these are the particular audio video samples which based on their
spoken content and the background voice or music sound cheerful. So, the same is then
shown to the user.

(Refer Slide Time: 10:42)

Now, moving on to another very important application here, let me first clear the screen a bit.
So, looking at the aspects of operator safety let us say there is a driver and the driver is
operating a complex heavy machinery. You can think of an environment for example, in
mining where a driver is handling a big machinery which has several controls. What does that
apply? Well, harsh working environment, a large number of controls of the machine and the
high cost of error.

So, the driver would be required to be attentive right. Now, you can clearly understand the
state of the driver by listening to what they are speaking and how they are speaking. From the
voice pattern one could easily figure out things such as if the person is sounding tired, is not
attentive, has negative emotion. So, if these attributes can be figured out the machine let us
say the car or the mining machine it can give a feedback to the user.

293
An example feedback can be please take a break right before any accident happens please
take a break. Because when I analyzed your voice I could figure out that you sounded tired,
distracted or an indication of some negative emotion, which can hamper the productivity and
affect the safety of the user and of the people who are in this environment around the user.

Now, friends the other extremely important aspect of where we use this voice based affect is
in the case of health and well being. Now, an example of that which is right now being
experimented in a large number of academic and industrial labs is looking at the mental
health through the voice patterns. So, an example of that is let us say we want to analyze data
of patients and healthy controls who are in a study where the patients have been clinically
diagnosed with unipolar depression.

So, when we would observe the psychomotor retardation which I briefly mentioned in the
facial expression recognition based lecture as well. The changes in the speech in terms of that
is the frequency of words which are spoken, the intensity, the pitch you could learn a machine
learning model which can predict the intensity of depression. Similarly, from the same
perspective of objective diagnostic tools, which can assist clinicians let us say there is a
patient with ADHD.

So, when a clinician or an expert is interacting with the patient we can record the speech of
the interaction, we can record the voices and then we can analyze how the patient was
responding to the expert to the clinician and what was the emotion which was elicited when a
particular question was asked. That can give very vital useful information to the clinician.

294
(Refer Slide Time: 15:36)

Now, another aspect where voice based affective computing is being used is for automatic
translation systems. Now, in this case a speaker would be let us say communicating between
a party or translating right. So, let me give you an example to understand this. Let us say we
have speakers of language one, you know group of people who are in a negotiation deal
trying to negotiate a deal with group of people who speak language two and both parties do
not really understand each other's language.

Now, here comes a translator could be a machine, could be a real person who is listening to
group 1 translating to group 2 and vice versa. Now, along with the task of translation from
language 1 to language 2 and vice versa there is a very subtle yet extremely important
information which the translator needs to convey.

Since the scenario is about negotiation, let us say a deal is being cracked. The emotional
aspect of what is the emotion which is conveyed when the speakers of language 1 are trying
to make a point to the other team that also needs to be conveyed. And based on this simply by
understanding the emotional the and the behavioral part 1 could indicate one could
understand if let us say the communication is clear.

And if the two parties are going in the direction as intended you can think of it as an
interrogation scenario as well. Let us say interrogator speaks another language and the person
who is being interrogated speaks another language right. So, how do we understand that in
what is the direction of communication are they actually able to understand each other and

295
when the context of the communication has changed all of a sudden a person let us say who
was cooperating is not cooperating, but speaks another language.

So, that is where we analyze this voice when you analyze the voice you can understand the
emotion and that is a very extremely useful cue in this kind of a dyadic conversation or
multiparty interaction. And of course, in this case the same is applicable to the human
machine interaction as well across different languages. Friends, also another use case is
mobile communication. So, let us say you are talking over a device, could be on a using a
mobile phone.

Now, from strictly privacy aware health and well being prospective can the device compute
the emotional state of the user and then let us say after the call or communication is over
maybe in a subtle way suggest some feedback to the user to perhaps let us say calm down or
simply indicate that you have been using the device for n number of hours this is actually
quite long you may like to take a break right.

Now, of course, you know in all these kind of passive analysis of the emotion of the user the
privacy aspect is extremely important. So, either that information is analyzed, used as is on
the device and the user is also aware that there is a feature like this on the device or it could
be something which is prescribed suggested to the user by an expert. So, the confidentiality
and privacy that need to be taken care of.

Now, this is a very interesting aspect friends. On one end we were saying well when you use
a camera to understand the facial expression of a person there is a major concern with the
privacy. Therefore, microphone could be a better sensor. So, analysis or voice could be a
better medium.

However, the same applies to your voice based on analysis as well because we can analyze
the identity of the subject through the voice and also when you speak there could be personal
information. So, where the processing has to be done to understand the affect through voice is
it on device of the user where is it stored. So, these are all very extremely important
applications which come into the picture when we are talking about these applications.

296
(Refer Slide Time: 21:04)

Now, let us discuss about some difficulties in understanding of the emotional state through
voice. So, according to Borden and others there are three major factors which are the
challenges in understanding of the emotion through voice. The first is what is said. Now, this
is about the information of the linguistic origin and depends on the way of pronunciation of
words as representatives of the language. What did the person actually say right? The content
for example, I am feeling happy today right.

So, the content what is being spoken the interpretation of this based on the pronunciation of
the speaker that could vary if that varies if there is any noise in the understanding of this
content which is being spoken then that can lead to noisy interpretation of the emotion as
well. The second part, the second challenge is how it is said you know how is a particular
statement said.

Now, this carries again paralinguistic information which is related to the speaker's emotional
state. And example is let us say you were in discussion with a friend and you asked ok do you
agree to what I am saying? The person replies in scenario 1. Yes. Yes, I agree to what you are
saying. In scenario 2, the person says hm, yes. Hm, I agree. Now, in these two examples,
there is a difference right. The difference in which how the same words were said the
difference was the emotion.

Let us say, the confidence in this particular example of how the person agreed to the other if
the person agreed or not or was the bit you know hesitant. So, we have to understand how the

297
content is being spoken which would indicate the emotion of the speaker. Now, looking at the
third challenge, third difficulty in understanding emotion from voice which is who says it ok.
So, this means you know the cumulative information regarding the speaker's basic attributes
and features for example, the age, gender and even body size.

So, in this case let us say a young individual saying I am not feeling any pain you know as an
example versus an individual you know adult saying I am not feeling any pain right a young
individual versus an adult speaking the same content I am not feeling any pain. Maybe the
young individual is a bit hesitant maybe the adult who is speaking this is too cautious. So,
what; that means, is the attributes the characteristics of the speaker which not only is based
on just their age, gender and body type, but also their cultural context.

So, in some cultures it could be a bit frowned upon to express certain type of emotion in a
particular context right. So, that means, if we want to understand the emotion through voice
of a user from a particular culture or a particular age range we need to equip our system, our
affective computing system with this Meta information. So, that the machine learning model
then could be made aware during the training itself that there could be differences in the
emotional state of the user based on their background, their cultures.

So, this means to understand emotion we need to be able to understand what is spoken ok. So,
you can think of it as speech to text conversion then how it is said, a very trivial way to
explain will be you got the text what was the duration in which the same was said were there
any breaks were there any umms and repetition of the same words you know.

So, that would indicate how it is being said and then the attributes of the speaker. So, we
would require all this information when we would be designing this voice based affect
computing system.

298
(Refer Slide Time: 27:11)

Now, as we have discussed earlier when we were talking about facial expression analysis
through cameras the extremely important requirement for creating a system is access to data
which has these examples which you could use to train a system right. Now, when we are
talking about voice based affect then there are three kind of databases you know the three
broad categories of databases which are existing, which have been proposed in the
community.

Now, the attributes of these databases is essentially based on how the emotion has been
elicited. So, we will see what does that mean and what is the context in which the participant
of the database have been recorded ok. So, let us look at the category. So, the first is natural
data ok. Now, this you can very easily link to facial expressions again. We are talking about
spontaneous speech in this case.

Spontaneous speech is what you are, let us say creating a group discussion you give a topic to
the participants and then they start discussing on that. Let us say they are not provided with
much constraints. It is supposed to be a free form discussion and during that discussion
within the group participants you record the data ok. So, that would be spontaneous replies,
spontaneous questions and within that we will have the emotion which is represented by a
particular speaker.

Now, other environments scenarios where you could get this kind of spontaneous speech data
which is reminiscent of, representative of natural environment is for example, also in call

299
center conversations ok. So, in this case you know let us say customer calls in there is a call
center representative, conversation goes on and if it is a real one then that could give you
know spontaneous speech.

Similarly, you could have you know cockpit recordings during abnormal conditions. Now, in
this case what happens right let us say there is an adverse condition there is an abnormal
condition the pilot or the user they would be communicating based on you know how they
would generally communicate when they are under stress. And in that whole exercise we
would get the emotional speech right.

Then also conversation between a patient and a doctor. I already gave you an example right
when we were talking about how voice could be used for affect computing in the case of
health and well-being right. A patient asking questions to sorry a patient replying to questions
to doctors a patient replying to the questions of the doctor and in that case you know we
would have these conversations about emotions. Same goes for you know these
communication which could be happening in public places as well.

Now, the other category for the voice based data set friends is simulated or acted. Now, in this
case the speech utterances the voice patterns they are collected from experienced trained
professional artist. So, in this case you know you would have let us say actors who would be
coming to a recording studio and then they could be given a script or a topic to speak about
and you know that data would be recorded.

Now, there are several advantages when it comes to the simulated data. Advantages, well it is
relatively easier to collect as compared to natural data. Since the speakers, they are already
you know informed about the content or the theme which they are supposed to speak, they
also have given an agreement. So, you know the cash compared to your natural data privacy
could be better handled in this case, but I should say easier to handle.

Now, the issue of course, is when you are talking about simulated data, acted data, then not
all examples which you are capturing in your data set could be the best examples of how the
user behavior will be in the real world. Now, the third category friends, is elicited emotion
which is induced.

So, in this case an example of course, is you know let us say you show a stimuli, a video
which contains positive or negative affect and after the user has seen the video, you could ask

300
them to answer certain questions about that video and the assumption is that the stimuli
would have elicited some emotion into the user right. And that would be affected represented
shown when the speaker the user in the study is answering questions.

(Refer Slide Time: 33:46)

Now, let us look at the databases which are very actively used in the community. The first is
the AIBO Database by Batliners and others. Now, this contains the interaction between
children and the AIBO robo, contains 110 dialogues and the emotion categories, the labels
are anger, boredom, empathetic, helpless, ironic and so forth.

So, the children are interacting with this robo, the robo is a cute you know Sony AIBO dog
robo. So, the assumption here is that the participant would get a bit comfortable with the robo
and then emotion would be elicited within the participant. And we can have these labels,
these emotion categories you know labeled afterwards into the data which has been recorded
in during the interaction between the robo and the children.

The other dataset which is very commonly used is the Berlin Database of Emotional speech
which was proposed by Burkhardt and others in 2005. Now, this contains 10 subjects and you
know the data consist of 10 German sentences which are now recorded in different emotions.
Notice this one is the acted one ok. So, this is the acted type of dataset where in the content
was already provided to the participants, actors and you know they try to speak it in different
emotions.

301
So, what does this mean is now the quality of emotions which are reflected are based on the
content and the quality of acting by the participant. Now, friends the third dataset is the
Ryerson Audio Visual Database of emotional speech and song. Now, again this is an active
dataset contains professional actors and these actors were given the task of vocalizing two
statements in North American accent.

Now, of course, you know again the cultural context is coming into the picture as well you
know this is also acted and if you compare that with the first dataset of the AIBO dataset then
this was more spontaneous ok. Of course, you know you would understand that getting this
type of interaction is non-trivial.

So, extremely important to be careful about the privacy and all the ethics approvals which are
required to be taken. Now, in these kind of databases where you have actors it is relatively
easier to scale the database because you know you could hire actors and you can have
multiple recording sessions and you can give different content as well.

(Refer Slide Time: 37:15)

Now, moving on to other databases. Friends the next is the IITKGP SESC dataset which was
proposed in 2009 by Koolagudi and others. Again, an active dataset 10 professional artist.
Now, this is a non-English dataset, it is in an Indian language Telugu. Now, each artist
participant here they spoke 15 sentences trying to represent 8 basic emotions in 1 session.
Another dataset is again from the same lab called the IITKGP SEHSC again by Rao and
Koolagudi and others 2011.

302
Now, in this case you again have 10 professional actors, but the recording is coming from
radio jockeys ok so from all India radio. So, these are extremely good speakers and the
emotions are 8 categories, again acted. But if you have high quality actors then the
assumption is that we would be able to get emotional speech as directed during the creation
of the dataset.

Now, moving forward friends the number of utterances here you know this is a fairly good
sized dataset 15 sentences, 8 emotions and then 10 artist, they were record in 10 sessions. So,
we have 12000 samples which are available for the learning of an emotional speech analysis
in system.

(Refer Slide Time: 39:01)

Now, in the community there are several projects going on. Now, they are looking at different
aspects of affect and behavior. So, one such is the empathetic grant in the EU. So, there as
well you know there are these dataset resources which are used for analysis of emotions and
speech. Another extremely useful, very commonly used platform is the computational
paralinguistic challenge platform by Schuller and Batliner.

303
(Refer Slide Time: 39:42)

So, this is actually hosted as part of a conference called inter-speech. Now, this is a very
reputed conference speech analysis. So, in the ComParE benchmarking challenge the
organizers have been proposing every year different sub challenges which are related to
speech analysis and a large number are related to emotion analysis and different task and
different settings in which we would like to understand the emotion of a user or a group of
users.

Now, there would be some acted and some spontaneous datasets which are available on this
benchmarking platform.

304
(Refer Slide Time: 40:40)

Now, moving on from the speech databases, let us see the databases have been collected
could be acted could be spontaneous. The next task is to generate the annotations, the labels.
Of course, before the recording is done the design of the experiment would already consider
the type of emotions which are expected to be annotated generated from the data. So, one
popularly used tool for annotation of speech is the Audino tool.

Now, here is a quick go through of how the labeling is done. Let us say friends here is the
waveform representation. The labeler would listen to the particular chunk, let us say this
chunk, can re-listen, can move forward backwards. And they also could have access to the
transcript what is being spoken during this time.

So, what they can do is you know they can then add the emotion which they interpret from
this audio data point. And they can label you know different things such as the topic of the
spoken text. And also things such as you know who is the speaker and the metadata. So, once
they listen to the content they generate the labels, then they can save and then they can move
forward.

Now, extremely important to have the right annotation tool because you may be planning to
create a large dataset representing different settings. So, in that case you would also have
multiple labelers right. So, if you have multiple labelers the tool needs to be scalable.

305
And as friends we have already discussed in the facial expression recognition lectures as
well. If you would have multiple labelers you will have to look at things such as consistency
for each labeler how consistent they are in the labeling process with respect to the sanity of
the labels.

So, you would like the labels to be as less affected by thing such as confirmation bias. So,
after you have the database collected, annotated by multiple labelers you may like to do the
statistical analysis of the labels for the same samples where the labels are generated from
multiple labelers. So, that at the end we have one or multiple coherent labels for that data
point.

(Refer Slide Time: 44:01)

Now, let us look at some of the limitations and this also is linked to the challenges in voice
based affect analysis. What we have seen till now is that there is a limited work on non
English language based effect analysis. You already saw the IIT KGP datasets which were
around the Telugu language and then the Hindi language. There are a few German speaking
datasets as well, Mandarin as well, but they are lesser in number, smaller in size as compared
to English only based datasets.

So, what that typically would mean is let us say you have a system which is analyzing the
emotion of a speaker speaking in English. You use that dataset and then you train a system on
that dataset. Now, you would like to test it on other users who are speaking some other
language.

306
Now, this cross dataset performance across different languages, that is a big challenge right
now in the community. Why is it a challenge? Because you have already seen the challenges
which are there are the three challenges you know what is being said, who said it and how it
was said. So, these will vary across different languages.

The other is limited number of speakers. So, if you want to create emotion detection in a
system based on voice which is supposed to work on a large number of users on a large scale,
you would ideally like a dataset where you can get a large number of users in the dataset. So,
that we learn the variability which is there when we speak right, different people will speak
differently, will have different styles of speaking and expressing emotions.

Now, with respect to the datasets of course, there is a limitation based on the number of
speakers which you can have there is a practicality limit. Let us say you wanted to create a
spontaneous dataset. So, if you try to increase the number of participants in the dataset, there
could be challenges such as getting the approvals, getting the approval from the participant
themselves and so forth.

Now, on the same lines friends issue is there are limited natural databases and I have already
explained to you right creating spontaneous dataset is a challenge because if you the user is
aware they are being recorded that could add a small bias. The other is the privacy concern
needs to be taken into picture. So, the spontaneous conversations you know if the proper
ethics and the permissions have been taken or not you know ethics based considerations are
there or not. So all that affects the number and size of the natural databases.

Now, this is fairly new, but extremely relevant. As of today there is not a large amount of
work on emotional speech synthesis. So, friends till now I have been talking about you have
speech pattern, someone spoke, machine analyzed, we understood the emotion. But
remember we have been saying right affect sensing is the first part of affective computing and
then the feedback has to be there.

So, in the case of speech we can have emotional synthesis done. So, the user speaks to
interacts with the system, system understands the emotion of the user and then the reply back
let us say that is also through speech that can have emotion in it as well right. Now, with
respect to the progress in the synthetic data generation of course, we have seen large strides
in the visual domain data face generation, facial movement generation.

307
But comparatively there is a bit less progress in the case of emotional speech and that is due
to of course, you know the challenges which I have just mentioned above. So, this of course,
is being you know worked upon in several labs across the globe, but that is currently a
challenge how to add emotion to the speech. Some examples you can check out for example,
from this link from developer.amazon.com you know there are a few styles, few emotions
which are added. But essentially the issue is as follows.

Let us say I want to create a text to speech system which is emotion aware. So, I could input
into let us say this TTS text to speech system the text this is the text from which I want to
generate the speech and let us say as a one hot vector the emotion as well. Now, this will give
me the emotional speech. But how do you scale across large number of speakers? Typically
high quality TTS systems are subject specific; you will have one subjects text to speech
model.

Of course, there are newer systems which are based on machine learning techniques such as
zero shot learning or one shot you know where you would require lesser amount of data for
training or in zero shot what you are saying is well, I have the same text to speech system,
which has been trained for large number of speakers along with the text and emotion, I would
also add the speech from the a speaker for which I want the new speech to be generated based
on this text input, right.

So, that is the challenge, how do you scale your text to speech system across different
speakers and have the emotion synthesized. The other, friends, extremely important aspect
which is the limitation currently is cross lingual emotion recognition. I have already given
you an example when we are talking about the limited number of non English language based
emotion recognition works.

So, you train on language 1 a system for detecting emotion, test it on language 2 generally a
large performance drop is observed. But one thing to understand is let us say for some
languages it is far more difficult to collect data to create databases as compared to some other
languages right, some languages are spoken more there are larger number of speakers other
languages could be older languages are spoken by less number of people. So, obviously, there
creating datasets would be a challenge.

Therefore, in the pursuit of cross lingual emotion recognition we would also like to have
systems where, let us say you train a system on language 1, which is very widely spoken and

308
the assumption is that you can actually create a large dataset. Then can be learn systems on
that dataset and later borrow and do things such as domain adaptation, adapt from that learn
from that borrow information and fine tune on another language where the we have smaller
datasets.

So, that you know now we can do emotion recognition on data from other smaller dataset.
Now, another challenge limitation is this is applicable to not just voice or speech, but other
modalities as well. When you are looking at the explanation part of why given a speech
sample the system said the person is feeling happy.

If you use traditional machine learning systems for example, your decision trees or support
vector machines it is a bit easier to understand why the system reached at a particular
emotion. Why the system predicted a certain emotion. However, speech based emotion
through deep learning.

So, this is deep learning friends, DL deep learning based methods. Even though it has the
state of the art performance even then the explanation part of why you reached at a certain
consensus based on the perceived emotion through the speech of a user that is still a very
active area of research.

So, we would like to understand, why the system reached at a certain point with respect to the
emotion of the user? Because if you are using this information about emotion state of the user
in let us say a serious application such as health and well-being, we would like to understand
how the system reached at that consensus.

So, friends with this we reach the end of lecture one for the voiced base emotion recognition
and we have seen why speech analysis is important, why is it useful for emotion recognition,
and then what are the challenges in understanding of emotion from speech from there, we
moved on to the different characteristics of the databases the data which is available for
learning voice based emotion recognition systems.

And then we concluded with the limitations which are currently there in voice based emotion
recognition systems.

Thank you.

309
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Week - 05
Lecture - 02
Automatic Speech Analysis based Affect Recognition

Hello and welcome. I am Abhinav Dhall from the Indian Institute of Technology Ropar.
Friends, today we are going to discuss about the aspects of Automatic Speech Analysis based
Affect Recognition. This is part of the Affective Computing course.

(Refer Slide Time: 00:45)

So, in the last lecture, we discussed about why speech and voice-based emotion recognition is
a useful modality for understanding the affective state of the user. We discussed about the
application and challenges. Then we looked at some of the commonly used speech-based
affect recognition databases.

Today, we are going to discuss about the feature analysis aspect. So, I will mention to you
about some of the commonly used hand engineered features which are used in speech
analysis. Then we will look at a system for normalization of the speech feature and later on
we will see an example of affect induced speech synthesis.

310
So, we are not only interested in understanding the emotion of the user using the speech
modality, but we are also interested in that if the feedback can be in the form of speech which
is generated by a machine, then how can appropriate emotions be added to the generated
speech itself for a more engaging and productive interaction with the user, alright.

(Refer Slide Time: 02:11)

So, let us dive in. So, first we will talk about the automatic feature extraction for speech. So,
friends, the very commonly used features are referred to as a prosody features. These relate to
the rhythm stress and the intonation of the speech. These are generally computed in the form
of the fundamental frequency, the short-term energy of the input signal and simple statistics
such as the speech rate, syllable and the phoneme rate.

Now, along with this, we also have features for speech analysis based on the measurement of
the spectral characteristics of the input speech. Now, these are related to the harmonic or
resonant structures. And typically, commonly these are based on your Mel frequency cepstral
coefficients, the MFCCs and the Mel filter bank energy coefficients.

In fact, MFCCs is one of the most commonly used speech analysis feature for not only just
emotion recognition, but also for other speech related applications such as automatic speech
recognition.

311
(Refer Slide Time: 03:43)

Now, the MFCCs are as I mentioned the most commonly used feature. So, from a
practitioner’s perspective, you can extract the MFCCs with library such as librosa or
openSMILE. Now, the steps are as follows, here you have continuous audio which is coming
in or since it is a continuous audio signal, we divide it into chunks. So, we are doing, we are
doing here. So, we would then apply the discrete Fourier transform where we move in the
incoming audio signal into the Fourier domain.

Then we are going to apply the Mel filter banks to these power spectra which we are getting
from spectral analysis of the DFT outputs and then we are going to sum the energy in each
filter. Later on, we compute the logarithm of these filter bank energies and take the discrete
cosine transform of the input. Further, this gives us the MFCCs where we will keep the
certain DCT coefficients and discard others.

312
(Refer Slide Time: 05:08)

Now, if you look at the prosody-based features. So, we are interested in analysing the
intensity and the amplitude of the input signal wherein you can use loudness which is a
measure of the energy in your input acoustic signal. We also very commonly use the
fundamental frequency. Now, the fundamental frequency, this is based on the approximation
of the frequency of your quasio periodic structure of voice.

And based on the fundamental frequency from the perception perspective, we have the
feature attribute called pitch, which is essentially the lowest periodic cycle of your input
acoustic signal. It is commonly referred to as the perception of your fundamental frequency.

Typically, this means you will require listeners to be able to quantify the pitch of an input
signal, but from a compute perspective wherein we have an automatic system, we are going
to compute the fundamental frequency. Then friends moving on, we have the formant
frequencies F1 and F2.

Now, you can compute the quality of a voice through this wherein we are looking at the
concentration of the acoustic energy around the first and the second formants. From this
another commonly used prosody feature is the speech rate as the name suggests. We are
interested in let us say computing the velocity of the speech, which is basically number of
complete utterances or elements produced per time unit, ok. The other one which is very
commonly used is your spectral energy.

313
Now, through this you can compute the timbre which essentially the relative energy in the
different frequency bands of your input signal. Now, a point to note here is these features are
very fundamental basic features have been extensively used in speech analysis for emotion
understanding of a user and in other applications. But again, these features such as MFCC are
hand engineered features.

So, Friends similar to how we discussed for automatic facial expression recognition wherein
early on histogram of gradients, scale in within feature transform features these were used.
But the community in academia and industry it moved to representation learning through
deep learning and we have these pre-trained networks through which we are extracting the
features.

(Refer Slide Time: 08:22)

Now, later on I will show you an example of how the community in speech analysis has
moved to the representation called spectrograms which is ideal for using convolution neural
networks. Now, what is observed is based on these features which we discussed. If we have
positive voices which are generally loud, they are with considerable variability in the
loudness attribute.

And have high and variable pitch and they are having high in the first two format frequencies
F1 and F2. Further it is observed that the variations in the pitch they tell us about the
differences between high arousal emotions. So, again we are talking about the valence in

314
arousal dimension where we are talking about the arousal emotion for example, representing
joy.

And friend’s low arousal emotions such as tenderness and lust when compared with neutral
vocalizations, ok so, pitch is an important feature for looking at the high and lower arousal
emotion.

(Refer Slide Time: 09:44)

Now, what typically happens is after let us say we have extracted a feature. We will have
these features extracted from n number of different data points of those are speech
feature-based data points which are coming from either different sources different speakers.

So, there is a very important step which is involved in the pipeline for speech-based emotion
recognition which is the normalization of the feature. Now, an example of this which will set
the motivation is as follows. When you have angry speech, it is observed that it has higher
fundamental frequency values than compared with the frequency values fundamental
frequency values for the neutral speech.

Now, this difference in the two emotions that is actually blurred by the inner speaker
differences so, what that means, you have n speakers who let us say are speaking in turn both
an angry speech and neutral speech. Though we observe differences between the fundamental
frequency value of the angry and the neutral speech between one subject.

315
But if you have multiple subjects because of the difference in style of speaking of these
subjects because of this intra-class variability the difference between the fundamental
frequency of the angry speech and neutral speech would be varying, ok. So, you will observe
a variation across different speakers. So, this observation that you know angry speech will
have a higher fundamental frequency as compared to neutral speech would be blurred in
some cases, but would be very evident in other speaker's cases.

Therefore, speaker normalization is used to accommodate these kind of differences which are
introduced in my data set due to the differences in the speakers. Now, here is an example
based on gender friends. For the fundamental frequency men would typically have signal in
between 50 to 250 hertz and if you look at women subjects it would be between 120 and 500
hertz.

So, there is a difference in the range here. Now, for feature normalization there is a very
simple approach let us talk about that one commonly used which is the Z-score
normalization. What do we do in this Z-score normalization? We say well, I am going to take
each feature and I am going to transpose it so that it will have zero mean and unit variance
across all the data.

Now, please notice this is actually a very common technique which is not just limited to
speech, but it is applied to vision and text features as well. Other approaches are the min-max
approach. So, you set the bound the minimum and the maximum value across the whole data
set.

And then you will map the whole of the features from all the speech data points based on this
minimum and maximum. So, you are just you either stretching or you are actually squeezing
the feature values for a particular data point. Then friend’s others are based on normalization
of the distribution. So, you will like to have a normal distribution for the feature which you
are extracting from speech.

Now, here is a problem with these standard you yet very commonly used features. The
problem is these feature normalization techniques can adversely affect the emotional
discrimination of the features. Since what we are doing is our observation is due to the
differences in the different speakers, we observe that the observable difference between the
feature for different emotion that varies.

316
So, we are applying some normalization techniques. However, since this normalization
technique is generally applied across the whole data set then what it can do is, it can
sometimes reduce the discriminative aspect of the feature for certain subjects as we are
comparing all the subjects together.

(Refer Slide Time: 14:40)

So, to mitigate this now I will mention about one simple yet very effective technique for
feature normalization which is called the Iterative Feature Normalization. Now, the
motivation for IFN iterative feature normalization is as follows. As we have seen that the
applying a single normalization across the entire corpus can adversely affect the emotional
discrimination of the features.

Therefore, let us try to work on this aspect ok, which is applying the normalization on the
entire data set. So, what Carlos Busso and his team proposed in 2011 was that let us estimate
the feature using only the neutral non-emotional samples. So, let me have a baseline let me
have a reference and what can be a good reference? Well, identifying the neutral emotion
utterance in audio sample.

Now, once I know what is the neutral utterance or neutral sample in my data for a subject, I
can treat that as a baseline to normalize the other samples, ok. Now, friend’s similar methods
are also used in facial expression analysis with videos as well. Wherein one could modify the
geometric features of a face based on the neutral expression. So, typically a neutral
expression you will observe the lips are closed.

317
So, the difference between the points the distance is less very less negligible. So, that is used
as a baseline for comparing with let us say when the mouth is wide open, ok. So, now let us
come back to speech, ok. So, we are going to talk about the iterative feature normalization.
You get the input audio signal we are interested in feature normalization. So, what do we do?

We use automatic emotion speech detection and that gives us two labels indicating if it is
neutral speech and if it is emotional speech. Once we know what is emotional speech, we
then use that to normalize the parameters. We do this iteratively and then we get the ideal
normalization.

(Refer Slide Time: 17:24)

So, here are the steps one by one. First, we take the acoustic features could be any of a feature
friend’s your MFCCs your F naughts and so forth without any normalization. So, we do not
apply any Z-cross normalization or any min-max approach. Now, we use these features and
we detect the expressive speech. Essentially, which part of the speech is neutral which is
showing some emotion so, kind of a binary problem binary class problem.

Now, the these observations which are labeled as neutral. So, those part of the speech or the
speech sample themselves which are labeled as neutral, they are used to re-estimate the
normalization parameters. As now the approximation of normalization parameter improves,
the performance of the detection algorithm is expected to improve. Now, this in turn is going
to give you better normalization parameters.

318
So, this is an iterative process and this process is repeated until certain percentage of files in
the emotional database they have now got change label from successive iteration. Now, this is
give from the original work this is given a threshold of 5 percent. But you know you can
empirically vary that as well. So, what you are saying you get neutral change parameters,
reiterate now again detect neutral and expressive speech and get the newer parameters and
run this iterative process.

(Refer Slide Time: 19:04)

Now, friends what we have done? We said we have a bunch of features which could be
extracted for analysis of the acoustic signal, we then discussed about the feature
normalization part. After that we can learn different machine learning techniques. Now, here I
am mentioning the commonly used machine learning techniques. Now, the reason to discuss
this is as follows.

Based on how we are extracting the audio feature, let us say you know this is your timeline
and we have some audio feature which we are extracting, ok. The duration of the window the
frequency of occurrence of the input so, at what frequency am I getting the data. And let us
say in this signal if I was to consider this part and call it part P 1 and call this part as P of N,
how important it is for me to correctly be able to predict the emotion for P of N with or
without prior information let us say P of 1?

So, how much of the prior information is required? Ok. In other words, how much temporal
variation is required across the windows, how much history do you need? Ok. So, that is

319
based on the use case. And also, would be one of the primary factors along with the
frequency time duration and also let me add here the computational complexity, ok. To decide
which machine learning technique are you going to use.

So, commonly in speech analysis the state-based machines they have been used the hidden
Markov models, the condition random fields also researchers have used, support vector
machines at random forest wherein they will first compute the feature from the whole of the
sample mostly you know you will take the whole of the sample speed sample extract the
features and then either run a support vector machine or random forest.

Recently we have also seen researchers are using deep learning based techniques as well. So,
either you can use your convolutional neural network or your recurrent neural networks.
Now, obvious question comes well, if you want to use the convolutional neural network you
would need let us say an image like feature, right. So, we will come to this in the coming
slides.

Your RNN based learning the motivation is very similar to this. You have your P signal you
divide it into chunks and let us say you want to learn the feature from the chunks and also
want to understand the information from the prior right from the background. So, let us say I
am here in the cell and I would not only analyse the feature for this a chunk, but I also want
some learning from the background, right. So, that is how you would commonly use RNNs as
well.

(Refer Slide Time: 22:49)

320
Now, coming to the feature which I have been talking about for which is commonly used in
your convolutional neural networks and has been shown to have high quality highly accurate
speech-based emotion recognition is the representation of the audio signal in the form of
spectrograms.

So, this is essentially the visualization of the frequency. So, you can see frequency against
time also it gets you the amplitude. Now, what do you see here friends is there are two
spectrograms from the same subject spectrogram one is when the speech was neutral and
spectrogram two is when the speech was angry. You can very well see the differences in these
two visualizations.

And since we are able to visualize this, I can treat a spectrogram as an image. You know I can
assume that this is actually an image. Now, if this is an image then I can use a convolutional
neural network to train with a spectrogram as an input and the output would be the emotion
classes.

(Refer Slide Time: 24:18)

So, here is an example of a work by Sat and others where they proposed this emotion
recognition system wherein you have your spectrogram as input. And then similar to your
traditional deep convolutional neural network you have your series of convolutions, max
pooling and so forth and to induce the time window as well. So, essentially you have a
bi-directional LSTM. So, what you are doing?

321
You are saying well this was my audio signal, this is time I have my S1 spectrogram one S2
spectrogram two. Of course, you could have overlapping windows as well, but this is just for
visualization S3 and so forth till S of N, right. So, these are all spectrogram you input this
here you get the feature representation for each spectrogram and then you are using a
recurrent neural network to finally, predict the emotion categories ok, these emotion classes.

(Refer Slide Time: 25:24)

Now, with this we have seen how emotion is predicted using speech signal and friends with
the tutorial which is after this lecture. So, for this very week you will also see a in detail
example of how to create a simple speech-based emotion recognition system starting from
getting a dataset and then extracting different features and trying out different classifiers.

So, till this is the recognition part which falls under the affect sensing step of affective
computing. Now, let me give you an example of speech synthesis where emotion is also
added. So, here we have two samples which are generated using Amazon's Alexa. The first
one is your speech synthesis where the subject is disappointed let us play that. I am playing a
single hand and what looks like a losing game.

The second one is when the speaker sounds excited let us hear this one. I am playing a single
hand and what looks like a losing game. By listening to these two samples one can easily tell
what is the emotion reflected, right. So, what this means is when we want to generate speech.
So, you have a text to speech system which takes into input the text generates the required
speech for emotional speech synthesis along with the text input.

322
We also need to give as an input the emotion class or could be let us say the valence arousal
intensities. Now, with these two inputs to your TTS one would then get emotion enhanced
speech synthesis. So, this is a very nascent area and we see that there is a lot of work going
on to generate emotion enhanced speech.

(Refer Slide Time: 28:09)

Now, here is an example by Sivaprasad and others who generate speech with emotional
prosody control. So, let us look at the framework the system which is proposed. So, what we
have here is first the text input, this is the text against which they are going to generate the
speech.

Then we have the second input which is the speaker style. So, this is the speaker reference
waveform and third friends is your target emotion the value for arousal and valence. Now,
text goes into a phoneme encoder. So, you extract a representation for the required phonemes
for the input text statement.

Then the speaker waveform is input into a speaker encoder it extracts the characteristics of
the speaker essentially the style with which one speaks. These are concatenated input into an
encoder. In parallel we have the AV the arousal and valence vectors which are concatenated
and later we are concatenating that with the input from the phoneme and speaker encoder.

There are then duration predictor to predict the duration of the samples of the word which are
going to generate. And further this is input into a length regulator where the energy of a

323
required word is predicted, the pitch is predicted and in parallel the length regulator and the
output of the energy predictor and predictor are concatenated are added into a decoder.

So, this is your decoder. Now friends this gives you a spectrogram the visualization of your
frequencies. And from this we can generate speech which is controlled by the emotional state
so, the target emotion. So, what this means is from a bird's eye view the system would require
the emotion and a representation for emotion and a series of encoders and decoders.

(Refer Slide Time: 30:37)

Now, let us look at some of the open challenges in speech-based affect analysis. The first is
inter and intra speaker variability. So, we have different speaking styles you add to it. Let us
say people coming in from different cultures speaking the same language, ok. Let us say you
have an Indians descent person and Asian descent person or Caucasian and all three are
speaking English.

So, these are let us say the ethnicities of the subjects in your data set. Now, this is going to
lead to a lot of variation even though everyone is speaking the same language because of
different difference pronunciation different style in which we speak, right. So, for
generalization of affect analysis system which is based on speech we would require to have
generic representations.

Now, we will observe that when we are want to have these generic representations they will
also have to be agnostic or at least would have analysed and then extracted generic

324
representations for the different display of emotions and the differences in individuals vocal
structures, ok. Even though let us say you have an Indian speakers they will have different
vocal structures which will lead to the variability in the data which is captured.

Further a speaker can express an emotion in n number of ways and is influenced by the
context, ok. Now, the context can again here be where the person is with whom the person is
interacting. So, the emotion would be reflected in different ways. Let us say when you say the
same statement to a friend and compare that while you are speaking the same statement to let
us say an elder, right.

So, the same statement would have a difference in either the intensity of emotion or could
have different emotions altogether. This adds to the variability right the content the linguistic
remains the same, but the emotion changes. Further the second open challenges what are the
aspects of emotion production-perception mechanism which are captured with the acoustic
feature?

We have seen different features are proposed for analysis. Some features would extract a
particular attribute of your signal and when you are trying to understand trying to perceive
the emotion one feature could be better let us say for arousal the other could be better for
valence, right. And how do you actually choose the right balance? Maybe fusion that could
be one.

The third open challenge friends is that the speech based affect recognition can be exhaustive
and can be computationally expensive. And hence it can have limited real time applicability.
We have seen in lecture 1 for speech the trade off right, between the window of sample
duration and how much real time it needs to be.

So, if you have a longer duration sample it can be more computational, but that may have far
more detailed information which could be required for an accurate prediction. However, if
you have a smaller duration sample that could have lesser information, but could be
computed closer to real time, right. So, this is an open challenge that if you want the rich
information how can we do that in closer to real time.

Now, along with this one more aspect which will come into the picture is, let us say even if
you are having a window of audio which is giving you features which are good enough for
predicting the emotion of the person in that very time stamp. There could be things such as

325
background noise in that sample let us say background music is there where the subject is.
So, we would require a noise removal step as well before feature extraction or it could be that
you could have a noise removal step after feature extraction, right.

So, this could also affect the computational aspect and I have it real time.

(Refer Slide Time: 35:33)

Now, let us look at some of the research challenges friends. The very commonly used
research challenge benchmarking platform is the Interspeech 2009 Emotion Challenge. The
other one which is very commonly used in the community is your Audiovisual Emotion
Challenge which is the AVEC challenge which will have several tasks for emotion, but it is
also used for multi-model emotion recognition as well audiovisual.

But the audio here is also very rich. Then we also have the Interspeech Computational
Paralinguistic Challenge. Now, this one is actually a very commonly used benchmark in the
speech-based affect recognition community and this has been running for years with different
tasks related to affect. Then we also have the Emotion in the Wild Challenge. So, motion
recognition in the wild.

Here you have audio and video, but audio itself is a combination of the background music,
background noise and the speaker’s voice. So, that is also used for understanding of affect
from audio. So, these are openly available resources which anyone could access based on the

326
different licensing agreements with these resources. And then you can use them for creating
evaluating speech-based emotion recognition works.

Now, along with this I will also like to mention other Multimodal Emotion Challenge which
is the MEC and MEC 16 and 2017 challenge which is in the Chinese language. So, friends
with this we come to the end of the second lecture of speech-based emotion recognition. In
this we briefly touched upon the features which are hand-engineered features and have been
commonly used in speech analysis the prosody-based features and then your MFCC's.

Later on, we talked about an important step in speech analysis for emotion prediction which
is your normalization of the feature. To this end we looked at the iterative feature
normalization technique. Then we looked at the different machine learning techniques which
are used for the affect prediction and later on we touched upon the concept of speech
synthesis where emotion is also induced into the generated speech.

Thank you.

327

You might also like