0% found this document useful (0 votes)
4 views87 pages

Lec 9

This tutorial on multimodal emotion recognition focuses on integrating audio and video data to enhance understanding of emotional states. It utilizes the RavDIS dataset, which includes recordings of actors expressing various emotions, and covers techniques for feature extraction using MFCC for audio and VGG16 for video. The tutorial also discusses the implementation of machine learning models for classification and the importance of evaluating system performance through feature fusion techniques.

Uploaded by

chaaya0605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views87 pages

Lec 9

This tutorial on multimodal emotion recognition focuses on integrating audio and video data to enhance understanding of emotional states. It utilizes the RavDIS dataset, which includes recordings of actors expressing various emotions, and covers techniques for feature extraction using MFCC for audio and VGG16 for video. The tutorial also discusses the implementation of machine learning models for classification and the importance of evaluating system performance through feature fusion techniques.

Uploaded by

chaaya0605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Affective Computing

Prof. Jainendra Shukla


Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Lecture - 28

Hello everyone, welcome to this effective computing tutorial on multimodal emotion


recognition. In this tutorial, we will try to learn emotion recognition using audio and
video information together. By integrating information from different modality such as
audio and video, one can gain a more complete and accurate understanding of person's
emotion state. And there are multiple different techniques for integrating information
from multiple modalities, including feature-level fusion, decision-level fusion, and hybrid
approaches that combine both. Each approach has its own strength and weakness, and the
choice of technique will depend upon specific application and data availability. So in this
tutorial, we will be using a RavDIS dataset.

The RavDIS dataset is a popular dataset used for emotion recognition research. It
contains recording of actors speaking and performing various emotional states, including
anger, disgust, fear, happiness, sadness, and surprise. The dataset includes both audio and
video recordings, which makes it a well suited for multimodal emotion recognition
research. The audio recordings consist of 24 actors speaking scripted sentences in each of
six emotional states.

The video recording consists of same actor performing facial expression and body
movements to convey the same six emotional states. Now, coming to the file name
convention in this data set, each file name consists of seven numerical identifier in which
third identifier tells about the emotional class. So to get started with multimodal emotion
recognition using Lab-based dataset, we will first need to import audio and video data
and extract relevant features from both modalities. After that, we can use a variety of
machine learning models to classify the emotional states based on the extracted features.
It is worth noting that the performance of your multimodal emotion recognition system
will depend on the quality of the feature extracted and fusion techniques used, as well as
the choice of machine learning model.

So it is important to carefully evaluate the performance of the system and fine-tune the
network. So we will start with downloading and importing the dataset into the Google
Drive. After that, we will be using some predefined Python libraries to read audio and
video files in Python environment. Later, we will do some feature extraction, particularly
this MFCC for audio data, and we will extract video feature using VGG16 face model.

526
After that, we will try to see individual features versus fused features performance, and
we will be using Gaussian A-based linear discriminant analysis support vector machine
and our CNN network for this analysis.

And later on, we will be doing our early and late fusion, and we will be using CNN for
this. So now let's jump directly upon the coding part. We will start with installing a
couple of libraries that might not be pre-installed in the Google Colab environment. Once
we have installed all these necessary libraries, we can begin with importing them in the
code and we will start pre-processing the data and extract a given feature for our
multimodal emotion recognition. And I'm also assuming that you have already
downloaded the data set and stored it into your respective Google drives.

So our downloading library code will look something like this. And it might take a
couple of seconds to download all these files. So after downloading my relevant
packages, I will import all my relevant libraries for this tutorial. And the code will look
something like this. After downloading and importing the libraries, I will start with
defining some path variables.

In this tutorial, I will be basically defining three of my path variable. First will be our
video path, which essentially contains the exact dataset path our video files are stored.
And then I will also define two other paths where I will be storing video only information
and audio only information extracted from the original dataset. One more thing, I will be
just using actor1 data for this analysis and the only purpose of using actor1 data is to
reduce the computational demand. So as of now my paths variable are all set.

So I will write first piece of code which will essentially take our video files from original
data and extract video and audio information from that video file and save it to two
separate folder. And the code will look something like this. So let me give you an overall
overview of this piece of code. What we are essentially doing here is we are iterating
over all the files in the dataset folder, and we will be only using emotion number two and
emotion number five, which essentially correspond to calm and anger emotions. The only
reason I'm using only two emotion classes is just reducing down the computational
power, as in the general case, Google Colab only consists of 12 GB of RAM.

For computational purpose only, I will be using just two of the emotion classes. Then
after reading the video from their location, I will be resizing them into a smaller size. And
again, the reason is to reduce the computational requirement. And after that, I will simply,
I will be simply considering first four seconds of the video to reduce the computational
requirement and also to make a consistency among all the videos. Like there could be
some instance that a particular video might consist of five seconds and other might be

527
consisting of four seconds.

So I'm using a smallest threshold like 4 seconds and every video so that all my data will
be consistent along the time dimension. Now I will extract audio information only using
this command videoclip.audio and later I will be storing these separate audio and video
content in these two folders. So now running this code. And this code might take a good
amount of time as we are reading and writing the data.

Audio and video files are created from the original dataset. Now, I will write another
piece of code to bring those audio-only and video-only files in my Python environment.
And my code will look something like this. So I will be just declaring two lists, video
label all and video only all. These two lists will be storing the labels and the exact video
only data.

And I will be iterating through all the files in my video only path and appending into
these two lists. later i will convert these two lists into numpy array for our ease of use and
i will write another piece of small code which will essentially convert our label into 01
binary labels After execution of this code, I will be using similar sort of code for bringing
audio only data in my Python environment and my code will look something like this. So
now, as you can see that I have bought my video-only and audio-only data in my Python
environment, and the shape of video-only data will look like 32 cross 120 cross 128 cross
227 cross three. So what does this number actually mean is like there are 32 files, there
are 32 video files, and each video files contain 120 frames. Since the video data was
sampled at 30 hertz, and we are taking just 4 seconds of data so there will be 120 frames
and each frame will be consisting of 128 x 227 dimension and there will be 3 channel red
green and blue similarly for our audio data we will be considering first 4 seconds of audio
and all the audio will be sampled at 8,000 Hertz, which essentially means that for one
second of data, there will be 8,000 samples.

And using this strategy, we will be collecting all those 32 files and each file will be
consisting of 32,000 sample, which essentially means four second multiplied by 8,000
sample rate. and video-only files in my Python environment, I will start with the feature
extraction part. In our feature extraction part, so in our feature extraction part, My first
feature will be MFCC feature, which essentially called as Mel Frequency Capacitor
Coefficient. And it is a commonly used feature extraction technique for audio analysis,
including in the text of multimodal emotion recognition. MFCC represent a compact yet
efficient way of capturing the spectral characteristics of an audio signal.

And they have been shown to be effective at capturing emotion information in speech
signals. So there are some basic steps that are involved in computing the MFCC from

528
audio signal. The very first step consists of applying a high pass filter to the signal to
amplify the higher frequency component and reducing the impact of lower frequency
components. Then second step MFCC process is frame segmentation where the signal is
divided into short frames, usually 20 to 30 milliseconds with some overlap.

Third step is windowing. A window function such as hamming window is applied to


each frame to reduce the spectral leakage and to smooth out the signal. And then we
perform a Fourier transform. where that windowed signal is transformed into frequency
domain using the discrete Fourier transformation. And after that, there will be melt
frequency warping. Transformed DFT spectrum will be converted into a melt frequency
scale, which essentially approximates the way humans perceive the sound.

After that, there is a usual capstral analysis over these warped male frequencies. The
male frequency spectrum is transformed into capstral domain using inverse of discrete
cosine transformation. And then we select a subset of resulting capstral coefficient as the
MFCC coefficients. These MFCC coefficients can be used as a feature for emotion
recognition classification, either on their own or maybe with the combination of other
features. And for our purpose, we will be using a standard library called Librosa, which
will provide the function to easily compute MFCC from audio signals.

And to do so, our code will look something like this. In this code, we have declared a list
where we will be storing our MFCC coefficients and we will be iterating through all the
audio only files and we will be using this predefined function Librosa.feature.MFCC and
passing each file in this predefined function and there is a parameter in this function
called number of MFCC coefficient and for that we have chosen a value equal to 40. And
after computation of these MFCC coefficients, we will convert that list into an array and
we will reshape this MFCC list as number of sample followed by number of coefficients.

So after extracting the MFCC features from audio files, we will move towards extracting
video features from the video only files. To do so, we will take the advantage of VGG16
based architecture. So VGG16 is basically a deep convolution neural network
architecture that was originally designed for image classification task. This architecture
for image classification consists of 16 layers, including 13 convolution layers and three
fully connected layers. In the standard architecture, input to the network is an RGB image
of size 224 cross 224 pixels.

And this input layer will be followed by 13 convolution layers, each convolution layer
with a three cross three filter size and a stride of one pixel. In this architecture, the
number of filters increases as we go deeper into the network, basically ranging from 64 in
the first layer to 512 in the last layer. Then, after every two convolutional layers, a max

529
pooling operation is applied to reduce the spatial resolution of the feature map. through
the convolution future maps coming through the convolution layers and after that a fully
connected layer is applied which consists of 4096 neuron each and after that in the
standard architecture there is a three layered multi-layer perceptron for final prediction
and the final prediction is performed using softmax layer So in our architecture, we will
be using the standard VGG16 phase model.

It's a pre-trained model. And to do so, our code will look something like this. And in this
code, instead of using 224 cross 224 image, we will be using 128 cross 227 dimensional
image. And also I have downloaded this file, this pre-trained wait file in my local drive.
And I have already given you the exact code for downloading this file. So after running
this file, our model is initiated.

So one essential step that we have to do in this pre-trained model is to freeze down all
the convolution layer so that they can extract relevant feature according to their
pre-trained settings. To do so, my code will look something like this. where I will iterate
through all the layers in my VGG model, and I will simply pass an argument
layer.trainable equal to false. After making our feature learning layers as non-trainable
layers, we can put another multi-layer perceptron over it, and the code will look
something like this.

So what does this over-the-top model is doing? This is taking our VGG model
architecture and then we are simply putting a flatten layer which will flat down the
extracted features into a single vector and then we will perform a batch normalization and
then we put a 21 neuron layer with the activation of ReLU. And we will name this layer
as a feature layer. So this will be our main layer for extracting the features. And then we
will simply perform a classification operation on our data set and extract the feature
learned at this layer. So you can see that our VGG model has given 4 cross 7 cross 5 12
dimensional features shape.

Then we have flattened that out. And after performing the best normalization, we will be
taking this 21 dimensional feature. Now I can simply compile my model. using
categorical cross entropy loss and with my Adam optimizer with learning rate equal to
0.0001. After this, I can simply fit my model with the video data for two classes with a
batch size of 128 and number of reports, let's say 20.

So fitting this model over the 20 box, this thing also might take some time. So please
have some patience. So as you can see, we have already trained our model now, and we
are getting accuracy somewhat around 86.41%, which is good discriminating accuracy,
saying that we are able to discriminate between the two emotion classes using VGG16

530
feature. Now, we will simply write a code to extract that feature layer, this feature layer,
For that I will write a simple code where I will create a feature extractor which is
essentially taking model input as our input and output will be the intermediate layer
which is our feature layer.

Now I can pass all my data point to this feature extractor and we will get the
corresponding features. For that I can write a simple code and the code will look
something like this. Here I have declared a list called video only feature list and I will
iterate through all the files and I will pass those files into this feature extractor function
over here. And then I will simply extract the feature, convert them into the numpy array
and I will append these features to a this video only feature list and i will convert that list
into a numpy array for our ease of use one another thing that you have to look here is
after extracting the numpy array i'm reshaping it to 120 cross 21 so what does this
basically means is since our original data consists of 32 files, each file containing 120
images of dimension 128 cross 227 cross three. So basically for each file, I'm passing all
these 120, images and for one particular image, I'm getting a 21 dimensional feature from
my VGG 16 architecture.

Now I'm simply appending these feature sequentially so that I can get a long feature for
these 120 images. And these 120 images basically correspond to our four seconds of
video. So in this way, I'm creating features for all these 32 images. So running this code,
we got 32 files, each having dimension 2,520. By now, we have created our audio
features as well as our video features.

Now, we will simply first try to see how these audio feature and video features are
working with respect to our basic classifier, which are our Gaussian name base, linear
discriminant analysis, and our support vector machine. To do so, I will import all these
classifiers from the standard sklearn library and my classification code will look
something like this here first of all i have i will be using my audio features which are
mfcc audio only features and i will divide these feature into my train and test split where
my test split size will be 20 percent And I will first call my Gaussian abyss classifier and
fit that data on my training set. And then again, I will call my linear discriminant analysis
classifier, then support vector machine with linear kernel, and after that, support vector
machine with my RPF kernel. Let's see how these features are actually performing with
these basic classifiers. Okay, as you can see, we are getting a training score of somewhat
around 76% using Gaussian-based linear discriminant analysis and our support vector
classifiers.

And support vector vision with RBF kernel is giving a lower score, which can also
signify that the discriminating boundary between these two classes are more of a linear

531
boundary. After this, we will perform same thing with our video features and the code
will look something like this. This is similar code as we are doing with the audio feature.
Instead of audio feature, I will simply use my video only features and their corresponding
label and the test size will again be 20% and we will split it into our train and test splits.
And then we will call our Gaussian A-base linear discriminant analysis support vector
machine with linear kernel and support vector machine with our RPF kernel.

So let's run this code and see how our video features are performing. Okay, so as we can
see, our video features are really good and they are giving accuracy of somewhat around
100% for each instances, including our test scores. but since we are already getting this
much of accuracy, so let's try to see how our accuracy changes if we simply combine
these two features together. So what I'm basically trying to say is I will simply concrete
my audio features, audio only features and my video only features together. And then I
will pass these fused feature in my, in these, Basic machine learning classifier and I will
try to see how my accuracy changes with that So for my fusion part my code will look
something like this so I have simply concatenated these two audio features and video
features and Let's see how dimension goes okay so we are having these 32 files and each
having 5040 dimensional features and i also printed these my audio label and video label
just to show you that these are same with respect to the indexing so we can use them
interchangeably now let's try to use our classification code and see how my fuse features
are working.

So for this, I will simply reuse my code and instead of my video only features here, I will
use fuse features. And since our audio labels and video labels are same with respect to the
index, I can use either of the label. Again, the test set size is 20%. And let's see how it
works. Okay, as we can see here is these are working slightly lesser as compared to our
video only feature in our first case, which is our Gaussian abyss case.

And also for our LDA case, the accuracy is dropping. But for the SVM case, we are
getting a similar accuracy as our video classifier video features are giving. So this
basically is saying that feature fusion is giving us a better or at least the same amount of
performance with our SVM classifier. And the reason why these classification accuracy
decreases, there could be multiple factors, like one potential factor could be the curse of
dimensionality since my feature dimension are too much high right now. So Gaussian
name base and linear discriminant analysis might be suffering from this curse of
dimensionality. That's why they are giving up poor performance, but In case of linear
SVM, I'm getting a similar accuracy over here.

So till now we tried with traditional machine learning classifiers and now onwards we
will be using 1D CNN architecture to classify the image and audio data encodings into

532
emotion classes. First we will separately test the CNN architecture on audio and video
data. Later on, we will be using the fused modality as the CNN input. In this case, we are
hypothesizing that the given audio and video embedding that we generated using MFCC
and VGG phase respectively are the true representation of the original data And since we
have created these embeddings by separately connecting each time frame, we can treat
these embeddings as a time domain data and use 1D CNN classifier on top of it. So our
CNN architecture will look something like this, where we will be sequentially we will be
passing the input shape and there will be two convolutional layers, each convolutional
layer followed by a max pooling operation.

Then we will flatten the output from the second max pool layer and we will pass through
the flattened output through batch normalization layer and later we will collect and later
we will be using 128 dimensional dense layer followed by a dropout layer for decreasing
any chance of overfitting and later with the last dense layer which will be consisting of
two neurons using soft map activation we will classify our given data into the motion
class also we will be using categorical cross entropy as a loss function and the optimizer
we will be using is atom optimizer with learning rate of 0.0001 and the metric will be
accuracy. So let me run this function. So as our classifier is ready, we just need to reshape
our video and MFCC audio features in a way that can be input into this CNN
architecture. So for that, I will be using simple reshaping operations and our
corresponding data will look something like this for both video and MLCC features.

Later, I will simply train our model. And for that, I will be first converting our labels into
categorical labels as we are using categorical cross entropy function as our loss function.
And we will divide our data into respective train and test set with 20% test size. And we
will simply fit our model with batch size equal to 8 and epoch equal to 10. Let's see how
it performs.

So here you can see the summary of our given model. And our model is already trained.
So we can see over here that our CNN architecture is giving a good training accuracy, but
the test accuracy is somewhat around 42%, which is essentially below chance level. And
this thing is happening with video features only.

Okay? So let me... try the same thing with audio features. And let's see how my CNN
architecture is working with audio features. So in case of audio features, we will again
divide our audio data into their respective train and test splits with a test size of 20%. And
we will call our model. We will call a fresh instance of our model, and then we will fit it
on our training data with a batch size of 8 and number of epoch equal to, say, 10.

And then we'll try to evaluate it on our test data. Let's see how it performs. Okay, so as

533
we can see that using these 1D CNN, our audio features are giving better test accuracy as
compared to our video features. Now, we are interested to see how the fusion of these two
features will perform in 1D CNN architecture. So for that, I will simply use the fused
version of the feature.

and the code will look something like this. Here, I'm using this fuse feature with test set
size 20% and writing to my respective printer sets, calling a fresh instance of our CNN
architecture and fitting on our training data with, again, same batch size eight and epochal
to 10. Okay. And in this case, we can see that our model is able to learn at least 71%,
which is somewhat equivalent to the audio level features. Here we can see like fusing
feature, we can get at least the accuracy single best performing feature. And again, here
though, we haven't done any sort of hyperparameter tuning and the data is also very less
over here.

I mean, there is still a chance that we can get a better accuracy over here. And this is a
classic example of our early feature fusion. So what is typically happening here is we
took two features, connected it together, and passed it to our network. So after performing
early fusion, we will now perform late feature fusion, where late feature fusion refers to a
method where feature extracted from multiple sources, such as different sensors or
modalities, are combined at a later stage in the processing pipeline. After initial
processing of these individual sources, this allows for more flexible and efficient feature
representation as it enables the combinations of diverse and complementary information.

In late fusion technique, the features are extracted from different sources, are first
processed independently, often using different techniques or architectures, and then
combined in the later stage of the pipeline, such as a fully connected layer. It can also be
done using various techniques such as concatenation of two features coming from
different modalities, maybe they're element-wise additions or element-wise
multiplication, or maybe some sort of a weighted averaging of two features. So in our
case, we will simply take our video features and audio features and pass them to different
CNN architectures. And from the feature representation layer of their respective CNN
architecture, we will elect the late features And then we will manually concatenate them
to a late feature embedding. And using that embedding, we will paint a multi-layer
perceptron and see the efficacy of our late fusion technique.

So to do so, we will first create a model. And this will be called a late-serum model. And
the agenda of this model will be to extract relevant features from this feature
representation layer, which essentially consists of 128 neurons. And the activation
function is real over here. So running this code.

534
So after that, we can simply now call our late CNN layer as an audio model. and fit it on
our MFCC audio-only features with a batch size of 8 and the number of epochs will be
10. After we fit the model, we will again make our feature extractor and extract the
features from this feature representation layer. And the code will look something like this.
Now same thing we will perform with our video features. And now since we have our
audio and video model trained, we will simply extract their feature layer using this code
and save them into separate variables.

So if I show the shape of this variable, it will be a tensor of 32 cross 128. So basically,
32, our number of our sample, and 128 is our latent representation of that size of that
latent representation. Similarly, for the video, we have the similar shape. Now, we can
simply concatenate these two features, these two late representation, and make a late
fused latent representation. For that, our code will look something like this.

Now, we have this late representation. We will create another multi-layer perceptron, and
we will train that perceptron on training data, which essentially will be coming from
these late fuse features, and test on the respective test data. And we will see how this late
fusion technique will work. So our Model will look something like this. It's a very simple
model where we will simply flatten down these features.

And then we perform a batch normalization. And there will be two layer consisting of 16
neurons and Rayleigh activation function. And I will be putting a drop layer with 10%
probability regularize this network and later we will classify the emotion classes using
softness classifier and i will again training this model with categorical cross entropy loss
function with adam optimizer at the learning rate of 0.0001 percent now model is
compiled, I can simply divide my late fuse data into train and test splits and the code will
look something like this and our test size is again 20%. And now I will simply run my
model and evaluate it on test set.

Let's see how does it perform now. OK. So yeah, as we can see here, we are getting
better accuracy in this late-fusion case as compared to our early-fusion. In late-fusion, we
are getting test accuracy of 85%, whereas in our early-fusion case, our test accuracy was
71%. So in line with the literature, we were having two different sort of modalities. we
extracted relevant features of these two modalities using some sort of a machine learning
algorithm, in our case, convolutional neural network. And later, we extracted the latent
representation from these machine learning algorithms from the CNN and combined
those latent representation and did a late fusion.

And it is giving a better accuracy than our earlier techniques. So finally concluding this
tutorial, in this tutorial we started with working with audio and video information. We

535
extracted MLCC features from audio modality and we used a pre-trained VGG16 phase
architecture to get features from our video data. Later, using these features, we firstly
separately trained our machine learning models and saw how they are performing. Then
we tried to fuse the given embeddings and saw how these classifiers again performed.
Later we did a CNN based approach where we first tested early future fusion method and
then we tested later future fusion method.

And we saw that in case of data where there are multiple modalities coming, these late
fusion techniques works better. Thank you.

536
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Week - 09
Lecture - 01
Empathy and Empathic Agent

Hi friends, welcome to this week's module which is Emotional Empathy in Machines.

(Refer Slide Time: 00:31)

So, in this week we are going to talk about how can we create machines which are not only
emotionally intelligent, but also which are empathetic. So, they have some sense of empathy.
It sounds like sci-fi, but there are ways in which we can at least approximate it. So, here is an
outline for today's class. In today's class we are going to look at empathy, try to understand
what empathy itself is and what do we mean by the empathetic agents.

We will next talk about how can we develop artificial empathy. Of course, in order to develop
artificial empathy, we need to understand how empathy can be developed is developed
naturally among humans. Next, we will talk about how can we evoke empathy using these
techniques that we have learned from the natural empathy evolution in the humans.

537
Then of course, we will be looking at how this empathy can be used or has been used in the
virtual and robotic agents. We will talk about how empathy is something that is not just an
emotional state, it is beyond emotional state and how can it be evoked that. And we will
finish the module with the evaluation of some performance metrics with that can be used to
understand the naturalness or the empathetic of the (Refer Time: 01:50) empathetic
interactions that are happening between the humans and the agents, perfect.

(Refer Slide Time: 01:56)

So, with that, let us dive in with the first topic, empathy and the empathic, empathetic agents.

(Refer Slide Time: 02:02)

538
So, the very first thing that we want to ask ourselves is why empathetic agents and in order to
answer that, we will need to let us revise first that what empathy is. So, empathy as we all
know, empathy is the capacity of humans to understand what other humans are experiencing.
And you may understand that this is different from sympathy.

So, this is basically an experience that you try to have what others are going through and then
in turn this helps you to comfort them in a better way, right. That is what the empathy is all
about. And of course, this when as a human also you may have seen that you want to we want
to interact we prefer interaction with the humans, we want to be around with those friends
and colleagues and family members those who are more empathetic to us rather than towards
those who are not so empathetic.

The same idea is there behind them empathetic agents as well. And then, underlying
hypothesis is that if we can provide if the virtual agents let it be robots, machines, services
any emotionally intelligent component, if it can have empathetic responses then that will lead
to better, more positive and appropriate interactions with the humans. And this is what we
want to enable.

And this is something that is already this is what happens when two humans which are
empathetic to each other they interact, right. So, there is a better conversation interaction,
there is a more positive interaction and it is also at the time it is appropriate when to laugh,
when not to laugh for example, and then things like that. So, this is the same thing that we
want the agents to have so that they can behave in the similar fashion to us and the entire
experience can be more positive for the humans.

But it turns out that why giving this empathetic response also is going to make virtual agents
more empathetic and in interact more interactive is because the underlying assumption that
we have is we humans we always prefer to interact with the machines in the same way that
we prefer to interact with the other people.

So, we want other people to be empathetic. Similarly, we would like to have machines to be
empathetic. And this is something that is not a very new phenomena, something that has been
studied since long. So, the underlying idea is that we whatever entities that we are interacting
with if we can attribute human like qualities to them.

539
If we can attribute human like qualities to those entities then that entity becomes less familiar
to us even if it is a non-human like entity. And in turn if it becomes less familiar to us then
you know we become more familiar and the entire experience around this entity become
more explainable or predictable to us and hence we are very comfortable with that. So, give
you an idea.

(Refer Slide Time: 05:05)

We can look at something that is known as anthropomorphic design. Some of you may have
heard about it, but those who have not we are going to look at this. So, this underlying
assumption that we are deriving that humans would like to interact with the machines that
have human-like properties is because we always have this tendency to feel better
comfortable around the things that are that have human-like features. And hence we know we
develop a tendency to provide human-like characteristics to non-life-like artifacts. And this
entire design is known as anthropomorphism.

So, anthropomorphism of course, is a Greek word which is made with two words anthropos
and morphe; anthropos means human morphe means shape or form. So, basically
anthropomorphism is all about having human like shape or form. But we will see that it is not
only about the shape and the form it is also about the behaviour.

But let us see take an example here. So, the word anthropomorphism itself says that ok, its
human like form or shape. So, but as I said it is not only about the form or the shape

540
anymore, it is also about the way they interact with you. So, we would like them to interact as
humans interact with us, right.

For example, looking with 2 eyes with 2 eyes rather than 4 eyes who is stopping you to create
4 eyes in the robots or in the agents or in the systems. For example, no one right, but then it
turns out that the humans are always familiar, humans would feel more more familiar and
comfortable around the robots which have 2 eyes. Because of course, they are human like
right, because of anthropomorphic nature of us.

But similarly, so this is behavior and this is also about the interaction and this is what we are
talking about the empathic interaction. So, we would like to have the non-human like artifacts
to interact also with us just the way humans interact with us, right. So, for example, imagine
that you can have a robot which can communicate to you using your brain signals ok, that
sounds fascinating, but then at the same time this is not how humans communicate to other
humans. Well, most of the humans will not do that some may claim, right.

So, the way humans communicate with other humans or interact with other humans is for
example, maybe using voice of course, using gestures the way I am using and the way we use
with each other things like that. Similarly, we would like to have the machines which can
communicate to us using voice you see and communicate to us using gesture and so on so
forth. Of course, facial expressions and on all that, right. And this is where we are talking
about empathetic interaction as well.

So, you can see this image on the right hand side. I hope this image is clearly visible to you.
So, if you look at this image for example, you know this looks cute to us, this looks appealing
to us or at least to the children and to many of us also. And this is the underlying you know
idea why we like cartoons, why we like caricatures and all that. Because for example, here
you can see this is the picture of a mouse, but the mouse is depicting human like
characteristics.

What characteristics? Many different characteristics. For example, one of course, it is having
glasses. It is funny for a mouse to have glasses, but ok humans have glasses. So, it makes bit
more funny and more human-like. It ok, of course, it is reading of course, it has you know
dress and then like humans of course, it is sleeping like humans you know on the pillow and
things like that and it has socks like humans and all that, right.

541
So, basically all these are the human characteristics. So, if you see the form also ok, I mean it
is trying to have glasses and trying to get the human like form, behavior it is reading ok, there
is no interaction, but it can also interact you know maybe using voice and all that.

So, this is what is the anthropomorphism all about and because of the anthropomorphism we
have a tendency to provide human-like characteristics to the non-human-like artifacts. And
why we do that? We do that because it makes the entire interaction more comfortable for us
and hence, we get a more positive experience for the humans. So, that is the underlying idea.

(Refer Slide Time: 09:05)

But then we have to look at this anthropomorphism with a bit of catch as well. So, it turns out
that there are three things that we are talking about or mainly two things that we are talking
about let say. One is the (Refer Time: 09:18) appearance, one is the one is the appearance
which is the shape or the form, one is the function which is the behavior and of course, the
function can include the behavior and the interaction both.

So, it turns out you know of course, you know it impacts the way we how we perceive it how
we interact with it and in what sense we build the long-term relationships with it, right. And
this is what this word is going to be very very important for the people who are in the
industry because one problem is of the retention of the customer services and customer itself,
right.

542
So, the customer maybe they will start using your product, your services, your agents, your
robots and all that, but with the time they are going to lose interest unless and until they are
able to build a long-term relationship with it and long-term relationships are only going to be
built not only, but can be built easily if you can have the other services they can provide
human life experiences to them, ok.

And, but there is a catch here. So, the catch that I wanted to talk about is that you know the
appearance and its capabilities or the appearance and the function of an agent. We can talk
about an agent because it becomes easier to understand, but as I have said multiple times
there is the same idea can be extended to the robots, services and things like that.

So, the appearance of a robot should also match its capabilities in the sense the form and the
behavior they should be in sync and of course, in turn it is it has to match the user's
expectations. So, just to make the thing a bit clear, please look at a picture the picture that you
have on your right side. So, the picture that you have is I am not sure how many of you know,
but this is the Sophia humanoid robot, ok.

So, Sophia is supposed to be the most advanced as of now the most advanced humanoid
robot. And if you look at the entire image of the Sophia robot it an anthropomorphic design,
why it is an anthropomorphic design? It looks like human, right. And it is behaving like
humanity, sort of in a smiling also like humans, to certain extent it has eyes like human, it has
a form like human again and ears and all that.

But then again what we want as I said, that we want the appearance and the capabilities to
match each other and we will and when they do not match each other then what happens?
Then the creepiness comes into the picture. So, then the entire experience becomes more
uncomfortable rather than being comfortable, right.

Of course, I am not sure, but for example, for some it may be seen that ok, while it looks like
human, but then at the same time if you look at the smile has some components which does
not look like exactly like human. So, there is a mismatch between these two and maybe it is
not going to give a very strong sense of comfort that you wanted it to have, right.

543
(Refer Slide Time: 12:11)

So, and it turns out this is a very well studied and established phenomenon which is known as
Uncanny valley effect. So, the uncanny valley effect is what? Uncanny valley effect the term
itself was coined by robotic system (Refer Time: 12:24) hero in 1970s which says that it is
the dip in the emotional response of humans when there is a human-like characteristic that is
being achieved by the robots or by the agents, right.

So, for example, just look at this graph and it is going to become very clear. On the x axis
what you have is the human likeness or human like characteristics ok, you are already have it
is written here. So, human likeness or human like characteristics, right. So, basically as you
go on increasing from left to right.

What you are seeing here that you have certain robotic characters, which are less human like
such as for example, of course, R t D 2 you may recall very famous character from Star Wars
movie, of course, Wall-E you may recall and then the more you go on the right hand side then
you come till for example, you know the since character in the humans (Refer Time: 13:21)
series.

And then terminator in the T-800 and Dolores of course, in the Westworld right, they look
very much like human. So, on the x axis you have humans or the robotic agents looking less
human like from the like on the left hand side, but as you go on the right side looking more
human like. Similarly, on the y axis what you have? You have the familiarity the sense of the
comfort or the familiarity of the humans as rated by humans, right.

544
So, for example, what it is roughly trying to show you, the more the human likeness or the
likelihood of looking like a human increases among the robotic characters the familiarity of
these agents also increases with the humans and it makes sense, right. The more they are
looking like human the more you are feeling familiar about with them and around them.

So, for example, if you look all recall R2-D2 or even for example, Wall-E character that is
there in front of you I mean ok, it was so popular because at the same time it was not human,
but at the same time it had some human like behavior and shape as well. So, for example, it
had two cute eyes and you know it was able to navigate and things like that. So, not exactly
like human, but it has some human like form and people loved it, liked it a lot.

Similarly, was the case with the R2-D2 and so on so forth. And similarly, you know if you
recall this again another very popular characters from the Star Wars movie which is C-3PO.
So, C-3PO of course, also if you recall it was you know more or less like quite the it had like
more human like characteristics and hence the familiarity or the popularity of this character
was also quite high.

But suddenly what started happening now, you see then this is the dip that we are talking
about, this entire reason is the dip that we are talking about. Now, suddenly in this area what
we are looking at, that even though the likelihood of looking like the human or human
likeness is increasing among the characters, there is a sudden dip in the familiarity.

So, of course, I mean we started we can talk about if you recall the I robot you know the
Sonny character from the I robot this is the Sonny character from the I robot. Of course, you
know with the T-800 skeletal form of the Terminator, similarly you know the synths as I said
and so on so forth.

So, basically what is happening in among in the case of these all these characters the
Gunslinger in the west world and all these characters, that they are looking they have started
to look like more like humans. But at the same time, you know there is a mismatch between
their form I mean I will note it down again so that you can recall it again. So, there is
mismatch between their form and their behavior and more importantly the user's
expectations, what do we what do we mean by that?

So, if you look at the form, the form looks quite like human, ok. I mean for example, if you
look at the Sonny robot itself it looks quite like human, but at the same time the behavior was

545
not exactly like the humans, it has some superpowers also or maybe you know for example,
there are certain things that was looking like creepy so, for example, all these things coming
out of the head and things like that, right.

Similarly, if you look at for example, there a skeletal form of the Terminator I mean this
particular character. It is looking very much like human, but at the same time it looks like a
distorted human body. I mean this does not look very good to us I mean this does not look
very comfortable to us right, I mean this looks like I mean it is missing some limbs, some
parts and this and that. So, it is like horrific it is not very so.

So, then of course, you know like we do not feel comfortable around this kind of designs.
Because of course, one thing is the form is looking like human, but at the same time is not
like humans and it is creepy itself. And then the form and the behavior maybe they both are
also not matching you know ok, I mean if you are looking like human, if the robot is looking
like human, it should also behave like human and it is really hard to behave like humans.

We have because of course, the naturalness that is there in the interaction, the voice and all
that I mean it is not easy to plug in and of course, there is a lot of research that is going on
these things. And more importantly there is a mismatch with the user's expectations.

When you show a character which is looking like a human the users expectation is ok, it is
going to behave like a human, but when it is looking like human and it is not behaving like a
human, while behaving like human as I said is always challenging then there is a mismatch
with the users expectations. And that is where you know the familiarity or the like popularity,
it starts dipping in and that is what is essentially, we can say is happening with the uncanny
valley reason.

But then what happens suddenly you know then again, we see, but as we cross a particular
threshold in a 90 percent or whatever you know like maybe something that is very very close
to human. So, for example, this character of course, you know like played by a (Refer Time:
17:53) in this case and then of course, in Dolares Westworld, what is happening? I mean ok
with this the likeliness has increased a lot, but at the same time the familiarity or the
popularity also has increased a lot.

So, one simple thing ok, they are looking very much like humans and they are behaving very
much like humans. So, is as simple as that, there is a very good match between the

546
expectations of the users ok, if they are going to look like humans they are going to behave
like humans and that is how they are behaving in this particular case.

And that is where you this is the point that you can also say that if you know or recall the
turing test to certain extent this is where we can say that they have they may have passed the
turing test also for the humans.

Because normal for a normal human it may be hard to differentiate ok, whether for example,
this character is a human or this character for example, is a robotic agent because of course,
they are looking very much like human and of course, they are behaving very much like
human.

So, maybe they are passing the turing test also after this certain reason. Nevertheless I mean it
is really hard to achieve you know this particular reason, but of course, I mean what you want
to do is you not want to get your design trapped in the uncanny valley effect, ok. So, so that is
what is about the uncanny valley effect.

So, coming back to the summary of this thing what we wanted to understand? We wanted to
understand why we want to have empathetic agents. We want to have empathetic agents
because it makes us more it makes the entire interaction more positive as simple as that. And
why it makes the entire interaction more positive? Because it gives us a sense of familiarity
when the agents are interacting with us just like the humans are and this is what is known as
the anthropomorphism or anthropomorphic design.

In the anthropomorphic design what we want to have? We want to have a human


anthropomorphic design it allows or it is all about a tendency to attribute human like
characteristics to non-life-like artifacts, which may include their changing their shape to like
humans, changing their behaviour like humans, changing their interaction like humans.

While we are doing it and that is underlying a hypothesis of the empathetic interaction, but
while we are doing it we should be cautious about uncanny valley because it may happen that
we are going to make their form like humans, but they are not able to achieve an interaction
like humans and that is where there is going to be a dip in the emotional response or in their
familiarity or in their likeness.

547
So, basically what we would like to have of course, this can be our holy grail in some this can
be called as our holy grail which can be difficult to achieve you know like just human like
form, human like behaviour, human like empathetic reaction, but maybe we can achieve till
here and you know that can also be quite, good. Or for example, you know even here or even
here or anywhere in this region, right, ok. So, with that is about the why empathy and what do
we mean by the empathetic agents and why do we like them so much.

(Refer Slide Time: 20:36)

Now, let us try to understand the next module of the class.

548
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Week - 09
Lecture - 02
Development of Artificial Empathy

So, now let us having understood the basics of the empathy and now let us try to understand
that how the artificial empathy can be generated.

(Refer Slide Time: 00:32)

So, for that let us first try to understand that how the empathy has been understood in the
classical sense with respect to the humans. So, empathy has been summarized can be
summarized in as three major subprocesses, one is the emotional simulation, another is the
perspective taking and another is the emotion regulation.

Now, what is emotional simulation? So, basically emotional simulation is all about trying to
analyze and understand the emotional state of an individual. This is what we have been doing
so far essentially by trying to understand the emotions of an individual in different, why
different physiological and the behavioral cues.

549
Now, perspective taking is what there is a perspective taking? So, perspective taking is all
about how can you understand that once you are going to give a response to the user, what the
user is going to feel about it, how the user is going to feel about it, that is what the
perspective taking.

So, what the user will feel you know how I would feel for example, if I would be in the user
situation, if I would receive this particular type of response. We will see for example, and it
will get very clear. And then of course, emotion regulation is all about that your aim or the
aim of the empathetic responses or the interactions is that we want to provide a empathetic
response, but at the same time we do not want to transfer any negative emotion to the towards
the individual.

And that is what is the emotion regulation. So, traditionally we as humans this is how we
approach the empathy or the empathetic interaction between each other. And having
understood this thing will also help as simplify that what it means for an intelligent machine
to be empathetic, ok.

(Refer Slide Time: 02:18)

So, let us see with this an example. Imagine that there is an AI empowered chatbot that you
want to create, which is for the mental health issues you know where the individuals those
who are having any type of mental health issue they can go and discuss and interact enhance
some sort of counselling with a chatbot which is AI empowered. And now if we have to

550
create this, we have to make this AI empowered chatbot a bit more empathetic then how will
it look like?

Ok, the very first thing of course, it will have to look into the emotional simulation. So, when
we say that it will have to look into the emotional simulation what it means, that it will have
it needs to have the ability to identify the emotional state of an individual. And in this
particular case it needs to identify the common triggers for the anxiety while it is making an
interaction with the humans, right.

So, basically it needs and how will it be able to do that? Identify what are the common
triggers for the anxiety or when the user is feeling anxiety. Of course, it will have it needs to
have access to a large amount of data may be physiological and behavioral both the types of
data.

And then it needs to have a training on that type of data and using that trained model it should
be able to identify when there is a trigger for a anxiety or when the individual who is
interacting with the chatbot is getting anxious or you know how it is interacting. So, that is
what the emotional simulation is.

Now, of course, next thing would be that it has to understand that how the user wants to be
felt like right, you know like it needs to understand that I need to make use of a language as a
chatbot. I need to make use of a language which is going to help my users to feel that you
know they are being heard and they are being understood. And how can it be done?

The chatbot can be programmed in advance to respond to the users in a you know very calm
and reassuring manner and then that is how they are going to that is how this is going to
interact with the users an intense user as going to feel good about it.

Because of course, you have to understand if you the chatbot having understood what is the
emotional state of the individual if the chatbot is going to respond them in an aggressive way
for example, in an aggressive manner or for example, in a higher pitch higher tone then of
course, the users are not going to feel comfortable about it.

So, this is what it has to understand that you know depending upon the state of the user how
should I respond so that you know they feel comfortable around it. And that is what is the
perspective taking, how will they feel if I respond in particular way. And of course, then

551
emotional regulation is ok, while all this is being done of course, the chatbot has to ensure
that it should not trigger or any negative emotions or the feelings in the users. So, that there
are no negative impacts around it, right.

So, for example, if the user has been talking about a particular scenario which is causing
distress, it may it may not to necessarily poke you know the user to talk more about that
particular scenario. Of course, keeping aside how the therapy works and all that, it has to see
take all these things into account that in what are the different ways in which I am going to
respond so that you know it is not necessarily going to invoke the negative emotions among
the user which I definitely do not want to do it.

So, you know this is how the traditionally this is how the empathy can be understood and
then this is how it can be transferred to a artificial intelligent systems.

(Refer Slide Time: 05:52)

Now, let us try to understand two major components related to the artificial empathy and the
empathy generation. One is the empathy analysis and then we will next talk about the
empathy simulation. So, basically empathy analysis is what? Empathy analysis is basically
the first part that we just saw, that we want to understand the emotional state of the user and
we want to understand the overall the whether the interaction that is happening is empathetic
or not.

552
Now, before we talk about the agents or the systems let us understand how traditionally it is
being done or it can be done without making it automated. So, you know in the behavioral
studies it is very very common to study empathy during the interactions and this could be you
know empathy during the interaction for example, of a doctor with a patient and so on so
forth. And then based on this thing there are different types of training which are also
provided.

So, for example, another example could be an interaction that is happening between a call
centre employee and a user who has some issues who recently had some issues with credit
card or something like that, right. So, you this is how you know like this kind of interactions
that traditionally they have been analyzed and you know people are being trained on that to
so that you know they can become a bit more empathetic while doing the interactions.

Now, if so in the absence of the AI or like without making use of intelligent systems how it
was being done? And that is when you know typically what was being done here, that
typically human raters or the external annotators what we call that human raters which are not
part of the entire interaction they were involved in this case.

And they used to you know make use of the behavioral cues of the target which for example,
in this case could be the patient or for example, could be the client, the customer who has
called to the customer care they are they were they are going to look at the behavioral cues
and they were trying to infer you know what is the overall state of the interaction.

Of course, if the user is continuously feeling frustrated about it maybe there is not much
empathetic interaction that is happening, right. So, basically, they are external human
annotators number one they are going to observe the behavioral cues of the target which
could be client patient all that and having targeted that they would like to infer that ok,
Whether the empathetic interaction is being happening and at the same time they would like
to annotate.

For example, at what point of time the empathetic interaction was happening. And this is
what an empathetic interaction looks like, right. So, this is how it has been done traditionally
with the in the case of the in behavioral studies.

And when I said the behavioral cues, I mean there could be like physiological and behavioral
cues both. So, for example, they for example, when they are making a call to the customer

553
care of course, the only a modality that you have access to is of course, you guessed it right,
is the audio modality.

So, of course, they can you know just monitor the audio modality of the client who has made
a call to the customer care and using this audio modality they can try to understand that what
is the emotionally state of an individual and at the same time you know overall whether it is
being empathetic the interaction is empathetic or not.

Of course, in order to understand whether it is empathetic or not they also need to understand
what is the interaction that is the entire interaction and the other party that is involved in the
interaction which for example, could be the doctor or for example, could be the customer care
employee who is taking the call of the client, right.

And of course, physiological cues as you rightly guessed physiological cues could be a bit
cumbersome for example, while analyzing the customer care client. But for example, it could
be very much feasible if the target is a patient who is interacting with the doctor.

For example, it is in a mental health counselling. Of course, we are not talking about very
intrusive a physiological cues, but for example, the user can always come with a and can
always be a wearable sensor through which you know some physiological cues such as you
know heart rate, even oxygen for that matter and things like that can be monitored and
understood.

And this can be analyzed to understood that how the user is feeling and overall whether there
has been an empathetic process that takes place. Perfect, now, so, you know so once this is
how traditionality is being done in the behavioral studies.

Now, the idea of artificial empathy is what? That the computational empathy analysis studies
that are focused on developing this artificial empathy, they want to similarly capture and
model the multi-modal behavioral cues for generating the empathy. The way for example, an
external annotator was doing in case of a behavioral study, right. So, this is what traditionality
it has been doon, it has been done and you guessed it right that there are multiple
physiological and behavioral cues.

Some of which we have already studied to understand how emotions in those cues can be
understood can also be used the same type of physiological and behavioral cues can also be

554
used to understand the empathy rather than just understanding the emotional state, they can
be and the used to understand the empathy that is occurring in the entire interaction and how
the user is feeling about it. And let us try to look at some of the these cues.

So, for example, one particular cue is as simple as the lexical cues. Now, lexical cues is what?
Lexical cues is basically text based data. Now, you may want to ask ok, so, where there will
be have access to the text data and how can text data be used to analyze the empathetic
response? So, for example, for the entire interaction that is happening between a patient and a
client, between a patient and a doctor, we can have access to the transcripts of the entire
interaction.

And using those transcripts we can understand whether there has been any empathetic
interaction throughout between the patient and the doctor. Of course, you guessed it right that
we will have to use some sort of you know NLP here, some sort of you know NLP Natural
Language Processing here in order to understand that what sort of features are there in this
transcripts that we can analyze.

And on the top of those features, may we may have to use some machine learning models to
understand whether there has been any empathetic reaction. So, let us talk about one
particular study which is of quite interest in this case. So, for example, Xiao et al., in 2013,
they made use of a N-gram language model to understand whether the language to understand
the language transcriptions of the therapists and the clients in an motivational interviewing
type counselling.

So, basically let us break this down for you. Ok, what is motivational interviewing type
counselling? So, basically motivational interviewing type counselling is the type of the
counselling where you know the whole job of the therapists or the doctors is to motivate the
clients, motivate the patients towards certain goal or towards certain objectives and that is
what is the motivational interviewing counselling is.

And of course, now in this case, what is N-gram type of modelling? So, basically N-gram
type of modelling is what? N-gram type N. So, N-gram is basically you know you may have
gone through it while we were talking about the emotions in the text, but basically N-gram is
what? N-gram is a sequence of words.

555
N-gram is a sequence of N words for example, for example, if a if we were to take a simple
example of you know, let us say that you know it looks like, let us write a simple example, it
looks like ok, let me just remove this this word now for now, it looks like. So, ok, how many
words are there? There are three different words.

So, what does it mean? How can we model it? How can we model it using N-gram? So, for
example, if we are using a unigram, then maybe we are going to look at only one single
statement.

(Refer Slide Time: 13:58)

So, for example, I just as simple as that probability of it, probability of occurrence of it could
be something that is a unigram model. Similarly, imagine if we were to look at, sorry, if we
were to look at the probability of let us say looks given that it has already occurred, then that
is what you can call it as a, this is the unigram model and this is what we can call it as a
bigram model.

So, this is a unigram and this is a bigram model, right. So, basically when we are looking at
two words that is a bigram model, when we are looking at one word that is a unigram model.
Similarly, we can also have a trigram model, you guessed it right, where for example, we
want to understand what is the probability of occurring like when the model has already
looked at it looks, right.

556
So, basically that is what is a trigram model. So, there are different characteristics of these
unigram models and for depending upon what is the type of the task that you are looking at,
you may want to use bigram, trigram or even higher sequence models. So, basically in this
particular study, what the Xiao and his group did, that they made use of a N-gram model.

In this case, they used the trigram model to understand that, to understand that, to analyze the
entire, the language transcriptions between the therapist and the client. And basically, what
they did? They were able to show that just by using a maximum likelihood classifier, which is
a very common type of classifier, which looks at the maximum likelihood of assigning a
class.

So, for example, in this case, they were looking at a classifier, which could assign the entire
transcription, whether it is belonging to a empathetic category. For example, or whether for
example, it was belonging to other category, empathetic or other category. So, there are two
classes, right. I hope you can understand this thing. This is class 1, and maybe this is class 2,
which is other, which could be non-empathetic or which is you know, neutral even.

So, basically what they showed in this study, that by making use of a maximum likelihood
classifier, we just looks at the maximum probability of occurring a particular, of assigning a
particular class on these language models, which is you know N-gram language models that
we just saw, they were able to automatically identify the empathetic utterances. And that is
fascinating, actually.

So, basically, you know they were just looking at the language transcriptions, making use of
the N-gram language model, applying a maximum likelihood classifier on the drop of it, and
they were able to identify when there was an empathetic interaction happening, when there
was no empathetic interaction happening, right.

557
(Refer Slide Time: 16:40)

So, for example, this is how you can make use of the lexical cues or the text based data to
analyze the empathetic process, to understand the empathy, not only the emotional state, but
also to understand the empathy, whether the empathetic interaction has been has happened.

Of course, you got it right, that you need to have the annotated data in order to make a
classifier supervised classification work in this case also. Similarly, you know of course, you
know, like test is has been is limiting to certain extent, but nevertheless, one of the most
common modalities where empathy has been understood in the research, in the literature, is
the vocal cues.

So, basically in the voice, and of course, you already understand that the vocal cues or the
voice modalities of the humans is highly dependent on the internal state. And hence, this is a
very, very good indicator of the empathetic, the empathetic state of the individual, right.

And there are there has been many different studies that has been done on this thing. For
example, one study again by Xiao et al., and his group only that was done in 2014, they
showed whether the prosodic patterns. So, prosodic patterns are basically again, if you recall
your class of emotions in speech, then you may recall the prosodic patterns, the features of
the speech.

558
Prosodic patterns related to the empathy assets assess, they analyze these prosodic patterns
related to the empathy assessments. For example, again in the same type of setting where you
know like there was a motivational interviewing based therapy happening.

So, you can look at this diagram and understand there is a therapist, of course, of course,
there is a client, the therapist and the patient they are talking to each other. So, basically the
whole idea was that entire audio was getting recorded. The next step, of course, audio
recording was being done, that is how you obtain the vocal cues.

Of course, you will have to do some sort of denoising on the top of it, background noise or
some sorts of other noise. So, then you first you recorded the audio, you recorded the entire
audio, you did the denoising to remove the low frequency or the high frequency noises. Then
of course, you may want to segment the utterances.

So, for example, the way we were doing trying to identify a word unigram, bigram and
N-gram and so on and so forth, in the same way you want to segment the entire speech into
different utterances, speech corresponding to one word, speech corresponding to two word
and so on and so forth.

And then of course, you may want to, you know, after the segmentation of the utterances, you
may want to extract the prosodic features within those segments. So, it may happen that you
know you may have entire segment of audio like this. So, for example, time t, but then you
may want to chunk in different segments and for example, where this is belonging to maybe
word 1, this is word 2, this is word 3, this is word 4, this is word 5, this is word 6 and so on
and so forth.

So, now and of course, this is just an example, you can definitely segment the entire audio
sequence depending upon so many things in one word in two words or you know depending
upon, for example, as simple as that, that you can just segment it within of with the segments
of 5 seconds, 10 seconds itself a uniform segment, 10 seconds each and so on so forth.

So, once you have segments then you know for each particular segment, then what they did?
They extracted the prosodic features related to the speech and of course, they did the feature,
some feature quantization on the top of it, they looked at the entire distribution of the
prosodic patterns in the different over the different segments.

559
And of course, you need to have always, as always you need to have an external annotator
and external observer who can look at the entire interaction and can annotate for you that, ok,
this particular segment, yes, it was empathetic, this particular segment it was not empathetic.

So, there is an external annotator or observer who provided the empathetic, empathy ratings
of the therapist that, ok, whether the therapist for example, was being empathetic in this
particular segment or not. And then of course, you know, once you have the empathy ratings,
then you simply ran some model and then you tried to understand did the automatic inference
of the entire thing.

So, that was the model that for example, Xiao et al., proposed for the vocal cues also and then
you know what the, what so, ok. So, for the prosodic features for each speech segment, of
course, they looked at the therapist and they looked at the client client's data as well. Now,
you may want to understand, ask, ok, if they wanted to understand the therapists empathy
ratings, then why did they collect the client's prosodic features, why did they look at the
client's audio signal, any guesses?

Ok, I mean, so, this is quite easy to understand. Your idea is to understand what was the
empathetic response on the, on the target, target here in this case is the client. You want your
client to have an empathetic feeling. And unless until you analyze the emotional, the vocal
state, the vocal modality of the client, you can never know that whether the therapist was,
whatever the therapist was intending, whether that was being having an impact or not, right.

So, of course, you will have to look at the both the therapist and the client and then of course,
the prosodic features included many different features such as, you know, vocal pitch, energy
of the signal, jitter, shimmer and speech segment duration itself, for example, one simple
feature in this case, ok.

So, they looked at these entire features and what they could show with the help of results, that
a group of significant empathy indicators, they were able to find a group of significant
empathy indicators which were able to predict, for example, low versus high, which were
able to predict, you know, what were low and what was, for example, high empathy in this
case.

And I think I have some example here, if I can just understand. So, for example, they were
able to understand that an increased distribution of medium length segments with high energy

560
and high pitch was associated with a low empathy situation. And it should not be surprising
because of course, you know, what it means that, if there is an segment with high energy and
high pitch, may, it may suggest that, you know, the therapist is making use of a louder voice
and is having a raised intonation.

And because of this, the therapist's response within that particular segment cannot be
considered or may not have been considered as high highly empathic. So, for example, these
are some of the ways in which they were able to understand and identify that how the
different prosodic features, they are related to the lower the high empathy, empathy
empathetic responses.

And this is how not only they were able to show, that for example, using vocal cues also, you
can analyze the empathetic situation of the of a particular interaction, perfect.

(Refer Slide Time: 23:23)

So, I will not talk about how that how the empathetic responses can be evaluated in all
different modalities, but I guess you got the idea. So, of course, you already saw that
empathetic responses can be understood with the help of the lexical analysis; you just look at
the text data of a transcription of the entire interaction.

You can also look at, for example, vocal cues and understand whether the empathetic
interaction has happened. Similarly, it is not uh.. trivial to understand that it is just quite
trivial to understand, for example, that the facial expressions has also, if you can analyze the

561
facial expressions of the respondents, you can simply understand whether they have been
feeling experiencing an empathetic interaction or not.

And motivated by this idea, for example, one of the first studies that Kumano et al., Kumano
and his their group did in twenty 2011 was they wanted to understand if the co-occurrence of
the facial expression patterns between different individuals who were part of the same group
can also be related the empathetic labels, can also be related to the empathetic responses,
right.

So, for example, you can see in this particular picture. So, on the left hand side you have a
high level view of the entire interaction the way it happened. Of course, this is an image from
the same paper, where for example, there were four participants, they were sitting facing each
other and of course, having a discussion and this is the individual participants' camera
highlighting individual's a facial expressions.

So, now in this study what they try to do? Of course, to take a take a step back, they wanted
to understand if the co-occurrence of the facial expression patterns means the facial
expression of participant 1 for example, with participant 2, with participant 3, with participant
4 can be used to understand that there has been an empathetic response or there has been an
empathetic interaction within that particular segment of course.

So, what they did? In this, they segmented the facial expressions into they categorized the
facial expression into six types. So, you already know, ok they neutral, smile, laughter, wry
wry smile and thinking and of course, others. Others could be something, you know, which is
not among these.

For example, which could be disgusting, shaming and all those kinds of things, right. So,
basically, other feelings. So, there were the six types in which they classified the facial
expressions of an individual, how to do the classification of facial expressions? I think you
already know the answer to it; we have already talked about it in emotions in facial
expressions.

Next, what they did? They not only looked at the facial expressions, they also looked at the
gaze patterns. And you can understand that ok, if when we are looking at the facial
expressions along with the gaze patterns, it means simply we are creating a multi-modal
system.

562
And of course, in this multi-modal system, the reason why we make use of a multi-modal
system is because the hypothesis that the response of a multi-modal system would be better in
comparison to a uni-modal system such as making only making use of a facial expressions,
ok.

So, in this case, they also looked at the gaze patterns of all the participants and then classified
the gaze patterns into three categories. One is the mutual gaze, what is mutual gaze? Mutual
gaze is the labelling where the participants, they were looking at each other.

For example, they were like participants 1 and 2 and maybe they were looking at each other.
So, that there was a mutual gaze between these participants. Now, similarly, what is one-way
gaze? So, one-way gaze is basically ok, one of the participants is looking at other participant,
but other participant is not looking back at the first participant.

So, that is one way gaze. And what is mutually averted gaze? So, you guessed it right.
Mutually averted gaze is basically neither of the participants are looking at each other. So, ok,
nevertheless, facial expressions, six states, gaze patterns, three states, our their objective was
to understand if we can look at facial expressions and gaze patterns, can be predict what is the
empathy label of interaction in this particular segment.

So, empathy of course, then you know in this case, they rather than using a binary
classification, they made use of a three class classification problem where empathy was
empathy, unconcern and antipathy actually. So, they have looked at antipathy as well as the
third class. So, there were three classes. So, you got the idea, right, ok.

Facial expressions, gaze patterns were the data; input data, empathy state is the label. Now,
let us see what they found. So, the results showed that the facial expressions were effective
predictors of the empathy labels, not very hard to understand, but nevertheless, it was good to
see that using an automated analysis, they were able to understand that how the empathy can
be analyzed in maybe making use of facial expressions.

And I think if you look at, go ahead and read the paper, you will find that they had some
interesting observations about the gaze patterns as well, and also about how and what was
happening when they were mixing the facial expressions and the gaze patterns. So,
nevertheless, overall, you got the idea that in order to understand the, the in order to analyze

563
the empathy during an interaction, you can look at the physiological data; you can look at the
behavioral cues.

And there are n number of examples, such as we just talked about three, which looked at the
how, for example, lexical cues; how, for example, the vocal cues and how, for example, the
facial expressions can be used to understand the level of empathy between during an
interaction, ok.

And similarly, you can just take the idea further and maybe make use of multiple modalities,
other modalities, such as the ones that we talked about during our previous classes, to
understand the emotional states, you can also use them to understand what is the relation of
the, this all those modalities with the emotional, with the with the empathy during a particular
interaction.

(Refer Slide Time: 29:17)

So, this is how you can do the analysis of the empathy, perfect. So, now the next step is what?
Next step is the empathy simulation that is closely related to the development of artificial
empathy. So, once you have understood that how the how the empathy can be analyzed, of
course, the next step is what?

To understand how can we have an artificial embodiment and display of these empathetic
behaviors in virtual or robotic agents, which can you know display artificial empathy as it can

564
be perceived by the human users. So, we already know that we now can see how it can be
analyzed.

Now, the next time would be to put them into the virtual and the robotic agents. Of course, a
bit of warning here, whatever empathetic responses these agents are going to display, they are
not going to be truly empathetic, right. They are not going to be truly empathetic, is as simple
as that.

Of course, because they are not sentient beings, any virtual agent or robotic agent is not going
to be a sentient being. And unless and until it is a sentient being, it can never feel the emotion
and since it can never feel the emotion, it cannot display a truly empathetic empathic
response.

So, I hope we understand, but nevertheless, there is hope. And the hope is that it has been
shown by previous research such as for example, Tapus and (Refer Time: 30:40) Tapus and
her group that we do not really need the agents to show a truly empathetic behavior.

What we really need, that even if we can have a simulation of a human life behavior by the
virtual agents that can just invoke a perception of the empathy by the user, it is of course, this
is going, this is feasible and this is useful for most of the experimentation and the application
purposes, right. So, the idea is not to make the agents feel empathetic, but just to simulate the
empathetic behavior which is a human-like empathetic behavior in the agents and that is
going to serve us well in the longer duration.

And that is what gives us a lot of hope actually and that is why we are talking about how
should what is artificial empathy and how can we generate the empathy and of course, in the
hope that this artificial empathy even though it is not true, it can really help us in making the
response more empathetic and hence more interactive.

Now, it turns out there are then there has been mainly two different directions using which the
previous research or the literature has been trying to do this simulation of human like
behavior. So, of course, we understand truly empathetic response cannot be generated. So, we
can just do a simulation of a human-like behavior. So, previous research has been mainly
focusing on two different methodologies to generate this simulation of a human-like
behavior.

565
So, for example, one particular one particular direction has been where it has been driven by
the computational model of the emotional space. So, basically it is very very driven by the
theory that is behind the generation of the empathy in humans and the idea is the broader idea
is: If we can understand the theory behind the generation of the empathy in the humans, we
can somehow you know create a cognitive model or computational model of that and using
that we can generate the empathy in the virtual agents or the robotic agents. Of course, it is a
bit not so easy to do.

Other approach that many of the studies they have been following is it is mostly driven by the
data where it is driven by the user and the context in which we want to apply the empathy to a
particular application. And if we have data related to that we are just going to other the
previous research has been trying to make use of that data to generate the empathetic
behavior. So, we will basically look at all these kind of things you know in the different ways
to understand that how can it be done, perfect.

So, now as I said there are two ways one is the computational model and one is the let us call
it as a data driven model. The second one we can call it as a data driven model. So, let us try
to understand both of them, how can we make use of these two models.

(Refer Slide Time: 33:45)

How the previous research has used these two different models methods to generate the
empathy. So, computational model as I said before, computational model is basically we want
to understand how the empathy is being generated in the humans and when can we replicate

566
the same in the machines. So, there has been many different models that has been proposed in
this sense to understand how the empathy has been generated among humans.

So, for example, one particular model that has been proposed by Boukricha and his group is
of generating the empathetic reaction can be can has three steps. So, basically the first step is
the empathy mechanism. So, basically this is what is the understanding of the emotional state
of the user who is of the target in this case. Other is the empathy modulation.

So, basically what you want to do? You want to understand that for example, as simple as that
what is the level of the emotional state, what is the severity of the emotional state, what is the
for example, the relationship between the target and the for example, the other observer, other
interaction other individual who is interacting?

And for example, what is the demographic information associated with the user using all
these, you may there can be a modulation of the empathy, there is usually a modulation of the
empathy such as and it is weighted by many different degree of factors for such as you know.
Of course, if you have a sense of familiarity between the patient and the doctor, of course,
there is going to be a more empathy. If there is going to be a liking between the patient and
the doctor of course, there can be a bit more empathy.

So, all these aspects comes under the empathy modulation. So, first trying to understand what
is the emotional state of the user, second how can you modulate, how can it be modulate and
third thing is the of course, is the expression of the empathy is basically making use of the
different physiological and the behavioral cues to express the empathic behavior, which of
course, look like the empathetic behavior to the users in this case. And this is how
traditionally one three-step model has been produced.

567
(Refer Slide Time: 35:55)

And let us try to see for example, how can this be applied in the case of a, if we were to use
this particular model, we were to use this particular model to generate empathy, how can it be
applied to again our earlier example, which is the AI empowered chatbot for the mental
health purposes. So, basically what you can do? Of course, the first step is the empathy
mechanism.

So, basically you want to understand and analyze what is the user's emotional state. It is
simple in the case of the chatbot, which is looking which is interacting with the human,
maybe of course, it has access to the users' wise data for example, we just saw. It has access
maybe to the users' you know transcriptions, the entire transcription of the interaction that is
happening.

So, basically it can look at those cues and it can look for specific words or the phrases, for
example, it can look at prosodic features, it can look at the N-grams. Remember, we just
talked about that in previous and it can understand that you know, like what is the emotional
state of the user in this interaction.

Having understood what is the emotional state of the individual, then it can also you know
understand that ok, that it can also adapt a particular, this is very, very important. Please pay
attention to this particular point. This is very, very important. The chatbot can also adapt a
particular emotional state in response to that in response to that particular emotional state of
the target.

568
For example, if the target is feeling stress, anxious and so on so forth, then of course, you do
not want your chatbot to be responding in a very jolly manner. It needs to be, it needs to look
like that it is concerned about the user’s concern user's current state or target's current state,
right. So, it needs to have a mood, we can call it as a state of the chatbot or we can call it as
the mood of the chatbot should be a bit of showing concern, should be a bit of empathy rather
than for example, showing a very jolly behavior and smiling always, you know, like laughing
for example, which is not going to be very sensitive towards this interaction. So, that is the
empathy mechanism.

In next case, of course, you know, once you have understood user's emotional state, once you
have understood what the chatbots mood could be according to that, next the chatbot would
like to modulate its empathy response and based on so many different things. For example, if
it has access to user's history with the chatbot, it may know some other context, it may know
some more information and it can modulate with respect to that.

Of course, it can also take into account what is the severity of the anxiety. Maybe the
individual is feeling a bit, you know, like less anxious, maybe individual is feeling highly
anxious and according to that, the state of the chatbot also can be, you know, programmed or
the chatbot can, the whatever the responses that the chatbot now is going to take, can be
modulated with respect to that.

And of course, as I said, it also needs to look at the user's demographic information such as
for example, what is the age, what is the gender and so on so forth. Maybe they all may play
a particular role in this case. And they all using all this information, the response of the
chatbot can be modulated, right. Now, for example, the chatbot may just, its response to be
more or less empathic depending upon the situation. If the user is very much concerned,
maybe more empathetic, if the user is less concerned, maybe, you know, less empathetic so,
for example.

Now of course, the next thing would be having understood what is the emotional state or the
empathetic state of the user and having understood what is the degree through which it is, it
has to be modulated. Now, the chatbot has to express the empathy. Now, how can chatbot
express empathy?

So, basically the chatbot can express empathy through its messages. For example, if it is just
interacting using a chat, using words and phrases that convey support, understanding and

569
validation. And of course, you know, like this, the same thing can be done maybe by making
use of a vocal cues, voice modality. If for example, the chatbot is like voice assistant.

So, this is how the expression can be done and you rightly guessed that all the, for example,
you may not want to have a very high pitch or high tone of the voice of the speaker, of that
chatbot when it is interacting, when it is expressing the empathy because of course, we just
saw in a few moments back that in one of the studies Xiao et al., for example, they showed
that the high pitch and the high energy is associated with the less empathy.

So, for example, this is how you modulate the degree to which the empathetic response has to
be shown to the target user. So, I hope so far it is making sense that how can the empathy can
be stimulated with the help of a computational model by for example, in a AI empowered
chatbot.

(Refer Slide Time: 40:55)

So, that was the first way in which you know the empathy can be generated with the help of a
computational model, trying to understand how the how the humans, how the empathy is
evolved in the humans, how it is expressed in the humans and having understood that we may
want to replicate the same for in artificial agent which may not be so easy to do.

This is a bit difficult actually and then we will talk about what are the challenges associated
with it. But then the method that is more common among the community and that has been
exploited more is basically the data driven approach. So, in the data driven approach what

570
happens that its very very specific to a particular application and it is of course, oriented
towards the user and based on a particular context only. We will see more about it in order to
understand.

So, basically in this case what happens? It tries to learn the context of the human empathetic
behavior and it tries to understand for example, ok in a particular scenario when the humans
they display emotions and how they display emotions, ok. It is not concerned that how what
is the cognitive model of the generation of the empathy in humans, but it is more concerned.

Ok, Can I understand that when they are displaying emotion and of course, other thing that
the I would like to understand in this case how they are displaying emotion that is it. I do not
want to understand the model that is behind when and how, ok. For example, which is
guiding the cognitive mechanism behind it? And that is what the data driven approach has
been.

So, for example, in one particular work that is by McQuiggan and Lester in 2007, they
designed a framework which they call it as a CARE framework. So, basically in this CARE
framework what they did was in this particular CARE framework, they collected the
behaviors of a virtual agent. In this case, for example, they collected the behavior of a virtual
agent that was being manipulated by a human who was acting in an empathetic manner, right.

For example, feeling frustrated when the user was losing a particular game. And in this entire
way, the entire data of being was being recorded and then entire data for example, was then
further used this entire data was being recorded. And then it was further used to train for
example, a particular virtual agent to display, to display empathetic behavior in response to
when the user was displaying the behavior and or when the user was or and accordingly how
the user was displaying the that particular behavior.

So, let us try to see this with the help of this particular example. Imagine that this is your
virtual agent, right this is your virtual agent which you want to train to show some empathetic
behaviour. And then of course, you know then there is an for example, there is an that there is
a human who is an empathizer in this case, you want to see how this human is doing and then
you want to learn from this particular human.

Basically, what is going to happen in this case, there is a trainer interface and through this
trainer interface, this virtual agent and the empathizer, the human are going to interact or for

571
example, through this virtual, through this interface, the humans are going to control how the
this virtual agent is going to respond.

Of course, it is completely controlled by the humans. This virtual agent's behavior for
example, what is the type of the message they are going to send or for example, what is the
they are going to say, everything is being controlled by this human.

Now, this virtual, there is an virtual environment which has access to all the data of this entire
interaction. When I say that the virtual environment has access to the entire data, which could
be you know that all the temporal attributes. For example, you know in which particular time
what was happening, locational attribute, all the spatial attributes, spatial properties for
example, and then what was the intentional attributes.

So, basically all sort of you know physiological and the behavioral cues that was getting
observed and in what particular time they were being observed, what was the synchronization
between all the cues and everything. So, all this data virtual environment was able to observe.

Now, once the virtual environment, it had access to all this particular data, what it did? It you
know you can see this particular relation. So, basically using all this data, it trained for
example, there is an empathy learner module, which is an intelligent module. For example, in
this case, they simply made use of a Naive Bayesian decision tree. So, you may recall the
Naive Bayesian decision tree from your ML classes before.

They simply made use of a Naive Bayesian decision tree, which is a very simple model
actually, to train on this particular observational data, where you know for example, they had
access to all the physiological and the behavioral cues and when was it happening and at the
same time, what was the type of the response that was being generated by the by the by the
humans and in response to what?

For example, we just saw that for example, when the user was playing a game and when the
user was losing the game, then you know like the other agent or the human is going to feel
the human is going to feel frustrated about it when they are losing the game. So, they are
feeling frustrated and how they are displaying the frustration, maybe you know saying I am
losing the game.

572
And when they said it and before that, what they felt after that, how they felt it. All this data
is being fed to a machine learning model, which is a empathy learner model in this case. They
just simply made use of a Naive Bayesian decision tree and then with the help of this thing
ultimately, they were able to create a model that was being deployed in real time.

So, basically this model was what? Now, this model was they making use of this model, now
they can do two things. They can this model is used to feed to this empathetic behavior
manager, which can you know generate this empathetic behavior and it can do those two
things.

Of course, it can understand, ok it can understand ok, whether the entire interaction is being
empathetic or not, we just saw how? It can interpret the empathetic interactions and it can
understand ok, when should I display the empathy and what is going to be the empathetic,
how should I display this type of empathy.

So, for example, maybe increase my pitch increase my tone, for example, or reduce my pitch,
reduce my tone or for example, say that ok, oh I am not have feeling happy about it and
things like that. So, basically, this is the behavior, this is these are all the things that the
empathetic manager has knowledge to now, thanks to this empathetic module, which is being
trained on this observational data with the help, thanks to with the help of the humans.

And then of course, you know, all this with the help of this, then you are going to make use of
a user, then all this is being fed to a user interface through which a user, a new user is going
to interact with. So, imagine this is your AI chatbot, imagine that this is your AI chatbot with
which a new user is now, another user is now interacting. But now since AI chatbot is being
trained on the data that in turn is being generated from the humans.

Now this is going to show empathy in terms of time and in terms of some expressions based
on the similar data. And it to an end user, it may look like a human-like behavior or it may
look like a human-like empathetic behavior right, and that was the end goal of the entire
model. So, pretty interesting framework that these guys have created. Of course, for more
information, I would invite you to please go through this particular paper.

573
(Refer Slide Time: 48:24)

Similarly, so, CARE framework, for example, it has not been only this one single work that
has been used to generate the empathetic behavior following the data driven approach. As I
said, most of the current work and some of the previous work has used the data driven
approach only to generate the affective behavior, empathetic behavior.

So, for example, one from D’mello and their group in 2013, they built a very nice module
which is known as Affective Autotutor, where they created an you know artificial virtual
agent. You can see, I hope that you can see here, there is an artificial virtual agent which is
the programmed, this this virtual empathetic agent to act in an empathetic and motivational
manner towards students.

For example, those who are feeling frustrated about the module that is being taught and you
know the course that is being taught to them in real time. And of course, the way they were
doing it, as I said, it was entirely data driven. So, the system of course, prepared in advance a
set of facial, prosodic and verbal responses. What sort of facial, prosodic and verbal
responses?

Of course, they were able to look at you know the response of the teachers. For example, that
how the teachers respond, what are the type of facial expression they make when they see
their students are struggling, what is the type of the speech modulation that they do when
they see their students are struggling and what is the type of the verbal response that they
give you know when they see their students are struggling.

574
So, for example, they may say that, ok I know that you know this material maybe can be a bit
difficult, but ok, do not worry, like we can do it together and let me help you understand it for
example. So, all these things were programmed in advance for this affective agent.

(Refer Slide Time: 50:10)

And then of course, in order to do it in real time, they were detecting, for example, if you
look at the, now I hope that this particular diagram is visible to you. So, then they were
looking at many different sorts of cues of the users, such as for example, they were looking at
their facial expressions, right. So, for example, they were looking at their facial expressions,
they were also looking at their gaze patterns and of course, you know they were looking at
their postures, body postures also you know.

So, for example, if the seat they were where the user was sitting, they were able to monitor
the movements within that chair itself and they were looking at all different types of you
know body postures, facial expressions and of course, they had access to the conversational
cues as well in terms of the voice data. And by observing all these thing, what they were able
to do? They created a rule based scheme to select the proper response.

So, for example, as simple as that if the user is feeling frustrated, for example, if the users, for
example, if the user’s speech the pitch and the tone or energy in the voice is looking like this,
maybe the user is frustrated. So, generate this particular type of response.

575
So, this can if then if else kind of rule. And actually, they evaluated this model and they were
able to show the results with the help of the results that the students who had the low prior
knowledge in the subject, they gained more knowledge with the help of this Affective
AutoTutor in comparison to a neutral version of it.

So, basically since the user tutor was affective, empathetic, maybe they were able to gain
more understanding of it, right. So, for example, this is another pretty nice interesting study
where it was entirely data driven, they were able to look at the previous data and they were
able to generate a affective tutor, which was empathetic in its response, perfect.

(Refer Slide Time: 51:58)

So, now I hope you understand that there are two different ways in which the community in
in which the previous research has been able to generate the empathy in the artificial
empathy. One is of course, by making use of the computational model of the empathy, other
is the data driven approach.

Now, let us try to understand what is the what are some of the limitations of both these
approaches. So, for the computational model of the empathy, of course, very first problem is
that the empathy itself is a very complex construct. And it is being expressed through
multiple multi-modal behavior cues and many of course, it involves walls at least two
individuals to show an empathetic behavior, you not remember a doctor and a patient, a
customer care agent and a client for example, and maybe in many cases it can be more such
as for example, when there is a group conversation happening.

576
So, this is a very very complex scenario and hence it makes the entire job very very
challenging to be being able to understand what is the group dynamics that is happening and
what is the you know happening when the there is a (Refer Time: 53:03) conversation that is
taking place and what is the cognitive modelling behind it.

So, there is not its not has not been very easy to understand it. And in order to understand it
like there are many different things that they have to understand. Of course, they have to
understand what is the stimuli that generates that triggers the empathetic reaction and then
how can we perceive the behavior that is there within two agents or that is there within a
group.

Of course, we already see that physiological and the behavioral cues and these kind of things,
but of course, what could be the most feasible and optimal way to do it. Of course, once we
have understood the perception of it the behavior, then we need to understand how can we
you know modulate the empathy by looking at very different factors such as for example, the
demographic information, the severity for example, of the emotional state and so on so forth.

And then of course, we also need to look at that you know once we have understood this
thing, how can we express the empathy. So, these are the different things in which that they
need to look at in order to create an empathetic response or behaviour. And since this has
been this all together, this is quite complex. Hence, you know and also we have to understand
that the definition of the empathy itself is also not commonly agreed very commonly agreed.

Of course, there has been some definitions that people have been using that the researchers
have been using in different context, but neither the definition of the empathy has been
agreed nor there has been a very established understanding of how these empathetic is the
cognitive model of the empathy. And hence, to make use of that model computational or to
put it into the virtual agents, it is a bit challenging, it has been a challenging task so far.

577
(Refer Slide Time: 54:48)

So, that is about the computational modelling of the empathy. Now, what is the problem or
the limitations of the other model which is the data driven approach? I think you may have
guessed it right. With the case of the data driven approach, of course, the problem is the data
itself, right. And it turns out that in case there has been two different types of limitations that
the researchers have been facing. Both in terms of the quantity and both in terms of the
variety or the quality.

Now, what do you mean by the quantity? So, it turns out that existing words as we talked
about before also, most of the time they have been making use of the audio recordings only,
that is a unimodal system to understand the empathetic reaction that is one. But those audio
recordings also have been obtained only like you know very few large scale psychotherapy
studies that could be totaling to the you know thousands of sessions maybe not more than
that.

And then also you have to understand there are lots of ethical issues involved around getting
this kind of data. Hence, there has been not very good set of availability of this type of data
and now on the top of that we do not just want the data.

You have to understand as a machine learning if you want to create a machine learning model
for it, then you want the entire data to be annotated also and that is where the other problem
is. That ok, first the data itself is limited and now whatever data that is available that itself is
not annotated properly.

578
And when I said it is not annotated properly it means of course, the very first thing that we
want to have is we want to have the annotation of what is the emotional state or the
psychological assessments of the mental and the behavioral state of the target and as well as
for example, the doctor for example, or the customer care employee. And then at the same
time we also need to see that all these have to be very very time synced, right.

We need to have the time synced transcripts, we need to have the time synced speech
segments in order to train and validate a proper machine learning model. So, this is where it
has been fairly limited in terms of quantity. And as I said in terms of the quality also it has
been fairly limited with respect to modalities and also with respect to the scenarios.

So, what do you mean by with respect to modalities? Most of the time only the audio data has
been made available and is available to the community, most of the cases to analyze and
understand it. And most of the time you know of course, many times it is due to the
confidentiality and privacy issues as well.

Because, for example, in the case of the psychological counselling where most of the work
has been done so far, the video and the physiological data is not collected actually and is not
hence available. But nevertheless, other problem is that if the respect to the scenarios as well,
we do not we can see that the there is the response or the impact of the empathetic reactions,
interactions can be in other scenarios as well such as for example, education, customer
service and medical care as well.

And basically, in these scenarios you know we can generate more data which is time synced
data which is having multiple modalities also and then where the most of the work can be
done to make more efficient data driven models to generate the empathy, right.

579
Affective Computing
Prof. Jainendra Shukla
Prof. Abhinav Dhall
Department of Computer Science and Engineering
Indraprastha Institute of Information Technology, Delhi

Week - 09
Lecture - 03
Evoking Empathy

So, when we want to talk about Evoking Empathy in these type of agents, then there are
certain elements that we need to consider and most of the time,

(Refer Slide Time: 00:37)

These are the elements that we would like to manipulate to obtain a particular type of
empathetic response. Number one is of course, we need to look at the type of the behaviours,
that these agents are going to exhibit. We need to look at the appearance and the features.

These 3 things, the type of the behaviour, appearance and the features are again coming from
the by from being motivated by the anthropomorphic design. So, you want to see that what
type of; what different types of behaviours your robot for example, is going to have is it
going to speak, is it going to walk, is it going to gesture and so on so forth.

Appearance, is it going to have an appearance like a human, like an animal, like a toy or
caricatured or so on so forth. Similarly, what type of features it is going to have, is it going to

580
be for example, black in color, white in color, what is going to the type of the height. For
example, if it is going to look like a human, what will be the height of this human, machine or
the robot or the agent and so on so forth.

So, mostly these 3 things are already well understood and are being motivated by the
anthropomorphic design that we just saw. Many a times of course, we also want to look into
the context and the situation that is characterizing the occurrence of the event, which is
leading to a particular type of emotion.

So, for example, you may want to look into that ok. If you are target trying to create a website
for example, for a user and then you are having a embodied agent, as a chatbot for the user on
the website. Then why exactly your user is going to show for example, a particular type of
emotion towards the chatbot that you have, embodied agent that you have placed on the site.

Maybe the there could be a context that the agent was doing very well and it helped the user
in a in a particular situation and accordingly a particular type of emotion can be evoked into
the humans when they are interacting with this type of embodied agent. Other thing for
example, then there are different types of mediating factors that also can be considered and
can also be manipulated.

So, for example, we can also take into account for example, if there is a type of relationship
between the observer and the target, for example, between the agent here and the human in
which you are trying to evoke a particular type of emotion by looking at the this particular
type of agent as simple as that.

So, for example, your user is a female, who is coming on the site for the some query.
Accordingly, maybe you want to create an embodied agent, which is going to be more
looking, more comfortable for this female user which could be a male, which could be a
female; which could be I do not know like depending upon the characteristics that you want
to; that you think can be helpful in making it more appealing to that particular user.

So, then there are other different types of mediating factors. So, for example, you may want
to look into that, what is the mood of the observer itself, right. So, for example, if you are
looking at the user, then how the user is going to behave while the how the user is behaving
or what is the mood in general, the mood of the user when it is trying to access the particular
service.

581
So, these are certain elements that you want to consider and manipulate even when you want
to evoke empathetic response among the humans for example, or among other agents while
they are observing a particular agent. Now, next we want to there is a very important question
that how are we going to measure the empathetic responses that are going to occur in these or
for example, in any other type of situation.

So, we are going to talk a bit in more detail about it in the upcoming slides. But in general,
there are these are the 5 different things that you may want to look into while creating to
while trying to evoke empathetical response.

(Refer Slide Time: 04:47)

We will see some examples and it will become a bit more clear. First example that we may
want to look into is the how are we going to generate empathetical response in virtual agents,
right. So, for example, on the right hand side what you are looking at, this is a figure from a
paper by Slater M and his group and their group, which is a paper on the bystander responses
to a violent incident in an immersive virtual environment.

Now, the idea is quite simple here. We all know that we see lots of violent incidences when
we are walking or passing through our daily life. But it is not it does not happen always that
whenever there is a violent incident, we get into we do some intervention and we try to stop
that intervention or the public gets into that intervention and tries to stop that particular type
of intervention.

582
And it turns out that among many different things, the social identity is very very critical
when we want to analyze the bystander's response in these situations. So, what it means that
it has been shown that for example, if there is a conflict that is happening an among with a
with a particular individual or with a particular group that you associate yourself to, then
maybe chances are high that you are going to get involved.

So, for example, you as a student, if you are going to if you are passing on a street and then
you see some other students being harassed, being bullied by a group of I do not know adults
or something like that, then chances are high that maybe you are going to intervene because
you associate yourself as a student there is an association of identity. There is a social identity
that you both share of a student.

Similarly, it can be on the base of the race, religion and so on so forth. Of course, the thing is
that its very, very hard to analyze and understand this particular type of response in a real
settings. You cannot just create these settings and understand and analyze these it could be
quite unethical.

So, then what they did in this scenario, in this paper that they created two different types of
agents, so for example, there is one agent maybe you can see that there is some particular
logo, that you can see here. So, this particular logo is signifying that this particular individual
is a supporter of a particular football club for example.

Then I mean of course, this particular individual is supporter of a different football club. And
that is what is the social identity that they are sharing and that is clearly visible on the type of
the T-shirt that they are wearing. For example, you can think of one is the supporter of
football club X, one is the supporter of football club Y.

Now, you can see that ok, this was the scenario that was given to the presented to the humans
even and then you know now the scenario is very very simple, that how many times the
humans are going to respond or intervene when there is a particular type this type of conflict
which is maybe resulting you know slowly from verbal, aggression to maybe you know very
very physical aggression maybe where there is a pushing.

And then you know you would like to understand the end of they try to looked into that ok,
how the participants when exactly will be the will the participants will be intervening. So, it
turns out that you know for example, it turns out that you know the more the participants

583
perceived that victim who was looking to them for help, the greater the number of
interventions occurred in the in group rather than in the out group.

So, for example, if individual was a supporter of football club X, other individual was a
supporter of football club Y and the X was being harassed and then the when the participant
also identified itself as a supporter of X football club then maybe the participants showed
more interest in intervening in the situation, rather than when it was belonging to the let us
say to the other group.

So, so it clearly validated this hypothesis that you have, that you know social identity is
critical as an explanatory variable in trying to understand the bystander’s response. So, this is
a very nice example of how the virtual agents they can evoke, empathy among the humans by
trying to with their design with their appearance with their behaviour.

So, you can see therefore, example, in this case this particular agent it had an appearance,
which is was making it look like a which was giving it a feeling which was making it a fan of
a particular football club and if you are also going to identify yourself as a fan of a particular
football club. Then you are going to intervene you are you are willing to help this particular
agent and that is a type of the emotional response or the empathetic behaviour that you can
you are going to show towards this virtual agent.

And the only reason you are able to do that, because virtual agent it evoked that particular
response into you with its own particular type of design and the appearance and the behavior,
right. So, for example, this is a very good example of how can the empathy; empathetic
empathy be evoked in by the virtual agents.

584
(Refer Slide Time: 10:23)

Here we have another very good example of how the empathetic can be evoked in the robots;
by the robots. So, for example, a. so this this is for those who do not know about this
particular robot. Its a Kismet robot. So, Kismet robot is its a very very it has a very very
expressive mechanical phase with lots of anthropomorphic features.

And what we are going to do, is that we are going to look at the particular video of the
Kismet robot, and then in which we can see that where the Professor Cynthia is going to talk
about how this robot was invented and what are the different capabilities that it has.

And that will give you an idea that how by the expressions by its mechanical expression’s the
this Kismet can even evoke some sort of emotional responses among the observers or with
the people that it is interacting with. So, let us try to look at the video now.

585
(Refer Slide Time: 11:34)

So, here we have the video. Kismet is an anthropomorphic robotic head that is specialized for
face-to-face interaction between humans and this robot. Kismet can express in three
modalities; one is through tone of voice. So, we can actually have the robot sound angry
when it is angry, sound sad when it is sad and so forth.

Another is through facial expression which you have talked about so smiling when it is
happy, frowning when it is sad and so forth. And body posture is also critical. So,
approaching leaning forward when it likes something, withdrawing when it does not like
something.

586
(Refer Slide Time: 12:25)

So, another important skill for the robot to be able to learn from people is being able to
recognize communicative attempts. And the way we have done that with Kismet right now is
to have the robot recognized by tone of voice. Are you praising it? Are you scolding it? So,
we have to give the person expressive feedback for the case of praise the robot smiles.

Look at my smile. The case of prohibition the robot frowns. Yeah, I do. Where did you put
your body? For intentional bid the robot perks up. Hey, kismet ok (Refer Time: 14:15)
Kismet Do you like the toy?

So, again, to close the loop, it's critical not only that the robot elicit this kind of prosody that
people will naturally give. But then the people can actually see from the robot's expression
and face that the robot understood. One very critical point of Kismet is that its responses have
to be well matched to what people expect and to what is familiar to people. By doing so, we
make the robot's behavior understandable intuitively to people. So, they know how to react to
it, shape the responses to it.

By following ideas and theories from psychology, from developmental psychology, from
evolution, from all of the study of natural systems and putting these theories into the robot
has the advantage of making the robot's behavior familiar, because it is essentially life like.

I like you Kismet you are pretty funny person

(Refer Time: 15:20).

587
Do you laugh at all? I laugh a lot (Refer Time: 15:23) I laugh a lot.

(Refer Time: 15:27).

I kind of laugh a lot.

(Refer Time: 15:30).

Ok, it is very adorable.

Yeah, I do.

Who are you? What are you?

(Refer Time: 15:44).

I want to show you something.

(Refer Time: 15:47).

(Refer Slide Time: 15:49)

This is a watch that my; this is a watch that my girlfriend gave me.

(Refer Time: 15:53).

Yeah, look, it has got a little blue light in it too. I almost lost it this week.

588
I do not know how you do that. You know what it is like to lose something?

We do not, too.

You can borrow.

(Refer Time: 16:13).

Oh, I think there is something here between us.

No.

Stop, you gotta let me talk. Shh, shh, shh. Kismet, I think we got something going on here.

(Refer Time: 16:26).

You and me. You're amazing (Refer Time: 16:31).

So, for example, now you can, you can; so this is; you can see that how the Kismet robot is
interacting with the humans and how it is able to; you know, show some empathetic
responses to the humans and in turn it is able to generate some empathetic responses from the
humans. So, it is it's quite kind of very mechanical, but it is very very anthropomorphic in
this sense, right. So, this is how, for example, this is another example of how the robots, they
can generate the empathy.

(Refer Slide Time: 17:00)

589
So, now we first; so we looked into that ok, how the empathy can be evoked by the agents
among the humans and how they can manipulate the emotional response of the humans in
their own favour or as per the situation. Now, let us try to understand, that how can we
generate the some empathetic in the virtual and the robotic agents as per our emotional state
or as per the response of the humans.

(Refer Slide Time: 17:31)

So, this is the, we are going to talk about the second type of agents that are going to respond
emotionally to the situations that are more congruent with the users or for example, other
agents' emotional situation. So, that is the second type of the robot that we are going to look
into.

590
(Refer Slide Time: 17:45)

And here, we are going to look into, for example, one a very famous paper, very famous
example of this empathetic companion, which is; was a paper that was by Helmut and
Mitsuru, where they were talking about an empathetic companion, which is; here they talked
about an character based interface, which can measure the affective information of the user
and it can even address the user affect by employing some embodied characters.

So, this is on the right, I hope you can see this diagram. On the right, what you can see that
there is a character based interface, where it's a virtual interview scenario. And in this virtual
interview scenario, there are certain questions, which are being presented to the candidate and
at the same time, you of course, you; so you are the candidate and you as in candidate, you
can answer the questions or whenever and while these questions are being presented to the
candidate to you, to the humans, Then the physiological data of the humans, of the user of the
candidate, which includes in this case the skin conducts and the electromyography. So, skin
conductance and the EMG is being analyzed in real time.

So, its being captured, its being processed, its being classified. And then in response to, and
then with this, they are trying to understand what is the emotional state of the user and in as
per the emotional state of the user, certain the character it is trying to give certain response or
it is trying to generate certain empathy, empathetical response to that particular type of
situation.

591
So, for example, maybe there is a question that was posed to you that how long you have
been working or are you fresher or do you have some professional experience and then you
have certain options, but of course, we all agree that this is not a very comfortable question,
especially when you do not have a lot of relevant experience.

And in that situation, maybe your skin conductance and the EMG is going to indicate that
you are experiencing a stress or a negative emotion. And identifying this thing, so for
example, the character is able to respond with an empathetic character, character is able to
give empathetic response by saying for example, it seems that you did not like this question
so much or maybe you are under a stress or maybe let me change this question, let me ask a
different question or as per the situation.

So, this was the type of the interaction that happened, in this character based interface and it
turns out that whenever there was this empathetic empathic feedback that the character was
providing to the user, it has a very positive; it has a positive feedback on the interview's stress
level.

So, of course, you will have to go into the paper to read more about it, but what they showed
that whenever they compared the empathetic feedback with the non empathetic feedback or
when there was a no feedback and they showed that the overall interviews stress level was
lower than in comparison to when there was no empathic feedback was being provided.

So again, a very good example of how an agent is observing your emotional state and how it
is able to adapt to that emotional state in order to make you feel a bit less stressed. So, that is
the empathic companion, I will definitely invite you to please go ahead and look at the paper
in order to get more details of this thing.

592
(Refer Slide Time: 21:25)

Again, we have a very good example of another robot virtual agent and then the robot. So,
now this robot is known as the Shimi robot is it is from the Georgia Tech and then we are
going to look at again the video of this robot in order to understand that what this robot does.

(Refer Slide Time: 21:47)

Hey, Shimi, can you sing Opera?

593
(Refer Slide Time: 21:49)

Right (Refer Time: 21:52).

So, Shimi is a personal robot that can communicate with humans, but its communication is
driven by music. Everything we do here is a center for music technology, tech is driven by
music and the way he communicates both verbally, with audio and with gestures is based on
deep learning analysis of music datasets and motion datasets, that allows him to analyze the
emotions in our speech and actually respond in an emotional way to us.

Hey, Shimi, I had a great day of work today, I got a promotion and I am feeling so good.

(Refer Time: 22:44).

Hey, Shimi, I am feeling really down in the dumps today, I am pretty sad.

(Refer Time: 23:00).

Shimi will understand your emotion based on how to speak and respond with this kind of
emotion response both in gestures and in voice. Allowing you to have a companion, a
companion that is driven by music. What we did in order to let him understand emotions and
project emotion and we analyzed datasets of musicians playing angry music, sad music,
happy music and we put it into a deep learning system powered by NVIDIA.

594
That will try to capture features from this kind of musical phrases and this is what is driving
Shimis contour and prosody and rhythmic and the way he actually moves and speak because
we feel that music is a great medium for projecting emotions. And if Shimis communication
is abstract like music, but also emotional like music, we feel that this can avoid the uncanny
valley and allow for great interaction.

(Refer Slide Time: 24:11)

Ok. So, I hope that you enjoyed the video of the robot.

(Refer Slide Time: 24:18)

595
So, if you look at this Shimi robot what it does? It tries to understand the emotional state of
the human which is interacting with the Shimi robot and then it tries to respond accordingly
to that particular emotion. And one interesting thing about this Shimi robot that it does not it
is not; it does not use a verbal language that we use. But rather it uses a musical language,
which has been based, which is based on some native languages, indigenous languages in
from Australia.

And now you can see that, what the type of the musical response that the robot had. And
while doing so, as of now the Shimi robot's capabilities are a bit limited, in the sense that it is
it only looks at the valence and the arousal and by looking at the valence and the arousal, it
only tries to, it is able to classify or understand only four different emotional states on the
valence arousal scale.

So, now you can quickly try to figure it out that you know the valence, it tries to analyze the
valence by semantically analyzing the spoken language and it tries to look for the words that
represent positive and the negative feelings. So, for example, I had a bad day. I had a bad day,
ok there is a bad word in it and; so on and so forth.

(Refer Slide Time: 25:43)

And then while looking, ok. So, now we already talked about that how the emotions can be
evoked by the virtual agents and how the virtual agents can respond to the emotions that are
being evoked in the humans both. Now, let us try to look into that how can we have more
empathetic response, which is beyond the just the analysis of the emotional states.

596
(Refer Slide Time: 26:16)

So, we have to first understand that whenever we are talking about the emotions, we usually
the emotions are beyond just the basic emotions. And in this sense the agent’s ability to
perceive for example, things which are beyond emotions such as “belief, desires and
intentions” can be quite limited. And they do have a very important role on the emotion,
emotional responses that are being evoked or that are being generated.

So, but as of now, most of the focus of the affective computing community is on the
representing the basic emotional states, which are quickly active, short and quite focused.
And the example is could be the immense basic emotions or for example, the way we saw in
the case of the Shimi robot.

And then there are other affective states such as; mood, personality, emotional intelligence,
etcetera. And which are also which plays also very very important role on the evoke in
evoking the emotions among the humans or in evoking a empathic response.

And now what we want to do definitely we want to understand that can we look into any of
these factors also, while we are trying to evoke emotional response or while we are trying to
provide a empathic feedback.

597
(Refer Slide Time: 27:45)

In order to do so, we would like to look at a very interesting term which is a theory of mind.
Now, theory of mind is basically the capacity to understand the other people by ascribing
mental states to them. So, basically this is the idea that we do not only know about ourselves,
but we also want to have the knowledge that others' mental states could be different from our
own.

And hence their desire, their belief, their intentions, their thoughts, their emotions can also be
a bit different from our. And their knowledge set in general is different from our knowledge
set and this acknowledgement itself is known as the theory of mind. And why is it helpful?

So, the theory of mind is helpful, because it allows you to infer the intention of the others and
it allows you to understand that what is exactly going on in someone else's head, mind
including; what are they hopeful about, what are their beliefs, what are their expectations and
what they fear about maybe.

And it's a very very interesting and important paradigm in the psychology and to test this
particular theory of mind, there is a very influential experiment which is known as the false
belief test that is usually done, for the kids and to understand that to what, to what aspect
extent they possess this theory of mind.

598
(Refer Slide Time: 29:28)

So, basically, I will let this video play and then I will let you go through this false belief test
in order to understand that what this test is all about. So, I hope that you enjoyed the video.

Now, in the false belief test, essentially what you saw that these two characters, while one
character, one individual, one kid is able to; while one kid is able to understand that what the
others are going to will be thinking or do think, other kid is not able to understand or
apprehend that what the other kid will do in that situation. What the other kid will think about
that for example, where the trolleys.

(Refer Slide Time: 30:13)

599
And now motivated by this thing what we can do? We can create an agent which is an
affective agent, which also possesses this theory of mind and with this it can sort of you know
have a mind reading skills. And why we want to give this sort of mind reading skills to the
agent, because it is going to understand not that ok apart from what is my belief and desires
and intentions, what for example other humans beliefs and misconceptions may be are.

What are their intentions, which could be valid which could be invalid. And then accordingly
it can come up with the logic that how to best help humans in for example, obtain their object
or the goal of the desire. If I understand that what their beliefs are, what their intentions are,
then accordingly the robot can help to the humans to obtain the particular object or the your
goal of desire accordingly.

(Refer Slide Time: 31:17)

And for this they need to of course, share the mental attention between these two and then
there is again a very good nice paper by Cynthia Breazeal, Breazeal and her group which
talks about this Leonardo the robot the humanoid robot and basically, I would like to again
play a video. So, that you can understand what I am talking about.

600
(Refer Slide Time: 31:44)

In this video the robot Leonardo demonstrates his ability to recognize the intentions of his
human partners. Even when their actions are based on incorrect information. Leo keeps track
of objects in his environment based on data from his sensors.

(Refer Slide Time: 31:56)

601
(Refer Slide Time: 31:59)

At the same time Leo also models the individual perspectives of his human partners. Here
everyone watches as Jesse places cookies in the box on the right and chips in the box on the
left. Since both people are present, everyone's beliefs are the same.

(Refer Slide Time: 32:14)

Leo's cognitive architecture based on ideas from psychology known as simulation theory,
reuses its own core mechanisms of behavior generation to understand and predict the
behavior of others.

602
(Refer Slide Time: 32:26)

In this demonstration Leo tracks sensory data from an optical motion capture system. This
same data is presented to duplicate systems, which represent the unique visual perspectives of
his human partners. Now, as Matt leaves the room Jesse decides to play a trick on him and
switches the locations of the two snacks.

(Refer Slide Time: 32:51)

Since Matt is absent Leo only updates his model of Jesse's beliefs. Now, Jesse seals the boxes
with combination locks, preventing easy access to the snacks. When Matt returns hungry for
a bag of chips, he tries to guess the combination to the box where he remembers seeing the

603
chips. As Leo watches Matt reaching for the lock, he tries to infer Matt's intention by
searching for an activity.

(Refer Slide Time: 33:46)

Model that matches the observed motion and task context. Once a matching activity is found,
Leo uses his model of Matt's beliefs to predict what Matt's goal might be.

(Refer Slide Time: 33:57)

604
Then Leo uses his own model of the true state of the world to search for a way to help Matt
achieve his goal. Having correctly inferred Matt's intention, Leo assists him by opening a box
connected to his control panel providing Matt with the chips he desires. Thanks Leo.

Now, Jesse returns and tries to open the same box. Leo correctly infers that Jesse wants the
cookies, since Jesse is aware of the actual contents of the boxes. Matt and Jesse both perform
the same physical action, but Leo's ability to model their individual beliefs, allows him to
correctly assist them in achieving their different goals.

(Refer Slide Time: 35:04)

(Refer Slide Time: 35:16)

605
So, basically you saw in the video that how the Leonardo was able to understand that what
was the desire of the human. Of course, humans this particular individual was looking for a
packet of the cookie and then the again since Leonardo knew that ok where the cookie was,
packet was it was able to help the human obtain that particular cookie.

So, in that sense this how this how this robot is able to do so. So, this robot is able to you
know look at, reuse its belief construction systems by from the visual perspective of the
human. And it is predict predicting actually it is predicting that ok, what the humans are
believing that particular time of point of time and whether that belief is true or not.

And it is doing by looking at all the visual sensors that it has and it is applying theory of mind
behind it and in doing so what it is able to do, it enables the robot to recognize and reason
about what it exactly the individual wants. At this particular point of time and accordingly
help the user to obtain that particular goal. In this case getting a packet of the cookie

And hence it is really important if we can enable the robot with such a type of capability, then
the robot can not only generate an empathetic response, but it will be able to generate a
response which is also going to take into account the beliefs or the desires or the intentions
that the humans have perfect.

(Refer Slide Time: 37:05)

606
So, that was about the how can we generate or evoke the emotional responses beyond the
emotional empathetic response, beyond the emotional states. Now, we will be talking about
how can we assess these empathetic responses.

(Refer Slide Time: 37:22)

And it's not an easy question. The basic idea is of course, we want we now we know that ok
how to create an empathetic agent, we know what are the different types of empathetic agent,
why we want to create them and so on so forth

But unless and until there is a performance metric, we do not know that how can we measure
the success in building these affective agents. In general, it's really hard problem and there is
no consensus yet in the community and how to assess these empathic responses. And more so
in interactive settings in online settings, in real time settings, it can be really difficult to judge
the implications of the empathy or the evocation of this empathy.

One thing that can be done for example, I briefly talked about mentioned this word turing test
in the beginning that what can be looked into that ok, how well a system is imitating the
human behavior? While trying to generate the empathy, the idea is very simple that by
providing by doing making anthropomorphic design, we want the agents to imitate the human
behavior.

And if we are able to create a system which is able to imitate the human behavior its as good
as the humans, then maybe it has passed the Turing test. And this is the best for example, that

607
as a agent or as a machine it can do. So, that can be one criteria that ok. Is the agent able to
evoke an empathetic response to the extent that for example, a human could have been do,
could have been done. And if it is able to do so, ok it has already passed the Turing test and
that is a very good measure of the test; of testing the effectiveness of it.

But then that can be a bit really hard and before even we go that of course, we will stuck in
the Turing test sorry in the uncanny valley if itself and then so on. So, then there are different
psychological benchmarks that we can look into. For example, we can look into the how
autonomous the agents are no matter what type of agents we are talking about, before even
looking at their empathetic responses, first thing we may want to look into that ok.

Whether the agents are autonomous or the empathetic responses that they are generating are
non-autonomous are being controlled by other humans, because of course, if it is being
controlled by other humans then maybe they are not very empathetic, right. This is very
artificial the response

But of course, in while doing. So, you may have to answer, the question that whether the
humans themselves are autonomous. And again, without going too much into the psychology
of this thing, but then you may want to look into that ok for example, sociobiologists, they
have one theory about it, moral researchers they have one theory about it where for example,
the sociobiologists they say that ok.

Everything that is being, done is being controlled by the genes is the result of the evolution
and hence for example, they may not be autonomous and they are being controlled by
everything, but then comes the moral researchers and philosophers like Aristotle and Socrates
for example, even those who say ok.

If everything is result of a gene and the evolution, then then of course, then humans cannot
have an autonomy and if the humans cannot have an autonomy, then they cannot be held
morally accountable. But nevertheless, without going into the discussion of the psychology
here may want to see that ok whether the response is that the agents are generating, whether
they are autonomous or not autonomous one criteria.

Sorry. So, other thing for example, that we can do is we can look into the imitation as well.
So, imitation this turns out that ok, is as simple as that if we like a particular character, if we
like the behavior of a particular character even from the movies, from the series we start

608
imitating them. Maybe sub-consciously and then this can be a very good criteria to look that
ok.

Whether the humans are imitating the humanoid robots or the machines or the services that
you are creating for example, and if so then how is that can be compared with respect to the
human to human imitation and whether the imitation is of the same extent or less and then
this can give a very good; this can become a very good criteria of trying to judge how
affective how affective the empathetic response was or the empathetic interaction was or is of
the agent.

And of course, all this can be evaluated in terms of for example, a Likert type scale where
you can have you know I do not know. Maybe 1 to 9 score of you can give a score of 1 to 9,
where for example, 9 may represent completely autonomous, 1 may represent 0 autonomy, 9
can represent that they are imitating 100 percent, 1 can represent they are imitating 0 percent
and so on.

(Refer Slide Time: 42:33)

Similarly, for example, we can look into the moral values. So, the question is ok, when we are
, the humans are interacting with these agents, would they like to ascribe the intrinsic moral
values to these humanoid robots or the agents. And if so, then to what extent, and again we
can have a scale of 1 to 9.

609
Because if they are ascribing if they are willing to ascribe the moral values to these robots.
Then then maybe they are thinking that ok, the robot is very very human like is very very
empathic and maybe that is where maybe it is very successful in giving this making this
interaction very very human like.

Similarly, you can look into the moral accountability that for example, when we are ascribing
some moral values whether the robot is just, whether the for example, the robot is fair and so
on so forth in its adaptive interaction’s can it be held accountable as well.

And for example, when it is doing a positive feedback can or vitamin something is going
wrong, can by say that ok you know it was the agent is should be held accountable for it,
because the agent made a emotional adaptation that was not supposed to be done for example,
right and it is causing the human harm.

So, the idea is to what extent the people can hold this agent responsible. And on the basis of
this moral accountability itself, again on a scale of 1 to 9 it can be evaluated. Again, for
example, when we are talking about this emotional adaptation or this empathetic interaction,
we can look into the privacy also, that for example, to what extent it is invading the privacy
of the humans.

For example, in order to understand the emotions, is it looking at the facial emotions, it is
looking at the identity of the human. It is looking at the race of the human and so on so forth.
And in that sense, I mean is it getting the information that it should not maybe get. And in
that sense you know, like the to what extent people are comfortable in sharing that particular
type of information with the robot or with the machine.

So, privacy could be one aspect on which, on the basis of which this particular agent can be
evaluated or this empathic responses can be evaluated. Another very important criteria could
be reciprocity. So, reciprocity is as simple as that, so usually it happens that you know when
someone is being empathetic with you, you would like to become empathetic with that
individual.

So, is like you know you behave with an individual, to the same extent that particular
individual behaves with you, right. And in that sense, so are the people willing to reciprocate
the this behavior with the humanoids as well, with the robots as well. And if so, then maybe
you know the robot is quite successful, in generating empathetic response because the

610
humans are treating it like humans for example, could be agents, could be machines or could
be robots as I said before.

And it turns out of course, it can be a bit tricky again to make use of this all the psychological
benchmarks. And then in that sense then you can simply you know use some self-reported
question as such as for example, you may want to list down the properties of the social
robots.

The agents or the machines and you may want to get this rated by the humans, which could
be all the psychological benchmarks or for example, as simple questions like as simple as that
did you like the empathetic interaction that you had with the agent as simple as that.

And then based on this questionnaires or you can also do the content analysis. So, for
example, if there is a conversational agent, which is chatting with the humans you may want
to look at the transcript and you may want to see what type of content is being generated and
see to what extent it was empathetic and to what extent it was successful.

So, that is for example, these are the few ways, in which you can evaluate in general the
empathetic responses of the agents its a very very fascinating area, there has been a limited
work in this so far. And accordingly, the assessment can be a bit tricky, but this is what we
have. So, far and hopefully you know it will improve down the line or in the future.

(Refer Slide Time: 47:00)

611
Perfect. So now, we come to the conclusion of the class. To conclude in this module, we
talked about the empathy and the empathetic agents. How can we evoke empathy among
virtual agents? How by virtual agents among humans and how can empathy be generated in
virtual and robotic agents for humans? We also talked about how the empathy can be
generated beyond the expression of the emotional states and we also looked at briefly how
can we do the assessment of the empathetic responses.

Now, when we talk about the empathy and empathetic agents, then we understood that we
want to have the empathetic agents because presence of these empathetic responses by agents
it leads to a better, more positive and appropriate interactions. And that is where we also
learned about the anthropomorphic design, uncanny valley and so on so forth.

When we were talking about how can we evoke the empathy by the virtual agents and by the
humans we looked into that the appearance and the function of a particular agent it plays a
very very crucial role on how the people are going to perceive it. And accordingly, how they
are going to interact with it. So, this again is coming from the anthropomorphic design, that
you really want to look into the appearance, the functions and the interaction of the machine,
the agent, the services with the humans.

When we while talking about the empathy beyond emotional states, we understood that just
by analyzing the basic emotional states, we may not be able to generate a very empathic
reaction reactions and interactions and hence we want the agents to have the ability to
perceive beliefs desires and the intentions of the humans, which can really help them to align
their empathetic responses with the goals and the desires of the humans.

And we also looked into that how the evolution of this empathetic responses can be done.
And while there are no general agreements on about it, but then we looked into how for
example, some of the psychological benchmarks can be used on the Likert scale for example,
of any 1 to 9, 0 to 5 or something like that to do the assessment of the empathetic responses
and in trying to understand how they were helpful, how successful they were in being
empathetic.

So, with that we finish this module and we will see you in the next module. Great learning.

612

You might also like