Understanding Emotions in Text Using Deep Learning and Big Data (PRINTED)
Understanding Emotions in Text Using Deep Learning and Big Data (PRINTED)
PII: S0747-5632(18)30615-0
DOI: https://doi.org/10.1016/j.chb.2018.12.029
Reference: CHB 5851
Please cite this article as: Chatterjee A., Gupta U., Chinnakotla M.K., Srikanth R., Galley M. & Agrawal
P., Understanding emotions in text using deep learning and big data, Computers in Human Behavior
(2019), doi: https://doi.org/10.1016/j.chb.2018.12.029.
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to
our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and all
legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
PT
Ankush Chatterjee1, Umang Gupta1, Manoj Kumar Chinnakotla2,
Radhakrishnan Srikanth2, Michel Galley3 and Puneet Agrawal1
RI
U SC
1. Microsoft , Hyderabad, India
2. Microsoft, Bellevue, USA
AN
3. Microsoft Research, Redmond, USA
M
Corresponding Author
D
Puneet Agrawal,
TE
Email: [email protected]
C
AC
ACCEPTED MANUSCRIPT
PT
RI
Abstract
SC
Big data and deep learning algorithms combined with enormous computing
power have paved ways for significant technological advancements. Technology
is evolving to anticipate, understand and address our unmet needs. However,
U
to fully meet human needs, machines or computers must deeply understand
human behavior including emotions. Emotions are physiological states gener-
AN
ated in humans as a reaction to internal or external events. They are complex
and studied across numerous fields including computer science. As humans, on
M
reading “Why don’t you ever text me!”, we can either interpret it as a sad or
an angry emotion and the same ambiguity exists for machines as well. Lack
of facial expressions and voice modulations make detecting emotions in text a
D
Deep Learning based approach to detect emotions - Happy, Sad and Angry in
textual dialogues. The essence of our approach lies in combining both semantic
and sentiment based representations for more accurate emotion detection. We
C
use semi-automated techniques to gather large scale training data with diverse
AC
Networks
PT
1. Introduction
RI
order to anticipate our needs, it is essential for machines or computers to be
able to deeply understand human behavior. Human behavior is very complex.
SC
Culture, social norms, faith, language among many other things, play a role in
defining human behavior. In particular, understanding and expressing emotion
is a key element of human behavior. Emotions must be deeply understood by
U
machines and computers, to be able to anticipate human needs.
AN
Emotions such as happiness, anger, sadness etc. are physiological states
that humans routinely experience. In the field of cognitive computing, where
we develop technologies to mimic functioning of the human brain, understand-
M
ing emotions is an important area of research [1]. With growing prominence of
messaging platforms like WhatsApp and Twitter, there is an increased interac-
D
tion using textual dialogues. There are several digital agents and chat bots on
these messaging platforms and are currently being used by a large number of
TE
online users. The success of these agents depends on their ability to modulate
responses based on user emotions for which it is imperative to be able to detect
emotions in textual dialogues and avoid responding inappropriately [2].
EP
2
ACCEPTED MANUSCRIPT
PT
Sorry I saw your texts now
RI
Figure 1: A sample 3-turn conversation from our dataset.
SC
emotion detection is implemented, in such cases, the application can take ap-
propriate action such as popping up a warning to the user before sending a
message. Emotion detection also finds social applications such as flagging con-
U
tent representing bullying, depression etc. from Twitter streams or online fora.
Thus, emotion detection in textual dialogue finds several applications in today’s
online world.
AN
Emotions have been studied by researchers [3],[4],[5] in the fields of psy-
M
chology, sociology, medicine, computer science etc. for the past several years.
Some of the prominent work in understanding and categorizing emotions in-
clude Ekman’s six class categorization [6] and Plutchik’s “Wheel of Emotion”
D
which suggested eight primary bipolar emotions [7]. Given the vast nature of
TE
Angry or Others.
AC
3
ACCEPTED MANUSCRIPT
the user whose messages are on the left, could be interpreted as angry or sad.
The challenge of understanding emotions is further compounded by difficulty in
PT
understanding context, sarcasm, class size imbalance, natural language ambi-
guity and rapidly growing Internet slang. However, big data and powerful deep
learning algorithms have paved way for us to attack this problem statement.
RI
In this paper, we propose an end-to-end trainable deep learning model, called
“Sentiment and Semantic Based Emotion Detector (SS-BED)” for detecting
SC
emotions in textual dialogues. The essence of our approach lies in leveraging
both the sentiment and semantic representations of user utterance for accurate
emotion detection. The motivation behind combining sentiment and semantic
U
representations can be understood from the following example. Let’s consider
the utterance “On road again... miss my amazing partner though!”. This utter-
AN
ance contains a negative sentiment word ‘miss’ as well as a positive sentiment
word ‘amazing’ but the overall emotion of the utterance is Sad. By combining
M
the sentiment of different words in the utterance with semantic understanding
of the sentence, we can detect the emotion in this case. Hence, we intuitively
feel combining both sentiment and semantic features helps in improving classi-
D
representations of their input words and combines them into a unified represen-
tation for the entire utterance which is used for predicting the emotion. We
EP
4
ACCEPTED MANUSCRIPT
PT
we notice that users often express a variety of emotions such as being nervous
about exams, excited about a new job, feeling sad about a break-up, etc. In such
cases, the boundaries between computers and humans blur, and users expect
RI
computers to deeply understand human behavior including emotions. Under-
standing these emotions and providing an emotionally aware response not only
SC
creates a deeper and sustained engagement with users but takes us a step closer
to deeply understanding humans and anticipating their psychological needs.
U
The rest of the paper is organized as follows: Section 2 provides a summary
of related work. Section 3 describes our approach (SS-BED) in detail. Our
AN
experimental setup is discussed in Section 4 and our results are in Section 5.
Finally, Section 6 concludes the paper, followed by future direction for our work.
M
2. Related Work
D
A lot of work has happened in the space of image based emotion recognition
[8], [9]. However, classifying textual dialogues based on emotions is relatively
TE
Speech taggers like the Stanford Parser are also used to exploit the structure of
keywords in a sentence. These pattern/dictionary based approaches, although
attaining high precision scores, suffer from low recall. A recent work by Yenala
et al. [17] on detecting offensive queries, points out this issue in such pat-
tern based approaches. For example, an “angry” emotion might be detected
5
ACCEPTED MANUSCRIPT
in “The service is plain bullshit!”, due to the keyword ‘bullshit’, but a slight
change of the same sentence will manage to fool a pattern based approach -
PT
“The service is plain horseshit!”. One workaround would be to include the word
‘horseshit’ in the dictionary of swear words, but having to incrementally update
by a human, defeats the purpose of an automated approach towards detecting
RI
emotions. Some people have tried dimensionality reduction based clustering ap-
proaches for text documents [18] [19]. However, our problem requires clustering
SC
at a sentence/utterance level and we also have supervised data for identifying
these emotional intents. So, such techniques may not be directly applicable.
Hasan et al., Purver et al., Suttles et al. and Wang et al. have also harnessed
U
cues from emoticons and hashtags [20], [21], [22], [23], [24]. For example, the
hashtags used in the sentence “Summer officially ends today. #sadness” make
AN
it easier to predict the underlying emotion. Other methods [25], [26], [27], [28],
[29] rely on extracting statistical features such as presence of frequent n-grams,
M
negation, punctuation, emoticons, hashtags to form representations of sentences
which are then used as input by classifiers such as Decision Trees, SVMs among
others to predict the output. More detailed analysis has been provided in the
D
work of Canales et al. [30]. Vosoughi et al. extract tweets based on loca-
tion, time and author and uses context to model prior in Bayesian models [31].
TE
However, all of these methods require extensive feature engineering and do not
achieve high recall due to diverse ways of representing emotions. For example -
EP
“Trust me! I am never gonna order again” contains no affective words despite
conveying emotions.
C
(b) Deep Learning Based Approaches: - Deep Neural networks have enjoyed
considerable success in varied tasks in text, speech and image domains. Varia-
AC
tions of Recurrent Neural Networks, such as Long Short Term Memory networks
(LSTM) [38] and BiLSTM [39] have been effective in modeling sequential in-
formation [40]. Also, Convolutional Neural Networks [41] have been a popular
choice in the image domain. The lower layers of the network capture local fea-
tures whereas higher layers unravel more abstract task based features for the
6
ACCEPTED MANUSCRIPT
PT
Deep Learning Facial image Human Judged Wang et al. [8], Dachapally et al. [34]
Non-Deep Learning Facial Image Human Judged Zhang et al. [9]
Deep Learning Text Human Judged Mundra et al. [35], Our Approach
Deep Learning Text Automatic Abdul et al. [36]
RI
Non-Deep Learning Text Human Judged Kozareva et al. [12], Strapparava [37],
Yan et al. [29], Balahur et al. [10]
Non-Deep Learning Text Automatic Sykora et al. [14], Hasan et al. [20] [21]
SC
Table 1: Comparison of existing emotion detection systems
image. Their introduction to the text domain has proven their ability to deci-
U
pher abstract concepts from raw signals [42], [43].
AN
Recently, approaches which employ Deep Learning for emotion detection in text
have been proposed. Zahiri et al. predicts emotion in a TV show transcript
[44]. Unlike TV shows, textual dialogue are full of spell errors, internet slang
M
etc. Abdul et al. and Koper et al. tries to understand emotions of tweets [36]
[45]. Tweets often use cues like hashtags etc. whereas our dataset of textual
D
dialogues is missing such cues. For instance, in the tweet “The moment of the
day when you have to start to plaster a smile in your face. #depression”, there
TE
cus for our study [46]. Felbo et al. learns representation based on emojis, and
uses it for emotion detection [47]. The approach is evaluated on tweets, news
headlines and self-reported emotional experiences created by a large group of
C
reactions, for example, “Cisco sues Apple over iPhone name”. On other hand,
self-reported emotional experiences often contain key emotion words like anger,
sad etc., for instance, - “I felt very sad as I saw my father being brought home
in a casket”. Textual dialogues, on other hand, are informal and laden with
misspellings which pose serious challenges for automatic emotion detection ap-
7
ACCEPTED MANUSCRIPT
Concatenation
Sentiment Encoding
LSTM LSTM LSTM
50 50 50 Softmax
PT
64 Probabilities
SSWE SSWE SSWE
Others
Leaky ReLU
Happy
I am tensed 64
Sad
RI
Semantic Encoding
Angry
GloVe GloVe GloVe
64
100 100 100
SC
LSTM LSTM LSTM
Fully Connected Network
Figure 2: The architecture of Sentiment and Semantic Based Emotion Detector (SS-
U
BED) Model.
AN
proaches. To the best of our knowledge, the work done by Mundra et al. is the
only one which has tackled the problem of emotion detection in English textual
dialogues [35]. Hence, we evaluate our technique against their approach. Table
M
1 provides a good representation of how our work is placed with respect to other
emotion detection systems.
D
3. Our Approach
TE
longing to four output classes - Happy, Sad, Angry and Others. The architecture
of our proposed SS-BED model is shown in Figure 2. Our model uses LSTMs
[38], which are effective in processing sequential information. The input user
C
utterance is fed into two LSTM layers using two different word embedding ma-
trices. One layer uses a semantic word embedding, whereas the other layer uses
AC
a sentiment word embedding. These two layers learn semantic and sentiment
feature representation and encode sequential patterns in the user utterance.
These two feature representations are then concatenated and passed to a fully
connected network with one hidden layer which models interactions between
these features and outputs probabilities per emotion class. Further details of
8
ACCEPTED MANUSCRIPT
training data used to train the model, sentiment and semantic embeddings, and
model training are provided below.
PT
3.1. Training Data Collection
RI
harness the advantage of big data and crowd intelligence to solve our problem.
A large amount of training data is collected, using a semi-automated approach.
SC
A dataset of 17.62 million tweet conversational pairs i.e. tweets (Twitter-Qs)
and their responses (Twitter-As; collectively referred to as Twitter Q-A pairs
below), extracted from the Twitter Firehose, covering the four year period from
U
2012 through 2015, is constructed. This data is further cleaned to remove twit-
ter handles and served as the base data for our two training data collection
techniques. AN
Technique 1: In this technique, we start with a small set (approximately
M
300) of annotated utterances per emotion class obtained by showing a randomly
selected sample from Twitter-Qs and Twitter-As to human judges. Using a vari-
D
ation of the model described in [48], we create sentence embeddings for these
annotated utterances as well as Twitter-Qs and Twitter-As. We identify poten-
TE
tial candidate utterances for each emotion class using the threshold-based cosine
similarity between annotated utterances and Twitter-Qs and Twitter-As. Var-
ious heuristics like presence of opposite emoticons (example “:’(” in a potential
EP
candidate set for Happy emotion class), length of utterances etc. are used to
further prune the candidate set. The candidate set is then shown to human
C
judges to determine whether or not they belong to the emotion class. Since
emotion class is often of very small size in a random sample of conversations,
AC
using our method we cut down the amount of human judgments required by five
times when compared to showing a random sample of utterances and choosing
emotion class utterances from them.
9
ACCEPTED MANUSCRIPT
method described above, we take all the utterances that belong to Twitter-Qs
and find their corresponding Twitter-As. These Twitter-As are then further
PT
aggregated by their frequency and top Twitter-As are chosen. For example, in
the Angry emotion class “There, there”1 is a popular response in Twitter-As.
Twitter-Qs corresponding to these top Twitter-As per emotion class are picked
RI
as potential utterances in that class and are further shown to human judges for
pruning.
SC
Negative data (belonging to class Others) is collected by randomly selecting
utterances from both Twitter-Qs and Twitter-As. Those which have a high co-
U
sine similarity score (using Technique 1) with any of the utterances in emotion
classes (Happy, Sad, Angry) are discarded.
AN
We finally obtained 456k utterances in Others category, 28k for Happy, 34k
M
for Sad, and 36k for Angry.
from the 17.62 million twitter data. Every out-of-vocabulary word is replaced
with a special “UNK” token.
C
1A phrase frequently used in popular American sitcom, “The Big Bang Theory”
10
ACCEPTED MANUSCRIPT
PT
Word2Vec 64.44 74.71 59.28 66.14
FastText 64.58 76.68 59.98 67.08
RI
GloVe 66.11 78.99 63.79 69.63
SC
LSTM network.
U
depression, :’( 0.23 0.63
happy, sad 0.59 -0.42
best, great AN 0.78 0.15
Table 3: Comparison of GloVe and SSWE embeddings w.r.t cosine similarity of dif-
M
ferent word pairs.
cons. For example, we convert the following utterance “Yeah! :((( My plan is
D
cancelled À” into “Yeah! :( My plan is cancelled :/ :(”. This helps us deal
with Out of Vocabulary (OOV) issues for infinitely many possible combinations
TE
For each word in the input utterance, we have multiple options for get-
C
ting the semantic word representations of input words. We try Word2Vec [49],
GloVe [50] and FastText [51]. To get the sentiment representations, we consider
AC
11
ACCEPTED MANUSCRIPT
PT
macro F1 score. We also observe that GloVe and SSWE behave very differently;
a few examples are in Table 3. SSWE embeddings give a high cosine similarity
for “depression” and “:’(” whereas GloVe gives a low score even though the two
RI
words have similar sentiment. For the “happy” and “sad” pair, SSWE rightly
gives a low score but GloVe outputs a reasonably high score. However, seman-
SC
tically similar words like “best” and “great” have a low cosine similarity with
SSWE but high score from GloVe. Based on these observations along with our
motivation of combining sentiment and semantic features from Section 1, we
U
choose GloVe and SSWE as our embedding for Semantic and Sentiment LSTM
layer respectively.
dropout rate, number of epochs to train for and batch size for training. Each
of these hyper-parameters are randomly selected from some predefined set of
TE
together to train a model and the held out set is used for validation.
We use the Microsoft Cognitive Toolkit2 for training SS-BED. The param-
AC
eters of SS-BED are trained to maximize prediction accuracy given the target
labels in the training set. To avoid over-fitting and generalize learning, dropout
[53] is used. We use Cross Entropy with Softmax as our loss function [54], and
2 https://www.microsoft.com/en-us/cognitive-toolkit/
12
ACCEPTED MANUSCRIPT
Stochastic Gradient Descent (SGD) as our learner. The optimal batch size is
found to be 4000 with a learning rate of 0.005 and 0.25 as the dropout probabil-
PT
ity. It is worth noting that, to train sequence classification models, the Microsoft
Cognitive Toolkit uses sum of the length of sequences across utterances (not the
number of utterances) while picking up data of a particular batch size.
RI
4. Experimental Setup
SC
In this section, we describe details of evaluation dataset used to compare
various techniques and baseline methods used for comparison.
U
4.1. Evaluation Dataset
AN
For this task there are three datasets reported in research literature: (a)
The ISEAR dataset3 , (b) The SemEval2007 Affective Text Dataset [55] and
(c) WASSA’17 Shared Task on Emotion Intensity [56]. However, we find all
M
these datasets unsuitable for evaluating our task. ISEAR dataset consists of
user reactions when asked to remember a circumstance which aroused certain
emotions in them. For example, “When my mother slapped me in the face, I felt
D
anger at that moment.” is one of the statements in ISEAR dataset and has a
TE
different form than what one would typically expect in a dialogue. On the other
hand, SemEval2007 dataset consists of news headlines which are expressive and
self-contained. For example, “Cisco sues Apple over iPhone name” is one of the
EP
impact of hashtags - “The moment of the day when you have to start to plaster
AC
3 http://www.affective-sciences.org/en/home/research/materials-and-online-
research/research-material/
13
ACCEPTED MANUSCRIPT
PT
% 4.90 4.81 4.04 86.25 100
RI
To overcome these challenges, we sample 3-turn conversations from Twit-
ter i.e. User 1’s tweet, User 2’s response to the tweet, and User 1’s response
SC
to User 2. We use the Twitter Firehose to extract these 3 turn conversations
covering the year of 2016. We sample from conversations where the last turn
is the third turn as well as from those where the third turn is in the middle
U
of the conversation. We follow the necessary pre-processing steps from Section
AN
3.2. Our dataset finally comprises of 2226 3-turn conversations along with their
emotion class labels (Happy, Sad, Angry, Others) provided by human judges.
The details of the dataset along with emotion class label statistics is shown in
M
Table 4. To gather the emotion class labels, we show third turn of the con-
versation along with the context of the previous 2 turns to human judges and
D
ask them to mark the emotion of the third turn after considering the context.
To gather high quality judgments, each conversation is shown to 5 judges, and
TE
a majority vote is taken to decide the emotion class. After several rounds of
training and auditing of mock sets, the final inter-annotator agreement based on
fleiss’ kappa value [57] is found to be 0.59. This kappa value, while slightly less
EP
uation dataset, on which the models were run and the numbers were reported.
AC
This set was not used for any debugging purposes, hence the performance on
this set provides a reliable measure of how well the models might generalize on
unseen data.
14
ACCEPTED MANUSCRIPT
# Features Description
PT
2 WordNet-Affect- # of direct affective words classified by Word-
Presence Net Affect under relevant categories
RI
3 SSWE SSWE word embeddings
4 POS Part-of-Speech tags
5 Emoticons # of happy, sad, angry emoticons
SC
6 Misc. # of words, exclamation marks, question
marks, sequences of punctuation marks
7 Negations Presence of negations
U
AN
Table 5: Features used for Machine learning baselines
Decision Tree (GBDT) classifier [59] and Naive Bayes (NB) classifier [59]. SVM,
GBDT and NB classifiers are trained using Scikit Learn [60]. We did an exten-
TE
sive feature engineering for above mentioned baselines and the feature set has
been explained in Table 5.
EP
After tuning parameters using the validation set as described in Section 3.4, we
find that SVM gives the best performance with linear Kernel and regularization
15
ACCEPTED MANUSCRIPT
NB (Feat-1) 41.35 50.46 45.45 70.87 68.22 69.52 38.16 32.22 34.94 49.97 50.81
PT
SVM (Feat-1) 66.67 25.69 37.09 86.49 59.81 70.71 85.42 45.56 59.42 55.74 56.59
GBDT (Feat-1) 75.76 22.94 35.21 89.47 63.55 74.31 86 47.78 61.43 56.98 58.49
NB (Feat-2) 43.27 57.36 49.33 70.83 69.1 69.95 68.26 42.96 52.73 57.34 57.42
SVM (Feat-2) 73.33 33.79 46.26 87.02 61.23 71.88 86.73 46.33 60.39 59.51 60.42
GBDT (Feat-2) 78.46 25.00 37.92 94.25 58.92 72.51 88.98 50.26 64.23 58.22 58.95
RI
CNN-NAVA 63.32 42.29 50.71 79.37 68.69 73.64 67.42 45.79 54.54 59.63 60.15
CNN-SSWE 67.69 40.37 50.57 77.45 73.83 75.6 80.95 37.77 51.51 59.23 60.97
CNN-GloVe 52.29 52.29 52.29 93.72 67.29 74.61 67.82 65.55 66.66 64.52 64.93
SC
LSTM-SSWE 70.69 37.61 49.1 83.87 72.89 78 73.24 57.77 64.6 63.9 64.77
LSTM-GloVe 64.18 39.45 48.86 72.88 80.37 76.44 72.15 63.33 67.45 64.25 65.26
SS-BED 69.51 52.29 59.68 85.42 76.63 80.79 87.69 63.33 73.55 71.34 71.4
U
Table 6: Comparison of various models on evaluation dataset. SS-BED results are
statistically significant with p < 0.005
AN
constant 0.0005. In case of GBDT, the best performance is achieved with 50
trees and a minimum of 10 samples per leaf.
M
For deep learning baseline, we implement the approach defined in [35], as
this approach also attempts to understand emotion classes in chat conversations.
D
and LSTM models with different embeddings like GloVe and SSWE.
We use precision, recall, F1 score of each class and macro and micro F1 score
(where macro and micro F1 is calculated for 3 emotion classes i.e. happy, sad
EP
Precision and Recall. The individual True Positives, False Positives and False
Negatives are summed up to get micro average of Precision and Recall.
AC
5. Results
16
ACCEPTED MANUSCRIPT
PT
RI
SC
Figure 3: Comparison of Micro and Macro F1 values of different models.
U
seen more clearly from Figure 3. The performance of SS-BED over all other
AN
models is particularly significant (p < 0.005) as measured by McNemar’s test
[62]. Our results thus indicate that combining sentiment and semantic features
in SS-BED outperforms individual LSTM-SSWE and LSTM-GloVe. SS-BED
M
was also significantly better than CNN based approaches including CNN-NAVA.
Also, when comparing across models using Macro and Micro F1 score, Deep
D
Learning models outperform NB, SVM and GBDT. Adding rich set of features
help improve performance of NB, SVM and GBDT, but they do not come at
TE
model, containing two LSTM layers, inevitably takes more time for training as
well as inference.
C
Table 7 highlights some examples from evaluation set and compares the per-
formance of our models across these examples. In example #1, we observe that
in absence of keywords with an obvious sentiment polarity associated with them,
LSTM-SSWE fails. LSTM-GloVe and SS-BED are able to infer the sadness of
the user in this somewhat subtle statement. In example #2, presence of op-
17
ACCEPTED MANUSCRIPT
PT
RI
U SC
AN
Figure 4: Runtime performance of various models during Train and Test phases.
M
posite polarity words, “miss” and “amazing”, confuses both LSTM-GloVe and
LSTM-SSWE, but SS-BED predicts it correctly by using both semantic and
sentiment feature sets. Similarly, in a complicated and long utterance #3, all
D
baseline approaches fail, but SS-BED is able to harness the advantage of com-
TE
to predict the correct emotion. In some utterances like in #5, context of the
conversation plays an important role to determine underlying emotion. SS-BED
does not consider context and hence fails as do all other models.
C
18
ACCEPTED MANUSCRIPT
# True User 1’s tweet User 2’s response User 1’s response Comment
Label
1 Sad Man even food de- Yea well it is a Yeah well i do not LSTM-SSWE fails
PT
livery apps in ban- bandh have anything at as there is no key-
galore won’t deliver home :/ word with an obvious
till 6:( negative polarity but
LSTM-GloVe and
RI
SS-BED are correct.
SC
partner though! of opposite polarity
words ‘miss’ and
‘amazing’ confuses
both LSTM-GloVe
and LSTM-SSWE and
U
they both fail.
3 Angry :) Good for both of It’s better not to in- It is not an ego or SS-BED is only model
us!
AN
teract with a girl
with so much ego.
Attitude is still fine
attitude. U started
first! U asked me
stupid ques! :/
which could correctly
predict
complicated
this rather
user
utterance.
M
4 Angry Pathetic delivery Sir, can you please Yes. I guess your All models including
services. Very state exact problem amazing delivery SS-BED are unable
disappointed so that we can work service has not yet to understand sar-
on it. arrived. casm and fail in this
D
example.
5 Happy I just qualified for WOOT! That’s I started crying All models predicted it
TE
and is called output encoding for SS-BED. In case of LSTM-SSWE and LSTM-
C
19
ACCEPTED MANUSCRIPT
PT
ESSW E 0.325 0.494 0.639
ESS−BED 0.504 0.546 0.689
RI
Table 8: Fraction of top 5 utterances belonging to same emotion class as the annotated
utterance averaged over each class in the evaluation set
SC
Angry from the evaluation dataset. We use EGloV e , ESSW E and ESS−BED to
find top 5 utterances based on cosine similarity for each of these utterances.
U
These top 5 utterances are fetched from a corpus of 17.62 million tweet pairs as
described in Section 3.1.
AN
Hence, for a sample utterance Sli , which is ith utterance in our evaluation set
belonging to an emotion class l, where l ∈ {‘Happy’, ‘Sad’, ‘Angry’} , we find the
e e e
top 5 most similar utterances. Let these texts be denoted by {Tli1 , Tli2 , ..., Tli5 },
M
where the encoding vector used is e ∈ {EGloV e , ESSW E , ESS−BED }.
We then annotate these 5 utterances via human judges, following which we
D
have the corresponding labels {Leli1 , Leli2 , ..., Leli5 }. Subsequently, we calculate
the percentage of utterances belonging to the emotion class l. Let this percent-
TE
age be denoted by P5 e
k=1 (Llik == l)
peli =
5
EP
e e e
Out of the 5 sentences ({Tli1 , Tli2 , ..., Tli5 }), the proportion that have the
same label as the original sample, are represented by this number. These per-
centages are averaged over Nl sample utterances of an emotion class l, producing
C
e p
Pl = i=1 li
Nl
Table 8 shows a comparison of different output encodings across the three
emotion classes. We observed that SS-BED’s output encoding gave best metric
across all emotion classes. This indicates that by combining semantic and sen-
timent features in SS-BED, we are able to generate a better representation of
20
ACCEPTED MANUSCRIPT
PT
end :)
RI
plans? ;) friends.
3 I had a match today. And did you win? Yes!! And I am super
happy :)
SC
Table 9: Sample conversations indicating challenges in Happy emotion class
U
input utterances in the output encoding space as compared to representations
via LSTM-GloVe and LSTM-SSWE.
AN
5.3. Discussion on Ambiguity in Happy Class
6. Conclusion
21
ACCEPTED MANUSCRIPT
PT
based applications. For this problem, we harness the power of deep learning and
big data and propose a Deep Learning based approach called “Sentiment and
Semantic Based Emotion Detector (SS-BED)”. This approach harnesses both
RI
sentiment and semantic based features for more accurately predicting user emo-
tions from their utterances. Evaluation on real world textual dialogue shows
SC
that our approach significantly outperforms baseline approaches proposed in
literature as well as off-the-shelf deep learning and feature engineering based
machine learning models.
U
7. Future Work
AN
As part of our future work, we plan to extend this approach to detect more
emotional classes such as Surprise, Fear, Disgust etc. Currently, our model is
M
limited by the fact that it does not train on the context of the dialogue. We
plan to train models that also take the dialogue context into account besides
D
References
[1] J. Thilmany, The emotional robot. cognitive computing and the quest for artificial
EP
health, interpersonal violence, and physical health, JAMA internal medicine Vol.
176, pages 619–625.
AC
[3] R. Plutchik, The psychology and biology of emotion., New York, NY, US: Harper-
Collins College Publishers, 1994.
22
ACCEPTED MANUSCRIPT
PT
[6] P. Ekman, An argument for basic emotions, Cognition & emotion Vol. 6, pages
169–200.
RI
[7] R. Plutchik, H. Kellerman, Emotion: theory, research and experience, Academic
press New York, 1986.
SC
[8] S.-H. Wang, P. Phillips, Z.-C. Dong, Y.-D. Zhang, Intelligent facial emotion recog-
nition based on stationary wavelet entropy and jaya algorithm, Neurocomputing
272 (2018) 668–676.
U
[9] Y.-D. Zhang, Z.-J. Yang, H.-M. Lu, X.-X. Zhou, P. Phillips, Q.-M. Liu, S.-H.
AN
Wang, Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy
support vector machine, and stratified cross validation, IEEE Access 4 (2016)
8375–8385.
M
[10] A. Balahur, J. M. Hermida, A. Montoyo, Detecting implicit expressions of sen-
timent in text based on commonsense knowledge, in: Proceedings of the 2nd
D
[11] F.-R. Chaumartin, Upar7: A knowledge-based system for headline sentiment tag-
ging, in: Proceedings of the 4th International Workshop on Semantic Evaluations,
EP
[13] C. Strapparava, R. Mihalcea, Learning to identify emotions in text, in: 2008 ACM
symposium on Applied computing, 2008, pp. 1556–1560.
23
ACCEPTED MANUSCRIPT
PT
[16] A. Esuli, F. Sebastiani, Sentiwordnet: A high-coverage lexical resource for opinion
mining, Evaluation (2007) 1–26.
RI
[17] H. Yenala, M. Chinnakotla, J. Goyal, Convolutional bi-directional lstm for detect-
ing inappropriate query suggestions in web search, in: PAKDD, Springer, 2017,
SC
pp. 3–16.
U
Computing (2018) 1–15.
AN
[19] H.-T. Zheng, Z. Wang, W. Wang, A. K. Sangaiah, X. Xiao, C. Zhao, Learning-
based topic detection using multiple features, Concurrency and Computation:
Practice and Experience 30 (15).
M
[20] M. Hasan, E. Agu, E. Rundensteiner, Using hashtags as labels for supervised
learning of emotions in twitter messages, in: ACM SIGKDD Workshop on Health
D
[23] J. Suttles, N. Ide, Distant supervision for emotion classification with discrete
binary values, in: International Conference on Intelligent Text Processing and
AC
24
ACCEPTED MANUSCRIPT
[25] C. O. Alm, D. Roth, R. Sproat, Emotions from text: machine learning for text-
based emotion prediction, in: Proceedings of the conference on human language
technology and empirical methods in natural language processing, ACL, 2005,
PT
pp. 579–586.
RI
sification: A new approach, International Journal of Applied Information Systems
Vol. 4, pages 48–53.
SC
[27] D. Davidov, O. Tsur, A. Rappoport, Enhanced sentiment learning using twitter
hashtags and smileys, in: Proceedings of the 23rd international conference on
computational linguistics: posters, ACL, 2010, pp. 241–249.
U
[28] F. Kunneman, C. Liebrecht, A. van den Bosch, The (un) predictability of emo-
AN
tional hashtags in twitter, In European Chapter of the Association for Computa-
tional Linguistics (2014) 26–34.
25
ACCEPTED MANUSCRIPT
PT
[36] M. Abdul-Mageed, L. Ungar, Emonet: Fine-grained emotion detection with gated
recurrent neural networks, in: Proceedings of the 55th Annual Meeting of the
RI
Association for Computational Linguistics, Vol. Vol. 1, pages 718–728, 2017.
SC
Intelligent Information Access, 2010, pp. 21–38.
U
[39] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans-
AN
actions on Signal Processing Vol. 45, pages 2673–2681.
[40] N. Liang, H.-T. Zheng, J.-Y. Chen, A. K. Sangaiah, C.-Z. Zhao, TRSDL: Tag-
M
Aware Recommender System Based on Deep Learning–Intelligent Computing
Systems, Applied Sciences 8 (5) (2018) 799.
D
[42] Y. Kim, Convolutional neural networks for sentence classification, arXiv preprint
arXiv:1408.5882.
EP
26
ACCEPTED MANUSCRIPT
[46] P. Li, J. Li, F. Sun, P. Wang, Short text emotion analysis based on recurrent neu-
ral network, in: Proceedings of the 6th International Conference on Information
Engineering, ACM, 2017.
PT
[47] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, S. Lehmann, Using millions of emoji
occurrences to learn any-domain representations for detecting sentiment, emotion
RI
and sarcasm, in: EMNLP, 2017, pp. 1616–1626.
[48] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, R. Ward, Deep
SC
sentence embedding using long short-term memory networks: Analysis and appli-
cation to information retrieval, IEEE/ACM Transactions on Audio, Speech and
Language Processing Vol. 24, pages 694–707.
U
[49] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed represen-
AN
tations of words and phrases and their compositionality, in: Advances in neural
information processing systems, 2013, pp. 3111–3119.
[50] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word repre-
M
sentation., in: EMNLP, Vol. Vol. 14, pages 1532–1543, 2014.
[51] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text
D
1565.
[55] C. Strapparava, R. Mihalcea, Semeval-2007 task 14: Affective text, in: Proceed-
ings of the 4th International Workshop on Semantic Evaluations, Association for
Computational Linguistics, 2007, pp. 70–74.
27
ACCEPTED MANUSCRIPT
PT
[57] P. E. Shrout, J. L. Fleiss, Intraclass correlations: uses in assessing rater reliability.,
Psychological bulletin Vol. 86, page 420.
RI
[58] C. Cortes, V. Vapnik, Support-vector networks, Machine learning Vol. 20, pages
273–297.
SC
[59] J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning,
Springer series in statistics Springer, Berlin, 2001.
U
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine
learning in python, Journal of Machine Learning Research Vol. 12, pages 2825–
2830. AN
[61] A. Agrawal, A. An, Unsupervised emotion detection from text using semantic
M
and syntactic relations, in: Proceedings of the The 2012 IEEE/WIC/ACM Inter-
national Joint Conferences on Web Intelligence and Intelligent Agent Technology,
2012, pp. 346–353.
D
[63] I. T. Jolliffe, Principal component analysis and factor analysis, in: Principal
component analysis, Springer, 1986, pp. 115–128.
C EP
AC
28
ACCEPTED MANUSCRIPT
We thank Balakrishnan Santhanam, Jaron Lochner and Rajesh Patel for their support in crowdsource
judgments. We also thank Oussama Elachqar, Chris Brockett, Niranjan Nayak and Kedhar Nath Narahari
for helpful brainstorming sessions and comments. Finally, we are grateful to Abhay Prakash and
Meghana Joshi for their constant support and guidance.
PT
RI
U SC
AN
M
D
TE
C EP
AC
ACCEPTED MANUSCRIPT
1. Emotion Detection in text finds several practical applications such as modulation of responses
for real-world chat-bot.
2. Combining Sentiment and Semantic information in a text improves emotion detection system
3. Our approach learns diverse ways of expressing emotions and significantly outperforms
methods described in literature.
PT
RI
U SC
AN
M
D
TE
C EP
AC