Tailieu RGAT

This paper presents a novel approach to emotion recognition in conversations using Relation-aware Graph Attention Networks (RGAT) enhanced with relational position encodings. The proposed model effectively captures both speaker dependencies and sequential information, outperforming state-of-the-art methods across multiple benchmark datasets. The study emphasizes the importance of considering both self- and inter-speaker dependencies in understanding emotional transitions in dialogues.

Uploaded by

Nguyễn Đức Anh (Đức Anh lsb)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views11 pages

Tailieu RGAT

Uploaded by

Nguyễn Đức Anh (Đức Anh lsb)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Relation-aware Graph Attention Networks with Relational Position

Encodings for Emotion Recognition in Conversations

Taichi Ishiwatari Yuki Yasuda Taro Miyazaki Jun Goto

NHK Science and Technology Research Laboratories
{ishiwatari.t-fa, yasuda.y-hk, miyazaki.t-jw, goto.j-fw} @nhk.or.jp

Abstract ] Speaker Utterance Emotion

1 A I’m just so tired all the time. Sad
Interest in emotion recognition in conversa- 2 B
Well have you been trying to get a job,
Neutral
tions (ERC) has been increasing in various look for a job or...?
fields, because it can be used to analyze user 3 A I’ve been looking for like eight months. Frustrated
behaviors and detect fake news. Many re- I know., It- It’s really tough out there.,
4 B Frustrated
cent ERC methods use graph-based neural net- It’s really hard to find a job.
works to take the relationships between the ut- I’m tired of the same excuses.,
terances of the speakers into account. In par- 5 A No, no you’re not qualified enough, Frustrated
ticular, the state-of-the-art method considers wish you had more education.
self- and inter-speaker dependencies in con-
6 B Well what are you looking for?, I mean– Neutral
versations by using relational graph attention
7 B Well, okay. Well that’s– Neutral
networks (RGAT). However, graph-based neu-
8 A Cause I went to Harvard. Anger
ral networks do not take sequential informa-
tion into account. In this paper, we propose re-
Table 1: Example for contextual emotion analysis on
lational position encodings that provide RGAT
the IEMOCAP dataset (Busso et al., 2008), which con-
with sequential information reflecting the rela-
tains emotion-labeled utterances in multi-party conver-
tional graph structure. Accordingly, our RGAT
sations.
model can capture both the speaker depen-
dency and the sequential information. Exper-
iments on four ERC datasets show that our
model is beneficial to recognizing emotions able to process long series of information (Brad-
expressed in conversations. In addition, our bury et al., 2016). DialogueRNN tries to make
approach empirically outperforms the state-of- up for this problem by using an attention mecha-
the-art on all of the benchmark datasets.
nism to focus on the relevant utterances in the en-
tire conversation (Majumder et al., 2019). How-
1 Introduction ever, these methods do not take self-dependency
Interest in emotion recognition in conversations or inter-speaker dependency into account. Table 1
(ERC) has been increasing in various fields (Pi- shows the importance of these dependencies, as il-
card, 2010), because it can be used to analyze user lustrated by an example dialogue depicting an ar-
behaviors (Lee and Hong, 2016) and detect fake gument about a job search. Because speaker A
news (Guo et al., 2019). With the recent prolifer- has not been able to find a job for a long time,
ation of social media platforms such as Facebook, his emotional state is consistently negative. In this
Twitter, and YouTube, as well as conversational way, self-dependency is critical to understanding
assistants such as Amazon Alexa, there is a need to his own emotional transitions in the conversation.
study how emotions are expressed in natural con- On the other hand, B’s emotions shift at utterance
versation. ]4 to commiserate on A’s situation. This inter-
Recent research on ERC processes the utter- speaker dependency captures how the utterances
ances of dialogues in sequence by using recurrent of other speakers affect emotions.
neural network (RNN)-based methods (Hochreiter The state-of-the-art method, DialogueGCN
and Schmidhuber, 1997; Chung et al., 2014; Liu (Ghosal et al., 2019), uses relational graph at-
et al., 2016). However, these methods are not tention networks (RGAT) to take the dependency

7360
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 7360–7370,
November 16–20, 2020. c 2020 Association for Computational Linguistics
into account; it is inspired by relational graph an effective representation of other positional vari-
convolutional networks (RGCN) (Schlichtkrull ations with absolute or relative position encodings.
et al., 2018) and graph attention networks (GAT)
(Veličković et al., 2017). This method takes into 2 Related Work
account the conversational context by using a di-
Emotion Recognition in Conversation Several
rected graph, where the nodes denote individual
studies have tackled the ERC task. Hazarika
utterances, the edges represent relationships be-
et al. (2018a,b) used memory networks for rec-
tween pairs of nodes (utterances), and the labels
ognizing humans emotion in conversation, where
of the edges represent the types of relationships.
two distinct memory networks consider the inter-
However, graph-based neural networks do not take
speaker interaction. DialogueRNN (Majumder
sequential information contained in utterances into
et al., 2019) employs an attention mechanism
account. Table 1 also represents the importance of
for grasping the relevant utterance from the en-
the sequential information. B’s emotional change
tire conversation. More related to our method
at utterance ]4 is caused by utterance ]3 rather than
is the DialogueGCN model proposed by Ghosal
]2 or ]1. In this way, human emotions may depend
et al. (2019), in which RGAT is used for model-
on more immediate utterances in the temporal or-
ing both self-dependency and inter-speaker depen-
der, and thus it is essential to take the sequence of
dency. This model has achieved state-of-the-art
utterances into account.
performance on several conversational datasets.
A common response to this issue is to en- On the other hand, as a way of considering contex-
code information about absolute position features tual information, Luo and Wang (2019) proposed
(Vaswani et al., 2017) or relative position fea- to propagate each of the utterances into an embed-
tures (Shaw et al., 2018), where these encodings ded vector. Likewise, a pre-trained BERT model
are added to nodes (utterances) or edges (relation- (Devlin et al., 2018) has been used for generating
ships). However, in order to account for self- and dialogue features to combine several utterances by
inter-speaker dependency, our model focuses on inserting separate tokens (Yang et al., 2019).
relation types rather than nodes (utterances) and
edges (relationships); thus, our position encoding Graph Neural Network Graph-based neural
also focuses on relation types. networks are used in various tasks. The fun-
In this paper, we propose novel position encod- damental model is the graph convolutional net-
ings (relational position encodings) that provide work (GCN) (Kipf and Welling, 2016), which
the RGAT model with sequential information re- uses a fixed adjacency matrix as the edge weight.
flecting relation types. By using the relational po- Our method is based on RGCN (Schlichtkrull
sition encodings, our RGAT model can capture et al., 2018) and GAT (Veličković et al., 2017).
both the speaker dependency and the sequential in- The RGCN model prepares a different structure
formation. Experiments on four ERC benchmark for each relation type and hence considers self-
datasets showed that our relational position en- dependency and inter-speaker dependency sepa-
coding outperformed baselines and state-of-the-art rately. The GAT model uses an attention mech-
methods. In addition, our method outperformed anism to attend to the neighborhood’s representa-
both the absolute and relative position encodings. tions of the utterances.

In summary, our contributions are as follows: Position Encodings In our work, positional in-
(1) For the first time, we apply position encod- formation is added to the graphical structure. Sev-
ings to RGAT to account for sequential informa- eral studies add position encodings to several
tion. (2) We propose relational position encodings structures, such as self-attention networks (SANs)
for the relational graph structure to reflect both se- and GCN. SANs (Vaswani et al., 2017) perform
quential information contained in utterances and the attention operation under the position-unaware
speaker dependency in conversations. (3) We con- assumption, in which the positions of the input are
duct extensive experiments demonstrating that the ignored. In response to this issue, the absolute po-
graphical model with relational position encod- sition (Vaswani et al., 2017) or relative position
ings is beneficial and that our method outperforms (Shaw et al., 2018), or structure position (Wang
state-of-the-art methods on four ERC datasets. (4) et al., 2019) are used to capture the sequential or-
We also empirically demonstrate that our model is der of the input. Similarly, graph-based neural net-

7361
Contextual Embedding Emotion Classification

BERT
Speaker Dependency Modeling

BERT

RGAT
BERT

BERT FFNN

Labels
BERT

Position Encodings
BERT

-2 -1 -1 0 -1 -1 -2
BERT
Concatenation

uttered by Speaker 1 Speaker1 : Self - Past type

: Inter - Past type
uttered by Speaker 2 Speaker2 : Self - Future type
: Inter - Future type

Figure 1: Our entire framework. First, we obtain a contextual embedding for each utterance by using BERT.
Then, we modify this embedding by using RGAT to consider speaker dependency. The position encodings in the
RGAT structure take sequential information into account. Finally, after concatenating the contextual embedding
to the output embedding through RGAT, we classify the concatenated vector into emotion labels by using a fully
connected feed-forward network.

works do not take sequential information. In the logueGCN model does not.
design of proteins, the relative spatial structure be-
tween proteins is modeled in order to account for 3.1 Contextual Utterance Embedding
the complex dependencies in the protein sequence We generate contextual utterance features from the
and is applied to the edges of the graph represen- tokens by following the method in (Luo and Wang,
tations (Ingraham et al., 2019). 2019). First, every utterance u1 , u2 , · · · , uN is
tokenized by the BPE tokenizer (Sennrich et al.,
3 Method 2015), i.e., ui = (ui,1 , ui,2 , · · · , ui,Ti ), where Ti
denotes the number of tokens. The tokens are
First, we define the problem of the ERC task. The
embedded through WordPiece embeddings (Wu
task is to recognize emotion labels (Happy, Sad,
et al., 2016). The pre-trained uncased BERT-Base1
Neutral, Angry, Excited, and Frustrated) of utter-
model converts the token embeddings into con-
ances u1 , u2 , · · · , uN , where N denotes the num-
textualized token representations, which can be
ber of utterances in a conversation. Let sm for
converted to the vector representations via max
m = 1, · · · , M be a collection of speakers in a
pooling, so that they are regarded as the contex-
given conversational dataset, where M denotes the (0)
number of speakers. The utterance ui is uttered by tual utterance embeddings hi ∈ RDm for i =
speaker sm , where m is the correspondence be- 1, · · · , M , where Dm denotes the dimension of
tween the utterance and its speaker. the utterance embeddings. This BERT model is
Our framework consists of three components - fine-tuned through a training process.
contextual utterance embedding, speaker depen- 3.2 Speaker Dependency Modeling with
dency modeling with position encodings and emo- Position Encodings
tion classification. The entire model architecture
is shown in Figure 1. Although our method is Graph-based neural networks are used to cap-
based on the DialogueGCN (Ghosal et al., 2019) ture the speaker dependency features of conver-
model, it considers the positional information con- sations. We design relational graph attention net-
tained in utterances in a sequential conversation 1
See https://github.com/google-research/bert for
as described in Section 3.2.3, whereas the Dia- details.

7362
works to capture both self-dependency and inter- contrast to the 8 types used by DialogueGCN2 .
speaker dependency of utterances. In addition, we In addition, the window sizes p and f repre-
introduce an attention mechanism to attend to the sent the number of past or future utterances from
neighborhood’s representations of the utterances. a target utterance in a neighborhood where each
Furthermore, novel position encodings (relational utterance ui has an edge with the p utterances
position encodings) are added to the graph to ac- (i.e. ui−1 , ui−2 , · · · , ui−p ), the f utterances (i.e.
count for the sequential information contained in ui+1 , ui+2 , · · · , ui+f ), and itself. An appropriate
utterances. window size has to be determined because a small
window makes each utterance connect to too small
3.2.1 Graphical Structure a neighborhood while an immense window size
makes the calculation very expensive. Although
We introduce the following notation: we denote
the window size can be different for each type, we
directed and multi-graphs as G = (V, E, R) with
determine the same window size for each relation.
a node (utterance) vi ∈ V and a labeled edge (re-
lation) (vi , r, vj ) ∈ E, where r ∈ R is a relation 3.2.2 Edge Weight
type. We introduce an edge weight by using an attention
mechanism. Although our attention mechanism is
Nodes Representation Each utterance in a con- based on the GAT (Veličković et al., 2017) model,
versation is represented as a node vi ∈ V. Each it is independent for each relational type r:
node vi is initialized with the contextual utterance
(0)
embeddings hi . Through a stack of graphical
layers, this embedding is modified by aggregating αijr = softmaxi LRL aTr [Wr hi ||Wr hj ]
their neighborhood’s representations, described as (1)
(L)
hi , where L denotes the number of graphical where αijr denotes the edge weight from a tar-
layers. get utterance i to its neighborhood j under rela-
tional type r, Wr denotes a parametrized weight
Labeled Edges Representation Following the matrix for the attention mechanism, ar denotes
state-of-the-art method (Ghosal et al., 2019), the a parametrized weight vector, and ·T represents
labeled edges depend on two aspects: (a) speak- transposition. After applying LeakyReLU nonlin-
ers dependency - this depends upon both self- earity (LRL), a softmax function is used to obtain
dependency and inter-speaker dependency. In the incoming edges whose sum total weight is 1.
detail, the former indicates how utterance ui of 3.2.3 Position Encodings
speaker sm influences sm ’s other utterances (in-
We propose relational position encodings for the
cluding itself). On the other hand, the latter de-
relational graph attention networks. Our position
scribes how utterance ui of speaker sm influences
encodings are based on the relative position since
the other speaker sk6=m ’s utterances; (b) temporal
it is appropriate for graph-based neural networks.
dependency - this also depends on temporal turns
The target utterance feature is connected to its
in conversation. Namely, it relies upon whether
neighborhood by an edge in the graph. There-
one utterance uj is uttered in the past or future
fore, in order to account for the sequential infor-
of the target utterance ui . While the future de-
mation between them, we need to consider the dis-
pendencies are not used in on-going conversation,
tance from the target to its neighborhood, which
the ERC task is an offline system. Furthermore,
is undoubtedly the relative distance between ut-
as past utterances plausibly influence future ut-
terances. Furthermore, we follow the speaker de-
terances, the converse may help the model fill in
pendency modeling described in 3.2.1 and use re-
some missing information like the speaker’s back-
lational graph attention networks. It is necessary
ground. For these reasons, we take the converse
that the sequential information depends on the re-
influence into account, referring to (Ghosal et al.,
lation type r. In summary, we use a different rel-
2019).
ative distance for each relation type, which is re-
Accordingly, there are four relation types of
2
edges: (1) self - past type, (2) inter - past type, The type of DialogueGCN depends on 2 distinct speak-
ers and therefore implies 2 × 4 distinct relation types, which
(3) self - future type, and (4) inter - future type, indicates that both the speaker dependency and the temporal
described as (r1 , r2 , r3 , r4 ). Note that this is in dependency are prepared for each distinct speaker.

7363
weight in (1) as

αijr = softmaxi LRL aTr [Wr hi ||Wr hj ] + PEijr
(3)
Figure 2: Example of relational positions. The rela- To add position encodings to the edge weight,
tional position depends on each relational type, and the our relational position has the same scalar dimen-
background color represents the relational type from sion as the edge weight. Because it is a scalar
the target utterance h4 . These positions, which are value, it may have limited ability to express po-
based on the relative distance, are different for each re- sitional information. In future studies, we will in-
lation.
crease the dimension of the position encodings.
3.2.4 RGAT
A graphical propagation module modifies the rep-
(l)
resentation of a node hi by aggregating represen-
tations of its neighborhood N r (i), and an attention
mechanism is used to attend to the neighborhood’s
(l−1)
representations. The features hir under relation
r are summed to compose the output embedding
(l)
of a node hi . Through a stack of graphical lay-
PE PE PE PE ers l, the representation of a node changes within
its l-hop neighborhood. We define the propagation
Figure 3: Illustration of relational position encodings. module as follows:
The encodings, which are composed of four represen-
tations, are added to the edges in a graph for each rela-
(l−1) (l−1) (l−1)
X
tion. “PE” denotes the position encodings. hir = αijr Wr(l−1) hj (4)
j∈N r (i)
R
(l) (l−1)
X
ferred to as relational position encodings. Figure 2 hi = hir (5)
illustrates the idea of relational positions. r=1
We compare two types of relational position en- where Wr
(l−1)
denotes a learnable weight matrix
coding, i.e., a fixed function and a learned repre- for each relation r. In addition, We apply multi-
sentation (Gehring et al., 2017). As the fixed posi- head attention to the aggregation module in (4)
tional function, we define its representation as and concatenate its outputs. After this propagation
module in (5), we use layer normalization with
 learnable affine transform parameters.
 max(−p, min(p, j − i))
 r = 1, where j ∈ N 1 (i)
N 2 (i)

 max(−p, min(p, j − i)) r = 2, where j ∈ 3.3 Emotion Classification
P Eijr =
 max(−f, min(f, j − i)) r = 3, where j ∈ N 3 (i)
 (L)
N 4 (i) After obtaining the representations hi of each

max(−f, min(f, j − i)) r = 4, where j ∈

(2)
node through the speaker dependency modeling
where P Eijr denotes the relational distance from
with relational position encodings, we concatenate
a target utterance i to its neighborhood j under (0)
the contextual utterance embeddings hi and the
relational type r. The maximum relational po- (L)
sition is clipped to a size of p or f , which de- representation of hi . The concatenated vector is
notes the window size of past or future utterances. classified by using a fully connected feed-forward
N r (i) denotes the neighborhood of the target i un- network, which consists of two linear transforma-
der relation type r. As the learned representations, tions with a ReLu activation between them:
we use one-layer feed-forward neural networks for
positional embeddings, whose argument is the re- Classifier(x) = max(0, xW1 + b1 )W2 + b2
lational fixed function. (6)
Our relational position is based on the relative where W1 and W2 denote learnable weight ma-
position; thus, it can be added to the edge weight, trixes, and b1 and b2 denote learnable bias vec-
as illustrated in Figure 3. We redefine the attention tors.

7364
Conversations Utterances
Datasets Classes Evaluation Metrics
train validation test train validation test
IEMOCAP 108 12 31 5320 490 1623 6 Weighted-F1
MELD 1038 114 280 9989 1109 2610 7 Weighted-F1
EmoryNLP 713 99 85 9934 1344 1328 7 Weighted-F1
DailyDialog 11118 1000 1000 87170 8069 7740 7 Micro-F1

Table 2: Dataset descriptions.

4 Experimental Settings CNN (Kim, 2014) This is a convolutional neu-

ral network trained at the utterance-level without
4.1 Datasets
contextual information.
We evaluated our method on four ERC benchmark
datasets of various sizes. Training, validation, and CNN+cLSTM (Poria et al., 2017) This model
test data distributions are reported in Table 2. extracts utterance features by using a CNN and
captures contextual information from surrounding
IEMOCAP (Busso et al., 2008) is an audio- utterances by using a bi-directional long short term
visual database consisting of recordings of ten memory (LSTM).
speakers in dyadic conversations. The utterances
are annotated with one of six emotional labels: BERT BASE (Devlin et al., 2018) This BERT-
happy, sad, neutral, angry, excited, or frustrated. based model extracts contextual information from
single sentences and uses it as input. After obtain-
MELD (Poria et al., 2018) is a multimodal
ing the sentence feature, it is classified with emo-
multi-party emotional conversational database
tion labels. We used this model as a contextual
created from scripts of the TV series Friends. The
utterance feature extractor (Section 3.1).
utterances are annotated with one of seven labels:
neutral, happiness, surprise, sadness, anger, dis- KET (Zhong et al., 2019) This is the state-of-
gust, or fear. the-art model for the EmoryNLP and DailyDialog
EmoryNLP (Zahiri and Choi, 2018) was also benchmark datasets. KET considers contextual in-
collected from Friends’ TV scripts. It contains dif- formation by using hierarchical self-attention and
ferent sizes and different types of annotations from leverages external commonsense knowledge by
those of MELD. The emotion labels include neu- using a context-aware graph attention mechanism.
ral, sad, mad, scared, powerful, peaceful, and joy-
DialogueRNN (Majumder et al., 2019) This
ful.
model uses a CNN to extract textual information.
DailyDialog (Li et al., 2017) is a multi-turn It uses three GRUs to account for the context
daily dialogue dataset, which contains human- and the speakers’ features and track the emotional
written daily communications. The emotion labels state.
are the same as the ones used in MELD.
DialogueGCN (Ghosal et al., 2019) This is
4.2 Evaluation Metrics the state-of-the-art model for the IEMOCAP and
For DailyDialog, following (Zhong et al., 2019), MELD datasets. DialogueGCN extracts textual ut-
we calculated the micro-averaged F1 score exclud- terance features by using a CNN and extracts se-
ing the majority class (neutral), due to it being an quential contextual features by using a GRU. Fur-
extremely high majority (over 80% occupancy in ther, it captures self-dependency and inter-speaker
both training and test sets). For the rest of the dependency by using two-layer graph neural net-
datasets, we followed (Zhong et al., 2019; Ghosal works, which consists of one layer RGAT and one
et al., 2019) and used the weighted-average F1 layer GCN.
score.
4.4 Other Settings
4.3 Baselines and State-of-the-Art We used cross entropy as a training loss for our
For a comprehensive performance evaluation, we approach on all datasets. The learning rate was
compared our model with the following baseline decreased in accordance with a cosine annealing
and state-of-the-art methods: schedule (Loshchilov and Hutter, 2016). We set

7365
Models IEMOCAP MELD EmoryNLP DailyDialog
CNN 48.18 55.86 32.59 49.34
CNN+cLSTM 54.95 56.87 32.89 50.24
BERT BASE 53.31 56.21 33.15 53.12
KET 59.56 58.18 34.39 53.37
DialogueRNN 62.75 57.03 31.70 50.65
DialogueGCN 64.18 58.10 - -
Ours 65.22 60.91 34.42 54.31

Table 3: Performance of our method, baseline, and state-of-the-art methods on the three test sets (the values in the
table are in terms of the evaluation metrics listed in Table 2). Bold font denotes the best performance. “-” signifies
that no results were reported for the given dataset. “Ours” denotes our methods, which are composed of a BERT
model and RGAT with relational position encodings. The position representations were learned.

initial learning rates of 4e-5 in the BERT struc- more, it achieved a weighted average F1 score
ture and 2e-3 in the RGAT structure and used the of 60.91% on the MELD dataset, outperform-
Adam optimizer (Kingma and Ba, 2014) under ing DialogueGCN by more than 2 points. For
the scheduled learning rate with a batch size of 1. EmoryNLP, it achieved a weighted average F1
The number of dimensions of the contextual em- score of 34.42%. It achieved a micro-averaged
beddings and utterance representations was set to F1 score of 54.31% on the DailyDialog dataset,
768, and the size of the internal hidden layer in improving recognition performance over the base-
the emotion classification module was set to 384. lines and KET model by around 1 point. From
We used 8-head attention for calculating the edge these results, we can see that adding our posi-
weight of RGAT and set 0.1 as the dropout rate in tion encodings caused an improvement over the
the BERT structure. We also carried out experi- baselines, KET, and DialogueGCN on all datasets.
ments with different contextual past window sizes Further, it is obvious that our approach is robust
p and future window sizes f , (1, 1), (2, 2), (3, 3), across datasets having varying training-data sizes,
(10, 10), (all, all), and RGAT layers, 1, 2, 3. We conversation lengths, and numbers of speakers.
selected either a concatenated function or a sum-
mation function as a mixing operation in the emo- 5.2 Analysis of the Experimental Results
tion classification module, as described in 3.3. We Let us investigate the importance of our model
chose the hyper-parameter that achieved the best components by analyzing the predicted emotional
score on each dataset by using development data. labels, as shown in Table 4. The results of the
All of the presented results are averages of 5 runs. model using BERT without speaker dependency
We conducted all experiments on a CentOS server modeling are listed on row ]0, while the results of
using Xeon(R) Gold 6246 CPU with 512GB of DialogueRNN, as described in Section 4.3, are on
memory, and we used Quadro RTX 8000 GPU row ]1. The results of DialogueGCN, as described
with 48GB of memory. in Section 4.3, are reported in ]2. The results of the
BERT and RGAT model without position encod-
5 Results and Discussion ings are on row ]3, and those of our model are on
]4. Note that DialogueGCN’s RGAT differs from
5.1 Comparison with Baselines and
our model in terms of its graphical structure and
State-of-the-Art
relational types.
We compared the performance of our approach As shown in the table, our method did not
with those of the baselines and state-of-the-art achieve the best score for almost all labels. How-
methods listed in Table 3. We have quoted the ever, interestingly, it achieved a state-of-the-art av-
results for the baselines and state-of-the-art re- erage F1 score, which is the target metric on the
sults reported in (Zhong et al., 2019; Ghosal et al., dataset. A possible reason for this performance is
2019), except for the results of BERT BASE on that our method consists of effective components.
IEMOCAP. Each component of BERT and RGAT with posi-
For IEMOCAP, our model obtained a weighted tion encodings worked well for each label. As a re-
average F1 score of 65.22%, outperforming Di- sult, these components led to a strong average per-
alogueGCN by more than 1 point. Further- formance. Each effective component is explained

7366
Background Components
Contextual Speaker
] Models Happy Sad Neutral Angry Excited Frustrated Average
Utterance Dependency
Embedding Modeling
0 BERT BASE BERT × 37.09 59.53 51.73 54.33 54.26 55.83 53.31
1 DialogueRNN CNN, GRU 33.18 78.80 59.21 65.28 71.86 58.91 62.75
2 DialogueGCN CNN, GRU RGAT 42.75 84.54 63.54 64.19 63.08 66.99 64.18
3 Ours(without PE) BERT RGAT 50.69 76.78 65.85 59.66 64.04 62.37 64.36
4 Ours BERT RGAT with PE 51.62 77.32 65.42 63.01 67.95 61.23 65.22

Table 4: Weighted average F1 scores of ours (with or without PE), baseline, and state-of-the-art methods for each
label in the IEMOCAP dataset. Bold font denotes the best performance. “Average” denotes the weighted average
F1 score. The variations of their background components are shown in the third and fourth columns.

as follows: Despite its strong performance, our model did not

outperform DialogueGCN and DialogueRNN on
Effect of Speaker Dependency We observed these labels (]1, ]2, and ]4). A possible explana-
that DialogueGCN and ours (with or without PE) tion is that these label’s utterances are mainly in-
achieved an F1 score of more than 60% on Frus- fluenced by the immediately preceding utterances;
trated, higher than the other methods. This may thus, RNN-based models such as GRU may be
be due to the well-functioning RGAT model. On more adequate for these two labels.
the IEMOCAP dataset, the utterances often keep
From these results, we can see that each com-
on influencing the other utterances through self
ponent of our method functioned successfully on
and inter-speaker dependency; thus, the same la-
each label. Our method achieved a state-of-the-art
bel continues in these utterances. Most of the la-
average F1 score. Moreover, it was useful on any
bels in this case are annotated with Frustrated. Be-
label; thus, it is a well-balanced method.
cause of the speaker dependency modeling, these
consecutive utterances can be well classified using Other Analyses We analyzed other aspects of
RGAT. our models. We observed that our model misclas-
sified some samples of Excited as Happy. The
Effect of Contextual Information Ours (with
cause of this issue may be due to the similarity
or without PE) achieved an F1 score of more than
of the sentences these labels appear in. There is
50% on Happy, outperforming the other baselines
almost no difference in the meanings of sentences,
by around 10 points. On the dataset, the Happy
so our method may have had difficulty distinguish-
label appears in several utterances including par-
ing these labels. In future work, we will utilize ad-
ticular words like ’love’ or ’great’. The BERT
ditional audio and visual information to help our
model with RGAT may have led to better perfor-
model by taking voice tones and facial expressions
mance. Due to the representational power afforded
into account.
by its bi-directional context modeling, the BERT
model may have functioned well in these utter- 5.3 Model Variations
ances. Note that the combination of BERT and
We evaluated the importance of our relational po-
RGAT is probably essential because the samples
sition encoding and studied the positional varia-
of Happy are also influenced by speaker depen-
tions on the IEMOCAP dataset. The experimental
dency, as compared with ]0.
results are reported in Table 5.
Effect of Sequential Feature Our position en- To make comparisons with the other position
codings contributed to the strong performance on encoding methods, absolute and relative position
the Sad and Angry labels, our model with PE out- representations were prepared; these are referred
performed our model without PE (]3 and ]4). The to as node-based position encodings and edge-
two labels often appear in the utterances influ- based position encodings, respectively. Inspired
enced by the other immediate utterances. As the by (Vaswani et al., 2017), we added node-based
RGAT with position encodings not only captures position encoding to the nodes (utterances) at the
self and inter-speaker dependency but clearly dis- bottoms of the RGAT layers. Similarly, edge-
tinguishes between immediate and far utterances; based position encoding was added to the edges
thus, it possibly performs well on these utterances. in the graph. We also compared two types of posi-

7367
] Position Encodings (PE) Type Average experiment by increasing the past and future win-
0 - - 64.36 dow sizes [(1,1), (3,3), (5,5), (7,7), (9,9), (11,11),
1 fixed 63.95
Node-based PE (20,20), (30,30), and (40,40)] on the IEMOCAP
2 learn 64.95
3 fixed 63.97 dataset and compared the results with those of the
Edge-based PE
4 learn 64.59 baseline model using BERT and RGAT without
5
Relational PE
fixed 63.99 positional information. The experimental results
6 learn 65.22 are illustrated in Figure 4.
Table 5: Impact of various position encodings compo- As an illustration, it is clear that both models
nents on the IEMOCAP dataset. The base model using perform better with a window size around 3, 5, 7.
BERT and RGAT without position encodings is shown On the other hand, long utterance information may
in ]0. “fixed” and “learn” denote a fixed function and a obstruct efficient recognition (see the results for
learned representation respectively. a window size around 30, 40). Although it is re-
quired to select a small window size, too small
a size results in poor performance, no better than
choosing a size of 1.
Furthermore, the proposed position encoding
method is robust to a varying window size. As
the window size increased, the baseline model’s
F1 score decreased, while our model maintained
its performance even with a large window. One
possible reason is that, as our position encodings
clearly distinguish between immediate and far ut-
terances, it can reduce the influence of these dis-
Figure 4: Effect of different window sizes on the tant utterances.
weighted average F1 score of our method (Ours) and
the baseline model (Base) on the IEMOCAP dataset.
We plotted the scores by using a marker with a con-
6 Conclusion
fidence interval of 95%, which was estimated using a
We proposed relational position encodings for
bootstrap.
RGAT to recognize human emotions in textual
conversation. We incorporated the relational po-
tion encoding, i.e., a fixed function and a learned sition encodings in the RGAT structure to cap-
representation. ture both speaker dependency and the sequen-
The baseline model using BERT and RGAT tial order of utterances. On four ERC datasets,
without position encodings (]0) had a recogni- our model improved recognition performance over
tion performance of 64.36%. We added various those of the baselines and existing state-of-the-art
position encodings to the baseline model and se- methods. Additional experimental studies demon-
lected fixed functions or learned representations strated that the relational position encoding ap-
as the position representation (from ]1 to ]6). The proach outperformed the other position encodings
model using the relational position encodings with and showed that it is robust to changes in window
learned representations had a recognition perfor- size.
mance of 65.22%, the best score and outperform- In future studies, we plan to increase the number
ing the base model by around 1 point. Our rela- of dimensions of the relational position encodings,
tional position encodings were more effective than since a scalar value may not be able to express po-
the other position encodings. sitional information adequately.
We also found that the fixed functions in various
positions resulted in a score lower than that of the Acknowledgements
baseline model. We can conclude that it is required
to learn a position representation. We would like to thank Dr. Ichiro Yamada, Dr.
Rei Endo, and Hideya Mino for the valuable dis-
5.4 Effect of Varying the Window Size cussions. We also thank the anonymous reviewers
We conducted another experiment to evaluate the for their helpful comments.
key aspects of our framework. We carried out an

7368
References Yoon Kim. 2014. Convolutional neural net-
works for sentence classification. arXiv preprint
James Bradbury, Stephen Merity, Caiming Xiong, and arXiv:1408.5882.
Richard Socher. 2016. Quasi-recurrent neural net-
works. arXiv preprint arXiv:1611.01576. Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe arXiv:1412.6980.
Kazemzadeh, Emily Mower, Samuel Kim, Jean-
nette N Chang, Sungbok Lee, and Shrikanth S Thomas N Kipf and Max Welling. 2016. Semi-
Narayanan. 2008. IEMOCAP: Interactive emo- supervised classification with graph convolutional
tional dyadic motion capture database. Language networks. arXiv preprint arXiv:1609.02907.
resources and evaluation, 42(4):335.
Jieun Lee and Ilyoo B Hong. 2016. Predicting positive
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, user responses to social media advertising: The roles
and Yoshua Bengio. 2014. Empirical evaluation of of emotional appeal, informativeness, and creativity.
gated recurrent neural networks on sequence model- International Journal of Information Management,
ing. arXiv preprint arXiv:1412.3555. 36(3):360–373.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang
Kristina Toutanova. 2018. BERT: Pre-training of Cao, and Shuzi Niu. 2017. DailyDialog: A man-
deep bidirectional transformers for language under- ually labelled multi-turn dialogue dataset. arXiv
standing. arXiv preprint arXiv:1810.04805. preprint arXiv:1710.03957.
Jonas Gehring, Michael Auli, David Grangier, Denis Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.
Yarats, and Yann N Dauphin. 2017. Convolutional 2016. Recurrent neural network for text classi-
sequence to sequence learning. In Proceedings fication with multi-task learning. arXiv preprint
of the 34th International Conference on Machine arXiv:1605.05101.
Learning-Volume 70, pages 1243–1252. JMLR. org.
Ilya Loshchilov and Frank Hutter. 2016. SGDR:
Deepanway Ghosal, Navonil Majumder, Soujanya Po- Stochastic gradient descent with warm restarts.
ria, Niyati Chhaya, and Alexander Gelbukh. 2019. arXiv preprint arXiv:1608.03983.
DialogueGCN: A graph convolutional neural net-
work for emotion recognition in conversation. arXiv Linkai Luo and Yue Wang. 2019. EmotionX-HSU:
preprint arXiv:1908.11540. Adopting pre-trained bert for emotion classification.
arXiv preprint arXiv:1907.09669.
Chuan Guo, Juan Cao, Xueyao Zhang, Kai Shu, and
Huan Liu. 2019. DEAN: Learning dual emotion for Navonil Majumder, Soujanya Poria, Devamanyu Haz-
fake news detection on social media. arXiv preprint arika, Rada Mihalcea, Alexander Gelbukh, and Erik
arXiv:1903.01728. Cambria. 2019. DialogueRNN: An attentive rnn for
emotion detection in conversations. In Proceedings
Devamanyu Hazarika, Soujanya Poria, Rada Mihal- of the AAAI Conference on Artificial Intelligence,
cea, Erik Cambria, and Roger Zimmermann. 2018a. volume 33, pages 6818–6825.
ICON: Interactive conversational memory network
for multimodal emotion detection. In Proceedings Rosalind W Picard. 2010. Affective computing: From
of the 2018 Conference on Empirical Methods in laughter to ieee. IEEE Transactions on Affective
Natural Language Processing, pages 2594–2604. Computing, 1(1):11–17.

Devamanyu Hazarika, Soujanya Poria, Amir Zadeh, Soujanya Poria, Erik Cambria, Devamanyu Hazarika,
Erik Cambria, Louis-Philippe Morency, and Roger Navonil Majumder, Amir Zadeh, and Louis-Philippe
Zimmermann. 2018b. Conversational memory net- Morency. 2017. Context-dependent sentiment anal-
work for emotion recognition in dyadic dialogue ysis in user-generated videos. In Proceedings of the
videos. In Proceedings of the 2018 Conference 55th annual meeting of the association for compu-
of the North American Chapter of the Association tational linguistics (volume 1: Long papers), pages
for Computational Linguistics: Human Language 873–883.
Technologies, Volume 1 (Long Papers), pages 2122–
2132. Soujanya Poria, Devamanyu Hazarika, Navonil Ma-
jumder, Gautam Naik, Erik Cambria, and Rada Mi-
Sepp Hochreiter and Jürgen Schmidhuber. 1997. halcea. 2018. MELD: A multimodal multi-party
Long short-term memory. Neural computation, dataset for emotion recognition in conversations.
9(8):1735–1780. arXiv preprint arXiv:1810.02508.

John Ingraham, Vikas Garg, Regina Barzilay, and Michael Schlichtkrull, Thomas N Kipf, Peter Bloem,
Tommi Jaakkola. 2019. Generative models for Rianne Van Den Berg, Ivan Titov, and Max Welling.
graph-based protein design. In Advances in Neu- 2018. Modeling relational data with graph convolu-
ral Information Processing Systems, pages 15794– tional networks. In European Semantic Web Confer-
15805. ence, pages 593–607. Springer.

7369
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2015. Neural machine translation of rare words with
subword units. arXiv preprint arXiv:1508.07909.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.
2018. Self-attention with relative position represen-
tations. arXiv preprint arXiv:1803.02155.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Petar Veličković, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Lio, and Yoshua Bengio.
2017. Graph attention networks. arXiv preprint
arXiv:1710.10903.
Xing Wang, Zhaopeng Tu, Longyue Wang, and Shum-
ing Shi. 2019. Self-attention with structural position
representations. arXiv preprint arXiv:1909.00383.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144.
Kisu Yang, Dongyub Lee, Taesun Whang, Seolhwa
Lee, and Heuiseok Lim. 2019. EmotionX-KU: Bert-
max based contextual emotion classifier. arXiv
preprint arXiv:1906.11565.
Sayyed M Zahiri and Jinho D Choi. 2018. Emotion de-
tection on tv show transcripts with sequence-based
convolutional neural networks. In Workshops at the
Thirty-Second AAAI Conference on Artificial Intelli-
gence.
Peixiang Zhong, Di Wang, and Chunyan Miao. 2019.
Knowledge-enriched transformer for emotion de-
tection in textual conversations. arXiv preprint
arXiv:1909.10681.