Deep emotion recognition in textual conversations: a survey

Pereira, Patrícia; Moniz, Helena; Carvalho, Joao Paulo

doi:10.1007/s10462-024-11010-y

Deep emotion recognition in textual conversations: a survey

Open access
Published: 07 November 2024

Volume 58, article number 10, (2025)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Deep emotion recognition in textual conversations: a survey

Download PDF

Patrícia Pereira^1,2,
Helena Moniz^1,3 &
Joao Paulo Carvalho^1,2

1604 Accesses
Explore all metrics

Abstract

Emotion Recognition in Conversations (ERC) is a key step towards successful human–machine interaction. While the field has seen tremendous advancement in the last few years, new applications and implementation scenarios present novel challenges and opportunities. These range from leveraging the conversational context, speaker, and emotion dynamics modelling, to interpreting common sense expressions, informal language, and sarcasm, addressing challenges of real-time ERC, recognizing emotion causes, different taxonomies across datasets, multilingual ERC, and interpretability. This survey starts by introducing ERC, elaborating on the challenges and opportunities of this task. It proceeds with a description of the emotion taxonomies and a variety of ERC benchmark datasets employing such taxonomies. This is followed by descriptions comparing the most prominent works in ERC with explanations of the neural architectures employed. Then, it provides advisable ERC practices towards better frameworks, elaborating on methods to deal with subjectivity in annotations and modelling and methods to deal with the typically unbalanced ERC datasets. Finally, it presents systematic review tables comparing several works regarding the methods used and their performance. Benchmarking these works highlights resorting to pre-trained Transformer Language Models to extract utterance representations, using Gated and Graph Neural Networks to model the interactions between these utterances, and leveraging Generative Large Language Models to tackle ERC within a generative framework. This survey emphasizes the advantage of leveraging techniques to address unbalanced data, the exploration of mixed emotions, and the benefits of incorporating annotation subjectivity in the learning phase.

Recognizing Emotion Cause in Conversations

Article 13 September 2021

Integrating Rich Utterance Features for Emotion Recognition in Multi-party Conversations

Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

This survey systematically reviews Deep Emotion Recognition in Conversations (ERC) works. It elaborates on several ERC challenges and opportunities, explaining how these are currently tackled by the reviewed works and discussing how these can be further addressed.

Emotion recognition capabilities are essential not only for successful interpersonal relationships but also for human–machine interaction. Understanding and knowing how to react to emotions significantly improves the interaction and its outcome. It is therefore a crucial component in the development of empathetic machines, which substantially enriches the experiences these can provide.

ERC modules are useful for a wide range of applications, from automatic opinion mining, to emotion-aware conversational agents and as assisting modules for therapeutic practices. ERC is therefore an actively growing research field, with high applicability and potential. Figure 1 depicts the evolution of the number of ERC publications across the years, using the most established benchmark datasets (Busso et al. 2008; Poria et al. 2019a; Zahiri and Choi 2018; Li et al. 2017), supporting the previous observation. Note that it does not account for datasets addressing emotion causes, a recent research direction.

It should be noted that emotions are only fully recognizable by the person having the emotion. Thus emotion recognition should be interpreted as “inferring an emotional state from observations of emotional expressions and behavior, and through reasoning about an emotion-generating situation” (Picard 2000).

Progress in ERC benefits from the advancements in Sentiment Analysis, particularly in dialogues, whose tasks include determining the polarity and emotion of text. Research in Sentiment Analysis has come a long way and there is a perception that the field has reached its maturity, but there are still aspects and sub-tasks that need to be thoroughly addressed to attain true sentiment understanding (Poria et al. 2023). Such a statement applies to ERC as well. Emotion and sentiment are used interchangeably throughout this survey, but one should bear in mind they have different meanings: emotion is a psychological state while sentiment is the mental attitude created through the existence of the emotion (Damásio 2020).

The conversational setting presents additional challenges and opportunities, besides the ones of Emotion Recognition, such as the presence of intricate dependencies between the emotional states of the subjects who participate in the conversation, and the use of context and common sense to express emotions, amongst others described in this survey.

1.1 Task definition

The task of ERC consists of determining the emotion of each utterance in a textual conversation, for which speaker information is provided for each utterance. Formally, given a sequence of N utterances $[(u_1, p_1), (u_2, p_2), \dots , (u_N, p_N)]$, where each utterance $u_i$ is spoken by participant $p_i$, the goal is to predict the emotion label $e_i$ for each utterance $u_i$.

1.2 Survey contributions

To the best of our knowledge, the existent surveys on Emotion Recognition in Conversations are by Poria et al. (2019b), focusing on the challenges and advances in text ERC, pointing out conversational context modelling, speaker-specific modelling, the presence of emotion shift, multiparty conversations and the presence of sarcasm, and a more recent survey by Fu et al. (2023), on multimodal ERC, that covers context modeling in conversations, speaker dependency, methods for fusing multimodal information and presents challenges such as real-world ERC and classification on the typical ERC imbalanced datasets.

Related relevant surveys covered sentiment analysis research challenges and directions (Poria et al. 2023) and textual emotion recognition and its challenges (Deng and Ren 2021), the latter including a section on textual emotion recognition in dialogue, elaborating on utterance context modelling and emotion dynamics modelling. However, these are not specific to conversations, not specifying how the models and practices employed in general textual emotion recognition relate to conversations, nor covering several ERC pertaining challenges.

Our survey discusses additional challenges for this task, such as recognizing emotion causes, different taxonomies across datasets, multilingual ERC, and interpretability. It also extensively covers solutions to dealing with imbalanced ERC datasets and learning with subjectivity from annotators.

This is also the first survey presenting an extensive list of Deep Learning works in ERC, stating their methods, modalities, and performance across various datasets, while also describing key works on the topic. We provide comprehensive descriptions of Deep Learning methods from the Multi-Layer Perceptron, Recurrent Neural Networks, Long Short-Term Memory Networks, Gated Recurrent Units, Convolutional Neural Networks, Graph Neural Networks, and Attention Mechanisms to the Transformer, in the light of Emotion Recognition in Conversations. While these methods are transversal to all emotion recognition tasks, and therefore all emotion recognition surveys, some are more appropriate for ERC. Examples are Transformer-based architectures that are better at preserving long-term dependencies between sequences of utterances than recurrent neural networks.

We also set up an online repository^{Footnote 1} with the sources we present, organized into categories such as overviews, datasets, and works on ERC.

1.3 Survey methodology

We report the steps of our systematic review: identification, screening, eligibility, and inclusion.

In the identification step, to collect ERC works for Sects. 4 and 6, we search the accepted paper’s sections of leading NLP conferences (ACL, EMNLP, NAACL, EACL, COLING, and Findings) and additional conferences (AAAI, INTERSPEECH, and ICASSP), using the title keywords “emotion” in combination with “conversational” or “conversations”. Journal papers are also collected from IEEE Access and IEEE Transactions on Affective Computing.

We limit our search to papers published after 2017, given the significant advances in Deep Learning models since that time, particularly the introduction of the Transformer architecture (Vaswani et al. 2017) and pre-trained Transformer models such as BERT (Devlin et al. 2019) and RoBERTa (Liu et al. 2019).

In the screening step, the inclusion criteria are papers that address emotion recognition within conversations, report performance on one or more of the four main benchmark datasets listed in subsection 3.2 (IEMOCAP, MELD, EmoryNLP, and DailyDialog), and report F1-scores, the adequate evaluation metric for emotion recognition in conversations, as elaborated in subsection 5.2. The exclusion criteria are papers not reporting results on the aforementioned benchmark datasets or not reporting F1-scores.

For Sects. 2, 3, and 5, we conduct topic-specific searches using Google Scholar, employing keywords such as “challenges in ERC” or “data taxonomies in emotion recognition”. Search results are filtered by relevance and publication date to prioritize recent and pertinent research.

To complement our searches, we use tools such as Connected Papers to discover papers related to the ones already retrieved.

To assess paper eligibility we evaluate each paper’s full text, focusing on the reported results and alignment with the survey scope, more specifically, the use of Deep Learning methods.

1.4 Outline

This survey is organised as follows: Sect. 1 introduces the topic of Emotion Recognition in Conversations and Sect. 2 presents its challenges and opportunities. Section 3 describes the emotion taxonomies and a variety of ERC benchmark datasets employing such taxonomies. Section 4 provides descriptions of the most prominent works in ERC and explains the Deep Learning architectures employed. Section 5 describes advisable ERC practices towards better frameworks, elaborating on methods to deal with subjectivity in annotations and methods to deal with the typically unbalanced ERC datasets. Section 6 consists of systematic review tables comparing several works regarding the methods used and their performance and provides insights into how the aforementioned works attempted to solve the ERC challenges presented in section 2. Section 7 elaborates on future work directions. Finally, Sect. 8 presents the conclusions, limitations and key findings of the survey.

2 ERC challenges and opportunities

Research in ERC has come a long way and several challenges have been addressed, as described in this section. From context, to speaker and emotion shift modelling to interpreting common sense, informal language and sarcasm, Deep Learning architectures have demonstrated their potential to address these aspects, although there is room for improvement. Application scenarios such as real-time ERC pose additional challenges. In this section, we describe such challenges and point to the sections that are connected with each challenge, and in subsection 6.1 we wrap up the challenges, elaborating on how they are currently being addressed. In Sect. 7, we suggest future work directions concerning partly unaddressed challenges.

2.1 Context, speaker and emotion dynamics modelling

While early works (Lee et al. 2009; Wöllmer et al. 2010) have tackled challenges from conversational context modelling, to speaker-specific modelling and emotion dynamics, resorting to, for example, gated networks, recent ERC approaches have leveraged the novel developments in Deep Learning, such as Transformers and Transformer-based Pre-trained Models to address these aspects.

Context modelling can be interpreted as leveraging information from previous or surrounding words or utterances. Speaker-specific modelling leverages information about each speaker’s utterances, capitalizing on the fact that an utterance belongs to a given speaker, being extremely important in multiparty conversations. Emotion dynamics concerns the variations of the emotions of the participants in the conversation, being more informative but also harder to model than emotion shift. Emotions concern not only who express them but also the receiver. For example, more conscious and reflective people tend to be more sensitive regarding other’s emotions.

Concerning these challenges, gated and graph-based neural networks and transformer-based architectures are successful at modelling long range dependencies between words in utterances and relations between utterances. Key ERC works in Sect. 4 are described with regards to how these Deep Learning architectures were employed to address these challenges.

2.2 Common sense, informal language and sarcasm

Contextual embeddings from language models, such as BERT (Devlin et al. 2019) and RoBERTA (Liu et al. 2019), are pre-trained with large amounts of data, which makes them suitable to deal with informal language, sarcasm, and use of common sense to express emotions, although there is room for improvement. Most of the ERC works featured in this survey make use of these contextual embeddings.

Regarding common sense, commonsense knowledge bases such as COMET (Bosselut et al. 2019) were developed to aid in leveraging the meaning behind phrases employing such knowledge. Three key ERC works described in Sect. 4 make use of knowledge bases. These knowledge bases, however, are not specific to emotions, which could be an extension to consider.

Concerning sarcasm, speaker-specific modelling can aid in identifying the sarcastic participants in the conversation, to more accurately classify their utterances (Poria et al. 2019b).

2.3 Real-time ERC

Systems to be deployed in real-life conversational scenarios may require real-time emotion recognition (Eyben et al. 2010), which is challenging not only for modelling and recognition but also in annotation because the boundaries of each expression of emotion have to be simultaneously recognized as the emotion category (Picard 2000). Datasets such as the ones described in subsubsection 3.2.5 feature annotations in continuous time, being more suitable for real-time ERC.

2.4 Recognizing emotion causes

Recognizing emotion causes, or what triggered the emotions in the conversation, is also an only recently addressed challenge (Poria et al. 2021; Wang et al. 2023; Jeong and Bak 2023; Hu et al. 2024a; Hua et al. 2024; Ma et al. 2024). It involves resorting to datasets enriched with emotion causes, guiding classifiers to identify not only the emotion at stake but also what caused that emotion to arise. Addressing this challenge can greatly contribute towards a more complete emotional understanding.

2.5 Different taxonomies

With the use of different taxonomies and taxonomy types, as described in subsection 3.1 and reflected in the different emotion representations in the datasets in subsection 3.2, it is not straightforward to compare and establish connections between the different datasets, making it difficult, for example, to fine-tune the same ERC model with data from more than a few datasets in which adjustments can be made to their respective taxonomies.

2.6 Multilingual ERC

Most of the benchmark datasets used to evaluate models and develop ERC applications are in English, as it can be observed in subsection 3.2. Translating the English conversations to the target application languages is not ideal since some emotion can be lost in translation. The collection and annotation of multilingual datasets for ERC would benefit research in this direction. Some dialogues involve code-switching, i.e., featuring several languages in the same dialogue or even utterance (Kumar et al. 2023). The cultural aspect of emotion expression/interpretation also poses a challenge, since emotion is not universally expressed or understood in the same way across different cultures.

2.7 Interpretability

A property lacking in most of the works resorting to deep learning architectures is interpretability. It is difficult to interpret the reasoning behind these models’ decisions and identify what are the characteristic features of each class. Examining attention weights is a way of determining to which features the models were attending to while computing classifications, which can aid interpretability, although there are some issues raised in the community (Jain and Wallace 2019; Wiegreffe and Pinter 2019). Other methods can be proposed to provide more interpretability to the ERC pipeline, such as Fuzzy Fingerprinting Transformer Embedding Representations (Pereira et al. 2023b).

3 ERC taxonomies and benchmark datasets

In this section, a description of different taxonomies, i.e., ways of representing emotions, is provided, followed by the presentation of various datasets that resort to such taxonomies. The “dialogues in the datasets reflect our daily communication way” (Li et al. 2017), therefore encompassing several emotions across different taxonomies.

3.1 Emotion taxonomies

One of the most important issues in the study of emotion is whether there are basic types of affect or whether affective states should be modelled as combinations of locations on shared dimensions (Zachar and Ellis 2012). These two dominant views corresponded to the categorical approach, defended by Panksepp (2004), and to the dimensional approach, proposed by Russell (1980), respectively. This is reflected in two main taxonomy approaches, that are described and compared in this section. The Plutchik wheel of emotions (Plutchik 1982) is also described.

3.1.1 Dimensional vs categorical approach

The description of emotions can follow a dimensional approach, a continuous model that allows for an unlimited states’ description. Following this approach, emotion is described by a range of values for valence, arousal, and dominance, the so-called VAD model (Mehrabian 1980). Valence is a measure of pleasure or displeasure of emotion: happiness has a positive valence, fear has a negative valence; arousal ranges from excitement to calmness: anger is high in arousal, sadness is low in arousal; dominance measures how much choice one has over an emotion: fear is low dominance, admiration is high dominance. The distribution of common emotions over the VAD 3-dimensional space can be observed in Fig. 2.

A motivation for the use of the dimensional description could be the existence of mixed emotions, elaborated in subsubsection 3.1.3.

The description can also follow a categorical approach, in which an emotion is assigned to one or more labels from a set of classes. Compared to the dimensional approach, this categorical discrete framework makes it easier to annotate emotions but is more limited. Several scholars proposed to identify a set of primary emotions. The basic emotions proposed by Ekman (1999) constitute a standard for this emotion description, categorised into joy, anger, sadness, surprise, fear, and disgust.

3.1.2 Plutchik wheel of emotions

The Plutchik wheel of emotions, depicted in Fig. 3, identifies eight core emotions, organised in opposite pairs of a wheel: sadness and joy, anger and fear, expectation and surprise, and trust and disgust. These have associated emotions according to their intensities and also combine to form new emotional states (Plutchik 1982).

3.1.3 Mixed emotions

A person can experience simultaneously several emotions. For example, one can experience joy and surprise at the same time and sometimes even opposite emotions simultaneously (Berrios et al. 2015). While the VAD model allows for an unlimited description of emotions so that a mixture of emotions can be expressed in a single VAD description, the single-label categorical approach does not account for mixed emotions. This motivates the need to classify emotions with multiple labels when using the categorical approach. This need becomes more obvious with the decreasing number of labels in a dataset, but there are also multi-label datasets with a high number of emotion labels, such as the GoEmotions dataset (Demszky et al. 2020) with 28 emotions.

3.1.4 ERC taxonomy considerations

The existent taxonomies are usually transversally applied along all Emotion Recognition tasks. However, for ERC, it can be observed that there are coincident predominant classes in several benchmark datasets, such as Neutral and Happiness/Joy, as can be seen in the class distribution of the datasets on subsection 3.2. Within different ERC domains such as chit-chat vs task-oriented, e.g. mental health (Dheeraj and Ramakrishnudu 2021) and customer support (Herzig et al. 2016), there are specific classes of emotions that will be more represented than others in each domain. With such a rich diversity of possible classes for this task, as it can be observed in the 28 classes of the GoEmotions dataset, the dimensional approach can encompass all classes but might make it harder to distinguish between differentiated emotional displays.

3.2 ERC benchmark datasets

In this subsection, different categorical, single-label datasets are presented, that were used by the works benchmarked in this survey. These works are organised in Table 3. Other relevant datasets are also presented. Overall, their class imbalance is significant suggesting the use of balancing techniques that will be elaborated on in subsection 5.2. All datasets are in English, which is a limitation concerning multilingual ERC, as pointed out in section 2.

3.2.1 IEMOCAP (Busso et al. 2008)

The IEMOCAP dataset consists of 151 videos of two speakers’ dialogues, comprising 7.433 utterances with an average length of 11,6 words. Each utterrance is labelled with one of 8 emotions: happy, sad, neutral, angry, excited, frustrated, fear, and disgust. Most of the works, however, benchmark their performance on the first 6 classes. The maximum class imbalance ratio is 1:3 (happy: frustrated).

Sample: “Is that, is that - is that just foam? I can’t even tell. Although, if you can’t tell, it’s probably isn’t them. It’ll probably be unmistakable, don’t you think?” Label: Excited

3.2.2 MELD (Poria et al. 2019a)

The Multimodal EmotionLines Dataset (MELD) contains 13.708 utterances from 1.433 dialogues from the TV series Friends. Each utterance encompasses audio, visual, and textual modalities, with an average length of 8,0 words. Emotion categories are Ekman’s basic emotions and neutral. The maximum class imbalance ratio is 1:18 (fear: neutral).

Sample: “Are we okay now?” Label: Fear

3.2.3 EmoryNLP (Zahiri and Choi 2018)

EmoryNLP is a text dataset extracted from transcripts of the TV show Friends. It contains 12.606 utterances, labelled with one of the six primary emotions in Willcox’s feeling wheel (Willcox 1982) and a neutral label. The maximum class imbalance ratio is 1:4 (sad: neutral).

Sample: “What a coincidence, I listen in my sleep.” Label: Peaceful

3.2.4 DailyDialog (Li et al. 2017)

DailyDialog is a large text dataset built from websites used to practise English dialogue in daily life, containing 13.118 multi-turn dialogues, comprising 102.879 utterances with an average length of 14,6 words. Each utterance is labelled with one of the six Ekman’s basic emotions or other. The maximum class imbalance ratio is 1:1156 (fear:other).

Sample: “I was scared stiff of giving my first performance” Label: Disgust

Table 1 shows a comparison of the label distributions of the aforementioned datasets and Table 2 shows a comparison of the corresponding percentages.

Table 1 Label distribution comparison of different ERC datasets

Full size table

Table 2 Label percentage distribution comparison of different ERC datasets

Full size table

3.2.5 Other datasets

The SEMAINE dataset (McKeown et al. 2011) consists in audiovisual recordings of interactions between a human and an operator undertaking the role of an avatar with four personalities. Its five core emotion dimensions are valence, activation, power, anticipation/expectation, and intensity.

The RECOLA dataset (Ringeval et al. 2013) is a multimodal corpus of spontaneous human interactions collected from a collaborative task. It includes not only audio and video, but also electrocardiogram and electrodermal activity data. Emotional behaviours are annotated in terms of arousal and valence, in continuous time, and social behaviours are in terms of agreement, dominance, engagement, performance, and rapport, in discrete time.

The SILICONE dataset (Chapuis et al. 2020) consists of preexisting datasets which have been considered challenging and interesting. It has annotations on Dialog Acts and Sentiment/Emotion. Concerning the latter, it comprises DailyDialog, MELD, IEMOCAP and SEMAINE.

4 Deep learning for emotion recognition in conversations

In this section, key ERC works are described, focusing on the challenges and opportunities of the conversational scenario, presented in Sect. 2, and on the diverse Deep Learning architectures that can be employed for the task. Descriptions of key Deep Learning models are provided along with descriptions of works that resorted to those models. The works on Emotion Recognition in Conversations using Deep Learning described in this survey are chosen to be representative of the different techniques that are employed, covering gated Neural Networks with and without bidirectionality, Memory Networks, Graph Neural Networks, Attention Mechanisms, and Transformers. It is remarkable to observe how in four years research evolved from a simple Long Short-Term Memory Network (LSTM), taking into account the context of the conversation, to complex gated and graph neural network architectures and how the invention of transformers was reflected in not only word embeddings, such as BERT, but also in new classifier architectures. Comparisons between these works are performed based on the reported weighted-F1-score without the majority class on the IEMOCAP dataset, the choice of most authors including the prominent works of Poria et al. (2017), Hazarika et al. (2018b), Majumder et al. (2019), Ghosal et al. (2019), Zhong et al. (2019), Shen et al. (2021a), Ghosal et al. (2020), Wang et al. (2020b), Li et al. (2021a), Shen et al. (2021b) and Lei et al. (2023), since the dataset is not too imbalanced.

4.1 Convolutional neural networks for feature extraction

Convolutional Neural Networks (CNNs) are a type of feedforward neural network inspired by the biology of the visual cortex that mimics the local filtering over the input space performed by the receptive fields. In the case of text, a linear layer is applied to a sequence of words, each represented by its embeddings, separately and in the same way for each sequence. The convolutional layer can have many filters, each extracting a different feature, and is usually followed by a pooling operation. In the case of Emotion Recognition in Conversations, convolutional layers act as feature extractors, useful since strong predictors of the label can appear in different places in the input. With the advent of Pre-Trained Language Models for utterance representation, elaborated on subsection 4.10, convolutional neural networks have been less used.

4.2 The multi-layer perceptron

Emotion Recognition is a classification task. The most simple neural network architecture that can be employed for such a task is the Multi-Layer Perceptron.

The Multi-Layer Perceptron (MLP) is a non-linear function approximator that can be used for classification and regression. It is a class of feed-forward artificial neural networks that is composed of at least three layers: an input layer, a hidden layer, and an output layer. Its complexity and representation power increases with the number of layers. A MLP is mathematically described by the following equation:

$$\begin{aligned} \hat{y}=f_{L}(W_{L}(\ldots f_{2}(W_{2}(f_{1}(W_{1}x+b_{1})+b_{2})\ldots )+b_{L}, \end{aligned}$$

(1)

In which L is the number of layers, $W_l \in \mathbb {R}^{l-1} \times \mathbb {R}^{l}$ and $b_l \in \mathbb {R}^{l}$ are the weight matrix and bias vector corresponding to layer l, and $f_{l}$ is a non-linear activation function corresponding to layer l. Except for the input layer, all layers’ units are followed by a non-linear activation function, such as the sigmoid, the hyperbolic tangent, or the rectifier linear unit. When performing classification, the last layer is followed by the sigmoid activation function for binary classification or the softmax activation function for multiclass classification. A multiclass MLP with one hidden layer is depicted in Fig. 4.

The MLP is commonly trained with the Backpropagation algorithm (Rumelhart et al. 1986). The idea is to update the model parameters in the opposite direction of the gradient, such that a loss computed from the prediction error is minimised. The algorithm is composed of a forward pass and a backward pass. In the forward pass, the input is passed through all the layers in the network, yielding a prediction. In the backward pass, the gradients of the model’s parameters are computed with respect to the loss function, making use of the chain rule. These are computed one layer at a time, iterating backward from the last layer. The computation of the gradients along with weight update are done following an optimization method, commonly using Stochastic Gradient Descent (SGD) algorithms.

When used for Emotion Recognition, the input to the network is a set of extracted features representing each utterance obtained for example with the Bag of Words approach for text, which does not account for the order of the words, fitting this type of network but not being the ideal method for the task. That is the main reason why state-of-the-art works do not resort to this architecture.

4.3 Elman recurrent neural network

Feed-forward neural networks, such as the above, assume data has no sequential dependencies which is not the case in Emotion Recognition in which an utterance is a sequence of words that can also comprise sequential data from other modalities. Recurrent Neural Networks (RNNs) introduce memory into the network, being able to learn sequences. They process one input, such as a word, at a time, producing a hidden state which is a summary of the sequence of inputs up to the current timestep. The most simple one, the Elman RNN (Elman 1991), updates the hidden state as in the following equation:

$$\begin{aligned} h_t=f(Wx_t+Uh_{t-1}+b), \end{aligned}$$

(2)

where $h_t \in \mathbb {R}^{h}$ is the hidden state at time t, $x_t \in \mathbb {R}^{N}$ is the input at time t, $W \in \mathbb {R}^{h\times N}$, $U \in \mathbb {R}^{h\times h}$ and $b \in \mathbb {R}^{h}$ are parameters to be learned and f is a non-linear activation function such as the sigmoid or the hyperbolic tangent. At each timestep the RNN can produce an output:

$$\begin{aligned} \hat{y}_t=g(Vh_t+c), \end{aligned}$$

(3)

where $y_t \in \mathbb {R}^{y}$ is the output at time t, $V\in \mathbb {R}^{y\times h}$ and $c \in \mathbb {R}^{y}$ are parameters to be learned and g is a non linear activation function.

A visualization of the RNN update through time-steps is provided in Fig. 5.

RNNs are trained with a modification of the Backpropagation algorithm, the Backpropagation Through Time (BPTT) algorithm (Werbos 1990). All the parameters are shared between all time steps and it can be observed that an unrolled RNN is comparable to a feed-forward neural network with the same parameters shared over layers.

In the case of Emotion Recognition, the input to the network is a sequential utterance, the textcorresponding to one word being fed per timestep, and the output is the predicted emotion.

In some cases, particularly when dealing with long sequences such as when considering several utterances to model context in Emotion Recognition in Conversations, the backpropagation algorithm yields vanishingly small gradients, preventing the parameters from changing their values and eventually stopping training.

4.4 Long short-term memory networks

To overcome this problem, Long Short-Term Memory Networks (LSTMs) (Hochreiter and Schmidhuber 1997) were proposed. LSTMs are composed of a memory cell and three gates. The memory cell stores the information about the input sequence across timesteps while the gates control the flow of information across the memory cell: the input gate controls the proportion of the current input to include in the memory cell; the forget gate the proportion of the previous memory cell to forget; the information gate the information to output from the current memory cell. This can be described by the following equations:

$$\begin{aligned} i_t= & sigm(W_ix_t+U_ih_{t-1}+b_i) \end{aligned}$$

(4)

$$\begin{aligned} f_t= & sigm(W_fx_t+U_fh_{t-1}+b_f) \end{aligned}$$

(5)

$$\begin{aligned} o_t= & sigm(W_ox_t+U_oh_{t-1}+b_o) \end{aligned}$$

(6)

$$\begin{aligned} c_t= & f_t \odot c_{t-1}+i_t\odot tanh(W_cx_t+U_ch_{t-1}+b_c) \end{aligned}$$

(7)

$$\begin{aligned} h_t= & o_t \odot tanh(c_t), \end{aligned}$$

(8)

where $i_t$ is the input gate, $f_t$ the forget gate, $o_t$ the output gate, $c_t$ the memory cell and sigm and tahn the element-wise sigmoid and hyperbolic tangent activation functions respectively. The several $\text{W}, \, \text{U} \, \text{b} $ are parameters to be learned. A visual depiction of an LSTM cell is provided in Fig. 6.

The vanishing gradient problem is overcome by the gates, specially the forget gate.

As in the Elman RNN, in the case of Emotion Recognition the input to the network can be a sequential utterance and the output is the predicted emotion. LSTMs are usually leveraged to model dependencies between several sequential utterances. In this case, the input to the network is a sequence of utterances.

The LSTM described in the equations above does not capture information from future time steps, which is useful for example in text applications when the context of a word is made up of both previous and subsequent words in the phrase. Therefore, Bidirectional Long Short-Term Memory Networks (Bi-LSTM) (Graves et al. 2005) were proposed. Bi-LSTMs consist of two LSTMs: one executing a forward pass and the other a backward pass. A sequence of hidden states is produced for each direction that are usually joined into a unique sequence of hidden states through concatenation.

4.4.1 Context-dependent emotion recognition (Poria et al. 2017)

There is a strong correlation and influence in emotion distribution between sequential utterances in a conversation. Leveraging those contextual dependencies is key to the performance of a classifier. An utterance is a unit of speech which is bound by breathes or pauses (Olson 1977). In textual data, it is usually delimited by punctuation such as full stops, but can also be delimited by question or exclamation marks. Along these lines, the work of Poria et al. (2017) considers interdependences among utterances. The approach consists of an architecture made of LSTMs to extract contextual features from the utterances. The model enables consecutive utterances to share information while preserving their order. The authors propose the contextual LSTM, consisting of unidirectional LSTM cells, the hidden LSTM, in which the dense layer after the LSTM cell is omitted, and the bi-directional contextual LSTM that considers information from before and after the targeted utterance. It achieved an F1-score of 54.95% on the IEMOCAP dataset.

4.5 Gated reccurent units

Another RNN, which structure is somehow similar but with fewer parameters than the LSTM, is the Gated Recurrent Unit (GRU) (Cho et al. 2014) that can be described by the following equations:

$$\begin{aligned} r_t= & sigm(W_rx_t+U_rh_{t-1}+b_r) \end{aligned}$$

(9)

$$\begin{aligned} z_t= & sigm(W_zx_t+U_zh_{t-1}+b_z) \end{aligned}$$

(10)

$$\begin{aligned} \hat{h}_t= & tanh(W_hx_t+U_hh_{t-1}+b_h) \end{aligned}$$

(11)

$$\begin{aligned} h_t= & z_t \odot h_{t-1}+(1-z_t)\odot \hat{h}_t, \end{aligned}$$

(12)

in which the $r_t$ is the reset gate that leads the candidate hidden state $\hat{h}_t$, to ignore or not the previous hidden state, and $z_t$ is the update gate that controls the proportion of the information from the previous hidden state to carry over to the current hidden state.

GRUs can work better than LSTMs when there is less available data since they have fewer parameters, and are less prone to overfitting.

4.5.1 DialogueRNN (Majumder et al. 2019)

DialogueRNN combines GRUs in a sophisticated way, in which each GRU plays a specific role in modelling the conversation.

DialogueRNN predicts the emotion of an utterance based on the speaker, the context of preceding utterances, and the emotions from preceding utterances. The model has three components all modeled by GRUs. The party-state models the parties’ emotion dynamics throughout the conversation. The global state has the encoding of preceding utterances and the party state, and models the context of the utterance. Finally, the emotion representation is based on the party state and the global state and is used to perform the final classification.

Several variants of the method are proposed. DialogueRNN_l considers an additional listener state while a speaker utters, BiDialogueRNN uses a Bi-directional RNN, DialogueRNN+Att uses attention over all surrounding emotion representations and BiDialogueRNN+Att combines the latter approaches.

It yielded an F1-score of 62.75% on the IEMOCAP dataset. The authors argue DialogueRNN variants outperform the contextual LSTM due to better context representation.

4.6 Attention mechanisms

The aforementioned work mentions the use of attention over surrounding emotion representations. To motivate Attention Mechanisms (AM) (Bahdanau et al. 2014) we now introduce Sequence to Sequence (Seq2Seq) models. Seq2Seq models are composed of an RNN encoder that takes the input sequence and outputs a context vector, the last encoder hidden state, that is fed into an RNN decoder that generates an output sequence from this vector. When the input sequence is very long the context vector does not have enough capacity to represent the information from the entire sequence. This is where AMs are useful. Taking inspiration from the human visual system that attends to different parts of the space, while building its representation of the scene, the AM allows the decoder to attend to the relevant encoder hidden states. At each time step, a context vector is obtained by a weighted sum of these hidden states, where the weights of the sum are proportional to the similarity between the current decoded hidden state and the encoded hidden states, as described in the following equations:

$$\begin{aligned} a_{ti}= & \frac{exp(score(h_t^d,h_i^e))}{\sum _{j=1}^{T}exp(score(h_t^d,h_j^e))} \end{aligned}$$

(13)

$$\begin{aligned} c_t= & \sum _{j=1}^{T}a_{tj}h_j^e, \end{aligned}$$

(14)

where $h_t^d$ is the decoder hidden state and $h_i^e$ and $h_j^e$ are the encoder hidden states. The most common similarity score function is the dot product. By introducing AM in an RNN the performance of a classifier usually increases.

4.7 Memory networks

4.7.1 Conversational memory network (Hazarika et al. 2018b)

The recurrent networks described before are limited in terms of long-range summarization since they rely on a sequential processing approach with implicit memory. The Conversational Memory Network (CMN) accounts for this factor through the use of memory networks, which can better capture long-term dependencies and have attention models that can summarize specific details using explicit memory structures.

The work is based on the notion that emotional dynamics involve self and inter-speaker emotional influence. Separate histories for each speaker, consisting of the speaker’s previous utterances, are modeled into memory cells of the Conversational Memory Network using GRUs. Memory networks provide a memory component that can be read from and written to and also perform inference. The CMN then employs an attention mechanism over historical utterances from each speaker to filter out relevant content for the current utterance. This mechanism is repeated for multiple hops of the network to subsequently classify the utterance. CMN achieves an F1-score of 56.13% on the IEMOCAP dataset.

4.7.2 Interactive conversational memory network (Hazarika et al. 2018a)

The Interactive Conversational Memory Network (ICON) is an improvement upon CMN that has an additional dynamic global influence module to model inter-personal emotional influence, a module that maintains a global representation of the conversation, a global state that is updated using a GRU operation on the previous state and current speaker’s history. ICON yields an F1-score of 58.54% on the IEMOCAP dataset.

4.8 Graph neural networks

An alternative to gated neural networks that is also useful to capture long term dependencies is using graph neural networks. Graph neural networks are based on graphs. The latter model a set of objects, the nodes, and their relationships, the edges. While standard neural networks such as CNNs and RNNs cannot handle the graph input properly since they need a specific order for the nodes, the output of GNNs is invariant for the input order of nodes. Furthermore, while in standard neural networks, dependencies between nodes are a node feature, in GNNs there are edges to represent these dependencies. GNNs can propagate information by the graph structure instead of using it as part of features (Zhou et al. 2020).

Two common types of GNNs are recurrent GNNs (RecGNNs) and convolutional GNNs (ConvGNNs). RecGNNs learn node representations with recurrent neural architectures, assuming that a node in a graph constantly exchanges information with its neighbors until a stable equilibrium is reached. ConvGNNs generalize the convolution operation from grid structure to graph structure, generating a node representation by aggregating its features with its neighbours’ features. For classification tasks, such as Emotion Recognition, an end-to-end framework can be constructed for example by stacking graph convolutional layers followed by a softmax layer (Wu et al. 2020).

4.8.1 DialogueGCN (Ghosal et al. 2019)

Dialogue Graph Convolutional Network (GCN) takes into account intra and inter-speaker dependency and background information, having a sequential and a speaker-level encoder. The sequential context encoder encodes the utterances using a bidirectional GRU and is speaker-agnostic. The speaker-level context encoder creates a directed edge-labelled graph in which each utterance, enriched with context, is a node in the graph. The information from neighbor nodes is then aggregated and passed to a neural network in each node, resulting in an updated representation of each node that takes into account its neighbors. In DialogueGCN, node features are initialized with sequentially encoded feature vectors from the sequential context encoders, and edge weights are set using a similarity-based attention module. The utterance level emotion classification is turned into a problem of node classification in the graph. DialogueGCN achieves an F1-score of 64.18% on the IEMOCAP dataset.

4.9 Transformer

Despite the success of recurrent and graph neural networks in a wide variety of tasks including Emotion Recognition, a more powerful network model exists, constituting the current state-of-the-art: the Transformer (Vaswani et al. 2017). The Transformer is also better at capturing long-term dependencies than gated RNNs, due to its shorter path of information flow, which is useful when modelling several utterances in Emotion Recognition in Conversations. It does not resort to recurrence or convolutions. It is also more parallelizable and needs less training time. It is composed of an Encoder and Decoder, which can be visualized in Fig. 7.

Each Encoder layer is composed of two sub-layers. The first is a so called self-attention layer for building the context of the sequence that calculates the relevance of the tokens in the sequence for each given token.

The self-attention mechanism takes as input three matrices, the Query, Key, and Value matrices. These are computed from the dot product between the input sequence matrix and weight matrices learned in the training phase. It then computes a vector of attention scores, as described in the following equation:

$$\begin{aligned} Attention(Q, K, V )=softmax\left( \frac{QK^T}{\sqrt{k}}\right) V, \end{aligned}$$

(15)

in which $Q \in \mathbb {R}^{N\times m}$, $K \in \mathbb {R}^{N\times m}$ and $V \in \mathbb {R}^{N\times m}$ are the Query, Key, and Value matrices, respectively, and m is the model output dimension. The dot product of the Query matrix with the Key matrix represents scores that are then multiplied by the Value matrix to keep intact the values of relevant words and drown-out the irrelevant ones. The $\frac{1}{\sqrt{k}}$ helps stabilizing the gradients.

The self-attention layer uses a multi-head attention mechanism so that the attention score vectors do not exclusively give a high probability score for its corresponding token, defined by the following equations:

$$\begin{aligned} MultiHead(Q, K, V ) = Concat(head_1, ..., head_h)W^{O}, \end{aligned}$$

(16)

where

$$\begin{aligned} head_i = Attention(QW_i^Q , K W_i^K , V W_i^V ), \end{aligned}$$

(17)

in which $W_i^Q \in \mathbb {R}^{m\times k}$, $W_i^K \in \mathbb {R}^{m\times k}$ and $W_i^V \in \mathbb {R}^{m\times v}$ are projection matrices and $W^O \in \mathbb {R}^{hv\times m}$, being h the number of heads and k=v=m/h.

The output of the self-attention layer is fed to the second layer, a fully connected feed-forward network, with two linear transformations with a ReLU activation in between, applied separately and in the same way to each position, allowing for parallelization.

A residual connection is applied around each of the two sub-layers, followed by layer normalization.

Each Decoder layer is composed of three sublayers. The first is a masked self-attention layer similar to the self-attention layer described mathematically in Eqs. 16 and 17, but in which each position in the decoder attends to all positions in the decoder up to and including that position, which is attained by masking out values in the softmax. The second is an “encoder-decoder attention” layer, described mathematically in the same equations, where $\text{K}, \, and \, \text{V} $come from the output of the encoder and Q comes from the output of the decoder’s masked self-attention layer, mimicking the typical encoder-decoder attention mechanisms in Seq2Seq models. The last layer is a feed-forward neural network, similar to the Encoder layer.

Similar to the Encoder, a residual connection is applied around each of the two sub-layers, followed by layer normalization.

Positional encodings are added to the embeddings that are fed to the Encoder and Decoder to introduce information about the position of the tokens in the sequence.

The most common use of the Transformer is as the backbone for a fine-tuned pre-trained transformer-based language model, such as BERT (Devlin et al. 2019), described in subsection 4.10. These language models are trained with large amounts of data to perform many tasks and only need to be fine-tuned for the specific task at hand. In the case of a classification task such as Emotion Recognition, one just needs to add a linear layer or a more complex neural network architecture on top of the language model. The training is performed with supervised learning, resorting to backpropagation algorithms.

4.9.1 Knowledge-enriched transformer (Zhong et al. 2019)

The invention of the Transformer (Vaswani et al. 2017) led to new state-of-the-art results in several Natural Language Processing tasks. The Knowledge-Enriched Transformer (KET) uses self-attention to model context and response using the Transformer, that a has shorter path of information flow than gated RNNs, overcoming the difficulty in capturing long-term dependencies. Conversations are modeled in the entire Transformer as a single input.

First, concepts are retrieved for each word in the conversation using an external knowledge base, a graph of concepts, then the concept representation is computed using a dynamic context-aware graph attention mechanism over the concepts that are then combined with the input sentence embeddings. Self-attention is used to learn utterance representations individually and hierarchical self-attention is applied to the context to learn the context representation. Finally, the encoder and decoder attention mechanism is applied, followed by max pooling and a linear layer. KET achieves an F1-score of 59.56% on the IEMOCAP dataset.

4.9.2 CESTa (Wang et al. 2020b)

The CESTa or Contextualized Emotion Sequence Tagging approaches ERC as a task of sequence tagging, choosing the set of tags with the highest likelihood for the whole utterance sequence at once, by using a Conditional Random Field (CRF) (Lafferty et al. 2001). In order to capture long-range global context, utterance representations are generated by a multi-layer Transformer encoder. These are then fed to a bi-LSTM encoder that captures self and inter-speaker dependencies, resulting in contextualized representations of utterances that are then passed to the CRF layer. It achieves an F1-score of 67.10% in the IEMOCAP dataset.

4.9.3 DialogXL (Shen et al. 2021a)

DialogXL is an adaptation of the pre-trained language model XLNet (Yang et al. 2019), which is based on the Transformer-XL (Dai et al. 2019), for Emotion Recognition in Conversations. It replaces XLNet’s segment recurrence with a memory utterance recurrence to leverage historical utterances. These utterances’ hidden states are stored in a memory bank to reuse them while identifying a query utterance. The approach also replaces XLNet’s vanilla self-attention proposing a dialog-aware self-attention to grasp useful intra- and inter-speaker dependencies. This is composed of four types of self-attention: global and local self-attention for different sizes of receptive fields, and speaker and listener self-attention for intra- and inter-speaker dependencies. DialogXL achieves an F1-score of 65.94% on the IEMOCAP dataset.

4.10 Embeddings from pre-trained transformer language models

Text features must represent words in a way that is useful for the models, in this case, numeric vectors referred to as word embeddings. These embeddings yield similar representations for similar words.

Static embeddings (Mikolov et al. 2013; Pennington et al. 2014; Mikolov et al. 2018) are obtained based on the co-occurrence of adjacent words and have a fixed representation for the words, not taking into account their context.

Contrarily, contextual embeddings include information from the context in the word representation, a representation that differs according to the occurrence of the word. The only disadvantage of this kind of representation is that we cannot use pre-trained embeddings as in static embeddings so there is an inference time for generating the embeddings. The models are pre-trained with large amounts of data, in order to enable generating good embedings. A first approach, Embeddings from Language Models (ELMo) (Peters et al. 2018) is based on a stack of two Bi-LSTMs, leveraging the full context of a word. It provides a context-free representation of the word along with context information of the sense of the word and its syntax. The second approach, Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2019), built from a Transformer encoder stack, leads to state-of-the-art results in multiple Natural Language Processing tasks. The input given to BERT is the sum of three embeddings, the token embedding corresponding to the sequence of words, the segment embedding containing information on which sentence each token belongs to, and the position embedding, defining the position of each token in the sequence and distance to other tokens taking into account the order of the sequence. An improvement upon BERT, RoBERTa (Liu et al. 2019) is trained with more data and for a longer period of time. Given that context is key for text Emotion Recognition in Conversations, contextual word embeddings are more adequate for the task than static embeddings, as it can be seen in the higher performance of works that resort to these embeddings, namely resultant from pre-trained language models, such as BERT-based embeddings. The latter are also more tailored to deal with common sense, informal language, and sarcasm, ERC challenges described in Sect. 2, since they are pre-trained on large amounts of text.

4.10.1 COSMIC (Ghosal et al. 2020)

Besides choosing appropriate utterance representations and classifier architectures, there are other factors that can be leveraged for the task of ERC. One such factor is the commonsense knowledge of the interlocutors that plays a central role in inferring the latent variables of a conversation, such as speaker state and intent.

COSMIC extracts commonsense features resorting to COMET (Bosselut et al. 2019), a commonsense transformer encoder-decoder model trained on the task of generative commonsense knowledge construction that uses GPT (Radford et al. 2019) as its generative model. The approach uses RoBERTa for context-independent feature extraction, passes each utterance through COMET’s encoder, and extracts the activations from the final time step. COSMIC maintains a context state and attention vector which are always shared between the participants of the conversation. Five GRUs are used to model context state, internal state, external state, intent state, and emotion state, being the latter used for emotion classification. This work achieves an F1-score of 65.28% in the IEMOCAP dataset.

4.10.2 Psychological (Li et al. 2021a)

Psychological also leverages COMET as the commonsense knowledge base and resorts to RoBERTa as the utterance-level encoder. It proposes SKAIG as a conversation-level encoder, a locally connected graph, where the targeted utterance receives information from that past and future context and is also self-connected.

Assuming that the influence of an utterance on contextual utterances is locally effective, the targeted node is connected with contextual nodes in a window of a given size of utterances in the past and future. Knowledge from COMET is introduced to enrich the edges with different relations. To propagate information through SKAIG a Graph Transformer (Shi et al. 2021) is used. Finally, a linear unit predicts the emotion distributions. This work achieves an F1-score of 66.96% in the IEMOCAP dataset.

4.10.3 DAG-ERC (Shen et al. 2021b)

Graph-based methods tend to neglect distant utterances and sequential information. On the other hand, recurrence-based methods leverage those but tend to update the query utterance’s state with limited information from the nearest utterances. Combining both graph and recurrence structures explores their advantages and mitigates their drawbacks. This is the idea behind DAG-ERC, which regards each conversation as a directed acyclic graph (DAG), a combination of both structures. DAG-ERC is based on the DAGNN (Thost and Chen 2021) architecture, with two improvements: a relation-aware feature transformation that gathers information based on speaker identity and a contextual information unit that enhances the information of historical context. RoBERTa-Large is used as the feature extractor. It achieves an F1-score of 68.03% on the IEMOCAP dataset, outperforming the previous approaches.

4.11 ERC within a generative framework

With the advent of Generative Large Language Models (LLMs), such as the GPT series (Radford et al. 2018, 2019; Brown et al. 2020) and Llama (Touvron et al. 2023), classification tasks started to be reformulated within generative frameworks. The idea is to prompt the LLM with the utterance to be classified and other relevant additional input and request an emotion label for the utterance. Open-source LLMs can also be fine-tuned and even pre-trained.

4.11.1 InstructERC (Lei et al. 2023)

In InstructERC the prompt consists of the utterance to be classified, the ERC instruction, the conversational context, the set of possible classification labels, and similar utterances together with its emotion labels. The LLM is pre-trained with a speaker identification task and fine-tuned both with the main ERC task and an emotion influence prediction task to improve overall performance. Experiments were made with several LLMs, and the best results were obtained with LLama2. InstructERC achieves 71.39% in the IEMOCAP dataset, outperforming the previous approaches.

5 Advisable ERC practices

This section provides useful methods for better ERC frameworks, from different ways of dealing with subjectivity in emotion annotation and modelling, to several ways of dealing with the typically unbalanced ERC datasets.

5.1 Learning subjectivity

All human annotation tasks involve some degree of uncertainty, that can stem amongst other factors from the subjectivity inherent to the annotator. Since Emotion Recognition is amongst the most subjective tasks (Uma et al. 2021) and that subjectivity is the main source of uncertainty in Emotion Recognition annotation, we address uncertainty as subjectivity in this survey (Rizos and Schuller 2020). A good annotation practice is to resort to several annotators to avoid biasing the results towards a single annotator’s subjectivity. This will naturally yield different labels from each annotator, the so called annotation disagreement. The more appropriately one deals with the disagreement in the annotated labels the more reliable will the data and classifiers be. Several techniques for dealing with disagreement are described in this section. They are divided into techniques that address subjectivity by adapting the labels, towards an estimate of a gold label, and techniques that embrace subjectivity by adapting the classifier architectures to deal with the subjectivity of the labels, leveraging such property.

5.1.1 Addressing subjectivity

For categorical labels, considering that a gold or true label exists for each sample, the simple way to obtain it would be through majority voting, and choosing the label with the most annotations. However, this does not account for the different annotator skill levels or sample inherent difficulty.

Thus, a common process is to calculate an inter-annotator agreement score to find low agreement data. A very simplistic way of calculating inter-annotator agreement for categorical labels would be to use the percent agreement, dividing the number of agreement classifications by the total number of classifications. However, non-expert annotators do not always know which label to use so they sometimes just guess, resulting in a random label. Cohen pointed out that there is some level of agreement in these random labels and developed Cohen’s Kappa to take that factor into account (McHugh 2012). The related Fleiss’ Kappa (Fleiss 1971) is the adaptation of Cohen’s Kappa for 3 or more annotators.

The low agreement data can then be discarded or trained on and evaluated separately from the high agreement data. It can be considered random noise if it does not reveal the existence of annotator bias (Uma et al. 2021).

For interval labels, weighted fusion of data can be performed with approaches such as the Evaluator Weighted Estimator (EWE) (Grimm and Kroschel 2005), which weights each rater’s annotations based on its rater-specific inter-reliability scores, or the Weighted Trustability Evaluator (WTE) (Hantke et al. 2016), which is instead based on each rater’s performance consistency.

5.1.2 Embracing subjectivity (Rizos and Schuller 2020)

The annotator (dis)agreement can be used as privileged information for master-class learning, being viewed as samples’ additional information (Eyben et al. 2012) to facilitate learning in the training phase, that might or might not be required for making predictions during testing, for example, to weigh positively the loss of the corresponding samples. In this setting, it is useful to learn the meaning of high-rater disagreement for the particular dataset in use.

Instead of considering the existence of a hard label, as in subsubsection 5.1.1, one can calculate the distribution of label annotations to define a soft label distribution per sample, which encodes the subjectivity for each sample and allows for the classifier to learn label correlations. Another approach is to model the annotations of each particular rater, through an ensemble of models. It allows to give more importance to expert raters rather than novices or spammers. The addition of an “unsure” label would facilitate the annotator’s workload and also make the model learn the ambiguity ground-truth, although not having to predict “unsure” labels during testing. Furthermore, methods to understand whether a sample is mislabelled by certain raters or inherently ambiguous could inform the model and annotators during active learning processes.

In ERC, the context of the interactions is not always explicit and the dialogues contain a variety of topics that can exacerbate subjectivity even more, making it much more advisable to embrace subjectivity in this domain.

5.2 Dealing with unbalanced data

All the benchmark datasets used for ERC are unbalanced, some of them are severely unbalanced: the unbalance ratio varies from 1:3 up to 1:1156. In the presence of these unbalances, most classifiers tend to favor the majority classes which, in these datasets, corresponds to the neutral emotion class. Hence, unless the unbalance issue is addressed, most models will be trained and evaluated based on the neutral class results, which is far from intended. All classes should be equally relevant in ERC, and even when Accuracy values look acceptable, a more detailed analysis usually reveals a very poor performance in minority classes.

5.2.1 Metrics for unbalanced data

The performance metrics used for Emotion Recognition in Conversations are Accuracy, Precision, Recall, and F-score or F1, the harmonic mean of Precision and Recall. These are commonly used metrics in many classification applications. One can consider the weighted version of these metrics, that take into account the relative frequencies of each class, or micro-averaging, in which all samples equally contribute to the final averaged metric. Since both weighted versions and micro-averaging depend more on the classes with a bigger number of items, those are not good indicators of the performance of the minority classes. Therefore, those should not be used when the classes have the same importance and unequal frequencies. In unweighted macro-averaging, the metric is computed for each class, and the results are averaged over all the classes. By maximizing an unweighted metric, one is maximizing the performance of the model in correctly classifying all classes regardless of the number of items they have.

ERC is clearly a task that deals with real-world unbalanced data. As such, a proper evaluation should always be performed on a test set that maintains the real-world unbalance of the data and using unweighted metrics such as macro F-score. Ideally, performance results on each of the individual classes should also be presented. Accuracy, micro-averaging, and weighted metrics are mostly irrelevant in what concerns the performance of the models since they are mostly dependent on the majority class, usually neutral. Unweighted metrics that are not affected by dataset unbalance, such as Recall, should also be avoided, unless used in conjunction with those that are, such as Precision.

5.2.2 Balancing techniques

Besides the use of appropriate metrics, another way of dealing with dataset unbalance is to apply balancing techniques to the training set. Common such techniques are random under-sampling, removing observations from the majority class, and random over-sampling, adding copies to the minority class. More elaborate techniques, such as Smote - Synthetic Minority Oversampling Technique (Chawla et al. 2002), usually improve the results.

Balancing a training set has its advantages, but it is of utmost importance that the test set maintains the original set balance. Otherwise, metrics such as Precision and F-score are artificially improved and the model is unlikely to be effective in an unbalanced real-world dataset. It should be noted that Recall is not affected when training and testing on an artificially balanced dataset so ignoring Precision and looking only at Recall as an evaluation metric will not be a good indicator for real-world performance of a model when there is a real-world imbalance among classes.

5.2.3 Few-shot learning

Few-Shot Learning (Wang et al. 2020a) is also an efficient method for dealing with unbalanced data. Contrary to supervised learning, it does not require a high number of training examples. Instead of learning to generalize class identities from a training set to a test set, the Few-Shot Learning models learn to discriminate the similarities and differences between classes, by training a function that predicts similarity.

6 Systematic review of ERC works

In Table 3, a plethora of works on ERC, since 2017, are displayed along with the methods they use and their reported performance.

Table 3 Summary of publications on emotion recognition in conversations, reporting the methods used along with their performance across several datasets

Full size table

Several authors resorted to gated neural networks, which can capture dependencies between words and utterances. This effort led to high reported results on the benchmark datasets. Some authors presented works using graph neural networks and the performances obtained are comparable to the ones obtained with gated neural networks. The best performances, however, were yielded when using transformers or transformer-based models, that better capture dependencies in long utterances, combined with the aforementioned state-of-the-art deep-gated and graph-based network architectures for context modelling. This may reflect the promising role that transformers and pre-trained transformer-based language models play in Natural Language Processing and Emotion Recognition in Conversations and the suitability of recurrent gated and graph-based neural network architectures, as elaborated in previous sections.

6.1 ERC challenges revisited

Practically all works leveraged the information of preceding and sometimes also subsequent (Li et al. 2021a) utterances, resorting to gated or graph neural networks for context modelling of the utterances that were represented by embeddings from fine-tuned pre-trained language models. Another viable way to perform context modelling is to feed several appended utterances to the pre-trained language model (Pereira et al. 2023a). For speaker-specific modelling, also gated (Majumder et al. 2019), graph (Ghosal et al. 2019) neural networks and a combination of both (Shen et al. 2021b) architectures were used. Emotion dynamics modelling was less explored. One work considered a GRU to model the emotion dynamics of each party (Majumder et al. 2019). Some works considered a knowledge base (Zhong et al. 2019; Ghosal et al. 2020; Li et al. 2021a; Zha et al. 2024) to aid in interpreting the meaning behind commonsense knowledge. The majority of the works resorted to fine-tuning large pre-trained language models to obtain embeddings, which also aids in capturing the meaning behind commonsense knowledge expressions and informal language.

7 Directions for future research

We now present suggestions for future work directions.

The first set of directions concerns potentiating the applicability of ERC modules for real-life scenarios. It comprises real-time ERC, recognizing emotion causes, and multilingual ERC, all elaborated in Sect. 2. There are plenty of research opportunities regarding annotation and modelling efforts.

Further exploring mixed emotions with multi-label datasets and classifiers is also promising. Moreover, training the same model with different datasets by matching emotions from one dataset to another improves the generalization capabilities of the classifiers.

Some surveyed works resorted to knowledge bases. These are, however, not specific to emotions, which would be a useful extension.

Most datasets are unbalanced, motivating the need to leverage techniques and use appropriate performance metrics to address this unbalance, as elaborated in subsection 5.2.

Concerning subjectivity uncertainty in annotations, we highlight using inter-annotator (dis)agreement measures to weight or filter the annotator’s opinions, incorporating these measures as additional information to the classifiers, and considering soft label distributions per sample.

Finally, we encourage further tackling interpretability, as elaborated in Sect. 2, since most works focus on performance.

8 Conclusion

Research in Emotion Recognition in Conversations is advancing at a high pace and novel application scenarios pose new challenges and opportunities. Although research in Emotion Recognition in Conversations has come a long way, there is still much room for improvement, as it can be seen by the performance of the classifiers on the benchmark datasets and the several unexplored directions put forward in this section. While we described how current work has addressed several challenges, we also pointed out the opportunities that partly unaddressed challenges constitute.

As main contributions of this survey, we presented partly addressed challenges for ERC, such as recognizing emotion causes, dealing with different taxonomies across datasets, multilingual ERC, and interpretability with associated future work directions. We compiled an extensive list of Deep Learning works in ERC, being the first to simultaneously report their methods, modalities, and performance across various datasets. Finally, our descriptions of Deep Learning methods in the context of ERC provided insights into the suitability of each method for this task, which were not given in previous surveys.

The survey highlights the advantage of leveraging techniques to address unbalanced data, the exploration of mixed emotions, and the benefits of incorporating annotation subjectivity in the learning phase.

Our survey relies on established benchmark datasets. While this aids in comparing different works, it also constitutes a limitation since benchmarking on these datasets may not generalize well to domain-specific applications such as different industries or languages. We also just briefly mention some challenges of real-world applications, not elaborating on the scalability or performance of models in such settings. So our findings may be highly relevant to academic benchmarks but less relevant to domain-specific applications.

Finally, important ethical aspects pertaining to Emotion Recognition are now presented. These aspects are, for example, and not limited to, whether an Emotion Recognition module should be developed or used for a certain purpose, which data to collect and the subjects behind the data, diversity and inclusiveness, privacy and control, and possible biases and misuses of the application (Mohammad 2022). Research in these directions will benefit the community with better Emotion Recognition in Conversation modules for current and novel applications.

Notes

http://github.com/patricia-pereira/deep-emotion-recognition-in-textual-conversations

References

Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bălan O, Moise G, Petrescu L et al (2019) Emotion classification based on biophysical signals and machine learning techniques. Symmetry 12(1):21
MATH Google Scholar
Berrios R, Totterdell P, Kellett S (2015) Eliciting mixed emotions: a meta-analysis comparing models, types, and measures. Front Psychol 6:428
MATH Google Scholar
Bosselut A, Rashkin H, Sap M, et al (2019) COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 4762–4779
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Busso C, Bulut M, Lee CC et al (2008) Iemocap: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359
Google Scholar
Chapuis E, Colombo P, Manica M, et al (2020) Hierarchical pre-training for sequence labelling in spoken dialog. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 2636–2648
Chawla NV, Bowyer KW, Hall LO et al (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chen J, Huang P, Huang G, et al (2023) Sdtn: Speaker dynamics tracking network for emotion recognition in conversation. In: ICASSP 2023 - 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Cho K, van Merriënboer B, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734
Dai Z, Yang Z, Yang Y, et al (2019) Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 2978–2988
Damásio A (2020) Sentir & Saber - A Caminho da Consciência. Temas & Debates
Demszky D, Movshovitz-Attias D, Ko J, et al (2020) GoEmotions: a dataset of fine-grained emotions. In: Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Online, pp 4040–4054
Deng J, Ren F (2021) A survey of textual emotion recognition and its challenges. IEEE Transactions on Affective Computing
Devlin J, Chang MW, Lee K, et al (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 4171–4186
Dheeraj K, Ramakrishnudu T (2021) Negative emotions detection on online mental-health related patients texts using the deep learning with mha-bcnn model. Expert Syst Appl 182:115–265
Google Scholar
Duong AQ, Ho NH, Pant S et al (2024) Residual relation-aware attention deep graph-recurrent model for emotion recognition in conversation. IEEE Access 12:2349–2360
MATH Google Scholar
Ekman P (1999) Basic emotions. Handbook Cognit Emotion 98(45–60):16
MATH Google Scholar
Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7(2):195–225
MATH Google Scholar
Eyben F, Wöllmer M, Graves A et al (2010) On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces 3(1):7–19
MATH Google Scholar
Eyben F, Wöllmer M, Schuller B (2012) A multitask approach to continuous five-dimensional affect sensing in natural speech. ACM Trans Interact Intell Syst (TiiS) 2(1):1–29
MATH Google Scholar
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378
MATH Google Scholar
Fu Y, Yuan S, Zhang C et al (2023) Emotion recognition in conversations: a survey focusing on context, speaker dependencies, and fusion methods. Electronics 12(22):4714
MATH Google Scholar
Gao T, Yao X, Chen D (2021) SimCSE: simple contrastive learning of sentence embeddings. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 6894–6910
Ghosal D, Majumder N, Gelbukh A, et al (2020) COSMIC: COmmonSense knowledge for eMotion identification in conversations. In: Findings of the association for computational linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 2470–2481
Ghosal D, Majumder N, Poria S, et al (2019) DialogueGCN: a graph convolutional neural network for emotion recognition in conversation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 154–164
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. International conference on artificial neural networks. Springer, pp 799–804
MATH Google Scholar
Grimm M, Kroschel K (2005) Evaluation of natural emotions using self assessment manikins. In: IEEE workshop on automatic speech recognition and understanding, 2005., IEEE, pp 381–385
Hantke S, Marchi E, Schuller B (2016) Introducing the weighted trustability evaluator for crowdsourcing exemplified by speaker likability classification. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16), pp 2156–2161
Hazarika D, Poria S, Mihalcea R, et al (2018a) Icon: interactive conversational memory network for multimodal emotion detection. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 2594–2604
Hazarika D, Poria S, Zadeh A, et al (2018b) Conversational memory network for emotion recognition in dyadic dialogue videos. In: Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, NIH Public Access, p 2122
Herzig J, Feigenblat G, Shmueli-Scheuer M, et al (2016) Classifying emotions in customer support dialogues in social media. In: Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue. Association for Computational Linguistics, Los Angeles, pp 64–73
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
MATH Google Scholar
Hong S, Sun J, Li T (2024) Detectivenn: imitating human emotional reasoning with a recall-detect-predict framework for emotion recognition in conversations
Hou G, Shen Y, Zhang W et al (2023) Enhancing emotion recognition in conversation via multi-view feature alignment and memorization. In: Bouamor H, Pino J, Bali K (eds) Findings of the association for computational linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, pp 12651–12663
Google Scholar
Hu D, Hou X, Wei L et al (2022) Mm-dfn: multimodal dynamic fusion network for emotion recognition in conversations. ICASSP 2022–2022 IEEE international conference on acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 7037–7041
MATH Google Scholar
Hua Y, Huang Y, Huang S, et al (2024) Causal discovery inspired unsupervised domain adaptation for emotion-cause pair extraction. arXiv preprint arXiv:2406.15490
Hu J, Liu Y, Zhao J, et al (2021b) MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 5666–5675
Hu D, Wei L, Huai X (2021a) DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 7042–7052
Hu G, Zhu Z, Hershcovich D, et al (2024) Unimeec: towards unified multimodal emotion recognition and emotion cause. arXiv preprint arXiv:2404.00403
Ishiwatari T, Yasuda Y, Miyazaki T, et al (2020) Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 7360–7370
Jain S, Wallace BC (2019) Attention is not explanation. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, pp 3543–3556
Jeong D, Bak J (2023) Conversational emotion-cause pair extraction with guided mixture of experts. In: Vlachos A, Augenstein I (eds) Proceedings of the 17th conference of the European chapter of the association for computational linguistics. Association for Computational Linguistics, Dubrovnik, Croatia, pp 3288–3298
Jiang D, Wei R, Wen J et al (2023) Automl-emo: automatic knowledge selection using congruent effect for emotion identification in conversations. IEEE Trans Affect Comput 14(3):1845–1856
MATH Google Scholar
Jian Z, Li J, Yao J, et al (2024) Conversation clique-based model for emotion recognition in conversation. In: ICASSP 2024 - 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5865–5869
Jiao W, Lyu M, King I (2020) Real-time emotion recognition via attention gated hierarchical memory network. In: Proceedings of the AAAI conference on artificial intelligence, pp 8002–8009
Kang Y, Cho YS (2023) Directed acyclic graphs with prototypical networks for few-shot emotion recognition in conversation. IEEE Access 11:633–642
MATH Google Scholar
Kingma DP (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Kumar S, S R, Akhtar M, et al (2023) From multilingual complexity to emotional clarity: Leveraging commonsense to unveil emotions in code-mixed dialogues. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Association for Computational Linguistics, Singapore, pp 9638–9652
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’01, p 282-289
Lai H, Chen H, Wu S (2020) Different contextual window sizes based rnns for multimodal emotion detection in interactive conversations. IEEE Access 8:516–526
MATH Google Scholar
Lee CC, Busso C, Lee S, et al (2009) Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In: Tenth annual conference of the international speech communication association
Lee B, Choi YS (2021) Graph based network with contextualized representations of turns in dialogue. In: Proceedings of the 2021 conference on empirical methods in natural language processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, pp 443–455
Lei S, Dong G, Wang X, et al (2023) Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework. arXiv preprint arXiv:2309.11911
Li J, Lin Z, Fu P et al (2021) Past, present, and future: conversational emotion recognition through structural modeling of psychological knowledge. Find Assoc Comput Linguist: EMNLP 2021:1204–1214
MATH Google Scholar
Li Z, Tang F, Zhao M et al (2022) EmoCaps: emotion capsule based model for conversational emotion recognition. Findings of the association computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin, pp 1610–1618
MATH Google Scholar
Li J, Wang X, Lv G et al (2024) Ga2mif: graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Trans Affect Comput 15(1):130–143
MATH Google Scholar
Liang C, Xu J, Lin Y, et al (2022) S+PAGE: A speaker and position-aware graph neural network model for emotion recognition in conversation. In: Proceedings of the 2nd conference of the Asia-Pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online only, pp 148–157
Li Q, Gkoumas D, Sordoni A, et al (2021b) Quantum-inspired neural network for conversational emotion recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 270–278
Li J, Ji D, Li F, et al (2020) Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations. In: Proceedings of the 28th international conference on computational linguistics, pp 4190–4200
Li Y, Su H, Shen X, et al (2017) DailyDialog: a manually labelled multi-turn dialogue dataset. In: Proceedings of the eighth international joint conference on natural language processing (Volume 1: Long Papers). Asian Federation of Natural Language Processing, Taipei, Taiwan, pp 986–995
Liu Y, Ott M, Goyal N, et al (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Liu Y, Zhao J, Hu J, et al (2022) DialogueEIN: emotion interaction network for dialogue affective analysis. In: Calzolari N, Huang CR, Kim H, et al (eds) Proceedings of the 29th international conference on computational linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, pp 684–693
Li D, Wang Y, Funakoshi K, et al (2023) Joyful: joint modality fusion and graph contrastive learning for multimoda emotion recognition. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Association for Computational Linguistics, Singapore, pp 16,051–16,069
Li J, Wang X, Liu Y, et al (2024a) Cfn-esa: A cross-modal fusion network with emotion-shift awareness for dialogue emotion recognition. IEEE Transactions on Affective Computing pp 1–16
Li S, Yan H, Qiu X (2021c) Contrast and generation make bart a good dialogue emotion recognizer. arXiv preprint arXiv:2112.11202
Lu X, Zhao Y, Wu Y, et al (2020) An iterative emotion interaction network for emotion recognition in conversations. In: Proceedings of the 28th international conference on computational linguistics, pp 4078–4088
Majumder N, Poria S, Hazarika D, et al (2019) Dialoguernn: an attentive rnn for emotion detection in conversations. In: Proceedings of the AAAI conference on artificial intelligence, pp 6818–6825
Mao Y, Liu G, Wang X et al (2021) DialogueTRM: exploring multi-modal emotional dynamics in a conversation. Findings of the association for computational linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp 2694–2704
MATH Google Scholar
Ma H, Yu J, Wang F, et al (2024) From extraction to generation: multimodal emotion-cause pair generation in conversations. IEEE Transactions on Affective Computing pp 1–12
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochemia medica 22(3):276–282
MathSciNet MATH Google Scholar
McKeown G, Valstar M, Cowie R et al (2011) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17
MATH Google Scholar
Mehrabian A (1980) Basic dimensions for a general psychological theory: implications for personality, social, environmental, and developmental studies. Oelgeschlager, Gunn & Hain, Cambridge
Mikolov T, Chen K, Corrado G, et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Mikolov T, Grave E, Bojanowski P, et al (2018) Advances in pre-training distributed word representations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan
Mohammad SM (2022) Ethics sheet for automatic emotion recognition and sentiment analysis. Comput Linguist 48(2):239–278
MATH Google Scholar
Olson D (1977) From utterance to text: the bias of language in speech and writing. Harv Educ Rev 47(3):257–281
MATH Google Scholar
Panksepp J (2004) Affective neuroscience: the foundations of human and animal emotions. Oxford University Press
Google Scholar
Partaourides H, Papadamou K, Kourtellis N et al (2020) A self-attentive emotion recognition network. ICASSP 2020–2020 IEEE international conference on acoustics. IEEE, Speech and Signal Processing (ICASSP), pp 7199–7203
Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methoailys in natural language processing (EMNLP), pp 1532–1543
Pereira P, Moniz H, Dias I, et al (2023a) Context-dependent embedding utterance representations for emotion recognition in conversations. In: Proceedings of the 13th workshop on computational approaches to subjectivity, sentiment, & social media analysis. Association for Computational Linguistics, pp 228–236
Pereira P, Ribeiro R, Coheur L, et al (2023b) Fuzzy fingerprinting transformer language-models for emotion recognition in conversations. In: IEEE International conference on fuzzy systems (FUZZ-IEEE), IEEE
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp 2227–2237
Picard RW (2000) Affective computing. MIT press
MATH Google Scholar
Plutchik R (1982) A psychoevolutionary theory of emotions. Soc Sci Inf 21(4–5):529–553
MATH Google Scholar
Poria S, Majumder N, Mihalcea R et al (2019) Emotion recognition in conversation: research challenges, datasets, and recent advances. IEEE Access 7:100,943-100,953
MATH Google Scholar
Poria S, Majumder N, Hazarika D et al (2021) Recognizing emotion cause in conversations. Cogn Comput 13(5):1317–1332
MATH Google Scholar
Poria S, Hazarika D, Majumder N et al (2023) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans Affect Comput 14(1):108–132
MATH Google Scholar
Poria S, Cambria E, Hazarika D, et al (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pp 873–883
Poria S, Hazarika D, Majumder N, et al (2019a) MELD: A multimodal multi-party dataset for emotion recognition in conversations. In: Proceedings of the 57th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Florence, Italy, pp 527–536
Quan X, Wu S, Chen J et al (2024) Multi-party conversation modeling for emotion recognition. IEEE Trans Affect Comput 15(3):751–768
MATH Google Scholar
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Google Scholar
Radford A, Narasimhan K, Salimans T, et al (2018) Improving language understanding by generative pre-training. OpenAI blog
Ringeval F, Sonderegger A, Sauer J, et al (2013) Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), IEEE, pp 1–8
Rizos G, Schuller BW (2020) Average jane, where art thou?–recent avenues in efficient machine learning under subjectivity uncertainty. In: International conference on information processing and management of uncertainty in knowledge-based systems, Springer, pp 42–55
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
MATH Google Scholar
Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161
MATH Google Scholar
Shen W, Chen J, Quan X, et al (2021a) Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In: Proceedings of the AAAI conference on artificial intelligence, pp 789–797
Sheng D, Wang D, Shen Y, et al (2020) Summarize before aggregate: a global-to-local heterogeneous graph inference network for conversational emotion recognition. In: Proceedings of the 28th international conference on computational linguistics, pp 4153–4163
Shen W, Wu S, Yang Y, et al (2021b) Directed acyclic graph network for conversational emotion recognition. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 1551–1560
Shi Y, Huang Z, Feng S, et al (2021) Masked label prediction: unified message passing model for semi-supervised classification. In: Proceedings of the thirtieth international joint conference on artificial intelligence, IJCAI-21. International Joint Conferences on Artificial Intelligence Organization, pp 1548–1554, main Track
Song X, Huang L, Xue H, et al (2022) Supervised prototypical contrastive learning for emotion recognition in conversation. In: Proceedings of the 2022 conference on empirical methods in natural language processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, pp 5197–5206
Sun Y, Yu N, Fu G (2021) A discourse-aware graph neural network for emotion recognition in multi-party conversation. Find Assoc Comput Linguist: EMNLP 2021:2949–2958
MATH Google Scholar
Su Y, Wei Y, Nie W, et al (2024) Dynamic causal disentanglement model for dialogue emotion detection. In: IEEE transactions on affective computing, pp 1–14
Thost V, Chen J (2021) Directed acyclic graph neural networks. arXiv preprint arXiv:2101.07965
Touvron H, Lavril T, Izacard G, et al (2023) Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971
Tu G, Liang B, Jiang D et al (2023) Sentiment- emotion- and context-guided knowledge selection framework for emotion recognition in conversations. IEEE Trans Affect Comput 14(3):1803–1816
MATH Google Scholar
Tu G, Liang B, Mao R et al (2023) Context or knowledge is not always necessary: a contrastive learning framework for emotion recognition in conversations. Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, Toronto, pp 14054–14067
MATH Google Scholar
Tu G, Liang B, Qin B et al (2023) An empirical study on multiple knowledge from ChatGPT for emotion recognition in conversations. In: Bouamor H, Pino J, Bali K (eds) Findings of the association for computational linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, pp 12160–12173
MATH Google Scholar
Tu G, Jing R, Liang B, et al (2023a) A training-free debiasing framework with counterfactual reasoning for conversational emotion detection. In: Bouamor H, Pino J, Bali K (eds) Proceedings of the 2023 conference on empirical methods in natural language processing. Association for Computational Linguistics, Singapore, pp 15,639–15,650
Uma AN, Fornaciari T, Hovy D et al (2021) Learning from disagreement: a survey. J Artif Intell Res 72:1385–1470
MathSciNet MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30. Curran Associates, Inc
Wang Y, Yao Q, Kwok JT et al (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv (CSUR) 53(3):1–34
MATH Google Scholar
Wang F, Yu J, Xia R (2023) Generative emotion cause triplet extraction in conversations with commonsense knowledge. In: Bouamor H, Pino J, Bali K (eds) Findings of the association for computational linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, pp 3952–3963
MATH Google Scholar
Wang Y, Zhang J, Ma J, et al (2020b) Contextualized emotion recognition in conversation as sequence tagging. In: Proceedings of the 21th annual meeting of the special interest group on discourse and dialogue, pp 186–195
Wei J, Hu G, Tuan LA, et al (2023) Multi-scale receptive field graph model for emotion recognition in conversations. In: ICASSP 2023 - 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Werbos PJ (1990) Backpropagation through time: what it does and how to do it. Proc IEEE 78(10):1550–1560
MATH Google Scholar
Wiegreffe S, Pinter Y (2019) Attention is not not explanation. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 11–20
Willcox G (1982) The feeling wheel: a tool for expanding awareness of emotions and increasing spontaneity and intimacy. Trans Anal J 12(4):274–276
MATH Google Scholar
Wöllmer M, Metallinou A, Eyben F, et al (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In: Proc. INTERSPEECH 2010, Makuhari, Japan, pp 2362–2365
Wu Z, Pan S, Chen F et al (2020) A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32(1):4–24
MathSciNet MATH Google Scholar
Wu X, Feng C, Xu M et al (2023) Dialoguepcn: perception and cognition network for emotion recognition in conversations. IEEE Access 11:141,251-141,260
Google Scholar
Xie Y, Yang K, Sun CJ et al (2021) Knowledge-interactive network with sentiment polarity intensity-aware multi-task learning for emotion recognition in conversations. Find Assoc Comput Linguist: EMNLP 2021:2879–2889
MATH Google Scholar
Xing S, Mai S, Hu H (2022) Adapted dynamic memory network for emotion recognition in conversation. IEEE Trans Affect Comput 13(3):1426–1439
MATH Google Scholar
Xu Y, Yang M (2024) Mcm-csd: Multi-granularity context modeling with contrastive speaker detection for emotion recognition in real-time conversation. In: ICASSP 2024 - 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 956–960
Yang K, Zhang T, Alhuzali H et al (2023) Cluster-level contrastive learning for emotion recognition in conversations. IEEE Trans Affect Comput 14(4):3269–3280
MATH Google Scholar
Yang K, Zhang T, Ananiadou S (2024) Disentangled variational autoencoder for emotion recognition in conversations. IEEE Trans Affect Comput 15(2):508–518
MATH Google Scholar
Yang Z, Li X, Cheng Y et al (2024) Emotion recognition in conversation based on a dynamic complementary graph convolutional network. IEEE Trans Affect Comput 15(3):1567–1579
MATH Google Scholar
Yang Z, Dai Z, Yang Y, et al (2019) Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inform Process Syst 32
Yang L, Shen Y, Mao Y, et al (2022) Hybrid curriculum learning for emotion recognition in conversation. In: Proceedings of the AAAI conference on artificial intelligence, pp 595–603
Yao B, Shi W (2024) Speaker-centric multimodal fusion networks for emotion recognition in conversations. In: ICASSP 2024 - 2024 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 8441–8445
Yu F, Guo J, Wu Z et al (2024) Emotion-anchored contrastive learning framework for emotion recognition in conversation. In: Duh K, Gomez H, Bethard S (eds) Findings of the association for computational linguistics: NAACL 2024. Association for Computational Linguistics, Mexico City, pp 4521–4534
MATH Google Scholar
Yun T, Lim H, Lee J, et al (2024) TelME: Teacher-leading multimodal fusion network for emotion recognition in conversation. In: Duh K, Gomez H, Bethard S (eds) Proceedings of the 2024 conference of the North American chapter of the association for computational linguistics: human language technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico City, Mexico, pp 82–95
Zachar P, Ellis RD (2012) Categorical versus dimensional models of affect: a seminar on the theories of Panksepp and Russell, vol 7. John Benjamins Publishing
Zahiri SM, Choi JD (2018) Emotion detection on tv show transcripts with sequence-based convolutional neural networks. In: 1st Workshop on affective content analysis
Zhang X, Cui W, Hu B et al (2024) A multi-level alignment and cross-modal unified semantic graph refinement network for conversational emotion recognition. IEEE Trans Affect Comput 15(3):1553–1566
MATH Google Scholar
Zhang D, Chen F, Chen X (2023a) DualGATs: Dual graph attention networks for emotion recognition in conversations. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 7395–7408
Zhang D, Chen X, Xu S, et al (2020) Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In: Proceedings of the 28th international conference on computational linguistics, pp 4429–4440
Zhang X, Li Y (2023) A cross-modality context fusion and semantic refinement network for emotion recognition in conversation. In: Proceedings of the 61st annual meeting of the association for computational linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, pp 13,099–13,110
Zhang Y, Wang M, Tiwari P, et al (2023c) Dialoguellm: context and emotion knowledge-tuned llama models for emotion recognition in conversations. arXiv preprint arXiv:2310.11374
Zhang M, Zhou X, Chen W, et al (2023b) Emotion recognition in conversation from variable-length context. In: ICASSP 2023 - 2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Zhao W, Zhao Y, Qin B (2022) MuCDN: Mutual conversational detachment network for emotion recognition in multi-party conversations. In: Calzolari N, Huang CR, Kim H, et al (eds) Proceedings of the 29th international conference on computational linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, pp 7020–7030
Zha X, Zhao H, Zhang Z (2024) Esihgnn: Event-state interactions infused heterogeneous graph neural network for conversational emotion recognition. In: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 136–140
Zhong P, Wang D, Miao C (2019) Knowledge-enriched transformer for emotion detection in textual conversations. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 165–176
Zhou J, Cui G, Hu S et al (2020) Graph neural networks: a review of methods and applications. AI Open 1:57–81
MATH Google Scholar
Zhu L, Pergola G, Gui L, et al (2021) Topic-driven and knowledge-aware transformer for dialogue emotion detection. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, pp 1571–1582

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their feedback. This work was supported by Fundação para a Ciência e a Tecnologia (FCT), through Portuguese national funds, Ref. UIDB/50021/2020, DOI: 10.54499/UIDB/50021/2020 and Ref. UI/BD/154561/2022 and the Portuguese Recovery and Resilience Plan through project C645008882-00000055 (Responsible.AI). Deep Learning TikZ images were based on contributions from Mark Wibrow, user121799, J. Leon V on StackExchange, and Renato Negrinho on GitHub.

Author information

Authors and Affiliations

INESC-ID, Lisbon, Portugal
Patrícia Pereira, Helena Moniz & Joao Paulo Carvalho
Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal
Patrícia Pereira & Joao Paulo Carvalho
Faculdade de Letras, Universidade de Lisboa, Lisbon, Portugal
Helena Moniz

Authors

Patrícia Pereira
View author publications
You can also search for this author in PubMed Google Scholar
Helena Moniz
View author publications
You can also search for this author in PubMed Google Scholar
Joao Paulo Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.P. conceived and designed the survey, performed data collection and interpretation and wrote the article. P.P., H.M. and J.P.C. made a critical revision of the article and approved it for publication.

Corresponding author

Correspondence to Patrícia Pereira.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Pereira, P., Moniz, H. & Carvalho, J.P. Deep emotion recognition in textual conversations: a survey. Artif Intell Rev 58, 10 (2025). https://doi.org/10.1007/s10462-024-11010-y

Download citation

Accepted: 27 October 2024
Published: 07 November 2024
DOI: https://doi.org/10.1007/s10462-024-11010-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep emotion recognition in textual conversations: a survey

Abstract

Similar content being viewed by others

Recognizing Emotion Cause in Conversations

Integrating Rich Utterance Features for Emotion Recognition in Multi-party Conversations

Contextual Information and Commonsense Based Prompt for Emotion Recognition in Conversation

Explore related subjects

1 Introduction

1.1 Task definition

1.2 Survey contributions

1.3 Survey methodology

1.4 Outline

2 ERC challenges and opportunities

2.1 Context, speaker and emotion dynamics modelling

2.2 Common sense, informal language and sarcasm

2.3 Real-time ERC

2.4 Recognizing emotion causes

2.5 Different taxonomies

2.6 Multilingual ERC

2.7 Interpretability

3 ERC taxonomies and benchmark datasets

3.1 Emotion taxonomies

3.1.1 Dimensional vs categorical approach

3.1.2 Plutchik wheel of emotions

3.1.3 Mixed emotions

3.1.4 ERC taxonomy considerations

3.2 ERC benchmark datasets

3.2.1 IEMOCAP (Busso et al. 2008)

3.2.2 MELD (Poria et al. 2019a)

3.2.3 EmoryNLP (Zahiri and Choi 2018)

3.2.4 DailyDialog (Li et al. 2017)

3.2.5 Other datasets

4 Deep learning for emotion recognition in conversations

4.1 Convolutional neural networks for feature extraction

4.2 The multi-layer perceptron

4.3 Elman recurrent neural network

4.4 Long short-term memory networks

4.4.1 Context-dependent emotion recognition (Poria et al. 2017)

4.5 Gated reccurent units

4.5.1 DialogueRNN (Majumder et al. 2019)

4.6 Attention mechanisms

4.7 Memory networks

4.7.1 Conversational memory network (Hazarika et al. 2018b)

4.7.2 Interactive conversational memory network (Hazarika et al. 2018a)

4.8 Graph neural networks

4.8.1 DialogueGCN (Ghosal et al. 2019)

4.9 Transformer

4.9.1 Knowledge-enriched transformer (Zhong et al. 2019)

4.9.2 CESTa (Wang et al. 2020b)

4.9.3 DialogXL (Shen et al. 2021a)

4.10 Embeddings from pre-trained transformer language models

4.10.1 COSMIC (Ghosal et al. 2020)

4.10.2 Psychological (Li et al. 2021a)

4.10.3 DAG-ERC (Shen et al. 2021b)

4.11 ERC within a generative framework

4.11.1 InstructERC (Lei et al. 2023)

5 Advisable ERC practices

5.1 Learning subjectivity

5.1.1 Addressing subjectivity

5.1.2 Embracing subjectivity (Rizos and Schuller 2020)

5.2 Dealing with unbalanced data

5.2.1 Metrics for unbalanced data

5.2.2 Balancing techniques

5.2.3 Few-shot learning

6 Systematic review of ERC works

6.1 ERC challenges revisited

7 Directions for future research

8 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions