Emotion For SE Text
Emotion For SE Text
ZHENPENG CHEN, YANBIN CAO, and HUIHAN YAO, Key Lab of High-Confidence Software
Technology, MoE (Peking University), China
XUAN LU, University of Michigan, USA
XIN PENG, Fudan University, China
HONG MEI and XUANZHE LIU, Key Lab of High-Confidence Software Technology, MoE (Peking
University), China
Sentiment and emotion detection from textual communication records of developers have various application
scenarios in software engineering (SE). However, commonly used off-the-shelf sentiment/emotion detection
tools cannot obtain reliable results in SE tasks and misunderstanding of technical knowledge is demonstrated
to be the main reason. Then researchers start to create labeled SE-related datasets manually and customize
SE-specific methods. However, the scarce labeled data can cover only very limited lexicon and expressions. In
this article, we employ emojis as an instrument to address this problem. Different from manual labels that are
provided by annotators, emojis are self-reported labels provided by the authors themselves to intentionally
convey affective states and thus are suitable indications of sentiment and emotion in texts. Since emojis have
been widely adopted in online communication, a large amount of emoji-labeled texts can be easily accessed
to help tackle the scarcity of the manually labeled data. Specifically, we leverage Tweets and GitHub posts
containing emojis to learn representations of SE-related texts through emoji prediction. By predicting emojis
containing in each text, texts that tend to surround the same emoji are represented with similar vectors, which
transfers the sentiment knowledge contained in emoji usage to the representations of texts. Then we leverage
the sentiment-aware representations as well as manually labeled data to learn the final sentiment/emotion
classifier via transfer learning. Compared to existing approaches, our approach can achieve significant im-
provement on representative benchmark datasets, with an average increase of 0.036 and 0.049 in macro-F1
in sentiment and emotion detection, respectively. Further investigations reveal that the large-scale Tweets
make a key contribution to the power of our approach. This finding informs future research not to unilat-
erally pursue the domain-specific resource but try to transform knowledge from the open domain through
ubiquitous signals such as emojis. Finally, we present the open challenges of sentiment and emotion detection
in SE through a qualitative analysis of texts misclassified by our approach.
This work was supported by the National Key R&D Program of China under the grant number 2018YFB1004800, the Beijing
Outstanding Young Scientist Program under the grant number BJJWZYJH01201910001004, the National Natural Science
Foundation of China under grant numbers J1924032 and 61725201, and the Key Laboratory of Intelligent Passenger Service
of Civil Aviation.
Authors’ addresses: Z. Chen, Y. Cao, H. Yao, H. Mei, and X. Liu (corresponding author), Key Lab of High-Confidence
Software Technology, MoE (Peking University), Beijing, China; emails: {czp, caoyanbin, yaohuihan, meih, xzl}@pku.edu.cn;
X. Lu, University of Michigan, Ann Arbor, MI; email: [email protected]; X. Peng, Fudan University, Shanghai, China;
email: [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
18
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2020 Association for Computing Machinery.
1049-331X/2020/12-ART18 $15.00
https://doi.org/10.1145/3424308
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:2 Z. Chen et al.
CCS Concepts: • Information systems → Sentiment analysis; • Software and its engineering → Col-
laboration in software development;
Additional Key Words and Phrases: Emoji, sentiment, emotion, software engineering
ACM Reference format:
Zhenpeng Chen, Yanbin Cao, Huihan Yao, Xuan Lu, Xin Peng, Hong Mei, and Xuanzhe Liu. 2020. Emoji-
powered Sentiment and Emotion Detection from Software Developers’ Communication Data. ACM Trans.
Softw. Eng. Methodol. 30, 2, Article 18 (December 2020), 48 pages.
https://doi.org/10.1145/3424308
1 INTRODUCTION
Software development is a highly collaborative activity that is susceptible to the affective states
of developers [40, 55, 75, 81]. Negative affective states may make developers underperform in
software projects [33, 81], while positive ones are demonstrated to be correlated with increased
productivity [37]. Therefore, awareness of developers’ affective states (including sentiment and
emotion) is crucial for stakeholders involved in the software development lifecycle. Sentiment is
usually considered to have three polarities, i.e., positive, negative, and neutral [62], while emotion
refers to love, sadness, anger, joy, surprise, fear, and so on [92]. Many approaches, such as sur-
veys [58], biometric measurements [35], and text analysis [12], have been developed for detecting
sentiment or emotion of developers.
Among these approaches, text-based detection has been increasingly popular [61, 104], due to its
convenience, low cost, and technical readiness. Many off-the-shelf sentiment/emotion detection
tools not designed for SE-related texts have been applied in SE studies, but a recent study has
indicated that these tools cannot produce reliable results in some SE tasks [54]. Furthermore, Islam
and Zibran [48] applied an off-the-shelf tool SentiStrength [100] to an SE-related dataset and found
that misunderstanding of domain-specific meanings of words (namely technical knowledge in the
rest of this article) accounts for the most misclassifications. Such a finding has inspired a series
of research efforts in recent years to create SE-related datasets and develop customized sentiment
and emotion detection methods [1, 10, 12, 50, 61].
However, since these customized methods are mainly trained on scarce labeled data (only thou-
sands of samples), they inevitably lack the knowledge of other lexicon and expressions that are
not contained in them. Given the large volume of English vocabulary, these missing expressions
are indeed non-trivial. To tackle this problem, a straightforward solution is to annotate abundant
texts with sentiment/emotion labels. However, manual annotation on a large scale is quite time-
consuming and error prone. Instead, recent work in natural language processing (NLP) employed
emojis as indications of sentiment/emotion in texts [30]. Different from the manual labels that
are provided by annotators based on their perception, emojis are self-reported labels provided by
the authors themselves to intentionally convey affective states and thus are suitable indications of
sentiment and emotion in written communication. As emojis have become an emerging ubiqui-
tous language used worldwide [65], a large amount of texts containing emojis in social media can
be easily accessed to tackle the scarcity of manually labeled texts [30]. Inspired by the previous
work [30], we aim to incorporate emoji usage data into sentiment and emotion detection in SE.
In fact, emojis not only pervasively exist in social media [64] but also are adopted in developers’
communication to express sentiment [46, 66]. For example, in the GitHub post “thanks for writing
this great plugin, ”1 the emoji “ ” can be considered an indication of positive sentiment.
1 https://github.com/MikaAK/s3-plugin-webpack/issues/65.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:3
In this study, we not only employ posts with emojis from Twitter (a typical social media plat-
form) but also consider the posts containing emojis from GitHub (a typical software development
platform). Both of these posts are used to complement the scarce manually labeled data. Here, the
core insight is as follows: GitHub posts can provide more technical knowledge beyond the limited la-
beled data, while posts from Twitter can help learn more general sentiment knowledge that is shared
in both technical and non-technical communication.
Specifically, we propose SEntiMoji, an emoji-powered transfer learning (refer to Section 2.4) [36]
approach for sentiment and emotion detection in SE. First, through an emoji prediction task, vector
representations of texts are derived based on modeling how emojis are used alongside texts on
Twitter and GitHub. By predicting emojis contained in each text, texts that tend to surround the
same emoji are represented similarly. Through such a process, sentiment knowledge contained in
emoji usage is transferred to the representations of texts and thus facilitates the sentiment/emotion
classification of these texts. Then, these informative representations are used to predict the manual
labels and learn the final sentiment/emotion classifier.
To evaluate the performance of SEntiMoji, we compare it against state-of-the-art sentiment
and emotion detection methods in SE on representative benchmark datasets. Results show that
SEntiMoji can outperform existing sentiment and emotion detection methods with an average
increase of 0.036 and 0.049 in macro-F1, respectively. Further investigations reveal that the GitHub
posts do not make a key contribution to the power of SEntiMoji. The combination of large-scale
Tweets and a small amount of manually labeled data can achieve satisfactory performance on
most tasks. Finally, by manually analyzing the samples misclassified by SEntiMoji, we distill seven
and eight common error causes for sentiment and emotion detection, respectively. Some causes,
such as implicit sentiment/emotion and complex context information, need special attention from
researchers.
The main contributions of this article are as follows:
• We propose an emoji-powered transfer learning approach for sentiment and emotion de-
tection in SE, which utilizes Tweets to capture general sentiment knowledge and GitHub
posts as well as manually labeled data to incorporate technical knowledge.
• We demonstrate the effectiveness of SEntiMoji on representative benchmark datasets in SE.
Results show that SEntiMoji can significantly improve the state-of-the-art performance on
almost all the datasets.
• We investigate the underlying reasons behind the good performance of SEntiMoji by rig-
orous comparative experiments and provide future research with some insightful implica-
tions.
• We manually identify the error causes of SEntiMoji on sentiment and emotion detection,
and suggest immediate actions and future research directions based on the findings.
• We have released the data, scripts, trained models, and experimental results used in this
study on https://github.com/SEntiMoji/SEntiMoji to facilitate replications or other types of
future work.
The rest of this article is organized as follows. Section 2 describes the preliminaries of this
work. Section 3 presents the workflow of SEntiMoji in detail. Section 4 describes baseline meth-
ods, benchmark datasets, evaluation metrics, and other experimental settings used for evaluation.
Section 5 answers four research questions based on the evaluation results. Section 6 summarizes
the lessons learned from this study and the implications. Section 7 discusses the threats that could
affect the validity of this study. Section 8 summarizes the literature related to this work, followed
by concluding remarks in Section 9.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:4 Z. Chen et al.
Part of the results in this article have been reported in our previous work at ESEC/FSE 2019 [14].
For those who have read the conference version, this extension mainly includes a new application
of SEntiMoji to emotion detection (see Section 4, Section 5.2, and Section 5.3). The qualitative anal-
ysis for identifying error causes of SEntiMoji is also completely new (see Section 5.4). Moreover,
we include a more comprehensive preliminary section (see Section 2) and related work section
(see Section 8). More lessons can be learned from the results of this article (see Section 6).
2 PRELIMINARIES
We start by clarifying the “sentiment” and “emotion” used in this study. Then, we briefly introduce
some techniques used in our approach, including word embedding, Long Short-Term Memory
(LSTM), and transfer learning.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
corpus:A corpus (plural: corpora) refers to a large and structured collection of written or spoken
texts that are used for linguistic analysis
Emoji-powered Sentiment and Emotion Detection 18:5
corpus. Compared to GloVe, the skip-gram algorithm is demonstrated to be more robust and utilize
less system resources [59]. In practice, the skip-gram algorithm is widely adopted in SE tasks, such
as creating SE-specific word embeddings [28], enhancing software traceability [38], and localizing
bugs [114].
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:6 Z. Chen et al.
words to the classification result. Such a combination has been demonstrated to be effective in
previous studies [105, 113, 117]. In this study, we also employ LSTM with attention, rather than
other sophisticated language models (such as ELMo [86] or BERT [23]) that are demonstrated to
require exceptionally large computational resources [98].
3 METHODOLOGY
Since sentiment and emotion detection are both relatively new tasks in SE and labeled data for
them are not so sufficient, we employ transfer learning to tackle the two problems. Specifically,
we use emojis as indications of sentiment/emotion and employ emoji prediction as the source
task. On one hand, emojis are able to express various emotions [45, 66]. The rich emotional infor-
mation contained in emoji usage makes emoji prediction a suitable source task of sentiment and
emotion detection. However, emojis are widely used in social media [64] and developers’ commu-
nication [46, 66], and thus can be easily collected. The large-scale emoji usage data can complement
the scarce manually labeled data for the two target tasks.
We propose SEntiMoji, an emoji-powered transfer learning approach for sentiment and emo-
tion detection in SE. First, we learn sentiment- and emotion-aware representations of texts by
using emoji prediction as an instrument. More specifically, we use emojis as indications of sen-
timent/emotion and learn vector representations of texts by predicting which emojis are used
in them. Texts that tend to surround the same emoji are represented as similar vectors. Then
these informative representations are used as features to predict the true sentiment/emotion labels.
Through these representations, sentiment knowledge contained in emoji usage data is transferred
from the emoji prediction task into the sentiment/emotion classifiers.
Since Felbo et al. [30] have released such a representation model (i.e., DeepMoji2 ) that is pre-
trained based on 56.6 billion Tweets, we directly build SEntiMoji upon the off-the-shelf DeepMoji
in a transfer learning way. Specifically, our approach takes two stages: (1) We fine-tune DeepMoji
using GitHub posts that contain emojis to incorporate technical knowledge. The fine-tuned model
is still a representation model based on emoji prediction, and we call it DeepMoji-SE. (2) We use
DeepMoji-SE to obtain vector representations of labeled texts, and then use these representations
as features to train the sentiment/emotion classifier. We call the final sentiment/emotion classifier
SEntiMoji. Next, we describe the existing DeepMoji model and the two-stage learning process in
detail.
2 https://github.com/bfelbo/deepmoji.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:7
→ ←
hi = hi || hi , (3)
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:8 Z. Chen et al.
→ ←
where hi and hi denote the forward and backward states of di , respectively. To enable the unim-
peded information flow in the whole model, the outputs of the two LSTM layers and the word
embedding layer are concatenated by the skip-connection algorithm [43], then as input into the
attention layer. Specifically, each word of the input sample is further represented as c i :
c i = di ||hi1 ||hi2 , (4)
where di , hi1 , and hi2 represent the encoded vectors of the ith word extracted from the word
embedding layer and the first and second bi-directional LSTM layers.
Attention Layer. Since not all words contribute equally to the overall sentiment/emotion of the
sample, the model employs the attention mechanism to determine the importance of each word.
The attention score (i.e., the importance) of the ith word is computed as:
exp(W c i )
α i = L , (5)
j=1 exp(W c j )
where W is the weight matrix of the attention layer. Then the sample can be represented as the
weighted sum of all words in it and denoted as V .
Softmax Layer. The final representation V is inputted into the softmax layer to output a
64-dimension probability vector, each element of which denotes the probability that this sample
contains one specific emoji.
Taking the real emoji contained in each sample in ET as ground truth, DeepMoji learns param-
eters by minimizing the cross entropy between the output probability vectors and the one-hot
representations of the emojis actually contained in the training samples. Through such a learning
process, plain texts that surround the same emoji can be represented similarly.
3 https://www.nltk.org/.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:9
Table 1. The Number of Emojis Samples from Twitter and GitHub Associated with Each Emoji in Millions
Note: Since the emoji samples used for DeepMoji (i.e., those from Twitter) have not been released, the number of them
is directly obtained from Felbo et al. [30].
Then we identify URLs, email addresses, references, code snippets, numbers, and mentions from
GitHub posts though the Markdown parser4 and replace them with specific tokens, in case that
the concrete contents of them influence the classification of sentiment and emotion. For exam-
ple, we replace any code snippet with “[code].” In addition, we use regular expressions to iden-
tify words with 2 or more repeated characters and replace them with their basic forms. From the
pre-processed GitHub data, we extract 809,918 posts containing emojis and then create separate
samples for each unique emoji in each post. Since we need to fine-tune DeepMoji, we then select
only the samples containing any of the 64 emojis predicted by DeepMoji. Finally, we have 1,058,413
emoji samples remained. For each of the 64 selected emojis, we present the number of correspond-
ing samples extracted from Twitter (i.e., those used for pre-training DeepMoji) and GitHub (i.e.,
those generated in this step) in Table 1.
Step 2: Expanding the vocabulary list. As many SE-related words are rarely used in Tweets,
they are not included in the vocabulary of DeepMoji. As a result, these words would be treated as
unknown words by DeepMoji, and technical knowledge contained in them would be lost in the
final representation model. To tackle this problem, we extend the initial vocabulary list of Deep-
Moji. From the emoji samples created in stage 1, we identify a total of 243,986 words that are not
included in the word embeddings of DeepMoji as the out-of-vocabulary (OOV) words. Adding all
the OOV words into the vocabulary list is neither economical, as it limits model applicability on
computation- or memory-constrained scenarios, nor necessary, as many words may contribute lit-
tle to the end task [29, 115]. Instead, we sort the OOV words in emoji samples by their frequency
in a descending order and add the most frequent 3,000 ones, whose occurrences account for 68.89%
of those of all the OOV words, into the vocabulary list. We observe that each of other OOV words
appear no more than 15 times. As a result, there lacks enough context information to learn their se-
mantics and embeddings, and they may not play an important role in the end task [103]. Therefore,
we do not take them into consideration. Finally, the representations of top 3,000 OOV words are
randomly initialized in the word embedding layer and then tuned during the fine-tuning process
described below.
Step 3: Fine-tuning DeepMoji. In this step, we use the emoji samples extracted in step 1 (de-
noted as “EG”) to fine-tune DeepMoji through the chain-thaw approach implemented by Felbo
et al. [30]. At a time, the chain-thaw approach fine-tunes a single layer of the network while keep-
ing other layers freezed, which is demonstrated to be able to reduce the risk of over-fitting [30].
Each time we fine-tune, we use plain texts in EG as the input and actual emojis in EG as the ground
truth to minimize the output error of the model. Following the procedures of chain-thaw approach,
we first fine-tune the softmax layer and then fine-tune each layer individually starting from the
4 https://python-markdown.github.io/extensions/api/.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:10 Z. Chen et al.
first layer (i.e., the word embedding layer) of the network. Finally, we fine-tune the entire model
with all layers unfreezed. We refer to the fine-tuned DeepMoji as DeepMoji-SE.
4 EVALUATION
We evaluate the performance of SEntiMoji through a benchmark study, where SEntiMoji and base-
line methods tackle the same datasets for comparison. The evaluation can be considered as a partial
replication of the benchmark study performed by Novielli et al. [79].
5 http://sentistrength.wlv.ac.uk/.
6 http://laser.cs.uno.edu/Projects/Projects.html.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:11
SentiCR7 [1] is a supervised sentiment detection method originally proposed for code reviews.
It computes TF-IDF [3] of bag-of-words as features and uses traditional machine learning algo-
rithms to train the sentiment classifier.
Senti4SD8 [10] is a supervised sentiment detection method proposed for developer-generated
texts. It leverages three kinds of features, including lexicon-based features (based on the sentimen-
tal word list of SentiStrength), keyword-based features (such as uni- and bi-grams), and semantic
features (based on the word embeddings trained on large-scale posts collected from Stack Over-
flow). Finally, it uses Support Vector Machine (SVM) [99] to train the sentiment classifier.
4.1.2 Existing Emotion Detection Methods. For emotion detection, we employ four existing
methods for comparison, including DEVA, EmoTxt, MarValous, and ESEM-E. All of them are specif-
ically proposed for emotion detection in SE. We then describe these methods briefly:
DEVA9 [50] is an SE-specific lexicon-based emotion detection tool. It is designed for detect-
ing the presence of four emotional states (i.e., excitement, stress, depression, and relaxation) from
technical texts. To this end, it contains two arousal dictionaries (i.e., a general-purpose dictionary
named ANEW [106] and a domain-specific dictionary named SEA [70]) and a valence dictionary
(i.e., the domain-specific dictionary of SentiStrength-SE). In addition, to increase accuracy, it in-
cludes all the heuristics implemented in SentiStrength-SE.
EmoTxt10 [12] is a supervised learning method proposed for recognizing emotions from tech-
nical texts. It leverages four kinds of features, including uni- and bi-grams features (using TF-IDF
schema), emotion lexicon features (based on WordNet Affect [97]), politeness features (measured
by the tool developed by Danescu et al. [20]), and mood features (measured by the tool developed
by De Smedt et al. [94]). Finally, it uses SVM to train the emotion classifier.
MarValous [47] is a supervised learning method proposed for detecting the presence of excite-
ment, stress, depression, and relaxation from technical texts. It leverages seven kinds of features,
including n-grams, emoticons, interjections, exclamation marks, and so on. Some of features, such
as emoticons and interjections, are specifically defined for the four emotions that it focuses on.
Finally, it uses traditional machine learning algorithms to train the emotion classifier.
ESEM-E [74] is a supervised learning method proposed for emotion detection in SE. It leverages
uni- and bi-grams as features and uses machine learning algorithms to train the emotion classifier.
4.1.3 Variants of SEntiMoji. SEntiMoji is developed based on three kinds of data, i.e., Twitter
data used for training DeepMoji, GitHub data used for training DeepMoji-SE, and manually labeled
data used for training the final SEntiMoji. To measure the contribution of the three kinds of data to
the overall performance of SEntiMoji, we employ the following seven variants as baseline methods.
SEntiMoji-G1 is a variant of SEntiMoji and shares the same network architecture with
SEntiMoji. By comparison, SEntiMoji-G1 leverages GitHub data used by SEntiMoji to train
DeepMoji-SE from scratch, rather than to fine-tune the pre-trained DeepMoji. Moreover, to make
a fair comparison, SEntiMoji-G1 and SEntiMoji have the same selection of the 64 emojis and
both employ the upsampling technique to ensure data balance. Then, following the procedures in
Stage 2, SEntiMoji-G1 uses the new DeepMoji-SE model to train the sentiment/emotion classifier.
Compared to SEntiMoji, SEntiMoji-G1 is trained purely based on GitHub data and manually
labeled data, without the incorporation of Twitter data.
7 https://github.com/senticr/SentiCR/.
8 https://github.com/collab-uniba/Senti4SD.
9 https://figshare.com/s/277026f0686f7685b79e.
10 https://github.com/collab-uniba/Emotion_and_Polarity_SO.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:12 Z. Chen et al.
In addition, since here we are not using the pre-trained version of DeepMoji, in fact, we do not
have any constraint to select the emojis. To better capture the affective expressions expressed in
GitHub posts, we introduce a new variant of SEntiMoji namely SEntiMoji-G2, which makes uses of
the most frequently used 64 emojis in GitHub. Specifically, the only difference between SEntiMoji-
G2 and SEntiMoji-G1 lies in the selection of emojis.
SEntiMoji-T is another variant of SEntiMoji. The only difference between SEntiMoji-T and
SEntiMoji is that SEntiMoji-T uses DeepMoji rather than DeepMoji-SE to train the final senti-
ment/emotion classifier. Compared to SEntiMoji, SEntiMoji-T does not use GitHub data and thus
can learn technical knowledge only from the labeled data.
T-80%, T-60%, T-40%, and T-20% are variants of SEntiMoji-T. They differ from SEntiMoji-T
only in the amount of the labeled training data. They randomly select 80%, 60%, 40%, and 20% of
the labeled training data used by SEntiMoji-T to train the sentiment/emotion classifier and keep
other settings unchanged.
11 http://ansymore.uantwerpen.be/system/files/uploads/artefacts/alessandro/MSR16/archive3.zip.
12 An issue comment may contain more than one sentence.
13 https://github.com/collab-uniba/Senti4SD.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:13
Polarity distribution
Dataset #
Positive Neutral Negative
JIRA 2,573 1,104 (42.9%) 702 (27.3%) 767 (29.8%)
Stack Overflow 4,423 1,527 (34.5%) 1,694 (38.3%) 1,202 (27.2%)
Code Review 1,600 1,202 (75.1%) 398 (24.9%)
Java Library 1,500 131 (8.7%) 1,191 (79.4%) 178 (11.9%)
The Code Review dataset14 [1] contains 1,600 code review comments extracted from code review
repositories of 20 popular open source software projects. The dataset originally contained 2,000
comments, each of which was annotated by three coders with positive, negative, or neutral. Based
on the annotation results, the distribution of the 2,000 review comments was: 7.7% positive, 19.9%
negative, and 72.4% neutral. Due to the serious class imbalance, its authors randomly excluded a
subset of the majority class (i.e., neutral) and aggregated the remaining neutral and positive com-
ments into “non-negative” class. Finally, among the remaining 1,600 comments, 24.9% are negative
and 75.1% non-negative.
The Java Library dataset15 [61] contains 1,500 sentences about Java libraries/APIs extracted
from the Stack Overflow dump of July 2017. Each sentence was annotated by two coders. The
coders labeled each sentence with a sentiment score from −2 to 2 (−2 indicates strong negative, -1
weak negative, 0 neutral, 1 weak positive, and 2 strong positive). After the labeling and discussion
process, each sentence had a consistent and double-checked sentiment label. Finally, among the
1,500 collected sentences, 8.7% are positive, 79.4% neutral, and 11.9% negative.
We summarize the statistics of the above four datasets in Table 2. Besides them, we also create
a unified dataset to further evaluate the generalizability of SEntiMoji and the baseline methods.
Since JIRA, Stack Overflow, and Java Library datasets share the same classes (i.e., positive, negative,
and neutral), we merge the three datasets into a new dataset, i.e., the Unified-S dataset.
4.2.2 Datasets for Emotion Detection. We employ four benchmark datasets (i.e., JIRA-E1 dataset,
SO-E dataset, JIRA-E2 dataset, and Unified-E dataset) for emotion detection. JIRA-E1, SO-E, and
Unified-E datasets are created based on the Shaver framework, while JIRA-E2 is based on the VAD
Model. We then describe these datasets briefly:
The JIRA-E1 dataset16 [81] is a cleaned dataset that is extracted from the original 5,992 samples
in the JIRA dataset. As mentioned in Section 4.2.1, the 5,992 samples were divided into Group-1,
Group-2, and Group-3. Since DEVA includes all the heuristics implemented in SentiStrength-SE
that is developed based on the Group-1 samples, to make a fair comparison among different meth-
ods, here we do not consider the Group-1 data. In addition, Group-2 data have very imbalanced
distribution (e.g., only three samples are labeled with surprise), which is too difficult for a machine
learning algorithm to learn a classifier. Therefore, in line with previous studies [11, 12], we exclude
the Group-2 data and use only the Group-3 data for emotion detection. Group-3 focuses on four
emotions (i.e., love, joy, sadness, and anger). For each emotion, Group-3 has 1,000 sentences labeled
with the presence or absence of it. Samples labeled with love, joy, sadness, and anger account for
16.6%, 12.4%, 32.4%, and 30.2% in their corresponding 1,000 samples, respectively.
14 https://github.com/senticr/SentiCR/.
15 https://sentiment-se.github.io/replication.zip.
16 http://ansymore.uantwerpen.be/system/files/uploads/artefacts/alessandro/MSR16/archive3.zip.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:14 Z. Chen et al.
Table 3. Benchmark Datasets Used for Emotion Detection: Based on Shaver Framework
Emotion distribution
Dataset #
Love Joy Anger Sadness Fear Surprise
JIRA-E1 4*1,000 166 (16.6%) 124 (12.4%) 324 (32.4%) 302 (30.2%) — —
SO-E 4,800 1,220 (25.4%) 491 (10.2%) 882 (18.4%) 230 (4.8%) 106 (2.2%) 45 (0.9%)
.
Table 4. A Benchmark Dataset Used for Emotion Detection: Based on VAD Model
Emotion distribution
Dataset #
Excitement Relaxation Stress Depression Neutral
JIRA-E2 1,795 411 (22.9%) 227 (12.6%) 252 (14.0%) 289 (16.1%) 616 (34.3%)
The SO-E dataset17 [78] is the same dataset as the Stack Overflow dataset, where raters annotated
each sample using both emotional labels and sentiment polarity. Specifically, SO-E dataset contains
4,800 posts and focuses on six emotions, including love, joy, anger, sadness, fear, and surprise. Each
post was annotated with the presence or absence of each emotion. Finally, 1,959 posts received at
least one emotion label. Among the 4,800 samples, 25.4% were labeled with love, 10.2% joy, 18.1%
anger, 4.8% sadness, 2.2% fear, and 0.9% surprise.
The JIRA-E2 dataset18 [50] contains 1,795 issue comments extracted from the JIRA issue tracking
system. Different from the JIRA-E1 dataset, JIRA-E2 dataset is created based on VAD model [50],
where emotions are represented as combinations of different levels of valence and arousal. Specif-
ically, JIRA-E2 dataset involves four emotions, i.e., excitement (positive valence and high arousal),
stress (negative valence and high arousal), depression (negative valence and low arousal), and
relaxation (positive valence and low arousal). First, its authors used a keyword-based searching
method to collect 2,000 samples that were likely to contain valence and arousal from the corpus
of the JIRA issue tracking system. Then they employed three coders to annotate each sample with
one emotional state based on its valence and arousal level. During the annotation process, 205
issue comments are discarded, since coders cannot achieve an agreement on the emotions per-
ceived in them. Among the remaining 1,795 comments, 22.9% were labeled with excitement, 12.6%
relaxation, 14.0% stress, 16.1% depression, and 34.3% neutral.
We summarize the statistics of above three datasets in Table 3 and Table 4. Besides them, we also
create a unified dataset for emotion detection. Since JIRA-E1 and SO-E share four classes (i.e., love,
joy, anger, and sadness), we merge the two datasets into a new dataset, i.e., the Unified-E dataset.
Specifically, similar to JIRA-E1, Unified-E contains four subsets. Each subset is for one emotion
and contains the corresponding 1,000 samples in JIRA-E1 and all the 4,800 samples in SO-E.
17 https://github.com/collab-uniba/EmotionDatasetMSR18.
18 https://figshare.com/s/277026f0686f7685b79e.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:15
Recall represents the sensitivity of one method. The recall of a given class c is calculated as:
#right predicted samples belonging to class c
recall@c = , (7)
#total samples belonging to class c
F1-score is a combination of precision and recall. The F-score of a given class c is computed as
the harmonic mean of precision@c and recall@c:
2 ∗ precision@c ∗ recall@c
F 1 − score@c = . (8)
precision@c + recall@c
Accuracy measures how often one method makes the correct prediction and is defined as below:
#right predicted samples
accuracy = , (9)
#total samples
In addition, since some datasets (e.g., Java Library dataset) have a imbalanced class distribu-
tion, we also employ three macro-average metrics, which are demonstrated to be suitable for such
imbalanced scenarios [91]. Specifically, we employ macro-precision, macro-recall, and macro-F1,
which take the average of precision, recall, and F1-score of all classes, respectively. For instance,
when we consider the binary classification task that aims to identify a specific emotion from texts,
the macro-precision/recall/F1 can be computed as the average of the precision/recall/F1 of the
two corresponding classes (i.e., samples containing the emotion and samples not containing the
emotion).
In this study, to make a comprehensive comparison, we report the results based on all of the
metrics above. However, we urge researchers and practitioners to determine which metrics they
should focus on according to their intended goals. For instance, if one needs to perform the sen-
timent detection in a scenario where precision is important, she may pay more attention to the
precision level of different methods. In addition, although all the metrics are well-adopted ones
that can help us quickly compare different methods, using them alone may be not enough in some
specific cases. For example, in a multi-class emotion detection task, classifying “love” as “joy” may
have a different cost than classifying “love” as “sadness,” but the adopted metrics cannot reflect
such a difference. In this case, we encourage researchers and practitioners to use other methods
to complement these metrics. For example, a confusion matrix can be used to show the distri-
bution of the classification results, from which we can observe how many samples with love are
misclassified as “joy” and “sadness,” respectively.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:16 Z. Chen et al.
• Since DEVA and MarValous aim at identifying excitement, stress, depression, and relaxation,
we compare them with other emotion detection methods only on the JIRA-E2 dataset.
• We consider JIRA, Stack Overflow, Java Library, and Unified-S datasets as trinary classi-
fication tasks. On each of the four datasets, we train a classifier that can classify texts as
positive, negative, or neutral. Since Code Review dataset has only two classes, it is consid-
ered as a binary classification task, i.e., negative or non-negative. In addition, as JIRA-E2
dataset focuses on four emotions, we consider it as a multi-class classification task involv-
ing five classes, i.e., classifying each sample as excitement, relaxation, stress, depression, or
neutral.
• JIRA-E1 and Unified-E datasets both contain four sub-datasets, each of which is for a spe-
cific emotion. For example, for the subset for joy, each sample is labeled with joy or not.
Therefore, following the previous studies [12, 81], we consider each of the two datasets as
four binary classification tasks. In each task, we consider only one emotion and use the
corresponding sub-dataset to train a classifier that can judge whether texts contain such an
emotion or not.
• SO-E dataset focuses on six emotions and contains samples that are annotated with more
than one emotion label. To tackle the complicated multi-label classification, we employ the
binary relevance method [87], which transforms the multi-label classification into indepen-
dently training binary classifiers for each label. The binary relevance method is a common
practice for tackling multi-label classification [87] and has been used in previous related
work in SE [12]. In line with this work, we consider SO-E dataset as six binary classification
tasks. In each task, we consider only one emotion and use the entire SO-E dataset to train
a classifier that can judge whether texts contain such an emotion or not.
For SentiStrengh, SenStrength-SE, and DEVA, as mentioned before, we directly use the off-the-
shelf tools for evaluation. For SentiCR, Senti4SD, and EmoTxt, we train them on the benchmark
datasets using the scripts released by their authors. More specifically, for SentiCR, we use the Gra-
dient Boosting Tree (GBT) [31] to reproduce it as recommended by its authors. For EmoTxt, in the
training phase, it provides two settings for data sampling, i.e., “DownSampling” and “NoDownSam-
pling.” DownSampling randomly samples training data to make all classes have the same frequency
as the minority class, while NoDownSampling does not change the training data. We employ both
the two settings for training EmoTxt and denote them as EmoTxt-Down and EmoTxt-No, respec-
tively. For ESEM-E, we carefully reproduce it following the detailed descriptions in the correspond-
ing paper [74], since its training code has not been released. As recommended by its authors, we use
SVM to reproduce ESEM-E. For MarValous, although its training code has also not been released
directly, we can find the compiled files of its python scripts in the released package.19 Therefore,
we use a decompiler named uncompyle20 to decompile these files and obtain the training scripts
of MarValous. Then we use the obtained scripts to reproduce MarValous and use SVM to train the
classifier as recommended by its authors.
For SEntiMoji, SEntiMoji-T, T-80%, T-60%, T-40%, and T-20%, we use the pre-trained DeepMoji as
initialization and then fine-tune the model parameters through the chain-thaw approach. During
this process, as recommended by Felbo et al. [30], we use Adam optimizer [56] with gradient
clipping of the norm to 1 and set learning rate to 10−3 for training of the replaced layer and 10−4 for
fine-tuning any pre-trained layers. For SEntiMoji-G1/2, in stage 1, instead of fine-tuning the pre-
trained DeepMoji, we train DeepMoji-SE based on GitHub data from scratch. During this process,
SEntiMoji-G1/2 shares the same hyper-parameter settings as DeepMoji. In stage 2, we fine-tune
the pre-trained DeepMoji-SE following the same procedure and settings as SEntiMoji. We train
and fine-tune all these emoji-based models on a Linux machine with two Tesla M40 GPUs.
5 RESULTS
In this section, we answer our research questions based on the evaluation results.
5.1 RQ1: How Does SEntiMoji Perform Compared to Existing Sentiment Detection
Methods in SE?
To answer RQ1, we compare SEntiMoji with four existing sentiment detection methods (i.e., Sen-
tiStrength, SentiStrength-SE, SentiCR, and Senti4SD) on five benchmark datasets (i.e., JIRA, Stack
19 https://figshare.com/s/a3308b7087df910db38f.
20 https://github.com/rocky/python-uncompyle6/.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:18 Z. Chen et al.
Overflow, Code Review, Java Library, and Unified-S datasets). We summarize the performance of
all the methods in Table 5. For each combination of dataset and metric, we highlight the best result
with shading.
Analysis: At a glance of the results in Table 5, we observe that SEntiMoji can achieve the best
performance on most metrics. In terms of macro-F1, it can outperform existing methods with an
average increase of 0.036 on the five datasets. Next, we analyze the results thoroughly.
We first compare SEntiMoji with the most widely used SentiStrength, which is an off-the-shelf
sentiment detection tool without any SE-customized efforts. On JIRA, Stack Overflow, Code Re-
view, Java Library, and Unified-S datasets, the macro-F1 obtained by SEntiMoji is 0.154, 0.060, 0.168,
0.185, and 0.078 higher than that obtained by SentiStrength, respectively. By comparison, the dif-
ference on the Stack Overflow dataset is the smallest, only 0.060. This “outlier” can be attributed to
the creation process of the Stack Overflow dataset. Calefato et al. [10] created this dataset by sam-
pling the originally collected posts based on their sentiment scores computed by SentiStrength. It
is easier for SentiStrength to correctly classify the samples selected by itself, which thus results in
a relatively small performance gap between SEntiMoji and SentiStrength on the Stack Overflow
dataset.
Then we want to compare SEntiMoji with the SE-customized methods (i.e., SentiStrength-SE,
SentiCR, and Senti4SD). In general, SentiCR performs the best among the three existing methods
as it can achieve the highest accuracy and macro-F1 on each dataset except the Stack Overflow
and Unified-S datasets. On the Stack Overflow and Unified-S datasets, it performs slightly worse
than Senti4SD. Similarly, Islam and Zibran [49] also found that Senti4SD can achieve the highest
accuracy on Stack Overflow dataset. It is reasonable that the semantic features used by Senti4SD
are extracted based on the embeddings trained on a large-scale Stack Overflow corpus, and thus
Senti4SD is more knowledgeable than SentiCR when dealing with Stack Overflow posts in Stack
Overflow and Unified-S datasets. As SentiCR has an obvious advantage over other existing SE-
customized methods in general, we then just compare SEntiMoji with it.
Compared to SentiCR, SEntiMoji obtains a better result in 54 of 62 metrics. In terms of macro-F1,
SEntiMoji outperforms SentiCR on all the datasets, with an average increase of 0.044. Similarly,
SEntiMoji can achieve a higher accuracy level. For example, on the JIRA dataset, the accuracy
scores obtained by SEntiMoji and SentiCR are 0.904 and 0.872, respectively. In other words, their
error rates are 0.096 and 0.128, respectively. It indicates that SEntiMoji can reduce 25% of the
samples misclassified by SentiCR on the JIRA dataset. In addition, when carefully inspecting the
results, we find that the performance gap between SentiCR and SEntiMoji is particularly large in
some cases. For instance, in terms of precision, the extent to which SEntiMoji outperforms SentiCR
on the Java Library dataset is obviously larger than on other datasets. This phenomenon can be
attributed to the sampling methods of different datasets. As JIRA dataset is labeled with various
emotion labels, we need to map the multi-class emotions into trinary polarities and some samples
are filtered due to ambiguity, which results in a relatively balanced data distribution. With regard
to Stack Overflow and Code Review datasets, we find both of them were manually pre-processed
and filtered to follow a not that skewed sentiment distribution during their creation process [1, 10].
The sampling of these datasets makes the classification tasks considerably easier, so SentiCR and
SEntiMoji do not show an obvious performance difference on these datasets. Compared to other
datasets, Java Library dataset has a more imbalanced class distribution, i.e., 79.4% of samples are
neutral. In such a situation, SEntiMoji can achieve 0.296 and 0.183 higher precision over positive
and negative samples while keeping the similar level of recall, which further demonstrates the
superiority of SEntiMoji.
Finally, to verify whether the superiority of SEntiMoji is statistically significant, for each dataset,
we perform McNemar’s test between the results produced by SEntiMoji and each baseline method.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:19
(Continued)
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:20 Z. Chen et al.
Table 5. Continued
Table 6. McNemar’s Statistics between the Results of SEntiMoji and Other Existing
Sentiment Detection Methods on Different Datasets, with Adjusted p-values in Parentheses
We present the statistic results in Table 6. For each dataset, we formulate null and alternative hy-
potheses to determine the statistical significance of the performance gap between SEntiMoji and
existing sentiment detection methods on it. Take the JIRA dataset as an example. The correspond-
ing null and alternative hypotheses are as follows:
Null hypothesis-1 (H 01 ): There is no significant difference in the performance of SEntiMoji com-
pared to existing sentiment detection methods on the JIRA dataset.
Alternative hypothesis-1 (Ha1 ): There exist significant differences in the performance of SEntiMoji
compared to existing sentiment detection methods on the JIRA dataset.
According to Table 6, on the JIRA dataset, the performance difference between SEntiMoji and
each baseline method is found to be statistically significant with p = 0.000 and p < 0.05. Therefore,
the McNemar’s test rejects our null hypothesis (H 01 ), and the alternative hypothesis (Ha1 ) holds
true. Similarly, McNemar’s test can reject the null hypotheses formulated for the remaining four
benchmark datasets at a significance level of 0.05. Therefore, we can derive the answer to RQ1 as
follows:
Ans. to RQ1: SEntiMoji can significantly outperform existing sentiment detection methods
in SE on all of the benchmark datasets.
5.2 RQ2: How Does SEntiMoji Perform Compared to Existing Emotion Detection
Methods in SE?
To answer RQ2, we compare SEntiMoji with five existing emotion detection methods (i.e.,
DEVA, EmoTxt-Down, EmoTxt-No, MarValous, and ESEM-E) on JIRA-E1, SO-E, JIRA-E2, and
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:21
Unified-E datasets. As explained in Section 4.4, we evaluate DEVA and MarValous only on the
JIRA-E2 dataset, and consider the JIRA-E1, SO-E, and Unified-E datasets as four, six, and four
binary classification tasks, respectively. We summarize the performance of all these methods in
Table 7. For each combination of dataset and metric, we highlight the best result with shading.
To save space, for each task on the JIRA-E1, SO-E, and Unified-E datasets, we report only the pre-
cision, recall, F1-score of the detected emotions as well as the overall accuracy, macro-precision,
macro-recall, and macro-F1.
Analysis: At a glance of the results in Table 7, we can find that SEntiMoji achieves the highest
value on most metrics. In particular, SEntiMoji obtains the highest accuracy in all the tasks and
the highest macro-F1 in 14 of 15 tasks. In terms of macro-F1, it can outperform these methods with
an average increase of 0.036. Next, we analyze the results thoroughly.
First, we compare SEntiMoji with DEVA and MarValous. SEntiMoji can outperform DEVA in
terms of all the 19 reported metrics on the JIRA-E2 dataset. Given the accuracy scores of DEVA
and SEntiMoji are 0.830 and 0.886, their error rates are 0.170 and 0.114, respectively. In other words,
SEntiMoji can reduce 32.9% of the errors compared to DEVA. Similarly, SEntiMoji can outperform
MarValous in 18 of 19 metrics on the JIRA-E2 dataset.
Then we compare SEntiMoji with EmoTxt-Down. An obvious advantage of EmoTxt-Down is
that it achieves the highest recall of emotions in 7 of 10 tasks of JIRA-E1 and SO-E. It is reasonable,
since EmoTxt-Down samples training data to ensure a balanced distribution of different classes.
By comparison, SEntiMoji is trained based on imbalanced data and thus tends to classify samples
as the majority class (i.e., absence of the emotion). As a result, SEntiMoji cannot find as many
emotional samples as EmoTxt-Down in some cases. However, the disadvantage of EmoTxt-Down
is also very obvious: its precision over emotional samples is often much lower than SEntiMoji. For
example, in the sadness detection task of the SO-E dataset, SEntiMoji obtains more than three times
the precision of sadness samples obtained by EmoTxt-Down. The disadvantage of EmoTxt-Down
can be attributed to its sampling strategy, too. As emotional samples are scarce in the JIRA-E1 and
SO-E datasets, to achieve a balanced distribution, EmoTxt-Down needs to filter out a large amount
of nonemotional samples. Therefore, it lacks knowledge about nonemotional samples and has a
high chance of classifying nonemotional samples as emotional, which results in a low precision
on emotional samples. By comparison, SEntiMoji can achieve a better balance between precision
and recall (i.e., higher F1-score) than EmoTxt-Down.
Next, we compare SEntiMoji with EmoTxt-No. We find that SEntiMoji can outperform EmoTxt-
No in almost all the metrics. The main disadvantage of EmoTxt-No is its low recall level, especially
on the JIRA-E1 and SO-E datasets. For example, in the love detection task of the JIRA-E1 dataset,
the recall level of love samples obtained by EmoTxt-No is only 0.066, while the one obtained by
SEntiMoji is 0.674. In the surprise detection task of the SO-E dataset, EmoTxt-No even has no
ability to distinguish between surprise and no-surprise as it classifies all the test samples as no-
surprise.21 The low recall of EmoTxt-No can be attributed to the scarcity of emotional samples
in these two datasets, which causes that EmoTxt-No lacks knowledge about emotional samples
and thus can identify only a few emotional samples. By comparison, since SEntiMoji is developed
based on large-scale emoji samples that contain rich emotional information, it can alleviate this
shortcoming to some extent.
Finally, we compare SEntiMoji with ESEM-E. Although ESEM-E achieves a higher macro-F1
score than SEntiMoji in the surprise task of SO-E, it achieves lower macro-F1 scores than SEntiMoji
in all the other tasks. In addition, SEntiMoji achieves higher accuracy than ESEM-E in all the tasks.
21 Since all the test samples are classified as no-surprise, we cannot calculate the precision and F1-score of surprise class.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:22 Z. Chen et al.
(Continued.)
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:23
Table 7. Continued
Note: For each metric, the highest value is highlighted with shading.
To verify whether the superiority of SEntiMoji is statistically significant, for each task, we per-
form McNemar’s test between the results produced by SEntiMoji and each baseline method. We
present the statistic results in Table 8. Then, for each task, we formulate null and alternative hy-
potheses to determine the statistical significance of the performance gap between SEntiMoji and
existing emotion detection methods. Take the love detection task of the JIRA-E1 dataset as an
example. The null and alternative hypotheses formulated for it are as follows:
Null hypothesis-2 (H 02 ): There is no significant difference in the performance of SEntiMoji com-
pared to existing emotion detection methods in the love detection task of the JIRA-E1 dataset.
Alternative hypothesis-2 (Ha2 ): There exist significant differences in the performance of SEntiMoji
compared to existing emotion detection methods in the love detection task of the JIRA-E1 dataset.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:24 Z. Chen et al.
Table 8. McNemar’s Statistics between the Results of SEntiMoji and Other Existing Emotion Methods
on Different Datasets, with Adjusted p-values in Parentheses
According to Table 8, in the love detection task of the JIRA-E1 dataset, the performance dif-
ference between SEntiMoji and each baseline method is found to be statistically significant with
p = 0.000 and p < 0.05. Thus, the McNemar’s test rejects the null hypothesis (H 02 ), and the al-
ternative hypothesis (Ha2 ) holds true. It indicates that SEntiMoji can significantly outperform the
existing methods in the love detection task of the JIRA-E1 dataset. Similarly, McNemar’s test can
reject the null hypotheses formulated for all the remaining tasks except the joy detection task of
the JIRA-E1 dataset and the fear and surprise detection tasks of the SO-E dataset. In the JIRA-E1
dataset, samples labeled with joy account for only 12.4%, less than the ones labeled with other
three emotions. Such a distribution makes the joy detection task difficult and thus SEntiMoji can-
not show a significant difference from existing methods. Similarly, the fear and surprise detection
tasks are the most difficult ones on the SO-E dataset, as samples labeled with the two emotions
account for only 2.2% and 0.9%, respectively. Take the surprise detection task as an example. Given
that more than 99% of samples are labeled with no-surprise, the classification task is like “finding
needles in a haystack.” In such a situation, EmoTxt-No even has no ability to distinguish between
surprise and no-surprise. Although SEntiMoji does not perform as badly as EmoTxt-No, SEntiMoji
cannot achieve significantly better results than other methods due to the difficulty of the problem.
Based on the aforementioned analysis, we can derive an answer to RQ2:
Ans. to RQ2: SEntiMoji can significantly outperform existing emotion detection methods in
SE, unless the data distribution is very imbalanced.
5.3 RQ3: Which Training Corpora Contribute More to the Power of SEntiMoji?
Since SEntiMoji can achieve significant improvement on most benchmark datasets, we then inves-
tigate the reasons behind its power, which can provide insights for future research. As SEntiMoji
does not outperform other methods significantly in the joy detection task of the JIRA-E1 dataset
and the fear and surprise detection tasks of the SO-E dataset, in this part, we do not take the three
tasks into consideration. In the remaining tasks, we evaluate SEntiMoji and its variants to measure
the contribution of different training corpora.
5.3.1 GitHub Data vs. Twitter Data. We first compare SEntiMoji with SEntiMoji-G1, SEntiMoji-
G2, and SEntiMoji-T, to measure the contribution of GitHub data and Twitter data. The four
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:25
methods share the same model architecture and use the same manually labeled data. They dif-
fer only in the external data used in representation learning. Specifically, SEntiMoji uses Twitter
data to supply the general sentimental expressions and uses GitHub data to learn technical knowl-
edge. By comparison, SEntiMoji-G1 and SEntiMoji-G2 uses only GitHub data, while SEntiMoji-T
uses only Twitter data. We summarize the performance of these methods in Table 9 and Table 10. In
addition, to evaluate the statistical significance of the performance difference, we perform the Mc-
Nemar’s test to compare SEntiMoji-G1, SEntiMoji-G2, and SEntiMoji-T with SEntiMoji. Table 11
presents the statistic results. Next, we analyze the results thoroughly.
Analysis for sentiment detection: Table 9 presents the performance of SEntiMoji and its
variants in sentiment detection tasks. We can observe that SEntiMoji-T obtains comparable results
with SEntiMoji, with only an average decrease of 0.013 in macro-F1 on the five datasets. On the
Unified-S dataset, SEntiMoji-T can even obtain a slightly higher macro-F1 score than SEntiMoji.
The comparable performance of SEntiMoji and SEntiMoji-T is also demonstrated by the results
of statistical tests. As shown in Table 11, the the performance difference between SEntiMoji and
SEntiMoji-T is not statistically significant on any of the five datasets for sentiment detection.
By comparison, the performance degradation of SEntiMoji-G1 and SEntiMoji-G2 compared to
SEntiMoji is more obvious. Specifically, the average decrease of SEntiMoji-G1 and SEntiMoji-G2
in macro-F1 is 0.091 and 0.174, 7 times and 13 times that of SEntiMoji-T, respectively. In addition,
SEntiMoji-G1 and SEntiMoji-G2 lose the advantage of our emoji-powered method in the precision
level on the Java Library dataset. The precision@neg gained by SEntiMoji-G1 and SEntiMoji-G2 is
0.489 and 0.311, much lower than the precision levels obtained by SEntiMoji (0.729) and SEntiMoji-
T (0.734). Moreover, statistic results in Table 11 show that SEntiMoji can significantly outperform
SEntiMoji-G1 and SEntiMoji-G2 on all the five datasets for sentiment detection with p = 0.000.
Analysis for emotion detection: Table 10 presents the performance of SEntiMoji and its vari-
ants in emotion detection tasks. In these tasks, SEntiMoji-T achieves comparable results with SEn-
tiMoji. In terms of macro-F1, SEntiMoji-T achieves better performance than SEntiMoji in three
tasks, with an average increase of 0.008. In the remaining nine tasks, SEntiMoji-T achieves slight
worse performance, with an average decrease of only 0.009 in macro-F1 compared to SEntiMoji.
The results of statistical tests in Table 11 further demonstrate that the performance difference
between SEntiMoji and SEntiMoji-T is not significant in 11 of 12 emotion detection tasks.
By comparison, SEntiMoji-G1 and SEntiMoji-G2 performs obviously worse than SEntiMoji. For
example, on the JIRA-E2 dataset, SEntiMoji-G1, and SEntiMoji-G2 achieve the macro-F1 scores of
0.596 and 0.569, which are 0.267 and 0.294 lower than the macro-F1 achieved by SEntiMoji (0.863),
respectively. In addition, results in Table 11 demonstrate that in 11 of 12 tasks, the performance
gap between SEntiMoji and SEntiMoji-G1/G2 is statistically significant.
Based on the above analysis for sentiment and emotion detection, we can find that SEntiMoji
can obtain comparable performance with SEntiMoji-T. However, the performance SEntiMoji-G1/2
is significantly worse than SEntiMoji. Therefore, we can conclude that Twitter data contribute
more than GitHub data to the power of SEntiMoji.
5.3.2 Contribution of Manually Labeled Data. We then measure the contribution of the manu-
ally labeled data. Specifically, we compare SEntiMoji-T with its variants (i.e., T-80%, T-60%, T-40%,
and T-20%) to verify whether the performance declines with the reduction of the manually labeled
data. All the five methods use only Tweets for representation learning, and learn technical knowl-
edge only from the labeled data. The sole difference is the amount of the labeled data that they
used. The performance of these methods are summarized in Table 12 and Table 13. In addition, we
perform the McNemar’s test to evaluate the statistical significance of the performance difference
between SEntiMoji-T and its variants. The results of statistical tests are presented in Table 14. Next,
we analyze the results in detail.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:26 Z. Chen et al.
Note: For each metric, the highest value is highlighted with shading.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:27
Table 10. Performance of SEntiMoji and Its Variants in Emotion Detection Tasks
(Continued.)
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:28 Z. Chen et al.
Note: For each metric, the highest value is highlighted with shading.
Analysis for sentiment detection: Table 12 presents the performance of SEntiMoji-T and its
variants in sentiment detection tasks. We can observe that when we use 80% of the labeled data (i.e.,
T-80%), the performance is comparable with SEntiMoji-T. In terms of macro-F1, T-80% even obtains
a better result than SEntiMoji-T on the Code Review dataset, although only with an increase of
0.011. On the Unified-S dataset, the macro-F1 scores obtained by SEntiMoji-T and T-80% are the
same. On the remaining three datasets, T-80% performs slightly inferior than SEntiMoji-T, with an
average decrease of 0.01. However, when we use 20% of the labeled data for training (i.e., T-20%),
we can observe an obvious decrease in performance. On one hand, we can find that T-20% performs
worse than SEntiMoji-T in almost all the metrics. On the other hand, in terms of macro-F1, T-20%
has an average decrease of 0.056.
To intuitively understand the changes in performance when scaling down the labeled data,
we then turn to the statistical results in Table 14 for further analysis. On any of the datasets for
sentiment detection, we find that the performance difference between SEntiMoji-T and T-80%
is not statistically significant, which is consistent with our finding that the performance of the
two methods is comparable. On the JIRA and Code Review datasets, the performance becomes
significantly worse until only 20% of the labeled data are remaining. However, on the Stack Over-
flow and Unified-S datasets, the performance is significantly worse even when 60% of the labeled
data are remaining, and the performance gap gets larger with the reduction of the labeled data.
Analysis for emotion detection: Table 13 presents the performance of SEntiMoji-T and its
variants in emotion detection tasks. We can observe that T-80% can obtain comparable perfor-
mance with SEntiMoji-T, which is consistent with our finding in sentiment detection tasks. Specif-
ically, in terms of macro-F1, SEntiMoji-T can outperform T-80% in five tasks, while T-80% can out-
perform SEntiMoji-T in seven tasks. In many tasks, even when we use only 20% of labeled data (i.e.,
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:29
T-20%), the performance is still comparable with SEntiMoji-T. For example, in the sadness detec-
tion task of the JIRA-E1 dataset, the macro-F1 scores obtained by SEntiMoji-T and T-20% are 0.873
and 0.872, with only a trivial gap of 0.001. However, in some tasks, using such a small amount of
labeled data can result in an obvious decrease in performance. For instance, in the anger detection
task in JIRA-E1, the macro-F1 of T-20% is 0.087 lower than the one obtained by SEntiMoji-T.
Then we analyze the McNemar’s statistics between the results of SEntiMoji-T and its variants
in Table 14. In 11 of 12 emotion tasks, the performance maintains at a comparable when 60% of the
labeled data are excluded (i.e., T-40%). In addition, in five tasks, the performance is still comparable
even when 80% of labeled data are excluded (i.e., T-20%).
Based on the above analysis for sentiment and emotion detection, we can find that a certain
amount of labeled data are essential in some tasks such as sentiment detection on Stack Overflow
and Unified-S datasets. However, in most tasks, the combination of the large-scale Tweets and
not so many labeled technical texts is able to obtain satisfying results, which demonstrates the
importance of the general sentimental knowledge incorporated by the Twitter data.
Ans. to RQ3: The large-scale Twitter data rather than the GitHub data make a key contribution
to the power of SEntiMoji. Although a certain amount of labeled data are essential in some
tasks, the combination of Twitter data and not so many labeled technical texts is able to obtain
satisfying results in most tasks.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:30 Z. Chen et al.
Note: For each metric, the highest value is highlighted with shading.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:31
(Continued.)
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:32 Z. Chen et al.
Note: For each metric, the highest value is highlighted with shading.
5.4.1 Error Analysis of Sentiment Detection. On JIRA, Stack Overflow, Code Review, and Java
Library datasets, SEntiMoji misclassifies 452 samples in total. To achieve 95% confidence level and
5% confidence interval, we randomly select 208 misclassified samples for manual inspection. We
follow an open coding method [67] to inspect and classify selected samples into different error
causes. First, the first two authors jointly inspect 50 samples to distill the initial set of error causes
and develop a coding guide with definitions and examples for each identified cause. Then, the two
authors use the coding guide to independently label the remaining 158 samples, each of which is
annotated with the most possible error cause. To measure the inter-rater agreement, we employ
Cohen’ Kappa (κ) [19] and have κ = 0.819. Such a statistic indicates almost perfect agreement [62]
and evidences the reliability of the coding schema and procedure. Finally, the samples where two
authors assign different labels are discussed to determine their final labels. To resolve conflicts, a
colleague, who is familiar with our research, is introduced as an arbitrator [21].
In total, we distill seven error causes of sentiment detection. The distribution of these categories
is shown in Table 15. Each category is described as follows.
C1: Implicit sentiment. About 31.7% of errors are due to the implicit sentiment polarity.
In many cases, developers use neutral expressions to convey sentiment. Such sentiment can be
captured by annotators due to their rich knowledge, but it is difficult for a machine learning model
whose knowledge is only from the training data. For example, the sample “My patch wouldn’t
compile” (id: 937006_1, JIRA dataset) conveys a negative sentiment, but it contains no negative
terms. Therefore, it is too obscure for SEntiMoji to classify it correctly.
C2: Misinterpreted purposes. This category accounts for 25.0% of bad cases. It includes neu-
tral requests, questions, and tips that are mistaken as positive or negative. For example, “Avoid
legacy date-time classes” (id: 5122, Java Library dataset) is considered by SEntiMoji as negative,
although it is just a tip. In addition, this category also includes objective polar utterances, which
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:33
Table 14. McNemar’s Statistics between the Results of SEntiMoji-T and Its Variants on Each
Dataset, with Adjusted p-values in Parentheses
describe positive or negative factual information about something without conveying a private
state [109]. Take the post “TestBlockTokenWithDFS fails for the same reason” as an example. It
reports a problem that may cause negative results and thus is classified as negative by SEntiMoji.
However, it is an objective polar utterance that does not convey any emotion or opinion.
C3: Complex context information. About 18.3% of bad cases can be attributed to complex
context information. On one hand, some samples are compound sentences that contain some con-
junction words such as “but,” “and,” “unless,” and “so.” In such samples, sub-sentences may convey
different sentiment, which makes it challenging for a machine learning model to determine the
overall sentiment. For example, in the negative sample “Also your username was stupidly named
and confusing, so I deleted it and created a new one more appropriate” (id: 2148299_4, JIRA dataset),
the first sub-sentence conveys obviously negative sentiment, but the second sub-sentence is neu-
tral. As a result, SEntiMoji misclassifies it as neutral. In addition, some samples are document-level,
i.e., a sample may contain more than one sentence. However, all the existing sentiment detection
methods (including SEntiMoji) consider the entire content of a sample as the unit of analysis, re-
gardless whether the sample is a sentence or a document. Under such a situation, document-level
samples consisting of sentences conveying different sentiment make the sentiment detection task
challenging.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:34 Z. Chen et al.
C4: Failure to capture obvious sentiment. Although SEntiMoji can outperform existing
methods in SE, it still cannot correctly classify some samples with obvious sentiment. For ex-
ample, SEntiMoji considers “Seems to work reliably now” (id: 719536_1, JIRA dataset) neutral, but
in fact it conveys an obviously positive sentiment. Cases attributed to this category account for
about 16.3%.
C5: Politeness. In online communication, developers may use lexical cues of politeness, such
as “please,” “thanks,” “sorry,” and “I’m afraid,” to decorate a neutral description. For example,
the sample “Thanks Sian. Patch applied in SECURITY module and third party notices file at repo
revision r543463. Please check it was applied as you expected” (id: 1050218, JIRA dataset) expresses
no sentiment, but the polite cues (i.e., “Thanks” and “please”) mislead SEntiMoji to consider it
positive. This category accounts for about 6.3% of bad cases.
C6: Data pre-processing. SEntiMoji needs to pre-process texts before learning representations
of them, to filter out some noise. However, data pre-processing can also filter out some useful
information. For instance, in the sample “+1 on intent from looking at what the patch fixes” (id:
819316_1, JIRA dataset), “+1” conveys positive sentiment. However, it is replaced with a specific
token during data pre-processing, which makes this sample be classified as neutral. The bad cases
in this category account for only 1.4%.
C7: Figurative language. Figurative language can be used to express metaphor, sarcasm, irony,
humor, and so on. The usage of it can make sentiment detection challenging. For example, “Regex
is your friend” (id: 695065_2, JIRA dataset) indicates a positive attitude to regex, but such sentiment
is implied in the metaphor and thus cannot be captured by SEntiMoji. Another sample we identi-
fied in this category is “Well, the funny part is, we do not know” (id: 4578, Java Library dataset),
which conveys negative sentiment through an expression of irony but is classified by SEntiMoji
as neutral.
5.4.2 Error Analysis for Emotion Detection. As is described before, for emotion detection, we
perform binary classifications on the JIRA-E1 and SO-E datasets. For example, in the love task of
JIRA-E1, we classify each sample as love or no-love. In the ten binary classification tasks, 1,863
bad cases are generated. To achieve 95% confidence level and 5% confidence interval, we randomly
select 319 bad cases for manual inspection. Similar to the error analysis for sentiment detection, we
follow an open coding method to distill categories of error causes and then label each sampled case
with one category. During the independent labeling process, the inter-rater agreement is almost
perfect given k = 0.858. Finally, we obtain eight categories and their distribution is summarized in
Table 16.
The first seven categories are almost the same as those obtained in the error analysis for sen-
timent detection. Take the first category (C1) as an example. Just like that implicit sentiment can
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:35
mislead SEntiMoji to classify sentimental samples as neutral, implicit emotion can also mislead
SEntiMoji to classify emotional samples as absence of emotion. To save space, here, we do not
describe the remaining six categories one by one. What we want to mention is the distribution of
the first seven categories. In sentiment detection, the seven categories in descending order of their
frequencies are C1, C2, C3, C4, C5, C6, and C7. However, in emotion detection, C3 and C4 occur
much more frequently than in sentiment detection. This phenomenon can be explained as follows:
• Explanation for C3: In each of the six classification tasks on SO-E, the entire 4,800 document-
level samples are used for evaluation. Considering that JIRA-E1 has only 4,000 sentence-
level samples, with 1,000 samples per classification task, we can find that a large proportion
of samples used for the binary emotion detection are documents rather than sentences.
Since SEntiMoji considers the entire content of each sample (regardless whether it is a sen-
tence or a document) as the unit of analysis, these document-level samples have a high
chance of being misclassified due to the complex context information (C3) in the documents.
• Explanation for C4: Compared to sentiment detection, emotion detection tends to have a
more imbalanced data distribution, as samples conveying a specific emotion are usually
only a subset of positive or negative samples. In other words, emotion detection is more
challenging than sentiment detection. Take the sadness task of SO-E as an example. As there
are only 4.8% of the samples conveying sadness, SEntiMoji has only a little knowledge about
such an emotion and thus may misclassify some obviously sad samples as no-sadness. As a
result, we find that C4 accounts for such a large proportion.
Besides the seven categories shared by sentiment and emotion detection, we find a new added
category for emotion detection:
C8: The same emotion polarity. Some emotions, especially those with the same polarity,
are relatively similar. For instance, joy and love are both positive emotions, so it is difficult for a
machine learning model to distinguish them. Take the post “Great point - will change!” (id: t2837,
SO-E dataset) as an example. SEntiMoji classifies it as love due to the positive emotion contained
in it, but actually it does not convey the love emotion. We find that 23.5% of bad cases can be
attributed to this category.
The severity of C8 can be also evidenced by the distribution of the cases misclassified by SEn-
tiMoji on JIRA-E2 dataset (see Table 17). As excitement and relaxation are both positive, the most
common mistakes in the two emotions are to classify them as each other. Similarly, stress samples
are often identified as depression.
Ans. to RQ4: For sentiment detection, SEntiMoji mainly faces challenges caused by implicit
sentiment, misinterpreted purposes, complex context information, failure to capture obvious
sentiment, politeness, data pre-processing, and figurative language. For emotion detection, be-
sides these challenges, SEntiMoji also faces the challenge caused by the same emotion polarity.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:36 Z. Chen et al.
Predicted Polarity
Positive Neutral Negative
Positive 43 84 4
Ground Truth Neutral 6 1,148 37
Negative 2 72 104
Predicted Polarity
Positive Neutral Negative
Positive 0 12 2
Ground Truth Neutral 28 0 54
Negative 2 27 0
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:37
Additionally, SEntiMoji can benefit the empirical studies that investigate the correlation between
developers’ sentiment/emotion and other SE-related factors. For example, when exploring the cor-
relation between negative sentiment and issue fixing time [81], the conclusion might be biased if
many identified negative posts are not really negative. The high precision of SEntiMoji on senti-
mental and emotional posts can alleviate this bias to some extent.
Although SEntiMoji achieves satisfactory precision in comparison with previous methods, we
should admit that more efforts should be devoted to further improving its recall level. If the recall
level is further improved, then SEntiMoji can be more qualified for the aforementioned tasks in
SE. For example, with a higher recall level, SEntiMoji can be more sensitive to negative emotions
of developers and thus inform the project managers more promptly.
Finally, we urge researchers and practitioners to select the appropriate method in line with the
intended goals. Before selecting the method, they should first make sure which sentiment/emotion
is the focus of the application scenario. For example, if project managers want to detect whether
developers feel stressed over a period of time, then stress is the focus of this task. Given the satis-
fying performance of SEntiMoji in detecting stress, SEntiMoji can be appropriate. However, if the
managers want to know if developers lack courage or feel fearful on new tasks, they can choose
other methods, since SEntiMoji cannot outperform existing methods in detecting fear. In addition,
the granularity of emotion detection also depends on the application scenarios. In the applications
just described, we only need a simple binary classifier to detect the presence of stress/fear. How-
ever, in some scenarios, a fine-grained emotion classifier is needed. For example, Ortu et al. [81]
aimed to investigate the relation between different emotions (e.g., love, joy, anger, and sadness)
with the time to fix a JIRA issue, so they need a fine-grained classifier that can identify all these
emotions.
• Researchers can try to leverage the general affective expressions from the open do-
main via transfer learning, rather than unilaterally pursuing the domain-specific knowl-
edge from the limited labeled data in SE. Previous studies found that sentiment detection tools
trained on non-technical texts are not always adequate for SE tasks [48, 54] and a lack of techni-
cal knowledge is the main reason [48]. Since then, many studies focused on how to use labeled
SE-related texts to train SE-customized sentiment and emotion classifiers [1, 10, 12, 48, 50]. How-
ever, in fact, there are many general affective expressions shared by SE and open domains (e.g.,
social media). Therefore, we are interested that whether the potential value of such information
has been deeply explored in previous studies. Since the customization of sentiment and emotion
detection for SE is still at dawn, the labeled SE-related texts are too scarce to train a high-quality
classification model. Therefore, we should not unilaterally place emphasis on learning the tech-
nical knowledge based on existing already-labeled SE-related data. In this study, we demonstrate
that leveraging the general affective information from large-scale social media data via transfer
learning can exactly facilitate sentiment and emotion detection in SE.
• The construction process of datasets can affect the evaluation, so the performance
on different datasets should be analyzed more rationally. When comparing SEntiMoji with
existing baseline methods in sentiment detection tasks, we find that performance gaps vary a lot on
different datasets. It is not the first time that such a phenomenon is observed and reported. Novielli
et al. [79] attributed this result to the different labeling approaches (i.e., model-driven annotation
or ad hoc annotation). In this study, we turn to the construction process of these datasets to seek
the reason. For example, the Stack Overflow dataset is constructed based on the corpus filtered by
SentiStrength. Therefore, when comparing the performance of SentiStrength and other methods
on this dataset, we should be careful to avoid being biased.
In addition, both the Stack Overflow and Code Review datasets are pre-processed to avoid im-
balanced class distribution. It makes the classification task much easier and may not match their
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:38 Z. Chen et al.
target application scenarios. If researchers want to apply a sentiment detection method to a tech-
nical Q&A site such as Stack Overflow where the sentiment distribution is highly imbalanced, then
they should not be blinded by the satisfactory results obtained on such datasets. Of course, we can-
not deny or ignore that in some applications, the distribution of sentiment is relatively balanced.
For these applications, evaluations on such processed datasets are meaningful.
• Researchers can put more attention on the identified serious error causes, such
as complex context information and implicit sentiment/emotion, to further improve the
performance of SE-customized sentiment and emotion detection. Our error analysis in Sec-
tion 5.4 identifies open challenges in sentiment and emotion detection in SE, which can facilitate
the future work. Researchers can put more attention on the dominant error causes. Techniques
in other communities such NLP can be borrowed to alleviate these problems. Take C3: Complex
context information as an example. SE-related texts often contain more than one sentence and the
different semantics of sentences in a text make sentiment and emotion identification tasks difficult.
In NLP, some research efforts have been devoted to sentiment detection in document level [27, 112].
Researchers can borrow some insights from these studies to improve the sentiment and emotion
detection tasks in SE.
• The characteristics of datasets can affect the results of error analysis, so the severity
of different error causes should be considered rationally. When comparing the distribution
of error causes in sentiment and emotion detection, we find some inconsistencies. For example,
because emotion detection is mainly performed on documents rather than sentences, we find that
complex context information is most severe in this task. By comparison, less document-level sam-
ples are used in the evaluation of sentiment detection, so such a problem is covered in the error
analysis for sentiment detection to some extent. This finding informs researchers to consider the
severity of different error causes rationally. If researchers want to apply the sentiment and emotion
detection methods to document-level texts, such as issue comments and Stack Overflow posts, then
C3: complex context information should be taken seriously. However, if the texts in their application
scenarios are mainly sentences, this error cause can be not such malicious.
7 THREATS TO VALIDITY
Threats to construct validity concern the relation between theory and observation. The JIRA
dataset is originally labeled with different emotions. To use this dataset for sentiment detection, we
follow previous studies to map the multi-class emotions to trinary sentiment polarity labels and fil-
ter some ambiguous samples. This process may violate the original distribution of this dataset and
lower the difficulty of the classification task, which can affect the performance of different meth-
ods. However, fortunately, we have several benchmark datasets for a comprehensive comparison
and the superiority of SEntiMoji is not only observed on the JIRA dataset.
Threats to internal validity concern confounding factors that could affect the obtained re-
sults. In our study, they are mainly from the configuration of baseline methods. We replicate these
methods by using their released scripts or recommended hyper-parameter settings. However, some
hyper-parameters can be further tuned to improve the classification performance.
In addition, SEntiMoji also has space for further improvement. To fine-tune the pre-trained
DeepMoji, we select only those GitHub samples containing any of the 64 emojis predicted by
DeepMoji. Limiting the choice to only those 64 emojis may discard other emojis that appear more
frequently in GitHub posts and in result hinder our model to capture the affective expressions
expressed in GitHub posts. Although SEntiMoji has already outperformed existing methods in
most cases, we still encourage future work to further capture the affective information contained
in GitHub posts and improve the performance.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:39
What is more, we observe that SEntiMoji cannot significantly outperform existing methods
in the surprise and fear tasks where data distribution is very imbalanced. Since data sampling
techniques (e.g., over-sampling) are demonstrated to be beneficial to tackle the imbalanced data
distribution, integrating such techniques into SEntiMoji may further improve the performance.
Furthermore, we find that 6.3% of misclassified samples in sentiment detection can be attributed
to the usage of lexical cues of politeness in neutral descriptions. Therefore, politeness detection [20]
may be useful for improving SEntiMoji. However, note that since lexical cues of politeness can also
be used in positive or negative texts, it is impractical to directly use such approaches to filter out
samples expressing politeness. Further research efforts are needed to adapt these approaches to
improve existing sentiment detection methods.
Moreover, manual error analysis presents the threat of subjectivity. To minimize this threat, two
authors individually inspect the bad cases and finally reach agreement through discussions. The
inter-rater agreement is relatively high, which ensures the reliability of the coding schema and
procedure. Furthermore, the conflicts are resolved by introducing an experienced colleague.
Threats to external validity concern the generalizability of our experimental results. Our
evaluation has covered several mostly used benchmark datasets. However, we still cannot claim
that the performance of SEntiMoji can be generalized across datasets, because the selected
datasets cannot represent all types of texts in SE. In addition, various sentiment and emotion
detection methods have been used in SE. In our study, we select only some representative ones
for comparison.
What is more, to tackle the multi-label classification task on SO-E dataset, we follow the previous
work [12] to employ the binary relevance method, i.e., transforming it into binary classification
tasks. Besides this common method, another way is to transform it into a multi-class classification
task [96], where each class represents a label combination. For example, if we have a multi-label
classification task that involves two classes (i.e., A and B), we can transform the tasks into a multi-
class classification that involves four classes, i.e., those labeled with A but not B, those labeled
with B but not A, those labeled with both A and B, and those not labeled with A or B. However,
in this article, we evaluate SEntiMoji on multi-label classification only using the binary relevance
method.
8 RELATED WORK
We then describe the literature related to this study. Our research is particularly inspired by two
streams of literature, i.e., sentiment and emotion detection in SE and emojis in sentiment and
emotion detection.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:40 Z. Chen et al.
8.1.1 SE-customized Sentiment Detection. Islam and Zibran [48] identified the challenges of
SentiStrength in detecting sentiment in SE texts and developed a lexicon-based tool SentiStrength-
SE to address the majority of the identified challenges. In addition, many feature-based machine
learning methods were proposed, including SentiCR [1] and Senti4SD [10]. Different from these
methods, SEntiMoji employs advanced word embedding and deep learning techniques. These tech-
niques have also been used in previous relevant work. Biswas et al. [8] used word embeddings and
deep learning for sentiment detection in SE. Different from this work, we employ word embedding
and deep learning in a transfer learning way, which takes advantages of the prior efforts in other
tasks. Similarly, Robbes and Janes [88] leveraged a pre-trained next word prediction model to tackle
the scarcity of data in SE-customized sentiment detection. However, different from this work, our
approach is developed based on a pre-trained emoji prediction model. Considering that express-
ing sentiment is the main function of emojis [45], the emoji prediction model can incorporate
more sentiment knowledge than the next word prediction model, which can further facilitate the
sentiment detection task. Recently, Lin et al. [60] proposed a pattern-based aspect sentiment detec-
tion method POME. The difference between POME and SEntiMoji mainly lies in two aspects. On
one hand, POME aims to classify Stack Overflow sentences referring to APIs according to different
aspects (e.g., performance and usability), and to determine their sentiment polarity. However, SEn-
tiMoji and other aforementioned techniques do not involve aspect identification. However, POME
is developed based on the patterns inferred by manually analyzing sentences from Stack Overflow
linked to APIs. Therefore, POME is a specific method for analyzing opinions about APIs from Stack
Overflow posts. Compared to POME, SEntiMoji is easier to be applied to other tasks in SE.
8.1.2 SE-customized Emotion Detection. To detect the emotional states of developers in the bi-
directional emotional model, Islam and Zibran [50] proposed a dictionary-based tool DEVA, which
can detect excitation, stress, depression, and relaxation from SE texts. Since the performance of
DEVA is inherently limited by the quality and size of the dictionaries, Islam et al. [47] then proposed
a machine learning method MarValous to address this limitation. The features used by MarValous
are specifically defined for the four emotions that DEVA focuses on, so MarValous can also identify
only those emotions. In addition, Calefato et al. [12] developed a feature-based machine learning
method EmoTxt. Murgia et al. [74] constructed a machine learning classifier (namely ESEM-E in
this article) to identify emotions in issue comments. The two methods as well as SEntiMoji can be
used to detect more emotions than DEVA and MarValous. Moreover, different from EmoTxt and
ESEM-E, SEntiMoji is developed based on word embedding, deep learning, and transfer learning
techniques.
8.1.3 Benchmark Studies on Sentiment and Emotion Detection in SE. To assess the performance
and reliability of the sentiment and emotion detection methods specifically customized for SE, re-
searchers have performed several benchmark studies, where each method tackles the same datasets
for comparison. For instance, for emotion detection, Islam et al. [47] compared MarValous and
DEVA on JIRA and Stack Overflow comments. For sentiment detection, Ikram et al. [4] carried out
an empirical study of applying three SE-customized sentiment detection tools (i.e., Sentistrength-
SE, SentiCR, and Senti4SD) on code reviews. By comparison, Islam and Zibran [49] applied the
SE-customized methods to three datasets, not a single dataset. They investigated that how well
these methods really work on different datasets and which tool to choose in which context. Sim-
ilarly, Novielli et al. [79] applied SE-customized sentiment detection methods on four benchmark
datasets and investigated the impact of labeling approach on their performance. In line with these
studies, we evaluate the performance of SEntiMoji through a benchmark study.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:41
8.1.4 Difficulties in Sentiment Detection in SE. In previous studies, Islam and Zibran [48] and
Novielli et al. [79] have identified difficulties faced by sentiment detection in SE through qualitative
analysis. Specifically, Islam and Zibran [48] analyzed the samples misclassified by SentiStrength,
while Novielli et al. [79] analyzed the samples for which SentiStrength-SE, Senti4SD, and SentiCR
all yielded a wrong classification. By comparison, we identify difficulties (i.e., error causes) by
analyzing the samples misclassified by SEntiMoji. Since the three studies distill difficulties based
on different samples and the identification process may be affected by the subjectivity of coders,
the findings may have differences. To elaborate on the comparison with existing literature, we
analyze the similarity and dissimilarity between the findings of us and the previous two studies.
First, we compare our findings with the difficulties identified by Islam and Zibran [48]. Since
Islam and Zibran [48] derived difficulties based on the samples misclassified by lexicon-based Sen-
tiStrength, most difficulties identified by them are summarized in the word level, i.e., what difficul-
ties sentiment detection has in dealing with specific kinds of words. For instance, domain-specific
meanings of words, context-sensitive variations in meanings of words, and misinterpretation of the
letter ‘X’ are the most frequent difficulties they observed. By comparison, we mainly summarized
the difficulties in dealing with specific kinds of sentences/documents, such as those with misinter-
preted purposes (i.e., C2) and those containing complex context information (i.e., C3). However,
there are also similarities between the findings of the two studies. Islam and Zibran [48] found
that it is difficult to dealing with irony and sarcasm, which can be included in C7: Figurative lan-
guage. Besides irony and sarcasm, C7 also includes the usage of other figurative languages such as
metaphor. In addition, Islam and Zibran [48] found that it is difficult to detect subtle expression of
sentiments, which is also observed in our results (i.e., C1: Implicit sentiment).
Then we compare our findings with the error causes identified by Novielli et al. [79]. To ease
the comprehension, we list the error causes identified by them with the exact naming from the
corresponding publication [79]:
(1) Polar facts but neutral sentiment. This error cause is included in our identified C2: Misinter-
preted purposes. Besides the polar facts, C2 also includes neutral requests, questions, and
tips that are mistaken as positive and negative.
(2) General Error. This error cause is related to the tool inability to deal with some textual
cues or errors in pre-processing raw text and thus can map to C6: Data pre-processing.
(3) Politeness. This error cause is also observed in our results (i.e., C5: Politeness).
(4) Implicit sentiment polarity. This cause can be directly linked to C1: Implicit sentiment.
(5) Subjectivity in sentiment annotation. This error cause is not observed in our results. It can
be attributed to the subjectivity of coders, which has been discussed in the threats to
validity.
(6) Inability of classifiers to deal with pragmatics or context information. This error cause maps
to C3: Complex context information.
(7) Figurative language. This error cause corresponds to C7: Figurative language.
From the comparison, we can find a substantial overlapping between the error causes identified
by Novielli et al. [79] and us, which partly demonstrates the validity of our findings. In addition,
we also report some unique error causes that have not been observed by Novielli et al. [79], such
as neutral requests, questions, and tips included in C1. We also report that although SEntiMoji
can outperform the existing methods, it still fails in capturing obvious sentiment in some cases
(i.e, C4: Failure to capture obvious sentiment), which is not included in the findings of Novielli et al.
[79]. Furthermore, Novielli et al. [79] only analyzed the error causes of sentiment detection in SE,
but we also identified the error causes of emotion detection, where a unique error cause (i.e., C8:
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:42 Z. Chen et al.
The same emotion polarity) is observed. Overall, we believe that our work advances the knowledge
over the previous work by Novielli et al. [79].
9 CONCLUSION
In this study, we have proposed SEntiMoji, an emoji-powered transfer learning approach for sen-
timent and emotion detection in SE. It is developed based on an existing representation model
called DeepMoji. DeepMoji is pre-trained on Tweets and can represent texts with sentiment- and
emotion-aware vectors. As technical knowledge is highlighted in current SE-customized senti-
ment and emotion detection, we also use GitHub data to incorporate more technical knowledge
into DeepMoji. Then the fine-tuned representation model, as well as manually labeled data, is used
to train the final sentiment/emotion classifier.
We evaluate the effectiveness of SEntiMoji on five benchmark datasets covering 10,096 samples
for sentiment detection and four benchmark datasets covering 10,595 samples for emotion de-
tection. In addition, for both sentiment and emotion detection, we use four representative meth-
ods as baselines to compare their performance against SEntiMoji. Results show that SEntiMoji
can outperform existing sentiment and emotion detection methods with an average increase of
0.036 and 0.049 in macro-F1, respectively. Furthermore, we investigate the impact of Tweets,
GitHub data, and labeled data on the performance of SEntiMoji and demonstrate the importance
of the large-scale Tweets. Finally, we perform a qualitative analysis to distill the errors causes of
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:43
SEntiMoji, which suggests immediate actions and future research directions. To facilitate repli-
cations or other types of future work, we have released the data, scripts, trained models, and
experimental results used in this study on https://github.com/SEntiMoji/SEntiMoji.
In the future, we plan to further improve SEntiMoji based on the identified error causes and
limitations. Using SEntiMoji and its future releases, we also plan to analyze the quality of dif-
ferent software artifacts (e.g., APIs and libraries) by mining opinions posted by developers while
discussing on Q&A websites such as Stack Overflow. Then we can leverage such crowd-sourced
knowledge to build high-quality tools to recommend software artifacts to developers. In addition,
we plan to verify the relationship between developers’ emotional states and their outcome (e.g.,
productivity and bugs) by operating SEntiMoji on industrial datasets.
ACKNOWLEDGMENTS
The authors would like to thank Prof. Gang Huang from Peking University for his advice on de-
tecting the impact of sentiment and emotion for development tasks in software engineering.
REFERENCES
[1] Toufique Ahmed, Amiangshu Bosu, Anindya Iqbal, and Shahram Rahimi. 2017. SentiCR: A customized sentiment
analysis tool for code review interactions. In Proceedings of the 32nd IEEE/ACM International Conference on Auto-
mated Software Engineering (ASE’17). 106–111.
[2] Wei Ai, Xuan Lu, Xuanzhe Liu, Ning Wang, Gang Huang, and Qiaozhu Mei. 2017. Untangling emoji popular-
ity through semantic embeddings. In Proceedings of the 11th International Conference on Web and Social Media
(ICWSM’17). 2–11.
[3] Akiko Aizawa. 2003. An information-theoretic perspective of TF-IDF measures. Inf. Process. Manage. 39, 1 (2003),
45–65.
[4] Ikram El Asri, Noureddine Kerzazi, Gias Uddin, Foutse Khomh, and Mohammed Amine Janati Idrissi. 2019. An
empirical study of sentiments in code reviews. Inf. Softw. Technol. 114 (2019), 37–54.
[5] Lisa Feldman Barrett. 1998. Discrete emotions or dimensions? The role of valence focus and arousal focus. Cogn.
Emot. 12, 4 (1998), 579–599.
[6] Yoav Benjamini and Daniel Yekutieli. 2001. The control of the false discovery rate in multiple testing under depen-
dency. Ann. Stat. 29, 4 (2001), 1165–1188.
[7] Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the 42nd Annual Meeting
of the Association for Computational Linguistics, Barcelona (ACL’04). 31.
[8] Eeshita Biswas, K. Vijay-Shanker, and Lori L. Pollock. 2019. Exploring word embedding techniques to improve sen-
timent analysis of software engineering texts. In Proceedings of the 16th International Conference on Mining Software
Repositories (MSR’19). 68–78.
[9] Cássio Castaldi Araujo Blaz and Karin Becker. 2016. Sentiment analysis in tickets for IT support. In Proceedings of
the 13th International Conference on Mining Software Repositories (MSR’16). 235–246.
[10] Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. Sentiment polarity detection for
software development. Emp. Softw. Eng. 23, 3 (2018), 1352–1382.
[11] Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2017. EmoTxt: A toolkit for emotion recognition from text.
In Proceedings of the 7th International Conference on Affective Computing and Intelligent Interaction Workshops and
Demos (ACII Workshops’17). 79–80.
[12] Fabio Calefato, Filippo Lanubile, Nicole Novielli, and Luigi Quaranta. 2019. EMTk: the emotion mining toolkit. In
Proceedings of the 4th International Workshop on Emotion Awareness in Software Engineering (SEmotion@ICSE’19).
34–37.
[13] José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code
search. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the
Foundations of Software Engineering (ESEC/SIGSOFT FSE’19). 964–974.
[14] Zhenpeng Chen, Yanbin Cao, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019. SEntiMoji: An emoji-powered learning
approach for sentiment analysis in software engineering. In Proceedings of the ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/SIGSOFT FSE’19).
841–852.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:44 Z. Chen et al.
[15] Zhenpeng Chen, Xuan Lu, Wei Ai, Huoran Li, Qiaozhu Mei, and Xuanzhe Liu. 2018. Through a gender lens: Learning
usage patterns of emojis from large-scale Android users. In Proceedings of the 2018 World Wide Web Conference
(WWW’18). 763–772.
[16] Zhenpeng Chen, Sheng Shen, Ziniu Hu, Xuan Lu, Qiaozhu Mei, and Xuanzhe Liu. 2019. Emoji-powered represen-
tation learning for cross-lingual sentiment classification. In Proceedings of the 2019 World Wide Web Conference on
World Wide Web (WWW’19). 251–262.
[17] Shaiful Alam Chowdhury and Abram Hindle. 2016. Characterizing energy-aware software projects: are they differ-
ent? In Proceedings of the 13th International Conference on Mining Software Repositories (MSR’16). 508–511.
[18] Maëlick Claes, Mika Mäntylä, and Umar Farooq. 2018. On the use of emoticons in open source software development.
In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement
(ESEM’18). 50:1–50:4.
[19] Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 1 (1960), 37–46.
[20] Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and Christopher Potts. 2013. A com-
putational approach to politeness with application to social factors. In Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics (ACL’13). 250–259.
[21] Jason W. Davey, P. Cristian Gugiu, and Chris L. S. Coryn. 2010. Quantitative methods for estimating the reliability
of qualitative data. J. MultiDiscipl. Eval. 6, 13 (2010), 140–162.
[22] Dmitry Davidov, Oren Tsur, and Ari Rappoport. 2010. Enhanced sentiment learning using Twitter hashtags and
smileys. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10). 241–249.
[23] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). 4171–4186.
[24] Thomas G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algo-
rithms. Neural Comput. 10, 7 (1998), 1895–1923.
[25] Jin Ding, Hailong Sun, Xu Wang, and Xudong Liu. 2018. Entity-level sentiment analysis of issue comments. In
Proceedings of the 3rd International Workshop on Emotion Awareness in Software Engineering (SEmotion@ICSE’18).
7–13.
[26] Hai Ha Do, P. W. C. Prasad, Angelika Maag, and Abeer Alsadoon. 2019. Deep learning for aspect-based sentiment
analysis: A comparative review. Expert Syst. Appl. 118 (2019), 272–299.
[27] Zi-Yi Dou. 2017. Capturing user and product information for document level sentiment analysis with deep memory
network. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17).
521–526.
[28] Vasiliki Efstathiou, Christos Chatzilenas, and Diomidis Spinellis. 2018. Word embeddings for the software engineer-
ing domain. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR’18). 38–41.
[29] Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcomplete word
vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language
Processing (ACL’15). 1491–1500.
[30] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Lehmann. 2017. Using millions of emoji oc-
currences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the
2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17). 1615–1625.
[31] Jerome H. Friedman. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 4 (2002), 367–
378.
[32] Daviti Gachechiladze, Filippo Lanubile, Nicole Novielli, and Alexander Serebrenik. 2017. Anger and its direction
in collaborative software development. In Proceedings of the 39th IEEE/ACM International Conference on Software
Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER’17). 11–14.
[33] David García, Marcelo Serrano Zanetti, and Frank Schweitzer. 2013. The role of emotions in contributors activity:
A case study on the GENTOO community. In Proceedings of the 2013 International Conference on Cloud and Green
Computing (CGC’13). 410–417.
[34] Anastasia Giachanou and Fabio Crestani. 2016. Like it or not: A survey of Twitter sentiment analysis methods.
Comput. Surv. 49, 2 (2016), 28:1–28:41.
[35] Daniela Girardi, Filippo Lanubile, Nicole Novielli, Luigi Quaranta, and Alexander Serebrenik. 2019. Towards recog-
nizing the emotions of developers using biometrics: the design of a field study. In Proceedings of the 4th International
Workshop on Emotion Awareness in Software Engineering. 13–16.
[36] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
[37] Daniel Graziotin, Fabian Fagerholm, Xiaofeng Wang, and Pekka Abrahamsson. 2018. What happens when software
developers are (un)happy. J. Syst. Softw. 140 (2018), 32–47.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:45
[38] Jin Guo, Jinghui Cheng, and Jane Cleland-Huang. 2017. Semantically enhanced software traceability using deep
learning techniques. In Proceedings of the 39th International Conference on Software Engineering (ICSE’17). 3–14.
[39] Emitza Guzman, David Azócar, and Yang Li. 2014. Sentiment analysis of commit comments in GitHub: an empirical
study. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR’14). 352–355.
[40] Emitza Guzman and Bernd Bruegge. 2013. Towards emotional awareness in software development teams. In Pro-
ceedings of the Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on
the Foundations of Software Engineering (ESEC/FSE’13). 671–674.
[41] Emitza Guzman and Walid Maalej. 2014. How do users like this feature? A fine grained sentiment analysis of app
reviews. In Proceedings of the IEEE 22nd International Requirements Engineering Conference (RE’14). 153–162.
[42] Maryam Hasan, Elke Rundensteiner, and Emmanuel Agu. 2014. EMOTEX: Detecting emotions in twitter messages.
In Proceedings of the ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conference (ASE’14). 27–31.
[43] M. Hermans and B. Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Proceedings of the
27th Annual Conference on Neural Information Processing (NIPS’13). 190–198.
[44] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–
1780.
[45] Tianran Hu, Han Guo, Hao Sun, Thuy-vy Thi Nguyen, and Jiebo Luo. 2017. Spice up your chat: the intentions
and sentiment effects of using emojis. In Proceedings of the 11th International Conference on Web and Social Media
(ICWSM’17). 102–111.
[46] Nasif Imtiaz, Justin Middleton, Joymallya Chakraborty, Neill Robson, Gina Bai, and Emerson R. Murphy-Hill. 2019.
Investigating the effects of gender bias on GitHub. In Proceedings of the 41st International Conference on Software
Engineering (ICSE’19). 700–711.
[47] Md Rakibul Islam, Md Kauser Ahmmed, and Minhaz F. Zibran. 2019. MarValous: Machine learning based detec-
tion of emotions in the valence-arousal space in software engineering text. In Proceedings of the 34th ACM/SIGAPP
Symposium on Applied Computing (SAC’19). 1786–1793.
[48] Md Rakibul Islam and Minhaz F. Zibran. 2017. Leveraging automated sentiment analysis in software engineering.
In Proceedings of the 14th International Conference on Mining Software Repositories (MSR’17). 203–214.
[49] Md Rakibul Islam and Minhaz F. Zibran. 2018. A comparison of software engineering domain specific sentiment
analysis tools. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering
(SANER’18). 487–491.
[50] Md Rakibul Islam and Minhaz F. Zibran. 2018. DEVA: sensing emotions in the valence arousal space in software
engineering text. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (SAC’18). 1536–1543.
[51] Md Rakibul Islam and Minhaz F. Zibran. 2018. SentiStrength-SE: Exploiting domain specificity for improved senti-
ment analysis in software engineering text. J. Syst. Softw. 145 (2018), 125–146.
[52] Pooyan Jamshidi, Norbert Siegmund, Miguel Velez, Christian Kästner, Akshay Patel, and Yuvraj Agarwal. 2017.
Transfer learning for performance modeling of configurable systems: an exploratory analysis. In Proceedings of the
32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). 497–508.
[53] Robbert Jongeling, Subhajit Datta, and Alexander Serebrenik. 2015. Choosing your weapons: On sentiment anal-
ysis tools for software engineering research. In Proceedings of the 2015 IEEE International Conference on Software
Maintenance and Evolution (ICSME’15). 531–535.
[54] Robbert Jongeling, Proshanta Sarkar, Subhajit Datta, and Alexander Serebrenik. 2017. On negative results when
using sentiment analysis tools for software engineering research. Emp. Softw. Eng. 22, 5 (2017), 2543–2584.
[55] Francisco Jurado and Pilar Rodríguez Marín. 2015. Sentiment analysis in monitoring software development pro-
cesses: an exploratory case study on GitHub’s project issues. J. Syst. Softw. 104 (2015), 82–89.
[56] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3rd
International Conference on Learning Representations (ICLR’15).
[57] Ekrem Kocaguneli, Tim Menzies, and Emilia Mendes. 2015. Transfer learning in effort estimation. Emp. Softw. Eng.
20, 3 (2015), 813–843.
[58] Miikka Kuutila, Mika V. Mäntylä, Maëlick Claes, Marko Elovainio, and Bram Adams. 2018. Using experience sam-
pling to link software repositories with emotions and work well-being. In Proceedings of the 12th ACM/IEEE Inter-
national Symposium on Empirical Software Engineering and Measurement (ESEM’18). 29:1–29:10.
[59] Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word
embeddings. Trans. Assoc. Comput. Ling. 3 (2015), 211–225.
[60] Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, and Michele Lanza. 2019. Pattern-based mining
of opinions in Q&A websites. In Proceedings of the 41st International Conference on Software Engineering (ICSE’19).
548–559.
[61] Bin Lin, Fiorella Zampetti, Gabriele Bavota, Massimiliano Di Penta, Michele Lanza, and Rocco Oliveto. 2018. Senti-
ment analysis for software engineering: how far can we go? In Proceedings of the 40th International Conference on
Software Engineering (ICSE’18). 94–104.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:46 Z. Chen et al.
[62] Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
[63] Kun-Lin Liu, Wu-Jun Li, and Minyi Guo. 2012. Emoticon smoothed language models for twitter sentiment analysis.
In Proceedings of the 26th AAAI Conference on Artificial Intelligence (AAAI’12).
[64] Nikola Ljubesic and Darja Fiser. 2016. A global analysis of emoji usage. In Proceedings of the 10th Web as Corpus
Workshop (WAC@ACL’16). 82–89.
[65] Xuan Lu, Wei Ai, Xuanzhe Liu, Qian Li, Ning Wang, Gang Huang, and Qiaozhu Mei. 2016. Learning from the ubiq-
uitous language: An empirical analysis of emoji usage of smartphone users. In Proceedings of the 2016 ACM Interna-
tional Joint Conference on Pervasive and Ubiquitous Computing (UbiComp’16). 770–780.
[66] Xuan Lu, Yanbin Cao, Zhenpeng Chen, and Xuanzhe Liu. 2018. A first look at emoji usage on GitHub: An empirical
study. arxiv:1812.04863. Retrieved from http://arxiv.org/abs/1812.04863.
[67] Howard Lune and Bruce L. Berg. 2016. Qualitative Research Methods for the Social Sciences. Pearson Higher Ed.
[68] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky.
2014. The stanford coreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics (ACL’14). 55–60.
[69] Mika Mäntylä, Bram Adams, Giuseppe Destefanis, Daniel Graziotin, and Marco Ortu. 2016. Mining valence, arousal,
and dominance: Possibilities for detecting burnout and productivity? In Proceedings of the 13th International Confer-
ence on Mining Software Repositories (MSR’16). 247–258.
[70] Mika V. Mäntylä, Nicole Novielli, Filippo Lanubile, Maëlick Claes, and Miikka Kuutila. 2017. Bootstrapping a lexicon
for emotional arousal in software engineering. In Proceedings of the 14th International Conference on Mining Software
Repositories (MSR’17). 198–202.
[71] Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. 2019. DeepDelta: Learning to
repair compilation errors. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/SIGSOFT FSE’19). 925–936.
[72] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in
vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13).
[73] Myriam Munezero, Calkin Suero Montero, Erkki Sutinen, and John Pajunen. 2014. Are they different? Affect, feeling,
emotion, sentiment, and opinion detection in text. IEEE Trans. Affect. Comput. 5, 2 (2014), 101–111.
[74] Alessandro Murgia, Marco Ortu, Parastou Tourani, Bram Adams, and Serge Demeyer. 2018. An exploratory quali-
tative and quantitative analysis of emotions in issue report comments of open source systems. Emp. Softw. Eng. 23,
1 (2018), 521–564.
[75] Alessandro Murgia, Parastou Tourani, Bram Adams, and Marco Ortu. 2014. Do developers feel emotions? An ex-
ploratory analysis of emotions in software artifacts. In Proceedings of the 11th Working Conference on Mining Software
Repositories (MSR’14). 262–271.
[76] Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. 2013. Transfer defect learning. In Proceedings of the 35th Inter-
national Conference on Software Engineering (ICSE’13). 382–391.
[77] Nicole Novielli, Fabio Calefato, and Filippo Lanubile. 2015. The challenges of sentiment detection in the social pro-
grammer ecosystem. In Proceedings of the 7th International Workshop on Social Software Engineering (SSE’15). 33–
40.
[78] Nicole Novielli, Fabio Calefato, and Filippo Lanubile. 2018. A gold standard for emotion annotation in stack overflow.
In Proceedings of the 15th International Conference on Mining Software Repositories (MSR’18). 14–17.
[79] Nicole Novielli, Daniela Girardi, and Filippo Lanubile. 2018. A benchmark study on sentiment analysis for software
engineering research. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR’18).
364–375.
[80] Nicole Novielli and Alexander Serebrenik. 2019. Sentiment and emotion in software engineering. IEEE Softw. 36, 5
(2019), 6–9.
[81] Marco Ortu, Bram Adams, Giuseppe Destefanis, Parastou Tourani, Michele Marchesi, and Roberto Tonelli. 2015. Are
bullies more productive? Empirical study of affectiveness vs. issue fixing time. In Proceedings of the 12th IEEE/ACM
Working Conference on Mining Software Repositories (MSR’15). 303–313.
[82] Marco Ortu, Giuseppe Destefanis, Steve Counsell, Stephen Swift, Roberto Tonelli, and Michele Marchesi. 2016. Ar-
sonists or firefighters? Affectiveness in agile software development. In Proceedings of the 2016 International Confer-
ence on Agile Software Development (XP’16). Springer, 144–155.
[83] Marco Ortu, Alessandro Murgia, Giuseppe Destefanis, Parastou Tourani, Roberto Tonelli, Michele Marchesi, and
Bram Adams. 2016. The emotional side of software developers in JIRA. In Proceedings of the 13th International Con-
ference on Mining Software Repositories (MSR’16). 480–483.
[84] Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado Aaron Visaggio, Gerardo Canfora, and Harald
C. Gall. 2015. How can i improve my app? Classifying user reviews for software maintenance and evolution. In
Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME’15). 281–290.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
Emoji-powered Sentiment and Emotion Detection 18:47
[85] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representa-
tion. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–
1543.
[86] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’18).
2227–2237.
[87] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2011. Classifier chains for multi-label classification.
Mach. Learn. 85, 3 (2011), 333–359.
[88] Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural
networks. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results
(ICSE (NIER)’19). 29–32.
[89] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1988. Learning representations by back-propagating
errors. Nature 323, 6088 (1988), 533–536.
[90] James A. Russell and Albert Mehrabian. 1977. Evidence for a three-factor theory of emotions. J. Res. Pers. 11, 3 (1977),
273–294.
[91] Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. Comput. Surv. 34, 1 (2002), 1–47.
[92] Phillip Shaver, Judith Schwartz, Donald Kirson, and Cary O’connor. 1987. Emotion knowledge: Further exploration
of a prototype approach.J. Pers. Soc. Psychol. 52, 6 (1987), 1061.
[93] Vinayak Sinha, Alina Lazar, and Bonita Sharif. 2016. Analyzing developer sentiment in commit logs. In Proceedings
of the 13th International Conference on Mining Software Repositories (MSR’16). 520–523.
[94] Tom De Smedt and Walter Daelemans. 2012. Pattern for python. J. Mach. Learn. Res. 13, 1 (2012), 2063–2067.
[95] Rodrigo Souza and Bruno Silva. 2017. Sentiment analysis of travis CI builds. In Proceedings of the 14th International
Conference on Mining Software Repositories (MSR’17). 459–462.
[96] Newton Spolaôr, Everton Alvares Cherman, Maria Carolina Monard, and Huei Diana Lee. 2012. A comparison of
multi-label feature selection methods using the problem transformation approach. In Proceedings of the XXXVIII
Latin American Computer Conference—Selected Papers (CLEI’12 Selected Papers). 135–151.
[97] Carlo Strapparava and Alessandro Valitutti. 2004. WordNet affect: An affective extension of wordnet. In Proceedings
of the 4th International Conference on Language Resources and Evaluation (LREC’04).
[98] Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning
in NLP. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL’19). 3645–3650.
[99] Johan A. K. Suykens and Joos Vandewalle. 1999. Least squares support vector machine classifiers. Neural Process.
Lett. 9, 3 (1999), 293–300.
[100] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. 2010. Sentiment in short strength
detection informal text. J. Assoc. Inf. Sci. Technol. 61, 12 (2010), 2544–2558.
[101] Parastou Tourani and Bram Adams. 2016. The impact of human discussions on just-in-time quality assurance: An
empirical study on OpenStack and Eclipse. In Proceedings of the IEEE 23rd International Conference on Software
Analysis, Evolution, and Reengineering (SANER’16). 189–200.
[102] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing
Systems 2017 (NIPS’17). 6000–6010.
[103] Bin Wang, Angela Wang, Fenxiao Chen, Yuncheng Wang, and C-C Jay Kuo. 2019. Evaluating word embedding
models: Methods and experimental results. APSIPA Trans. Sign. Inf. Process. 8 (2019).
[104] Yi Wang. 2019. Emotions extracted from text vs. true emotions-an empirical evaluation in SE context. In Proceedings
of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19). 230–242.
[105] Yequan Wang, Minlie Huang, Xiaoyan Zhu, and Li Zhao. 2016. Attention-based LSTM for aspect-level sentiment
classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16).
606–615.
[106] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for
13,915 english lemmas. Behav. Res. Methods 45, 4 (2013), 1191–1207.
[107] Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical
and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial
Intelligence (IJCAI’17). 3034–3040.
[108] Honghao Wei, Fuzheng Zhang, Nicholas Jing Yuan, Chuan Cao, Hao Fu, Xing Xie, Yong Rui, and Wei-Ying Ma.
2017. Beyond the words: Predicting user personality from heterogeneous information. In Proceedings of the 10th
ACM International Conference on Web Search and Data Mining (WSDM’17). 305–314.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.
18:48 Z. Chen et al.
[109] Theresa Wilson. 2008. Annotating subjective content in meetings. In Proceedings of the International Conference on
Language Resources and Evaluation (LREC’08).
[110] Michal R. Wróbel. 2013. Emotions in the software development process. In Proceedings of the 6th International Con-
ference on Human System Interactions (HSI’13). 518–523.
[111] Michal R. Wrobel. 2016. Towards the participant observation of emotions in software development teams. In Pro-
ceedings of the 2016 Federated Conference on Computer Science and Information Systems (FedCSIS’16). 1545–1548.
[112] Jiacheng Xu, Danlu Chen, Xipeng Qiu, and Xuanjing Huang. 2016. Cached long short-term memory neural networks
for document-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing (EMNLP’16). 1660–1669.
[113] Min Yang, Wenting Tu, Jingxuan Wang, Fei Xu, and Xiaojun Chen. 2017. Attention based LSTM for target dependent
sentiment classification. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). 5013–5014.
[114] Xin Ye, Hui Shen, Xiao Ma, Razvan C. Bunescu, and Chang Liu. 2016. From word embeddings to document similar-
ities for improved information retrieval in software engineering. In Proceedings of the 38th International Conference
on Software Engineering (ICSE’16). 404–415.
[115] Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2015. Learning word representations with hierar-
chical sparse coding. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 87–96.
[116] Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev.
Data Min. Knowl. Discov. 8, 4 (2018).
[117] Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Attention-based LSTM network for cross-lingual sentiment
classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16).
247–256.
ACM Transactions on Software Engineering and Methodology, Vol. 30, No. 2, Article 18. Pub. date: December 2020.