MultiEmo: Language-Agnostic Sentiment Analysis
MultiEmo: Language-Agnostic Sentiment Analysis
1 Introduction
Two of the most important and applicable topics in natural language processing
(NLP), in particular in opinion mining, include sentiment analysis [1–3] and emo-
tion recognition [4], [5]. Recently, more and more online comments are expressed
in different natural languages. Consequently, there is a growing interest in new
methods for sentiment analysis that are language-independent. For that purpose,
appropriate language-agnostic models (embeddings) may be utilized.
In this paper, we developed and validated three language-agnostic methods for
sentiment analysis: one based on the LASER model [6] and two on LaBSE [7], see
Sec. 3. The latter was used in its basic version (LaBSEb ) and with additional atten-
tion layer (LaBSEa ). All of them were implemented within the bidirectional LSTM
architecture (biLSTM). The experiments were performed on our new benchmark
MultiEmo dataset, which is an extension of MultiEmo-Test 1.0 [8]. In the latter,
only test texts were translated into other languages, whereas the MultiEmo data
proposed here is fully multilingual. As the experiments revealed that LaBSE with
the additional attention layer (LaBSEa ) performs best (Sec. 5), it was exploited
in the MultiEmo web service for language-agnostic sentiment analysis: https:
//ws.clarin-pl.eu/multiemo. All results presented in this paper are download-
able: the MultiEmo dataset at https://clarin-pl.eu/dspace/handle/11321/
798 and source codes at https://github.com/CLARIN-PL/multiemo.
⋆
This work was partially supported by the National Science Centre, Poland, project
no. 2020/37/B/ST6/03806; by the statutory funds of the Department of Artificial
Intelligence, Wroclaw University of Science and Technology; by the European Regional
Development Fund as a part of the 2014-2020 Smart Growth Operational Programme,
CLARIN - Common Language Resources and Technology Infrastructure, project no.
POIR.04.02.00-00C002/19.
2 Related work
the division of the sum of embeddings by the prepared sum of the attention mask.
This gives us an averaged embedding of tokens from the last output layer enriched
with an indication of which tokens contain the most key information.
4 Experimental setup
4.1 Pipeline
Model training and evaluation were done in the following stages: (1) perform train-
ing on 80% of data and validation on 10%; (2) train the model until the loss func-
tion value stops decreasing for 25 epochs; keep the lowest achieved value of loss;
(3) evaluate the trained model using the test part of data – the remaining 10%.
All experiments were repeated 30 times so that strong statistical tests could be
performed. This removed the amount of uncertainty caused by the randomness of
the neural network model learning process. If the difference between the results in
our statistical tests was p < 5%, they were treated as insignificantly different.
Additionally, the whole corpus was machine translated into different languages
using DeepL (https://www.deepl.com/translator), what resulted in a new Mul-
tiEmo dataset. It provides an opportunity to train and test the model in any out
of 11 languages: Polish (origin), English, Chinese, Italian, Japanese, Russian, Ger-
man, Spanish, French, Dutch and Portuguese. The comprehensive profile of the
MultiEmo dataset is presented in Tab. 1. Only the mixed-domain corpus was ex-
ploited in the experiments described in Sec. 4.3 and 5, see the last row in Tab. 1.
4.3 Scenarios
To validate the quality of the models, we used three research scenarios, differing
in the language of the texts used to train and test the models:
– Any->Same – the model is both trained and tested on texts in one chosen
language (e.g. Polish-Polish, English-English).
– PL->Any – the model is trained only on Polish texts and tested on docs
translated to any other language (e.g. Polish-English, Polish-Chinese).
– Any->PL - the model is trained on texts in any language and tested only on
Polish texts (e.g. English-Polish, Chinese-Polish, Dutch-Polish).
All scenarios use the same train-validation-test split, Tab. 1, which ensures
that the model will not be trained and tested on the same translated texts.
5 Experimental results
The results for the same language training and testing on the MultiEmo dataset
(all domains mixed), which is the first scenario described in Sec. 4.3, prove that
LaBSEa is better in almost all cases. There are 5 situations when LaBSEb was
insignificantly better than LaBSEa . It happened in English (positive and negative
labels), French (positive and neutral), and Italian (neutral).
In the second scenario, the training was carried out on Polish data and testing
on other languages. LaBSEa is almost always statistically better than the other
models. There are only eight cases out of all 88, in which LaBSEb was insignifi-
cantly better than LaBSEa . There is also one situation (Portuguese: F1samples )
where LaBSEb is insignificantly worse than LaBSEa . The results aggregated over
all languages separately for each of the three considered models are shown in
Fig. 1a for the LASER language model, in Fig. 1b for basic LaBSE (LaBSEb ),
and in Fig. 1c for LaBSE with the custom mean pooling (LaBSEa ).
Fig. 1: Distribution of F1 scores for models learned on Polish texts and evaluated
on all languages from the MultiEmo dataset (PL->Any scenario) aggregated over
all test languages. (A) – for the LASER embeddings; (B) – for the basic LaBSEb
embeddings; (C) – for the LaBSE with attention, i.e. LaBSEa embeddings
In the third scenario, the classifier was trained on different languages but testing
was performed on Polish texts only. Similarly to the previous scenarios, LaBSEa
outperforms LaBSEb and LASER language models. In all scenarios, the results for
the ambivalent class are worse by about 40%-50% than for negative or positive class
meaning some documents are more controversial than others. Rather, we should
consider applying personalized reasoning to them [4, 5, 23, 24]. Also, the neutral
class is poorly classified, especially for LASER and non-Latin languages (Chinese,
Japanese, Russian). LaBSEa in the second scenario overcomes this problem re-
vealing the superiority of language-agnostic solutions over language-specific ones.
Languages using Latin alphabet perform almost the same.
References
1. F. Hemmatian and M. K. Sohrabi, “A survey on classification techniques for opinion
mining and sentiment analysis,” Artificial Intelligence Review.
2. Ł. Augustyniak, P. Szymański, T. Kajdanowicz, and P. Kazienko, “Fast and accurate
- improving lexicon-based sentiment classification with an ensemble methods.”
3. R. Bartusiak, L. Augustyniak, T. Kajdanowicz, and P. Kazienko, “Sentiment analysis
for polish using transfer learning approach,” in ENIC 2015.
4. P. Miłkowski, M. Gruza, K. Kanclerz, P. Kazienko, D. Grimling, and J. Kocon,
“Personal bias in prediction of emotions elicited by textual opinions,” in ACL-IJCNLP
2021: Student Research Workshop. ACL, 2021, pp. 248–259.
5. J. Kocoń, M. Gruza, J. Bielaniewicz, D. Grimling, K. Kanclerz, P. Miłkowski, and
P. Kazienko, “Learning personal human biases and representations for subjective
tasks in natural language processing,” in ICDM 2021. IEEE, 2021, pp. 1168–1173.
6. M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond,” Transactions of the ACL.
7. F. Feng, Y. Yang, D. Cer, N. Arivazhagan, and W. Wang, “Language-agnostic bert
sentence embedding,” arXiv preprint arXiv:2007.01852, 2020.
8. K. Kanclerz, P. Miłkowski, and J. Kocoń, “Cross-lingual deep neural transfer learning
in sentiment analysis,” Procedia Computer Science, vol. 176, pp. 128–137, 2020.
9. T. Chen, R. Xu, Y. He, and X. Wang, “Improving sentiment analysis via sentence
type classification using bilstm-crf and cnn,” Expert Systems with Applications.
10. J. Kocoń, P. Miłkowski, and M. Zaśko-Zielińska, “Multi-level sentiment analysis of
polemo 2.0: Extended corpus of multi-domain consumer reviews,” in CoNLL’19.
11. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep
bidirectional transformers for language understanding.”
12. Y. Liu et al., “Roberta: A robustly optimized bert pretraining approach.”
13. P. Rybak, R. Mroczkowski, J. Tracz, and I. Gawlik, “Klej: comprehensive benchmark
for polish language understanding,” arXiv preprint arXiv:2005.00630, 2020.
14. P. H. Calais Guerra, A. Veloso, W. Meira Jr, and V. Almeida, “From bias to opinion: a
transfer-learning approach to real-time sentiment analysis,” in ACM SIGKDD’2011.
15. A. Pelicon, M. Pranjić, D. Miljković, B. Škrlj, and S. Pollak, “Zero-shot learning for
cross-lingual news sentiment classif.” Applied Sciences, vol. 10, no. 17, p. 5993, 2020.
16. X. Zhou, X. Wan, and J. Xiao, “Attention-based lstm network for cross-lingual sen-
timent classification,” in EMNLP’16, 2016, pp. 247–256.
17. T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual bert?” in
Proceedings of the 57th Annual Meeting of the ACL, 2019, pp. 4996–5001.
18. L. Shen, J. Xu, and R. Weischedel, “A new string-to-dependency machine translation
algorithm with a target dependency language model,” in ACL-08: HLT.
19. M. Guo et al., “Effective parallel corpus mining using bilingual sentence embeddings.”
20. Y. Yang et al., “Improving multilingual sentence embedding using bi-directional dual
encoder with additive margin softmax,” arXiv preprint arXiv:1902.08564, 2019.
21. K. Gawron, M. Pogoda, N. Ropiak, M. Swędrowski, and J. Kocoń, “Deep neural
language-agnostic multi-task text classifier,” in ICDM’21. IEEE, 2021, pp. 136–142.
22. G. Hripcsak and A. S. Rothschild, “Agreement, the f-measure, and reliability in
information retrieval,” JAMIA, vol. 12, no. 3, pp. 296–298, 2005.
23. J. Kocoń, A. Figas, M. Gruza, D. Puchalska, T. Kajdanowicz, and P. Kazienko,
“Offensive, aggressive, and hate speech analysis: From data-centric to human-centered
approach,” Information Processing & Management, vol. 58, no. 5, p. 102643, 2021.
24. K. Kanclerz, A. Figas, M. Gruza, T. Kajdanowicz, J. Kocoń, D. Puchalska, and
P. Kazienko, “Controversy and conformity: from generalized to personalized aggres-
siveness detection,” in ACL-IJCNLP 2021. ACL, 2021, pp. 5915–5926.