Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

Nattapong Tiyajamorn; Tomoyuki Kajiwara; Yuki Arase†; Makoto Onizuka

doi:10.18653/v1/2021.emnlp-main.612

Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation

Nattapong Tiyajamorn, Tomoyuki Kajiwara, Yuki Arase, Makoto Onizuka

Abstract

We propose a method to distill a language-agnostic meaning embedding from a multilingual sentence encoder. By removing language-specific information from the original embedding, we retrieve an embedding that fully represents the sentence’s meaning. The proposed method relies only on parallel corpora without any human annotations. Our meaning embedding allows efficient cross-lingual sentence similarity estimation by simple cosine similarity calculation. Experimental results on both quality estimation of machine translation and cross-lingual semantic textual similarity tasks reveal that our method consistently outperforms the strong baselines using the original multilingual embedding. Our method consistently improves the performance of any pre-trained multilingual sentence encoder, even in low-resource language pairs where only tens of thousands of parallel sentence pairs are available.

Anthology ID:: 2021.emnlp-main.612
Volume:: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2021
Address:: Online and Punta Cana, Dominican Republic
Editors:: Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7764–7774
Language:
URL:: https://aclanthology.org/2021.emnlp-main.612
DOI:: 10.18653/v1/2021.emnlp-main.612
Bibkey:
Cite (ACL):: Nattapong Tiyajamorn, Tomoyuki Kajiwara, Yuki Arase, and Makoto Onizuka. 2021. Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7764–7774, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):: Language-agnostic Representation from Multilingual Sentence Encoders for Cross-lingual Similarity Estimation (Tiyajamorn et al., EMNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.emnlp-main.612.pdf
Video:: https://aclanthology.org/2021.emnlp-main.612.mp4
Code: nattaptiy/qe_disentangled

PDF Cite Search Code Video