0% found this document useful (0 votes)
44 views14 pages

Language-Agnostic BERT Sentence Embedding

This document describes a new model called LaBSE that learns language-agnostic sentence embeddings using BERT. LaBSE uses a dual encoder architecture with pre-trained BERT models to encode sentences instead of training encoders from scratch. LaBSE achieves state-of-the-art results on cross-lingual retrieval tasks, outperforming prior models by learning representations across 109 languages using limited parallel data. The authors release the LaBSE model to enable cross-lingual transfer learning across many languages.

Uploaded by

eketha4017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views14 pages

Language-Agnostic BERT Sentence Embedding

This document describes a new model called LaBSE that learns language-agnostic sentence embeddings using BERT. LaBSE uses a dual encoder architecture with pre-trained BERT models to encode sentences instead of training encoders from scratch. LaBSE achieves state-of-the-art results on cross-lingual retrieval tasks, outperforming prior models by learning representations across 109 languages using limited parallel data. The authors release the LaBSE model to enable cross-lingual transfer learning across many languages.

Uploaded by

eketha4017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Language-agnostic BERT Sentence Embedding

Fangxiaoyu Feng∗, Yinfei Yang∗†, Daniel Cer, Naveen Arivazhagan, Wei Wang†
Google AI
Mountain View
{fangxiaoyu, cer, navari}@google.com
{yangyin7, wei.wang.world}@gmail.com

Abstract Loss
Add Additive Margin

Source Embeddings Target Embeddings


While BERT is an effective method for learn-
ing monolingual sentence embeddings for se-
Share
arXiv:2007.01852v2 [cs.CL] 8 Mar 2022

mantic similarity and embedding based trans- 12-Layer Transformer Parameters 12-Layer Transformer
fer learning (Reimers and Gurevych, 2019), Embedding Network Embedding Network

BERT based cross-lingual sentence embed- Initialize Initialize


dings have yet to be explored. We sys- Pre-trained BERT
tematically investigate methods for learning
multilingual sentence embeddings by combin- Source Text Target Text
ing the best methods for learning monolin-
gual and cross-lingual representations includ-
Figure 1: Dual encoder model with BERT based encod-
ing: masked language modeling (MLM), trans-
ing modules.
lation language modeling (TLM) (Conneau
and Lample, 2019), dual encoder translation
ranking (Guo et al., 2018), and additive margin
existing cross-lingual sentence embedding mod-
softmax (Yang et al., 2019a). We show that in-
troducing a pre-trained multilingual language
els incorporate large transformer models, using
model dramatically reduces the amount of par- large pretrained language models is not well ex-
allel training data required to achieve good plored. Rather in prior work, encoders are trained
performance by 80%. Composing the best of directly on translation pairs (Artetxe and Schwenk,
these methods produces a model that achieves 2019b; Guo et al., 2018; Yang et al., 2019a), or on
83.7% bi-text retrieval accuracy over 112 lan- translation pairs combined with monolingual input-
guages on Tatoeba, well above the 65.5% response prediction (Chidambaram et al., 2019;
achieved by Artetxe and Schwenk (2019b),
Yang et al., 2019b).
while still performing competitively on mono-
lingual transfer learning benchmarks (Con- In our exploration, as illustrated in figure 1, we
neau and Kiela, 2018). Parallel data mined make use of dual-encoder models, which have been
from CommonCrawl using our best model is demonstrated as an effective approach for learning
shown to train competitive NMT models for bilingual sentence embeddings (Guo et al., 2018;
en-zh and en-de. We publicly release our best Yang et al., 2019a). However, diverging from prior
multilingual sentence embedding model for work, rather than training encoders from scratch,
109+ languages at https://tfhub.dev/
we investigate using pre-trained encoders based
google/LaBSE.
on large language models. We contrast models
with and without additive margin softmax (Yang
1 Introduction et al., 2019a)1 . Figure 2 illustrates where our work
In this paper, we systematically explore using pre- stands (shaded) in the field of LM pre-training and
training language models in combination with the sentence embedding learning.
best of existing methods for learning cross-lingual Our massively multilingual models outperform
sentence embeddings. Such embeddings are use- the previous state-of-the-art on large bi-text re-
ful for clustering, retrieval, and modular use of trieval tasks including the United Nations (UN)
text representations for downstream tasks. While 1
We also investigate the impact of mining hard nega-
tives (Guo et al., 2018), but found it doesn’t provide additional

Equal contributions. gain on top of other approaches. See supplemental material

Work done while at Google. for details.
Pre-training Sentence Emebedding 2 Cross-lingual Sentence Embeddings
Monolingual MLM USE & InferSent

Cross-lingual
Bilingual
TLM
Yang et. al. (2019a)
Dual encoder models are an effective approach
Multilingual m-USE & LASER
for learning cross-lingual embeddings (Guo et al.,
Figure 2: Where our work stands (shaded) vs. related
2018; Yang et al., 2019a). Such models consist of
work in LM pre-training and sentence embedding learn- paired encoding models that feed a scoring func-
ing. tion. The source and target sentences are encoded
separately. Sentence embeddings are extracted
Model Langs Model HN AMS Pre-train from each encoder. Cross-lingual embeddings are
LASER 97 seq2seq N/A N/A N
Yang et al. (2019a) 2 DE Y Y N
trained using a translation ranking task with in-
m-USE 16 DE Y Y N batch negative sampling:
LaBSE 109 DE N Y Y

Table 1: LaBSE model compared to other recent cross-


N
lingual embedding models. [DE]: Dual Encoder. [HN]: 1X eφ(xi ,yi )
Hard Negative. [AMS]: Additive Margin Softmax. L=− log
N eφ(xi ,yi ) + N φ(xi ,yn )
P
[PT]: Pre-training. i=1 n=1,n6=i e
(1)
The embedding space similarity of x and y is
corpus (Ziemski et al., 2016) and BUCC (Zweigen- given by φ(x, y), typically φ(x, y) = xy T . The
baum et al., 2018). Table 1 compares our best loss attempts to rank yi , the true translation of xi ,
model with other recent multilingual work. over all N −1 alternatives in the same batch. Notice
Both the UN corpus and BUCC cover resource that L is asymmetric and depends on whether the
rich languages (fr, de, es, ru, and zh). We fur- softmax is over the source or the target sentences.
ther evaluate our models on the Tatoeba retrieval For bidirectional symmetry, the final loss can sum
task (Artetxe and Schwenk, 2019b) that covers the source-to-target, L, and target-to-source, L0 ,
112 languages. Compare to LASER (Artetxe and losses (Yang et al., 2019a):
Schwenk, 2019b), our models perform significantly
better on low-resource languages, boosting the L̄ = L + L0 (2)
overall accuracy on 112 languages to 83.7%, from
the 65.5% achieved by the previous state-of-art. Dual encoder models trained using a translation
Surprisingly, we observe our models performs well ranking loss directly maximize the similarity of
on 30+ Tatoeba languages for which we have no translation pairs in a shared embedding space.
explicit monolingual or bilingual training data. Fi-
nally, our embeddings perform competitively on 2.1 Additive Margin Softmax
the SentEval sentence embedding transfer learning
Additive margin softmax extends the scoring func-
benchmark (Conneau and Kiela, 2018).
tion φ by introducing margin m around positive
The contributions of this paper are:
pairs (Yang et al., 2019a):
• A novel combination of pre-training and dual-
encoder finetuning to boost translation rank- ®
ing performance, achieving a new state-of-the- 0 φ(xi , yj ) − m if i = j
φ (xi , yj ) = (3)
art on bi-text mining. φ(xi , yj ) if i 6= j

• A publicly released multilingual sentence em- The margin, m, improves the separation between
bedding model spanning 109+ languages. translations and nearby non-translations. Using
φ0 (xi , yj ) with the bidirectional loss L̄s , we obtain
• Thorough experiments and ablation studies to
the additive margin loss
understand the impact of pre-training, nega-
tive sampling strategies, vocabulary choice,
data quality, and data quantity. N
1X eφ(xi ,yi )−m
L=− PN
We release the pre-trained model at https:// N φ(xi ,yi )−m + φ(xi ,yn )
i=1 e n=1,n6=i e
tfhub.dev/google/LaBSE. (4)
2.2 MLM and TLM Pre-training data is extracted from the 05-21-2020 dump using
Only limited prior work has combined dual en- WikiExtractor7 . An in-house tool splits the text
coders trained with a translation ranking loss with into sentences. The sentences are filtered using a
encoders initialized using large pre-trained lan- sentence quality classifier.8 After filtering, we ob-
guage models (Yang et al., 2021). We contrast tain 17B monolingual sentences, about 50% of the
using a randomly initialized transformer, as was unfiltered version. The monolingual data is only
done in prior work (Guo et al., 2018; Yang et al., used in customized pre-training.
2019a), with using a large pre-trained language Bilingual Translation Pairs The translation cor-
model. For pre-training, we combined Masked pus is constructed from web pages using a bitext
language modeling (MLM) (Devlin et al., 2019) mining system similar to the approach described
and Translation language modeling (TLM) (Con- in Uszkoreit et al. (2010). The extracted sen-
neau and Lample, 2019). MLM is a variant of a tence pairs are filtered by a pre-trained contrastive-
cloze task, whereby a model uses context words data-selection (CDS) scoring model (Wang et al.,
surrounding a [MASK] token to try to predict what 2018). Human annotators manually evaluate sen-
the [MASK] word should be. TLM extends this to tence pairs from a small subset of the harvested
the multilingual setting by modifying MLM train- pairs and mark the pairs as either GOOD or BAD
ing to include concatenated translation pairs. translations. The data-selection scoring model
Multilingual pre-trained models such as threshold is chosen such that 80% of the retained
mBERT (Devlin et al., 2019), XLM (Conneau pairs from the manual evaluation are rated as
and Lample, 2019) and XLM-R (Conneau et al., GOOD. We further limit the maximum number
2019) have led to exceptional gains across a variety of sentence pairs to 100 million for each language
of cross-lingual natural language processing to balance the data distribution. Many languages
tasks (Hu et al., 2020). However, without a sen- still have far fewer than 100M sentences. The final
tence level objective, they do not directly produce corpus contains 6B translation pairs.9 The transla-
good sentence embeddings. As shown in Hu et al. tion corpus is used for both dual encoder training
(2020), the performance of such models on bitext and customized pre-training.
retrieval tasks is very weak, e.g XLM-R Large
gets 57.3% accuracy on a selected 37 languages2 3.2 Configurations
from the Tatoeba dataset compared to 84.4% using
In this section, we describe the training details for
LASER (see performance of more models in table
the dual encoder model. A transformer encoder is
5). We contribute a detailed exploration that uses
used in all experiments (Vaswani et al., 2017). We
pre-trained language models to produce useful
train two versions of the model, one uses the public
multilingual sentence embeddings.
BERT multilingual cased vocab with vocab size
3 Corpus and Training Details 119,547 and a second incorporates a customized
vocab extracted over our training data. For the
3.1 Corpus customized vocab, we employ a wordpiece tok-
We use bilingual translation pairs and monolingual enizer (Sennrich et al., 2016), with a cased vocabu-
data in our experiments3 . lary extracted from the training set using TF Text.10
The language smoothing exponent for the vocab
Monolingual Data We collect monolingual data generation tool is set to 0.3 to counter imbalances
from CommonCrawl4 and Wikipedia5 . We use in the amount of data available per language. The
the 2019-35 version of CommonCrawl with heuris- final vocabulary size is 501,153.
tics from Raffel et al. (2019) to remove noisy text. The encoder architecture follows the BERT Base
Additionally, we remove short lines < 10 char- model, with 12 transformer blocks, 12 attention
acters and those > 5000 characters.6 The wiki
7
https://github.com/attardi/
2
The number is counted from official evaluation script wikiextractor
despite the original paper says 33 languages. 8
The quality classifier is trained using sentences from the
3
See the detailed list of supported languages in supplemen- main content of webpages as positives and text from other
tal material. areas as negatives.
4 9
https://commoncrawl.org/ Experiments in later sections show that even 200M pairs
5
https://www.wikipedia.org/ across all languages is sufficient.
6 10
Long lines are usually JavaScript or attempts at SEO. https://github.com/tensorflow/text
Target Sentence Batch Target Sentence Batch
Core i Core i Core i-1 Core i+1
j j
i 0 1 2 3 4 5 6 7 i 0 1 2 3 4 5 6 7

Source Sentence Batch


0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Source Sentence Batch

1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Core i
2 2 0 0 0 0 0 0 0 0
Core i

0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 3
... 0 0 0 0 0 0 0 0
...
4 0 0 0 0 1 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5 0 0 0 0 0 1 0 0 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

6 0 0 0 0 0 0 1 0 6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7 0 0 0 0 0 0 0 1 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In-batch Negative Sampling Cross-Accelerator Negative Sampling

Figure 3: Negative sampling example in a dual encoder framework. [Left]: The in-batch negative sampling in a
single core; [Right]: Synchronized multi-accelerator negative sampling using n TPU cores and batch size 8 per
core with examples from other cores are all treated as negatives.

heads and 768 per-position hidden units. The en- pling, which is illustrated in figure 3.12 Under this
coder parameters are shared for all languages. Sen- strategy each core encodes its assigned sentences
tence embeddings are extracted as the l2 normal- and then the encoded sentence representations from
ized [CLS] token representations from the last all cores are broadcast as negatives to the other
transformer block.11 cores. This allows us to fully realize the benefits of
Our models are trained on Cloud TPU V3 larger batch sizes while still distributing the compu-
with 32-cores using a global batch size of 4096 tationally intensive encoding work across multiple
with a max sequence length of 128, using the cores.
AdamW (Loshchilov and Hutter, 2019) optimizer Note the dot-product scoring function makes it
with initial learning rate 1e-3, and linear weight efficient to compute the pairwise scores in the same
decay. We train for 50k steps for models with batch with matrix multiplication. In figure 3, the
pre-training, and 500k steps for models without value in the grids indicates the ground truth labels,
pre-training. We observe that even further training with all positive labels located in diagonal grids. A
did not change the performance significantly. The softmax function is applied on each row.
default margin value for additive margin softmax is
set to 0.3. Hyperparameters are tuned on a held-out 3.4 Pre-training
development set. The encoder is pre-trained with Masked Language
Model (MLM) (Devlin et al., 2019) and Transla-
3.3 Cross-Accelerator Negative Sampling tion Language Model (TLM) (Conneau and Lam-
Cross-lingual embedding models trained with in- ple, 2019)13 training on the monolingual data and
batch negative samples benefit from large training bilingual translation pairs, respectively. For an L
batch sizes (Guo et al., 2018). Resource intensive layer transformer encoder, we train using a 3 stage
models like BERT, are limited to small batch sizes progressive stacking algorithm (Gong et al., 2019),
due to memory constraints. While data-parallelism where we first learn a L4 layers model and then L2
does allow us to increase the global batch size by layers and finally all L layers. The parameters of
using multiple accelerators, the batch-size on an the models learned in the earlier stages are copied
individual cores remains small. For example, a to the models for the subsequent stages.
4096 batch run across 32 cores results in a local Pre-training uses TPUv3 with 512-cores and a
batch size of 128, with each example then only batch size of 8192. The max sequence length is set
receiving 127 negatives. to 512 and 20% of tokens (or 80 tokens at most) per
We introduce cross-accelerator negative sam- sequence are masked for MLM and TLM predic-
tions. For the three stages of progressive stacking,
11
During training, the sentence embeddings after normal-
12
ization are multiplied by a scaling factor. Following Chi- While our experiments use TPU accelerators, the same
dambaram et al. (2018), we set the scaling factor to 10. We strategy can also be applied to models trained on GPU.
13
observe that the scaling factor is important for training a dual Diverging from Conneau and Lample (2019), we do not
encoder model with the normalized embeddings. provide a language hint to encourage multilinguality.
we respectively train for 400k, 800k, and 1.8M analysis (Socher et al., 2013), (TREC) question-
steps using all monolingual and bilingual data. type (Voorhees and Tice, 2000), (CR) product
reviews (Hu and Liu, 2004), (SUBJ) subjectiv-
4 Evaluation Tasks ity/objectivity (Pang and Lee, 2004), (MPQA)
opinion polarity (Wiebe et al., 2005), and (MRPC)
4.1 Bitext Retrieval
paraphrasing detection (Dolan et al., 2004). While
We evaluate models on three bitext retrieval tasks: SentEval is English only, we make use of this
United Nations (UN), Tatoeba, and BUCC. All benchmark in order to directly compare to prior
tasks are to retrieve the correct English translation work on sentence embedding models.
for each non-English sentence.
5 Results
United Nations (UN) contains 86,000 sentence
aligned bilingual documents over five language Table 2 shows the performance on the UN and
pairs: en-fr, en-es, en-ru, en-ar and en-zh (Ziemski Tatoeba bitext retrieval tasks and compares against
et al., 2016). A total of 11.3 million14 aligned sen- the prior state-of-the-art bilingual models Yang
tence pairs can be extract from the document pairs. et al. (2019a), LASER (Artetxe and Schwenk,
The large pool of translation candidates makes this 2019b), and the multilingual universal sentence
data set particularly challenging. encoder (m-USE) (Yang et al., 2019b)16 . Row 1-3
show the performance of baseline models, as re-
Tatoeba evaluates translation retrieval over 112 ported in the original papers.
languages (Artetxe and Schwenk, 2019b). The Row 4-7 shows the performance of models that
dataset contains up to 1,000 sentences per language use the public mBERT vocabulary. The baseline
along with their English translations. We evaluate model shows reasonable performance on UN rang-
performance on the original version covering all ing from 57%-71% P@1. It also perform well on
112 languages, and also the 36 languages version Tatoeba with 92.8% and 79.1% accuracy for the
from the XTREME benchmark (Hu et al., 2020). 36 language group and all languages, respectively.
Adding pre-training both helps models converge
BUCC is a parallel sentence mining shared faster (see details in section 6.2) and improves per-
task (Zweigenbaum et al., 2018). We use the 2018 formance on the UN retrieval using both vocabu-
shared task data, containing four language pairs: fr- laries. Pre-training also helps on Taoeba, but only
en, de-en, ru-en and zh-en. For each pair, the task using the customized vocabulary.17 Additive mar-
provides monolingual corpora and gold true trans- gin softmax significantly improves the performance
lation pairs. The task is to extract translation pairs on all model variations.
from the monolingual data, which are evaluated The last two rows contain two models using the
against the ground truth using F1. Since the ground customized vocab. Both of them are trained with
truth for the BUCC test data is not released, we additive margin softmax given the strong evidence
follow prior work using the BUCC training set for from the experiments above. Both models outper-
evaluation rather than training (Yang et al., 2019b; form the mBERT vocabulary based models, and
Hu et al., 2020). Sentence embedding cosine simi- the model with pre-training performs best of all.
larity is used to identify the translation pairs.15 The top model (Base w/ Customized Vocab + AMS
+ PT) achieves a new state-of-the-art on 3 of the
4.2 Downstream Classification
4 languages, with P@1 91.1, 88.3, 90.8 for en-es,
We also evaluate the transfer performance of multi- en-fr, en-ru respectively. It reaches 87.7 on zh-en,
lingual sentence embeddings on downstream clas- only 0.2 lower than the best bilingual en-zh model
sification tasks from the SentEval benchmark (Con- and nearly 9 points better than the previous best
neau and Kiela, 2018). We evaluate on select multilingual model. On Tatoeba, the best model
tasks from SentEval including: (MR) movie re- also outperform the baseline model by a large mar-
views (Pang and Lee, 2005)), (SST) sentiment gin, with +10.6 accuracy on the 36 language group
14 16
About 9.5 million after de-duping. universal-sentence-encoder-multilingual-large/3
15 17
Reranking models can further improve performance (e.g. The coverage of the public mBERT vocabulary on the tail
margin based scorer (Artetxe and Schwenk, 2019a) and BERT languages are bad with many [UNK] tokens in those languages,
based classifier (Yang et al., 2019a)). However, this ss tangen- e.g. the [UNK] token rate is 71% for language si, which could
tial to assessing the raw embedding retrieval performance. be reason the pre-training doesn’t help on the tatoeba task.
UN (en → xx) Taoeba (xx → en)
Model
es fr ru zh avg 36 Langs All Langs
LASER (Artetxe and Schwenk, 2019b) – – – – – 84.4 65.5
m-USE (Yang et al., 2019b) 86.1 83.3 88.9 78.8 84.3 – –
Yang et al. (2019a) 89.0 86.1 89.2 87.9 88.1 – –
Base w/ mBERT Vocab 67.7 57.0 70.2 71.9 66.7 92.8 79.1
+ PT 68.5 59.8 65.8 71.7 66,5 92.7 78.6
+ AMS 88.2 84.5 88.6 86.4 86.9 93.7 81.2
+ AMS + PT 89.3 85.7 89.3 87.2 87.9 93.2 78.4
Base w/ Customized Vocab
+ AMS 90.6 86.5 89.5 86.8 88.4 94.8 82.6
+ AMS + PT (LaBSE) 91.1 88.3 90.8 87.7 89.5 95.0 83.7

Table 2: UN (P@1) and Taoteba (Average accuracy) performance for different model configurations. Base uses a
bidirectional dual encoder model. [AMS]: Additive Margin Softmax. [PT]: Pre-training.

from XTREME and +18.2 on all languages. 100


UN P@1 (Averaged) with different margin value

It is worth noting that all our models perform


similarly on Tatoeba but not on UN. This suggests 90

it is necessary to evaluate on large scale bitext re-

P@1 (%)
80

trieval tasks to better discern differences between


competing models. In the rest of the paper we re- 70

fer to LaBSE as the best performing model here 60


Base w/ mBERT vocab + AMS
Base w/ mBERT vovab + AMS + PT
Base w/ Customized Vocab + AMS + PT, unless Base w/ Customized vocab + AMS + PT (LaBSE)
otherwise specified. 50
−0.1 0.0 0.1 0.2 0.3 0.4 0.5
Margin value
Table 3 provides LaBSE’s retrieval performance
Figure 4: Average P@1 (%) on UN retrieval task of
on BUCC, comparing against strong baselines
models trained with different margin values.
from Artetxe and Schwenk (2019a) and Yang et al.
(2019a). Following prior work, we perform both
forward and backward retrieval. Forward retrieval gate the effect of margin size on our three model
treats en as the target and the other language as variations, as shown in figure 4. The model with an
the source, and backward retrieval is vice versa. additive margin value 0 performs poorly on the UN
LaBSE not only systematically outperforms prior task with ∼60 average P@1 across all three model
work but also covers all languages within a single variations. With a small margin value of 0.1, the
model. The previous state-of-the-art required four model improves significantly compare to no mar-
separate bilingual models (Yang et al., 2019a). gin with 70s to 80s average P@1. Increasing the
margin value keeps improving performance until it
5.1 Results on Downstream Classification reaches 0.3. The trend is consistent on all models.
Tasks
6.2 Effectiveness of Pre-training
Table 4 gives the transfer performance achieved by
LaBSE on the SentEval benchmark (Conneau and To better understand the effective of MLM/TLM
Kiela, 2018), comparing against other state-of-the- pre-training in the final LaBSE model, we explore
art sentence embedding models. Despite its mas- training a variant of this model using our cus-
sive language coverage in a single model, LaBSE tomized vocab but without pre-training. The results
still obtains competitive transfer performance with are shown in figure 5. We experiment with vary-
monolingual English embedding models and the ing the number of training steps for both models,
16 language m-USE model. including: 50k, 100K, 200K, and 500K steps. A
model with pre-trained encoders already achieves
6 Analysis the highest performance when trained 50K steps,
further training doesn’t increase the performance
6.1 Additive Margin Softmax significantly. However, the model without pre-
The above experiments show that additive margin training performs poorly when only trained 50k
softmax is a critical factor in learning good cross- steps. Its performance increases with additional
lingual embeddings, which is aligned with the find- steps and approaches the model with pre-training
ings from Yang et al. (2019a). We further investi- at 500k steps. The overall performance is, how-
fr-en de-en ru-en zh-en
Models
P R F P R F P R F P R F
Backward Forward

Artetxe and Schwenk (2019a) 82.1 74.2 78.0 78.9 75.1 77.0 - - - - - -
Yang et al. (2019a) 86.7 85.6 86.1 90.3 88.0 89.2 84.6 91.1 87.7 86.7 90.9 88.8
LaBSE 86.6 90.9 88.7 92.3 92.7 92.5 86.1 91.9 88.9 88.2 89.7 88.9
Artetxe and Schwenk (2019a) 77.2 72.7 74.7 79.0 73.1 75.9 - - - - - -
Yang et al. (2019a) 83.8 85.5 84.6 89.3 87.7 88.5 83.6 90.5 86.9 88.7 87.5 88.1
LaBSE 87.1 88.4 87.8 91.3 92.7 92.0 86.3 90.7 88.4 87.8 90.3 89.0

Table 3: [P]recision, [R]ecall and [F]-score of BUCC training set score with cosine similarity scores. The thresh-
olds are chosen for the best F scores on the training set. Following the naming of BUCC task (Zweigenbaum et al.,
2018), we treat en as the target and the other language as source in forward search. Backward is vice versa.

Model MR CR SUBJ MPQA TREC SST MRPC


English Models
InferSent 81.1 86.3 92.4 90.2 88.2 84.6 76.2 to identify broader trends. Besides the 36 language
Skip-Thought LN 79.4 83.1 93.7 89.3 – – –
Quick-Thought 82.4 86.0 94.8 90.2 92.4 87.6 76.9 group and all-languages group, two more groups of
USETrans 82.2 84.2 95.5 88.1 93.2 83.7 –
Multilingual Models 14 languages (selected from the languages covered
m-USETrans 78.1 87.0 92.1 89.9 96.6 80.9 –
LaBSE 79.1 86.7 93.6 89.6 92.6 83.8 74.4 by m-USE, and 82 languages group (covered by
the LASER training data) are evaluated. Table 5
Table 4: Performance on English transfer tasks from provides the macro-average accuracy achieved by
SentEval (Conneau and Kiela, 2018). We com-
LaBSE for the four language groupings drawn from
pare LaBSE model with InferSent (Conneau et al.,
2017), Skip-Thought LN (Ba et al., 2016), Quick-
Tatoeba, comparing against LASER and m-USE.
Thought (Logeswaran and Lee, 2018), USETrans (Cer All three models perform well on the 14 major lan-
et al., 2018), and m-USETrans (Yang et al., 2019b). guages support by m-USE, with each model achiev-
ing an average accuracy >93%. Both LaBSE and
UN P@1 (Averaged) with training different steps
LASER perform moderately better than m-USE,
100

with an accuracy of 95.3%. As more languages are


90 included, the averaged accuracy for both LaBSE
and LASER decreases, but with a notably more
P@1 (%)

80
rapid decline for LASER. LaBSE systematically
70 outperforms LASER on the groups of 36 languages
(+10.6%), 82 languages (+11.4%), and 112 lan-
60
Bsse w/ Customized vocab + AMS guages (+18.2%).
Base w/ Cutsomized vocab + AMS + PT (LaBSE)
50
0 100 200 300 400 500
Trainging Steps (K) Figure 6 lists the Tatoeba accuracy for languages
Figure 5: Average P@1 (%) on UN retrieval task of where we don’t have any explicit training data.
models trained with training different steps. There are a total of 30+ such languages. The
performance is surprisingly good for most of the
ever, still slightly worse. Moreover, further training languages with an average accuracy around 60%.
past 500k steps doesn’t increase the performance Nearly one third of them have accuracy greater
significantly. Pre-training thus both improves per- than 75%, and only 7 of them have accuracy lower
formance and dramatically reduces the amount of than 25%. One possible reason is that language
parallel data required. Critically, the model sees mapping is done manually and some languages
1B examples at 500K steps, while the 50K model are close to those languages with training data but
only sees 200M examples.18 may be treated differently according to ISO-639
standards and other information. Additional, since
6.3 Low Resource Languages and Languages automatic language detection is used, some limited
without Explicit Training Data amount of data for the missing languages might
We evaluate performance through further experi- be included during training. We also suspect that
ments on Tatoeba for comparison to prior work and those well performing languages are close to some
language that we have training data. For exam-
18
We note that it is relative easy to get 200M parallel exam- ple yue and wuu are related to zh (Chinese) and fo
ples for many languages from public sources like Paracrawl,
TED58, while obtaining 1B examples is generally much more has similarities to is (ICELANDIC). Multilingual
challenging. generalization across so many languages is only
Model 14 Langs 36 Langs 82 Langs All Langs Model dev test
m-USETrans. 93.9 – – – SentenceBERT (Reimers and Gurevych, 2019) - 79.2
LASER 95.3 84.4 75.9 65.5 m-USE (Yang et al., 2019b) 83.7 82.5
LaBSE 95.3 95.0 87.3 83.7 USE (Cer et al., 2018) 80.2 76.6
ConvEmbed (Yang et al., 2018) 81.4 78.2
InferSent (Conneau et al., 2017) 80.1 75.6
Table 5: Accuracy(%) of the Tatoeba datasets. [14
LaBSE 74.3 72.8
Langs]: The languages USE supports. [36 Langs]: STS Benchmark Tuned
The languages selected by XTREME. [82 Langs]: Lan- SentenceBERT-STS (Reimers and Gurevych, 2019) - 86.1
guages that LASER has training data. All Langs: All ConvEmbed (Yang et al., 2018) 83.5 80.8
languages supported by Taoteba.
Table 6: Semantic Textual Similarity (STS) bench-
100
mark(Cer et al., 2017) performance as measured by
Pearson’s r.
Tatoeba Accuracy

75

50 ity than other sentence embedding models. We sus-


25 pect training LaBSE on translation pairs biases the
0
model to excel at detecting meaning equivalence,
arz

tzl

kzj/dtp
tk

kw
war
cbk

oc

orv

ber
AVG

ast
ie

awa

swg

gsw

kab
mhr/ch
yue

arq

ch

br
csb

pam
ia

io

pms
nov

ang

dtp
wuu

fo

max
nds

hsb

dsb
lfn

but not at distinguishing between fine grained de-


Language
grees of meaning overlap.
Figure 6: Tatoeba accuracy for those languages without
Recently Reimers and Gurevych (2020) showed
any explicit training data. The average (AVG) accuracy one can distill a English sentence representation
is 60.5%, listed at the first. model to a student multilingual model using a lan-
guage alignment loss. The distilled model performs
possible due to the massively multilingual nature well on (multilingual-)STS benchmarks but under-
of LaBSE. performs on bitext retrieval tasks when compared
to state-of-the-art models. Our approach is compli-
6.4 Semantic Similarity mentary and can be combined with their method to
The Semantic Textual Similarity (STS) bench- distill better student models.
mark (Cer et al., 2017) measures the ability of
7 Mining Parallel Text from
models to replicate fine-grained grained human
CommonCrawl
judgements on pairwise English sentence similar-
ity. Models are scored according to their Pearson We use the LaBSE model to mine parallel text
correlation, r, on gold labels ranging from 0, unre- from CommonCrawl, a large-scale multilingual
lated meaning, to 5, semantically equivalent, with web corpus, and then train NMT models on the
intermediate values capturing carefully defined de- mined data. We experiment with two language
grees of meaning overlap. STS is used to evaluate pairs: English-to-Chinese (en-zh) and English-to-
the quality of sentence-level embeddings by assess- German (en-de). We mine translations from mono-
ing the degree to which similarity between pairs of lingual CommonCrawl data processed as described
sentence embeddings aligns with human perception above for self-supervised MLM pretraining. After
of sentence meaning similarity. processing, there are 1.17B, 0.6B, 7.73B sentences
Table 6 reports performance on the STS bench- for Chinese (zh), German (de), and English (en),
mark for LaBSE versus existing sentence embed- respectively. LaBSE embeddings are used to pair
ding models. Following prior work, the semantic each non-English sentence with its nearest English
similarity of a sentence pair according to LaBSE neighbor, dropping pairs with a similarity score
is computed as the arc cosine distance between the < 0.6.20 For en-de and en-zh, we train a model
pair’s sentence embeddings.19 For comparison, we with Transformer-Big (Vaswani et al., 2017) in the
include numbers for SentenceBERT when it is fine- following way: First we train the model on the
tuned for the STS task as well as ConvEmbed when mined data as is for 120k steps with batch size
an additional affine transform is trained to fit the 20
The threshold 0.6 is selected by manually inspecting a
embeddings to STS. We observe that LaBSE per- data sample, where pairs greater or equal to this threshold
forms worse on pairwise English semantic similar- are likely to be translation or partial translation of each other.
This results in 715M and 302M sentence pairs for en-zh and
19
Within prior work, m-USE, USE and ConvEmbed use arc- en-de, respectively. Note that the pairs may still be noisy, but
cos distance to measure embedding space semantic similarity, we resort to data selection to select sentence pairs in higher
while InferSent and SentenceBERT use cosine similarity. quality for training NMT models.
# of # of # of BLEU Acknowledgments
Langs
XX Sents En Sents Mined Pairs News TED
en-zh 1.17B 7.73B 715M 36.3 15.2
en-de 601M 7.73B 302M 28.1 31.3
We thank our teammates from Descartes, Translate
and other Google groups for their feedback and
Table 7: The number of source / target sentences and suggestions. Special thanks goes to Sidharth Mud-
number of mined parallel text from CommonCrawl. gal, and Jax Law for help with data processing;
BLEU scores (en→xx) are evaluated on WMT News as well as Jialu Liu, Tianqi Liu, Chen Chen, and
dataset and TED dataset. We use wmtnews17 and wmt-
Anosh Raj for help on BERT pretraining.
news14 for zh-en and de-en respectively in WMT News
set.
References
10k. Then we select the best 20% using Wang Mikel Artetxe and Holger Schwenk. 2019a. Margin-
et al. (2018)’s data selection method, and train for based parallel corpus mining with multilingual sen-
another 80k steps. tence embeddings. In Proceedings of the 57th An-
nual Meeting of the Association for Computational
Results in table 7 show the effectiveness of the Linguistics, pages 3197–3203, Florence, Italy. Asso-
mined training data. By referencing previous re- ciation for Computational Linguistics.
sults (Edunov et al., 2018), we see the mined data
Mikel Artetxe and Holger Schwenk. 2019b. Mas-
yields performance that is only 2.8 BLEU away sively multilingual sentence embeddings for zero-
from performance of the best system that made use shot cross-lingual transfer and beyond. Trans. Assoc.
of the WMT17 en-de parallel data. Compare to Comput. Linguistics, 7:597–610.
prior en-zh results (Sennrich et al., 2017), we see
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton.
that the model is as good as a WMT17 NMT model 2016. Layer normalization. CoRR, abs/1607.06450.
(Sennrich et al., 2017) that is trained on the WMT
en-zh parallel data. The table also gives BLEU Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-
performance on the TED test set (Qi et al., 2018), Gazpio, and Lucia Specia. 2017. SemEval-2017
task 1: Semantic textual similarity multilingual and
with performance being comparable with models crosslingual focused evaluation. In Proceedings
trained using CCMatrix (Schwenk et al., 2019).21 of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14, Vancouver,
Canada. Association for Computational Linguistics.
8 Conclusion
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,
Nicole Limtiaco, Rhomni St. John, Noah Constant,
This paper presents a language-agnostic BERT sen- Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
tence embedding (LaBSE) model supporting 109 Brian Strope, and Ray Kurzweil. 2018. Universal
sentence encoder for English. In Proceedings of
languages. The model achieves state-of-the-art per- the 2018 Conference on Empirical Methods in Nat-
formance on various bi-text retrieval/mining tasks ural Language Processing: System Demonstrations,
compare to the previous state-of-the-art, while also pages 169–174, Brussels, Belgium. Association for
providing increased language coverage. We show Computational Linguistics.
the model performs strongly even on those lan- Muthu Chidambaram, Yinfei Yang, Daniel Cer, Steve
guages where LaBSE doesn’t have any explicit Yuan, Yunhsuan Sung, Brian Strope, and Ray
training data, likely due to language similarity and Kurzweil. 2019. Learning cross-lingual sentence
the massively multilingual natural of the model. representations via a multi-task dual-encoder model.
In Proceedings of the 4th Workshop on Represen-
Extensive experiments show additive margin soft- tation Learning for NLP (RepL4NLP-2019), pages
max is a key factor for training the model, par- 250–259, Florence, Italy. Association for Computa-
allel data quality matters, but the effect of in- tional Linguistics.
creased amounts of parallel data diminishes when
Muthuraman Chidambaram, Yinfei Yang, Daniel Cer,
a pre-trained language model is used. The pre- Steve Yuan, Yun-Hsuan Sung, Brian Strope, and Ray
trained model is released at https://tfhub. Kurzweil. 2018. Learning cross-lingual sentence
dev/google/LaBSE. representations via a multi-task dual-encoder model.
CoRR, abs/1810.12836.
21
CCMatrix is another dataset contains billions of paral- Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
lel sentences mined from CommonCrawl using a embedding Vishrav Chaudhary, Guillaume Wenzek, F. Guzmán,
based mining approach, with an additional cleaning step. Edouard Grave, Myle Ott, Luke Zettlemoyer, and
Veselin Stoyanov. 2019. Unsupervised cross- Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham
lingual representation learning at scale. ArXiv, Neubig, Orhan Firat, and Melvin Johnson. 2020.
abs/1911.02116. Xtreme: A massively multilingual multi-task bench-
mark for evaluating cross-lingual generalization.
Alexis Conneau and Douwe Kiela. 2018. SentEval: An
evaluation toolkit for universal sentence representa- Minqing Hu and Bing Liu. 2004. Mining and summa-
tions. In Proceedings of the Eleventh International rizing customer reviews. In Proceedings of the tenth
Conference on Language Resources and Evaluation ACM SIGKDD international conference on Knowl-
(LREC 2018), Miyazaki, Japan. European Language edge discovery and data mining, pages 168–177.
Resources Association (ELRA).
Lajanugen Logeswaran and Honglak Lee. 2018. An ef-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc ficient framework for learning sentence representa-
Barrault, and Antoine Bordes. 2017. Supervised tions. In International Conference on Learning Rep-
learning of universal sentence representations from resentations (ICLR).
natural language inference data. In Proceedings of
the 2017 Conference on Empirical Methods in Nat- Ilya Loshchilov and Frank Hutter. 2019. Decoupled
ural Language Processing, pages 670–680, Copen- weight decay regularization. In International Con-
hagen, Denmark. Association for Computational ference on Learning Representations.
Linguistics.
Jing Lu, Gustavo Hernández Ábrego, Ji Ma, Jianmo
Alexis Conneau and Guillaume Lample. 2019. Cross- Ni, and Yinfei Yang. 2020. Neural passage re-
lingual language model pretraining. In H. Wal- trieval with improved negative contrast. CoRR,
lach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, abs/2010.12523.
E. Fox, and R. Garnett, editors, Advances in Neu- Bo Pang and Lillian Lee. 2004. A sentimental educa-
ral Information Processing Systems 32, pages 7059– tion: Sentiment analysis using subjectivity summa-
7069. Curran Associates, Inc. rization based on minimum cuts. In Proceedings of
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and the 42nd Annual Meeting of the Association for Com-
Kristina Toutanova. 2019. BERT: Pre-training of putational Linguistics (ACL-04), pages 271–278.
deep bidirectional transformers for language under- Bo Pang and Lillian Lee. 2005. Seeing stars: exploit-
standing. In Proceedings of the 2019 Conference ing class relationships for sentiment categorization
of the North American Chapter of the Association with respect to rating scales. In Proceedings of the
for Computational Linguistics: Human Language 43rd Annual Meeting on Association for Computa-
Technologies, Volume 1 (Long and Short Papers), tional Linguistics, pages 115–124.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad-
manabhan, and Graham Neubig. 2018. When and
William Dolan, Chris Quirk, Chris Brockett, and Bill why are pre-trained word embeddings useful for neu-
Dolan. 2004. Unsupervised construction of large ral machine translation? In Proceedings of the 2018
paraphrase corpora: Exploiting massively parallel Conference of the North American Chapter of the
news sources. International Conference on Compu- Association for Computational Linguistics: Human
tational Linguistics. Language Technologies, Volume 2 (Short Papers),
Sergey Edunov, Myle Ott, Michael Auli, and David pages 529–535, New Orleans, Louisiana. Associa-
Grangier. 2018. Understanding back-translation at tion for Computational Linguistics.
scale. In Proceedings of the 2018 Conference on Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Empirical Methods in Natural Language Processing, Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
pages 489–500, Brussels, Belgium. Association for Wei Li, and Peter J. Liu. 2019. Exploring the limits
Computational Linguistics. of transfer learning with a unified text-to-text trans-
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei former. ArXiv, abs/1910.10683.
Wang, and Tieyan Liu. 2019. Efficient training of Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT by progressively stacking. In Proceedings bert: Sentence embeddings using siamese bert-
of the 36th International Conference on Machine networks. In EMNLP/IJCNLP.
Learning, volume 97 of Proceedings of Machine
Learning Research, pages 2337–2346, Long Beach, Nils Reimers and Iryna Gurevych. 2020. Making
California, USA. PMLR. monolingual sentence embeddings multilingual us-
ing knowledge distillation. In Proceedings of the
Mandy Guo, Qinlan Shen, Yinfei Yang, Heming 2020 Conference on Empirical Methods in Natural
Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Language Processing (EMNLP), pages 4512–4525,
Stevens, Noah Constant, Yun-hsuan Sung, Brian Online. Association for Computational Linguistics.
Strope, and Ray Kurzweil. 2018. Effective parallel
corpus mining using bilingual sentence embeddings. Holger Schwenk, Guillaume Wenzek, Sergey Edunov,
In Proceedings of the Third Conference on Machine E. Grave, and Armand Joulin. 2019. Ccmatrix: Min-
Translation: Research Papers, pages 165–176. As- ing billions of high-quality parallel sentences on the
sociation for Computational Linguistics. web. ArXiv, abs/1911.04944.
Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich CAI 2019, Macao, China, August 10-16, 2019, pages
Germann, Barry Haddow, Kenneth Heafield, An- 5370–5378. ijcai.org.
tonio Valerio Miceli Barone, and Philip Williams.
2017. The university of Edinburgh’s neural MT sys- Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy
tems for WMT17. In Proceedings of the Second Guo, Jax Law, Noah Constant, Gustavo Hernández
Conference on Machine Translation, pages 389–399, Ábrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung,
Copenhagen, Denmark. Association for Computa- Brian Strope, and Ray Kurzweil. 2019b. Multi-
tional Linguistics. lingual universal sentence encoder for semantic re-
trieval. CoRR, abs/1907.04307.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural machine translation of rare words Yinfei Yang, Steve Yuan, Daniel Cer, Sheng-yi Kong,
with subword units. In Proceedings of the 54th An- Noah Constant, Petr Pilar, Heming Ge, Yun-Hsuan
nual Meeting of the Association for Computational Sung, Brian Strope, and Ray Kurzweil. 2018. Learn-
Linguistics (Volume 1: Long Papers), pages 1715– ing semantic textual similarity from conversations.
1725, Berlin, Germany. Association for Computa- In Proceedings of The Third Workshop on Repre-
tional Linguistics. sentation Learning for NLP, pages 164–174, Mel-
bourne, Australia. Association for Computational
Richard Socher, Alex Perelygin, Jean Wu, Jason Linguistics.
Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod- Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and
els for semantic compositionality over a sentiment Eric Darve. 2021. Universal sentence representation
treebank. In Proceedings of the 2013 conference on learning with conditional masked language model.
empirical methods in natural language processing, To appear in EMNLP 2021.
pages 1631–1642.
Michal Ziemski, Marcin Junczys-Dowmunt, and Bruno
Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Pouliquen. 2016. The united nations parallel cor-
Moshe Dubiner. 2010. Large scale parallel docu- pus v1. 0. In Proceedings of the Tenth International
ment mining for machine translation. In Proceed- Conference on Language Resources and Evaluation,
ings of the 23rd International Conference on Com- LREC ’16. European Language Resources Associa-
putational Linguistics, COLING ’10, pages 1101– tion (ELRA).
1109, Stroudsburg, PA, USA. Association for Com-
Pierre Zweigenbaum, Serge Sharoff, and Reinhard
putational Linguistics.
Rapp. 2018. Overview of the third bucc shared task:
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Spotting parallel sentences in comparable corpora.
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz In Proceedings of the Eleventh International Confer-
Kaiser, and Illia Polosukhin. 2017. Attention is all ence on Language Resources and Evaluation (LREC
you need. In Advances in Neural Information Pro- 2018), Paris, France. European Language Resources
cessing Systems, pages 5998–6008. Association (ELRA).

Ellen M Voorhees and Dawn M Tice. 2000. Building


a question answering test collection. In Proceedings
of the 23rd annual international ACM SIGIR confer-
ence on Research and development in information
retrieval, pages 200–207.

Wei Wang, Taro Watanabe, Macduff Hughes, Tetsuji


Nakagawa, and Ciprian Chelba. 2018. Denoising
neural machine translation training with trusted data
and online data selection. In Proceedings of the
Third Conference on Machine Translation, pages
133–143. Association for Computational Linguis-
tics.

Janyce Wiebe, Theresa Wilson, and Claire Cardie.


2005. Annotating expressions of opinions and emo-
tions in language. Language resources and evalua-
tion, 39(2-3):165–210.

Yinfei Yang, Gustavo Hernández Ábrego, Steve Yuan,


Mandy Guo, Qinlan Shen, Daniel Cer, Yun-Hsuan
Sung, Brian Strope, and Ray Kurzweil. 2019a. Im-
proving multilingual sentence embedding using bi-
directional dual encoder with additive margin soft-
max. In Proceedings of the Twenty-Eighth Interna-
tional Joint Conference on Artificial Intelligence, IJ-
A LaBSELarge datasets as it is very time consuming and computa-
tionally costly .
Motivated by the recent progress of giant models, We investigate hard negative mining closely fol-
we also train a model with increased model capac- lowing Guo et al. (2018). By contacting the original
ity. Following BERTLarge , we develop LaBSELarge authors, we obtained their negative mining pipeline,
using a 24 layers transformer with 16 attention which employs a weaker dual encoder that uses a
heads and 1024 hidden size. Constrained by com- deep averaging network trained to identify trans-
putation resource, we train 1M steps one stage lation pairs. Similar to the cross-accelerator neg-
pre-training instead of the progressive multi-stage atives, the mined negatives are also appended to
pre-training used when training LaBSE model. each example.
Fine-tuning configs are exact the same as the base We only experiment using hard negative for
LaBSE model. Spanish (es) as it is very costly to get hard negative
Table 8 shows the UN performance of the for all languages. Due to memory constraints, we
LaBSELarge model compared to LaBSE model. The only append 3 mined hard negatives in es for each
results are mixed, and the average performances en source sentence. Since the amount of examples
are very close. We also evaluate the model on increased 4x per en sentence in es batches,we also
Tatoeba, and the average performances across all decrease batch size from 128 to 32 in the hard neg-
languages are also very close: 83.7 (LaBSE) v.s. ative experiment. For languages other than es, the
83.8 (LaBSELarge ). training data was the same as other the experiments
but with batch size decreased to 32 together. Other
Model es fr ru zh avg. languages are trained as usual. Table 9 shows the
LaBSE 91.1 88.3 90.8 87.7 89.5 results of these models on UN. The accuracy of all
LaBSELarge 90.9 87.9 89.4 89.5 89.4 four languages went down, even for en-es where
Table 8: P@1 on UN (en→xx) . we have the hard negatives. We suspect the worse
performance is caused by the decreasing of batch
We suspect that the translate matching training size due to the memory constrain with more hard
objective is too easy, the model cannot learn more negative per example.
information from the current in-batch negative sam-
Model es fr ru zh avg.
pling approach. An improved negative contrast LaBSE 91.1 88.3 90.8 87.7 89.5
could help the larger model to learn better repre- LaBSE + es HN 90.4 87.1 89.9 87.2 88.7
sentations. We experimented with one type of hard
negatives in the section below, but more types of Table 9: P@1 on UN (en→xx) with hard negative ex-
hard negatives could be explored as described in amples in en-es.
(Lu et al., 2020). We leave this as a future work.
C Supported Languages
B Hard Negative Mining
The supported langauges is listed in table 10. The
Since their introduction into models that make use
distribution for each supported language is shown
of dual encoders to learn cross-lingual embeddings,
in figure 7.
hard negatives (Guo et al., 2018) have become the
de facto data augmentation method for learning
cross-lingual sentence embeddings (Chidambaram
et al., 2019; Yang et al., 2019a). To get the hard
negatives, a weaker dual encoder model is trained
using a similar model but with less parameters and
less training data. For each training example, those
incorrect translations that are semantically similar
to the correct translation are retrieved as “hard-
negatives” from a candidates pool. Semantically
similarity is determined using the cosine similarity
of the embeddings generated by the weaker model.
It is challenging to apply hard negative to large
ISO NAME ISO NAME ISO NAME
af AFRIKAANS ht HAITIAN_CREOLE pt PORTUGUESE
am AMHARIC hu HUNGARIAN ro ROMANIAN
ar ARABIC hy ARMENIAN ru RUSSIAN
as ASSAMESE id INDONESIAN rw KINYARWANDA
az AZERBAIJANI ig IGBO si SINHALESE
be BELARUSIAN is ICELANDIC sk SLOVAK
bg BULGARIAN it ITALIAN sl SLOVENIAN
bn BENGALI ja JAPANESE sm SAMOAN
bo TIBETAN jv JAVANESE sn SHONA
bs BOSNIAN ka GEORGIAN so SOMALI
ca CATALAN kk KAZAKH sq ALBANIAN
ceb CEBUANO km KHMER sr SERBIAN
co CORSICAN kn KANNADA st SESOTHO
cs CZECH ko KOREAN su SUNDANESE
cy WELSH ku KURDISH sv SWEDISH
da DANISH ky KYRGYZ sw SWAHILI
de GERMAN la LATIN ta TAMIL
el GREEK lb LUXEMBOURGISH te TELUGU
en ENGLISH lo LAOTHIAN tg TAJIK
eo ESPERANTO lt LITHUANIAN th THAI
es SPANISH lv LATVIAN tk TURKMEN
et ESTONIAN mg MALAGASY tl TAGALOG
eu BASQUE mi MAORI tr TURKISH
fa PERSIAN mk MACEDONIAN tt TATAR
fi FINNISH ml MALAYALAM ug UIGHUR
fr FRENCH mn MONGOLIAN uk UKRAINIAN
fy FRISIAN mr MARATHI ur URDU
ga IRISH ms MALAY uz UZBEK
gd SCOTS_GAELIC mt MALTESE vi VIETNAMESE
gl GALICIAN my BURMESE wo WOLOF
gu GUJARATI ne NEPALI xh XHOSA
ha HAUSA nl DUTCH yi YIDDISH
haw HAWAIIAN no NORWEGIAN yo YORUBA
he HEBREW ny NYANJA zh CHINESE
hi HINDI or ORIYA zu ZULU
hmn HMONG pa PUNJABI
hr CROATIAN pl POLISH

Table 10: The supported languages of LaBSE (ISO 639-1/639-2).


Number of Sentences/Sentence Pairs (M)

0
500
1,000
1,500
2,000
en
ru
ja
zh
fr
de
pt
nl
es
pl
id
it
cs
tr
sv
vi
fa
ko
el
ar
hu
uk
da
iw
no
fi
sk
bg
ro
ms
hi
sl
hr
et
lv
ta
th
is
te
ur
gl
lt
bs
gu
sq
sw
az
mt
mk
eo
hy
ht
bn
ga
sr

training set. The English (en) sentences are capped at 2 billion.


ka

Language
mr
kk
ca
ml
be
eu
fil
kn
mn
af
Bilingual Sentence Pairs (en-xx)

my
si
ne
uz
km
am
ky
cy
pa
tg
ku
la
or
tt
so
ceb
zu
ug
yo
co
lb
su
rw
lo
ha
mg
fy
ig
Monolingual Sentences

gd
mi
sn
st
ny
xh
hmn
haw
as
sm
wo
yi
tk
sd
ps
bo

Figure 7: Quantity of monolingual sentences and bilingual sentence-pairs for each of the 109 languages in our

You might also like