Language-Agnostic BERT Sentence Embedding
Language-Agnostic BERT Sentence Embedding
Fangxiaoyu Feng∗, Yinfei Yang∗†, Daniel Cer, Naveen Arivazhagan, Wei Wang†
Google AI
Mountain View
{fangxiaoyu, cer, navari}@google.com
{yangyin7, wei.wang.world}@gmail.com
Abstract Loss
Add Additive Margin
mantic similarity and embedding based trans- 12-Layer Transformer Parameters 12-Layer Transformer
fer learning (Reimers and Gurevych, 2019), Embedding Network Embedding Network
Cross-lingual
Bilingual
TLM
Yang et. al. (2019a)
Dual encoder models are an effective approach
Multilingual m-USE & LASER
for learning cross-lingual embeddings (Guo et al.,
Figure 2: Where our work stands (shaded) vs. related
2018; Yang et al., 2019a). Such models consist of
work in LM pre-training and sentence embedding learn- paired encoding models that feed a scoring func-
ing. tion. The source and target sentences are encoded
separately. Sentence embeddings are extracted
Model Langs Model HN AMS Pre-train from each encoder. Cross-lingual embeddings are
LASER 97 seq2seq N/A N/A N
Yang et al. (2019a) 2 DE Y Y N
trained using a translation ranking task with in-
m-USE 16 DE Y Y N batch negative sampling:
LaBSE 109 DE N Y Y
• A publicly released multilingual sentence em- The margin, m, improves the separation between
bedding model spanning 109+ languages. translations and nearby non-translations. Using
φ0 (xi , yj ) with the bidirectional loss L̄s , we obtain
• Thorough experiments and ablation studies to
the additive margin loss
understand the impact of pre-training, nega-
tive sampling strategies, vocabulary choice,
data quality, and data quantity. N
1X eφ(xi ,yi )−m
L=− PN
We release the pre-trained model at https:// N φ(xi ,yi )−m + φ(xi ,yn )
i=1 e n=1,n6=i e
tfhub.dev/google/LaBSE. (4)
2.2 MLM and TLM Pre-training data is extracted from the 05-21-2020 dump using
Only limited prior work has combined dual en- WikiExtractor7 . An in-house tool splits the text
coders trained with a translation ranking loss with into sentences. The sentences are filtered using a
encoders initialized using large pre-trained lan- sentence quality classifier.8 After filtering, we ob-
guage models (Yang et al., 2021). We contrast tain 17B monolingual sentences, about 50% of the
using a randomly initialized transformer, as was unfiltered version. The monolingual data is only
done in prior work (Guo et al., 2018; Yang et al., used in customized pre-training.
2019a), with using a large pre-trained language Bilingual Translation Pairs The translation cor-
model. For pre-training, we combined Masked pus is constructed from web pages using a bitext
language modeling (MLM) (Devlin et al., 2019) mining system similar to the approach described
and Translation language modeling (TLM) (Con- in Uszkoreit et al. (2010). The extracted sen-
neau and Lample, 2019). MLM is a variant of a tence pairs are filtered by a pre-trained contrastive-
cloze task, whereby a model uses context words data-selection (CDS) scoring model (Wang et al.,
surrounding a [MASK] token to try to predict what 2018). Human annotators manually evaluate sen-
the [MASK] word should be. TLM extends this to tence pairs from a small subset of the harvested
the multilingual setting by modifying MLM train- pairs and mark the pairs as either GOOD or BAD
ing to include concatenated translation pairs. translations. The data-selection scoring model
Multilingual pre-trained models such as threshold is chosen such that 80% of the retained
mBERT (Devlin et al., 2019), XLM (Conneau pairs from the manual evaluation are rated as
and Lample, 2019) and XLM-R (Conneau et al., GOOD. We further limit the maximum number
2019) have led to exceptional gains across a variety of sentence pairs to 100 million for each language
of cross-lingual natural language processing to balance the data distribution. Many languages
tasks (Hu et al., 2020). However, without a sen- still have far fewer than 100M sentences. The final
tence level objective, they do not directly produce corpus contains 6B translation pairs.9 The transla-
good sentence embeddings. As shown in Hu et al. tion corpus is used for both dual encoder training
(2020), the performance of such models on bitext and customized pre-training.
retrieval tasks is very weak, e.g XLM-R Large
gets 57.3% accuracy on a selected 37 languages2 3.2 Configurations
from the Tatoeba dataset compared to 84.4% using
In this section, we describe the training details for
LASER (see performance of more models in table
the dual encoder model. A transformer encoder is
5). We contribute a detailed exploration that uses
used in all experiments (Vaswani et al., 2017). We
pre-trained language models to produce useful
train two versions of the model, one uses the public
multilingual sentence embeddings.
BERT multilingual cased vocab with vocab size
3 Corpus and Training Details 119,547 and a second incorporates a customized
vocab extracted over our training data. For the
3.1 Corpus customized vocab, we employ a wordpiece tok-
We use bilingual translation pairs and monolingual enizer (Sennrich et al., 2016), with a cased vocabu-
data in our experiments3 . lary extracted from the training set using TF Text.10
The language smoothing exponent for the vocab
Monolingual Data We collect monolingual data generation tool is set to 0.3 to counter imbalances
from CommonCrawl4 and Wikipedia5 . We use in the amount of data available per language. The
the 2019-35 version of CommonCrawl with heuris- final vocabulary size is 501,153.
tics from Raffel et al. (2019) to remove noisy text. The encoder architecture follows the BERT Base
Additionally, we remove short lines < 10 char- model, with 12 transformer blocks, 12 attention
acters and those > 5000 characters.6 The wiki
7
https://github.com/attardi/
2
The number is counted from official evaluation script wikiextractor
despite the original paper says 33 languages. 8
The quality classifier is trained using sentences from the
3
See the detailed list of supported languages in supplemen- main content of webpages as positives and text from other
tal material. areas as negatives.
4 9
https://commoncrawl.org/ Experiments in later sections show that even 200M pairs
5
https://www.wikipedia.org/ across all languages is sufficient.
6 10
Long lines are usually JavaScript or attempts at SEO. https://github.com/tensorflow/text
Target Sentence Batch Target Sentence Batch
Core i Core i Core i-1 Core i+1
j j
i 0 1 2 3 4 5 6 7 i 0 1 2 3 4 5 6 7
1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Core i
2 2 0 0 0 0 0 0 0 0
Core i
0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 3
... 0 0 0 0 0 0 0 0
...
4 0 0 0 0 1 0 0 0 4 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 1 0 0 5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 1 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 3: Negative sampling example in a dual encoder framework. [Left]: The in-batch negative sampling in a
single core; [Right]: Synchronized multi-accelerator negative sampling using n TPU cores and batch size 8 per
core with examples from other cores are all treated as negatives.
heads and 768 per-position hidden units. The en- pling, which is illustrated in figure 3.12 Under this
coder parameters are shared for all languages. Sen- strategy each core encodes its assigned sentences
tence embeddings are extracted as the l2 normal- and then the encoded sentence representations from
ized [CLS] token representations from the last all cores are broadcast as negatives to the other
transformer block.11 cores. This allows us to fully realize the benefits of
Our models are trained on Cloud TPU V3 larger batch sizes while still distributing the compu-
with 32-cores using a global batch size of 4096 tationally intensive encoding work across multiple
with a max sequence length of 128, using the cores.
AdamW (Loshchilov and Hutter, 2019) optimizer Note the dot-product scoring function makes it
with initial learning rate 1e-3, and linear weight efficient to compute the pairwise scores in the same
decay. We train for 50k steps for models with batch with matrix multiplication. In figure 3, the
pre-training, and 500k steps for models without value in the grids indicates the ground truth labels,
pre-training. We observe that even further training with all positive labels located in diagonal grids. A
did not change the performance significantly. The softmax function is applied on each row.
default margin value for additive margin softmax is
set to 0.3. Hyperparameters are tuned on a held-out 3.4 Pre-training
development set. The encoder is pre-trained with Masked Language
Model (MLM) (Devlin et al., 2019) and Transla-
3.3 Cross-Accelerator Negative Sampling tion Language Model (TLM) (Conneau and Lam-
Cross-lingual embedding models trained with in- ple, 2019)13 training on the monolingual data and
batch negative samples benefit from large training bilingual translation pairs, respectively. For an L
batch sizes (Guo et al., 2018). Resource intensive layer transformer encoder, we train using a 3 stage
models like BERT, are limited to small batch sizes progressive stacking algorithm (Gong et al., 2019),
due to memory constraints. While data-parallelism where we first learn a L4 layers model and then L2
does allow us to increase the global batch size by layers and finally all L layers. The parameters of
using multiple accelerators, the batch-size on an the models learned in the earlier stages are copied
individual cores remains small. For example, a to the models for the subsequent stages.
4096 batch run across 32 cores results in a local Pre-training uses TPUv3 with 512-cores and a
batch size of 128, with each example then only batch size of 8192. The max sequence length is set
receiving 127 negatives. to 512 and 20% of tokens (or 80 tokens at most) per
We introduce cross-accelerator negative sam- sequence are masked for MLM and TLM predic-
tions. For the three stages of progressive stacking,
11
During training, the sentence embeddings after normal-
12
ization are multiplied by a scaling factor. Following Chi- While our experiments use TPU accelerators, the same
dambaram et al. (2018), we set the scaling factor to 10. We strategy can also be applied to models trained on GPU.
13
observe that the scaling factor is important for training a dual Diverging from Conneau and Lample (2019), we do not
encoder model with the normalized embeddings. provide a language hint to encourage multilinguality.
we respectively train for 400k, 800k, and 1.8M analysis (Socher et al., 2013), (TREC) question-
steps using all monolingual and bilingual data. type (Voorhees and Tice, 2000), (CR) product
reviews (Hu and Liu, 2004), (SUBJ) subjectiv-
4 Evaluation Tasks ity/objectivity (Pang and Lee, 2004), (MPQA)
opinion polarity (Wiebe et al., 2005), and (MRPC)
4.1 Bitext Retrieval
paraphrasing detection (Dolan et al., 2004). While
We evaluate models on three bitext retrieval tasks: SentEval is English only, we make use of this
United Nations (UN), Tatoeba, and BUCC. All benchmark in order to directly compare to prior
tasks are to retrieve the correct English translation work on sentence embedding models.
for each non-English sentence.
5 Results
United Nations (UN) contains 86,000 sentence
aligned bilingual documents over five language Table 2 shows the performance on the UN and
pairs: en-fr, en-es, en-ru, en-ar and en-zh (Ziemski Tatoeba bitext retrieval tasks and compares against
et al., 2016). A total of 11.3 million14 aligned sen- the prior state-of-the-art bilingual models Yang
tence pairs can be extract from the document pairs. et al. (2019a), LASER (Artetxe and Schwenk,
The large pool of translation candidates makes this 2019b), and the multilingual universal sentence
data set particularly challenging. encoder (m-USE) (Yang et al., 2019b)16 . Row 1-3
show the performance of baseline models, as re-
Tatoeba evaluates translation retrieval over 112 ported in the original papers.
languages (Artetxe and Schwenk, 2019b). The Row 4-7 shows the performance of models that
dataset contains up to 1,000 sentences per language use the public mBERT vocabulary. The baseline
along with their English translations. We evaluate model shows reasonable performance on UN rang-
performance on the original version covering all ing from 57%-71% P@1. It also perform well on
112 languages, and also the 36 languages version Tatoeba with 92.8% and 79.1% accuracy for the
from the XTREME benchmark (Hu et al., 2020). 36 language group and all languages, respectively.
Adding pre-training both helps models converge
BUCC is a parallel sentence mining shared faster (see details in section 6.2) and improves per-
task (Zweigenbaum et al., 2018). We use the 2018 formance on the UN retrieval using both vocabu-
shared task data, containing four language pairs: fr- laries. Pre-training also helps on Taoeba, but only
en, de-en, ru-en and zh-en. For each pair, the task using the customized vocabulary.17 Additive mar-
provides monolingual corpora and gold true trans- gin softmax significantly improves the performance
lation pairs. The task is to extract translation pairs on all model variations.
from the monolingual data, which are evaluated The last two rows contain two models using the
against the ground truth using F1. Since the ground customized vocab. Both of them are trained with
truth for the BUCC test data is not released, we additive margin softmax given the strong evidence
follow prior work using the BUCC training set for from the experiments above. Both models outper-
evaluation rather than training (Yang et al., 2019b; form the mBERT vocabulary based models, and
Hu et al., 2020). Sentence embedding cosine simi- the model with pre-training performs best of all.
larity is used to identify the translation pairs.15 The top model (Base w/ Customized Vocab + AMS
+ PT) achieves a new state-of-the-art on 3 of the
4.2 Downstream Classification
4 languages, with P@1 91.1, 88.3, 90.8 for en-es,
We also evaluate the transfer performance of multi- en-fr, en-ru respectively. It reaches 87.7 on zh-en,
lingual sentence embeddings on downstream clas- only 0.2 lower than the best bilingual en-zh model
sification tasks from the SentEval benchmark (Con- and nearly 9 points better than the previous best
neau and Kiela, 2018). We evaluate on select multilingual model. On Tatoeba, the best model
tasks from SentEval including: (MR) movie re- also outperform the baseline model by a large mar-
views (Pang and Lee, 2005)), (SST) sentiment gin, with +10.6 accuracy on the 36 language group
14 16
About 9.5 million after de-duping. universal-sentence-encoder-multilingual-large/3
15 17
Reranking models can further improve performance (e.g. The coverage of the public mBERT vocabulary on the tail
margin based scorer (Artetxe and Schwenk, 2019a) and BERT languages are bad with many [UNK] tokens in those languages,
based classifier (Yang et al., 2019a)). However, this ss tangen- e.g. the [UNK] token rate is 71% for language si, which could
tial to assessing the raw embedding retrieval performance. be reason the pre-training doesn’t help on the tatoeba task.
UN (en → xx) Taoeba (xx → en)
Model
es fr ru zh avg 36 Langs All Langs
LASER (Artetxe and Schwenk, 2019b) – – – – – 84.4 65.5
m-USE (Yang et al., 2019b) 86.1 83.3 88.9 78.8 84.3 – –
Yang et al. (2019a) 89.0 86.1 89.2 87.9 88.1 – –
Base w/ mBERT Vocab 67.7 57.0 70.2 71.9 66.7 92.8 79.1
+ PT 68.5 59.8 65.8 71.7 66,5 92.7 78.6
+ AMS 88.2 84.5 88.6 86.4 86.9 93.7 81.2
+ AMS + PT 89.3 85.7 89.3 87.2 87.9 93.2 78.4
Base w/ Customized Vocab
+ AMS 90.6 86.5 89.5 86.8 88.4 94.8 82.6
+ AMS + PT (LaBSE) 91.1 88.3 90.8 87.7 89.5 95.0 83.7
Table 2: UN (P@1) and Taoteba (Average accuracy) performance for different model configurations. Base uses a
bidirectional dual encoder model. [AMS]: Additive Margin Softmax. [PT]: Pre-training.
P@1 (%)
80
Artetxe and Schwenk (2019a) 82.1 74.2 78.0 78.9 75.1 77.0 - - - - - -
Yang et al. (2019a) 86.7 85.6 86.1 90.3 88.0 89.2 84.6 91.1 87.7 86.7 90.9 88.8
LaBSE 86.6 90.9 88.7 92.3 92.7 92.5 86.1 91.9 88.9 88.2 89.7 88.9
Artetxe and Schwenk (2019a) 77.2 72.7 74.7 79.0 73.1 75.9 - - - - - -
Yang et al. (2019a) 83.8 85.5 84.6 89.3 87.7 88.5 83.6 90.5 86.9 88.7 87.5 88.1
LaBSE 87.1 88.4 87.8 91.3 92.7 92.0 86.3 90.7 88.4 87.8 90.3 89.0
Table 3: [P]recision, [R]ecall and [F]-score of BUCC training set score with cosine similarity scores. The thresh-
olds are chosen for the best F scores on the training set. Following the naming of BUCC task (Zweigenbaum et al.,
2018), we treat en as the target and the other language as source in forward search. Backward is vice versa.
80
rapid decline for LASER. LaBSE systematically
70 outperforms LASER on the groups of 36 languages
(+10.6%), 82 languages (+11.4%), and 112 lan-
60
Bsse w/ Customized vocab + AMS guages (+18.2%).
Base w/ Cutsomized vocab + AMS + PT (LaBSE)
50
0 100 200 300 400 500
Trainging Steps (K) Figure 6 lists the Tatoeba accuracy for languages
Figure 5: Average P@1 (%) on UN retrieval task of where we don’t have any explicit training data.
models trained with training different steps. There are a total of 30+ such languages. The
performance is surprisingly good for most of the
ever, still slightly worse. Moreover, further training languages with an average accuracy around 60%.
past 500k steps doesn’t increase the performance Nearly one third of them have accuracy greater
significantly. Pre-training thus both improves per- than 75%, and only 7 of them have accuracy lower
formance and dramatically reduces the amount of than 25%. One possible reason is that language
parallel data required. Critically, the model sees mapping is done manually and some languages
1B examples at 500K steps, while the 50K model are close to those languages with training data but
only sees 200M examples.18 may be treated differently according to ISO-639
standards and other information. Additional, since
6.3 Low Resource Languages and Languages automatic language detection is used, some limited
without Explicit Training Data amount of data for the missing languages might
We evaluate performance through further experi- be included during training. We also suspect that
ments on Tatoeba for comparison to prior work and those well performing languages are close to some
language that we have training data. For exam-
18
We note that it is relative easy to get 200M parallel exam- ple yue and wuu are related to zh (Chinese) and fo
ples for many languages from public sources like Paracrawl,
TED58, while obtaining 1B examples is generally much more has similarities to is (ICELANDIC). Multilingual
challenging. generalization across so many languages is only
Model 14 Langs 36 Langs 82 Langs All Langs Model dev test
m-USETrans. 93.9 – – – SentenceBERT (Reimers and Gurevych, 2019) - 79.2
LASER 95.3 84.4 75.9 65.5 m-USE (Yang et al., 2019b) 83.7 82.5
LaBSE 95.3 95.0 87.3 83.7 USE (Cer et al., 2018) 80.2 76.6
ConvEmbed (Yang et al., 2018) 81.4 78.2
InferSent (Conneau et al., 2017) 80.1 75.6
Table 5: Accuracy(%) of the Tatoeba datasets. [14
LaBSE 74.3 72.8
Langs]: The languages USE supports. [36 Langs]: STS Benchmark Tuned
The languages selected by XTREME. [82 Langs]: Lan- SentenceBERT-STS (Reimers and Gurevych, 2019) - 86.1
guages that LASER has training data. All Langs: All ConvEmbed (Yang et al., 2018) 83.5 80.8
languages supported by Taoteba.
Table 6: Semantic Textual Similarity (STS) bench-
100
mark(Cer et al., 2017) performance as measured by
Pearson’s r.
Tatoeba Accuracy
75
tzl
kzj/dtp
tk
kw
war
cbk
oc
orv
ber
AVG
ast
ie
awa
swg
gsw
kab
mhr/ch
yue
arq
ch
br
csb
pam
ia
io
pms
nov
ang
dtp
wuu
fo
max
nds
hsb
dsb
lfn
0
500
1,000
1,500
2,000
en
ru
ja
zh
fr
de
pt
nl
es
pl
id
it
cs
tr
sv
vi
fa
ko
el
ar
hu
uk
da
iw
no
fi
sk
bg
ro
ms
hi
sl
hr
et
lv
ta
th
is
te
ur
gl
lt
bs
gu
sq
sw
az
mt
mk
eo
hy
ht
bn
ga
sr
Language
mr
kk
ca
ml
be
eu
fil
kn
mn
af
Bilingual Sentence Pairs (en-xx)
my
si
ne
uz
km
am
ky
cy
pa
tg
ku
la
or
tt
so
ceb
zu
ug
yo
co
lb
su
rw
lo
ha
mg
fy
ig
Monolingual Sentences
gd
mi
sn
st
ny
xh
hmn
haw
as
sm
wo
yi
tk
sd
ps
bo
Figure 7: Quantity of monolingual sentences and bilingual sentence-pairs for each of the 109 languages in our