0% found this document useful (0 votes)
64 views14 pages

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

The document introduces S ONAR, a multilingual and multimodal sentence embedding framework that significantly outperforms existing models in multilingual similarity search tasks. It utilizes a single text encoder for 200 languages and incorporates speech encoders trained through a teacher-student approach, enabling effective speech-to-text and text-to-speech translation. The framework also includes a text decoder for machine translation, demonstrating competitive results in zero-shot translation scenarios.

Uploaded by

dillweed2333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views14 pages

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

The document introduces S ONAR, a multilingual and multimodal sentence embedding framework that significantly outperforms existing models in multilingual similarity search tasks. It utilizes a single text encoder for 200 languages and incorporates speech encoders trained through a teacher-student approach, enabling effective speech-to-text and text-to-speech translation. The framework also includes a text decoder for machine translation, demonstrating competitive results in zero-shot translation scenarios.

Uploaded by

dillweed2333
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

S ONAR: Sentence-Level Multimodal

and Language-Agnostic Representations

Paul-Ambroise Duquenne Holger Schwenk Benoît Sagot


Meta AI & Inria Meta AI Inria
[email protected] [email protected] [email protected]

Abstract

We introduce S ONAR, a new multilin-


gual and multimodal fixed-size sentence
embedding space. Our single text en-
coder, covering 200 languages, substan-
tially outperforms existing sentence embed-
dings such as L ASER 3 and LabSE on the
xsim and xsim++ multilingual similar-
ity search tasks. Speech segments can
Figure 1: S ONAR architecture.
be embedded in the same S ONAR embed-
ding space using language-specific speech
encoders trained in a teacher-student set- Gurevych, 2019), aiming to encode sentences with
ting on speech transcription data. Our similar meanings closely in the sentence embed-
encoders outperform existing speech en-
coders on similarity search tasks. We also
ding space. Artetxe and Schwenk (2019); Feng
provide a text decoder for 200 languages, et al. (2020) extended this idea to multilingual sen-
which allows us to perform text-to-text and tences, enabling the semantic comparison between
speech-to-text machine translation, includ- sentences from different languages. This was used
ing for zero-shot language and modality to perform bitext mining at scale, to automatically
combinations. Our text-to-text results are align monolingual sentences from Common Crawl
competitive compared to the state-of-the- (Schwenk et al., 2021). This mined bitext data can
art NLLB 1B model, despite the fixed-size
be successfully used to train state-of-the art ma-
bottleneck representation. Our zero-shot
speech-to-text translation results compare chine translation (MT) models (Schwenk et al.,
favorably with strong supervised baselines 2021; NLLB Team et al., 2022). In recent re-
such as Whisper. search, we may distinguish three main approaches
to building multilingual fixed-size sentence repre-
sentations.
1 Introduction Encoder-only approaches such as (Feng et al.,
Representation learning of sentences has been 2020), which learn sentence embeddings for text,
widely studied in recent years for different pur- based on a siamese encoder architecture. Con-
poses: from classification of sentences (Devlin trastive loss is often used to learn similar represen-
et al., 2018) to multilingual representations for tations for different text translations while avoid-
translation purposes (Pham et al., 2019). Dif- ing collapse (i.e. avoid to predict the same embed-
ferent pre-training objectives were explored to ding for every input)
build contextual representations from sentences Encoder-decoder approaches such as (Artetxe
(Devlin et al., 2018; Conneau et al., 2019; Clark and Schwenk, 2019), which learn sentence em-
et al., 2020). However, these methods often beddings with a translation objective, that can
lack sentence-level objectives, making it difficult be computed thanks to an additional decoder.
to evaluate the semantic similarity between two The main difference with classical sequence-to-
sentences. On the other hand, several works sequence model is the bottleneck layer, or pooling
focused on learning sentence embeddings (Cer function, that computes a fixed-size sentence rep-
et al., 2018; Conneau et al., 2017; Reimers and resentation between the encoder and the decoder.
Teacher-student approaches such as (Reimers • We explore different training objectives
and Gurevych, 2020; Heffernan et al., 2022), to learn a multilingual sentence embed-
which extend a (possibly monolingual) pre- ding space initialized from the NLLB 1B
existing sentence embedding space to new lan- model, thoroughly comparing the different
guages with a teacher-student learning strategy. approaches on a wide range of decoding and
The existing embedding space is used as teacher similarity search evaluations
to train student encoders for new languages. Bi-
text training data is used for this kind of train- • This yield a single sentence encoder for
ing, where the sentence in the new language is 200 languages which significantly outper-
encoded with a trained encoder, while its trans- form state-of-the-art sentence embedding ap-
lation in another supported language is encoded proaches;
with the pre-existing encoder as target. The same
teacher-student approach can be used to extend a • We trained speech encoders for 37 languages
text-only multilingual sentence embedding space using teacher-student training
to the speech modality by training speech en-
• We provide a text decoder for 200 languages
coders (Duquenne et al., 2021; Khurana et al.,
enabling (zero-shot) text and speech transla-
2022). These speech encoders can be used to per-
tion
form speech-to-text or speech-to-speech transla-
tion mining (Duquenne et al., 2022a).
• We analyzed the cross-lingual and cross-
In this work, we used an encoder-decoder ap- modal similarity search and decoding capa-
proach to build our sentence embedding space bilities of our S ONAR framework.
S ONAR on text data only. We then used a teacher-
student approach to train speech encoders for the • The S ONAR text and speech encoders
same space. as well as the text decoders are freely
Our motivation for using an encoder-decoder available at https://github.com/
approach for the initial text-based training phase facebookresearch/SONAR.
is twofold. First, a multilingual decoder is trained
along the multilingual encoder, which opens pos- 2 Related work
sibilities such as zero-shot MT (Duquenne et al.,
2022b). Second, a pre-trained state-of-the-art MT Multilingual sentence representations Many
encoder-decoder model can be used to initialize works have studied how to efficiently learn mul-
the whole encoder-decoder architecture, in this tilingual representations of sentences. Some of
work we used NLLB 1B dense model as initializa- them focused on variable-length representations
tion. In contrast to previous work, we study the ef- of sentences, learning high-level contextual rep-
fect of different training objective functions on the resentations for each sub-word like multilingual
properties of the resulting embedding space. More BERT (Devlin et al., 2018) or XLM-R (Conneau
precisely, we combine translation, auto-encoding et al., 2020). Others learnt fixed-size sentence
and denoising objectives, together with a cross- representations by integrating sentence-level ob-
lingual similarity objective in the sentence embed- jectives in the training. It is the case for exam-
ding space. ple of sentence-BERT (Reimers and Gurevych,
2019), which was initially trained on English text
In a second step, we train speech student en-
only, and later extended to other languages with a
coders using our multilingual text encoder as a
teacher-student approach (Reimers and Gurevych,
teacher. We demonstrate the cross-modal similar-
2020). The English model behaves as a teacher
ity search and speech translation1 capabilities of
to train a multilingual encoder covering other lan-
the resulting S ONAR framework.
guages. The student model is initialized with the
In summary, the main contributions of
XLM-R pretrained encoder and fine-tuned using
the S ONAR (Sentence-level multimOdal and
bitext training data. The original English encoder,
laNguage-Agnostic Representations) model are as
which is kept frozen, is used to generate an embed-
follows:
ding for the English translation of each sentence,
1
The term “speech translation” customarily denotes which then serves as a target for the student en-
speech-to-text translation. coder via a regression loss.
Bitexts can also be used in other ways to train multilingual sentence embedding space based on
multilingual sentence embedding spaces. LASER an encoder-decoder approach. The second step
(Artetxe and Schwenk, 2019) is an encoder- extends the multilingual text sentence embedding
decoder architecture, with a fixed-size sentence space to the speech modality, using a teacher-
representation between the encoder and the de- student approach.
coder, trained with a translation objective. The
orginal LASER covers 93 languages. Its decoder 3.1 Multilingual sentence representations for
was originally used for training only, as the en- text
coder itself defines the sentence embedding space. Contrarily to LASER’s bidirectional LSTM ar-
However, recent work such as (Duquenne et al., chitecture (Artetxe and Schwenk, 2019), S ONAR
2022b) showed that it is possible to learn high relies on a Transformer encoder-decoder archi-
quality decoders for LASER representations into tecture, initialized with pre-trained MT model
multiple languages, thereby enabling zero-shot weights. However, as opposed to standard
MT on unseen languages directions. Similarly to sequence-to-sequence architectures for MT, the
Reimers and Gurevych (2020), Heffernan et al. architecture we use to train S ONAR on parallel
(2022) introduced L ASER 3, extending LASER to text data goes through a single vector bottleneck
new languages, including low-resource languages, that represents the full sentence and does not use
using a teacher-student approach. Finally, LaBSE token-level cross-attention. The fixed-size sen-
(Feng et al., 2020) uses a dual-encoder approach tence representation is computed by pooling the
with and additive margin softmax objective (Yang token-level outputs of the encoder. Instead of do-
et al., 2019). It highlights the benefits of initializ- ing cross-attention on a variable-length sequence
ing encoders with multilingual pre-trained models of encoder outputs, the decoder only attends to
and covers 109 languages. this single vector at each decoding step. Dif-
Joint speech/text sentence representations ferent pooling methods can be used to compute
There has been a large body or research on unsu- this fixed-size representation, including max- and
pervised representation learning for monolingual mean-pooling on token-level encoder outputs, as
(Baevski et al., 2020) and multilingual speech well as the encoder output for a special BOS to-
(Babu et al., 2021), with recently w2v-bert (Chung ken.
et al., 2021) that combines contrastive learning Contrarily to LASER (Artetxe and Schwenk,
and masked language modeling to learn self- 2019), we do not only train our encoder-decoder
supervised representations from speech. Other architecture using an MT objective only. We in-
works explored multilingual and multimodal vestigated several other objectives and combina-
(speech/text) pre-training methods, including tions thereof and analyzed their effect on the sen-
mSLAM (Bapna et al., 2022). Finally, Duquenne tence embedding space and the decoding perfor-
et al. (2021), followed by Khurana et al. (2022), mance of the resulting model. We introduce below
introduced multilingual and multimodal sentence the different objectives used to train our encoder-
embeddings, extending a pre-existing multilingual decoder architecture.
text sentence embedding space to the speech Translation objective Following (Artetxe and
modality with a distillation approach. Duquenne Schwenk, 2019) work, we used parallel data
et al. (2022b, 2023) also showed that it is possible to train our encoder-decoder architecture with a
to efficiently decode multilingual speech sen- translation objective. To better understand the
tence embeddings with decoders trained on text motivation behind this objective, let us take this
sentence embeddings into different languages, to example: Given a triplet of translations x, y, z,
perform zero-shot speech translation. where z is the English translation, decoding x and
y into English may be easily achieved by the de-
3 Methodology
coder if the sentence representation of these two
To build our multilingual and multimodal sen- input sentences are similar in the sentence em-
tence embedding space S ONAR, we follow a two- bedding space. Training a encoder-decoder archi-
step training strategy, inspired by Duquenne et al. tecture on a translation objective may end up in
(2021, 2022b). The first step is to build a sen- this potential local minimum where translations
tence embedding space for text: we are building a are encoded closely to one another, so as to be
English sentence from FLORES:
Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and
scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days.
Auto-encoding of the sentence with S ONAR:
Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chairman of the clinical
and scientific division of the Canadian Diabetes Association warned that the research is still in its early stages.
Figure 2: Example of a long sentence with named entities auto-encoded with S ONAR.

decoded into the same target language sentence. an MSE loss with a translation objective and/or a
However, there is no guarantee to converge to this denoising auto-encoding objective could also pre-
local minimum. Nothing explicitly constrains a vent collapse from happening, as the model is
sentence in a language and its translation in an- forced to keep embeddings distinct to encode and
other language to be encoded closely to one an- decode different sentences.
other. As a result, other local minima are possible,
where translations are not encoded closely but still Decoder finetuning Duquenne et al. (2022b)
decoded into the same sentence for a given target demonstrated that learning deep decoders for
language. To mitigate this, shallow decoders were an existing sentence embedding space (in their
used by Artetxe and Schwenk (2019): a deeper de- case, LASER) can significantly improve transla-
coder can more easily decode different points into tion and auto-encoding performance. While keep-
the same sentence, whereas a shallower decoder is ing the existing embedding space unchanged, such
more likely to need two vectors to be very simi- decoders greatly improve the decoding of sen-
lar whenever they must be decoded into the same tence embeddings, therefore significantly improv-
sentence. ing auto-encoding and translation performance
when combined with compatible encoders. This
Auto-encoding and denoising auto-encoding is of great interest for zero-shot (possibly cross-
objective Auto-encoders have been widely used modal) translation, as shown by Duquenne et al.
to build representations. It has the advantage to en- (2023). In this paper, we introduce a decoder
courage encoding fine-grained details of the input. fine-tuning method called random interpolation
However, this objective by itself is not likely to decoding. Based on a trained encoder-decoder
learn semantic representation of sentences. More- model with a bottleneck representation between
over, this objective is much simpler to learn com- the encoder and the decoder, we freeze the encoder
pared to a translation objective, which makes the weights and fine-tune the decoder weights only on
combination of these two objectives difficult. To a specific decoding task: Given a bitext x, y, we
mitigate these issues, Liu et al. (2020) introduce encode x and y with the frozen encoder, randomly
a denoising auto-encoding task, which has proven draw z as a random interpolation of x and y em-
to be a good pre-training objective for translation beddings, and learn to decode sentence embedding
tasks. z into y. This can be viewed as a continuous com-
bination of translation and auto-encoding tasks.
MSE loss objective in the sentence embed-
ding space Teacher-student approaches to mul- 3.2 Multilingual sentence representations for
tilingual sentence embedding space learning have speech
shown that ensuring that translations of a same Duquenne et al. (2021) introduced the first seman-
sentence are embedded close to one another in tic sentence embeddings for multilingual speech.
the sentence embedding space with an MSE loss Their method follows a teacher-student approach,
works really well (Reimers and Gurevych, 2020; where the teacher model is an encoder for multi-
Heffernan et al., 2022). However, using this lingual sentence embeddings trained on text. We
kind of loss without a frozen pre-existing teacher follow the same approach but using our newly
embedding space would lead to collapse (all in- trained text sentence embedding space as teacher:
puts mapped to the same embedding), which is we trained a speech student encoder to encode au-
why contrastive learning methods were introduced dios into fixed-size representations and minimize
to learn multilingual sentence embeddings from the MSE loss between the transcription sentence
scratch (Feng et al., 2020). However, combining embeddings and the trained speech sentence em-
beddings. Written translation embeddings could Auto-encoding task Similarly to translation
also be used as targets in this teacher-student ap- tasks, we decode sentence embedding in the same
proach (Duquenne et al., 2021). However, in this language to perform auto-encoding and evaluate
work, we only focus on transcriptions as targets, the content preservation of this operation.
using written translations is left for future work. All these evaluations for text were performed
As in previous work, we leveraged self-supervised on FLORES-200 devtest set,3 which provides an
pre-trained models, for our speech encoders train- N -way parallel corpus of translations in 200 lan-
ing, using a w2v-bert pretrained model as initial- guages.
ization.
4.2 Evaluations on speech
4 Evaluations xsim for speech We follow Duquenne et al.
To evaluate the semantic properties of the resulting (2021) and calculate cross-modal and -lingual
sentence embedding space, we relied on a number similarity search on the F LEURS speech transla-
of evaluation tasks on both text and speech modal- tion test set (Conneau et al., 2023). It follows the
ities: xsim evaluation presented above, but xsim is run
on speech embeddings against English text trans-
4.1 Evaluations on text lation embeddings.
xsim Cross-lingual similarity search, also xsim++ for speech In addition to xsim com-
called xsim,2 evaluates the similarity between putation for speech, we augment the English
sentence embeddings across languages. Given a texts with challenging negative examples from the
test dataset of bitexts, translations are encoded xsim++ modified English sentences of FLORES.
into the multilingual sentence embedding space
and cosine similarity between all embeddings Zero-shot speech-to-text translation Follow-
are computed. For each test instance, if the two ing Duquenne et al. (2022b), speech student en-
corresponding translations are not the closest, we coders can be combined with text decoders at
count it as an error in order to compute an error inference time. Since the speech encoder were
rate on the whole test set. trained on ASR data only and the S ONAR text
decoder was only trained on text and has never
xsim++ More recently, xsim++ was intro- seen speech embeddings during training, this cor-
duced as a more semantically challenging similar- responds to zero-shot speech-to-text translation.
ity search task (Chen et al., 2023).2 It augments Similarly to text, it enables evaluating the content
the test set with hard negative examples for the encoding in the speech embeddings. It also eval-
similarity search, generating several modified ver- uates the compatibility between speech and text
sions of ground truth examples based on causal- representations.
ity alternation, entity replacement and number re-
placement. Zero-shot Automatic Speech Recognition: we
also decode speech embeddings in the same lan-
Translation tasks Multilingual embeddings are guage to perform speech recognition.
decoded into other target languages to perform
All these evaluations for speech were performed
MT. We report spBLEU (flores200) scores and
on F LEURS test set (Conneau et al., 2023), a
COMET scores on the generated translations. De-
N -way parallel speech dataset in 102 languages
coding sentence embeddings into other languages
built on top of the text F LORES-101 benchmark.
partially evaluates how much information is en-
coded in sentence embeddings, which is comple-
5 Experiments on text
mentary to xsim and xsim++ evaluations. How-
ever, please note that information may also be re- In this paper, we first trained a multilingual sen-
stored from the internal language modeling capa- tence embedding space using an encoder-decoder
bilities of the decoder, and not from the sentence architecture on text, with fixed-representation of
embeddings themselves. sentences between the encoder and the decoder.
2 3
https://github.com/facebookresearch/ https://github.com/facebookresearch/
LASER flores/tree/main/flores200
Method X-eng↑ eng-X↑ AE↑ xsim↓ xsim++↓
LMT 33.2 21.1 28.6 1.3 19.6
LMT + LAE 17.6 18.6 94.6 15.9 65.7
LMT + 0.1 · LDAE 31.6 20.9 41.6 2.6 26.2
LMT + 0.1 · LMSE 31.7 20.2 27.2 1.3 14.3
S ONAR sentence embedding space
LMT + 0.1 · LMSE + 0.01 · LDAE 32.9 20.7 32.4 1.4 15.2
LMT + 0.1 · LMSE + 0.01 · LDAE & fine-tuned dec. 32.7 21.6 41.7 1.4 15.2
MT topline
NLLB 1B 35.2 24.9 39.0∗ 3.7∗ 49.6∗
Similarity search baselines
LaBSE — — — 10.7 36.1
L ASER 3 — — — 5.1 36.4

Table 1: Text evaluations on FLORES200 devtest set, averaged on the 200 languages supported by NLLB 1B:
translation spBLEU for X-eng and eng-X directions, auto-encoding spBLEU, xsim and xsim++ similarity search
results on X-eng pairs. Results with * are zero-shot evaluations of NLLB 1B model which was not trained to
optimize these tasks.

5.1 Training setup guages which contrasts with LASER training that
We initialized our model with the NLLB 1B dense only used English and Spanish as target languages.
model (NLLB Team et al., 2022), that was trained As presented in Section 3, we ran an extensive
on translation tasks with full cross-attention on study on the use of different training objectives,
variable length encoder outputs as it is commonly namely translation objective (MT), auto-encoding
done for sequence-to-sequence MT model train- objective (AE), denoising auto-encoding objective
ing. The model is composed of a 24 layers Trans- (DAE) and Mean Squared Error loss (MSE) in the
former encoder and a 24 layers Transformer de- sentence embedding space:
coder and trained on a combination of human la- L = LMT + α · LMSE + β · LAE/DAE
beled data, back-translated data and mined data
(NLLB Team et al., 2022). In order to build We are using the same training data for auto-
our fixed-size sentence representation, we added encoding and translation objectives, inputting the
a pooling operation on the encoder outputs. Sev- target sentences instead of the source sentences
eral pooling methods are possible: max-pooling to perform auto-encoding of target sentences only.
as done in (Artetxe and Schwenk, 2019), mean- Incorporating more monolingual data in the train-
pooling as done in (Reimers and Gurevych, 2019), ing for the auto-encoding task is left to future
or EOS pooling which use the output representa- work.
tion of the EOS special token appended at the end
of sentences during NLLB training. Contrary to 5.2 Initial experiment with translation
mean-pooling or EOS-pooling, max-pooling out- objective only
puts a vector with a different range of values com- We report the results of our experiments on text
pared to NLLB training (due to the max opera- sentence embedding modeling in Table 1. Our first
tion), leading to worse results in our initial exper- experiment using only the translation objective
iments. Since for EOS-pooling the training hap- for our encoder-decoder model with fixed-size in-
pened to be unstable during initial experiments, termediate representation gives surprisingly good
we focused on mean-pooling for the rest of our ex- translation performance, given the bottleneck be-
periments. We trained our encoder-decoder model tween the encoder and the decoder. It yields -2
for 100k updates with same learning rate and batch BLEU on X-eng direction and -3.8 BLEU on eng-
size as NLLB training in the following experi- X direction compared to NLLB 1B model with
ments, unless explicitly specified. We used all full-cross attention.
bitext data used in the NLLB training, human We notice that auto-encoding evaluation (AE)
labeled bitexts, back-translated data and mined significantly lags behind NLLB 1B model. This
data. This training dataset involves 200 target lan- result may come from an inductive bias of the
sequence-to-sequence architecture with full cross- Method X-eng eng-X
attention, that could bias the model towards copy-
S ONAR 85.9 83.4
ing encoder inputs. S ONAR & fine-tuned dec. 85.9 84.2
xsim and xsim++ results are significantly bet-
Topline
ter compared to previous work, namely LaBSE NLLB 1B 86.5 85.2
and L ASER 3, on our 200 languages of focus, with
approximately 45% relative reduction of xsim++ Table 2: Translation evaluations for X-eng and eng-X
error rate compared to the baseline models. Note directions on FLORES200 devtest set: COMET scores
that averaging NLLB 1B encoder outputs to per- averaged on 89 languages supported by both COMET
form similarity search already gives good xsim and NLLB 1B models.
scores. This directly comes from the translation
objective used during NLLB 1B training that en- to simple auto-encoding and could be better com-
courages to encode multilingual sentences in sim- bined with the non-trivial translation task. We
ilar ways for efficient cross-lingual transfer. How- also scaled down this denoising auto-encoding
ever, the more difficult xsim++ evaluation re- objective by a factor 0.1. This mostly miti-
mains challenging, in this zero-shot setting, for the gated the performance drops on translation tasks,
original NLLB 1B model. while significantly boosting the auto-encoding
task (+13 BLEU) compared to the translation-only
5.3 Experiments with auto-encoding model. However, the denoising auto-encoding cri-
objectives terion significantly affects the xsim and xsim++
scores. This again shows that auto-encoding af-
Noticing the gap in the auto-encoding perfor-
fects the organization of the sentence embedding
mance between the fixed-size bottleneck encoder-
space, learning distinct representations for differ-
decoder model and NLLB 1B, we integrated an
ent languages to optimize auto-encoding.
auto-encoding objective, hoping to close the gap
with the NLLB 1B model. This model was only
5.4 Experiments with cross-lingual similarity
trained for 50k steps, as it converged quickly com-
objective
pared to other variants. We notice that auto-
encoding task is easy to learn, even with a fixed- Motivated by the recent distillation approaches
size bottleneck between the encoder and the de- to extend a sentence embedding space to new
coder, almost reaching 95 BLEU in average on languages, explicitly aligning languages with an
the 200 languages of NLLB. This shows that a MSE criterion in the embedding space, we ex-
lot of information can be efficiently stored in a plored the use of an auxiliary MSE loss in the
fixed-size representation and that the bottleneck sentence embedding space. This is in addition
should not be seen as an hard limitation. While the to the translation loss, with the hope to explic-
translation performance on eng-X tranlation direc- itly make translations closer in the embedding
tions is not that much impacted, we see a big drop space. In Table 1, we notice that this new
in translation performance for X-eng directions (- constraint degrades the decoding performance of
15,6 BLEU) compared to the fixed-size bottleneck the encoder-decoder model for both translation
encoder-decoder model trained only on a transla- and auto-encoding tasks. However, it signifi-
tion task. Moreover, we see a big drop in both cantly boosts the xsim++ scores compared to the
xsim and xsim++ evaluations showing that the encoder-decoder model trained only on a transla-
model is not learning language-agnostic represen- tion task, with -5.3 xsim++ error rate reduction.
tations anymore, due to the auto-encoding objec-
tive that seems more easily optimized compared to
the translation objective. 5.5 Training the S ONAR embedding space
To mitigate the negative effects of the auto- Based on the conclusions of the previously trained
encoding objective, while improving the auto- models, we combined the translation loss, the aux-
encoding performance at inference time, we iliary MSE loss and the denoising auto-encoding
switched to a denoising auto-encoding criterion, loss, to create the S ONAR embedding space. In
to avoid that the model overfits on the copy task. this run, the denoising auto-encoding loss is fur-
That would also make the task harder compared ther downscaled, motivated by the high xsim++
fra spa swh rus 98 languages
X-eng BLEU xsim ↓ xsim++ ↓
S ONAR & fine-tuned dec. 46.1 34.5 42.4 37.1
L ASER 3 MSE & T-mod. 40.4 29.6 27.2 29.7 S ONAR 0.1 9.3
L ASER 3 1.1 27.5
xsim++
S ONAR 4.8 7.9 7.1 6.5 LaBSE 1.5 15.4
L ASER 3 MSE 7.6 12.6 15.2 12.4
Table 4: Comparison of similarity search results (er-
Table 3: Comparison to T-modules framework based ror rates) on the intersection of languages handled by
on LASER embedding space. spBLEU scores for X- LaBSE, LASER3 and S ONAR.
eng translation directions on FLORES200 devtest set
and xsim++ for X-eng pairs on FLORES200 devtest
set. lation of the source and target sentence embed-
dings and learn to decode the target sentence to-
kens. As the encoder is frozen, the xsim and
score of the previously trained sentence embed- xsim++ scores won’t change, but the decoding
ding space trained on denoising auto-encoding. results will. With this decoder fine-tuning step, we
First, in the same tendency from previous training notice similar translation results on the X-eng di-
with (denoising) auto-encoding objective, we no- rection, while noticing a +0.9BLEU gain on the
tice a slight degradation in xsim++ scores when eng-X translation directions. More importantly,
adding the denoising auto-encoding in addition the auto-encoding performance is boosted by 9.3
to the MSE loss. However, this degradation is BLEU with decoder fine-tuning method while the
only 0.9% which can be considered as accept- sentence embedding space was not affected. This
able. Initial experiments on larger scaling fac- finetuning step is trained for 50k additional steps.
tors for the denoising auto-encoding criterion fur- We also evaluated the best performing mod-
ther increased, as expected, the xsim++ degra- els on translation tasks with COMET, which has
dation, and we thus decided to stick with a 0.01 proven to better correlate with human judgments
scaling factor for the denoising auto-encoding ob- compared to BLEU scores. We evaluated the
jective. On the other hand, for our new S ONAR two X-eng and eng-X directions involving the lan-
model, we see improvements on translation tasks guages on which XLM-R was trained on, which
compared to the model trained on MT and MSE are the languages supported by COMET (see Ta-
loss. This may be due to efficient mitigation of ble 2). We see less that 1 point difference be-
collapse that could happen with MSE loss, thanks tween our S ONAR encoder-decoder model (with
to the denoising auto-encoding objective. We fine-tuned decoder) compared to NLLB 1B model
also see big improvements in auto-encoding task for both eng-X and X-eng directions, showing the
(>+3.8 BLEU) compared to all models not trained good quality of the translations.
with auto-encoding objectives. This variant seems The NLLB 1B model still represents a topline,
to be the best setup in terms of sentence em- and to evaluate our S ONAR framework against
bedding space organization (following xsim and a more fair baseline involving a fixed-size sen-
xsim++ scores) and decoding performance (fol- tence representation between the encoder and the
lowing translation and auto-encoding evaluations). decoder, we compared our results to the decod-
We also report the xsim and xsim++ results on ing of LASER embeddings, recently introduced
the intersection of languages handled by LaBSE, in T-modules (Duquenne et al., 2022b, 2023). As
LASER3 and S ONAR in Table 4, and notice again L ASER 3 encoders were trained with a cosine loss,
that S ONAR outperforms previous state-of-the-art the sentence embeddings cannot be efficiently de-
sentence embedding spaces for multilingual simi- coded with T-modules decoder. This is why we
larity search. trained new L ASER 3 encoders with MSE loss, and
Finally, we tried to improve the decoding per- added back-translated data from NLLB project in
formances of our architecture, freezing the em- addition to the original training data of L ASER 3
bedding space and our multilingual encoder, while encoders. These newly trained L ASER 3 MSE en-
fine-tuning only the decoder. We used the ran- coders can be combined with T-modules decoder
dom interpolation decoding method introduced in (Duquenne et al., 2023) to perform X-eng transla-
section 3, where we compute a random interpo- tion. We report the results on 4 languages French,
BLEU fra-eng spa-eng fra spa swh rus
S ONAR mean-pooling 25.2 20.6 Training hours
S ONAR max-pooling 31.6 24.5 S ONAR/LASER ASR 0.8k 0.4k 0.3k 0.2k
S ONAR attention-pooling 33.3 25.5 Whisper ASR 10k 11k 0.01k 10k
Whisper ST 4k 7k 0.3k 8k
Table 5: spBLEU X-eng zero-shot speech translation S ONAR zero-shot ST
on F LEURS test set for different pooling methods. S ONAR 33.3 25.5 14.9 15.0
S ONAR & fine-tuned dec. 33.4 24.8 15.6 14.6
Zero-shot ST baseline
Spanish, Swahili and Russian in Table 3 and notice L ASER 3 MSE & T-mod 30.7 22.9 3.7 16.2
big improvements using S ONAR on both X-eng Supervised ST toplines
translation task and xsim++ evaluation . Please Whisper Large v1 33.8 27.0 5.2 30.2
note that compared to previous work (Duquenne Whisper Large v2 34.9 27.2 7.6 31.1
et al., 2022b), we are able to encode and decode
200 languages with a single encoder and a single Table 7: spBLEU scores on F LEURS test set for zero-
decoder. shot Speech Translation on X-eng directions.

6 Experiments on speech
(Wang et al., 2021) and Librispeech (Panayotov
Based on the experiments and evaluations of mul- et al., 2015). We tested different pooling methods,
tilingual sentence embedding spaces for text, we namely mean-pooling, max-pooling and attention-
chose to focus only on the embedding space pooling. Attention-pooling is performed with a
learnt with translation, denoising auto-encoding three layer transformer decoder architecture with
and MSE objectives which seems to be a good cross-attention on the speech encoder outputs, in
trade-off between good semantic representation order to output a single vector as our speech sen-
(xsim and xsim++) and good decoding perfor- tence embedding. Best results are achieved with
mance (translation and auto-encoding). We follow attention-pooling (see Table 5).
a teacher-student approach to extend this space to As a baseline, we compared our S ONAR speech
the speech modality for several languages. We first encoders to speech encoders trained with LASER
performed an initial extensive study on five lan- as teacher (using our newly trained L ASER 3 MSE
guages only: English (eng), Spanish (spa), French text encoders as teacher), with exact same train-
(fra), Russian (rus) and Swahili (swh). We then ing data and pre-trained w2v-bert model. We
scale to 37 languages. report the xsim and xsim++ cross-lingual and
6.1 Experiments on 5 languages cross-modal results in Table 6 on F LEURS test
set for foreign speech embeddings against English
We use a pre-trained w2v-bert 600 million pa- text embeddings. Similarly to what Chen et al.
rameter model to initialize the speech encoders (2023) noticed on FLORES, xsim scores saturate
and train them on Common Voice 12 ASR train- to zero error rate on F LEURS test set, not provid-
ing set (Ardila et al., 2019). For our English ing useful insights on the multimodal sentence em-
speech encoder, we also used ASR training data
from Must-C (Di Gangi et al., 2019), Voxpopuli
src\tgt eng fra spa swh rus 200
langs
fra spa swh rus
eng 69.7 44.3 26.9 27.8 29.8 17.7
xsim
fra 33.4 64.1 21.5 18.2 23.3 13.4
S ONAR 0.0 0.0 0.0 0.0
spa 24.8 25.1 58.9 16.0 16.8 11.7
L ASER 3 MSE 0.0 0.0 0.0 0.3
swh 15.6 13.5 9.0 25.7 9.8 7.0
xsim++ rus 14.6 17.3 11.0 10.4 35.0 8.0
S ONAR 12.3 13.9 22.8 24.6
L ASER 3 MSE 17.5 24.9 40.7 42.1 Table 8: spBLEU scores on F LEURS test set for zero-
shot Speech Translation on {eng,fra,spa,swh,rus}-X di-
Table 6: Multilingual and multimodal similarity search rections. Last column is the average spBLEU Speech
evaluations on F LEURS test set: xsim and xsim++ Translation scores for decoding in the 200 languages
error rates on speech translation X-eng pairs. supported by S ONAR text decoder.
eng fra spa swh rus
Training hours
S ONAR/LASER ASR 4k 0.8k 0.4k 0.3k 0.2k
Whisper ASR 438k 10k 11k 0.01k 10k
Whisper ST — 4k 7k 0.3k 8k
BLEU
S ONAR 64.7 54.3 50.0 17.7 29.1
S ONAR & fine-tuned dec 69.7 64.1 58.9 25.7 35.0
Whisper v1 80.8 79.8 84.8 26.9 84.3
Whisper v2 81.3 82.0 85.3 36.0 85.3
Bert-score
S ONAR 0.948 0.926 0.923 0.808 0.853
S ONAR & fine-tuned dec 0.951 0.939 0.936 0.831 0.870
Whisper v1 0.972 0.965 0.977 0.837 0.975
Whisper v2 0.972 0.969 0.979 0.865 0.978

Table 9: Speech recognition spBLEU scores and Bert-scores on F LEURS test set.

bedding space organization. Therefore, we also results, while being trained on much less training
report xsim++ scores, augmenting the F LEURS data. As for Swahili, our framework significantly
test set with hard negatives generated in (Chen outperforms Whisper models. We notice much
et al., 2023) based on FLORES which composes better results for Russian-to-English for Whisper
the source sentences of F LEURS. We notice 41% which was expected given the amount of training
xsim++ relative reduction when switching from data and the supervised setting.
LASER as teacher to S ONAR as teacher.
Following (Duquenne et al., 2022b), we de- Thanks to the compatibility across modali-
coded the speech sentence embeddings with ties and across languages, we decoded English,
our S ONAR text decoder, performing zero-shot French, Spanish, Swahili and Russian speech sen-
speech-to-text translation. Indeed, the text de- tence embeddings into the 200 text languages sup-
coder has never seen speech sentence embeddings ported by our S ONAR decoders. We report the
during training. Moreover, speech representa- zero-shot speech translation results using the fine-
tions were only trained to match their transcrip- tuned S ONAR decoder in Table 8. We notice that
tion representations but never translations. In Ta- BLEU scores remain high for other languages than
ble 7, we report our zero-shot speech-to-text trans- English, still in a zero-shot setting, highlighting
lation results on F LEURS test set for X-eng direc- again the compatibility between representations.
tions and compare it to the baseline trained on
LASER space. We also report the state-of-the- Finally, speech embeddings can be decoded into
art results for speech-to-text translation, trained text in the same language, which can be seen as
in a supervised way on significantly more train- speech transcription. Since our model can of-
ing data. First, we notice big improvements in ten paraphrase transcriptions, we report in Table 9
the BLEU scores compared to the LASER base- BLEU scores as well as bert-scores for this zero-
line on French, Spanish an Swahili, with an av- shot transcription task. While being significantly
erage 5.5 BLEU gain on these languages, while behind on BLEU scores, which is expected as
being slightly behind on Russian to English trans- our model often paraphrases transcriptions, we see
lation (-1.2 BLEU). This last result is surpris- much less gap with Whisper transcriptions with
ing, as our S ONAR speech encoder have much the bert-score metric (which still advantages real
better xsim++ score on Russian compared to transcriptions compared to paraphrases, but less
the LASER speech encoder. Second, we notice than BLEU). Training data amount is also signif-
that for our two high resource languages, namely icantly different between the two setups, but it’s
French and Spanish, our zero-shot speech-to-text interesting to notice that the gap in terms of bert-
results are close to Whisper Large v1 supervised score remains reasonable.
6.2 Scaling to 37 languages
We use the same recipe than described above to ISO Language Train Ours Whisper
extend the coverage of the speech encoders to 37 arb MS Arabic 822 30.9 26.9
languages. These speech encoders were trained by ben Bengali 335 21.3 14.1
linguistic language family, e.g. Romance or In- cat Catalan 1738 37.7 36.9
cmn Mandarin Chinese 9320 18.6 20.8
dian languages, using speech transcriptions only,
ces Czech 181 32.0 30.3
from public and licensed sources. Table 10 col- cym Welsh 99 14.5 13.4
umn "Train" gives statistics on the amount of train- dan Danish 115 34.9 36.0
ing data. As in Section 6.1, we evaluate the speech deu German 3329 36.2 38.8
encoders by connecting them to the S ONAR text est Estonian 131 26.1 21.2
decoder and calculate speech-to-text translation fin Finish 184 24.9 25.2
performance, as measured by BLEU. Although fra French 2057 33.7 34.9
our results are fully zero-shot speech translation, hin Hindi 150 22.6 24.2
ind Indonesian 269 28.7 31.9
we achieve very competitive performance com-
ita Italian 588 29.3 27.5
pared to the state-of-the-art model Whisper v2 jpn Japanese 17319 20.2 20.8
large (Radford et al., 2022). The average on BLEU kan Kannada 114 21.4 13.1
scores are slightly better for SONAR compared kor Korean 316 17.1 24.2
to Whisper, while being zero-shot speech transla- mlt Maltese 106 24.4 16.2
tion. Our model performs less well on some high- nld Dutch 1723 29.3 28.4
resource languages like Mandarin Chinese, Ger- pes Western Persian 386 24.4 20.9
por Portuguese 269 38.3 41.4
man or French, but outperforms Whisper for oth-
pol Polish 304 21.1 25.8
ers like Spanish or Dutch and for several less com- ron Romanian 135 34.7 34.1
mon languages, like Swahili or Uzbek. Our mod- rus Russian 259 28.4 31.1
ular approach seems to achieve particular good re- slk Slovak 102 32.3 29.3
sults on Indian languages: Bengali, Hindi, Kan- spa Spanish 1511 28.0 27.2
nada, Telugu, Tamil and Urdu. swh Swahili 361 23.5 7.6
tam Tamil 245 16.2 10.0
7 Discussion tel Telugu 84 18.0 14.7
tgl Tagalog 108 14.6 26.8
From all the experiments present in Section 5 and tha Thai 195 16.9 17.8
Section 6, we can draw a couple of high-level con- tur Turkish 174 22.7 29.9
clusions: ukr Ukrainian 105 30.7 32.5
urd Urdu 185 19.7 18.1
First, we have seen that the auto-encoding task
uzn Uzbek 115 20.0 6.6
can be greatly solved even with a fixed-size bottle- vie Vietnamese 194 19.1 21.9
neck between the encoder an the decoder, showing
Total/avg 43628 25.3 24.5
that a fixed-size representation should not be seen
as a hard limitation, as a lot of information can be
stored in a single vector. Then, similarly to Artetxe Table 10: spBLEU evaluation of S2T into English on
and Schwenk (2019), we noticed that a translation F LEURS test set. Our models were trained on ASR
transcriptions only, compared to the Whisper large v2.
objective is well suited to build language-agnostic
representations while making sure that the encoder
model encodes enough information in the sentence tive and the mutual compatibility between speech
embedding to be efficiently decoded (in another and text multilingual embeddings is greatly high-
language). Adding an MSE loss in the training, lighted by the fact that speech embeddings can be
which explicitly encourages to align languages decoded in foreign text in a zero-shot way.
in the sentence embedding space, leads to better
language-agnostic representations. Moreover, de- 8 Conclusion
noising auto-encoding combined with MSE loss,
can bring gains for decoding tasks, but too much To conclude, we introduced a new multilingual
of it affects the language-agnostic representations. and multimodal sentence embedding space called
Finally, teacher-student approach to extend to the S ONAR. We conducted an extensive study on ob-
speech modality has once again proven to be effec- jective functions to build our multilingual teacher
sentence embedding space for text, and an ex- References
tensive evaluation of our S ONAR framework for
both similarity search and decoding tasks. We ex- Rosana Ardila, Megan Branson, Kelly Davis,
tended this new text sentence embedding space Michael Henretty, Michael Kohler, Josh Meyer,
to the speech modality to introduce Sentence- Reuben Morais, Lindsay Saunders, Francis M
level multimOdal and laNguage-Agnostic Rep- Tyers, and Gregor Weber. 2019. Common
resentations (S ONAR). The S ONAR text and voice: A massively-multilingual speech corpus.
speech encoders as well as the text decoders arXiv preprint arXiv:1912.06670.
are freely available at https://github.com/
Mikel Artetxe and Holger Schwenk. 2019. Mas-
facebookresearch/SONAR.
sively multilingual sentence embeddings for
9 Acknowledgment zero-shot cross-lingual transfer and beyond.
TACL, pages 597–610.
We would like to thank Kevin Heffernan for his
help on providing xsim and xsim++ baselines Arun Babu, Changhan Wang, Andros Tjan-
for LaBSE and L ASER 3, Andy Chung for pro- dra, Kushal Lakhotia, Qiantong Xu, Naman
viding the w2v-bert pre-trained models used as Goyal, Kritika Singh, Patrick von Platen,
initialization for our speech encoders, Changhan Yatharth Saraf, Juan Pino, et al. 2021. Xls-
Wang for providing speech data manifests used r: Self-supervised cross-lingual speech repre-
for training and Artyom Kozhevnikov and Naji El sentation learning at scale. arXiv preprint
Hachem for the migration of models to fairseq2 arXiv:2111.09296.
for open-sourcing.
The last author’s contribution was partly funded Alexei Baevski, Yuhao Zhou, Abdelrahman Mo-
by his chair in the PRAIRIE institute funded by the hamed, and Michael Auli. 2020. wav2vec
French national agency ANR as part of the “In- 2.0: A framework for self-supervised learning
vestissements d’avenir” programme under the ref- of speech representations. Advances in Neu-
erence ANR-19-P3IA-0001. ral Information Processing Systems, 33:12449–
12460.

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia,


Melvin Johnson, Yong Cheng, Simran Khanuja,
Jason Riesa, and Alexis Conneau. 2022.
mslam: Massively multilingual joint pre-
training for speech and text. arXiv preprint
arXiv:2202.01374.

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan


Hua, Nicole Limtiaco, Rhomni St John, Noah
Constant, Mario Guajardo-Cespedes, Steve
Yuan, Chris Tar, et al. 2018. Universal sentence
encoder. arXiv preprint arXiv:1803.11175.

Mingda Chen, Kevin Heffernan, Onur Çelebi,


Alex Mourachko, and Holger Schwenk. 2023.
xsim++: An improved proxy to bitext mining
performance for low-resource languages. arXiv
preprint arXiv:2306.12907.

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng


Chiu, James Qin, Ruoming Pang, and Yonghui
Wu. 2021. W2v-bert: Combining contrastive
learning and masked language modeling for
self-supervised speech pre-training. In 2021
IEEE Automatic Speech Recognition and Un- A large-scale mined corpus of multilingual
derstanding Workshop (ASRU), pages 244–250. speech-to-speech translations. arXiv preprint
IEEE. arXiv:2211.04508.

Kevin Clark, Minh-Thang Luong, Quoc V Le, Paul-Ambroise Duquenne, Hongyu Gong, Benoît
and Christopher D Manning. 2020. Elec- Sagot, and Holger Schwenk. 2022b. T-
tra: Pre-training text encoders as discrimina- modules: Translation modules for zero-shot
tors rather than generators. arXiv preprint cross-modal machine translation. arXiv
arXiv:2003.10555. preprint arXiv:2205.12216.

Alexis Conneau, Kartikay Khandelwal, Naman Paul-Ambroise Duquenne, Hongyu Gong, and
Goyal, Vishrav Chaudhary, Guillaume Wen- Holger Schwenk. 2021. Multimodal and multi-
zek, Francisco Guzmán, Edouard Grave, Myle lingual embeddings for large-scale speech min-
Ott, Luke Zettlemoyer, and Veselin Stoy- ing. Advances in Neural Information Process-
anov. 2019. Unsupervised cross-lingual rep- ing Systems, 34.
resentation learning at scale. arXiv preprint
Paul-Ambroise Duquenne, Holger Schwenk, and
arXiv:1911.02116.
Benoît Sagot. 2023. Modular speech-to-text
Alexis Conneau, Kartikay Khandelwal, Naman translation for zero-shot cross-modal transfer.
Goyal, Vishrav Chaudhary, Guillaume Wen- In Interspeech.
zek, Francisco Guzmán, Edouard Grave, Myle
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer,
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Naveen Arivazhagan, and Wei Wang. 2020.
2020. Unsupervised cross-lingual representa-
Language-agnostic bert sentence embedding.
tion learning at scale. In ACL.
arXiv preprint arXiv:2007.01852.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Kevin Heffernan, Onur Çelebi, and Holger
Loic Barrault, and Antoine Bordes. 2017. Su- Schwenk. 2022. Bitext mining using distilled
pervised learning of universal sentence repre- sentence representations for low-resource lan-
sentations from natural language inference data. guages. arXiv preprint arXiv:2205.12654.
arXiv preprint arXiv:1705.02364.
Sameer Khurana, Antoine Laurent, and James
Alexis Conneau, Min Ma, Simran Khanuja, Glass. 2022. Samu-xlsr: Semantically-
Yu Zhang, Vera Axelrod, Siddharth Dalmia, aligned multimodal utterance-level cross-
Jason Riesa, Clara Rivera, and Ankur Bapna. lingual speech representation. arXiv preprint
2023. Fleurs: Few-shot learning evaluation of arXiv:2205.08180.
universal representations of speech. In 2022
IEEE Spoken Language Technology Workshop Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
(SLT), pages 798–805. IEEE. Sergey Edunov, Marjan Ghazvininejad, Mike
Lewis, and Luke Zettlemoyer. 2020. Multilin-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, gual denoising pre-training for neural machine
and Kristina Toutanova. 2018. Bert: Pre- translation. Transactions of the Association for
training of deep bidirectional transformers Computational Linguistics, 8:726–742.
for language understanding. arXiv preprint
arXiv:1810.04805. Marta R NLLB Team, Costa-jussà, James Cross,
Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben- Kevin Heffernan, Elahe Kalbassi, Janice Lam,
tivogli, Matteo Negri, and Marco Turchi. 2019. Daniel Licht, Jean Maillard, et al. 2022.
MuST-C: a Multilingual Speech Translation No language left behind: Scaling human-
Corpus. In NAACL, pages 2012–2017. centered machine translation. arXiv preprint
arXiv:2207.04672.
Paul-Ambroise Duquenne, Hongyu Gong, Ning
Dong, Jingfei Du, Ann Lee, Vedanuj Goswani, Vassil Panayotov, Guoguo Chen, Daniel Povey,
Changhan Wang, Juan Pino, Benoît Sagot, and Sanjeev Khudanpur. 2015. Librispeech:
and Holger Schwenk. 2022a. Speechmatrix: an asr corpus based on public domain audio
books. In 2015 IEEE international confer-
ence on acoustics, speech and signal processing
(ICASSP), pages 5206–5210. IEEE.

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha,


and Alex Waibel. 2019. Improving zero-shot
translation with language-independent con-
straints.

Alec Radford, Jong Wook Kim, Tao Xu, Greg


Brockman, Christine McLeavey, and Ilya
Sutskever. 2022. Robust speech recognition
via large-scale weak supervision. Technical
report, Technical report, OpenAI, 2022. URL
https://cdn. openai. com/papers/whisper. pdf.

Nils Reimers and Iryna Gurevych. 2019.


Sentence-bert: Sentence embeddings us-
ing siamese bert-networks. arXiv preprint
arXiv:1908.10084.

Nils Reimers and Iryna Gurevych. 2020. Making


monolingual sentence embeddings multilingual
using knowledge distillation. In EMNLP, pages
4512–4525.

Holger Schwenk, Guillaume Wenzek, Sergey


Edunov, Edouard Grave, Armand Joulin, and
Angela Fan. 2021. CCMatrix: Mining billions
of high-quality parallel sentences on the web.
In ACL, page 6490–6500.

Changhan Wang, Morgane Riviere, Ann Lee,


Anne Wu, Chaitanya Talnikar, Daniel Haziza,
Mary Williamson, Juan Pino, and Emmanuel
Dupoux. 2021. VoxPopuli: A large-scale multi-
lingual speech corpus for representation learn-
ing, semi-supervised learning and interpreta-
tion. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Lin-
guistics and the 11th International Joint Con-
ference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 993–1003, Online.
Association for Computational Linguistics.

Yinfei Yang, Gustavo Hernandez Abrego, Steve


Yuan, Mandy Guo, Qinlan Shen, Daniel Cer,
Yun-Hsuan Sung, Brian Strope, and Ray
Kurzweil. 2019. Improving multilingual sen-
tence embedding using bi-directional dual en-
coder with additive margin softmax. arXiv
preprint arXiv:1902.08564.

You might also like