0% found this document useful (0 votes)

64 views14 pages

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

The document introduces S ONAR, a multilingual and multimodal sentence embedding framework that significantly outperforms existing models in multilingual similarity search tasks. It utilizes a single text encoder for 200 languages and incorporates speech encoders trained through a teacher-student approach, enabling effective speech-to-text and text-to-speech translation. The framework also includes a text decoder for machine translation, demonstrating competitive results in zero-shot translation scenarios.

Uploaded by

dillweed2333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views14 pages

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

Uploaded by

dillweed2333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

S ONAR: Sentence-Level Multimodal

and Language-Agnostic Representations

Paul-Ambroise Duquenne Holger Schwenk Benoît Sagot

Meta AI & Inria Meta AI Inria
[email protected] [email protected] [email protected]

Abstract

We introduce S ONAR, a new multilin-

gual and multimodal fixed-size sentence
embedding space. Our single text en-
coder, covering 200 languages, substan-
tially outperforms existing sentence embed-
dings such as L ASER 3 and LabSE on the
xsim and xsim++ multilingual similar-
ity search tasks. Speech segments can
Figure 1: S ONAR architecture.
be embedded in the same S ONAR embed-
ding space using language-specific speech
encoders trained in a teacher-student set- Gurevych, 2019), aiming to encode sentences with
ting on speech transcription data. Our similar meanings closely in the sentence embed-
encoders outperform existing speech en-
coders on similarity search tasks. We also
ding space. Artetxe and Schwenk (2019); Feng
provide a text decoder for 200 languages, et al. (2020) extended this idea to multilingual sen-
which allows us to perform text-to-text and tences, enabling the semantic comparison between
speech-to-text machine translation, includ- sentences from different languages. This was used
ing for zero-shot language and modality to perform bitext mining at scale, to automatically
combinations. Our text-to-text results are align monolingual sentences from Common Crawl
competitive compared to the state-of-the- (Schwenk et al., 2021). This mined bitext data can
art NLLB 1B model, despite the fixed-size
be successfully used to train state-of-the art ma-
bottleneck representation. Our zero-shot
speech-to-text translation results compare chine translation (MT) models (Schwenk et al.,
favorably with strong supervised baselines 2021; NLLB Team et al., 2022). In recent re-
such as Whisper. search, we may distinguish three main approaches
to building multilingual fixed-size sentence repre-
sentations.
1 Introduction Encoder-only approaches such as (Feng et al.,
Representation learning of sentences has been 2020), which learn sentence embeddings for text,
widely studied in recent years for different pur- based on a siamese encoder architecture. Con-
poses: from classification of sentences (Devlin trastive loss is often used to learn similar represen-
et al., 2018) to multilingual representations for tations for different text translations while avoid-
translation purposes (Pham et al., 2019). Dif- ing collapse (i.e. avoid to predict the same embed-
ferent pre-training objectives were explored to ding for every input)
build contextual representations from sentences Encoder-decoder approaches such as (Artetxe
(Devlin et al., 2018; Conneau et al., 2019; Clark and Schwenk, 2019), which learn sentence em-
et al., 2020). However, these methods often beddings with a translation objective, that can
lack sentence-level objectives, making it difficult be computed thanks to an additional decoder.
to evaluate the semantic similarity between two The main difference with classical sequence-to-
sentences. On the other hand, several works sequence model is the bottleneck layer, or pooling
focused on learning sentence embeddings (Cer function, that computes a fixed-size sentence rep-
et al., 2018; Conneau et al., 2017; Reimers and resentation between the encoder and the decoder.
Teacher-student approaches such as (Reimers • We explore different training objectives
and Gurevych, 2020; Heffernan et al., 2022), to learn a multilingual sentence embed-
which extend a (possibly monolingual) pre- ding space initialized from the NLLB 1B
existing sentence embedding space to new lan- model, thoroughly comparing the different
guages with a teacher-student learning strategy. approaches on a wide range of decoding and
The existing embedding space is used as teacher similarity search evaluations
to train student encoders for new languages. Bi-
text training data is used for this kind of train- • This yield a single sentence encoder for
ing, where the sentence in the new language is 200 languages which significantly outper-
encoded with a trained encoder, while its trans- form state-of-the-art sentence embedding ap-
lation in another supported language is encoded proaches;
with the pre-existing encoder as target. The same
teacher-student approach can be used to extend a • We trained speech encoders for 37 languages
text-only multilingual sentence embedding space using teacher-student training
to the speech modality by training speech en-
• We provide a text decoder for 200 languages
coders (Duquenne et al., 2021; Khurana et al.,
enabling (zero-shot) text and speech transla-
2022). These speech encoders can be used to per-
tion
form speech-to-text or speech-to-speech transla-
tion mining (Duquenne et al., 2022a).
• We analyzed the cross-lingual and cross-
In this work, we used an encoder-decoder ap- modal similarity search and decoding capa-
proach to build our sentence embedding space bilities of our S ONAR framework.
S ONAR on text data only. We then used a teacher-
student approach to train speech encoders for the • The S ONAR text and speech encoders
same space. as well as the text decoders are freely
Our motivation for using an encoder-decoder available at https://github.com/
approach for the initial text-based training phase facebookresearch/SONAR.
is twofold. First, a multilingual decoder is trained
along the multilingual encoder, which opens pos- 2 Related work
sibilities such as zero-shot MT (Duquenne et al.,
2022b). Second, a pre-trained state-of-the-art MT Multilingual sentence representations Many
encoder-decoder model can be used to initialize works have studied how to efficiently learn mul-
the whole encoder-decoder architecture, in this tilingual representations of sentences. Some of
work we used NLLB 1B dense model as initializa- them focused on variable-length representations
tion. In contrast to previous work, we study the ef- of sentences, learning high-level contextual rep-
fect of different training objective functions on the resentations for each sub-word like multilingual
properties of the resulting embedding space. More BERT (Devlin et al., 2018) or XLM-R (Conneau
precisely, we combine translation, auto-encoding et al., 2020). Others learnt fixed-size sentence
and denoising objectives, together with a cross- representations by integrating sentence-level ob-
lingual similarity objective in the sentence embed- jectives in the training. It is the case for exam-
ding space. ple of sentence-BERT (Reimers and Gurevych,
2019), which was initially trained on English text
In a second step, we train speech student en-
only, and later extended to other languages with a
coders using our multilingual text encoder as a
teacher-student approach (Reimers and Gurevych,
teacher. We demonstrate the cross-modal similar-
2020). The English model behaves as a teacher
ity search and speech translation1 capabilities of
to train a multilingual encoder covering other lan-
the resulting S ONAR framework.
guages. The student model is initialized with the
In summary, the main contributions of
XLM-R pretrained encoder and fine-tuned using
the S ONAR (Sentence-level multimOdal and
bitext training data. The original English encoder,
laNguage-Agnostic Representations) model are as
which is kept frozen, is used to generate an embed-
follows:
ding for the English translation of each sentence,
1
The term “speech translation” customarily denotes which then serves as a target for the student en-
speech-to-text translation. coder via a regression loss.
Bitexts can also be used in other ways to train multilingual sentence embedding space based on
multilingual sentence embedding spaces. LASER an encoder-decoder approach. The second step
(Artetxe and Schwenk, 2019) is an encoder- extends the multilingual text sentence embedding
decoder architecture, with a fixed-size sentence space to the speech modality, using a teacher-
representation between the encoder and the de- student approach.
coder, trained with a translation objective. The
orginal LASER covers 93 languages. Its decoder 3.1 Multilingual sentence representations for
was originally used for training only, as the en- text
coder itself defines the sentence embedding space. Contrarily to LASER’s bidirectional LSTM ar-
However, recent work such as (Duquenne et al., chitecture (Artetxe and Schwenk, 2019), S ONAR
2022b) showed that it is possible to learn high relies on a Transformer encoder-decoder archi-
quality decoders for LASER representations into tecture, initialized with pre-trained MT model
multiple languages, thereby enabling zero-shot weights. However, as opposed to standard
MT on unseen languages directions. Similarly to sequence-to-sequence architectures for MT, the
Reimers and Gurevych (2020), Heffernan et al. architecture we use to train S ONAR on parallel
(2022) introduced L ASER 3, extending LASER to text data goes through a single vector bottleneck
new languages, including low-resource languages, that represents the full sentence and does not use
using a teacher-student approach. Finally, LaBSE token-level cross-attention. The fixed-size sen-
(Feng et al., 2020) uses a dual-encoder approach tence representation is computed by pooling the
with and additive margin softmax objective (Yang token-level outputs of the encoder. Instead of do-
et al., 2019). It highlights the benefits of initializ- ing cross-attention on a variable-length sequence
ing encoders with multilingual pre-trained models of encoder outputs, the decoder only attends to
and covers 109 languages. this single vector at each decoding step. Dif-
Joint speech/text sentence representations ferent pooling methods can be used to compute
There has been a large body or research on unsu- this fixed-size representation, including max- and
pervised representation learning for monolingual mean-pooling on token-level encoder outputs, as
(Baevski et al., 2020) and multilingual speech well as the encoder output for a special BOS to-
(Babu et al., 2021), with recently w2v-bert (Chung ken.
et al., 2021) that combines contrastive learning Contrarily to LASER (Artetxe and Schwenk,
and masked language modeling to learn self- 2019), we do not only train our encoder-decoder
supervised representations from speech. Other architecture using an MT objective only. We in-
works explored multilingual and multimodal vestigated several other objectives and combina-
(speech/text) pre-training methods, including tions thereof and analyzed their effect on the sen-
mSLAM (Bapna et al., 2022). Finally, Duquenne tence embedding space and the decoding perfor-
et al. (2021), followed by Khurana et al. (2022), mance of the resulting model. We introduce below
introduced multilingual and multimodal sentence the different objectives used to train our encoder-
embeddings, extending a pre-existing multilingual decoder architecture.
text sentence embedding space to the speech Translation objective Following (Artetxe and
modality with a distillation approach. Duquenne Schwenk, 2019) work, we used parallel data
et al. (2022b, 2023) also showed that it is possible to train our encoder-decoder architecture with a
to efficiently decode multilingual speech sen- translation objective. To better understand the
tence embeddings with decoders trained on text motivation behind this objective, let us take this
sentence embeddings into different languages, to example: Given a triplet of translations x, y, z,
perform zero-shot speech translation. where z is the English translation, decoding x and
y into English may be easily achieved by the de-
3 Methodology
coder if the sentence representation of these two
To build our multilingual and multimodal sen- input sentences are similar in the sentence em-
tence embedding space S ONAR, we follow a two- bedding space. Training a encoder-decoder archi-
step training strategy, inspired by Duquenne et al. tecture on a translation objective may end up in
(2021, 2022b). The first step is to build a sen- this potential local minimum where translations
tence embedding space for text: we are building a are encoded closely to one another, so as to be
English sentence from FLORES:
Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chair of the clinical and
scientific division of the Canadian Diabetes Association cautioned that the research is still in its early days.
Auto-encoding of the sentence with S ONAR:
Dr. Ehud Ur, professor of medicine at Dalhousie University in Halifax, Nova Scotia and chairman of the clinical
and scientific division of the Canadian Diabetes Association warned that the research is still in its early stages.
Figure 2: Example of a long sentence with named entities auto-encoded with S ONAR.

decoded into the same target language sentence. an MSE loss with a translation objective and/or a
However, there is no guarantee to converge to this denoising auto-encoding objective could also pre-
local minimum. Nothing explicitly constrains a vent collapse from happening, as the model is
sentence in a language and its translation in an- forced to keep embeddings distinct to encode and
other language to be encoded closely to one an- decode different sentences.
other. As a result, other local minima are possible,
where translations are not encoded closely but still Decoder finetuning Duquenne et al. (2022b)
decoded into the same sentence for a given target demonstrated that learning deep decoders for
language. To mitigate this, shallow decoders were an existing sentence embedding space (in their
used by Artetxe and Schwenk (2019): a deeper de- case, LASER) can significantly improve transla-
coder can more easily decode different points into tion and auto-encoding performance. While keep-
the same sentence, whereas a shallower decoder is ing the existing embedding space unchanged, such
more likely to need two vectors to be very simi- decoders greatly improve the decoding of sen-
lar whenever they must be decoded into the same tence embeddings, therefore significantly improv-
sentence. ing auto-encoding and translation performance
when combined with compatible encoders. This
Auto-encoding and denoising auto-encoding is of great interest for zero-shot (possibly cross-
objective Auto-encoders have been widely used modal) translation, as shown by Duquenne et al.
to build representations. It has the advantage to en- (2023). In this paper, we introduce a decoder
courage encoding fine-grained details of the input. fine-tuning method called random interpolation
However, this objective by itself is not likely to decoding. Based on a trained encoder-decoder
learn semantic representation of sentences. More- model with a bottleneck representation between
over, this objective is much simpler to learn com- the encoder and the decoder, we freeze the encoder
pared to a translation objective, which makes the weights and fine-tune the decoder weights only on
combination of these two objectives difficult. To a specific decoding task: Given a bitext x, y, we
mitigate these issues, Liu et al. (2020) introduce encode x and y with the frozen encoder, randomly
a denoising auto-encoding task, which has proven draw z as a random interpolation of x and y em-
to be a good pre-training objective for translation beddings, and learn to decode sentence embedding
tasks. z into y. This can be viewed as a continuous com-
bination of translation and auto-encoding tasks.
MSE loss objective in the sentence embed-
ding space Teacher-student approaches to mul- 3.2 Multilingual sentence representations for
tilingual sentence embedding space learning have speech
shown that ensuring that translations of a same Duquenne et al. (2021) introduced the first seman-
sentence are embedded close to one another in tic sentence embeddings for multilingual speech.
the sentence embedding space with an MSE loss Their method follows a teacher-student approach,
works really well (Reimers and Gurevych, 2020; where the teacher model is an encoder for multi-
Heffernan et al., 2022). However, using this lingual sentence embeddings trained on text. We
kind of loss without a frozen pre-existing teacher follow the same approach but using our newly
embedding space would lead to collapse (all in- trained text sentence embedding space as teacher:
puts mapped to the same embedding), which is we trained a speech student encoder to encode au-
why contrastive learning methods were introduced dios into fixed-size representations and minimize
to learn multilingual sentence embeddings from the MSE loss between the transcription sentence
scratch (Feng et al., 2020). However, combining embeddings and the trained speech sentence em-
beddings. Written translation embeddings could Auto-encoding task Similarly to translation
also be used as targets in this teacher-student ap- tasks, we decode sentence embedding in the same
proach (Duquenne et al., 2021). However, in this language to perform auto-encoding and evaluate
work, we only focus on transcriptions as targets, the content preservation of this operation.
using written translations is left for future work. All these evaluations for text were performed
As in previous work, we leveraged self-supervised on FLORES-200 devtest set,3 which provides an
pre-trained models, for our speech encoders train- N -way parallel corpus of translations in 200 lan-
ing, using a w2v-bert pretrained model as initial- guages.
ization.
4.2 Evaluations on speech
4 Evaluations xsim for speech We follow Duquenne et al.
To evaluate the semantic properties of the resulting (2021) and calculate cross-modal and -lingual
sentence embedding space, we relied on a number similarity search on the F LEURS speech transla-
of evaluation tasks on both text and speech modal- tion test set (Conneau et al., 2023). It follows the
ities: xsim evaluation presented above, but xsim is run
on speech embeddings against English text trans-
4.1 Evaluations on text lation embeddings.
xsim Cross-lingual similarity search, also xsim++ for speech In addition to xsim com-
called xsim,2 evaluates the similarity between putation for speech, we augment the English
sentence embeddings across languages. Given a texts with challenging negative examples from the
test dataset of bitexts, translations are encoded xsim++ modified English sentences of FLORES.
into the multilingual sentence embedding space
and cosine similarity between all embeddings Zero-shot speech-to-text translation Follow-
are computed. For each test instance, if the two ing Duquenne et al. (2022b), speech student en-
corresponding translations are not the closest, we coders can be combined with text decoders at
count it as an error in order to compute an error inference time. Since the speech encoder were
rate on the whole test set. trained on ASR data only and the S ONAR text
decoder was only trained on text and has never
xsim++ More recently, xsim++ was intro- seen speech embeddings during training, this cor-
duced as a more semantically challenging similar- responds to zero-shot speech-to-text translation.
ity search task (Chen et al., 2023).2 It augments Similarly to text, it enables evaluating the content
the test set with hard negative examples for the encoding in the speech embeddings. It also eval-
similarity search, generating several modified ver- uates the compatibility between speech and text
sions of ground truth examples based on causal- representations.
ity alternation, entity replacement and number re-
placement. Zero-shot Automatic Speech Recognition: we
also decode speech embeddings in the same lan-
Translation tasks Multilingual embeddings are guage to perform speech recognition.
decoded into other target languages to perform
All these evaluations for speech were performed
MT. We report spBLEU (flores200) scores and
on F LEURS test set (Conneau et al., 2023), a
COMET scores on the generated translations. De-
N -way parallel speech dataset in 102 languages
coding sentence embeddings into other languages
built on top of the text F LORES-101 benchmark.
partially evaluates how much information is en-
coded in sentence embeddings, which is comple-
5 Experiments on text
mentary to xsim and xsim++ evaluations. How-
ever, please note that information may also be re- In this paper, we first trained a multilingual sen-
stored from the internal language modeling capa- tence embedding space using an encoder-decoder
bilities of the decoder, and not from the sentence architecture on text, with fixed-representation of
embeddings themselves. sentences between the encoder and the decoder.
2 3
https://github.com/facebookresearch/ https://github.com/facebookresearch/
LASER flores/tree/main/flores200
Method X-eng↑ eng-X↑ AE↑ xsim↓ xsim++↓
LMT 33.2 21.1 28.6 1.3 19.6
LMT + LAE 17.6 18.6 94.6 15.9 65.7
LMT + 0.1 · LDAE 31.6 20.9 41.6 2.6 26.2
LMT + 0.1 · LMSE 31.7 20.2 27.2 1.3 14.3
S ONAR sentence embedding space
LMT + 0.1 · LMSE + 0.01 · LDAE 32.9 20.7 32.4 1.4 15.2
LMT + 0.1 · LMSE + 0.01 · LDAE & fine-tuned dec. 32.7 21.6 41.7 1.4 15.2
MT topline
NLLB 1B 35.2 24.9 39.0∗ 3.7∗ 49.6∗
Similarity search baselines
LaBSE — — — 10.7 36.1
L ASER 3 — — — 5.1 36.4

Table 1: Text evaluations on FLORES200 devtest set, averaged on the 200 languages supported by NLLB 1B:
translation spBLEU for X-eng and eng-X directions, auto-encoding spBLEU, xsim and xsim++ similarity search
results on X-eng pairs. Results with * are zero-shot evaluations of NLLB 1B model which was not trained to
optimize these tasks.

5.1 Training setup guages which contrasts with LASER training that
We initialized our model with the NLLB 1B dense only used English and Spanish as target languages.
model (NLLB Team et al., 2022), that was trained As presented in Section 3, we ran an extensive
on translation tasks with full cross-attention on study on the use of different training objectives,
variable length encoder outputs as it is commonly namely translation objective (MT), auto-encoding
done for sequence-to-sequence MT model train- objective (AE), denoising auto-encoding objective
ing. The model is composed of a 24 layers Trans- (DAE) and Mean Squared Error loss (MSE) in the
former encoder and a 24 layers Transformer de- sentence embedding space:
coder and trained on a combination of human la- L = LMT + α · LMSE + β · LAE/DAE
beled data, back-translated data and mined data
(NLLB Team et al., 2022). In order to build We are using the same training data for auto-
our fixed-size sentence representation, we added encoding and translation objectives, inputting the
a pooling operation on the encoder outputs. Sev- target sentences instead of the source sentences
eral pooling methods are possible: max-pooling to perform auto-encoding of target sentences only.
as done in (Artetxe and Schwenk, 2019), mean- Incorporating more monolingual data in the train-
pooling as done in (Reimers and Gurevych, 2019), ing for the auto-encoding task is left to future
or EOS pooling which use the output representa- work.
tion of the EOS special token appended at the end
of sentences during NLLB training. Contrary to 5.2 Initial experiment with translation
mean-pooling or EOS-pooling, max-pooling out- objective only
puts a vector with a different range of values com- We report the results of our experiments on text
pared to NLLB training (due to the max opera- sentence embedding modeling in Table 1. Our first
tion), leading to worse results in our initial exper- experiment using only the translation objective
iments. Since for EOS-pooling the training hap- for our encoder-decoder model with fixed-size in-
pened to be unstable during initial experiments, termediate representation gives surprisingly good
we focused on mean-pooling for the rest of our ex- translation performance, given the bottleneck be-
periments. We trained our encoder-decoder model tween the encoder and the decoder. It yields -2
for 100k updates with same learning rate and batch BLEU on X-eng direction and -3.8 BLEU on eng-
size as NLLB training in the following experi- X direction compared to NLLB 1B model with
ments, unless explicitly specified. We used all full-cross attention.
bitext data used in the NLLB training, human We notice that auto-encoding evaluation (AE)
labeled bitexts, back-translated data and mined significantly lags behind NLLB 1B model. This
data. This training dataset involves 200 target lan- result may come from an inductive bias of the
sequence-to-sequence architecture with full cross- Method X-eng eng-X
attention, that could bias the model towards copy-
S ONAR 85.9 83.4
ing encoder inputs. S ONAR & fine-tuned dec. 85.9 84.2
xsim and xsim++ results are significantly bet-
Topline
ter compared to previous work, namely LaBSE NLLB 1B 86.5 85.2
and L ASER 3, on our 200 languages of focus, with
approximately 45% relative reduction of xsim++ Table 2: Translation evaluations for X-eng and eng-X
error rate compared to the baseline models. Note directions on FLORES200 devtest set: COMET scores
that averaging NLLB 1B encoder outputs to per- averaged on 89 languages supported by both COMET
form similarity search already gives good xsim and NLLB 1B models.
scores. This directly comes from the translation
objective used during NLLB 1B training that en- to simple auto-encoding and could be better com-
courages to encode multilingual sentences in sim- bined with the non-trivial translation task. We
ilar ways for efficient cross-lingual transfer. How- also scaled down this denoising auto-encoding
ever, the more difficult xsim++ evaluation re- objective by a factor 0.1. This mostly miti-
mains challenging, in this zero-shot setting, for the gated the performance drops on translation tasks,
original NLLB 1B model. while significantly boosting the auto-encoding
task (+13 BLEU) compared to the translation-only
5.3 Experiments with auto-encoding model. However, the denoising auto-encoding cri-
objectives terion significantly affects the xsim and xsim++
scores. This again shows that auto-encoding af-
Noticing the gap in the auto-encoding perfor-
fects the organization of the sentence embedding
mance between the fixed-size bottleneck encoder-
space, learning distinct representations for differ-
decoder model and NLLB 1B, we integrated an
ent languages to optimize auto-encoding.
auto-encoding objective, hoping to close the gap
with the NLLB 1B model. This model was only
5.4 Experiments with cross-lingual similarity
trained for 50k steps, as it converged quickly com-
objective
pared to other variants. We notice that auto-
encoding task is easy to learn, even with a fixed- Motivated by the recent distillation approaches
size bottleneck between the encoder and the de- to extend a sentence embedding space to new
coder, almost reaching 95 BLEU in average on languages, explicitly aligning languages with an
the 200 languages of NLLB. This shows that a MSE criterion in the embedding space, we ex-
lot of information can be efficiently stored in a plored the use of an auxiliary MSE loss in the
fixed-size representation and that the bottleneck sentence embedding space. This is in addition
should not be seen as an hard limitation. While the to the translation loss, with the hope to explic-
translation performance on eng-X tranlation direc- itly make translations closer in the embedding
tions is not that much impacted, we see a big drop space. In Table 1, we notice that this new
in translation performance for X-eng directions (- constraint degrades the decoding performance of
15,6 BLEU) compared to the fixed-size bottleneck the encoder-decoder model for both translation
encoder-decoder model trained only on a transla- and auto-encoding tasks. However, it signifi-
tion task. Moreover, we see a big drop in both cantly boosts the xsim++ scores compared to the
xsim and xsim++ evaluations showing that the encoder-decoder model trained only on a transla-
model is not learning language-agnostic represen- tion task, with -5.3 xsim++ error rate reduction.
tations anymore, due to the auto-encoding objec-
tive that seems more easily optimized compared to
the translation objective. 5.5 Training the S ONAR embedding space
To mitigate the negative effects of the auto- Based on the conclusions of the previously trained
encoding objective, while improving the auto- models, we combined the translation loss, the aux-
encoding performance at inference time, we iliary MSE loss and the denoising auto-encoding
switched to a denoising auto-encoding criterion, loss, to create the S ONAR embedding space. In
to avoid that the model overfits on the copy task. this run, the denoising auto-encoding loss is fur-
That would also make the task harder compared ther downscaled, motivated by the high xsim++
fra spa swh rus 98 languages
X-eng BLEU xsim ↓ xsim++ ↓
S ONAR & fine-tuned dec. 46.1 34.5 42.4 37.1
L ASER 3 MSE & T-mod. 40.4 29.6 27.2 29.7 S ONAR 0.1 9.3
L ASER 3 1.1 27.5
xsim++
S ONAR 4.8 7.9 7.1 6.5 LaBSE 1.5 15.4
L ASER 3 MSE 7.6 12.6 15.2 12.4
Table 4: Comparison of similarity search results (er-
Table 3: Comparison to T-modules framework based ror rates) on the intersection of languages handled by
on LASER embedding space. spBLEU scores for X- LaBSE, LASER3 and S ONAR.
eng translation directions on FLORES200 devtest set
and xsim++ for X-eng pairs on FLORES200 devtest
set. lation of the source and target sentence embed-
dings and learn to decode the target sentence to-
kens. As the encoder is frozen, the xsim and
score of the previously trained sentence embed- xsim++ scores won’t change, but the decoding
ding space trained on denoising auto-encoding. results will. With this decoder fine-tuning step, we
First, in the same tendency from previous training notice similar translation results on the X-eng di-
with (denoising) auto-encoding objective, we no- rection, while noticing a +0.9BLEU gain on the
tice a slight degradation in xsim++ scores when eng-X translation directions. More importantly,
adding the denoising auto-encoding in addition the auto-encoding performance is boosted by 9.3
to the MSE loss. However, this degradation is BLEU with decoder fine-tuning method while the
only 0.9% which can be considered as accept- sentence embedding space was not affected. This
able. Initial experiments on larger scaling fac- finetuning step is trained for 50k additional steps.
tors for the denoising auto-encoding criterion fur- We also evaluated the best performing mod-
ther increased, as expected, the xsim++ degra- els on translation tasks with COMET, which has
dation, and we thus decided to stick with a 0.01 proven to better correlate with human judgments
scaling factor for the denoising auto-encoding ob- compared to BLEU scores. We evaluated the
jective. On the other hand, for our new S ONAR two X-eng and eng-X directions involving the lan-
model, we see improvements on translation tasks guages on which XLM-R was trained on, which
compared to the model trained on MT and MSE are the languages supported by COMET (see Ta-
loss. This may be due to efficient mitigation of ble 2). We see less that 1 point difference be-
collapse that could happen with MSE loss, thanks tween our S ONAR encoder-decoder model (with
to the denoising auto-encoding objective. We fine-tuned decoder) compared to NLLB 1B model
also see big improvements in auto-encoding task for both eng-X and X-eng directions, showing the
(>+3.8 BLEU) compared to all models not trained good quality of the translations.
with auto-encoding objectives. This variant seems The NLLB 1B model still represents a topline,
to be the best setup in terms of sentence em- and to evaluate our S ONAR framework against
bedding space organization (following xsim and a more fair baseline involving a fixed-size sen-
xsim++ scores) and decoding performance (fol- tence representation between the encoder and the
lowing translation and auto-encoding evaluations). decoder, we compared our results to the decod-
We also report the xsim and xsim++ results on ing of LASER embeddings, recently introduced
the intersection of languages handled by LaBSE, in T-modules (Duquenne et al., 2022b, 2023). As
LASER3 and S ONAR in Table 4, and notice again L ASER 3 encoders were trained with a cosine loss,
that S ONAR outperforms previous state-of-the-art the sentence embeddings cannot be efficiently de-
sentence embedding spaces for multilingual simi- coded with T-modules decoder. This is why we
larity search. trained new L ASER 3 encoders with MSE loss, and
Finally, we tried to improve the decoding per- added back-translated data from NLLB project in
formances of our architecture, freezing the em- addition to the original training data of L ASER 3
bedding space and our multilingual encoder, while encoders. These newly trained L ASER 3 MSE en-
fine-tuning only the decoder. We used the ran- coders can be combined with T-modules decoder
dom interpolation decoding method introduced in (Duquenne et al., 2023) to perform X-eng transla-
section 3, where we compute a random interpo- tion. We report the results on 4 languages French,
BLEU fra-eng spa-eng fra spa swh rus
S ONAR mean-pooling 25.2 20.6 Training hours
S ONAR max-pooling 31.6 24.5 S ONAR/LASER ASR 0.8k 0.4k 0.3k 0.2k
S ONAR attention-pooling 33.3 25.5 Whisper ASR 10k 11k 0.01k 10k
Whisper ST 4k 7k 0.3k 8k
Table 5: spBLEU X-eng zero-shot speech translation S ONAR zero-shot ST
on F LEURS test set for different pooling methods. S ONAR 33.3 25.5 14.9 15.0
S ONAR & fine-tuned dec. 33.4 24.8 15.6 14.6
Zero-shot ST baseline
Spanish, Swahili and Russian in Table 3 and notice L ASER 3 MSE & T-mod 30.7 22.9 3.7 16.2
big improvements using S ONAR on both X-eng Supervised ST toplines
translation task and xsim++ evaluation . Please Whisper Large v1 33.8 27.0 5.2 30.2
note that compared to previous work (Duquenne Whisper Large v2 34.9 27.2 7.6 31.1
et al., 2022b), we are able to encode and decode
200 languages with a single encoder and a single Table 7: spBLEU scores on F LEURS test set for zero-
decoder. shot Speech Translation on X-eng directions.

6 Experiments on speech
(Wang et al., 2021) and Librispeech (Panayotov
Based on the experiments and evaluations of mul- et al., 2015). We tested different pooling methods,
tilingual sentence embedding spaces for text, we namely mean-pooling, max-pooling and attention-
chose to focus only on the embedding space pooling. Attention-pooling is performed with a
learnt with translation, denoising auto-encoding three layer transformer decoder architecture with
and MSE objectives which seems to be a good cross-attention on the speech encoder outputs, in
trade-off between good semantic representation order to output a single vector as our speech sen-
(xsim and xsim++) and good decoding perfor- tence embedding. Best results are achieved with
mance (translation and auto-encoding). We follow attention-pooling (see Table 5).
a teacher-student approach to extend this space to As a baseline, we compared our S ONAR speech
the speech modality for several languages. We first encoders to speech encoders trained with LASER
performed an initial extensive study on five lan- as teacher (using our newly trained L ASER 3 MSE
guages only: English (eng), Spanish (spa), French text encoders as teacher), with exact same train-
(fra), Russian (rus) and Swahili (swh). We then ing data and pre-trained w2v-bert model. We
scale to 37 languages. report the xsim and xsim++ cross-lingual and
6.1 Experiments on 5 languages cross-modal results in Table 6 on F LEURS test
set for foreign speech embeddings against English
We use a pre-trained w2v-bert 600 million pa- text embeddings. Similarly to what Chen et al.
rameter model to initialize the speech encoders (2023) noticed on FLORES, xsim scores saturate
and train them on Common Voice 12 ASR train- to zero error rate on F LEURS test set, not provid-
ing set (Ardila et al., 2019). For our English ing useful insights on the multimodal sentence em-
speech encoder, we also used ASR training data
from Must-C (Di Gangi et al., 2019), Voxpopuli
src\tgt eng fra spa swh rus 200
langs
fra spa swh rus
eng 69.7 44.3 26.9 27.8 29.8 17.7
xsim
fra 33.4 64.1 21.5 18.2 23.3 13.4
S ONAR 0.0 0.0 0.0 0.0
spa 24.8 25.1 58.9 16.0 16.8 11.7
L ASER 3 MSE 0.0 0.0 0.0 0.3
swh 15.6 13.5 9.0 25.7 9.8 7.0
xsim++ rus 14.6 17.3 11.0 10.4 35.0 8.0
S ONAR 12.3 13.9 22.8 24.6
L ASER 3 MSE 17.5 24.9 40.7 42.1 Table 8: spBLEU scores on F LEURS test set for zero-
shot Speech Translation on {eng,fra,spa,swh,rus}-X di-
Table 6: Multilingual and multimodal similarity search rections. Last column is the average spBLEU Speech
evaluations on F LEURS test set: xsim and xsim++ Translation scores for decoding in the 200 languages
error rates on speech translation X-eng pairs. supported by S ONAR text decoder.
eng fra spa swh rus
Training hours
S ONAR/LASER ASR 4k 0.8k 0.4k 0.3k 0.2k
Whisper ASR 438k 10k 11k 0.01k 10k
Whisper ST — 4k 7k 0.3k 8k
BLEU
S ONAR 64.7 54.3 50.0 17.7 29.1
S ONAR & fine-tuned dec 69.7 64.1 58.9 25.7 35.0
Whisper v1 80.8 79.8 84.8 26.9 84.3
Whisper v2 81.3 82.0 85.3 36.0 85.3
Bert-score
S ONAR 0.948 0.926 0.923 0.808 0.853
S ONAR & fine-tuned dec 0.951 0.939 0.936 0.831 0.870
Whisper v1 0.972 0.965 0.977 0.837 0.975
Whisper v2 0.972 0.969 0.979 0.865 0.978

Table 9: Speech recognition spBLEU scores and Bert-scores on F LEURS test set.

bedding space organization. Therefore, we also results, while being trained on much less training
report xsim++ scores, augmenting the F LEURS data. As for Swahili, our framework significantly
test set with hard negatives generated in (Chen outperforms Whisper models. We notice much
et al., 2023) based on FLORES which composes better results for Russian-to-English for Whisper
the source sentences of F LEURS. We notice 41% which was expected given the amount of training
xsim++ relative reduction when switching from data and the supervised setting.
LASER as teacher to S ONAR as teacher.
Following (Duquenne et al., 2022b), we de- Thanks to the compatibility across modali-
coded the speech sentence embeddings with ties and across languages, we decoded English,
our S ONAR text decoder, performing zero-shot French, Spanish, Swahili and Russian speech sen-
speech-to-text translation. Indeed, the text de- tence embeddings into the 200 text languages sup-
coder has never seen speech sentence embeddings ported by our S ONAR decoders. We report the
during training. Moreover, speech representa- zero-shot speech translation results using the fine-
tions were only trained to match their transcrip- tuned S ONAR decoder in Table 8. We notice that
tion representations but never translations. In Ta- BLEU scores remain high for other languages than
ble 7, we report our zero-shot speech-to-text trans- English, still in a zero-shot setting, highlighting
lation results on F LEURS test set for X-eng direc- again the compatibility between representations.
tions and compare it to the baseline trained on
LASER space. We also report the state-of-the- Finally, speech embeddings can be decoded into
art results for speech-to-text translation, trained text in the same language, which can be seen as
in a supervised way on significantly more train- speech transcription. Since our model can of-
ing data. First, we notice big improvements in ten paraphrase transcriptions, we report in Table 9
the BLEU scores compared to the LASER base- BLEU scores as well as bert-scores for this zero-
line on French, Spanish an Swahili, with an av- shot transcription task. While being significantly
erage 5.5 BLEU gain on these languages, while behind on BLEU scores, which is expected as
being slightly behind on Russian to English trans- our model often paraphrases transcriptions, we see
lation (-1.2 BLEU). This last result is surpris- much less gap with Whisper transcriptions with
ing, as our S ONAR speech encoder have much the bert-score metric (which still advantages real
better xsim++ score on Russian compared to transcriptions compared to paraphrases, but less
the LASER speech encoder. Second, we notice than BLEU). Training data amount is also signif-
that for our two high resource languages, namely icantly different between the two setups, but it’s
French and Spanish, our zero-shot speech-to-text interesting to notice that the gap in terms of bert-
results are close to Whisper Large v1 supervised score remains reasonable.
6.2 Scaling to 37 languages
We use the same recipe than described above to ISO Language Train Ours Whisper
extend the coverage of the speech encoders to 37 arb MS Arabic 822 30.9 26.9
languages. These speech encoders were trained by ben Bengali 335 21.3 14.1
linguistic language family, e.g. Romance or In- cat Catalan 1738 37.7 36.9
cmn Mandarin Chinese 9320 18.6 20.8
dian languages, using speech transcriptions only,
ces Czech 181 32.0 30.3
from public and licensed sources. Table 10 col- cym Welsh 99 14.5 13.4
umn "Train" gives statistics on the amount of train- dan Danish 115 34.9 36.0
ing data. As in Section 6.1, we evaluate the speech deu German 3329 36.2 38.8
encoders by connecting them to the S ONAR text est Estonian 131 26.1 21.2
decoder and calculate speech-to-text translation fin Finish 184 24.9 25.2
performance, as measured by BLEU. Although fra French 2057 33.7 34.9
our results are fully zero-shot speech translation, hin Hindi 150 22.6 24.2
ind Indonesian 269 28.7 31.9
we achieve very competitive performance com-
ita Italian 588 29.3 27.5
pared to the state-of-the-art model Whisper v2 jpn Japanese 17319 20.2 20.8
large (Radford et al., 2022). The average on BLEU kan Kannada 114 21.4 13.1
scores are slightly better for SONAR compared kor Korean 316 17.1 24.2
to Whisper, while being zero-shot speech transla- mlt Maltese 106 24.4 16.2
tion. Our model performs less well on some high- nld Dutch 1723 29.3 28.4
resource languages like Mandarin Chinese, Ger- pes Western Persian 386 24.4 20.9
por Portuguese 269 38.3 41.4
man or French, but outperforms Whisper for oth-
pol Polish 304 21.1 25.8
ers like Spanish or Dutch and for several less com- ron Romanian 135 34.7 34.1
mon languages, like Swahili or Uzbek. Our mod- rus Russian 259 28.4 31.1
ular approach seems to achieve particular good re- slk Slovak 102 32.3 29.3
sults on Indian languages: Bengali, Hindi, Kan- spa Spanish 1511 28.0 27.2
nada, Telugu, Tamil and Urdu. swh Swahili 361 23.5 7.6
tam Tamil 245 16.2 10.0
7 Discussion tel Telugu 84 18.0 14.7
tgl Tagalog 108 14.6 26.8
From all the experiments present in Section 5 and tha Thai 195 16.9 17.8
Section 6, we can draw a couple of high-level con- tur Turkish 174 22.7 29.9
clusions: ukr Ukrainian 105 30.7 32.5
urd Urdu 185 19.7 18.1
First, we have seen that the auto-encoding task
uzn Uzbek 115 20.0 6.6
can be greatly solved even with a fixed-size bottle- vie Vietnamese 194 19.1 21.9
neck between the encoder an the decoder, showing
Total/avg 43628 25.3 24.5
that a fixed-size representation should not be seen
as a hard limitation, as a lot of information can be
stored in a single vector. Then, similarly to Artetxe Table 10: spBLEU evaluation of S2T into English on
and Schwenk (2019), we noticed that a translation F LEURS test set. Our models were trained on ASR
transcriptions only, compared to the Whisper large v2.
objective is well suited to build language-agnostic
representations while making sure that the encoder
model encodes enough information in the sentence tive and the mutual compatibility between speech
embedding to be efficiently decoded (in another and text multilingual embeddings is greatly high-
language). Adding an MSE loss in the training, lighted by the fact that speech embeddings can be
which explicitly encourages to align languages decoded in foreign text in a zero-shot way.
in the sentence embedding space, leads to better
language-agnostic representations. Moreover, de- 8 Conclusion
noising auto-encoding combined with MSE loss,
can bring gains for decoding tasks, but too much To conclude, we introduced a new multilingual
of it affects the language-agnostic representations. and multimodal sentence embedding space called
Finally, teacher-student approach to extend to the S ONAR. We conducted an extensive study on ob-
speech modality has once again proven to be effec- jective functions to build our multilingual teacher
sentence embedding space for text, and an ex- References
tensive evaluation of our S ONAR framework for
both similarity search and decoding tasks. We ex- Rosana Ardila, Megan Branson, Kelly Davis,
tended this new text sentence embedding space Michael Henretty, Michael Kohler, Josh Meyer,
to the speech modality to introduce Sentence- Reuben Morais, Lindsay Saunders, Francis M
level multimOdal and laNguage-Agnostic Rep- Tyers, and Gregor Weber. 2019. Common
resentations (S ONAR). The S ONAR text and voice: A massively-multilingual speech corpus.
speech encoders as well as the text decoders arXiv preprint arXiv:1912.06670.
are freely available at https://github.com/
Mikel Artetxe and Holger Schwenk. 2019. Mas-
facebookresearch/SONAR.
sively multilingual sentence embeddings for
9 Acknowledgment zero-shot cross-lingual transfer and beyond.
TACL, pages 597–610.
We would like to thank Kevin Heffernan for his
help on providing xsim and xsim++ baselines Arun Babu, Changhan Wang, Andros Tjan-
for LaBSE and L ASER 3, Andy Chung for pro- dra, Kushal Lakhotia, Qiantong Xu, Naman
viding the w2v-bert pre-trained models used as Goyal, Kritika Singh, Patrick von Platen,
initialization for our speech encoders, Changhan Yatharth Saraf, Juan Pino, et al. 2021. Xls-
Wang for providing speech data manifests used r: Self-supervised cross-lingual speech repre-
for training and Artyom Kozhevnikov and Naji El sentation learning at scale. arXiv preprint
Hachem for the migration of models to fairseq2 arXiv:2111.09296.
for open-sourcing.
The last author’s contribution was partly funded Alexei Baevski, Yuhao Zhou, Abdelrahman Mo-
by his chair in the PRAIRIE institute funded by the hamed, and Michael Auli. 2020. wav2vec
French national agency ANR as part of the “In- 2.0: A framework for self-supervised learning
vestissements d’avenir” programme under the ref- of speech representations. Advances in Neu-
erence ANR-19-P3IA-0001. ral Information Processing Systems, 33:12449–
12460.

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia,

Melvin Johnson, Yong Cheng, Simran Khanuja,
Jason Riesa, and Alexis Conneau. 2022.
mslam: Massively multilingual joint pre-
training for speech and text. arXiv preprint
arXiv:2202.01374.

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan

Hua, Nicole Limtiaco, Rhomni St John, Noah
Constant, Mario Guajardo-Cespedes, Steve
Yuan, Chris Tar, et al. 2018. Universal sentence
encoder. arXiv preprint arXiv:1803.11175.

Mingda Chen, Kevin Heffernan, Onur Çelebi,

Alex Mourachko, and Holger Schwenk. 2023.
xsim++: An improved proxy to bitext mining
performance for low-resource languages. arXiv
preprint arXiv:2306.12907.

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng

Chiu, James Qin, Ruoming Pang, and Yonghui
Wu. 2021. W2v-bert: Combining contrastive
learning and masked language modeling for
self-supervised speech pre-training. In 2021
IEEE Automatic Speech Recognition and Un- A large-scale mined corpus of multilingual
derstanding Workshop (ASRU), pages 244–250. speech-to-speech translations. arXiv preprint
IEEE. arXiv:2211.04508.

Kevin Clark, Minh-Thang Luong, Quoc V Le, Paul-Ambroise Duquenne, Hongyu Gong, Benoît
and Christopher D Manning. 2020. Elec- Sagot, and Holger Schwenk. 2022b. T-
tra: Pre-training text encoders as discrimina- modules: Translation modules for zero-shot
tors rather than generators. arXiv preprint cross-modal machine translation. arXiv
arXiv:2003.10555. preprint arXiv:2205.12216.

Alexis Conneau, Kartikay Khandelwal, Naman Paul-Ambroise Duquenne, Hongyu Gong, and
Goyal, Vishrav Chaudhary, Guillaume Wen- Holger Schwenk. 2021. Multimodal and multi-
zek, Francisco Guzmán, Edouard Grave, Myle lingual embeddings for large-scale speech min-
Ott, Luke Zettlemoyer, and Veselin Stoy- ing. Advances in Neural Information Process-
anov. 2019. Unsupervised cross-lingual rep- ing Systems, 34.
resentation learning at scale. arXiv preprint
Paul-Ambroise Duquenne, Holger Schwenk, and
arXiv:1911.02116.
Benoît Sagot. 2023. Modular speech-to-text
Alexis Conneau, Kartikay Khandelwal, Naman translation for zero-shot cross-modal transfer.
Goyal, Vishrav Chaudhary, Guillaume Wen- In Interspeech.
zek, Francisco Guzmán, Edouard Grave, Myle
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer,
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Naveen Arivazhagan, and Wei Wang. 2020.
2020. Unsupervised cross-lingual representa-
Language-agnostic bert sentence embedding.
tion learning at scale. In ACL.
arXiv preprint arXiv:2007.01852.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Kevin Heffernan, Onur Çelebi, and Holger
Loic Barrault, and Antoine Bordes. 2017. Su- Schwenk. 2022. Bitext mining using distilled
pervised learning of universal sentence repre- sentence representations for low-resource lan-
sentations from natural language inference data. guages. arXiv preprint arXiv:2205.12654.
arXiv preprint arXiv:1705.02364.
Sameer Khurana, Antoine Laurent, and James
Alexis Conneau, Min Ma, Simran Khanuja, Glass. 2022. Samu-xlsr: Semantically-
Yu Zhang, Vera Axelrod, Siddharth Dalmia, aligned multimodal utterance-level cross-
Jason Riesa, Clara Rivera, and Ankur Bapna. lingual speech representation. arXiv preprint
2023. Fleurs: Few-shot learning evaluation of arXiv:2205.08180.
universal representations of speech. In 2022
IEEE Spoken Language Technology Workshop Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
(SLT), pages 798–805. IEEE. Sergey Edunov, Marjan Ghazvininejad, Mike
Lewis, and Luke Zettlemoyer. 2020. Multilin-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, gual denoising pre-training for neural machine
and Kristina Toutanova. 2018. Bert: Pre- translation. Transactions of the Association for
training of deep bidirectional transformers Computational Linguistics, 8:726–742.
for language understanding. arXiv preprint
arXiv:1810.04805. Marta R NLLB Team, Costa-jussà, James Cross,
Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben- Kevin Heffernan, Elahe Kalbassi, Janice Lam,
tivogli, Matteo Negri, and Marco Turchi. 2019. Daniel Licht, Jean Maillard, et al. 2022.
MuST-C: a Multilingual Speech Translation No language left behind: Scaling human-
Corpus. In NAACL, pages 2012–2017. centered machine translation. arXiv preprint
arXiv:2207.04672.
Paul-Ambroise Duquenne, Hongyu Gong, Ning
Dong, Jingfei Du, Ann Lee, Vedanuj Goswani, Vassil Panayotov, Guoguo Chen, Daniel Povey,
Changhan Wang, Juan Pino, Benoît Sagot, and Sanjeev Khudanpur. 2015. Librispeech:
and Holger Schwenk. 2022a. Speechmatrix: an asr corpus based on public domain audio
books. In 2015 IEEE international confer-
ence on acoustics, speech and signal processing
(ICASSP), pages 5206–5210. IEEE.

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha,

and Alex Waibel. 2019. Improving zero-shot
translation with language-independent con-
straints.

Alec Radford, Jong Wook Kim, Tao Xu, Greg

Brockman, Christine McLeavey, and Ilya
Sutskever. 2022. Robust speech recognition
via large-scale weak supervision. Technical
report, Technical report, OpenAI, 2022. URL
https://cdn. openai. com/papers/whisper. pdf.

Nils Reimers and Iryna Gurevych. 2019.

Sentence-bert: Sentence embeddings us-
ing siamese bert-networks. arXiv preprint
arXiv:1908.10084.

Nils Reimers and Iryna Gurevych. 2020. Making

monolingual sentence embeddings multilingual
using knowledge distillation. In EMNLP, pages
4512–4525.

Holger Schwenk, Guillaume Wenzek, Sergey

Edunov, Edouard Grave, Armand Joulin, and
Angela Fan. 2021. CCMatrix: Mining billions
of high-quality parallel sentences on the web.
In ACL, page 6490–6500.

Changhan Wang, Morgane Riviere, Ann Lee,

Anne Wu, Chaitanya Talnikar, Daniel Haziza,
Mary Williamson, Juan Pino, and Emmanuel
Dupoux. 2021. VoxPopuli: A large-scale multi-
lingual speech corpus for representation learn-
ing, semi-supervised learning and interpreta-
tion. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Lin-
guistics and the 11th International Joint Con-
ference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 993–1003, Online.
Association for Computational Linguistics.

Yinfei Yang, Gustavo Hernandez Abrego, Steve

Yuan, Mandy Guo, Qinlan Shen, Daniel Cer,
Yun-Hsuan Sung, Brian Strope, and Ray
Kurzweil. 2019. Improving multilingual sen-
tence embedding using bi-directional dual en-
coder with additive margin softmax. arXiv
preprint arXiv:1902.08564.

Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Language-Agnostic BERT Sentence Embedding
No ratings yet
Language-Agnostic BERT Sentence Embedding
14 pages
Wintersc Iitguwahati Multilingual Model Jan25
No ratings yet
Wintersc Iitguwahati Multilingual Model Jan25
81 pages
Universal Sentence Encoder
No ratings yet
Universal Sentence Encoder
7 pages
DL - Mod - 4 - 2025 (1) - Merged
No ratings yet
DL - Mod - 4 - 2025 (1) - Merged
115 pages
Google USM - Scaling Automatic Speech Recognition Beyond 100 Languages
No ratings yet
Google USM - Scaling Automatic Speech Recognition Beyond 100 Languages
20 pages
Lang Bridge
No ratings yet
Lang Bridge
21 pages
Zero Shot
No ratings yet
Zero Shot
11 pages
Speechut: Bridging Speech and Text With Hidden-Unit For Encoder-Decoder Based Speech-Text Pre-Training
No ratings yet
Speechut: Bridging Speech and Text With Hidden-Unit For Encoder-Decoder Based Speech-Text Pre-Training
14 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
Cross-Lingual Language Model Pretraining PDF
No ratings yet
Cross-Lingual Language Model Pretraining PDF
10 pages
Exchanging-Based Multimodal Fusion With Transformer
No ratings yet
Exchanging-Based Multimodal Fusion With Transformer
13 pages
Un Supervise
No ratings yet
Un Supervise
14 pages
IMAGEBIND: One Embedding Space To Bind Them All
No ratings yet
IMAGEBIND: One Embedding Space To Bind Them All
15 pages
Text and Code Embeddings by Contrastive Pre-Training
No ratings yet
Text and Code Embeddings by Contrastive Pre-Training
13 pages
Generating Bowfromwirdembeddings
No ratings yet
Generating Bowfromwirdembeddings
12 pages
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
No ratings yet
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
12 pages
1707 06519
No ratings yet
1707 06519
8 pages
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
No ratings yet
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
12 pages
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
No ratings yet
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
5 pages
French To English Translator in PyTorch
No ratings yet
French To English Translator in PyTorch
30 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
A Unified Architecture For Natural Language Processing: Deep Neural Networks With Multitask Learning
No ratings yet
A Unified Architecture For Natural Language Processing: Deep Neural Networks With Multitask Learning
9 pages
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
No ratings yet
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
10 pages
MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
No ratings yet
MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
20 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
2503 06594v1-LaMaTE
No ratings yet
2503 06594v1-LaMaTE
36 pages
Contrastive Learning For Sentence Representation
No ratings yet
Contrastive Learning For Sentence Representation
10 pages
State of Multilingual and Multimodal NLP
No ratings yet
State of Multilingual and Multimodal NLP
27 pages
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
No ratings yet
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
13 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
Towards General Text Embeddings With Multi-Stage Contrastive Learning
No ratings yet
Towards General Text Embeddings With Multi-Stage Contrastive Learning
18 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Mme5 - Improving Multimodal Multilingual Embeddings Via High-Quality Synthetic Data
No ratings yet
Mme5 - Improving Multimodal Multilingual Embeddings Via High-Quality Synthetic Data
21 pages
Electronics 10 01372 With Cover
No ratings yet
Electronics 10 01372 With Cover
24 pages
Manuscript
No ratings yet
Manuscript
2 pages
STTATTS: Unified Speech-To-Text and Text-To-Speech Model: Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
No ratings yet
STTATTS: Unified Speech-To-Text and Text-To-Speech Model: Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
11 pages
2004 09813v1 PDF
No ratings yet
2004 09813v1 PDF
10 pages
2020 Acl-Srw 15
No ratings yet
2020 Acl-Srw 15
8 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Aligning Pre-Trained Models For Spoken Language Translation.18294v1
No ratings yet
Aligning Pre-Trained Models For Spoken Language Translation.18294v1
12 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
Prompting Large Language Models With Speech Recognition Abilities
No ratings yet
Prompting Large Language Models With Speech Recognition Abilities
9 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
No ratings yet
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
10 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Reducing Language Context Confusion For End-To-End Code-Switching Automatic Speech Recognition
No ratings yet
Reducing Language Context Confusion For End-To-End Code-Switching Automatic Speech Recognition
9 pages
Universal Network
No ratings yet
Universal Network
18 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
Contextual Word Embeddings
No ratings yet
Contextual Word Embeddings
8 pages
Innovatively Fused Deep Learning For Evaluating Translations From Poor Into Rich Morphology-Coling2020
No ratings yet
Innovatively Fused Deep Learning For Evaluating Translations From Poor Into Rich Morphology-Coling2020
11 pages
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
N Efficient Framework For Learning Sentence Representations
No ratings yet
N Efficient Framework For Learning Sentence Representations
16 pages
2WH Light
No ratings yet
2WH Light
36 pages
PLC Exercises LD
No ratings yet
PLC Exercises LD
27 pages
DD Assignment
No ratings yet
DD Assignment
40 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
D8.1M 2007PV PDF
No ratings yet
D8.1M 2007PV PDF
5 pages
Arduino Energy Meter PDF
100% (2)
Arduino Energy Meter PDF
16 pages
CC-LINK Interface: SR83 Digital Controller
No ratings yet
CC-LINK Interface: SR83 Digital Controller
24 pages
Bentinho Massaro - 3 Main Teachings PDF
No ratings yet
Bentinho Massaro - 3 Main Teachings PDF
1 page
9330 Brochure HQ A2 Markem Imaje
100% (1)
9330 Brochure HQ A2 Markem Imaje
2 pages
Riscv Server Soc
No ratings yet
Riscv Server Soc
34 pages
SAP Enhancement Package 5 For SAP ERP 6.0
No ratings yet
SAP Enhancement Package 5 For SAP ERP 6.0
222 pages
cs224n spr2024 Lecture01 Wordvecs1
No ratings yet
cs224n spr2024 Lecture01 Wordvecs1
40 pages
Math Homework Tic Tac Toe
100% (1)
Math Homework Tic Tac Toe
8 pages
STP 1571-2014
No ratings yet
STP 1571-2014
184 pages
Steering System PDF
No ratings yet
Steering System PDF
49 pages
Sew Cost Map
No ratings yet
Sew Cost Map
20 pages
S4 Planning Phase
No ratings yet
S4 Planning Phase
4 pages
Organ Donar Prediction Using Machine Learning
No ratings yet
Organ Donar Prediction Using Machine Learning
13 pages
National Transformation Program Delivery Plan
No ratings yet
National Transformation Program Delivery Plan
131 pages
Saeed PHD Thesis UniMelbAug2017
No ratings yet
Saeed PHD Thesis UniMelbAug2017
143 pages
Entropy 25 00916arst
No ratings yet
Entropy 25 00916arst
16 pages
Information 14 00090
No ratings yet
Information 14 00090
14 pages
Fphy 10 915756
No ratings yet
Fphy 10 915756
11 pages
UNIT 2.2 Functional Modeling
No ratings yet
UNIT 2.2 Functional Modeling
23 pages
Interaction-Aware Hypergraph Neural Networks For User Profiling
No ratings yet
Interaction-Aware Hypergraph Neural Networks For User Profiling
10 pages
05 Group Account Management
No ratings yet
05 Group Account Management
13 pages
Towards Lightweight Graph Neural Network Search With Curriculum Graph Sparsification
No ratings yet
Towards Lightweight Graph Neural Network Search With Curriculum Graph Sparsification
11 pages
Message
No ratings yet
Message
30 pages
Number System and Polynomials
No ratings yet
Number System and Polynomials
2 pages
Kolom Distilasi Tinjauan Umum
No ratings yet
Kolom Distilasi Tinjauan Umum
22 pages
Lesson 2 Introduction of Robot HAT
No ratings yet
Lesson 2 Introduction of Robot HAT
4 pages
Aegis El RG 4m El RG 4k Manual 21-03-31
No ratings yet
Aegis El RG 4m El RG 4k Manual 21-03-31
8 pages
Resume - Devendra Rathore - Digital Marketing Professional-Compressed
No ratings yet
Resume - Devendra Rathore - Digital Marketing Professional-Compressed
2 pages
Tentative 3rd International Conference On Communication
No ratings yet
Tentative 3rd International Conference On Communication
2 pages
3.1 Critical Thinking Rubric
No ratings yet
3.1 Critical Thinking Rubric
1 page
Installation Instructions Model HTRI-M: Addressable Interface Module
No ratings yet
Installation Instructions Model HTRI-M: Addressable Interface Module
2 pages
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Perceptual Computing: Fundamentals and Applications
From Everand
Perceptual Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

Uploaded by

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

Uploaded by

S ONAR: Sentence-Level Multimodal

and Language-Agnostic Representations

Paul-Ambroise Duquenne Holger Schwenk Benoît Sagot

We introduce S ONAR, a new multilin-

Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia,

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan

Mingda Chen, Kevin Heffernan, Onur Çelebi,

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng

Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha,

Alec Radford, Jong Wook Kim, Tao Xu, Greg

Nils Reimers and Iryna Gurevych. 2019.

Nils Reimers and Iryna Gurevych. 2020. Making

Holger Schwenk, Guillaume Wenzek, Sergey

Changhan Wang, Morgane Riviere, Ann Lee,

Yinfei Yang, Gustavo Hernandez Abrego, Steve

You might also like