SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Abstract
decoded into the same target language sentence. an MSE loss with a translation objective and/or a
However, there is no guarantee to converge to this denoising auto-encoding objective could also pre-
local minimum. Nothing explicitly constrains a vent collapse from happening, as the model is
sentence in a language and its translation in an- forced to keep embeddings distinct to encode and
other language to be encoded closely to one an- decode different sentences.
other. As a result, other local minima are possible,
where translations are not encoded closely but still Decoder finetuning Duquenne et al. (2022b)
decoded into the same sentence for a given target demonstrated that learning deep decoders for
language. To mitigate this, shallow decoders were an existing sentence embedding space (in their
used by Artetxe and Schwenk (2019): a deeper de- case, LASER) can significantly improve transla-
coder can more easily decode different points into tion and auto-encoding performance. While keep-
the same sentence, whereas a shallower decoder is ing the existing embedding space unchanged, such
more likely to need two vectors to be very simi- decoders greatly improve the decoding of sen-
lar whenever they must be decoded into the same tence embeddings, therefore significantly improv-
sentence. ing auto-encoding and translation performance
when combined with compatible encoders. This
Auto-encoding and denoising auto-encoding is of great interest for zero-shot (possibly cross-
objective Auto-encoders have been widely used modal) translation, as shown by Duquenne et al.
to build representations. It has the advantage to en- (2023). In this paper, we introduce a decoder
courage encoding fine-grained details of the input. fine-tuning method called random interpolation
However, this objective by itself is not likely to decoding. Based on a trained encoder-decoder
learn semantic representation of sentences. More- model with a bottleneck representation between
over, this objective is much simpler to learn com- the encoder and the decoder, we freeze the encoder
pared to a translation objective, which makes the weights and fine-tune the decoder weights only on
combination of these two objectives difficult. To a specific decoding task: Given a bitext x, y, we
mitigate these issues, Liu et al. (2020) introduce encode x and y with the frozen encoder, randomly
a denoising auto-encoding task, which has proven draw z as a random interpolation of x and y em-
to be a good pre-training objective for translation beddings, and learn to decode sentence embedding
tasks. z into y. This can be viewed as a continuous com-
bination of translation and auto-encoding tasks.
MSE loss objective in the sentence embed-
ding space Teacher-student approaches to mul- 3.2 Multilingual sentence representations for
tilingual sentence embedding space learning have speech
shown that ensuring that translations of a same Duquenne et al. (2021) introduced the first seman-
sentence are embedded close to one another in tic sentence embeddings for multilingual speech.
the sentence embedding space with an MSE loss Their method follows a teacher-student approach,
works really well (Reimers and Gurevych, 2020; where the teacher model is an encoder for multi-
Heffernan et al., 2022). However, using this lingual sentence embeddings trained on text. We
kind of loss without a frozen pre-existing teacher follow the same approach but using our newly
embedding space would lead to collapse (all in- trained text sentence embedding space as teacher:
puts mapped to the same embedding), which is we trained a speech student encoder to encode au-
why contrastive learning methods were introduced dios into fixed-size representations and minimize
to learn multilingual sentence embeddings from the MSE loss between the transcription sentence
scratch (Feng et al., 2020). However, combining embeddings and the trained speech sentence em-
beddings. Written translation embeddings could Auto-encoding task Similarly to translation
also be used as targets in this teacher-student ap- tasks, we decode sentence embedding in the same
proach (Duquenne et al., 2021). However, in this language to perform auto-encoding and evaluate
work, we only focus on transcriptions as targets, the content preservation of this operation.
using written translations is left for future work. All these evaluations for text were performed
As in previous work, we leveraged self-supervised on FLORES-200 devtest set,3 which provides an
pre-trained models, for our speech encoders train- N -way parallel corpus of translations in 200 lan-
ing, using a w2v-bert pretrained model as initial- guages.
ization.
4.2 Evaluations on speech
4 Evaluations xsim for speech We follow Duquenne et al.
To evaluate the semantic properties of the resulting (2021) and calculate cross-modal and -lingual
sentence embedding space, we relied on a number similarity search on the F LEURS speech transla-
of evaluation tasks on both text and speech modal- tion test set (Conneau et al., 2023). It follows the
ities: xsim evaluation presented above, but xsim is run
on speech embeddings against English text trans-
4.1 Evaluations on text lation embeddings.
xsim Cross-lingual similarity search, also xsim++ for speech In addition to xsim com-
called xsim,2 evaluates the similarity between putation for speech, we augment the English
sentence embeddings across languages. Given a texts with challenging negative examples from the
test dataset of bitexts, translations are encoded xsim++ modified English sentences of FLORES.
into the multilingual sentence embedding space
and cosine similarity between all embeddings Zero-shot speech-to-text translation Follow-
are computed. For each test instance, if the two ing Duquenne et al. (2022b), speech student en-
corresponding translations are not the closest, we coders can be combined with text decoders at
count it as an error in order to compute an error inference time. Since the speech encoder were
rate on the whole test set. trained on ASR data only and the S ONAR text
decoder was only trained on text and has never
xsim++ More recently, xsim++ was intro- seen speech embeddings during training, this cor-
duced as a more semantically challenging similar- responds to zero-shot speech-to-text translation.
ity search task (Chen et al., 2023).2 It augments Similarly to text, it enables evaluating the content
the test set with hard negative examples for the encoding in the speech embeddings. It also eval-
similarity search, generating several modified ver- uates the compatibility between speech and text
sions of ground truth examples based on causal- representations.
ity alternation, entity replacement and number re-
placement. Zero-shot Automatic Speech Recognition: we
also decode speech embeddings in the same lan-
Translation tasks Multilingual embeddings are guage to perform speech recognition.
decoded into other target languages to perform
All these evaluations for speech were performed
MT. We report spBLEU (flores200) scores and
on F LEURS test set (Conneau et al., 2023), a
COMET scores on the generated translations. De-
N -way parallel speech dataset in 102 languages
coding sentence embeddings into other languages
built on top of the text F LORES-101 benchmark.
partially evaluates how much information is en-
coded in sentence embeddings, which is comple-
5 Experiments on text
mentary to xsim and xsim++ evaluations. How-
ever, please note that information may also be re- In this paper, we first trained a multilingual sen-
stored from the internal language modeling capa- tence embedding space using an encoder-decoder
bilities of the decoder, and not from the sentence architecture on text, with fixed-representation of
embeddings themselves. sentences between the encoder and the decoder.
2 3
https://github.com/facebookresearch/ https://github.com/facebookresearch/
LASER flores/tree/main/flores200
Method X-eng↑ eng-X↑ AE↑ xsim↓ xsim++↓
LMT 33.2 21.1 28.6 1.3 19.6
LMT + LAE 17.6 18.6 94.6 15.9 65.7
LMT + 0.1 · LDAE 31.6 20.9 41.6 2.6 26.2
LMT + 0.1 · LMSE 31.7 20.2 27.2 1.3 14.3
S ONAR sentence embedding space
LMT + 0.1 · LMSE + 0.01 · LDAE 32.9 20.7 32.4 1.4 15.2
LMT + 0.1 · LMSE + 0.01 · LDAE & fine-tuned dec. 32.7 21.6 41.7 1.4 15.2
MT topline
NLLB 1B 35.2 24.9 39.0∗ 3.7∗ 49.6∗
Similarity search baselines
LaBSE — — — 10.7 36.1
L ASER 3 — — — 5.1 36.4
Table 1: Text evaluations on FLORES200 devtest set, averaged on the 200 languages supported by NLLB 1B:
translation spBLEU for X-eng and eng-X directions, auto-encoding spBLEU, xsim and xsim++ similarity search
results on X-eng pairs. Results with * are zero-shot evaluations of NLLB 1B model which was not trained to
optimize these tasks.
5.1 Training setup guages which contrasts with LASER training that
We initialized our model with the NLLB 1B dense only used English and Spanish as target languages.
model (NLLB Team et al., 2022), that was trained As presented in Section 3, we ran an extensive
on translation tasks with full cross-attention on study on the use of different training objectives,
variable length encoder outputs as it is commonly namely translation objective (MT), auto-encoding
done for sequence-to-sequence MT model train- objective (AE), denoising auto-encoding objective
ing. The model is composed of a 24 layers Trans- (DAE) and Mean Squared Error loss (MSE) in the
former encoder and a 24 layers Transformer de- sentence embedding space:
coder and trained on a combination of human la- L = LMT + α · LMSE + β · LAE/DAE
beled data, back-translated data and mined data
(NLLB Team et al., 2022). In order to build We are using the same training data for auto-
our fixed-size sentence representation, we added encoding and translation objectives, inputting the
a pooling operation on the encoder outputs. Sev- target sentences instead of the source sentences
eral pooling methods are possible: max-pooling to perform auto-encoding of target sentences only.
as done in (Artetxe and Schwenk, 2019), mean- Incorporating more monolingual data in the train-
pooling as done in (Reimers and Gurevych, 2019), ing for the auto-encoding task is left to future
or EOS pooling which use the output representa- work.
tion of the EOS special token appended at the end
of sentences during NLLB training. Contrary to 5.2 Initial experiment with translation
mean-pooling or EOS-pooling, max-pooling out- objective only
puts a vector with a different range of values com- We report the results of our experiments on text
pared to NLLB training (due to the max opera- sentence embedding modeling in Table 1. Our first
tion), leading to worse results in our initial exper- experiment using only the translation objective
iments. Since for EOS-pooling the training hap- for our encoder-decoder model with fixed-size in-
pened to be unstable during initial experiments, termediate representation gives surprisingly good
we focused on mean-pooling for the rest of our ex- translation performance, given the bottleneck be-
periments. We trained our encoder-decoder model tween the encoder and the decoder. It yields -2
for 100k updates with same learning rate and batch BLEU on X-eng direction and -3.8 BLEU on eng-
size as NLLB training in the following experi- X direction compared to NLLB 1B model with
ments, unless explicitly specified. We used all full-cross attention.
bitext data used in the NLLB training, human We notice that auto-encoding evaluation (AE)
labeled bitexts, back-translated data and mined significantly lags behind NLLB 1B model. This
data. This training dataset involves 200 target lan- result may come from an inductive bias of the
sequence-to-sequence architecture with full cross- Method X-eng eng-X
attention, that could bias the model towards copy-
S ONAR 85.9 83.4
ing encoder inputs. S ONAR & fine-tuned dec. 85.9 84.2
xsim and xsim++ results are significantly bet-
Topline
ter compared to previous work, namely LaBSE NLLB 1B 86.5 85.2
and L ASER 3, on our 200 languages of focus, with
approximately 45% relative reduction of xsim++ Table 2: Translation evaluations for X-eng and eng-X
error rate compared to the baseline models. Note directions on FLORES200 devtest set: COMET scores
that averaging NLLB 1B encoder outputs to per- averaged on 89 languages supported by both COMET
form similarity search already gives good xsim and NLLB 1B models.
scores. This directly comes from the translation
objective used during NLLB 1B training that en- to simple auto-encoding and could be better com-
courages to encode multilingual sentences in sim- bined with the non-trivial translation task. We
ilar ways for efficient cross-lingual transfer. How- also scaled down this denoising auto-encoding
ever, the more difficult xsim++ evaluation re- objective by a factor 0.1. This mostly miti-
mains challenging, in this zero-shot setting, for the gated the performance drops on translation tasks,
original NLLB 1B model. while significantly boosting the auto-encoding
task (+13 BLEU) compared to the translation-only
5.3 Experiments with auto-encoding model. However, the denoising auto-encoding cri-
objectives terion significantly affects the xsim and xsim++
scores. This again shows that auto-encoding af-
Noticing the gap in the auto-encoding perfor-
fects the organization of the sentence embedding
mance between the fixed-size bottleneck encoder-
space, learning distinct representations for differ-
decoder model and NLLB 1B, we integrated an
ent languages to optimize auto-encoding.
auto-encoding objective, hoping to close the gap
with the NLLB 1B model. This model was only
5.4 Experiments with cross-lingual similarity
trained for 50k steps, as it converged quickly com-
objective
pared to other variants. We notice that auto-
encoding task is easy to learn, even with a fixed- Motivated by the recent distillation approaches
size bottleneck between the encoder and the de- to extend a sentence embedding space to new
coder, almost reaching 95 BLEU in average on languages, explicitly aligning languages with an
the 200 languages of NLLB. This shows that a MSE criterion in the embedding space, we ex-
lot of information can be efficiently stored in a plored the use of an auxiliary MSE loss in the
fixed-size representation and that the bottleneck sentence embedding space. This is in addition
should not be seen as an hard limitation. While the to the translation loss, with the hope to explic-
translation performance on eng-X tranlation direc- itly make translations closer in the embedding
tions is not that much impacted, we see a big drop space. In Table 1, we notice that this new
in translation performance for X-eng directions (- constraint degrades the decoding performance of
15,6 BLEU) compared to the fixed-size bottleneck the encoder-decoder model for both translation
encoder-decoder model trained only on a transla- and auto-encoding tasks. However, it signifi-
tion task. Moreover, we see a big drop in both cantly boosts the xsim++ scores compared to the
xsim and xsim++ evaluations showing that the encoder-decoder model trained only on a transla-
model is not learning language-agnostic represen- tion task, with -5.3 xsim++ error rate reduction.
tations anymore, due to the auto-encoding objec-
tive that seems more easily optimized compared to
the translation objective. 5.5 Training the S ONAR embedding space
To mitigate the negative effects of the auto- Based on the conclusions of the previously trained
encoding objective, while improving the auto- models, we combined the translation loss, the aux-
encoding performance at inference time, we iliary MSE loss and the denoising auto-encoding
switched to a denoising auto-encoding criterion, loss, to create the S ONAR embedding space. In
to avoid that the model overfits on the copy task. this run, the denoising auto-encoding loss is fur-
That would also make the task harder compared ther downscaled, motivated by the high xsim++
fra spa swh rus 98 languages
X-eng BLEU xsim ↓ xsim++ ↓
S ONAR & fine-tuned dec. 46.1 34.5 42.4 37.1
L ASER 3 MSE & T-mod. 40.4 29.6 27.2 29.7 S ONAR 0.1 9.3
L ASER 3 1.1 27.5
xsim++
S ONAR 4.8 7.9 7.1 6.5 LaBSE 1.5 15.4
L ASER 3 MSE 7.6 12.6 15.2 12.4
Table 4: Comparison of similarity search results (er-
Table 3: Comparison to T-modules framework based ror rates) on the intersection of languages handled by
on LASER embedding space. spBLEU scores for X- LaBSE, LASER3 and S ONAR.
eng translation directions on FLORES200 devtest set
and xsim++ for X-eng pairs on FLORES200 devtest
set. lation of the source and target sentence embed-
dings and learn to decode the target sentence to-
kens. As the encoder is frozen, the xsim and
score of the previously trained sentence embed- xsim++ scores won’t change, but the decoding
ding space trained on denoising auto-encoding. results will. With this decoder fine-tuning step, we
First, in the same tendency from previous training notice similar translation results on the X-eng di-
with (denoising) auto-encoding objective, we no- rection, while noticing a +0.9BLEU gain on the
tice a slight degradation in xsim++ scores when eng-X translation directions. More importantly,
adding the denoising auto-encoding in addition the auto-encoding performance is boosted by 9.3
to the MSE loss. However, this degradation is BLEU with decoder fine-tuning method while the
only 0.9% which can be considered as accept- sentence embedding space was not affected. This
able. Initial experiments on larger scaling fac- finetuning step is trained for 50k additional steps.
tors for the denoising auto-encoding criterion fur- We also evaluated the best performing mod-
ther increased, as expected, the xsim++ degra- els on translation tasks with COMET, which has
dation, and we thus decided to stick with a 0.01 proven to better correlate with human judgments
scaling factor for the denoising auto-encoding ob- compared to BLEU scores. We evaluated the
jective. On the other hand, for our new S ONAR two X-eng and eng-X directions involving the lan-
model, we see improvements on translation tasks guages on which XLM-R was trained on, which
compared to the model trained on MT and MSE are the languages supported by COMET (see Ta-
loss. This may be due to efficient mitigation of ble 2). We see less that 1 point difference be-
collapse that could happen with MSE loss, thanks tween our S ONAR encoder-decoder model (with
to the denoising auto-encoding objective. We fine-tuned decoder) compared to NLLB 1B model
also see big improvements in auto-encoding task for both eng-X and X-eng directions, showing the
(>+3.8 BLEU) compared to all models not trained good quality of the translations.
with auto-encoding objectives. This variant seems The NLLB 1B model still represents a topline,
to be the best setup in terms of sentence em- and to evaluate our S ONAR framework against
bedding space organization (following xsim and a more fair baseline involving a fixed-size sen-
xsim++ scores) and decoding performance (fol- tence representation between the encoder and the
lowing translation and auto-encoding evaluations). decoder, we compared our results to the decod-
We also report the xsim and xsim++ results on ing of LASER embeddings, recently introduced
the intersection of languages handled by LaBSE, in T-modules (Duquenne et al., 2022b, 2023). As
LASER3 and S ONAR in Table 4, and notice again L ASER 3 encoders were trained with a cosine loss,
that S ONAR outperforms previous state-of-the-art the sentence embeddings cannot be efficiently de-
sentence embedding spaces for multilingual simi- coded with T-modules decoder. This is why we
larity search. trained new L ASER 3 encoders with MSE loss, and
Finally, we tried to improve the decoding per- added back-translated data from NLLB project in
formances of our architecture, freezing the em- addition to the original training data of L ASER 3
bedding space and our multilingual encoder, while encoders. These newly trained L ASER 3 MSE en-
fine-tuning only the decoder. We used the ran- coders can be combined with T-modules decoder
dom interpolation decoding method introduced in (Duquenne et al., 2023) to perform X-eng transla-
section 3, where we compute a random interpo- tion. We report the results on 4 languages French,
BLEU fra-eng spa-eng fra spa swh rus
S ONAR mean-pooling 25.2 20.6 Training hours
S ONAR max-pooling 31.6 24.5 S ONAR/LASER ASR 0.8k 0.4k 0.3k 0.2k
S ONAR attention-pooling 33.3 25.5 Whisper ASR 10k 11k 0.01k 10k
Whisper ST 4k 7k 0.3k 8k
Table 5: spBLEU X-eng zero-shot speech translation S ONAR zero-shot ST
on F LEURS test set for different pooling methods. S ONAR 33.3 25.5 14.9 15.0
S ONAR & fine-tuned dec. 33.4 24.8 15.6 14.6
Zero-shot ST baseline
Spanish, Swahili and Russian in Table 3 and notice L ASER 3 MSE & T-mod 30.7 22.9 3.7 16.2
big improvements using S ONAR on both X-eng Supervised ST toplines
translation task and xsim++ evaluation . Please Whisper Large v1 33.8 27.0 5.2 30.2
note that compared to previous work (Duquenne Whisper Large v2 34.9 27.2 7.6 31.1
et al., 2022b), we are able to encode and decode
200 languages with a single encoder and a single Table 7: spBLEU scores on F LEURS test set for zero-
decoder. shot Speech Translation on X-eng directions.
6 Experiments on speech
(Wang et al., 2021) and Librispeech (Panayotov
Based on the experiments and evaluations of mul- et al., 2015). We tested different pooling methods,
tilingual sentence embedding spaces for text, we namely mean-pooling, max-pooling and attention-
chose to focus only on the embedding space pooling. Attention-pooling is performed with a
learnt with translation, denoising auto-encoding three layer transformer decoder architecture with
and MSE objectives which seems to be a good cross-attention on the speech encoder outputs, in
trade-off between good semantic representation order to output a single vector as our speech sen-
(xsim and xsim++) and good decoding perfor- tence embedding. Best results are achieved with
mance (translation and auto-encoding). We follow attention-pooling (see Table 5).
a teacher-student approach to extend this space to As a baseline, we compared our S ONAR speech
the speech modality for several languages. We first encoders to speech encoders trained with LASER
performed an initial extensive study on five lan- as teacher (using our newly trained L ASER 3 MSE
guages only: English (eng), Spanish (spa), French text encoders as teacher), with exact same train-
(fra), Russian (rus) and Swahili (swh). We then ing data and pre-trained w2v-bert model. We
scale to 37 languages. report the xsim and xsim++ cross-lingual and
6.1 Experiments on 5 languages cross-modal results in Table 6 on F LEURS test
set for foreign speech embeddings against English
We use a pre-trained w2v-bert 600 million pa- text embeddings. Similarly to what Chen et al.
rameter model to initialize the speech encoders (2023) noticed on FLORES, xsim scores saturate
and train them on Common Voice 12 ASR train- to zero error rate on F LEURS test set, not provid-
ing set (Ardila et al., 2019). For our English ing useful insights on the multimodal sentence em-
speech encoder, we also used ASR training data
from Must-C (Di Gangi et al., 2019), Voxpopuli
src\tgt eng fra spa swh rus 200
langs
fra spa swh rus
eng 69.7 44.3 26.9 27.8 29.8 17.7
xsim
fra 33.4 64.1 21.5 18.2 23.3 13.4
S ONAR 0.0 0.0 0.0 0.0
spa 24.8 25.1 58.9 16.0 16.8 11.7
L ASER 3 MSE 0.0 0.0 0.0 0.3
swh 15.6 13.5 9.0 25.7 9.8 7.0
xsim++ rus 14.6 17.3 11.0 10.4 35.0 8.0
S ONAR 12.3 13.9 22.8 24.6
L ASER 3 MSE 17.5 24.9 40.7 42.1 Table 8: spBLEU scores on F LEURS test set for zero-
shot Speech Translation on {eng,fra,spa,swh,rus}-X di-
Table 6: Multilingual and multimodal similarity search rections. Last column is the average spBLEU Speech
evaluations on F LEURS test set: xsim and xsim++ Translation scores for decoding in the 200 languages
error rates on speech translation X-eng pairs. supported by S ONAR text decoder.
eng fra spa swh rus
Training hours
S ONAR/LASER ASR 4k 0.8k 0.4k 0.3k 0.2k
Whisper ASR 438k 10k 11k 0.01k 10k
Whisper ST — 4k 7k 0.3k 8k
BLEU
S ONAR 64.7 54.3 50.0 17.7 29.1
S ONAR & fine-tuned dec 69.7 64.1 58.9 25.7 35.0
Whisper v1 80.8 79.8 84.8 26.9 84.3
Whisper v2 81.3 82.0 85.3 36.0 85.3
Bert-score
S ONAR 0.948 0.926 0.923 0.808 0.853
S ONAR & fine-tuned dec 0.951 0.939 0.936 0.831 0.870
Whisper v1 0.972 0.965 0.977 0.837 0.975
Whisper v2 0.972 0.969 0.979 0.865 0.978
Table 9: Speech recognition spBLEU scores and Bert-scores on F LEURS test set.
bedding space organization. Therefore, we also results, while being trained on much less training
report xsim++ scores, augmenting the F LEURS data. As for Swahili, our framework significantly
test set with hard negatives generated in (Chen outperforms Whisper models. We notice much
et al., 2023) based on FLORES which composes better results for Russian-to-English for Whisper
the source sentences of F LEURS. We notice 41% which was expected given the amount of training
xsim++ relative reduction when switching from data and the supervised setting.
LASER as teacher to S ONAR as teacher.
Following (Duquenne et al., 2022b), we de- Thanks to the compatibility across modali-
coded the speech sentence embeddings with ties and across languages, we decoded English,
our S ONAR text decoder, performing zero-shot French, Spanish, Swahili and Russian speech sen-
speech-to-text translation. Indeed, the text de- tence embeddings into the 200 text languages sup-
coder has never seen speech sentence embeddings ported by our S ONAR decoders. We report the
during training. Moreover, speech representa- zero-shot speech translation results using the fine-
tions were only trained to match their transcrip- tuned S ONAR decoder in Table 8. We notice that
tion representations but never translations. In Ta- BLEU scores remain high for other languages than
ble 7, we report our zero-shot speech-to-text trans- English, still in a zero-shot setting, highlighting
lation results on F LEURS test set for X-eng direc- again the compatibility between representations.
tions and compare it to the baseline trained on
LASER space. We also report the state-of-the- Finally, speech embeddings can be decoded into
art results for speech-to-text translation, trained text in the same language, which can be seen as
in a supervised way on significantly more train- speech transcription. Since our model can of-
ing data. First, we notice big improvements in ten paraphrase transcriptions, we report in Table 9
the BLEU scores compared to the LASER base- BLEU scores as well as bert-scores for this zero-
line on French, Spanish an Swahili, with an av- shot transcription task. While being significantly
erage 5.5 BLEU gain on these languages, while behind on BLEU scores, which is expected as
being slightly behind on Russian to English trans- our model often paraphrases transcriptions, we see
lation (-1.2 BLEU). This last result is surpris- much less gap with Whisper transcriptions with
ing, as our S ONAR speech encoder have much the bert-score metric (which still advantages real
better xsim++ score on Russian compared to transcriptions compared to paraphrases, but less
the LASER speech encoder. Second, we notice than BLEU). Training data amount is also signif-
that for our two high resource languages, namely icantly different between the two setups, but it’s
French and Spanish, our zero-shot speech-to-text interesting to notice that the gap in terms of bert-
results are close to Whisper Large v1 supervised score remains reasonable.
6.2 Scaling to 37 languages
We use the same recipe than described above to ISO Language Train Ours Whisper
extend the coverage of the speech encoders to 37 arb MS Arabic 822 30.9 26.9
languages. These speech encoders were trained by ben Bengali 335 21.3 14.1
linguistic language family, e.g. Romance or In- cat Catalan 1738 37.7 36.9
cmn Mandarin Chinese 9320 18.6 20.8
dian languages, using speech transcriptions only,
ces Czech 181 32.0 30.3
from public and licensed sources. Table 10 col- cym Welsh 99 14.5 13.4
umn "Train" gives statistics on the amount of train- dan Danish 115 34.9 36.0
ing data. As in Section 6.1, we evaluate the speech deu German 3329 36.2 38.8
encoders by connecting them to the S ONAR text est Estonian 131 26.1 21.2
decoder and calculate speech-to-text translation fin Finish 184 24.9 25.2
performance, as measured by BLEU. Although fra French 2057 33.7 34.9
our results are fully zero-shot speech translation, hin Hindi 150 22.6 24.2
ind Indonesian 269 28.7 31.9
we achieve very competitive performance com-
ita Italian 588 29.3 27.5
pared to the state-of-the-art model Whisper v2 jpn Japanese 17319 20.2 20.8
large (Radford et al., 2022). The average on BLEU kan Kannada 114 21.4 13.1
scores are slightly better for SONAR compared kor Korean 316 17.1 24.2
to Whisper, while being zero-shot speech transla- mlt Maltese 106 24.4 16.2
tion. Our model performs less well on some high- nld Dutch 1723 29.3 28.4
resource languages like Mandarin Chinese, Ger- pes Western Persian 386 24.4 20.9
por Portuguese 269 38.3 41.4
man or French, but outperforms Whisper for oth-
pol Polish 304 21.1 25.8
ers like Spanish or Dutch and for several less com- ron Romanian 135 34.7 34.1
mon languages, like Swahili or Uzbek. Our mod- rus Russian 259 28.4 31.1
ular approach seems to achieve particular good re- slk Slovak 102 32.3 29.3
sults on Indian languages: Bengali, Hindi, Kan- spa Spanish 1511 28.0 27.2
nada, Telugu, Tamil and Urdu. swh Swahili 361 23.5 7.6
tam Tamil 245 16.2 10.0
7 Discussion tel Telugu 84 18.0 14.7
tgl Tagalog 108 14.6 26.8
From all the experiments present in Section 5 and tha Thai 195 16.9 17.8
Section 6, we can draw a couple of high-level con- tur Turkish 174 22.7 29.9
clusions: ukr Ukrainian 105 30.7 32.5
urd Urdu 185 19.7 18.1
First, we have seen that the auto-encoding task
uzn Uzbek 115 20.0 6.6
can be greatly solved even with a fixed-size bottle- vie Vietnamese 194 19.1 21.9
neck between the encoder an the decoder, showing
Total/avg 43628 25.3 24.5
that a fixed-size representation should not be seen
as a hard limitation, as a lot of information can be
stored in a single vector. Then, similarly to Artetxe Table 10: spBLEU evaluation of S2T into English on
and Schwenk (2019), we noticed that a translation F LEURS test set. Our models were trained on ASR
transcriptions only, compared to the Whisper large v2.
objective is well suited to build language-agnostic
representations while making sure that the encoder
model encodes enough information in the sentence tive and the mutual compatibility between speech
embedding to be efficiently decoded (in another and text multilingual embeddings is greatly high-
language). Adding an MSE loss in the training, lighted by the fact that speech embeddings can be
which explicitly encourages to align languages decoded in foreign text in a zero-shot way.
in the sentence embedding space, leads to better
language-agnostic representations. Moreover, de- 8 Conclusion
noising auto-encoding combined with MSE loss,
can bring gains for decoding tasks, but too much To conclude, we introduced a new multilingual
of it affects the language-agnostic representations. and multimodal sentence embedding space called
Finally, teacher-student approach to extend to the S ONAR. We conducted an extensive study on ob-
speech modality has once again proven to be effec- jective functions to build our multilingual teacher
sentence embedding space for text, and an ex- References
tensive evaluation of our S ONAR framework for
both similarity search and decoding tasks. We ex- Rosana Ardila, Megan Branson, Kelly Davis,
tended this new text sentence embedding space Michael Henretty, Michael Kohler, Josh Meyer,
to the speech modality to introduce Sentence- Reuben Morais, Lindsay Saunders, Francis M
level multimOdal and laNguage-Agnostic Rep- Tyers, and Gregor Weber. 2019. Common
resentations (S ONAR). The S ONAR text and voice: A massively-multilingual speech corpus.
speech encoders as well as the text decoders arXiv preprint arXiv:1912.06670.
are freely available at https://github.com/
Mikel Artetxe and Holger Schwenk. 2019. Mas-
facebookresearch/SONAR.
sively multilingual sentence embeddings for
9 Acknowledgment zero-shot cross-lingual transfer and beyond.
TACL, pages 597–610.
We would like to thank Kevin Heffernan for his
help on providing xsim and xsim++ baselines Arun Babu, Changhan Wang, Andros Tjan-
for LaBSE and L ASER 3, Andy Chung for pro- dra, Kushal Lakhotia, Qiantong Xu, Naman
viding the w2v-bert pre-trained models used as Goyal, Kritika Singh, Patrick von Platen,
initialization for our speech encoders, Changhan Yatharth Saraf, Juan Pino, et al. 2021. Xls-
Wang for providing speech data manifests used r: Self-supervised cross-lingual speech repre-
for training and Artyom Kozhevnikov and Naji El sentation learning at scale. arXiv preprint
Hachem for the migration of models to fairseq2 arXiv:2111.09296.
for open-sourcing.
The last author’s contribution was partly funded Alexei Baevski, Yuhao Zhou, Abdelrahman Mo-
by his chair in the PRAIRIE institute funded by the hamed, and Michael Auli. 2020. wav2vec
French national agency ANR as part of the “In- 2.0: A framework for self-supervised learning
vestissements d’avenir” programme under the ref- of speech representations. Advances in Neu-
erence ANR-19-P3IA-0001. ral Information Processing Systems, 33:12449–
12460.
Kevin Clark, Minh-Thang Luong, Quoc V Le, Paul-Ambroise Duquenne, Hongyu Gong, Benoît
and Christopher D Manning. 2020. Elec- Sagot, and Holger Schwenk. 2022b. T-
tra: Pre-training text encoders as discrimina- modules: Translation modules for zero-shot
tors rather than generators. arXiv preprint cross-modal machine translation. arXiv
arXiv:2003.10555. preprint arXiv:2205.12216.
Alexis Conneau, Kartikay Khandelwal, Naman Paul-Ambroise Duquenne, Hongyu Gong, and
Goyal, Vishrav Chaudhary, Guillaume Wen- Holger Schwenk. 2021. Multimodal and multi-
zek, Francisco Guzmán, Edouard Grave, Myle lingual embeddings for large-scale speech min-
Ott, Luke Zettlemoyer, and Veselin Stoy- ing. Advances in Neural Information Process-
anov. 2019. Unsupervised cross-lingual rep- ing Systems, 34.
resentation learning at scale. arXiv preprint
Paul-Ambroise Duquenne, Holger Schwenk, and
arXiv:1911.02116.
Benoît Sagot. 2023. Modular speech-to-text
Alexis Conneau, Kartikay Khandelwal, Naman translation for zero-shot cross-modal transfer.
Goyal, Vishrav Chaudhary, Guillaume Wen- In Interspeech.
zek, Francisco Guzmán, Edouard Grave, Myle
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer,
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
Naveen Arivazhagan, and Wei Wang. 2020.
2020. Unsupervised cross-lingual representa-
Language-agnostic bert sentence embedding.
tion learning at scale. In ACL.
arXiv preprint arXiv:2007.01852.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Kevin Heffernan, Onur Çelebi, and Holger
Loic Barrault, and Antoine Bordes. 2017. Su- Schwenk. 2022. Bitext mining using distilled
pervised learning of universal sentence repre- sentence representations for low-resource lan-
sentations from natural language inference data. guages. arXiv preprint arXiv:2205.12654.
arXiv preprint arXiv:1705.02364.
Sameer Khurana, Antoine Laurent, and James
Alexis Conneau, Min Ma, Simran Khanuja, Glass. 2022. Samu-xlsr: Semantically-
Yu Zhang, Vera Axelrod, Siddharth Dalmia, aligned multimodal utterance-level cross-
Jason Riesa, Clara Rivera, and Ankur Bapna. lingual speech representation. arXiv preprint
2023. Fleurs: Few-shot learning evaluation of arXiv:2205.08180.
universal representations of speech. In 2022
IEEE Spoken Language Technology Workshop Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li,
(SLT), pages 798–805. IEEE. Sergey Edunov, Marjan Ghazvininejad, Mike
Lewis, and Luke Zettlemoyer. 2020. Multilin-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, gual denoising pre-training for neural machine
and Kristina Toutanova. 2018. Bert: Pre- translation. Transactions of the Association for
training of deep bidirectional transformers Computational Linguistics, 8:726–742.
for language understanding. arXiv preprint
arXiv:1810.04805. Marta R NLLB Team, Costa-jussà, James Cross,
Onur Çelebi, Maha Elbayad, Kenneth Heafield,
Mattia A. Di Gangi, Roldano Cattoni, Luisa Ben- Kevin Heffernan, Elahe Kalbassi, Janice Lam,
tivogli, Matteo Negri, and Marco Turchi. 2019. Daniel Licht, Jean Maillard, et al. 2022.
MuST-C: a Multilingual Speech Translation No language left behind: Scaling human-
Corpus. In NAACL, pages 2012–2017. centered machine translation. arXiv preprint
arXiv:2207.04672.
Paul-Ambroise Duquenne, Hongyu Gong, Ning
Dong, Jingfei Du, Ann Lee, Vedanuj Goswani, Vassil Panayotov, Guoguo Chen, Daniel Povey,
Changhan Wang, Juan Pino, Benoît Sagot, and Sanjeev Khudanpur. 2015. Librispeech:
and Holger Schwenk. 2022a. Speechmatrix: an asr corpus based on public domain audio
books. In 2015 IEEE international confer-
ence on acoustics, speech and signal processing
(ICASSP), pages 5206–5210. IEEE.