Enhancing Multilingual LLM Pretraining With Model-Based Data Selection
Enhancing Multilingual LLM Pretraining With Model-Based Data Selection
Average accuracy
Dataset curation has become a basis for strong FineWeb-2
large language model (LLM) performance. While
28.0%
arXiv:2502.10361v1 [cs.CL] 14 Feb 2025
1
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
to multilingual datasets. While model perplexity-based fil- as the base dataset. However, early works already observed
tering is commonly applied to multilingual datasets (Wenzek that performing quality filtering on Common Crawl is cru-
et al., 2019; Laurençon et al., 2022; Nguyen et al., 2023), the cial for model performance (Brown et al., 2020). There exist
current state-of-the-art, FineWeb-2 (Penedo et al., 2024c), various data curation approaches, such as deduplication (Lee
primarily relies on heuristic-based filters. In this work, we et al., 2022), PII removal (Subramani et al., 2023), or toxic-
focus on model-based filtering with a quality definition that ity filtering (Arnett et al., 2024). Another important aspect
emphasizes: 1) structured data and 2) knowledge-rich data is quality filtering of the documents. For this, the definition
samples, to enhance multilingual pretraining datasets. of quality is an important aspect. A common approach is
to use heuristics to remove documents outside of the tar-
To achieve this, we leverage embedding-based classifica-
get distribution, such as filtering based on average word
tion models. Firstly, we adopt the FastText quality filter-
length, existence of punctuation, or document length (Rae
ing approach from DCLM to develop a unified framework
et al., 2021; Raffel et al., 2020). Another approach is to
for multilingual datasets that span diverse language fam-
define model-based filters, where research has focused on
ilies, scripts, and resource availability, focusing on Chi-
perplexity measure of the text (Wenzek et al., 2019) or
nese, German, French, Arabic, and Danish as represen-
focused on educational (Penedo et al., 2024a) and conversa-
tative languages for our experiments. Additionally, we
tional documents (Li et al., 2024b). In this work, we build
extend this embedding-based approach by incorporating
upon previous curated datasets based on heuristic filtering,
Transformer (Vaswani et al., 2023) embeddings, specifically
specifically Finweb-2 (Penedo et al., 2024c), and focus on
XLM-RoBERTa (Conneau et al., 2020), for filtering. We
model-based filtering for structured and knowledge-rich
compare the performance between baseline FineWeb-2 data
documents relying on textual embedding representation.
and our best FastText and Transformer embedding-based
approaches in Figure 1. Curated English datasets. One of the early curated datasets
was C4 (Raffel et al., 2020), followed by MassiveText (Rae
In summary, our contributions are as follows:
et al., 2021). RefinedWeb (Penedo et al., 2023) was an im-
• We propose a transparent, simple, and unified frame- portant step forward, demonstrating that filtered web data
work for multilingual model-based filtering at web can outperform selected high-quality data sources. While
scale, enabling data curation across diverse language these datasets have not been made fully publicly available,
families, scripts and resource availability. their filtering techniques have been expanded upon in re-
cent fully public datasets, such as Dolma (Soldaini et al.,
• We present comprehensive per-language ablation stud- 2024), FineWeb, and FineWeb-Edu (Penedo et al., 2024a).
ies of embedding-based multilingual quality filtering While FineWeb primarily relies on filter heuristics for data
on top of the FineWeb-2 dataset (Penedo et al., 2024c), quality, Dolma adopts model perplexity filtering. FineWeb-
achieving performance comparable to the baseline Edu takes model-based filtering a step further and relies
while using as little as 15% of the tokens. We ad- on LLM-based quality assessment. Similarly, a concurrent
ditionally analyze the impact of dataset contamination work, DCLM, has achieved competitive performance using
and multilingual LLM training. FastText (Joulin et al., 2017) classifier trained on a carefully
• We evaluate the impact of different training datasets selected training dataset. In this work we adapt and extend
for data selection classifiers on the downstream perfor- this approach to the multilingual context.
mance of LLMs. Curated Multilingual Datasets. Analogously to the En-
glish datasets, there have been efforts in the multilingual
• We release the refined pretraining dataset2 covering
space. An influential work has been CCNet (Wenzek et al.,
20 languages3 , filtered using our proposed framework,
2019), whose language identification and model perplex-
along with the codebase4 , to advance multilingual lan-
ity filter for data quality has been re-used in later datasets.
guage modeling.
Again, while CCNet was not published directly, but rather
provided the tools for data cleaning, RedPajama (Together
2. Related Work Computer, 2023) is a prominent multilingual dataset rely-
ing on these filtering techniques. While RedPajama offers
Data Curation. In order to pretrain LLMs on a large
data in 5 European languages, other datasets, such as OS-
amount of diverse texts, Common Crawl5 is often used
CAR (Ortiz Suárez et al., 2019; Abadji et al., 2021; Abadji
2
huggingface.co/epfml et al., 2022), mC4 (Xue et al., 2021), ROOTS (Laurençon
3
Russian, Chinese, German, Japanese, Spanish, French, Ital- et al., 2022), MADLAD-400 (Kudugunta et al., 2023), Cul-
ian, Portuguese, Polish, Dutch, Indonesian, Turkish, Czech, Viet- turaX (Nguyen et al., 2023), and HPLT (de Gibert et al.,
namese, Swedish, Persian, Arabic, Greek, Danish, Hungarian
4 2024), focus on expanding beyond, spanning a variety of
github.com/epfml/fineweb2-hq
5
commoncrawl.org language families and scripts. While they offer refined
2
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
datasets for hundreds of languages, FineWeb-2 (Penedo samples and 2) we describe the different models, namely
et al., 2024c) pushes the limit to thousands of languages FastText and Transformer embedding-based filters, used to
and further improves the performance. Our work also fo- capture and leverage these characteristics.
cuses on filtering quality samples across various language
families and scripts. However, we limit our scope to 20 3.1. Classifier Training Dataset
languages, as the number of documents drops quickly and
there is trade-off between retaining a sufficient number of Representative Sample Selection. Our goal is to identify a
pretraining tokens and ensuring data quality (Muennighoff diverse set of structured and knowledge-rich samples, espe-
et al., 2023; Held et al., 2025). In our results, we observe cially within a multilingual context. We define two criteria
the greatest benefits using stricter data filtering. for our training datasets: 1) the samples must be informative
and well-structured and 2) the datasets must be available in
Multilingual Embedding Models. Early word embed- multiple languages. While some multilingual benchmark
ding models like Word2Vec (Mikolov et al., 2013) and datasets meet these criteria precisely, it is important to note
GloVe (Pennington et al., 2014) lacked contextual under- that we do not train the LLM directly on this data. Instead,
standing. FastText (Bojanowski et al., 2017) built upon we train a proxy model to assess pretraining data quality.
them and improved performance by incorporating subword Nevertheless, we must remain cautious about potentially
information. Transformer (Vaswani et al., 2023) models like increased pretraining data contamination stemming from
BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) this approach, as discussed in Section 4.2.6.
then revolutionized the field with context-aware embeddings.
Multilingual models like mBERT, XLM (Lample & Con- Based on our criteria, we selected the following datasets as
neau, 2019), and XLM-RoBERTa (Conneau et al., 2020) fur- representative examples.
ther advanced cross-lingual understanding, with recent open- • Aya Collection. A prompt completion dataset com-
source LLMs pushing performance even higher (Llama prising ∼514M samples covering a wide variety of
Team, 2024; Mistral AI, 2025). Using such models, doc- tasks, generated using instruction-style templates in
uments as well as representative samples can be mapped 101 languages (Singh et al., 2024b).
into a shared embedding space to estimate their similarity.
Focusing on transparency, simplicity and efficiency in our • Aya Dataset. Human-annotated instruction fine-tuning
work, we use FastText and XLM-RoBERTa for our filtering, dataset consisting of ∼202K prompt-completion pairs
and analyze the trade-off between computational complexity in 65 languages (Singh et al., 2024b).
and filtering performance.
• MMLU. Originally for English language, the dataset
Multilingual Evaluation. Evaluating LLMs requires di- contains ∼14K multiple-choice knowledge questions
verse benchmarks testing linguistic and cognitive abilities in diverse subjects and areas (Hendrycks et al., 2020).
like reading comprehension, reasoning, and knowledge. Multilingual version was translated into 14 languages
While English benchmarks like MMLU (Hendrycks et al., by professional translators (OpenAI, 2024).
2020) and ARC (Clark et al., 2018) exist, other languages
often use translations from English, e.g., XNLI (Conneau • OpenAssistant-2. The dataset contains ∼14K user-
et al., 2018) and machine-translated version of MMLU (Lai assistant conversations with multiple messages in 28
et al., 2023). However, translations can be problematic, languages (Fischer et al., 2024).
failing to capture cultural nuances or introducing ”transla- • Include-Base-44. Multiple-choice questions focused
tionese” (Romanou et al., 2024). Recent work by Romanou on general and regional knowledge, as well as reason-
et al. (2024); Singh et al. (2024a) emphasizes the need for ing, extracted from academic and professional exams.
culturally sensitive, natively collected benchmarks. Task dif- Spanning 44 languages, it includes a total of ∼23K
ficulty and task formulation also impact model performance samples (Romanou et al., 2024).
when trained for shorter durations (Kydlı́ček et al., 2024).
In our work, we follow the recent evaluation tasks selection Representative Sample Collection. MMLU and Include-
and methodology by Kydlı́ček et al. (2024) to assess our Base-44 are highly curated benchmark datasets, containing
model-based filtering approaches across multiple languages. structured, knowledge-rich samples. The Aya Dataset is
human-curated, while OpenAssistant-2 is partially human-
curated and partially generated by large language models
3. Methods (LLMs). In contrast, the Aya Collection consists of various
In this work, we present our model-based filtering ap- AI-generated samples without quality guarantee, though it
proaches. Our methodology is structured into two key com- represents the largest and most multilingual of the five.
ponents: 1) we select suitable training datasets, aiming to To address this quality difference, we create two Multilin-
identifying a diverse set of structured and knowledge-rich gual Knowledge Collection (MKC) configurations:
3
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
• MKC: Includes Include-Base-44, OpenAssistant-2, uments in the pretraining dataset. To filter the dataset, we
MMLU, and the Aya Dataset applied a score threshold based on the desired retention per-
centage of documents. This approach balances dataset size
• MKC+ : Includes MKC and the Aya Collection and the predicted quality of the samples.
This allows us to evaluate the trade-off between data quality 3.3. Transformer Embedding-based Filtering
and scale.
To leverage rich semantic information based on contex-
Dataset Creation. For our model-based filtering ap-
tual relationships, we utilized the Transformer model em-
proaches, our goal is to identify documents from the pre-
beddings. Specifically, we selected a pretrained XLM-
training dataset that are most similar to our representative
RoBERTa base model (Conneau et al., 2020) due to its
samples, with the notion of similarity determined by the
support of 100 languages, a relatively small size of approxi-
specific classifier used. We can measure the similarity to
mately 279M parameters, and its transparent training pro-
our training dataset directly, for example, by computing the
cedure. This choice enabled us to process web-scale data
cosine similarity to our training samples in the embedding
efficiently without being restricted to a single language and
space. Alternatively, following the approach of Li et al.
to align with our commitment to open science.
(2024b), the task can be framed as a binary classification
problem, with the representative samples as the positive To retain general embeddings that can be reused across meth-
class. For the negative class, we can simply subsample ods, we opted against fine-tuning the model. For each docu-
documents from our pretraining dataset, under the assump- ment from our datasets, we computed the 768-dimensional
tion that the majority of these documents are neither well- embedding by mean pooling the embeddings of the output
structured nor knowledge-rich. We use both approaches for sequence. Since the model has a fixed maximum sequence
our classifiers. length of 512 tokens, we considered only the first 512 tokens
of each document, assuming they are representative of the
To create the binary classification training dataset, we se-
entire document.
lected 80K random examples from the training set (MKC or
MKC+ ) as positive samples and 80K random examples from After computing the embeddings of our corpora, we experi-
FineWeb-2 as negative samples. For smaller datasets, such mented with two methods: 1) classification of embeddings
as Include-Base-44, the entire dataset was used. The same using a multi-layer perceptron and 2) cosine similarity be-
training dataset was utilized across all model-based filtering tween the embeddings. As in the FastText approach, we
approaches, disregarding negative samples when unneces- scored each document and applied a threshold to retain the
sary. Additionally, we created a training dataset for each desired percentage of the highest-scoring documents.
language individually to avoid leaking language-specific
Multi-Layer Perceptron (MLP). We trained a single-
biases to data of other languages.
hidden-layer neural network with a hidden dimension of 256,
Sample Pre-processing. We applied no pre-processing the ReLU activation function, a dropout rate of 20%, and the
to the FineWeb-2 (negative) samples but performed mini- sigmoid function on the output. The network was trained
mal pre-processing on the representative (positive) samples. for 6 epochs using the AdamW optimizer (Loshchilov,
For instance, in datasets like MMLU or OpenAssistant-2, 2017) with a constant learning rate 0.0003 and binary cross-
we concatenated various sample components. For the Aya entropy loss. We computed document scores using the out-
Collection, we resolved encoding issues in non-Latin lan- put layer of the MLP model, which used XML-RoBERTa
guages and removed samples containing <unk> tokens, document embeddings as input.
which were particularly prevalent in Arabic data (37.1%).
Cosine Similarity (CS). We computed the document scores
as the maximum cosine similarity between its embeddings
3.2. FastText-based Filtering (FT) and a set of K randomly sampled positive sample embed-
To efficiently process datasets with over 100 million doc- dings. We experimented with varying values of K, including
uments (Penedo et al., 2024c), similar to DCLM (Li et al., 1024, 2048, 4096, 8192, and 16384. However, we did not
2024b), we used a binary FastText classifier (Joulin et al., observe a significant differences in the documents with high
2017). This classifier runs on CPU and can be easily scores across these variations when manually inspecting
deployed across multiple cores, for example using Data- the data. To strike a balance between the diversity of the
Trove (Penedo et al., 2024b). positive samples and computational efficiency, we chose
K = 8192 for our experiments.
We trained our FastText classifier on the processed training
set using 2-gram features (4-gram for Chinese). Additional
details about the training process are given in Appendix A.1.
These classifiers were then used to assign scores to all doc-
4
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
5
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Accuracy
Accuracy
30.0%
30.0% 26.0% MLP MKC +
FT MKC +
27.5% 28.0% CS MKC +
24.0% FineWeb-2
25.0%
0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B
Tokens seen Tokens seen Tokens seen
(a) English (MMLU) (b) Chinese (CMMLU) (c) German (MMLU)
Accuracy
Accuracy
27.00% 35.0%
FT MKC + FT MKC + 24.0% CS MKC +
26.00% MLP MKC + 32.5% MLP MKC + FT MKC +
CS MKC + CS MKC + MLP MKC +
25.00% FineWeb-2 FineWeb-2 22.0% FineWeb-2
30.0%
0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B
Tokens seen Tokens seen Tokens seen
(d) French (MMLU) (e) Arabic (MMLU) (f) Danish (ARC)
Figure 2. Benchmark performance comparison (accuracy) during training on 119B tokens between the baseline methods (FineWeb,
DCLM, FineWeb-Edu, and FineWeb-2) and our proposed filtering methods (FT, MLP, and CS), trained on MKC+ . When using our
approaches, the data retention rates are set to 10% for English, Chinese, German, and French, 56% for Arabic, and 65% for Danish. For
English, Chinese, German, and French, baseline-level performance is observed around 20B tokens consumed (16.7% of the total).
6
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
C+
KC
KC
-2
MK
b
M
PM
We
MK
MK
MK
FT
CS
results of experiments where 5% and 10% unfiltered data
ML
e
P
FT
CS
Fin
ML
a quality guarantee for all samples in the Aya Collection, this Table 4. Benchmark performance comparison (average rank) of
dataset yields strong performance, making our approach ap- our MLP MKC+ and FT MKC+ approaches, retaining top 10%
of the documents while mixing in 0%, 5% or 10% of the original
plicable for various languages. Overall, we observe that the
FineWeb-2 dataset. The average rank is computed across Fine-
diversity resulting from combining all individual training Tasks performance of 1B-parameter models evaluated for Chinese,
datasets gives the best results. German, or French, after consuming 70B and 119B tokens.
Interestingly, models trained exclusively on Include-Base- Approach Mixture Rate Average Rank
44 and OpenAssistant-2 perform worse overall than the base- +
line. This may be due to the nature of these datasets. For MLP MKC 0% 4.36
MLP MKC+ 5% 5.09
instance, Include-Base-44 is relatively small and domain-
MLP MKC+ 10% 5.40
specific, e.g., consisting primarily of driving license exam FT MKC+ 10% 7.17
questions in its German subset. Similarly, OpenAssistant-2 FT MKC+ 0% 7.51
includes a limited number of samples, with fewer than 2K FT MKC+ 5% 8.66
positive samples per training set, which likely negatively
impacts classifier performance. Again, we relate model
performance to the average document length bias in Ap-
pendix B.3 and confirm the findings from Section 4.2.2, 4.2.5. A PPROACH VALIDATION ON E NGLISH
suggesting that factors beyond the retained document length
bias may influence performance. Previous experiments have shown strong performance of
our MLP MKC+ approach. But do these results translate
to English? Table 5 presents the performance of MLP
Table 3. Benchmark performance comparison (average rank) be- MKC+ with 10% retention applied to the English FineWeb
tween the baseline (FineWeb-2) and the MLP filtering method dataset (Penedo et al., 2024a). Our method is compared
trained on either MKC+ as a whole or its individual dataset com- against FineWeb and baselines using model-based filtered
ponents, retaining top 10% of the documents for Chinese, German, datasets, including DCLM (Li et al., 2024b) and FineWeb-
and French, 56% for Arabic, and 65% for Danish. The average Edu (Penedo et al., 2024a). To save computational resources,
rank is computed across FineTasks performance of 1B-parameter we use the 6 most recent FineWeb and FineWeb-Edu dumps
models trained on each language with 30B tokens. and the first partition of DCLM6 , which we denote with ∗ .
Dataset Average Rank
Each of these subsets contains more than 119B tokens, with
+
FineWeb retaining this size even after applying our filtering
MKC 2.52 retaining top 10% of the documents.
Aya Collection 2.91
Aya Dataset 3.17 While each approach demonstrates strengths in different
MMLU 3.57 benchmarks, as seen from Table 5 and Figure 2, the overall
Baseline 4.09
OpenAssistant-2 4.53 average rank results indicate that our method outperforms
Include-Base-44 5.42 all other baselines.
6
huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-
parquet
7
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 5. Benchmark performance comparison for English of our Table 6. Benchmark performance comparison in English for our
MLP MKC+ approach (retaining top 10% of the documents) MLP MKC+ approach (retaining top 10% of the documents), both
against baseline datasets: FineWeb, DCLM, and FineWeb-Edu. decontaminated (D) and non-decontaminated, against the baseline
The average rank is computed across SmolLM task performance FineWeb datasets, also in decontaminated and non-decontaminated
for 1B-parameter models trained on 119B tokens. variants. The average rank is computed across SmolLM task per-
formance for 1B-parameter models trained on 119B tokens.
Dataset Ours DCLM∗ FW-Edu∗ FW∗
Average Rank 1.8333 2.3889 2.4444 3.3333 Dataset Ours OursD FW∗ FW∗D
ARC (Challenge) 0.3550 0.3530 0.3850 0.3010 Average Rank 1.5000 2.1111 3.0556 3.3333
ARC (Easy) 0.6670 0.6470 0.6970 0.5880 ARC (Challenge) 0.3550 0.3440 0.3010 0.2880
CommonsenseQA 0.3870 0.4100 0.3770 0.3850
ARC (Easy) 0.6670 0.6520 0.5880 0.5700
HellaSwag 0.6040 0.5960 0.5700 0.5930
MMLU 0.3400 0.3160 0.3470 0.3030
CommonsenseQA 0.3870 0.4000 0.3850 0.3820
OpenBookQA 0.3860 0.3840 0.4180 0.3560 HellaSwag 0.6040 0.6040 0.5930 0.5890
PIQA 0.7510 0.7510 0.7410 0.7620 MMLU 0.3400 0.3220 0.3030 0.3050
WinoGrande 0.5720 0.5610 0.5660 0.5550 OpenBookQA 0.3860 0.3840 0.3560 0.3740
TriviaQA 0.0820 0.1240 0.0320 0.0370 PIQA 0.7510 0.7590 0.7620 0.7600
WinoGrande 0.5720 0.5550 0.5550 0.5570
TriviaQA 0.0820 0.0380 0.0370 0.0250
4.2.6. DATA C ONTAMINATION A NALYSIS Table 7. Benchmark performance comparison in French for our
MLP MKC+ approach (retaining top 10% of the documents),
Our LLMs are never trained on benchmark datasets. But both decontaminated (D) and non-decontaminated, against the
is the strong performance observed in the previous sections baseline FineWeb-2 datasets, also in decontaminated and non-
primarily due to an increased ratio of data contamination? decontaminated variants. The average rank is computed across
To ensure the validity of our approach, we conduct decon- FineTasks performance for 1B-parameter models trained on 119B
tamination experiments, as web crawl data may include eval- tokens.
uation benchmark tasks. While Li et al. (2024b) addressed
Dataset Ours OursD FW-2D FW-2
similar concerns, our approach follows the methodology
of Brown et al. (2020). Specifically, we perform 13-gram Average Rank 2.0556 2.0556 2.7222 3.1667
decontamination of the LLM training data separately for Belebele 0.3533 0.3400 0.3778 0.3444
HellaSwag 0.5380 0.5350 0.5180 0.5180
English and French evaluation benchmarks. However, un- X-CSQA 0.2740 0.2810 0.2730 0.2870
like the original approach, we remove the entire document XNLI 2.0 0.7400 0.7400 0.7070 0.7180
if it is flagged as contaminated, using the implementation FQuAD 0.2803 0.2620 0.2890 0.2401
provided in DataTrove (Penedo et al., 2024b). MMLU 0.2895 0.2875 0.2711 0.2706
Mintaka 0.0438 0.0797 0.0658 0.0712
Tables 6 and 7 present the results of decontamination experi- X-CODAH 0.2667 0.2900 0.2800 0.2633
ments for English and French, respectively. We used the fol- ARC (Challenge) 0.3180 0.3110 0.2880 0.2850
lowing experimental setup (removed document contamina-
tion rates): baseline FineWeb English (0.16%), MLP MKC+
English with 10% retention (0.19%), baseline FineWeb-2
4.2.7. I MPACT ON MULTILINGUAL MODEL TRAINING
French (0.14%), and MLP MKC+ French with 10% reten-
tion (0.14%). As in our previous experiments, we train Although not the primary focus of our work, we believe
the models on 119B tokens. Additionally, we compare the that refined datasets can contribute to advancing the perfor-
results against equivalent training runs without decontam- mance of multilingual models. To investigate this, we con-
ination to further analyze its impact. For an example of a ducted an ablation study by training a 1B-parameter model
contaminated sample, see Appendix E. on 595B tokens (5×119B), covering all five languages: Chi-
nese, German, French, Arabic and Danish. We trained two
For English models, decontamination slightly reduces per-
models—the first one using our filtered FineWeb-2 dataset
formance both for our approach and baseline FineWeb data.
and the second one using unfiltered FineWeb-2 data. We
However, even when decontaminated, our approach still out-
then compared these results for each language against their
performs training on non-decontaminated baseline data. For
monolingual counterparts trained on 119B tokens.
French models, performance of our approach is comparable
between decontaminated and non-decontaminated datasets, The results for French are presented in Table 8. We ob-
with both continuing to outperform baseline FineWeb-2 data. serve that the multilingual LLM outperforms its monolin-
Interestingly, decontaminated baseline data yields better re- gual counterpart on our filtered datasets, whereas the mono-
sults than its non-decontaminated counterpart. lingual model achieves better performance than the mul-
8
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Acknowledgements Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell,
We thank Guilherme Penedo, Hynek Kydlı́ček, and Leandro A., et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.
von Werra for their help with FineWeb-2 data, and to Alex
Hägele for providing feedback on the paper draft. Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K.,
Salvi, F., Pagliardini, M., Fan, S., Köpf, A., Mohtashami, A.,
This work was supported as part of the Swiss AI Initiative Sallinen, A., Sakhaeirad, A., Swamy, V., Krawczuk, I., Bayazit,
by a grant from the Swiss National Supercomputing Centre D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M., and
(CSCS) under project ID a06 on Alps. Bosselut, A. Meditron-70b: Scaling medical pretraining for
large language models, 2023. URL https://arxiv.org/
abs/2311.16079.
9
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Zhang, Y., Zhao, C., Zhao, Y., Zhou, S., Zhou, S., Zhu, Q.,
Nikolaev, V., and Palomaki, J. TyDi QA: A Benchmark for and Zou, Y. DeepSeek LLM: Scaling Open-Source Language
Information-Seeking Question Answering in Typologically Di- Models with Longtermism, 2024. URL https://arxiv.
verse Languages, 2020. URL https://arxiv.org/abs/ org/abs/2401.02954.
2003.05002.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., training of deep bidirectional transformers for language under-
Schoenick, C., and Tafjord, O. Think you have Solved Question standing, 2019. URL https://arxiv.org/abs/1810.
Answering? Try ARC, the AI2 Reasoning Challenge, 2018. 04805.
URL https://arxiv.org/abs/1803.05457.
d’Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., and
Conneau, A., Rinott, R., Lample, G., Williams, A., Bow- Vidal, M. FQuAD: French Question Answering Dataset, 2020.
man, S., Schwenk, H., and Stoyanov, V. XNLI: Evalu- URL https://arxiv.org/abs/2002.06071.
ating cross-lingual sentence representations. In Riloff, E., Fischer, S., Rossetto, F., Gemmell, C., Ramsay, A., Mackie, I.,
Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceed- Zubel, P., Tecklenburg, N., and Dalton, J. Open assistant toolkit–
ings of the 2018 Conference on Empirical Methods in Nat- version 2. arXiv preprint arXiv:2403.00586, 2024.
ural Language Processing, pp. 2475–2485, Brussels, Bel-
gium, October-November 2018. Association for Computational Fourrier, C., Habib, N., Wolf, T., and Tunstall, L. LightEval:
Linguistics. doi: 10.18653/v1/D18-1269. URL https: A lightweight framework for LLM evaluation, 2023. URL
//aclanthology.org/D18-1269/. https://github.com/huggingface/lighteval.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, Hägele, A., Bakouch, E., Kosson, A., Allal, L. B., Von Werra, L.,
G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoy- and Jaggi, M. Scaling laws and compute-optimal training be-
anov, V. Unsupervised cross-lingual representation learning yond fixed training durations. arXiv preprint arXiv:2405.18392,
at scale, 2020. URL https://arxiv.org/abs/1911. 2024.
02116.
Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I.,
Cui, Y., Liu, T., Che, W., Xiao, L., Chen, Z., Ma, W., Wang, and Nakov, P. Exams: A multi-subject high school examinations
S., and Hu, G. A span-extraction dataset for chinese machine dataset for cross-lingual and multilingual question answering,
reading comprehension. In Proceedings of the 2019 Conference 2020. URL https://arxiv.org/abs/2011.03080.
on Empirical Methods in Natural Language Processing and Held, W., Paranjape, B., Koura, P. S., Lewis, M., Zhang, F., and
the 9th International Joint Conference on Natural Language Mihaylov, T. Optimizing pretraining data mixtures with llm-
Processing (EMNLP-IJCNLP). Association for Computational estimated utility. arXiv preprint arXiv:2501.11747, 2025.
Linguistics, 2019. doi: 10.18653/v1/d19-1600. URL http:
//dx.doi.org/10.18653/v1/D19-1600. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song,
D., and Steinhardt, J. Measuring massive multitask language
de Gibert, O., Nail, G., Arefyev, N., Bañón, M., van der Linde, J., understanding. arXiv preprint arXiv:2009.03300, 2020.
Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramı́rez-Sánchez,
G., Kutuzov, A., Pyysalo, S., Oepen, S., and Tiedemann, J. Hu, H., Richardson, K., Xu, L., Li, L., Kübler, S., and Moss, L. OC-
A new massive multilingual dataset for high-performance lan- NLI: Original Chinese Natural Language Inference. In Cohn, T.,
guage technologies. In Calzolari, N., Kan, M.-Y., Hoste, V., He, Y., and Liu, Y. (eds.), Findings of the Association for Com-
Lenci, A., Sakti, S., and Xue, N. (eds.), Proceedings of the putational Linguistics: EMNLP 2020, pp. 3512–3526, Online,
2024 Joint International Conference on Computational Lin- November 2020. Association for Computational Linguistics.
guistics, Language Resources and Evaluation (LREC-COLING doi: 10.18653/v1/2020.findings-emnlp.314. URL https://
2024), pp. 1116–1128, Torino, Italia, May 2024. ELRA aclanthology.org/2020.findings-emnlp.314/.
and ICCL. URL https://aclanthology.org/2024.
lrec-main.100. Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu,
J., Lv, C., Zhang, Y., Lei, J., Fu, Y., Sun, M., and He, J. C-
De Gibert, O., Nail, G., Arefyev, N., Bañón, M., Van Der Linde, J., eval: A multi-level multi-discipline chinese evaluation suite
Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramı́rez-Sánchez, for foundation models, 2023. URL https://arxiv.org/
G., Kutuzov, A., et al. A new massive multilingual dataset abs/2305.08322.
for high-performance language technologies. arXiv preprint
Hugging Face. Nanotron, 2024a. URL https://github.
arXiv:2403.14009, 2024.
com/huggingface/nanotron. Accessed 30 Jan. 2025.
DeepSeek-AI, :, Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Hugging Face. SmolLM - blazingly fast and remarkably pow-
Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., Gao, H., Gao, erful, 2024b. URL https://huggingface.co/blog/
K., Gao, W., Ge, R., Guan, K., Guo, D., Guo, J., Hao, G., Hao, smollm. Accessed 30 Jan. 2025.
Z., He, Y., Hu, W., Huang, P., Li, E., Li, G., Li, J., Li, Y., Li,
Y. K., Liang, W., Lin, F., Liu, A. X., Liu, B., Liu, W., Liu, Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. TriviaQA:
X., Liu, X., Liu, Y., Lu, H., Lu, S., Luo, F., Ma, S., Nie, X., A large scale distantly supervised challenge dataset for read-
Pei, T., Piao, Y., Qiu, J., Qu, H., Ren, T., Ren, Z., Ruan, C., ing comprehension. In Barzilay, R. and Kan, M.-Y. (eds.),
Sha, Z., Shao, Z., Song, J., Su, X., Sun, J., Sun, Y., Tang, M., Proceedings of the 55th Annual Meeting of the Association
Wang, B., Wang, P., Wang, S., Wang, Y., Wang, Y., Wu, T., for Computational Linguistics (Volume 1: Long Papers), pp.
Wu, Y., Xie, X., Xie, Z., Xie, Z., Xiong, Y., Xu, H., Xu, R. X., 1601–1611, Vancouver, Canada, July 2017. Association for
Xu, Y., Yang, D., You, Y., Yu, S., Yu, X., Zhang, B., Zhang, Computational Linguistics. doi: 10.18653/v1/P17-1147. URL
H., Zhang, L., Zhang, L., Zhang, M., Zhang, M., Zhang, W., https://aclanthology.org/P17-1147/.
10
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of and the 11th International Joint Conference on Natural Lan-
tricks for efficient text classification. In Proceedings of the guage Processing (Volume 1: Long Papers), pp. 1274–1287,
15th Conference of the European Chapter of the Association for Online, August 2021a. Association for Computational Lin-
Computational Linguistics: Volume 2, Short Papers, pp. 427– guistics. doi: 10.18653/v1/2021.acl-long.102. URL https:
431. Association for Computational Linguistics, April 2017. //aclanthology.org/2021.acl-long.102/.
Kudugunta, S., Caswell, I., Zhang, B., Garcia, X., Choquette-Choo, Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S.,
C. A., Lee, K., Xin, D., Kusupati, A., Stella, R., Bapna, A., and Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., Pasunuru,
Firat, O. MADLAD-400: A Multilingual And Document-Level R., Shleifer, S., Koura, P. S., Chaudhary, V., O’Horo, B.,
Large Audited Dataset, 2023. URL https://arxiv.org/ Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M. T., Stoy-
abs/2309.04662. anov, V., and Li, X. Few-shot learning with multilingual lan-
guage models. CoRR, abs/2112.10668, 2021b. URL https:
Kydlı́ček, H., Penedo, G., Fourier, C., Habib, N., and Wolf, T. //arxiv.org/abs/2112.10668.
FineTasks: Finding signal in a haystack of 200+ multilingual
tasks, 2024. URL https://huggingface.co/spaces/ Llama Team. The Llama 3 Herd of Models, 2024. URL https:
HuggingFaceFW/blogpost-fine-tasks. Accessed //arxiv.org/abs/2407.21783.
30 Jan. 2025.
Loshchilov, I. Decoupled weight decay regularization. arXiv
Lai, V., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, preprint arXiv:1711.05101, 2017.
R., and Nguyen, T. Okapi: Instruction-tuned large language
models in multiple languages with reinforcement learning from Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of
human feedback. In Feng, Y. and Lefever, E. (eds.), Proceed- armor conduct electricity? a new dataset for open book question
ings of the 2023 Conference on Empirical Methods in Natural answering, 2018. URL https://arxiv.org/abs/1809.
Language Processing: System Demonstrations, pp. 318–327, 02789.
Singapore, December 2023. Association for Computational Lin- Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient esti-
guistics. doi: 10.18653/v1/2023.emnlp-demo.28. URL https: mation of word representations in vector space, 2013. URL
//aclanthology.org/2023.emnlp-demo.28/. https://arxiv.org/abs/1301.3781.
Lample, G. and Conneau, A. Cross-lingual language model pre- Mistral AI. v3 (tekken) tokenizer, 2024. URL https://docs.
training, 2019. URL https://arxiv.org/abs/1901. mistral.ai/guides/tokenization/. Accessed 30
07291. Jan. 2025.
Laurençon, H., Saulnier, L., Wang, T., Akiki, C., Villanova del Mistral AI. Mistral small 3, 2025. URL https://mistral.
Moral, A., Le Scao, T., Von Werra, L., Mou, C., González Pon- ai/news/mistral-small-3/. Accessed 30 Jan. 2025.
ferrada, E., Nguyen, H., et al. The bigscience roots corpus:
A 1.6 tb composite multilingual dataset. Advances in Neural Mozannar, H., Hajal, K. E., Maamary, E., and Hajj, H. Neural
Information Processing Systems, 35:31809–31826, 2022. arabic question answering, 2019. URL https://arxiv.
org/abs/1906.05394.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-
Burch, C., and Carlini, N. Deduplicating training data makes Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Piktus, A.,
language models better. In Muresan, S., Nakov, P., and Villav- Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-
icencio, A. (eds.), Proceedings of the 60th Annual Meeting constrained language models, 2023. URL https://arxiv.
of the Association for Computational Linguistics (Volume 1: org/abs/2305.16264.
Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022. As-
sociation for Computational Linguistics. doi: 10.18653/v1/ Nguyen, T., Van Nguyen, C., Lai, V. D., Man, H., Ngo, N. T.,
2022.acl-long.577. URL https://aclanthology.org/ Dernoncourt, F., Rossi, R. A., and Nguyen, T. H. Culturax: A
2022.acl-long.577/. cleaned, enormous, and multilingual dataset for large language
models in 167 languages. arXiv preprint arXiv:2309.09400,
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., and Schwenk, H. Mlqa: 2023.
Evaluating cross-lingual extractive question answering, 2020.
URL https://arxiv.org/abs/1910.07475. OpenAI. MMMLU, 2024. URL https://huggingface.
co/datasets/openai/MMMLU. Accessed 30 Jan. 2025.
Li, H., Zhang, Y., Koto, F., Yang, Y., Zhao, H., Gong, Y., Duan,
N., and Baldwin, T. Cmmlu: Measuring massive multitask Ortiz Suárez, P. J., Sagot, B., and Romary, L. Asynchronous
language understanding in chinese, 2024a. URL https:// pipelines for processing huge corpora on medium to low
arxiv.org/abs/2306.09212. resource infrastructures. Proceedings of the Workshop on
Challenges in the Management of Large Corpora (CMLC-
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., 7) 2019. Cardiff, 22nd July 2019, pp. 9 – 16, Mannheim,
Bansal, H., Guha, E., Keh, S., Arora, K., et al. DataComp-LM: 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/
In search of the next generation of training sets for language ids-pub-9021. URL http://nbn-resolving.de/urn:
models. arXiv preprint arXiv:2406.11794, 2024b. nbn:de:bsz:mh39-90215.
Lin, B. Y., Lee, S., Qiao, X., and Ren, X. Common sense be- Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Alobeidli, H.,
yond English: Evaluating and improving multilingual language Cappelli, A., Pannier, B., Almazrouei, E., and Launay, J. The
models for commonsense reasoning. In Zong, C., Xia, F., RefinedWeb dataset for Falcon LLM: Outperforming curated
Li, W., and Navigli, R. (eds.), Proceedings of the 59th An- corpora with web data only. Advances in Neural Information
nual Meeting of the Association for Computational Linguistics Processing Systems, 36:79155–79172, 2023.
11
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Penedo, G., Kydlı́ček, H., Lozhkov, A., Mitchell, M., Raffel, C., Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G.,
Von Werra, L., Wolf, T., et al. The FineWeb Datasets: Decanting Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong,
the Web for the Finest Text Data at Scale. arXiv preprint W. Q., Susanto, Y., Ng, R., Longpre, S., Ko, W.-Y., Smith,
arXiv:2406.17557, 2024a. M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ip-
polito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker,
Penedo, G., Kydlı́ček, H., Cappelli, A., Sasko, M., and Wolf, T. S. Global mmlu: Understanding and addressing cultural
DataTrove: large scale data processing, 2024b. URL https: and linguistic biases in multilingual evaluation, 2024a. URL
//github.com/huggingface/datatrove. Accessed https://arxiv.org/abs/2412.03304.
30 Jan. 2025.
Singh, S., Vargus, F., Dsouza, D., Karlsson, B. F., Mahendiran, A.,
Penedo, G., Kydlı́ček, H., Sabolčec, V., Messmer, B., Foroutan, Ko, W.-Y., Shandilya, H., Patel, J., Mataciunas, D., OMahony,
N., Jaggi, M., von Werra, L., and Wolf, T. FineWeb2: L., Zhang, M., Hettiarachchi, R., Wilson, J., Machado, M.,
A sparkling update with 1000s of languages, December Moura, L. S., Krzemiński, D., Fadaei, H., Ergün, I., Okoh,
2024c. URL https://huggingface.co/datasets/ I., Alaagib, A., Mudannayake, O., Alyafeai, Z., Chien, V. M.,
HuggingFaceFW/fineweb-2. Accessed 30 Jan. 2025. Ruder, S., Guthikonda, S., Alghamdi, E. A., Gehrmann, S.,
Muennighoff, N., Bartolo, M., Kreutzer, J., Üstün, A., Fadaee,
Pennington, J., Socher, R., and Manning, C. GloVe: Global M., and Hooker, S. Aya dataset: An open-access collection for
vectors for word representation. In Moschitti, A., Pang, B., and multilingual instruction tuning, 2024b.
Daelemans, W. (eds.), Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D.,
pp. 1532–1543, Doha, Qatar, October 2014. Association for Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., et al.
Computational Linguistics. doi: 10.3115/v1/D14-1162. URL Dolma: An open corpus of three trillion tokens for language
https://aclanthology.org/D14-1162/. model pretraining research. arXiv preprint arXiv:2402.00159,
2024.
Pluto-Junzeng. pluto-junzeng/chinesesquad, 2019. URL https:
//github.com/pluto-junzeng/ChineseSquad. Subramani, N., Luccioni, S., Dodge, J., and Mitchell, M. Detecting
Accessed 30 Jan. 2025. personal information in training corpora: an analysis. In Ovalle,
A., Chang, K.-W., Mehrabi, N., Pruksachatkun, Y., Galystan,
Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vulić, I., A., Dhamala, J., Verma, A., Cao, T., Kumar, A., and Gupta,
and Korhonen, A. XCOPA: A multilingual dataset for R. (eds.), Proceedings of the 3rd Workshop on Trustworthy
causal commonsense reasoning. In Webber, B., Cohn, T., Natural Language Processing (TrustNLP 2023), pp. 208–220,
He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Confer- Toronto, Canada, July 2023. Association for Computational Lin-
ence on Empirical Methods in Natural Language Processing guistics. doi: 10.18653/v1/2023.trustnlp-1.18. URL https:
(EMNLP), pp. 2362–2376, Online, November 2020. Associ- //aclanthology.org/2023.trustnlp-1.18/.
ation for Computational Linguistics. doi: 10.18653/v1/2020.
emnlp-main.185. URL https://aclanthology.org/ Sun, K., Yu, D., Yu, D., and Cardie, C. Investigating prior
2020.emnlp-main.185/. knowledge for challenging Chinese machine reading compre-
hension. Transactions of the Association for Computational
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Linguistics, 8:141–155, 2020. doi: 10.1162/tacl a 00305. URL
Improving language understanding by generative pre-training. https://aclanthology.org/2020.tacl-1.10/.
2018.
Talmor, A., Herzig, J., Lourie, N., and Berant, J. Common-
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., senseQA: A question answering challenge targeting common-
Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. sense knowledge. In Burstein, J., Doran, C., and Solorio,
Scaling language models: Methods, analysis & insights from T. (eds.), Proceedings of the 2019 Conference of the North
training gopher. arXiv preprint arXiv:2112.11446, 2021. American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 (Long
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, and Short Papers), pp. 4149–4158, Minneapolis, Minnesota,
M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of June 2019. Association for Computational Linguistics. doi:
transfer learning with a unified text-to-text transformer. Journal 10.18653/v1/N19-1421. URL https://aclanthology.
of machine learning research, 21(140):1–67, 2020. org/N19-1421/.
Romanou, A., Foroutan, N., Sotnikova, A., Chen, Z., Nelaturu, Tikhonov, A. and Ryabinin, M. It’s all in the heads: Using attention
S. H., Singh, S., Maheshwary, R., Altomare, M., Haggag, M. A., heads as a baseline for cross-lingual transfer in commonsense
Amayuelas, A., et al. Include: Evaluating multilingual lan- reasoning, 2021. URL https://arxiv.org/abs/2106.
guage understanding with regional knowledge. arXiv preprint 12066.
arXiv:2411.19799, 2024.
Together Computer. Redpajama: An open source recipe to repro-
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Wino- duce llama training dataset, 2023. URL https://github.
Grande: An Adversarial Winograd Schema Challenge at Scale. com/togethercomputer/RedPajama-Data. Ac-
arXiv preprint arXiv:1907.10641, 2019. cessed 30 Jan. 2025.
Sen, P., Aji, A. F., and Saffari, A. Mintaka: A Complex, Natural, Upadhyay, A. K. and Upadhya, H. K. Xnli 2.0: Improving xnli
and Multilingual Dataset for End-to-End Question Answering, dataset and performance on cross lingual understanding (xlu),
2022. URL https://arxiv.org/abs/2210.01613. 2023. URL https://arxiv.org/abs/2301.06527.
12
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all
you need, 2023. URL https://arxiv.org/abs/1706.
03762.
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R.,
Siddhant, A., Barua, A., and Raffel, C. mT5: A mas-
sively multilingual pre-trained text-to-text transformer. In
Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur,
D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,
and Zhou, Y. (eds.), Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pp. 483–
498, Online, June 2021. Association for Computational Lin-
guistics. doi: 10.18653/v1/2021.naacl-main.41. URL https:
//aclanthology.org/2021.naacl-main.41/.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y.
Hellaswag: Can a machine really finish your sentence?, 2019.
URL https://arxiv.org/abs/1905.07830.
Zhang, W., Aljunied, S. M., Gao, C., Chia, Y. K., and Bing, L.
M3exam: A multilingual, multimodal, multilevel benchmark
for examining large language models, 2023. URL https:
//arxiv.org/abs/2306.05179.
Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied,
A., Chen, W., and Duan, N. Agieval: A human-centric
benchmark for evaluating foundation models, 2023. URL
https://arxiv.org/abs/2304.06364.
13
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
1 checkpoints:
2 checkpoint_interval: 1000
3 checkpoints_path: checkpoints/
4 checkpoints_path_is_shared_file_system: false
5 resume_checkpoint_path: null
6 save_initial_state: false
7 data_stages:
8 - data:
9 dataset:
10 dataset_folder: template
11 num_loading_workers: 1
12 seed: 42
13 name: General purpose training (Single dataset)
14 start_training_step: 1
15 general:
16 benchmark_csv_path: null
17 consumed_train_samples: null
18 ignore_sanity_checks: true
19 project: template
20 run: template
21 seed: 42
22 step: null
23 lighteval: null
24 logging:
25 iteration_step_info_interval: 1
26 log_level: info
27 log_level_replica: info
28 model:
29 ddp_bucket_cap_mb: 25
30 dtype: bfloat16
31 init_method:
32 std: 0.025
33 make_vocab_size_divisible_by: 1
34 model_config:
35 bos_token_id: 1
36 eos_token_id: 2
37 hidden_act: silu
38 hidden_size: 1536
39 initializer_range: 0.02
40 intermediate_size: 6144
41 is_llama_config: true
42 max_position_embeddings: 1024
43 num_hidden_layers: 24
44 num_attention_heads: 16
45 num_key_value_heads: 16
46 pad_token_id: null
47 pretraining_tp: 1
48 rms_norm_eps: 1.0e-06
14
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
49 rope_scaling: null
50 tie_word_embeddings: true
51 use_cache: true
52 vocab_size: 131072
53 optimizer:
54 optimizer_factory:
55 adam_beta1: 0.9
56 adam_beta2: 0.95
57 adam_eps: 1.0e-08
58 name: adamW
59 torch_adam_is_fused: true
60 learning_rate_scheduler:
61 learning_rate: 0.0008
62 lr_decay_starting_step: 61001 # for 119B tokens (36001 for 70B tokens, 15001 for 30B
tokens)
63 lr_decay_steps: 12000 # for 119B tokens (7000 for 70B tokens, 4000 for 30B tokens)
64 lr_decay_style: 1-sqrt
65 lr_warmup_steps: 2000
66 lr_warmup_style: linear
67 min_decay_lr: 0.00
68 zero_stage: 0
69 clip_grad: 1.0
70 weight_decay: 0.1
71 accumulate_grad_in_fp32: true
72 parallelism:
73 dp: 80
74 expert_parallel_size: 1
75 pp: 1
76 pp_engine: 1f1b
77 tp: 1
78 tp_linear_async_communication: true
79 tp_mode: REDUCE_SCATTER
80 profiler: null
81 tokenizer:
82 tokenizer_max_length: null
83 tokenizer_name_or_path: mistralai/Mistral-Nemo-Base-2407
84 tokenizer_revision: null
85 tokens:
86 batch_accumulation_per_replica: 1
87 limit_test_batches: 0
88 limit_val_batches: 0
89 micro_batch_size: 20
90 sequence_length: 1024
91 train_steps: 73000 # for 119B tokens (43000 for 70B tokens, 19000 for 30B tokens)
92 val_check_interval: -1
B. Additional Results
B.1. Model Selection - Per Language Results
For completeness, we present the individual benchmark results of the 1B-parameter model trained on 119B tokens for each
language in the following tables: Table 9 for Chinese, Table 10 for French, Table 11 for German, Table 12 for Arabic, and
Table 13 for Danish.
15
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 9. Benchmark performance comparison in Chinese between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.
Approach MLP MKC+ MLP MKC CS MKC FT MKC FT MKC+ Baseline CS MKC+
Average Rank 1.7333 2.4333 4.0667 4.0667 4.4667 5.2333 6.0000
AGIEval 0.2995 0.2948 0.2897 0.2919 0.2817 0.2853 0.2773
Belebele 0.3300 0.3233 0.3178 0.3133 0.3133 0.3056 0.3022
C3 0.4550 0.4480 0.4400 0.4500 0.4400 0.4400 0.4370
C-Eval 0.3095 0.3060 0.2760 0.2903 0.2906 0.2878 0.2805
CMMLU 0.3312 0.3259 0.3041 0.3043 0.3060 0.3009 0.2995
CMRC 2018 0.2224 0.2125 0.1614 0.2251 0.2164 0.1949 0.1866
HellaSwag 0.3790 0.3800 0.3530 0.3680 0.3660 0.3510 0.3370
M3Exam 0.3319 0.3245 0.3084 0.3201 0.3245 0.3216 0.3245
X-CODAH 0.3033 0.3000 0.3233 0.3100 0.2900 0.2967 0.3067
X-CSQA 0.2740 0.2680 0.2690 0.2610 0.2520 0.2510 0.2650
XCOPA 0.6200 0.6400 0.6180 0.5740 0.5740 0.6000 0.5620
OCNLI 0.5470 0.5470 0.5340 0.5250 0.5600 0.5420 0.5060
Chinese-SQuAD 0.0929 0.1097 0.0865 0.0889 0.0850 0.0777 0.0585
XStoryCloze 0.5800 0.5630 0.5710 0.5560 0.5610 0.5580 0.5570
XWINO 0.6429 0.6528 0.6587 0.6131 0.5992 0.6429 0.6111
Table 10. Benchmark performance comparison in French between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.
Approach FT MKC+ MLP MKC+ MLP MKC FT MKC CS MKC CS MKC+ Baseline
Average Rank 3.2222 3.5000 3.5556 3.7778 4.0000 4.6667 5.2778
Belebele 0.3378 0.3533 0.3678 0.3489 0.3444 0.3344 0.3444
HellaSwag 0.5380 0.5380 0.4990 0.5150 0.5280 0.5070 0.5180
X-CSQA 0.2820 0.2740 0.2730 0.2990 0.2850 0.2900 0.2870
XNLI 2.0 0.7340 0.7400 0.7430 0.7230 0.7450 0.7330 0.7180
FQuAD 0.2597 0.2803 0.3032 0.2981 0.2411 0.2476 0.2401
MMLU 0.2896 0.2895 0.2925 0.2886 0.2806 0.2815 0.2706
Mintaka 0.0710 0.0438 0.0334 0.0670 0.0610 0.0976 0.0712
X-CODAH 0.3000 0.2667 0.2867 0.2767 0.3000 0.2800 0.2633
ARC (Challenge) 0.3120 0.3180 0.3090 0.3060 0.2950 0.2830 0.2850
Table 11. Benchmark performance comparison in German between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.
Approach MLP MKC+ FT MKC+ FT MKC CS MKC MLP MKC CS MKC+ Baseline
Average Rank 3.1250 3.1250 3.5000 3.7500 4.5000 4.7500 5.2500
MMLU 0.2940 0.2879 0.2926 0.2770 0.2905 0.2764 0.2718
ARC (Challenge) 0.2760 0.2850 0.2820 0.2880 0.2830 0.2640 0.2680
Mintaka 0.0580 0.0548 0.0735 0.0576 0.0494 0.0766 0.0498
Belebele 0.3611 0.3578 0.3544 0.3544 0.3567 0.3422 0.3544
X-CODAH 0.3367 0.3500 0.3300 0.3567 0.3400 0.3600 0.3467
X-CSQA 0.2978 0.3008 0.2877 0.2887 0.2857 0.2918 0.2787
HellaSwag 0.4640 0.4710 0.4870 0.4820 0.4540 0.4390 0.4470
XNLI 2.0 0.6620 0.6530 0.6740 0.6440 0.6610 0.6520 0.6890
approaches for Chinese, French, Arabic, and Danish. These results complement the findings for German discussed in
Section 4.2.2 and are shown in Figure 4. Table 15 lists the actual dataset sizes (number of retained tokens) after tokenization
for all languages.
16
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 12. Benchmark performance comparison in Arabic between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 56% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.
Approach MLP MKC+ MLP MKC FT MKC+ Baseline CS MKC+ CS MKC FT MKC
Average Rank 2.7812 3.2500 3.6875 3.9688 3.9688 5.0312 5.3125
EXAMS 0.3537 0.3656 0.3552 0.3582 0.3443 0.3262 0.3346
MMLU 0.4007 0.3909 0.4023 0.3894 0.3912 0.3781 0.3885
ARC (Easy) 0.4330 0.4230 0.4210 0.4120 0.4020 0.3940 0.4080
AlGhafa SciQ 0.6915 0.7005 0.6965 0.6854 0.6724 0.6683 0.6804
Belebele 0.3456 0.3356 0.3322 0.3311 0.3356 0.3567 0.3233
SOQAL 0.7333 0.6867 0.7000 0.7200 0.7267 0.6867 0.7133
MLQA 0.2386 0.2402 0.1928 0.1901 0.2189 0.2154 0.1793
TyDi QA 0.1547 0.1476 0.1230 0.1441 0.1223 0.1097 0.1182
AlGhafa RACE 0.3720 0.3740 0.3640 0.3710 0.3590 0.3660 0.3730
ARCD 0.3638 0.3505 0.3235 0.3354 0.3358 0.3432 0.3043
X-CODAH 0.2600 0.2533 0.2567 0.2633 0.2633 0.2500 0.2600
AlGhafa PIQA 0.6360 0.6320 0.6400 0.6240 0.6320 0.6320 0.6370
X-CSQA 0.2740 0.2810 0.2770 0.2900 0.2880 0.2720 0.2770
XNLI 2.0 0.6570 0.6910 0.6990 0.7010 0.6910 0.6900 0.6770
HellaSwag 0.4270 0.4220 0.4280 0.4250 0.4260 0.4320 0.4150
XStoryCloze 0.6150 0.6100 0.6100 0.6070 0.6130 0.6180 0.5930
Table 13. Benchmark performance comparison in Danish between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 65% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.
17
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 14. Benchmark performance comparison (average rank) between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, CS) trained on MKC+ or MKC, retaining top 10%, 15% or 20% of the documents. The average rank is computed across FineTasks
performance of 1B-parameter models evaluated for Chinese, German and French after 70B and 119B tokens were consumed.
Chinese French
300
Document Length
Document Length
2000
200
100 1000
0 0
Arabic Danish
2000 2000
Document Length
Document Length
1500 1500
1000 1000
500 500
0 0
C+
C+
C+
C+
C+
C+
KC
KC
KC
KC
C
b-2
b-2
MK
MK
M
M
PM
PM
We
We
MK
MK
MK
MK
MK
MK
FT
FT
CS
CS
ML
ML
e
e
P
P
FT
FT
CS
CS
Fin
Fin
ML
ML
Figure 4. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using one of our
approaches retaining top 10% of the documents for Chinese and French, 56% for Arabic and 65% for Danish. The average document
length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document length is measured based
on number of space-separated tokens.
18
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 15. Comparison of retained tokens in FineWeb-2 before and after filtering using one of our proposed approaches retaining top 10%
of the documents for Chinese, French and German, 56% for Arabic and 65% for Danish. The token counts correspond to the size of the
tokenized datasets, processed with the multilingual Mistral v3 (Tekken) tokenizer (Mistral AI, 2024).
German
Document length
2000
0
C+
st
(C)
-2
DE
(D
ML
ssi
ML Web
LU
MK
ya
ya
PM
nA
NC
PA
PA
e
P
pe
Fin
ML
PI
ML
ML
ML
PO
ML
Figure 5. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using MLP filtering
method retaining top 10% of the documents with different training datasets. The average document length of FineWeb-2 is represented as
a red horizontal line, while the medians are shown as red dots. Document length is measured based on number of space-separated tokens.
Chinese Arabic
200
Document Length
Document Length
2000
100
1000
0 0
French Danish
2000
Document Length
Document Length
2000
1000 1000
0 0
C+
C+
st
st
(C)
b-2
(C)
E
b-2
E
(D
(D
D
D
ML
ML
ssi
ssi
LU
LU
We
We
MK
MK
ya
ya
ya
ya
PM
PM
nA
nA
NC
NC
PA
PA
PA
PA
e
e
P
P
pe
pe
Fin
Fin
ML
ML
PI
PI
ML
ML
ML
ML
ML
ML
PO
PO
ML
ML
ML
ML
Figure 6. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using MLP filtering
method retaining top 10% of the documents for Chinese and French, 56% for Arabic and 65% for Danish with different training datasets.
The average document length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document
length is measured based on number of space-separated tokens.
19
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 16. Benchmark performance comparison for Chinese of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.
Table 17. Benchmark performance comparison for Arabic of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.
20
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 18. Benchmark performance comparison for German of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.
Table 19. Benchmark performance comparison for Danish of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.
21
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Table 20. List of evaluation benchmarks and metrics used in our setup for Chinese, French, German, Arabic, Danish, and English.
Benchmark Chinese French German Arabic Danish English Evaluation metric
AGIEval (Zhong et al., 2023) ✓ Normalized accuracy
AlGhafa ARC (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa PIQA (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa RACE (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa SciQ (Almazrouei et al., 2023) ✓ Normalized accuracy
ARC (Clark et al., 2018) ✓ Normalized accuracy
ARCD (Mozannar et al., 2019) ✓ F1 score
Belebele (Bandarkar et al., 2024) ✓ ✓ ✓ ✓ ✓ Normalized accuracy
C3 (Sun et al., 2020) ✓ Normalized accuracy
C-Eval (Huang et al., 2023) ✓ Normalized accuracy
Chinese-SQuAD (Pluto-Junzeng, 2019) ✓ F1 score
CMMLU (Li et al., 2024a) ✓ Normalized accuracy
CMRC 2018 (Cui et al., 2019) ✓ F1 score
CommonsenseQA (Talmor et al., 2019) ✓ Normalized accuracy
EXAMS (Hardalov et al., 2020) ✓ Normalized accuracy
FQuAD (d’Hoffschmidt et al., 2020) ✓ F1 score
HellaSwag (Zellers et al., 2019) ✓ Normalized accuracy
M3Exam (Zhang et al., 2023) ✓ Normalized accuracy
Mintaka (Sen et al., 2022) ✓ ✓ F1 score
MLMM ARC (Lai et al., 2023) ✓ ✓ ✓ Normalized accuracy
MLMM HellaSwag (Lai et al., 2023) ✓ ✓ ✓ ✓ ✓ Normalized accuracy
MLMM MMLU (Lai et al., 2023) ✓ ✓ ✓ Normalized accuracy
MLQA (Lewis et al., 2020) ✓ F1 score
MMLU (Hendrycks et al., 2020) ✓ Normalized accuracy
OCNLI (Hu et al., 2020) ✓ Normalized accuracy
OpenBookQA (Mihaylov et al., 2018) ✓ Normalized accuracy
PIQA (Bisk et al., 2019) ✓ Normalized accuracy
SOQAL (Mozannar et al., 2019) ✓ Normalized accuracy
TriviaQA (Joshi et al., 2017) ✓ Quasi-exact match
TyDi QA (Clark et al., 2020) ✓ F1 score
WinoGrande (Sakaguchi et al., 2019) ✓ Normalized accuracy
X-CODAH (Lin et al., 2021a) ✓ ✓ ✓ ✓ Normalized accuracy
XCOPA (Ponti et al., 2020) ✓ Normalized accuracy
X-CSQA (Lin et al., 2021a) ✓ ✓ ✓ ✓ Normalized accuracy
XNLI 2.0 (Upadhyay & Upadhya, 2023) ✓ ✓ ✓ Normalized accuracy
XStoryCloze (Lin et al., 2021b) ✓ ✓ Normalized accuracy
XWINO (Tikhonov & Ryabinin, 2021) ✓ Normalized accuracy
22
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Highest score:
hi. i couldn’t solve my problem because it has two conditional logical propositions. the problem is:can anyone help
me about this, thanks =)we’re expected to know that: . is equivalent tofind a logically equivalent proposition for:by
first writing its contrapositive, and then applying demorgan’s lawand the equality forthey were trying to be helpful
by outlining the steps we should follow,. . but i think they made it more confusing.i don’t see the purpose of using
the contrapositive here.. . i wouldn’t have done it that way.besides, the statement is a tautology . . .which gives us:
.and this is a tautology: ”a thing implies itself” ... which is always true.i don’t know of any ”logically equivalent
proposition” we can write . . .
Lowest score:
|starts||23 sep 2016 (fri) (one day only)|want to travel soon but dont wish to fork out a fortune for flights? check
out todays promotion from jetstar featuring promo fares fr $35 all−in valid for travel period commencing 12
october 2016dont miss out! all−in frenzy fares to hong kong, penang and more from $35.sale ends 23 sep, 11
pm!|travelling||price||travel period||find flight||penang||$35ˆ|| [...]
Highest score:
Naqhadeh County is a county in West Azerbaijan Province in Iran. The capital of the county is Naqadeh. At the
2006 census, the county’s population was 117,831, in 27,937 families. The county is subdivided into two districts:
the Central District and Mohammadyar District. The county has two cities: Naqadeh and Mohammadyar.
Lowest score:
Custom Wedding Gifts
Personalized photo frames, albums & keepsakes. Heirloom quality!
Custom Engraved Journals
Handmade in Florence Italy. Dozens of sizes and paper styles!
Awesome Leather Journals
Personalized, Customizable, Artisan made in Santa Fe, NM.
Ink Rendering from Photos
100% Hand painted with unique style by pro artists. From $49.
23
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Highest score:
When you are renting a 5, 10, 15, 20, 30 or 40 yard dumpster, you want a company you can trust with prices that
make you smile. Give us a call today and see the difference we can make in your next construction or clean out
project.
Simply give us a call and we will help you figure out your dumpster rental needs.
Our dumpsters usually go out same-day or next-day depending on when you call.
We provide top-notch service, while going easy on your bottom line. What more could you ask for?
Our trained operators are here to give you a fast and hassle-free experience from start to finish.[...]
Lowest score:
Cooperative flat 206/J
- Cooperative flat 201/J - Sold
2(1)+kitchenette, 50,1 m2Cooperative flat 202/J - Sold
2(1)+kitchenette, 44,9 m2Cooperative flat 203/J - Sold
2(1)+kitchenette, 50,6 m2Cooperative flat 204/J - Sold
1+kitchenette, 27,1 m2Cooperative flat 205/J - Sold
2(1)+kitchenette, 50,1 m2Cooperative flat 206/J - On sale
3+kitchenette 86,7 m2[...]
Here is our diagram of the Preamble to the Constitution of the United States. It is based on our understanding of the
use of ”in order to” as a subordinating conjunction that introduces a series of infinitival clauses (without subjects)
that, in turn, modify the compound verbs ”do ordain” and ”establish.”
See A Grammar of Contemporary English by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik.
Longman Group: London. 1978. p. 753.
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic
Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty
to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
If you have alternative rendering for this sentence, we would be happy to hear of it. Use the e-mail icon to the left.
24