0% found this document useful (0 votes)

118 views24 pages

Enhancing Multilingual LLM Pretraining With Model-Based Data Selection

This document presents a model-based filtering framework for enhancing multilingual large language model (LLM) pretraining, addressing the lack of research on non-English datasets. The authors demonstrate that their approach can achieve comparable performance to existing datasets while using significantly fewer training tokens, and they release refined pretraining datasets for 20 languages. The study emphasizes the importance of structured and knowledge-rich data samples for effective multilingual LLM training.

Uploaded by

mmhetric

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views24 pages

Enhancing Multilingual LLM Pretraining With Model-Based Data Selection

Uploaded by

mmhetric

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer * 1 Vinko Sabolčec * 1 Martin Jaggi 1

Abstract MLP MKC +

30.0% FT MKC +

Average accuracy
Dataset curation has become a basis for strong FineWeb-2
large language model (LLM) performance. While
28.0%
arXiv:2502.10361v1 [cs.CL] 14 Feb 2025

various rule-based filtering heuristics exist for En-

glish and multilingual datasets, model-based filter-
ing techniques have primarily focused on English. 26.0%
To address the disparity stemming from limited
research on non-English languages, we propose 0B 20B 40B 60B 80B 100B 120B
a model-based filtering framework for multilin- Tokens seen
gual datasets that aims to identify a diverse set of
Figure 1. Pretraining benchmark performance (average accuracy)
structured and knowledge-rich samples. Our ap-
measured on Chinese (CMMLU), German (MMLU), and French
proach emphasizes transparency, simplicity, and (MMLU), while training for 119B tokens, comparing the baseline
efficiency, leveraging Transformer- and FastText- FineWeb-2 dataset against data filtered using our FastText (FT)
based classifiers to ensure the broad accessibility and Transformer Multi-Layer Perceptron (MLP) embedding-based
of our technique and data. We conduct compre- filtering methods trained on our data mixture MKC+ . When using
hensive ablation studies on the FineWeb-2 web our approaches, the data retention rates are set to 10%.
crawl dataset across diverse language families,
scripts, and resource availability to demonstrate
the effectiveness of our method. Training a 1B-
Deduplication and heuristic-based dataset cleaning have
parameter Llama model for 70B and 119B to-
become standard practices in data curation (Rae et al.,
kens, our approach can match the baseline MMLU
2021; Raffel et al., 2020; De Gibert et al., 2024). These
score with as little as 15% of the training to-
quality filters are often complemented by additional filters,
kens, while also improving across other bench-
such as the removal of personally identifiable information
marks. These findings provide strong evidence
(PII) (Penedo et al., 2024a) or model-based toxicity filter-
for the generalizability of our approach to other
ing (Soldaini et al., 2024). Recently, model-based filtering
languages. As a result, we extend our framework
has also emerged as a promising method for quality filtering.
to 20 languages for which we release the refined
The release of FineWeb-Edu (Penedo et al., 2024a) demon-
pretraining datasets.
strated that pretraining on just 10% of the tokens (38B) from
an English dataset filtered using a model-based approach
can achieve performance comparable to models trained on
1. Introduction 350B tokens of unfiltered data. Moreover, when trained
Large Language Models (LLMs) have demonstrated impres- on equivalent amounts of data, this model largely outper-
sive performance improvements when trained on increas- forms the baseline. Concurrently, the release of DCLM (Li
ingly larger datasets and model sizes (Brown et al., 2020). et al., 2024b) showed that competitive performance can be
While Brown et al. (2020) already observed the importance achieved using a simple and efficient model-based approach,
of using a cleaned version of Common Crawl for improved namely a FastText (Joulin et al., 2017) classifier trained on
performance, the high cost of LLM training has further a carefully selected training dataset.
motivated research into better pretraining quality filters. However, these recent advances have primarily focused on
* 1 English data. This emphasis risks further widening the dis-
Equal contribution School of Computer and Communica-
tion Sciences, EPFL, Lausanne, Switzerland. Correspondence to: parity in LLM performance between languages, as less than
Bettina Messmer <[email protected]>, Vinko Sabolčec half of internet content is written in English1 . To address this
<[email protected]>. concern, we aim to extend model-based filtering frameworks
1
Preprint. w3techs.com/technologies/overview/content language

1
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

to multilingual datasets. While model perplexity-based fil- as the base dataset. However, early works already observed
tering is commonly applied to multilingual datasets (Wenzek that performing quality filtering on Common Crawl is cru-
et al., 2019; Laurençon et al., 2022; Nguyen et al., 2023), the cial for model performance (Brown et al., 2020). There exist
current state-of-the-art, FineWeb-2 (Penedo et al., 2024c), various data curation approaches, such as deduplication (Lee
primarily relies on heuristic-based filters. In this work, we et al., 2022), PII removal (Subramani et al., 2023), or toxic-
focus on model-based filtering with a quality definition that ity filtering (Arnett et al., 2024). Another important aspect
emphasizes: 1) structured data and 2) knowledge-rich data is quality filtering of the documents. For this, the definition
samples, to enhance multilingual pretraining datasets. of quality is an important aspect. A common approach is
to use heuristics to remove documents outside of the tar-
To achieve this, we leverage embedding-based classifica-
get distribution, such as filtering based on average word
tion models. Firstly, we adopt the FastText quality filter-
length, existence of punctuation, or document length (Rae
ing approach from DCLM to develop a unified framework
et al., 2021; Raffel et al., 2020). Another approach is to
for multilingual datasets that span diverse language fam-
define model-based filters, where research has focused on
ilies, scripts, and resource availability, focusing on Chi-
perplexity measure of the text (Wenzek et al., 2019) or
nese, German, French, Arabic, and Danish as represen-
focused on educational (Penedo et al., 2024a) and conversa-
tative languages for our experiments. Additionally, we
tional documents (Li et al., 2024b). In this work, we build
extend this embedding-based approach by incorporating
upon previous curated datasets based on heuristic filtering,
Transformer (Vaswani et al., 2023) embeddings, specifically
specifically Finweb-2 (Penedo et al., 2024c), and focus on
XLM-RoBERTa (Conneau et al., 2020), for filtering. We
model-based filtering for structured and knowledge-rich
compare the performance between baseline FineWeb-2 data
documents relying on textual embedding representation.
and our best FastText and Transformer embedding-based
approaches in Figure 1. Curated English datasets. One of the early curated datasets
was C4 (Raffel et al., 2020), followed by MassiveText (Rae
In summary, our contributions are as follows:
et al., 2021). RefinedWeb (Penedo et al., 2023) was an im-
• We propose a transparent, simple, and unified frame- portant step forward, demonstrating that filtered web data
work for multilingual model-based filtering at web can outperform selected high-quality data sources. While
scale, enabling data curation across diverse language these datasets have not been made fully publicly available,
families, scripts and resource availability. their filtering techniques have been expanded upon in re-
cent fully public datasets, such as Dolma (Soldaini et al.,
• We present comprehensive per-language ablation stud- 2024), FineWeb, and FineWeb-Edu (Penedo et al., 2024a).
ies of embedding-based multilingual quality filtering While FineWeb primarily relies on filter heuristics for data
on top of the FineWeb-2 dataset (Penedo et al., 2024c), quality, Dolma adopts model perplexity filtering. FineWeb-
achieving performance comparable to the baseline Edu takes model-based filtering a step further and relies
while using as little as 15% of the tokens. We ad- on LLM-based quality assessment. Similarly, a concurrent
ditionally analyze the impact of dataset contamination work, DCLM, has achieved competitive performance using
and multilingual LLM training. FastText (Joulin et al., 2017) classifier trained on a carefully
• We evaluate the impact of different training datasets selected training dataset. In this work we adapt and extend
for data selection classifiers on the downstream perfor- this approach to the multilingual context.
mance of LLMs. Curated Multilingual Datasets. Analogously to the En-
glish datasets, there have been efforts in the multilingual
• We release the refined pretraining dataset2 covering
space. An influential work has been CCNet (Wenzek et al.,
20 languages3 , filtered using our proposed framework,
2019), whose language identification and model perplex-
along with the codebase4 , to advance multilingual lan-
ity filter for data quality has been re-used in later datasets.
guage modeling.
Again, while CCNet was not published directly, but rather
provided the tools for data cleaning, RedPajama (Together
2. Related Work Computer, 2023) is a prominent multilingual dataset rely-
ing on these filtering techniques. While RedPajama offers
Data Curation. In order to pretrain LLMs on a large
data in 5 European languages, other datasets, such as OS-
amount of diverse texts, Common Crawl5 is often used
CAR (Ortiz Suárez et al., 2019; Abadji et al., 2021; Abadji
2
huggingface.co/epfml et al., 2022), mC4 (Xue et al., 2021), ROOTS (Laurençon
3
Russian, Chinese, German, Japanese, Spanish, French, Ital- et al., 2022), MADLAD-400 (Kudugunta et al., 2023), Cul-
ian, Portuguese, Polish, Dutch, Indonesian, Turkish, Czech, Viet- turaX (Nguyen et al., 2023), and HPLT (de Gibert et al.,
namese, Swedish, Persian, Arabic, Greek, Danish, Hungarian
4 2024), focus on expanding beyond, spanning a variety of
github.com/epfml/fineweb2-hq
5
commoncrawl.org language families and scripts. While they offer refined

2
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

datasets for hundreds of languages, FineWeb-2 (Penedo samples and 2) we describe the different models, namely
et al., 2024c) pushes the limit to thousands of languages FastText and Transformer embedding-based filters, used to
and further improves the performance. Our work also fo- capture and leverage these characteristics.
cuses on filtering quality samples across various language
families and scripts. However, we limit our scope to 20 3.1. Classifier Training Dataset
languages, as the number of documents drops quickly and
there is trade-off between retaining a sufficient number of Representative Sample Selection. Our goal is to identify a
pretraining tokens and ensuring data quality (Muennighoff diverse set of structured and knowledge-rich samples, espe-
et al., 2023; Held et al., 2025). In our results, we observe cially within a multilingual context. We define two criteria
the greatest benefits using stricter data filtering. for our training datasets: 1) the samples must be informative
and well-structured and 2) the datasets must be available in
Multilingual Embedding Models. Early word embed- multiple languages. While some multilingual benchmark
ding models like Word2Vec (Mikolov et al., 2013) and datasets meet these criteria precisely, it is important to note
GloVe (Pennington et al., 2014) lacked contextual under- that we do not train the LLM directly on this data. Instead,
standing. FastText (Bojanowski et al., 2017) built upon we train a proxy model to assess pretraining data quality.
them and improved performance by incorporating subword Nevertheless, we must remain cautious about potentially
information. Transformer (Vaswani et al., 2023) models like increased pretraining data contamination stemming from
BERT (Devlin et al., 2019) and GPT (Radford et al., 2018) this approach, as discussed in Section 4.2.6.
then revolutionized the field with context-aware embeddings.
Multilingual models like mBERT, XLM (Lample & Con- Based on our criteria, we selected the following datasets as
neau, 2019), and XLM-RoBERTa (Conneau et al., 2020) fur- representative examples.
ther advanced cross-lingual understanding, with recent open- • Aya Collection. A prompt completion dataset com-
source LLMs pushing performance even higher (Llama prising ∼514M samples covering a wide variety of
Team, 2024; Mistral AI, 2025). Using such models, doc- tasks, generated using instruction-style templates in
uments as well as representative samples can be mapped 101 languages (Singh et al., 2024b).
into a shared embedding space to estimate their similarity.
Focusing on transparency, simplicity and efficiency in our • Aya Dataset. Human-annotated instruction fine-tuning
work, we use FastText and XLM-RoBERTa for our filtering, dataset consisting of ∼202K prompt-completion pairs
and analyze the trade-off between computational complexity in 65 languages (Singh et al., 2024b).
and filtering performance.
• MMLU. Originally for English language, the dataset
Multilingual Evaluation. Evaluating LLMs requires di- contains ∼14K multiple-choice knowledge questions
verse benchmarks testing linguistic and cognitive abilities in diverse subjects and areas (Hendrycks et al., 2020).
like reading comprehension, reasoning, and knowledge. Multilingual version was translated into 14 languages
While English benchmarks like MMLU (Hendrycks et al., by professional translators (OpenAI, 2024).
2020) and ARC (Clark et al., 2018) exist, other languages
often use translations from English, e.g., XNLI (Conneau • OpenAssistant-2. The dataset contains ∼14K user-
et al., 2018) and machine-translated version of MMLU (Lai assistant conversations with multiple messages in 28
et al., 2023). However, translations can be problematic, languages (Fischer et al., 2024).
failing to capture cultural nuances or introducing ”transla- • Include-Base-44. Multiple-choice questions focused
tionese” (Romanou et al., 2024). Recent work by Romanou on general and regional knowledge, as well as reason-
et al. (2024); Singh et al. (2024a) emphasizes the need for ing, extracted from academic and professional exams.
culturally sensitive, natively collected benchmarks. Task dif- Spanning 44 languages, it includes a total of ∼23K
ficulty and task formulation also impact model performance samples (Romanou et al., 2024).
when trained for shorter durations (Kydlı́ček et al., 2024).
In our work, we follow the recent evaluation tasks selection Representative Sample Collection. MMLU and Include-
and methodology by Kydlı́ček et al. (2024) to assess our Base-44 are highly curated benchmark datasets, containing
model-based filtering approaches across multiple languages. structured, knowledge-rich samples. The Aya Dataset is
human-curated, while OpenAssistant-2 is partially human-
curated and partially generated by large language models
3. Methods (LLMs). In contrast, the Aya Collection consists of various
In this work, we present our model-based filtering ap- AI-generated samples without quality guarantee, though it
proaches. Our methodology is structured into two key com- represents the largest and most multilingual of the five.
ponents: 1) we select suitable training datasets, aiming to To address this quality difference, we create two Multilin-
identifying a diverse set of structured and knowledge-rich gual Knowledge Collection (MKC) configurations:

3
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

• MKC: Includes Include-Base-44, OpenAssistant-2, uments in the pretraining dataset. To filter the dataset, we
MMLU, and the Aya Dataset applied a score threshold based on the desired retention per-
centage of documents. This approach balances dataset size
• MKC+ : Includes MKC and the Aya Collection and the predicted quality of the samples.

This allows us to evaluate the trade-off between data quality 3.3. Transformer Embedding-based Filtering
and scale.
To leverage rich semantic information based on contex-
Dataset Creation. For our model-based filtering ap-
tual relationships, we utilized the Transformer model em-
proaches, our goal is to identify documents from the pre-
beddings. Specifically, we selected a pretrained XLM-
training dataset that are most similar to our representative
RoBERTa base model (Conneau et al., 2020) due to its
samples, with the notion of similarity determined by the
support of 100 languages, a relatively small size of approxi-
specific classifier used. We can measure the similarity to
mately 279M parameters, and its transparent training pro-
our training dataset directly, for example, by computing the
cedure. This choice enabled us to process web-scale data
cosine similarity to our training samples in the embedding
efficiently without being restricted to a single language and
space. Alternatively, following the approach of Li et al.
to align with our commitment to open science.
(2024b), the task can be framed as a binary classification
problem, with the representative samples as the positive To retain general embeddings that can be reused across meth-
class. For the negative class, we can simply subsample ods, we opted against fine-tuning the model. For each docu-
documents from our pretraining dataset, under the assump- ment from our datasets, we computed the 768-dimensional
tion that the majority of these documents are neither well- embedding by mean pooling the embeddings of the output
structured nor knowledge-rich. We use both approaches for sequence. Since the model has a fixed maximum sequence
our classifiers. length of 512 tokens, we considered only the first 512 tokens
of each document, assuming they are representative of the
To create the binary classification training dataset, we se-
entire document.
lected 80K random examples from the training set (MKC or
MKC+ ) as positive samples and 80K random examples from After computing the embeddings of our corpora, we experi-
FineWeb-2 as negative samples. For smaller datasets, such mented with two methods: 1) classification of embeddings
as Include-Base-44, the entire dataset was used. The same using a multi-layer perceptron and 2) cosine similarity be-
training dataset was utilized across all model-based filtering tween the embeddings. As in the FastText approach, we
approaches, disregarding negative samples when unneces- scored each document and applied a threshold to retain the
sary. Additionally, we created a training dataset for each desired percentage of the highest-scoring documents.
language individually to avoid leaking language-specific
Multi-Layer Perceptron (MLP). We trained a single-
biases to data of other languages.
hidden-layer neural network with a hidden dimension of 256,
Sample Pre-processing. We applied no pre-processing the ReLU activation function, a dropout rate of 20%, and the
to the FineWeb-2 (negative) samples but performed mini- sigmoid function on the output. The network was trained
mal pre-processing on the representative (positive) samples. for 6 epochs using the AdamW optimizer (Loshchilov,
For instance, in datasets like MMLU or OpenAssistant-2, 2017) with a constant learning rate 0.0003 and binary cross-
we concatenated various sample components. For the Aya entropy loss. We computed document scores using the out-
Collection, we resolved encoding issues in non-Latin lan- put layer of the MLP model, which used XML-RoBERTa
guages and removed samples containing <unk> tokens, document embeddings as input.
which were particularly prevalent in Arabic data (37.1%).
Cosine Similarity (CS). We computed the document scores
as the maximum cosine similarity between its embeddings
3.2. FastText-based Filtering (FT) and a set of K randomly sampled positive sample embed-
To efficiently process datasets with over 100 million doc- dings. We experimented with varying values of K, including
uments (Penedo et al., 2024c), similar to DCLM (Li et al., 1024, 2048, 4096, 8192, and 16384. However, we did not
2024b), we used a binary FastText classifier (Joulin et al., observe a significant differences in the documents with high
2017). This classifier runs on CPU and can be easily scores across these variations when manually inspecting
deployed across multiple cores, for example using Data- the data. To strike a balance between the diversity of the
Trove (Penedo et al., 2024b). positive samples and computational efficiency, we chose
K = 8192 for our experiments.
We trained our FastText classifier on the processed training
set using 2-gram features (4-gram for Chinese). Additional
details about the training process are given in Appendix A.1.
These classifiers were then used to assign scores to all doc-

4
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

4. Experiments individual metrics and languages to determine the optimal

approach.
4.1. Experimental Setup
Technical Details. We evaluate 1B-parameter Llama mod- 4.2. Experimental Results & Discussion
els (Llama Team, 2024) to demonstrate the effectiveness
4.2.1. M ODEL S ELECTION
of our model-based filtering approaches. The models are
trained on either 70B or 119B tokens, balancing token qual- In Section 3, we introduced several model-based filtering
ity and diversity. The smaller dataset (70B tokens) exposes approaches. But which of these performs the best? We eval-
the model to each token at most once (with a few exceptions uate which combination of our defined classifier training
where some tokens appear twice). The larger dataset (119B datasets (MKC or MKC+ ) and filtering methods (FT, MLP
tokens) simulates longer training, resulting in increased or CS) achieve the highest performance. Table 1 presents the
token repetition. Training utilizes the HuggingFace Nan- overall ranking across our representative language selection
otron library (Hugging Face, 2024a) with the AdamW opti- (Chinese, German, French, Arabic, Danish) and training
mizer (Loshchilov, 2017) and a WSD learning rate sched- runs of 70B and 119B tokens. Analogous to the DCLM
ule (Hägele et al., 2024). filtering recipe (Li et al., 2024b), the results are based on
a dataset that retains 10% of the documents for the high-
To minimize the need for costly hyperparameter tuning, we
resource datasets (Chinese, German, French) and keeps 56%
maintain a consistent setup across all experiments. Specif-
and 65% of the documents for the lower-resource languages
ically, we adopt the DeepSeek scaling law (DeepSeek-AI
(Arabic and Danish, respectively). These percentages main-
et al., 2024) with a batch size of 1.6M tokens, learning
tain approximately 70B tokens, under the assumption of
rate of 0.0008, and 2000 warmup steps. We provide our
uniform token distribution across documents. We also ex-
Nanotron config in Appendix A.2.
clude approaches that use MKC for training on Danish, as
As base dataset, we use FineWeb-2 (Penedo et al., 2024c), it lacks sufficient training data. For detailed, per-language
which has been shown to provide a strong baseline across a results, please refer to Appendix B.1.
variety of languages. Since FineWeb-2 is globally dedupli-
Table 1 demonstrates that MLP MKC+ approach outper-
cated, we rehydrate both filtered and unfiltered data using
forms all other approaches. Interestingly, the high- and
the hyperparameters recommended by Penedo et al. (2024c).
low-scored samples presented in Appendix D align with the
To validate our method on English, we use three datasets: observed rankings. Figure 2 further highlights the strong per-
FineWeb (Penedo et al., 2024a) as the baseline, along with formance of MLP MKC+ , particularly for high-resource lan-
FineWeb-Edu (Penedo et al., 2024a) and DCLM (Li et al., guages, where it largely outperforms the baseline. For lower-
2024b), both of which represent the current state-of-the-art. resource languages—where less data was filtered—the per-
Tokenization is performed using the multilingual Mistral v3 formance gains are less pronounced. Notably, FT filtering
(Tekken) tokenizer (Mistral AI, 2024). All experiments are is also competitive. Given the computational expense of
conducted using 80 NVIDIA GH200 chips. XLM-RoBERTa embeddings, FastText can be a promising
alternative in resource-constrained setups.
Evaluation. Our evaluation prioritizes a diverse range of
tasks to ensure the models retain well-rounded capabilities,
rather than focusing exclusively on knowledge-based tasks.
Specifically, we include tasks covering reading comprehen- Table 1. Benchmark performance comparison (average rank) be-
sion, general knowledge, natural language understanding, tween the baseline (FineWeb-2) and our proposed filtering methods
(FT, MLP, and CS) trained on MKC+ or MKC, retaining top 10%
common-sense reasoning, and generative tasks in the target
of the documents for Chinese, German, and French, 56% for Ara-
language. To evaluate our approach, we use the Hugging-
bic, and 65% for Danish. The average rank is computed across
Face LightEval library (Fourrier et al., 2023). FineTasks performance of 1B-parameter models evaluated after
For French, Chinese, and Arabic, we utilize the Fine- 70B and 119B tokens were consumed.
Tasks (Kydlı́ček et al., 2024) multilingual evaluation suite, Approach Average Rank
which is designed to provide meaningful signals even for +
MLP MKC 4.35
models trained in the order of 100B tokens. We select analo- MLP MKC 6.11
gous tasks for German and Danish. For English, we rely on FT MKC+ 7.17
the SmolLM tasks suite (Hugging Face, 2024b). A complete FT MKC 8.04
list of tasks and their evaluation metrics for each language CS MKC 8.10
is provided in Appendix C. Baseline 8.72
CS MKC+ 8.79
Model Selection. We follow the approach used in FineTasks
for filter selection, computing a global rank score across

5
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

FineWeb-Edu MLP MKC +

35.0% MLP MKC + 32.0% FT MKC +
DCLM FineWeb-2 28.0%
32.5% FineWeb CS MKC +
Accuracy

Accuracy

Accuracy
30.0%
30.0% 26.0% MLP MKC +
FT MKC +
27.5% 28.0% CS MKC +
24.0% FineWeb-2
25.0%
0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B
Tokens seen Tokens seen Tokens seen
(a) English (MMLU) (b) Chinese (CMMLU) (c) German (MMLU)

29.00% 40.0% 28.0%

28.00% 37.5% 26.0%
Accuracy

Accuracy

Accuracy
27.00% 35.0%
FT MKC + FT MKC + 24.0% CS MKC +
26.00% MLP MKC + 32.5% MLP MKC + FT MKC +
CS MKC + CS MKC + MLP MKC +
25.00% FineWeb-2 FineWeb-2 22.0% FineWeb-2
30.0%
0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B 0B 20B 40B 60B 80B 100B 120B
Tokens seen Tokens seen Tokens seen
(d) French (MMLU) (e) Arabic (MMLU) (f) Danish (ARC)

Figure 2. Benchmark performance comparison (accuracy) during training on 119B tokens between the baseline methods (FineWeb,
DCLM, FineWeb-Edu, and FineWeb-2) and our proposed filtering methods (FT, MLP, and CS), trained on MKC+ . When using our
approaches, the data retention rates are set to 10% for English, Chinese, German, and French, 56% for Arabic, and 65% for Danish. For
English, Chinese, German, and French, baseline-level performance is observed around 20B tokens consumed (16.7% of the total).

4.2.2. T HRESHOLD S ELECTION

Table 2. Benchmark performance comparison (average rank) be-
In Section 4.2.1, we base our model selection on exper- tween the baseline (FineWeb-2) and our proposed filtering methods
iments that retain top 10% of the data for high-resource (FT, MLP) trained on MKC+ or MKC, retaining top 10%, 15% or
languages. But is this the optimal threshold? Following the 20% of the documents. The average rank is computed across Fine-
Tasks performance of 1B-parameter models evaluated for Chinese,
methodology of Li et al. (2024b), we analyze the impact of
German and French after 70B and 119B tokens were consumed.
varying filter strengths on performance for Chinese, German,
and French, using our MLP and FT filtering methods. The Approach Threshold Average Rank
results are summarized in Table 2, with a comprehensive MLP MKC +
10% 8.85
analysis, including results for CS, provided in Appendix B.2 MLP MKC+ 15% 9.44
(Table 14). Consistent with their findings, we observe that MLP MKC 20% 11.37
retaining top 10% of the data is a competitive threshold, MLP MKC 15% 11.70
particularly for approaches using the MKC+ dataset. Inter- MLP MKC 10% 11.95
MLP MKC+ 20% 11.97
estingly, approaches using MKC perform better with higher FT MKC+ 10% 13.92
retention. Motivated by the observed bias in certain ap- FT MKC 15% 14.62
proaches favoring the selection of shorter documents, we FT MKC 10% 14.74
examine how this bias interacts with performance when re- FT MKC 20% 15.62
taining more documents. As demonstrated in Figure 3 for FT MKC+ 15% 16.27
German, Appendix B.2 for other languages, and the retained FT MKC+ 20% 16.51
Baseline – 18.55
token counts in Table 15, the MLP MKC approach shows
a tendency to retain shorter documents, while achieving
higher performance with an increased number of retained
documents. In contrast, the CS and FT filtering methods duced by combining various base datasets truly necessary?
present mixed results, suggesting that the optimal threshold We evaluate the impact of each base dataset individually and
selection may be influenced by additional factors. compare it to the combined MKC+ dataset. For this ablation
study, we use our best filtering method (MLP with a top 10%
4.2.3. T RAINING DATA A NALYSIS retention) and train the models on 30B tokens. This token
count is chosen to match the size of the smallest filtered
The experiments in Sections 4.2.1 and 4.2.2 are based on the dataset, ensuring consistency across comparisons. The re-
training datasets MKC and MKC+ . But is the diversity intro- sults, presented in Table 3, show that despite the absence of

6
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

German 4.2.4. R EPLAY OF O RIGINAL DATA

Document length

2000 But does our model-based filtering introduce unwanted bi-

ases? We explore whether incorporating a small percentage
1000
of original raw data (replay) can help improve performance.
0 We do this for our best FastText (FT MKC+ ) and Trans-
former approaches (MLP MKC+ ). Table 4 presents the
C+
C+

C+
KC

-2
MK

b
M

We
MK
MK

MK
FT

CS
results of experiments where 5% and 10% unfiltered data
ML

e
P
FT

Fin
ML

were mixed into the training dataset, alongside results from

training without any replay. Although, the FT MKC+ fil-
Figure 3. Comparison of average document length and standard ters shows mixed signal, our MLP MKC+ approach clearly
deviation in FineWeb-2 before and after filtering using one of our demonstrates that replay does not improve performance, in-
approaches retaining top 10% of the documents. The average dicating the data selection already retains enough diversity.
document length of FineWeb-2 is represented as a red horizontal
In cases of less diverse datasets, replay was shown to offer
line, while the medians are shown as red dots. Document length is
measured based on number of space-separated tokens.
benefits (Bethune et al., 2025; Chen et al., 2023).

a quality guarantee for all samples in the Aya Collection, this Table 4. Benchmark performance comparison (average rank) of
dataset yields strong performance, making our approach ap- our MLP MKC+ and FT MKC+ approaches, retaining top 10%
of the documents while mixing in 0%, 5% or 10% of the original
plicable for various languages. Overall, we observe that the
FineWeb-2 dataset. The average rank is computed across Fine-
diversity resulting from combining all individual training Tasks performance of 1B-parameter models evaluated for Chinese,
datasets gives the best results. German, or French, after consuming 70B and 119B tokens.
Interestingly, models trained exclusively on Include-Base- Approach Mixture Rate Average Rank
44 and OpenAssistant-2 perform worse overall than the base- +
line. This may be due to the nature of these datasets. For MLP MKC 0% 4.36
MLP MKC+ 5% 5.09
instance, Include-Base-44 is relatively small and domain-
MLP MKC+ 10% 5.40
specific, e.g., consisting primarily of driving license exam FT MKC+ 10% 7.17
questions in its German subset. Similarly, OpenAssistant-2 FT MKC+ 0% 7.51
includes a limited number of samples, with fewer than 2K FT MKC+ 5% 8.66
positive samples per training set, which likely negatively
impacts classifier performance. Again, we relate model
performance to the average document length bias in Ap-
pendix B.3 and confirm the findings from Section 4.2.2, 4.2.5. A PPROACH VALIDATION ON E NGLISH
suggesting that factors beyond the retained document length
bias may influence performance. Previous experiments have shown strong performance of
our MLP MKC+ approach. But do these results translate
to English? Table 5 presents the performance of MLP
Table 3. Benchmark performance comparison (average rank) be- MKC+ with 10% retention applied to the English FineWeb
tween the baseline (FineWeb-2) and the MLP filtering method dataset (Penedo et al., 2024a). Our method is compared
trained on either MKC+ as a whole or its individual dataset com- against FineWeb and baselines using model-based filtered
ponents, retaining top 10% of the documents for Chinese, German, datasets, including DCLM (Li et al., 2024b) and FineWeb-
and French, 56% for Arabic, and 65% for Danish. The average Edu (Penedo et al., 2024a). To save computational resources,
rank is computed across FineTasks performance of 1B-parameter we use the 6 most recent FineWeb and FineWeb-Edu dumps
models trained on each language with 30B tokens. and the first partition of DCLM6 , which we denote with ∗ .
Dataset Average Rank
Each of these subsets contains more than 119B tokens, with
+
FineWeb retaining this size even after applying our filtering
MKC 2.52 retaining top 10% of the documents.
Aya Collection 2.91
Aya Dataset 3.17 While each approach demonstrates strengths in different
MMLU 3.57 benchmarks, as seen from Table 5 and Figure 2, the overall
Baseline 4.09
OpenAssistant-2 4.53 average rank results indicate that our method outperforms
Include-Base-44 5.42 all other baselines.
6
huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-
parquet

7
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 5. Benchmark performance comparison for English of our Table 6. Benchmark performance comparison in English for our
MLP MKC+ approach (retaining top 10% of the documents) MLP MKC+ approach (retaining top 10% of the documents), both
against baseline datasets: FineWeb, DCLM, and FineWeb-Edu. decontaminated (D) and non-decontaminated, against the baseline
The average rank is computed across SmolLM task performance FineWeb datasets, also in decontaminated and non-decontaminated
for 1B-parameter models trained on 119B tokens. variants. The average rank is computed across SmolLM task per-
formance for 1B-parameter models trained on 119B tokens.
Dataset Ours DCLM∗ FW-Edu∗ FW∗
Average Rank 1.8333 2.3889 2.4444 3.3333 Dataset Ours OursD FW∗ FW∗D
ARC (Challenge) 0.3550 0.3530 0.3850 0.3010 Average Rank 1.5000 2.1111 3.0556 3.3333
ARC (Easy) 0.6670 0.6470 0.6970 0.5880 ARC (Challenge) 0.3550 0.3440 0.3010 0.2880
CommonsenseQA 0.3870 0.4100 0.3770 0.3850
ARC (Easy) 0.6670 0.6520 0.5880 0.5700
HellaSwag 0.6040 0.5960 0.5700 0.5930
MMLU 0.3400 0.3160 0.3470 0.3030
CommonsenseQA 0.3870 0.4000 0.3850 0.3820
OpenBookQA 0.3860 0.3840 0.4180 0.3560 HellaSwag 0.6040 0.6040 0.5930 0.5890
PIQA 0.7510 0.7510 0.7410 0.7620 MMLU 0.3400 0.3220 0.3030 0.3050
WinoGrande 0.5720 0.5610 0.5660 0.5550 OpenBookQA 0.3860 0.3840 0.3560 0.3740
TriviaQA 0.0820 0.1240 0.0320 0.0370 PIQA 0.7510 0.7590 0.7620 0.7600
WinoGrande 0.5720 0.5550 0.5550 0.5570
TriviaQA 0.0820 0.0380 0.0370 0.0250

4.2.6. DATA C ONTAMINATION A NALYSIS Table 7. Benchmark performance comparison in French for our
MLP MKC+ approach (retaining top 10% of the documents),
Our LLMs are never trained on benchmark datasets. But both decontaminated (D) and non-decontaminated, against the
is the strong performance observed in the previous sections baseline FineWeb-2 datasets, also in decontaminated and non-
primarily due to an increased ratio of data contamination? decontaminated variants. The average rank is computed across
To ensure the validity of our approach, we conduct decon- FineTasks performance for 1B-parameter models trained on 119B
tamination experiments, as web crawl data may include eval- tokens.
uation benchmark tasks. While Li et al. (2024b) addressed
Dataset Ours OursD FW-2D FW-2
similar concerns, our approach follows the methodology
of Brown et al. (2020). Specifically, we perform 13-gram Average Rank 2.0556 2.0556 2.7222 3.1667
decontamination of the LLM training data separately for Belebele 0.3533 0.3400 0.3778 0.3444
HellaSwag 0.5380 0.5350 0.5180 0.5180
English and French evaluation benchmarks. However, un- X-CSQA 0.2740 0.2810 0.2730 0.2870
like the original approach, we remove the entire document XNLI 2.0 0.7400 0.7400 0.7070 0.7180
if it is flagged as contaminated, using the implementation FQuAD 0.2803 0.2620 0.2890 0.2401
provided in DataTrove (Penedo et al., 2024b). MMLU 0.2895 0.2875 0.2711 0.2706
Mintaka 0.0438 0.0797 0.0658 0.0712
Tables 6 and 7 present the results of decontamination experi- X-CODAH 0.2667 0.2900 0.2800 0.2633
ments for English and French, respectively. We used the fol- ARC (Challenge) 0.3180 0.3110 0.2880 0.2850
lowing experimental setup (removed document contamina-
tion rates): baseline FineWeb English (0.16%), MLP MKC+
English with 10% retention (0.19%), baseline FineWeb-2
4.2.7. I MPACT ON MULTILINGUAL MODEL TRAINING
French (0.14%), and MLP MKC+ French with 10% reten-
tion (0.14%). As in our previous experiments, we train Although not the primary focus of our work, we believe
the models on 119B tokens. Additionally, we compare the that refined datasets can contribute to advancing the perfor-
results against equivalent training runs without decontam- mance of multilingual models. To investigate this, we con-
ination to further analyze its impact. For an example of a ducted an ablation study by training a 1B-parameter model
contaminated sample, see Appendix E. on 595B tokens (5×119B), covering all five languages: Chi-
nese, German, French, Arabic and Danish. We trained two
For English models, decontamination slightly reduces per-
models—the first one using our filtered FineWeb-2 dataset
formance both for our approach and baseline FineWeb data.
and the second one using unfiltered FineWeb-2 data. We
However, even when decontaminated, our approach still out-
then compared these results for each language against their
performs training on non-decontaminated baseline data. For
monolingual counterparts trained on 119B tokens.
French models, performance of our approach is comparable
between decontaminated and non-decontaminated datasets, The results for French are presented in Table 8. We ob-
with both continuing to outperform baseline FineWeb-2 data. serve that the multilingual LLM outperforms its monolin-
Interestingly, decontaminated baseline data yields better re- gual counterpart on our filtered datasets, whereas the mono-
sults than its non-decontaminated counterpart. lingual model achieves better performance than the mul-

8
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

tilingual model on the FineWeb-2 dataset. This trend is References

consistent across all languages except Chinese. Detailed re- Abadji, J., Suárez, P. J. O., Romary, L., and Sagot, B. Ungoliant:
sults for the other languages are provided in Appendix B.4. An optimized pipeline for the generation of a very large-scale
multilingual web corpus. Proceedings of the Workshop on Chal-
lenges in the Management of Large Corpora (CMLC-9) 2021.
Table 8. Benchmark performance comparison for French of mul- Limerick, 12 July 2021 (Online-Event), pp. 1 – 9, Mannheim,
tilingual LLMs (M ) trained on FineWeb-2 or the refined dataset 2021. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/
using our MLP MKC+ approach (retaining top 10% of the doc- ids-pub-10468. URL https://nbn-resolving.org/
uments for Chinese, German, and French, 56% for Arabic, and urn:nbn:de:bsz:mh39-104688.
65% for Danish) trained on 595B tokens, against their monolin-
Abadji, J., Ortiz Suarez, P., Romary, L., and Sagot, B. Towards
gual counterparts trained on 119B tokens. The average rank is
a Cleaner Document-Oriented Multilingual Crawled Corpus.
computed across FineTasks performance for 1B-parameter models arXiv e-prints, art. arXiv:2201.06642, January 2022.
trained on 119B tokens.
Almazrouei, E., Cojocaru, R., Baldo, M., Malartic, Q., Alobei-
Dataset OursM Ours FW-2 FW-2M dli, H., Mazzotta, D., Penedo, G., Campesan, G., Farooq, M.,
Average Rank 1.8333 2.0556 3.0000 3.1111 Alhammadi, M., Launay, J., and Noune, B. AlGhafa evalu-
Belebele 0.3667 0.3533 0.3444 0.3511 ation benchmark for Arabic language models. In Sawaf, H.,
HellaSwag 0.5270 0.5380 0.5180 0.4970 El-Beltagy, S., Zaghouani, W., Magdy, W., Abdelali, A., Tomeh,
X-CSQA 0.2740 0.2740 0.2870 0.2750 N., Abu Farha, I., Habash, N., Khalifa, S., Keleg, A., Had-
XNLI 2.0 0.7660 0.7400 0.7180 0.7330 dad, H., Zitouni, I., Mrini, K., and Almatham, R. (eds.), Pro-
FQuAD 0.3212 0.2803 0.2401 0.2459 ceedings of ArabicNLP 2023, pp. 244–275, Singapore (Hy-
MMLU 0.2841 0.2895 0.2706 0.2735 brid), December 2023. Association for Computational Linguis-
Mintaka 0.0456 0.0438 0.0712 0.0579 tics. doi: 10.18653/v1/2023.arabicnlp-1.21. URL https:
X-CODAH 0.2900 0.2667 0.2633 0.2567 //aclanthology.org/2023.arabicnlp-1.21/.
ARC (Challenge) 0.2970 0.3180 0.2850 0.2670
Arnett, C., Jones, E., Yamshchikov, I. P., and Langlais, P.-C. Tox-
icity of the Commons: Curating Open-Source Pre-Training
Data. arXiv preprint arXiv:2410.22587, 2024. URL https:
//arxiv.org/pdf/2410.22587.
5. Conclusion
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N.,
In this work, we introduced a novel framework for Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., and
model-based filtering of web-scale multilingual pretrain- Khabsa, M. The Belebele Benchmark: a Parallel Read-
ing datasets, demonstrating consistent improvements on ing Comprehension Dataset in 122 Language Variants. In
LLM benchmarks across a wide range of languages. Our Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pp.
Transformer embedding-based classifier, MLP MKC+ , out-
749–775. Association for Computational Linguistics, 2024.
performs state-of-the-art methods on both English and mul- doi: 10.18653/v1/2024.acl-long.44. URL http://dx.doi.
tilingual datasets, even when decontaminating the datasets org/10.18653/v1/2024.acl-long.44.
or using them for training multilingual LLMs. This demon-
strates that simple classifiers can achieve competitive results. Bethune, L., Grangier, D., Busbridge, D., Gualdoni, E., Cuturi,
M., and Ablin, P. Scaling laws for forgetting during finetuning
While our FastText-based filtering approach performed well with pretraining data injection, 2025. URL https://arxiv.
and shows promise in resource-constrained setups, MLP org/abs/2502.06042.
MKC+ consistently outperformed all other methods and can
be easily scaled to other languages. These results motivate Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa:
Reasoning about physical commonsense in natural language,
us to expand our framework to 20 languages and release 2019. URL https://arxiv.org/abs/1911.11641.
the corresponding refined pretraining datasets and our code,
contributing to the advancement of multilingual language Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching
modeling. word vectors with subword information, 2017. URL https:
//arxiv.org/abs/1607.04606.

Acknowledgements Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell,
We thank Guilherme Penedo, Hynek Kydlı́ček, and Leandro A., et al. Language models are few-shot learners. Advances in
neural information processing systems, 33:1877–1901, 2020.
von Werra for their help with FineWeb-2 data, and to Alex
Hägele for providing feedback on the paper draft. Chen, Z., Cano, A. H., Romanou, A., Bonnet, A., Matoba, K.,
Salvi, F., Pagliardini, M., Fan, S., Köpf, A., Mohtashami, A.,
This work was supported as part of the Swiss AI Initiative Sallinen, A., Sakhaeirad, A., Swamy, V., Krawczuk, I., Bayazit,
by a grant from the Swiss National Supercomputing Centre D., Marmet, A., Montariol, S., Hartley, M.-A., Jaggi, M., and
(CSCS) under project ID a06 on Alps. Bosselut, A. Meditron-70b: Scaling medical pretraining for
large language models, 2023. URL https://arxiv.org/
abs/2311.16079.

9
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Zhang, Y., Zhao, C., Zhao, Y., Zhou, S., Zhou, S., Zhu, Q.,
Nikolaev, V., and Palomaki, J. TyDi QA: A Benchmark for and Zou, Y. DeepSeek LLM: Scaling Open-Source Language
Information-Seeking Question Answering in Typologically Di- Models with Longtermism, 2024. URL https://arxiv.
verse Languages, 2020. URL https://arxiv.org/abs/ org/abs/2401.02954.
2003.05002.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., training of deep bidirectional transformers for language under-
Schoenick, C., and Tafjord, O. Think you have Solved Question standing, 2019. URL https://arxiv.org/abs/1810.
Answering? Try ARC, the AI2 Reasoning Challenge, 2018. 04805.
URL https://arxiv.org/abs/1803.05457.
d’Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., and
Conneau, A., Rinott, R., Lample, G., Williams, A., Bow- Vidal, M. FQuAD: French Question Answering Dataset, 2020.
man, S., Schwenk, H., and Stoyanov, V. XNLI: Evalu- URL https://arxiv.org/abs/2002.06071.
ating cross-lingual sentence representations. In Riloff, E., Fischer, S., Rossetto, F., Gemmell, C., Ramsay, A., Mackie, I.,
Chiang, D., Hockenmaier, J., and Tsujii, J. (eds.), Proceed- Zubel, P., Tecklenburg, N., and Dalton, J. Open assistant toolkit–
ings of the 2018 Conference on Empirical Methods in Nat- version 2. arXiv preprint arXiv:2403.00586, 2024.
ural Language Processing, pp. 2475–2485, Brussels, Bel-
gium, October-November 2018. Association for Computational Fourrier, C., Habib, N., Wolf, T., and Tunstall, L. LightEval:
Linguistics. doi: 10.18653/v1/D18-1269. URL https: A lightweight framework for LLM evaluation, 2023. URL
//aclanthology.org/D18-1269/. https://github.com/huggingface/lighteval.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, Hägele, A., Bakouch, E., Kosson, A., Allal, L. B., Von Werra, L.,
G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoy- and Jaggi, M. Scaling laws and compute-optimal training be-
anov, V. Unsupervised cross-lingual representation learning yond fixed training durations. arXiv preprint arXiv:2405.18392,
at scale, 2020. URL https://arxiv.org/abs/1911. 2024.
02116.
Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I.,
Cui, Y., Liu, T., Che, W., Xiao, L., Chen, Z., Ma, W., Wang, and Nakov, P. Exams: A multi-subject high school examinations
S., and Hu, G. A span-extraction dataset for chinese machine dataset for cross-lingual and multilingual question answering,
reading comprehension. In Proceedings of the 2019 Conference 2020. URL https://arxiv.org/abs/2011.03080.
on Empirical Methods in Natural Language Processing and Held, W., Paranjape, B., Koura, P. S., Lewis, M., Zhang, F., and
the 9th International Joint Conference on Natural Language Mihaylov, T. Optimizing pretraining data mixtures with llm-
Processing (EMNLP-IJCNLP). Association for Computational estimated utility. arXiv preprint arXiv:2501.11747, 2025.
Linguistics, 2019. doi: 10.18653/v1/d19-1600. URL http:
//dx.doi.org/10.18653/v1/D19-1600. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song,
D., and Steinhardt, J. Measuring massive multitask language
de Gibert, O., Nail, G., Arefyev, N., Bañón, M., van der Linde, J., understanding. arXiv preprint arXiv:2009.03300, 2020.
Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramı́rez-Sánchez,
G., Kutuzov, A., Pyysalo, S., Oepen, S., and Tiedemann, J. Hu, H., Richardson, K., Xu, L., Li, L., Kübler, S., and Moss, L. OC-
A new massive multilingual dataset for high-performance lan- NLI: Original Chinese Natural Language Inference. In Cohn, T.,
guage technologies. In Calzolari, N., Kan, M.-Y., Hoste, V., He, Y., and Liu, Y. (eds.), Findings of the Association for Com-
Lenci, A., Sakti, S., and Xue, N. (eds.), Proceedings of the putational Linguistics: EMNLP 2020, pp. 3512–3526, Online,
2024 Joint International Conference on Computational Lin- November 2020. Association for Computational Linguistics.
guistics, Language Resources and Evaluation (LREC-COLING doi: 10.18653/v1/2020.findings-emnlp.314. URL https://
2024), pp. 1116–1128, Torino, Italia, May 2024. ELRA aclanthology.org/2020.findings-emnlp.314/.
and ICCL. URL https://aclanthology.org/2024.
lrec-main.100. Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu,
J., Lv, C., Zhang, Y., Lei, J., Fu, Y., Sun, M., and He, J. C-
De Gibert, O., Nail, G., Arefyev, N., Bañón, M., Van Der Linde, J., eval: A multi-level multi-discipline chinese evaluation suite
Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramı́rez-Sánchez, for foundation models, 2023. URL https://arxiv.org/
G., Kutuzov, A., et al. A new massive multilingual dataset abs/2305.08322.
for high-performance language technologies. arXiv preprint
Hugging Face. Nanotron, 2024a. URL https://github.
arXiv:2403.14009, 2024.
com/huggingface/nanotron. Accessed 30 Jan. 2025.
DeepSeek-AI, :, Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Hugging Face. SmolLM - blazingly fast and remarkably pow-
Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., Gao, H., Gao, erful, 2024b. URL https://huggingface.co/blog/
K., Gao, W., Ge, R., Guan, K., Guo, D., Guo, J., Hao, G., Hao, smollm. Accessed 30 Jan. 2025.
Z., He, Y., Hu, W., Huang, P., Li, E., Li, G., Li, J., Li, Y., Li,
Y. K., Liang, W., Lin, F., Liu, A. X., Liu, B., Liu, W., Liu, Joshi, M., Choi, E., Weld, D., and Zettlemoyer, L. TriviaQA:
X., Liu, X., Liu, Y., Lu, H., Lu, S., Luo, F., Ma, S., Nie, X., A large scale distantly supervised challenge dataset for read-
Pei, T., Piao, Y., Qiu, J., Qu, H., Ren, T., Ren, Z., Ruan, C., ing comprehension. In Barzilay, R. and Kan, M.-Y. (eds.),
Sha, Z., Shao, Z., Song, J., Su, X., Sun, J., Sun, Y., Tang, M., Proceedings of the 55th Annual Meeting of the Association
Wang, B., Wang, P., Wang, S., Wang, Y., Wang, Y., Wu, T., for Computational Linguistics (Volume 1: Long Papers), pp.
Wu, Y., Xie, X., Xie, Z., Xie, Z., Xiong, Y., Xu, H., Xu, R. X., 1601–1611, Vancouver, Canada, July 2017. Association for
Xu, Y., Yang, D., You, Y., Yu, S., Yu, X., Zhang, B., Zhang, Computational Linguistics. doi: 10.18653/v1/P17-1147. URL
H., Zhang, L., Zhang, L., Zhang, M., Zhang, M., Zhang, W., https://aclanthology.org/P17-1147/.

10
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of and the 11th International Joint Conference on Natural Lan-
tricks for efficient text classification. In Proceedings of the guage Processing (Volume 1: Long Papers), pp. 1274–1287,
15th Conference of the European Chapter of the Association for Online, August 2021a. Association for Computational Lin-
Computational Linguistics: Volume 2, Short Papers, pp. 427– guistics. doi: 10.18653/v1/2021.acl-long.102. URL https:
431. Association for Computational Linguistics, April 2017. //aclanthology.org/2021.acl-long.102/.

Kudugunta, S., Caswell, I., Zhang, B., Garcia, X., Choquette-Choo, Lin, X. V., Mihaylov, T., Artetxe, M., Wang, T., Chen, S.,
C. A., Lee, K., Xin, D., Kusupati, A., Stella, R., Bapna, A., and Simig, D., Ott, M., Goyal, N., Bhosale, S., Du, J., Pasunuru,
Firat, O. MADLAD-400: A Multilingual And Document-Level R., Shleifer, S., Koura, P. S., Chaudhary, V., O’Horo, B.,
Large Audited Dataset, 2023. URL https://arxiv.org/ Wang, J., Zettlemoyer, L., Kozareva, Z., Diab, M. T., Stoy-
abs/2309.04662. anov, V., and Li, X. Few-shot learning with multilingual lan-
guage models. CoRR, abs/2112.10668, 2021b. URL https:
Kydlı́ček, H., Penedo, G., Fourier, C., Habib, N., and Wolf, T. //arxiv.org/abs/2112.10668.
FineTasks: Finding signal in a haystack of 200+ multilingual
tasks, 2024. URL https://huggingface.co/spaces/ Llama Team. The Llama 3 Herd of Models, 2024. URL https:
HuggingFaceFW/blogpost-fine-tasks. Accessed //arxiv.org/abs/2407.21783.
30 Jan. 2025.
Loshchilov, I. Decoupled weight decay regularization. arXiv
Lai, V., Nguyen, C., Ngo, N., Nguyen, T., Dernoncourt, F., Rossi, preprint arXiv:1711.05101, 2017.
R., and Nguyen, T. Okapi: Instruction-tuned large language
models in multiple languages with reinforcement learning from Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of
human feedback. In Feng, Y. and Lefever, E. (eds.), Proceed- armor conduct electricity? a new dataset for open book question
ings of the 2023 Conference on Empirical Methods in Natural answering, 2018. URL https://arxiv.org/abs/1809.
Language Processing: System Demonstrations, pp. 318–327, 02789.
Singapore, December 2023. Association for Computational Lin- Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient esti-
guistics. doi: 10.18653/v1/2023.emnlp-demo.28. URL https: mation of word representations in vector space, 2013. URL
//aclanthology.org/2023.emnlp-demo.28/. https://arxiv.org/abs/1301.3781.
Lample, G. and Conneau, A. Cross-lingual language model pre- Mistral AI. v3 (tekken) tokenizer, 2024. URL https://docs.
training, 2019. URL https://arxiv.org/abs/1901. mistral.ai/guides/tokenization/. Accessed 30
07291. Jan. 2025.
Laurençon, H., Saulnier, L., Wang, T., Akiki, C., Villanova del Mistral AI. Mistral small 3, 2025. URL https://mistral.
Moral, A., Le Scao, T., Von Werra, L., Mou, C., González Pon- ai/news/mistral-small-3/. Accessed 30 Jan. 2025.
ferrada, E., Nguyen, H., et al. The bigscience roots corpus:
A 1.6 tb composite multilingual dataset. Advances in Neural Mozannar, H., Hajal, K. E., Maamary, E., and Hajj, H. Neural
Information Processing Systems, 35:31809–31826, 2022. arabic question answering, 2019. URL https://arxiv.
org/abs/1906.05394.
Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-
Burch, C., and Carlini, N. Deduplicating training data makes Muennighoff, N., Rush, A. M., Barak, B., Scao, T. L., Piktus, A.,
language models better. In Muresan, S., Nakov, P., and Villav- Tazi, N., Pyysalo, S., Wolf, T., and Raffel, C. Scaling data-
icencio, A. (eds.), Proceedings of the 60th Annual Meeting constrained language models, 2023. URL https://arxiv.
of the Association for Computational Linguistics (Volume 1: org/abs/2305.16264.
Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022. As-
sociation for Computational Linguistics. doi: 10.18653/v1/ Nguyen, T., Van Nguyen, C., Lai, V. D., Man, H., Ngo, N. T.,
2022.acl-long.577. URL https://aclanthology.org/ Dernoncourt, F., Rossi, R. A., and Nguyen, T. H. Culturax: A
2022.acl-long.577/. cleaned, enormous, and multilingual dataset for large language
models in 167 languages. arXiv preprint arXiv:2309.09400,
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., and Schwenk, H. Mlqa: 2023.
Evaluating cross-lingual extractive question answering, 2020.
URL https://arxiv.org/abs/1910.07475. OpenAI. MMMLU, 2024. URL https://huggingface.
co/datasets/openai/MMMLU. Accessed 30 Jan. 2025.
Li, H., Zhang, Y., Koto, F., Yang, Y., Zhao, H., Gong, Y., Duan,
N., and Baldwin, T. Cmmlu: Measuring massive multitask Ortiz Suárez, P. J., Sagot, B., and Romary, L. Asynchronous
language understanding in chinese, 2024a. URL https:// pipelines for processing huge corpora on medium to low
arxiv.org/abs/2306.09212. resource infrastructures. Proceedings of the Workshop on
Challenges in the Management of Large Corpora (CMLC-
Li, J., Fang, A., Smyrnis, G., Ivgi, M., Jordan, M., Gadre, S., 7) 2019. Cardiff, 22nd July 2019, pp. 9 – 16, Mannheim,
Bansal, H., Guha, E., Keh, S., Arora, K., et al. DataComp-LM: 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/
In search of the next generation of training sets for language ids-pub-9021. URL http://nbn-resolving.de/urn:
models. arXiv preprint arXiv:2406.11794, 2024b. nbn:de:bsz:mh39-90215.
Lin, B. Y., Lee, S., Qiao, X., and Ren, X. Common sense be- Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Alobeidli, H.,
yond English: Evaluating and improving multilingual language Cappelli, A., Pannier, B., Almazrouei, E., and Launay, J. The
models for commonsense reasoning. In Zong, C., Xia, F., RefinedWeb dataset for Falcon LLM: Outperforming curated
Li, W., and Navigli, R. (eds.), Proceedings of the 59th An- corpora with web data only. Advances in Neural Information
nual Meeting of the Association for Computational Linguistics Processing Systems, 36:79155–79172, 2023.

11
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Penedo, G., Kydlı́ček, H., Lozhkov, A., Mitchell, M., Raffel, C., Singh, S., Romanou, A., Fourrier, C., Adelani, D. I., Ngui, J. G.,
Von Werra, L., Wolf, T., et al. The FineWeb Datasets: Decanting Vila-Suero, D., Limkonchotiwat, P., Marchisio, K., Leong,
the Web for the Finest Text Data at Scale. arXiv preprint W. Q., Susanto, Y., Ng, R., Longpre, S., Ko, W.-Y., Smith,
arXiv:2406.17557, 2024a. M., Bosselut, A., Oh, A., Martins, A. F. T., Choshen, L., Ip-
polito, D., Ferrante, E., Fadaee, M., Ermis, B., and Hooker,
Penedo, G., Kydlı́ček, H., Cappelli, A., Sasko, M., and Wolf, T. S. Global mmlu: Understanding and addressing cultural
DataTrove: large scale data processing, 2024b. URL https: and linguistic biases in multilingual evaluation, 2024a. URL
//github.com/huggingface/datatrove. Accessed https://arxiv.org/abs/2412.03304.
30 Jan. 2025.
Singh, S., Vargus, F., Dsouza, D., Karlsson, B. F., Mahendiran, A.,
Penedo, G., Kydlı́ček, H., Sabolčec, V., Messmer, B., Foroutan, Ko, W.-Y., Shandilya, H., Patel, J., Mataciunas, D., OMahony,
N., Jaggi, M., von Werra, L., and Wolf, T. FineWeb2: L., Zhang, M., Hettiarachchi, R., Wilson, J., Machado, M.,
A sparkling update with 1000s of languages, December Moura, L. S., Krzemiński, D., Fadaei, H., Ergün, I., Okoh,
2024c. URL https://huggingface.co/datasets/ I., Alaagib, A., Mudannayake, O., Alyafeai, Z., Chien, V. M.,
HuggingFaceFW/fineweb-2. Accessed 30 Jan. 2025. Ruder, S., Guthikonda, S., Alghamdi, E. A., Gehrmann, S.,
Muennighoff, N., Bartolo, M., Kreutzer, J., Üstün, A., Fadaee,
Pennington, J., Socher, R., and Manning, C. GloVe: Global M., and Hooker, S. Aya dataset: An open-access collection for
vectors for word representation. In Moschitti, A., Pang, B., and multilingual instruction tuning, 2024b.
Daelemans, W. (eds.), Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D.,
pp. 1532–1543, Doha, Qatar, October 2014. Association for Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., et al.
Computational Linguistics. doi: 10.3115/v1/D14-1162. URL Dolma: An open corpus of three trillion tokens for language
https://aclanthology.org/D14-1162/. model pretraining research. arXiv preprint arXiv:2402.00159,
2024.
Pluto-Junzeng. pluto-junzeng/chinesesquad, 2019. URL https:
//github.com/pluto-junzeng/ChineseSquad. Subramani, N., Luccioni, S., Dodge, J., and Mitchell, M. Detecting
Accessed 30 Jan. 2025. personal information in training corpora: an analysis. In Ovalle,
A., Chang, K.-W., Mehrabi, N., Pruksachatkun, Y., Galystan,
Ponti, E. M., Glavaš, G., Majewska, O., Liu, Q., Vulić, I., A., Dhamala, J., Verma, A., Cao, T., Kumar, A., and Gupta,
and Korhonen, A. XCOPA: A multilingual dataset for R. (eds.), Proceedings of the 3rd Workshop on Trustworthy
causal commonsense reasoning. In Webber, B., Cohn, T., Natural Language Processing (TrustNLP 2023), pp. 208–220,
He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Confer- Toronto, Canada, July 2023. Association for Computational Lin-
ence on Empirical Methods in Natural Language Processing guistics. doi: 10.18653/v1/2023.trustnlp-1.18. URL https:
(EMNLP), pp. 2362–2376, Online, November 2020. Associ- //aclanthology.org/2023.trustnlp-1.18/.
ation for Computational Linguistics. doi: 10.18653/v1/2020.
emnlp-main.185. URL https://aclanthology.org/ Sun, K., Yu, D., Yu, D., and Cardie, C. Investigating prior
2020.emnlp-main.185/. knowledge for challenging Chinese machine reading compre-
hension. Transactions of the Association for Computational
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Linguistics, 8:141–155, 2020. doi: 10.1162/tacl a 00305. URL
Improving language understanding by generative pre-training. https://aclanthology.org/2020.tacl-1.10/.
2018.
Talmor, A., Herzig, J., Lourie, N., and Berant, J. Common-
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., senseQA: A question answering challenge targeting common-
Song, F., Aslanides, J., Henderson, S., Ring, R., Young, S., et al. sense knowledge. In Burstein, J., Doran, C., and Solorio,
Scaling language models: Methods, analysis & insights from T. (eds.), Proceedings of the 2019 Conference of the North
training gopher. arXiv preprint arXiv:2112.11446, 2021. American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 (Long
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, and Short Papers), pp. 4149–4158, Minneapolis, Minnesota,
M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of June 2019. Association for Computational Linguistics. doi:
transfer learning with a unified text-to-text transformer. Journal 10.18653/v1/N19-1421. URL https://aclanthology.
of machine learning research, 21(140):1–67, 2020. org/N19-1421/.

Romanou, A., Foroutan, N., Sotnikova, A., Chen, Z., Nelaturu, Tikhonov, A. and Ryabinin, M. It’s all in the heads: Using attention
S. H., Singh, S., Maheshwary, R., Altomare, M., Haggag, M. A., heads as a baseline for cross-lingual transfer in commonsense
Amayuelas, A., et al. Include: Evaluating multilingual lan- reasoning, 2021. URL https://arxiv.org/abs/2106.
guage understanding with regional knowledge. arXiv preprint 12066.
arXiv:2411.19799, 2024.
Together Computer. Redpajama: An open source recipe to repro-
Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Wino- duce llama training dataset, 2023. URL https://github.
Grande: An Adversarial Winograd Schema Challenge at Scale. com/togethercomputer/RedPajama-Data. Ac-
arXiv preprint arXiv:1907.10641, 2019. cessed 30 Jan. 2025.

Sen, P., Aji, A. F., and Saffari, A. Mintaka: A Complex, Natural, Upadhyay, A. K. and Upadhya, H. K. Xnli 2.0: Improving xnli
and Multilingual Dataset for End-to-End Question Answering, dataset and performance on cross lingual understanding (xlu),
2022. URL https://arxiv.org/abs/2210.01613. 2023. URL https://arxiv.org/abs/2301.06527.

12
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all
you need, 2023. URL https://arxiv.org/abs/1706.
03762.

Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V.,

Guzmán, F., Joulin, A., and Grave, E. Ccnet: Extracting
high quality monolingual datasets from web crawl data. arXiv
preprint arXiv:1911.00359, 2019.

Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R.,
Siddhant, A., Barua, A., and Raffel, C. mT5: A mas-
sively multilingual pre-trained text-to-text transformer. In
Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur,
D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T.,
and Zhou, Y. (eds.), Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pp. 483–
498, Online, June 2021. Association for Computational Lin-
guistics. doi: 10.18653/v1/2021.naacl-main.41. URL https:
//aclanthology.org/2021.naacl-main.41/.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y.
Hellaswag: Can a machine really finish your sentence?, 2019.
URL https://arxiv.org/abs/1905.07830.

Zhang, W., Aljunied, S. M., Gao, C., Chia, Y. K., and Bing, L.
M3exam: A multilingual, multimodal, multilevel benchmark
for examining large language models, 2023. URL https:
//arxiv.org/abs/2306.05179.

Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied,
A., Chen, W., and Duan, N. Agieval: A human-centric
benchmark for evaluating foundation models, 2023. URL
https://arxiv.org/abs/2304.06364.

13
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

A. Additional Experimental Details

A.1. FastText Training Details
The FastText classifier was trained on the processed training set using 2-grams, a minCount of 1, and the softmax
loss function. All other parameters were automatically tuned using the FastText library. For Chinese, fixed parameters were
used: 30 training epochs and a learning rate of 0.1 to ensure training stability. Additionally, 4-grams and a minCount of
0 were selected based on manual evaluation of the results.
Prior to training the FastText models, we pre-processed the training data by removing newlines.

A.2. Nanotron Configuration

To facilitate the reproducibility of our model training, we provide the Nanotron (Hugging Face, 2024a) configuration used in
our experiments.

1 checkpoints:
2 checkpoint_interval: 1000
3 checkpoints_path: checkpoints/
4 checkpoints_path_is_shared_file_system: false
5 resume_checkpoint_path: null
6 save_initial_state: false
7 data_stages:
8 - data:
9 dataset:
10 dataset_folder: template
11 num_loading_workers: 1
12 seed: 42
13 name: General purpose training (Single dataset)
14 start_training_step: 1
15 general:
16 benchmark_csv_path: null
17 consumed_train_samples: null
18 ignore_sanity_checks: true
19 project: template
20 run: template
21 seed: 42
22 step: null
23 lighteval: null
24 logging:
25 iteration_step_info_interval: 1
26 log_level: info
27 log_level_replica: info
28 model:
29 ddp_bucket_cap_mb: 25
30 dtype: bfloat16
31 init_method:
32 std: 0.025
33 make_vocab_size_divisible_by: 1
34 model_config:
35 bos_token_id: 1
36 eos_token_id: 2
37 hidden_act: silu
38 hidden_size: 1536
39 initializer_range: 0.02
40 intermediate_size: 6144
41 is_llama_config: true
42 max_position_embeddings: 1024
43 num_hidden_layers: 24
44 num_attention_heads: 16
45 num_key_value_heads: 16
46 pad_token_id: null
47 pretraining_tp: 1
48 rms_norm_eps: 1.0e-06

14
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

49 rope_scaling: null
50 tie_word_embeddings: true
51 use_cache: true
52 vocab_size: 131072
53 optimizer:
54 optimizer_factory:
55 adam_beta1: 0.9
56 adam_beta2: 0.95
57 adam_eps: 1.0e-08
58 name: adamW
59 torch_adam_is_fused: true
60 learning_rate_scheduler:
61 learning_rate: 0.0008
62 lr_decay_starting_step: 61001 # for 119B tokens (36001 for 70B tokens, 15001 for 30B
tokens)
63 lr_decay_steps: 12000 # for 119B tokens (7000 for 70B tokens, 4000 for 30B tokens)
64 lr_decay_style: 1-sqrt
65 lr_warmup_steps: 2000
66 lr_warmup_style: linear
67 min_decay_lr: 0.00
68 zero_stage: 0
69 clip_grad: 1.0
70 weight_decay: 0.1
71 accumulate_grad_in_fp32: true
72 parallelism:
73 dp: 80
74 expert_parallel_size: 1
75 pp: 1
76 pp_engine: 1f1b
77 tp: 1
78 tp_linear_async_communication: true
79 tp_mode: REDUCE_SCATTER
80 profiler: null
81 tokenizer:
82 tokenizer_max_length: null
83 tokenizer_name_or_path: mistralai/Mistral-Nemo-Base-2407
84 tokenizer_revision: null
85 tokens:
86 batch_accumulation_per_replica: 1
87 limit_test_batches: 0
88 limit_val_batches: 0
89 micro_batch_size: 20
90 sequence_length: 1024
91 train_steps: 73000 # for 119B tokens (43000 for 70B tokens, 19000 for 30B tokens)
92 val_check_interval: -1

B. Additional Results
B.1. Model Selection - Per Language Results
For completeness, we present the individual benchmark results of the 1B-parameter model trained on 119B tokens for each
language in the following tables: Table 9 for Chinese, Table 10 for French, Table 11 for German, Table 12 for Arabic, and
Table 13 for Danish.

B.2. Threshold Selection

To confirm that the CS filtering method is not competitive with MLP and FT, even when a higher percentage of documents is
retained, we present the complete threshold selection results, including the CS method, in Table 14 in addition to the results
shown in Section 4.2.2 (Table 2).
We provide further results on the variation in the average length of documents retained by our model-based filtering

15
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 9. Benchmark performance comparison in Chinese between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.

Approach MLP MKC+ MLP MKC CS MKC FT MKC FT MKC+ Baseline CS MKC+
Average Rank 1.7333 2.4333 4.0667 4.0667 4.4667 5.2333 6.0000
AGIEval 0.2995 0.2948 0.2897 0.2919 0.2817 0.2853 0.2773
Belebele 0.3300 0.3233 0.3178 0.3133 0.3133 0.3056 0.3022
C3 0.4550 0.4480 0.4400 0.4500 0.4400 0.4400 0.4370
C-Eval 0.3095 0.3060 0.2760 0.2903 0.2906 0.2878 0.2805
CMMLU 0.3312 0.3259 0.3041 0.3043 0.3060 0.3009 0.2995
CMRC 2018 0.2224 0.2125 0.1614 0.2251 0.2164 0.1949 0.1866
HellaSwag 0.3790 0.3800 0.3530 0.3680 0.3660 0.3510 0.3370
M3Exam 0.3319 0.3245 0.3084 0.3201 0.3245 0.3216 0.3245
X-CODAH 0.3033 0.3000 0.3233 0.3100 0.2900 0.2967 0.3067
X-CSQA 0.2740 0.2680 0.2690 0.2610 0.2520 0.2510 0.2650
XCOPA 0.6200 0.6400 0.6180 0.5740 0.5740 0.6000 0.5620
OCNLI 0.5470 0.5470 0.5340 0.5250 0.5600 0.5420 0.5060
Chinese-SQuAD 0.0929 0.1097 0.0865 0.0889 0.0850 0.0777 0.0585
XStoryCloze 0.5800 0.5630 0.5710 0.5560 0.5610 0.5580 0.5570
XWINO 0.6429 0.6528 0.6587 0.6131 0.5992 0.6429 0.6111

Table 10. Benchmark performance comparison in French between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.

Approach FT MKC+ MLP MKC+ MLP MKC FT MKC CS MKC CS MKC+ Baseline
Average Rank 3.2222 3.5000 3.5556 3.7778 4.0000 4.6667 5.2778
Belebele 0.3378 0.3533 0.3678 0.3489 0.3444 0.3344 0.3444
HellaSwag 0.5380 0.5380 0.4990 0.5150 0.5280 0.5070 0.5180
X-CSQA 0.2820 0.2740 0.2730 0.2990 0.2850 0.2900 0.2870
XNLI 2.0 0.7340 0.7400 0.7430 0.7230 0.7450 0.7330 0.7180
FQuAD 0.2597 0.2803 0.3032 0.2981 0.2411 0.2476 0.2401
MMLU 0.2896 0.2895 0.2925 0.2886 0.2806 0.2815 0.2706
Mintaka 0.0710 0.0438 0.0334 0.0670 0.0610 0.0976 0.0712
X-CODAH 0.3000 0.2667 0.2867 0.2767 0.3000 0.2800 0.2633
ARC (Challenge) 0.3120 0.3180 0.3090 0.3060 0.2950 0.2830 0.2850

Table 11. Benchmark performance comparison in German between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 10% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.

Approach MLP MKC+ FT MKC+ FT MKC CS MKC MLP MKC CS MKC+ Baseline
Average Rank 3.1250 3.1250 3.5000 3.7500 4.5000 4.7500 5.2500
MMLU 0.2940 0.2879 0.2926 0.2770 0.2905 0.2764 0.2718
ARC (Challenge) 0.2760 0.2850 0.2820 0.2880 0.2830 0.2640 0.2680
Mintaka 0.0580 0.0548 0.0735 0.0576 0.0494 0.0766 0.0498
Belebele 0.3611 0.3578 0.3544 0.3544 0.3567 0.3422 0.3544
X-CODAH 0.3367 0.3500 0.3300 0.3567 0.3400 0.3600 0.3467
X-CSQA 0.2978 0.3008 0.2877 0.2887 0.2857 0.2918 0.2787
HellaSwag 0.4640 0.4710 0.4870 0.4820 0.4540 0.4390 0.4470
XNLI 2.0 0.6620 0.6530 0.6740 0.6440 0.6610 0.6520 0.6890

approaches for Chinese, French, Arabic, and Danish. These results complement the findings for German discussed in
Section 4.2.2 and are shown in Figure 4. Table 15 lists the actual dataset sizes (number of retained tokens) after tokenization
for all languages.

16
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 12. Benchmark performance comparison in Arabic between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 56% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.

Approach MLP MKC+ MLP MKC FT MKC+ Baseline CS MKC+ CS MKC FT MKC
Average Rank 2.7812 3.2500 3.6875 3.9688 3.9688 5.0312 5.3125
EXAMS 0.3537 0.3656 0.3552 0.3582 0.3443 0.3262 0.3346
MMLU 0.4007 0.3909 0.4023 0.3894 0.3912 0.3781 0.3885
ARC (Easy) 0.4330 0.4230 0.4210 0.4120 0.4020 0.3940 0.4080
AlGhafa SciQ 0.6915 0.7005 0.6965 0.6854 0.6724 0.6683 0.6804
Belebele 0.3456 0.3356 0.3322 0.3311 0.3356 0.3567 0.3233
SOQAL 0.7333 0.6867 0.7000 0.7200 0.7267 0.6867 0.7133
MLQA 0.2386 0.2402 0.1928 0.1901 0.2189 0.2154 0.1793
TyDi QA 0.1547 0.1476 0.1230 0.1441 0.1223 0.1097 0.1182
AlGhafa RACE 0.3720 0.3740 0.3640 0.3710 0.3590 0.3660 0.3730
ARCD 0.3638 0.3505 0.3235 0.3354 0.3358 0.3432 0.3043
X-CODAH 0.2600 0.2533 0.2567 0.2633 0.2633 0.2500 0.2600
AlGhafa PIQA 0.6360 0.6320 0.6400 0.6240 0.6320 0.6320 0.6370
X-CSQA 0.2740 0.2810 0.2770 0.2900 0.2880 0.2720 0.2770
XNLI 2.0 0.6570 0.6910 0.6990 0.7010 0.6910 0.6900 0.6770
HellaSwag 0.4270 0.4220 0.4280 0.4250 0.4260 0.4320 0.4150
XStoryCloze 0.6150 0.6100 0.6100 0.6070 0.6130 0.6180 0.5930

Table 13. Benchmark performance comparison in Danish between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, and CS) trained on MKC+ or MKC, retaining 65% of the documents. The average rank is computed across FineTasks performance
of 1B-parameter models evaluated after 119B tokens were consumed.

Approach CS MKC+ MLP MKC+ FT MKC+ Baseline

Average Rank 1.0000 2.5000 3.1667 3.3333
ARC (Challenge) 0.2820 0.2650 0.2730 0.2560
HellaSwag 0.4950 0.4850 0.4750 0.4750
Belebele 0.3333 0.3289 0.3189 0.3289

B.3. Training Data Analysis

We give details on the variation in the average length of documents retained by our model-based filtering method MLP for
Chinese, French, Arabic, and Danish with different training datasets. The results are shown for German in Figure 5 and for
all other languages in Figure 6.

B.4. Impact on multilingual model training

This section presents the results of our MLP MKC+ approach on multilingual model training for Chinese (Table 16), Arabic
(Table 17), German (Table 18), and Danish (Table 19), in addition to the results for French discussed in Section 4.2.7.

17
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 14. Benchmark performance comparison (average rank) between the baseline (FineWeb-2) and our proposed filtering methods (FT,
MLP, CS) trained on MKC+ or MKC, retaining top 10%, 15% or 20% of the documents. The average rank is computed across FineTasks
performance of 1B-parameter models evaluated for Chinese, German and French after 70B and 119B tokens were consumed.

Approach Threshold Average Rank

+
MLP MKC 10% 11.73
MLP MKC+ 15% 12.13
MLP MKC 20% 15.07
MLP MKC 15% 15.09
MLP MKC+ 20% 15.40
MLP MKC 10% 16.09
FT MKC+ 10% 18.61
CS MKC 15% 19.02
CS MKC 20% 19.24
FT MKC 15% 19.84
FT MKC 10% 20.02
CS MKC 10% 20.67
FT MKC 20% 20.80
FT MKC+ 15% 22.05
FT MKC+ 20% 22.52
CS MKC+ 15% 24.66
CS MKC+ 20% 25.08
Baseline – 25.54
CS MKC+ 10% 26.94

Chinese French
300
Document Length

Document Length

2000
200
100 1000

0 0
Arabic Danish
2000 2000
Document Length

Document Length

1500 1500
1000 1000
500 500
0 0
C+

C+
C+

C+
KC

KC
KC

C
b-2

b-2
MK

MK
M

M
PM

PM
We

We
MK

MK
MK

MK
FT

FT
CS

CS
ML

ML
e

e
P

P
FT

FT
CS

CS
Fin

Fin
ML

Figure 4. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using one of our
approaches retaining top 10% of the documents for Chinese and French, 56% for Arabic and 65% for Danish. The average document
length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document length is measured based
on number of space-separated tokens.

18
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 15. Comparison of retained tokens in FineWeb-2 before and after filtering using one of our proposed approaches retaining top 10%
of the documents for Chinese, French and German, 56% for Arabic and 65% for Danish. The token counts correspond to the size of the
tokenized datasets, processed with the multilingual Mistral v3 (Tekken) tokenizer (Mistral AI, 2024).

Approach Chinese French German Arabic Danish

+
MLP MKC 150B (9%) 89B (12%) 119B (12%) 78B (61%) 71B (66%)
MLP MKC 105B (7%) 72B (10%) 87B (9%) 75B (59%) –
FT MKC+ 221B (14%) 70B (10%) 63B (6% ) 77B (61%) 70B (65%)
FT MKC 190B (12%) 43B (6%) 65B (7%) 80B (63%) –
CS MKC+ 170B (11%) 126B (17%) 166B (17%) 82B (65%) 77B (71%)
CS MKC 161B (10%) 132B (18%) 172B (18%) 83B (65%) –
Baseline 1597B 730B 973B 127B 108B

German
Document length

2000

0
C+

st
(C)

-2

DE
(D

ssi
ML Web

LU
MK

NC
PA

e
P

pe
Fin
ML

PI
ML

Figure 5. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using MLP filtering
method retaining top 10% of the documents with different training datasets. The average document length of FineWeb-2 is represented as
a red horizontal line, while the medians are shown as red dots. Document length is measured based on number of space-separated tokens.

Chinese Arabic
200
Document Length

Document Length

2000
100
1000

0 0
French Danish
2000
Document Length

Document Length

2000

1000 1000

0 0
C+

C+
st

st
(C)

b-2

E
(D

(D
D

D
ML

ML
ssi

ssi
LU

LU
We

We
MK

MK
ya

ya
ya

ya
PM

PM
nA

nA
NC

NC
PA

PA
PA

PA
e

e
P

P
pe

pe
Fin

Fin
ML

ML
PI

PI
ML

ML
ML

ML
PO

PO
ML

ML
ML

Figure 6. Comparison of average document length and standard deviation in FineWeb-2 before and after filtering using MLP filtering
method retaining top 10% of the documents for Chinese and French, 56% for Arabic and 65% for Danish with different training datasets.
The average document length of FineWeb-2 is represented as a red horizontal line, while the medians are shown as red dots. Document
length is measured based on number of space-separated tokens.

19
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 16. Benchmark performance comparison for Chinese of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.

Dataset Ours OursM FW-2M FW-2

Average Rank 1.5667 2.1667 2.9000 3.3667
AGIEval 0.2995 0.2863 0.2894 0.2853
Belebele 0.3300 0.3456 0.3189 0.3056
C3 0.4550 0.4520 0.4480 0.4400
C-Eval 0.3095 0.2848 0.2683 0.2878
CMMLU 0.3312 0.3064 0.2967 0.3009
CMRC 2018 0.2224 0.2689 0.2090 0.1949
HellaSwag 0.3790 0.3740 0.3740 0.3510
M3Exam 0.3319 0.3040 0.3304 0.3216
X-CODAH 0.3033 0.3067 0.2800 0.2967
X-CSQA 0.2740 0.2810 0.2780 0.2510
XCOPA 0.6200 0.6020 0.5860 0.6000
OCNLI 0.5470 0.5320 0.4910 0.5420
Chinese-SQuAD 0.0929 0.1304 0.1017 0.0777
XStoryCloze 0.5800 0.5760 0.5650 0.5580
XWINO 0.6429 0.6409 0.6468 0.6429

Table 17. Benchmark performance comparison for Arabic of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.

Dataset OursM Ours FW-2 FW-2M

Average Rank 1.9688 2.0000 2.7500 3.2812
EXAMS 0.3336 0.3537 0.3582 0.3076
MMLU 0.3828 0.4007 0.3894 0.3599
ARC (Easy) 0.4190 0.4330 0.4120 0.3760
AlGhafa SciQ 0.6764 0.6915 0.6854 0.6563
Belebele 0.3511 0.3456 0.3311 0.3344
SOQAL 0.7000 0.7333 0.7200 0.6533
MLQA 0.2208 0.2386 0.1901 0.2085
TyDi QA 0.1634 0.1547 0.1441 0.1429
AlGhafa RACE 0.3830 0.3720 0.3710 0.3770
ARCD 0.3377 0.3638 0.3354 0.2970
X-CODAH 0.2767 0.2600 0.2633 0.2767
AlGhafa PIQA 0.6170 0.6360 0.6240 0.6160
X-CSQA 0.2860 0.2740 0.2900 0.2660
XNLI 2.0 0.7080 0.6570 0.7010 0.7340
HellaSwag 0.4390 0.4270 0.4250 0.4240
XStoryCloze 0.6370 0.6150 0.6070 0.6160

20
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Table 18. Benchmark performance comparison for German of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.

Dataset OursM Ours FW-2 FW-2M

Average Rank 1.5000 2.1250 2.9375 3.4375
MMLU 0.2918 0.2940 0.2718 0.2691
ARC (Challenge) 0.2740 0.2760 0.2680 0.2640
Mintaka 0.0821 0.0580 0.0498 0.0660
Belebele 0.3956 0.3611 0.3544 0.3633
X-CODAH 0.3500 0.3367 0.3467 0.3167
X-CSQA 0.3048 0.2978 0.2787 0.2787
HellaSwag 0.4690 0.4640 0.4470 0.4430
XNLI 2.0 0.6420 0.6620 0.6890 0.6340

Table 19. Benchmark performance comparison for Danish of multilingual LLMs (M ) trained on FineWeb-2 or the refined dataset using
our MLP MKC+ approach (retaining top 10% of the documents for Chinese, German, and French, 56% for Arabic, and 65% for Danish)
trained on 595B tokens, against their monolingual counterparts trained on 119B tokens. The average rank is computed across FineTasks
performance for 1B-parameter models trained on 119B tokens.

Dataset OursM Ours FW-2M FW-2

Average Rank 1.6667 2.1667 3.0000 3.1667
ARC (Challenge) 0.2920 0.2650 0.2600 0.2560
HellaSwag 0.4710 0.4850 0.4560 0.4750
Belebele 0.3700 0.3289 0.3311 0.3289

21
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

C. List of evaluation benchmarks and metrics

We provide a detailed overview of the evaluation benchmarks used to assess our models’ performance, along with their
respective evaluation metrics in Table 20. For non-English tasks and English MMLU, we use the cloze multiple-choice
prompt, which allows the model to directly predict each option instead of using the standard prompt format with A/B/C/D
letter prefixes as targets. This approach was chosen because it has been shown to serve as a more reliable performance
indicator earlier in training (Kydlı́ček et al., 2024). We evaluate the models in a 0-shot setting.

Table 20. List of evaluation benchmarks and metrics used in our setup for Chinese, French, German, Arabic, Danish, and English.
Benchmark Chinese French German Arabic Danish English Evaluation metric
AGIEval (Zhong et al., 2023) ✓ Normalized accuracy
AlGhafa ARC (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa PIQA (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa RACE (Almazrouei et al., 2023) ✓ Normalized accuracy
AlGhafa SciQ (Almazrouei et al., 2023) ✓ Normalized accuracy
ARC (Clark et al., 2018) ✓ Normalized accuracy
ARCD (Mozannar et al., 2019) ✓ F1 score
Belebele (Bandarkar et al., 2024) ✓ ✓ ✓ ✓ ✓ Normalized accuracy
C3 (Sun et al., 2020) ✓ Normalized accuracy
C-Eval (Huang et al., 2023) ✓ Normalized accuracy
Chinese-SQuAD (Pluto-Junzeng, 2019) ✓ F1 score
CMMLU (Li et al., 2024a) ✓ Normalized accuracy
CMRC 2018 (Cui et al., 2019) ✓ F1 score
CommonsenseQA (Talmor et al., 2019) ✓ Normalized accuracy
EXAMS (Hardalov et al., 2020) ✓ Normalized accuracy
FQuAD (d’Hoffschmidt et al., 2020) ✓ F1 score
HellaSwag (Zellers et al., 2019) ✓ Normalized accuracy
M3Exam (Zhang et al., 2023) ✓ Normalized accuracy
Mintaka (Sen et al., 2022) ✓ ✓ F1 score
MLMM ARC (Lai et al., 2023) ✓ ✓ ✓ Normalized accuracy
MLMM HellaSwag (Lai et al., 2023) ✓ ✓ ✓ ✓ ✓ Normalized accuracy
MLMM MMLU (Lai et al., 2023) ✓ ✓ ✓ Normalized accuracy
MLQA (Lewis et al., 2020) ✓ F1 score
MMLU (Hendrycks et al., 2020) ✓ Normalized accuracy
OCNLI (Hu et al., 2020) ✓ Normalized accuracy
OpenBookQA (Mihaylov et al., 2018) ✓ Normalized accuracy
PIQA (Bisk et al., 2019) ✓ Normalized accuracy
SOQAL (Mozannar et al., 2019) ✓ Normalized accuracy
TriviaQA (Joshi et al., 2017) ✓ Quasi-exact match
TyDi QA (Clark et al., 2020) ✓ F1 score
WinoGrande (Sakaguchi et al., 2019) ✓ Normalized accuracy
X-CODAH (Lin et al., 2021a) ✓ ✓ ✓ ✓ Normalized accuracy
XCOPA (Ponti et al., 2020) ✓ Normalized accuracy
X-CSQA (Lin et al., 2021a) ✓ ✓ ✓ ✓ Normalized accuracy
XNLI 2.0 (Upadhyay & Upadhya, 2023) ✓ ✓ ✓ Normalized accuracy
XStoryCloze (Lin et al., 2021b) ✓ ✓ Normalized accuracy
XWINO (Tikhonov & Ryabinin, 2021) ✓ Normalized accuracy

22
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

D. FineWeb documents in different scoring approaches

To illustrate the types of documents each classifier scores highly or poorly, we present the highest- and lowest-scoring
FineWeb examples for each of our classifier approaches (FT MKC+ , MLP MKC+ , CS MKC+ ). These examples were
selected from the randomly chosen FineWeb test dataset (10K samples) used to validate the training of our model-based
classifiers.

D.1. FastText Classifier (FT)

Highest score:

hi. i couldn’t solve my problem because it has two conditional logical propositions. the problem is:can anyone help
me about this, thanks =)we’re expected to know that: . is equivalent tofind a logically equivalent proposition for:by
first writing its contrapositive, and then applying demorgan’s lawand the equality forthey were trying to be helpful
by outlining the steps we should follow,. . but i think they made it more confusing.i don’t see the purpose of using
the contrapositive here.. . i wouldn’t have done it that way.besides, the statement is a tautology . . .which gives us:
.and this is a tautology: ”a thing implies itself” ... which is always true.i don’t know of any ”logically equivalent
proposition” we can write . . .

Lowest score:

|starts||23 sep 2016 (fri) (one day only)|want to travel soon but dont wish to fork out a fortune for flights? check
out todays promotion from jetstar featuring promo fares fr $35 all−in valid for travel period commencing 12
october 2016dont miss out! all−in frenzy fares to hong kong, penang and more from $35.sale ends 23 sep, 11
pm!|travelling||price||travel period||find flight||penang||$35ˆ|| [...]

D.2. Multi-Layer Perceptron (MLP)

Highest score:

Naqhadeh County is a county in West Azerbaijan Province in Iran. The capital of the county is Naqadeh. At the
2006 census, the county’s population was 117,831, in 27,937 families. The county is subdivided into two districts:
the Central District and Mohammadyar District. The county has two cities: Naqadeh and Mohammadyar.

Lowest score:
Custom Wedding Gifts
Personalized photo frames, albums & keepsakes. Heirloom quality!
Custom Engraved Journals
Handmade in Florence Italy. Dozens of sizes and paper styles!
Awesome Leather Journals
Personalized, Customizable, Artisan made in Santa Fe, NM.
Ink Rendering from Photos
100% Hand painted with unique style by pro artists. From $49.

23
Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

D.3. Cosine Similarity (CS)

Highest score:

When you are renting a 5, 10, 15, 20, 30 or 40 yard dumpster, you want a company you can trust with prices that
make you smile. Give us a call today and see the difference we can make in your next construction or clean out
project.
Simply give us a call and we will help you figure out your dumpster rental needs.
Our dumpsters usually go out same-day or next-day depending on when you call.
We provide top-notch service, while going easy on your bottom line. What more could you ask for?
Our trained operators are here to give you a fast and hassle-free experience from start to finish.[...]

Lowest score:
Cooperative flat 206/J
- Cooperative flat 201/J - Sold
2(1)+kitchenette, 50,1 m2Cooperative flat 202/J - Sold
2(1)+kitchenette, 44,9 m2Cooperative flat 203/J - Sold
2(1)+kitchenette, 50,6 m2Cooperative flat 204/J - Sold
1+kitchenette, 27,1 m2Cooperative flat 205/J - Sold
2(1)+kitchenette, 50,1 m2Cooperative flat 206/J - On sale
3+kitchenette 86,7 m2[...]

E. Example of a contaminated document

We present an example of a FineWeb document that was removed during our decontamination pipeline.

MMLU contaminated document (matched 13-gram in bold):

Here is our diagram of the Preamble to the Constitution of the United States. It is based on our understanding of the
use of ”in order to” as a subordinating conjunction that introduces a series of infinitival clauses (without subjects)
that, in turn, modify the compound verbs ”do ordain” and ”establish.”
See A Grammar of Contemporary English by Randolph Quirk, Sidney Greenbaum, Geoffrey Leech, and Jan Svartvik.
Longman Group: London. 1978. p. 753.
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic
Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty
to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
If you have alternative rendering for this sentence, we would be happy to hear of it. Use the e-mail icon to the left.