Sailor: Open Language Models For South-East Asia: Longxu Dou Qian Liu Guangtao Zeng Jia Guo Jiahui Zhou Wei Lu Min Lin
Sailor: Open Language Models For South-East Asia: Longxu Dou Qian Liu Guangtao Zeng Jia Guo Jiahui Zhou Wei Lu Min Lin
Longxu Dou1 ∗ Qian Liu1 ∗ Guangtao Zeng2 Jia Guo1 Jiahui Zhou1
Wei Lu2 Min Lin1
1 Sea AI Lab, Singapore 2 SUTD, Singapore
{doulx, liuqian}@sea.com
Homepage: https://sailorllm.github.io
Model: https://huggingface.co/sail
arXiv:2404.03608v1 [cs.CL] 4 Apr 2024
Abstract
Takeaway
(1) Language models struggle with multiple languages, and continual pre-training presents
an opportunity to improve specific language capabilities. (2) Code-switching techniques
can be beneficial in multilingual scenarios, improving the ability to handle language mix-
ing. (3) Language models are sensitive to subword segmentation, and techniques like BPE
dropout can improve model robustness. (4) Even available high-quality multilingual cor-
pora may require further data deduplication and cleaning. (5) Simulation experiments on
smaller models can provide insights into performance trends for large-scale experiments.
1
Technical Report
1 Introduction
Recent years have witnessed an astonishing surge in the performance of large language
models (LLMs), driven by the rapid growth of Internet data (Rana, 2010) and advances in
pre-training techniques. The advent of models such as GPT-3 (Brown et al., 2020), Gem-
ini (Anil et al., 2023), and Llama (Touvron et al., 2023a) has fueled ever-increasing expec-
tations for LLMs across diverse domains, ranging from creative writing (Wang et al., 2024),
coding (Lozhkov et al., 2024) and logical reasoning (Zhong et al., 2023). Developing high-
quality models crucially depends on access to a large-scale and high-quality dataset. The
ubiquity of digitized English content has established it as a preeminent source for training
LLMs. Consequently, mainstream LLMs (Touvron et al., 2023a; AI et al., 2024; Bai et al.,
2023) tend to heavily rely on English datasets. For example, 89.70% of the training data of
Llama-2 is English (Touvron et al., 2023b). However, these English-centric LLMs frequently
encounter difficulties in achieving comparable performance across other languages (e.g.,
Thai). This phenomenon, termed the curse of multilinguality (Chang et al., 2023), implies
that an over-reliance on English training leads to sub-optimal performance for non-English
languages, as the model lacks sufficient exposure to other languages during pre-training.
In this paper, we aim to develop LLMs that perform well across the South-East Asia (SEA)
region, encompassing a range of languages that include English, Chinese, Vietnamese,
Thai, Indonesian, Malay, and Lao. We share both successful experiences and failed at-
tempts in a completely open manner to accelerate the development of LLMs for the SEA
region. Specifically, we introduce and discuss the benefits of merging adjacent short exam-
ple, the document-level code-switching and the word-level code-switching, as illustrated
in Figure 1. Additionally, we share all of our data cleaning pipeline and deduplication pro-
cedure that turns out to be extremely important for the quality of LLMs, especially in the
scenario of continual pre-training. As for the tokenization, we explore the usage of BPE
Dropout (Provilkov et al., 2020) and highlight its importance for the robustness of LLMs.
Finally, we use small models as proxy to optimize the hyper-parameter for continual pre-
training, including the learning rate and data mixture ratio of different data sources.
2 Insights
During our development, we perform ablation studies on small LMs to understand the
impact of various strategies. We then apply the key insights gained from these studies to
improve LLMs. Most of the experimental results are obtained from three series of models:
our internal 120M model trained on 20B English tokens using SlimPajama (Soboleva et al.,
2023), the TinyLlama 1.1B model (Zhang et al., 2024), and the Qwen1.5-0.5B model (Bai
et al., 2023). All techniques we have considered are listed in Table 1.
2.1 Data
Merging Adjacent Short Examples Several studies have emphasized the importance of
deduplication (Lee et al., 2022), with some popular corpora undergoing deduplication at
2
Technical Report
the paragraph level, resulting in a final corpus comprising unique paragraphs such as
CC100 (Wenzek et al., 2020a). The approach enhances data efficiency by maximizing the
number of unique tokens the model encounters, but it can adversely impact model perfor-
mance as the connections between different pieces within the context become less relevant.
To mitigate the issue, we have employed a simple method of randomly combining several
adjacent examples into one example before applying a global shuffle. The method can be
applied because the deduplicated paragraphs still retain the order in which they appear in
the original documents, allowing for the reconstruction of context when necessary. More-
over, the method is also applied to certain sources, such as subtitles, which are inherently
composed of short sentences.
Aggressive Data Cleaning and Deduplication The data quality is crucial during con-
tinual pre-training. We employ aggressive cleaning parameters, extended filtering list for
each language, and multi-round deduplication. Consequently, even though we started
with well-curated open datasets, e.g., MADLAD-400 clean set (Kudugunta et al., 2023), we
still further removed 31.11% in data cleaning and 11.16% in data deduplication. By ex-
tensively filtering out noisy, harmful, and duplicated content, we are able to significantly
improve the efficiency of the pre-training process and the stability of the optimization pro-
cedure. Furthermore, LLMs are less prone to memorization issues when training data has
undergone thorough deduplication (Lee et al., 2022).
Figure 2: Initially Sailor models were trained on 200B tokens using a greedy tokenization
strategy. Subsequently, they were fine-tuned using BPE dropout for an additional 2B to-
kens, with the dropout rate as 0.1. As observed, BPE dropout improves the robustness.
3
Technical Report
0.25B
3.5 0.25B Trend
0.75B
0.75B Trend
3.3
3.2
Figure 3: We initially pre-train a 120M model using a corpus of 20B tokens focusing on En-
glish. Subsequently, we continually pre-train the model using a mixed corpus comprising
both English and SEA languages. Each data point here corresponds to a different configu-
ration of data mixture and learning rate. As indicated, under a fixed total tokens, there is a
trade-off between the model’s performance on English and SEA languages.
2.2 Tokenization
BPE Dropout We have observed that the model is unreasonably sensitive to small vari-
ations of the prompt, especially on spaces. As illustrated in Figure 2a, when prompting
the model with the string “Answer:” without any trailing space yields a substantially im-
proved performance compared to “Answer: ” 1 . The same phenomenon are observed
in Qwen1.5, Mistral and Llama 2, and a similar issue has been discussed at lm-evaluation-
harness library2 (Gao et al., 2023). We attribute this kind of vulnerability to the tokenization
strategy used in data processing. Modern tokenization methods usually employ the Byte
Pair Encoding (BPE) (Sennrich et al., 2016) under the greedy segmentation setting 3 , which
means that sentences are segmented into subwords using the optimal tokenization strat-
egy. However, the always-optimal strategy can lead to vulnerability in the model when
it encounters noisy subwords, such as an unexpected space in “Answer: ”. Typically, a
space is segmented into subwords alongside the subsequent chars (e.g., “ 1” constitutes
a single subword). Yet, if a space is left at the end of the prompt, it becomes an isolated
subword “ ”, deviating from the segmentation strategy in the demonstration examples. To
alleviate the problem, we employ the BPE-Dropout (Provilkov et al., 2020) during continual
pre-training, which stochastically corrupts the segmentation procedure of BPE to achieve
subword regularization. Experimental results indicate that although BPE Dropout slightly
increases the loss on greedy subword segmentation, it enhances both the performance and
the robustness of models, as shown in Figure 2b.
4
Technical Report
4.0
R² = 0.9936 R² = 0.9534
y = 0.13x 2 + 1.24x + 6.28
2.9
2.8
2.7
2.6
2.5
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Learning Rate ×10 4
(c) The average SEA loss with increasing the learning rate.
Figure 4: Under the same token budget, we observe that (a) the validation loss on English
can be modeled as a quadratic function of log(English Proportion) − log(Learning Rate);
(b) the validation loss on SEA languages, using Malay as an example, can be approximately
represented by a quadratic function with log(Malay Proportion) + log(Learning Rate); (c)
we can tune the learning rate by analyzing the learning curves on SEA languages.
models based on Qwen1.5 (Bai et al., 2023), which is inherently multilingual-friendly and
possesses a large vocabulary size, thereby guaranteeing a high compression rate for SEA
languages.
2.3 Training
When it comes to continual pre-training, two crucial hyper-parameters to consider are the
learning rate and the data mixture. In our practise, we begin by generating a number of
training configurations with varying learning rates 4 and language proportions to train
several proxy models 5 . By analyzing the trade-off between English and SEA languages on
these proxy models, we can select a suitable learning rate. Once the learning rate is deter-
mined, we then conduct fine-grained data mixture simulation experiments to optimize the
joint loss across all languages, which is finally used in large-scale training.
5
Technical Report
2 0.1076 0.1656 0.0722 0.1838 0.0892 0.1434 0.2372 2.421 A New Data Mixture: Joint Loss
English … Vietnamese 2.115
…
0.1359 … 0.0987
64 0.2004 0.1258 0.1236 0.1937 0.0714 0.1431 0.1419 2.342
Figure 5: We employ the experimental results from proxy models across a variety of data
mixtures (e.g., 64 distinct data mixture here) to fit a linear regression model. The model
is then utilized to predict the validation loss of simulate numerous random data mixtures,
enabling us to identify the most effective data mixture for optimizing joint loss. Subse-
quently, the best data mixture is applied to large-scale training.
Learning Rate Tuning Figure 3 also demonstrates an inverse relationship between the
number of tokens and the loss on English. As more tokens are consumed (e.g., 0.25B →
2.25B), the curve shifts towards the upper-left area, signifying an increase in the loss on
English. Interestingly, the loss trend on the source domain (i.e., English) is primarily in-
fluenced by two factors: the proportion of English data during continual pre-training and
the learning rate. Under the same token budget, the model’s loss on English can be accu-
rately modeled as a quadratic function of log(English Proportion) − log(Learning Rate), as
shown in Figure 4a. In other words, while keeping the proportion of English data constant,
increasing the learning rate may adversely affect the model’s performance on English.
Meanwhile, the loss trend on the target domain (i.e., SEA languages) is also mainly affected
by the proportion of the target domain and the learning rate. However, there is a different
modeling among the model loss on SEA languages, the proportion and the learning rate,
as demonstrated by Figure 4b. From the observation, it becomes evident that the learning
rate serves as a crucial hyper-parameter. A well-tuned learning rate plays a pivotal role in
striking a balance between the acquisition of SEA languages and the forgetting of English.
As shown in Figure 4c, considering that increasing the learning rate beyond 1e-4 does not
yield significant improvements in the loss on SEA languages, we set the peak learning rate
to 1e-4 in our experiments.
Data Mixture Simulation We aim to develop an improved LLM tailored for the en-
tire SEA region, with a focus on ensuring balanced representation across all target lan-
guages. To achieve this, we have developed a new algorithm that determines the appro-
priate weights for various languages during continual pre-training. This method involves
conducting a series of randomized data mixture experiments, while adhering to a prede-
termined learning rate. Our goal is to determine the most effective data mixture. To this
end, we suggest employing simulations in conjunction with linear regression models. As
depicted in Figure 5, we begin by training a set of proxy models (e.g., 64 in total here)
on a variety of data mixtures for a limited number of training steps (e.g., 1000 steps). We
then fit a linear regression model, using the data mixture as the input feature and the joint
loss considering all languages 6 as the target. With this model, we can perform numerous
simulation experiments (e.g., 1,000,000) on randomly sampled data mixtures to explore the
vast array of possibilities within seconds. The linear model then guides us in selecting the
combination that yields the lowest predicted joint loss. Once this data mixture has been
optimized, it can be directly applied to large-scale training. More details and findings will
be discussed in our upcoming paper.
6
Technical Report
Drawing from the above insights, we highlight the importance of selecting the learning rate
and the proportion of source domain data to mitigate issues such as catastrophic forgetting.
Therefore, we focus on the metric log(Source Domain Proportion) − log(Learning Rate),
which we refer to as the magic metric below. We suggest the following steps:
1. Fit a parametric quadratic function modeling the relationship between loss source
and the magic metric via experiments varying learning rates and proportions.
2. Estimate the boundary of the magic metric value beyond which the model’s
loss source starts to deviate significantly from the original one.
3. Balance the learning progress on the target domain with the retention rate on the
source domain by selecting a suitable magic metric larger than the boundary.
4. If the magic metric substantially exceeds the estimated boundary, it indicates that
the model retains more knowledge from the source domain; conversely, it facili-
tates a more rapid learning pace on the target domain.
The above guideline can potentially explain why Lemur (Xu et al., 2024) demonstrated
negligible performance deterioration on natural language benchmarks (e.g., MMLU) de-
spite undergoing continual pre-training from Llama-2 on an extremely imbalanced data
distribution (i.e., text:code as 1:10). The employment of a smaller learning rate (i.e., 4e-5)
during Lemur’s training likely preserved the magic metric within an good range, allowing
the model to maintain its proficiency in the source natural language domain.
3 Data Sources
Here we describe all the corpus used in our training. Note that we performed an additional
round of data deduplication and cleaning on these datasets before using them.
SkyPile SkyPile (Wei et al., 2023) is a massive, high-quality Chinese dataset for pre-
training. It comprises 233M web pages, totaling 150B tokens, carefully filtered and dedu-
plicated from public web sources. We download SkyPile by accessing its hosted dataset on
HuggingFace 8 .
CC100 CC100 9 is a multilingual corpus comprising monolingual data from over 100 lan-
guages. The corpus was original constructed for training the XLM-R model (Conneau et al.,
2020), a powerful cross-lingual language model. The data was sourced from the Common
Crawl project (Rana, 2010). Specifically, the corpus was generated by processing Common
7 https://huggingface.co/datasets/cerebras/SlimPajama-627B
8 https://huggingface.co/datasets/Skywork/SkyPile-150B
9 https://data.statmt.org/cc-100
7
Technical Report
Crawl snapshots from January to December 2018, using the open-source CC-Net reposi-
tory (Wenzek et al., 2020a). In our pre-training corpus, we take the Indonesian, Malay, Lao,
Thai and Vietnamese subsets.
MADLAD-400 The CC100 corpus is a great resource for multilingual languages due to
its high quality, but it has already split every document into separate paragraphs, making it
serve as a paragraph-level corpus. We believe using paragraphs as examples would greatly
hurt the document-level performance of the model, as evidenced by our preliminary study.
Therefore, we also consider MADLAD-400 (Kudugunta et al., 2023), a manually audited
and large-scale multilingual corpus spanning 419 languages. MADLAD-400 is also based
on CommonCrawl, which uses all available corpus till August 2022. In our pre-training
corpus, we take its clean version, downloaded from the dataset hosted by HuggingFace 10 .
Wikipedia We utilize the Wikipedia dump (encompassing Malay, Indonesian, Thai, and
Vietnamese) up to November 2023 from the Wikipedia dataset hosted on HuggingFace 11 .
It should be noted that some of the Wikipedia corpus may be duplicated, as the SlimPajama
dataset has already included the multilingual Wikipedia corpora.
OpenSubtitles We collect the Malay, Indonesian, Thai and Vietnamese subtitles from the
OPUS OpenSubtitles category 12 . For all subtitles, we use a sliding window of 100 to con-
catenate adjacent subtitles to compose longer documents. An example of Indonesian sub-
title can be found below:
Translation While our preliminary studies indicate that translation data may have sim-
ilar effects to document-level code-switching, we still incorporated translation data since
translation is an important task. We curate a selection of English-SEA language translation
pairs available in the OPUS project 13 (e.g., TED2020 talks). Notably, we observe substan-
tial duplication within the translation data, thus necessitating a further deduplication step.
Concurrently, to account for both directions, we processed data for both English-to-SEA
and SEA-to-English translation directions for each example. An illustrative example is
provided below:
Indonesian to English: Pak Tanaka bukan murid. Mr. Tanaka is not a student.
English to Indonesian: Did the Israelites execute criminals by hanging them on
stakes? Apakah mereka menghukum mati penjahat dengan memakukannya pada
tiang?
4 Preprocessing Pipeline
The data quality is crucial during continual pre-training. We found that several publicly
available multilingual datasets could be further cleaned and deduplicated. To improve
the data cleaning process for SEA languages specifically, we expanded our list of filtering
words, trained new filtering models, and implemented a more aggressive deduplication
strategy. As a result of these optimizations, we extracted 61.19% of data for SEA languages
10 https://huggingface.co/datasets/allenai/MADLAD-400
11 https://huggingface.co/datasets/wikimedia/wikipedia
12 https://opus.nlpl.eu/OpenSubtitles-v2018.php
13 https://opus.nlpl.eu/
8
Technical Report
Figure 6: With aggressive data cleaning and deduplication, we obtain 61.19% high-quality
data from two well-curated open datasets, including CC100 (Wenzek et al., 2020a) and
MADLAD-400 (Kudugunta et al., 2023). This forms the SailCraft dataset, used to train the
Sailor models. The reported removal rate (grey) is with respect to each previous stage, and
the kept rate (colored) demonstrates the overall rate.
from public datasets, and constructed the final SailCraft dataset. The specific removal rates
are shown in Figure 6.
1. Uniform whitespace. We first unify the whitespace within the sentence, trans-
forming all forms of whitespace to the classic space character. This approach guar-
antees consistency across various whitespace characters and facilitates the segmen-
tation, i.e., converting the documents into words.
3. Remove incorrect words. We exclude emojis using the EMOJI package 14 , remove
HTML-related tags to eliminate links associated with source page code, and filter
out certain terms based on a pre-defined word list.
Note that for MADLAD-400, we have fixed the Unicode escaping issue (i.e., lots of “\\n”)
as introduced in HuggingFace forum 15 , which would cause the trouble in few-shot in-
context learning and chatbot application where “\n” acts as an important delimiter be-
tween demonstrations and task input. Concretely, we replace all “\\n” with “\n\n” or
“\n” with some heuristic rules. Please refer to Appendix A for implementation details and
concrete cases.
9
Technical Report
The data cleaning mainly follows the BigScience data cleaning recipe 16 . Note that for most
languages, we can make use of the publicly available resource. However, for several low-
resource languages, we have to train model from scratch. The entire data cleaning process
is as follows:
1. Filtering on the number of words. We first tokenize the documents with senten-
cepiece model for each language. Then we count the number of words and remove
the document that is less than min length or is greater than the max length. The
purpose of filtering short document is to remove the incorrect sentences or the
sentences without enough context. The purpose of filtering long document is to
remove redundant information or exceed the maximum input length.
2. Filtering on the character repetition ratio. We first compile the list of character-
level n-grams for the given document. Then, we calculate the frequency for each
n-gram. We define the character repetition ratio as the sum of frequencies of the
top m most frequent n-grams. A document is dropped if its character repetition
ratio score is above the pre-defined threshold. Note that m is determined as a
trade-off choice so that it can balance the distribution of short and long documents.
Practically, we choose m as the square root of the amount of n-grams.
3. Filtering on the word repetition ratio. The word repetition ratio is defined as the
sum of frequencies of all n-grams whose frequency is greater than 2. A document
is dropped if its word repetition ratio score is above the pre-defined threshold.
4. Filtering on the special characters ratio. A list is maintained to track special char-
acters. If a document’s ratio of special characters exceeds a pre-defined threshold,
it will be dropped. The purpose of this filter is to eliminate documents that consist
primarily of special characters.
5. Filtering on the stop words ratio. A list of stop words for each language is main-
tained, and a document will be removed if its stop words ratio is above the pre-
defined threshold. It is to remove machine-generated text that do not have much
semantically meaningful information. However, one significant challenge arises
with languages such as Chinese and Vietnamese that do not use spaces, as it be-
comes difficult to recognize stop words after tokenization. Following BigScience
practise, we address the issue by expanding the stop list to include both word-level
and byte-piece-level stop words, thereby enhancing the coverage and effectiveness
of the filtering. For stop words list, we collected those for Thai 17 and Malay 18 from
available resources. However, we did not find relevant resources for Lao, and thus
we translated the Thai stop words list into Lao.
6. Filtering on the flagged words ratio. We maintain a list of flagged words for each
language. A document is removed if its flagged words ratio is above the pre-
defined threshold. It will remove buzzwords about the porn, which is harmful
for model training. We create or expand the flagged word list for Thai, Malay, and
Lao by translating from English ones developed by BigScience.
7. Filtering on the language identification prediction score. We adopt the fast-
Text (Joulin et al., 2016) model 19 to get the language identification result for each
document and the corresponding confidence score. A document will be dropped
if its confidence score is below the pre-defined threshold. The filter is to remove
unnatural content such as machine-generated text, advertisements, or frequently
changing spoken language. However, it also bring the drawback that it would
14 https://github.com/carpedm20/emoji
15 https://huggingface.co/datasets/allenai/MADLAD-400/discussions/2
16 https://drive.google.com/file/d/1cCJ8sWE88TRLDAa3eHLmXO4JlkR2QzLY/view
17 https://github.com/stopwords-iso/stopwords-th/blob/master/stopwords-th.txt
18 https://github.com/stopwords-iso/stopwords-ms/blob/master/stopwords-ms.txt
19 https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
10
Technical Report
Table 2: The storage statistics on each subset, including raw data, after data cleaning and
after data deduplication. Even though we started with high-quality open datasets (i.e.,
the MADLAD-400 clean set and the CC100), we still removed 31.11% data during data
cleaning, and further removed 11.16% during data deduplication.
remove code-switching text that exists in SEA regions like Singlish (Singapore En-
glish) and Manglish (Malaysian English).
8. Filtering on the perplexity score. We adopt the KenLM (Heafield, 2011) model
to calculate the perplexity score of documents for each language. KenLM model
are trained from the high-quality corpus like Wikipedia. A document will be re-
moved if its perplexity score is above the pre-defined threshold. The filter will
remove the documents with unrelated words like tags, time, date and lots of rep-
etitions. One main drawback of the filter is that it would inevitably remove nec-
essary documents that have a different distribution from Wikipedia. For KenLM
model, we download most language models from BigScience repo 20 . However,
there is no KenLM model available for Thai, Malay and Lao. Thus, we sample
a high-quality subset from the Wikipedia corpus and train KenLM models with
vocab size 65536 21 .
The data deduplication procedure is the most important and challenging part in our data
preprocessing. Firstly, it distills the corpus for efficient pre-training. Moreover, it further
filters the noise information for effective training, like machine-generated advertisements
that could not be easily recognized by rule-based cleaning methods. Most importantly,
LLMs are less prone to exhibit memorization issues when training data has undergone
thorough deduplication (Lee et al., 2022).
20 https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/
training/01b_oscar_cleaning_and_filtering#2-download-everything-you-need
21 https://github.com/bigscience-workshop/data_tooling/tree/master/kenlm_training
22 https://github.com/ChenghaoMou/text-dedup
23 https://tinyurl.com/bdf6zerm
11
Technical Report
Figure 7: The most frequent textual duplicates identified across two datasets for three SEA
languages, along with their respective frequencies. Inappropriate content and personally
identifiable information are replaced with [Sensitive Data]. For brevity, we highlight only
the duplicate content. The lengthy content is truncated for better visualization.
and false negatives 24 . Ultimately, the number of bands and the number of rows per band
(i.e., two crucial hyper-parameters) are optimized to 25 and 10, respectively.
For resource requirements, the data deduplication mainly consumes memory, CPU cores
and disk spaces. The memory requirement is primarily determined by the number of doc-
uments, rather than the disk size of the documents 25 . To facilitate dealing with large files
within limited memory resource, we first split them into chunks (30GB) to make it tractable
on our CPU server. It takes about 200GB of memory to process a 30GB corpus with 256 per-
mutations. It takes approximately 30 minutes in total to deduplicate the 30GB corpus under
64 CPU cores. To improve the deduplication performance, we iteratively cycled through
the process of splitting into chunks, data deduplication, and recombining the chunks for 3 rounds
until the chunk size converged.
12
Technical Report
a. Segeralah bertaubat kpd Tuhan YME. Karena Tuhan sudah memberikan a. Segeralah bertaubat kpd Tuhan YME. karena Tuhan sudah memberikan
Anugerah dan Amanah, mahakarya ciptaan-NYA. yaitu seorg anak yg Anugerah dan Amanah, mahakarya ciptaan-NYA. yaitu seorg anak yang
lahir dalam keadaan fitrah. Semua anak lahir ke dunia ini tanpa masalah, lahir dalam keadaan fitrah. Semua anak lahir ke dunia ini tanpa masalah,
orang tualah yang membuat masalah kepada anak-anaknya orang tualah yang membuat masalah kepada anak-anaknya.
angka adalah salah satu yang pertama di Indonesia, banyak agen yang togel sgp com adalah game yang sudah ada sejak dulu kala di Indonesia,
digu-nakan di Indonesia yang menjual togel hongkong pool, data togel, banyak situs yang menjual di Indonesia yang menjual togel hongkong 30
live togel hongkong, togel sg karena kami adalah agen bola bonus deposit september, dunia togel, live draw hongkongpools, dan sgp com karena
terbesar yang sudah sejak tahun 2004. Pemainnya juga sangat banyak kami adalah agen kartu yang telah berdiri sejak tahun 2004. Pemainnya
karena kami menye-diakan informasi daftar situs judi qq online yang juga sangat banyak karena kami menyediakan informasi daftar situs judi
menyediakan nomor kegel dan sbobet Eropa dengan permainan menarik qq online yang menyediakan prediksi togel sgp singapura dan potato777
seperti baccarat online apk sbobet dengan permainan menarik seperti baccarat online asia
“Pendarahan” terbanyak terjadi ketika putra kami belajar mengemudi “Pendarahan” terbanyak terjadi ketika putra kami belajar berjalan-jalan
(pada sembilan bulan!). Jatuh dan banyak pendarahan internal. Begitu dia (sembilan bulan!). Uang jatuh dan banyak brusing internal. Begitu dia stabil
stabil, hal-hal menjadi lebih bermanfaat. Kami menanamkan anak kami di kakinya, hal-hal menjadi lebih menguntungkan. Kami menanamkan
dua kali seminggu penuh untuk mencegah “pendarahan”. Sekitar waktu 3, anak kami dua kali seminggu untuk mencegah “pendarahan”. Tentang era
tak lama ia mulai mengendarai sepeda motor. Pada usia empat tahun, dia 3, tak lama ia mulai mengendarai sepeda. Pada usia empat tahun, dia adalah
adalah sepatu roda. Pada usia enam tahun, dia bermain ski di Pegunungan sepatu roda. Pada usia enam tahun, dia bermain ski di Pegunungan Rocky
Rocky (dengan infus setiap hari sebelumnya di luar sana di gunung). Dia (dengan infus setiap pagi sebelum di luar sana di gunung). Dia bermain
bermain sepak bola dari usia lima hingga 10 tahun dan di kelas lima, dia sepak bola dari usia lima hingga sepuluh tahun dan di kelas lima, dia berada
berada di tim bola voli perguruan tinggi dan tim bola basket. Voli telah di tim bola voli perguruan tinggi dan tim bola basket. Voli telah menjadi
menjadi hasratnya dan ia berpartisipasi dalam kamp pelatihan bola voli gairahnya dan dia berpartisipasi dalam kamp pelatihan bola voli setiap
setiap musim panas hanya selama dua minggu. musim panas selama 2 minggu."
Figure 8: The above pairs of documents from the CC100-Indonesian dataset are identified
as duplicates by our deduplication algorithm. To enhance readability, the matching sub-
sequences within these document pairs are highlighted.
Case Study Figure 7 showcases the most prevalent duplicate sentences across various
language subsets identified by our deduplication algorithm. These duplicates span a wide
range of domains, including medicine, customer service, and address information. The
presence of such noisy and redundant data can impede the pre-training process, as indi-
cated by Penedo et al. (2023). Additionally, sensitive information like emails and phone
numbers poses privacy risks.
Our deduplication approach effectively addresses the prevalent scenario where documents
are nearly identical, differing only in the interspersed template fields, as exemplified
in Figure 8. Despite cleaning efforts by CCNet (Wenzek et al., 2020b) and MADLAD-
400 (Kudugunta et al., 2023), the quality of open datasets remains sub-optimal, under-
scoring the challenges in multilingual data cleaning. For more deduplication cases, please
refer to Appendix B.
As detailed in Section 2.3, our algorithm involves utilizing proxy models to fit a linear re-
gression model, which then aids in determining the optimal data mixture for large-scale
training. To elaborate, we extend the data mixture by extending it beyond language-
level considerations to also include the source of the data. This means we treat each lan-
guage from every source as a distinct dataset and try to optimize the data mixture of these
datasets. The Qwen1.5-0.5B model serves as our proxy model, and we apply the optimized
data mixture to the continual pre-training process across all model sizes. The effective to-
kens and equivalent epochs in SailCraft are documented in Table 3. From the table, we
observe that, in terms of quality or diversity, the CC100 dataset exhibits a relative advan-
tage over the MADLAD-400 dataset, particularly for Indonesian and Vietnamese.
13
Technical Report
5 Model Training
We obtain the Sailor models through continual pre-training of Qwen1.5 (Bai et al., 2023) on
140B high-quality SEA tokens and 60B tokens for replay (see Section 4.4).
Hardware For training devices, we use NVIDIA A100 SXM4 40GB GPUs. To acceler-
ate multi-node training, we further employ the InfiniBand for low latency and extreme
throughput. During the training, we employ 64 GPU cards for 7B / 4B models, and 32
GPU cards for 1.8B / 0.5B model.
26 https://github.com/epfLLM/Megatron-LLM
27 https://epfllm.github.io/Megatron-LLM/
28 https://github.com/jzhang38/TinyLlama
14
Technical Report
We adopt most of the pre-training settings and model architectures from Qwen1.5 (Bai
et al., 2023). It follows the standard transformer architecture (Vaswani et al., 2017), adopts
the pre-normalization with RMSNorm (Jiang et al., 2023b), SwiGLU activation (Shazeer,
2020) and rotary positional embeddings (Su et al., 2022). Notably, Qwen1.5 adds a bias
item in attention of the QKV layer to improve the extrapolation ability. Meanwhile, for
the 0.5B model, we set TIE WORD EMBEDDINGS to False, i.e., not tying the learning of the
input embedding (EMBEDDING module) and output projection (LM HEAD module). Thus,
the parameter of Sailor 0.5B is approximately 0.6B. However, we still name it 0.5B to be
consistent with Qwen1.5.
During training, we utilize a context window length of 4,096, and integrate Flash Attention
2 (Dao, 2023) to improve the training efficiency and reduce the memory usage 29 . We utilize
AdamW (Kingma & Ba, 2014) for optimization, with the hyper-parameters β 1 = 0.9, β 2 =
0.95, eps = 1e−5. We use the weight decay of 0.1 and the gradient clipping of 1.0. We
train models with BFloat16 mixed precision to balance the training efficiency and stability.
Notably, we set ATTENTION SOFTMAX IN FP 32 to True to execute attention masking and
Softmax operations in fp32, thereby preventing precision underflow 30 .
The final pre-training corpus, SailCraft, is composed of approximately 200B tokens, inte-
grating both SEA tokens and replay tokens, as elaborated in Section 4.4. We use the batch
size of 4M tokens and the learning rate of 1e-4. Following a warmup period of 500 steps,
the learning rate remains constant. This scheduling strategy encourages more transfer-
able conclusions from simulations and allows for easier recovery from interrupted training
sessions. Generally Sailor models consume around 200B tokens, completing a full pass
through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training
with 400B tokens, equivalent to 2 epochs.
6 Experiments
Sailor models are evaluated on several high-quality benchmarks, including question an-
swering, commonsense reasoning, reading comprehension and examination.
6.1 Benchmark
Question Answering The XQuAD dataset (Artetxe et al., 2020) (Thai, Vietnamese) and
the TydiQA dataset (Clark et al., 2020) (Indonesian) were selected as the representative
benchmarks for question answering. The XQuAD dataset comprises 1,190 question-answer
pairs from professional translations of the development set of SQuAD v1.1 (Rajpurkar
et al., 2016). The TydiQA dataset covers 204,000 question-answer pairs directly sourced
from data in their original languages, with human-written questions.
29 In contrary to Flash Attention 1 (Dao et al., 2022), Flash Attention 2 makes it possible to train
model on an arbitrary dataset that also includes padding tokens.
30 https://github.com/huggingface/transformers/pull/17437
15
Technical Report
Commonsense Reasoning The XCOPA dataset (Ponti et al., 2020) (Indonesian, Thai, Viet-
namese) provides two choices for each premise, requiring the model to select one that bet-
ter addresses either the cause or effect of the event mentioned in the premise.
Examination The M3Exam dataset (Zhang et al., 2023) (Javanese, Thai, Vietnamese) is a
multilingual exam benchmark collected from official school tests used in nine countries.
Note that we chose its Javanese subset since the Indonesian version has yet to be released.
[Qustion] Pak Untung iku sabendinane lunga menyang sanggar saperlu mimpin
lan ngatur lumakune crita drama
Miturut wacan ing inggil, pendamelan (penggawean) pak Untung dados
[Option 1] Aktor
[Option 2] Penulis
[Option 3] Pelawak
[Option 4] Sutradara
[Answer] Sutradara
16
Technical Report
Table 4: Experimental results of different models on the question answering task. Note that
SeaLLM-7b-Hybrid and SeaLLM-7B-v2 are both models trained with instruction tuning
datasets, and the same for other tables.
for Indonesian tasks). Note that we keep the tokenizer consistent when computing the F1
scores of different models.
Following the evaluation approaches adopted in OpenCompass (Contributors, 2023) and
the Eleuther AI evaluation framework (Gao et al., 2023) on the popular HellaSwag bench-
mark Zellers et al. (2019), we reformulated the tasks with limited output spaces (i.e.,
XCOPA, BELEBELE) as continuation writing tasks. It is to say, each possible answer is
appended to the given input or question, and the one that achieves the lowest perplexity
score is considered as the model prediction. As for the M3Exam dataset, we adopt the
official evaluation method used in Zhang et al. (2023) to evaluate all models. The evalu-
ation approach involves directly prompting LLMs to produce the correct option ID when
presented with a question and its corresponding options.
We compare Sailor models with SeaLLM (Nguyen et al., 2023b), Sea-Lion (AI Singapore,
2023), Typhoon (Pipatanakul et al., 2023), and VinaLLaMA (Nguyen et al., 2023a). Our
reporting strictly adheres to the same evaluation methodology to ensure a fair comparison,
and we make much effort to closely match the reported results of all baselines.
Experimental results shown in Table 4, 5, 6 indicate that our Sailor models typically out-
perform the baseline model, Qwen1.5, in terms of performance on SEA languages. Addi-
tionally, the performance of Sailor models is either superior or comparable to major SEA
LLMs such as SeaLLM, Sea-Lion, Typhoon, and VinaLLaMA on these benchmarks.
However, it is not the case for M3Exam. As shown in Table 7, our Sailor models exhibit
no evident advantage over Qwen1.5 at the 4B parameter scale or lower, and in certain in-
stances, it displays noticeable weaknesses. We have observed that the discrepancy is due to
a significant option bias, which leads the Sailor models to favor certain option IDs (e.g., al-
ways C) when making predictions. Interestingly, a similar phenomenon was also observed
among other baseline LLMs focusing on SEA languages. While instruction tuning could
mitigate the option bias, we have chosen not to tune the Sailor models to maintain fairness
and consistency in the evaluation process. We also provide additional results evaluated us-
17
Technical Report
ing the HellaSwag protocol in Appendix C, which is better aligned with other benchmark
results.
In this paper, we present the Sailor family of open language models, tailored for South-East
Asian languages, which exhibit strong performance across various multilingual tasks and
benchmarks, fostering advancements in multilingual language models for the SEA region.
Here are some of the most important future works:
18
Technical Report
duplicated paragraph is found across multiple documents, it is advisable to retain only one
instance of the paragraph in a single document while removing the duplicate occurrences
from the other documents. For example, assuming document A, B, and C each contains
paragraphs a1, a2, b1, b2, c1 and c2 respectively. The paragraph a1 is a duplicate of b1,
while c1 is a duplicate of b2. Then the algorithm should filter out the duplicated para-
graphs b1 and b2 from the document B, thereby preserving the integrity and completeness
of both document A and C.
Cross-Lingual Instruction In the diverse and multilingual context of South-East Asia re-
ligion, it is quite common for users to communicate in various languages. This creates a
particularly challenging scenario for chat models, which must be adept at understanding
and responding to queries in multiple languages. For example, if the user asks the chat
model a question in Indonesian, such as “Draf undangan pernikahan dalam bahasa Vietnam”
(English: Wedding invitation draft in Vietnamese), the user would expect the model to reply in
Vietnamese. Currently, in our internal evaluations, the Sailor models does not address the
challenge well. We plan to build cross-lingual instruction datasets to address the problem.
More South-East Asian Languages To broaden the impact of open language models for
SEA languages, we are dedicated to expanding our coverage to include more languages
from the region. We plan to achieve the goal by gathering high-quality training corpora
from all CommonCrawl snapshots and other open resources. Moreover, we aim to explore
language generalization techniques to transfer knowledge from high-resource languages
to low-resource languages, thereby enhancing the capabilities of Sailor models for the un-
derserved languages in the SEA region.
19
Technical Report
Acknowledgement
We extend our sincere gratitude to Zhikai Huang, Joseph Wong, and Xinyi Wan for their
regular maintenance of the cluster of Sea AI Lab, ensuring its stable operation and enabling
our jobs to run smoothly. We are deeply thankful to Xiaosen Zheng, Fan Zhou, Zhoujun
Cheng, Binyuan Hui, Junyang Lin and Terry Yin for the fruitful discussions. We appreciate
HuggingFace for providing a platform for open-source models and datasets, which have
been invaluable resources in building our pre-training corpus and advancing our research.
References
01. AI, :, Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang,
Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, Kaidong Yu, Peng Liu, Qiang Liu,
Shawn Yue, Senbin Yang, Shiming Yang, Tao Yu, Wen Xie, Wenhao Huang, Xiaohui Hu,
Xiaoyi Ren, Xinyao Niu, Pengcheng Nie, Yuchi Xu, Yudong Liu, Yue Wang, Yuxuan Cai,
Zhenyu Gu, Zhiyuan Liu, and Zonghong Dai. Yi: Open foundation models by 01.ai,
2024.
AI Singapore. Sea-lion (southeast asian languages in one network): A family of large lan-
guage models for southeast asia. https://github.com/aisingapore/sealion, 2023.
Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Sori-
cut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav
Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin
Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy,
Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm
Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford,
Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim
Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaı̈s White, Anders
Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez,
Misha Khalman, Jakub Sygnowski, and et al. Gemini: A family of highly capable mul-
timodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805. URL
https://doi.org/10.48550/arXiv.2312.11805.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of
monolingual representations. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 4623–4637. Associa-
tion for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.
acl-main.421.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin
Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayi-
heng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei
Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng
Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang,
Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tian-
hang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla,
Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian
Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 lan-
guage variants. CoRR, abs/2308.16884, 2023. URL https://doi.org/10.48550/arXiv.
2308.16884.
Andrei Z. Broder. On the resemblance and containment of documents. Proceedings. Com-
pression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, 1997. URL
https://api.semanticscholar.org/CorpusID:11748509.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
20
Technical Report
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya
Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,
Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language mod-
els are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and
H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–
1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_
files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Tyler A. Chang, Catherine Arnett, Zhuowen Tu, and Benjamin K. Bergen. When is multi-
linguality a curse? language modeling for 250 high- and low-resource languages. CoRR,
abs/2311.09205, 2023. doi: 10.48550/ARXIV.2311.09205. URL https://doi.org/10.
48550/arXiv.2311.09205.
Jonathan H. Clark, Jennimaria Palomaki, Vitaly Nikolaev, Eunsol Choi, Dan Garrette,
Michael Collins, and Tom Kwiatkowski. Tydi QA: A benchmark for information-seeking
question answering in typologically diverse languages. Trans. Assoc. Comput. Linguistics,
8:454–470, 2020. URL https://doi.org/10.1162/tacl_a_00317.
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partition-
ing. ArXiv, abs/2307.08691, 2023. URL https://api.semanticscholar.org/CorpusID:
259936734.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R’e. Flashattention:
Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135,
2022. URL https://api.semanticscholar.org/CorpusID:249151871.
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles
Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell,
Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf,
Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang,
and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL
https://zenodo.org/records/10256836.
Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-
Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the
Sixth Workshop on Statistical Machine Translation, pp. 187–197, Edinburgh, Scotland, July
2011. Association for Computational Linguistics. URL https://aclanthology.org/
W11-2123.
21
Technical Report
Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven-
dra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume
Lample, Lucile Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.
Mistral 7b. ArXiv, abs/2310.06825, 2023a. URL https://api.semanticscholar.org/
CorpusID:263830494.
Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-rmsnorm and pre-crmsnorm
transformers: Equivalent and efficient pre-ln transformers. ArXiv, abs/2305.14858,
2023b. URL https://api.semanticscholar.org/CorpusID:258865592.
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and
Tomas Mikolov. Fasttext.zip: Compressing text classification models. arXiv preprint
arXiv:1612.03651, 2016.
Seungduk Kim, Seungtaek Choi, and Myeongho Jeong. Efficient and effective vocabulary
expansion towards multilingual large language models. ArXiv, abs/2402.14714, 2024.
URL https://api.semanticscholar.org/CorpusID:267782714.
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,
abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya
Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. MADLAD-400: A mul-
tilingual and document-level large audited dataset. In Alice Oh, Tristan Nau-
mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Ad-
vances in Neural Information Processing Systems 36: Annual Conference on Neural In-
formation Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December
10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/
d49042a5d49818711c401d34172f9900-Abstract-Datasets_and_Benchmarks.html.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris
Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language mod-
els better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Pro-
ceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022. Association for Computa-
tional Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.
org/2022.acl-long.577.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier,
Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max
Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry
Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu,
Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß,
Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas
Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone,
Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier
Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Se-
bastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa
Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang,
Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Star-
coder 2 and the stack v2: The next generation, 2024.
Chenghao Mou, Chris Ha, Kenneth Enevoldsen, and Peiyuan Liu. Chenghaomou/text-
dedup: Reference snapshot, September 2023. URL https://doi.org/10.5281/zenodo.
8364980.
Quan Nguyen, Huy Pham, and Dung Dao. Vinallama: Llama-based vietnamese foun-
dation model. CoRR, abs/2312.11011, 2023a. doi: 10.48550/ARXIV.2312.11011. URL
https://doi.org/10.48550/arXiv.2312.11011.
22
Technical Report
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Qingyu Tan, Liying Cheng,
Guanzheng Chen, Yue Deng, Sen Yang, Chaoqun Liu, Hang Zhang, and Lidong Bing.
Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023b. doi:
10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro
Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay.
The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data,
and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/
2306.01116.
Kunat Pipatanakul, Phatrasek Jirabovonvisut, Potsawee Manakul, Sittipong Sripaisarn-
mongkol, Ruangsak Patomwong, Pathomporn Chokchainant, and Kasima Tharnpip-
itchai. Typhoon: Thai large language models. CoRR, abs/2312.13951, 2023. doi:
10.48550/ARXIV.2312.13951. URL https://doi.org/10.48550/arXiv.2312.13951.
Edoardo Maria Ponti, Goran Glavas, Olga Majewska, Qianchu Liu, Ivan Vulic, and Anna
Korhonen. XCOPA: A multilingual dataset for causal commonsense reasoning. In Pro-
ceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP
2020, Online, November 16-20, 2020, pp. 2362–2376. Association for Computational Lin-
guistics, 2020. URL https://doi.org/10.18653/v1/2020.emnlp-main.185.
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. BPE-dropout: Simple and effective
subword regularization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault
(eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
pp. 1882–1892, Online, July 2020. Association for Computational Linguistics. doi: 10.
18653/v1/2020.acl-main.170. URL https://aclanthology.org/2020.acl-main.170.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+
questions for machine comprehension of text. In Proceedings of the 2016 Conference on Em-
pirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, Novem-
ber 1-4, 2016, pp. 2383–2392. The Association for Computational Linguistics, 2016. URL
https://doi.org/10.18653/v1/d16-1264.
Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL
https://www.slideshare.net/hadoopusergroup/common-crawlpresentation.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare
words with subword units. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.
1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics.
doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
Noam Shazeer. Glu variants improve transformer, 2020.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and
Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language mod-
els using model parallelism. ArXiv, abs/1909.08053, 2019. URL https://api.
semanticscholar.org/CorpusID:202660670.
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel
Hestness, and Nolan Dey. SlimPajama: A 627B token cleaned and
deduplicated version of RedPajama. https://www.cerebras.net/blog/
slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama,
2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer:
Enhanced transformer with rotary position embedding, 2022.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux,
Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien
Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and
efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/
ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
23
Technical Report
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine
Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel,
Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami,
Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin
Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh
Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu,
Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin
Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schel-
ten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez,
Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and
fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288.
URL https://doi.org/10.48550/arXiv.2307.09288.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings
of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.
6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei
Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, Yibin Liu, Jialong Wu, Shengwei Ding, Long
Li, Zhiwei Huang, Xinle Deng, Teng Yu, Gangan Ma, Han Xiao, Zixin Chen, Danjun
Xiang, Yunxia Wang, Yuanyuan Zhu, Yi Xiao, Jing Wang, Yiru Wang, Siran Ding, Jiayang
Huang, Jiayi Xu, Yilihamu Tayier, Zhenyu Hu, Yuan Gao, Chengfeng Zheng, Yueshu Ye,
Yihang Li, Lei Wan, Xinyue Jiang, Yujie Wang, Siyu Cheng, Zhule Song, Xiangru Tang,
Xiaohua Xu, Ningyu Zhang, Huajun Chen, Yuchen Eleanor Jiang, and Wangchunshu
Zhou. Weaver: Foundation models for creative writing. CoRR, abs/2401.17268, 2024. doi:
10.48550/ARXIV.2401.17268. URL https://doi.org/10.48550/arXiv.2401.17268.
Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li,
Cheng Cheng, Weiwei Lü, Rui Hu, Chenxia Li, Liu Yang, Xilin Luo, Xuejie Wu, Lu-
nan Liu, Wenjun Cheng, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Lei Lin, Xiaokun
Wang, Yutuan Ma, Chuanhai Dong, Yanqi Sun, Yifu Chen, Yongyi Peng, Xiaojuan Liang,
Shuicheng Yan, Han Fang, and Yahui Zhou. Skywork: A more open bilingual foun-
dation model. CoRR, abs/2310.19341, 2023. doi: 10.48550/ARXIV.2310.19341. URL
https://doi.org/10.48550/arXiv.2310.19341.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco
Guzmán, Armand Joulin, and Edouard Grave. Ccnet: Extracting high quality mono-
lingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe
Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isa-
hara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and
Stelios Piperidis (eds.), Proceedings of The 12th Language Resources and Evaluation Confer-
ence, LREC 2020, Marseille, France, May 11-16, 2020, pp. 4003–4012. European Language
Resources Association, 2020a. URL https://aclanthology.org/2020.lrec-1.494/.
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco
Guzmán, Armand Joulin, and Édouard Grave. Ccnet: Extracting high quality mono-
lingual datasets from web crawl data. In Proceedings of The 12th Language Resources and
Evaluation Conference, pp. 4003–4012, 2020b.
Yiheng Xu, Hongjin SU, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou,
Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang,
Caiming Xiong, and Tao Yu. Lemur: Harmonizing natural language and code for lan-
guage agents. In The Twelfth International Conference on Learning Representations, 2024.
URL https://openreview.net/forum?id=hNhwSmtXRh.
Zheng Xin Yong, Ruochen Zhang, Jessica Forde, Skyler Wang, Arjun Subramonian, Holy
Lovenia, Samuel Cahyawijaya, Genta Winata, Lintang Sutawika, Jan Christian Blaise
24
Technical Report
Cruz, Yin Lin Tan, Long Phan, Long Phan, Rowena Garcia, Thamar Solorio, and Alham
Aji. Prompting multilingual large language models to generate code-mixed texts: The
case of south East Asian languages. In Genta Winata, Sudipta Kar, Marina Zhukova,
Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali
(eds.), Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-
Switching, pp. 43–63, Singapore, December 2023. Association for Computational Linguis-
tics. doi: 10.18653/v1/2023.calcs-1.5. URL https://aclanthology.org/2023.calcs-1.
5.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? In Proceedings of the 57th Conference of the Association
for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume
1: Long Papers, pp. 4791–4800. Association for Computational Linguistics, 2019. doi:
10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source
small language model. CoRR, abs/2401.02385, 2024. doi: 10.48550/ARXIV.2401.02385.
URL https://doi.org/10.48550/arXiv.2401.02385.
Wenxuan Zhang, Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3exam:
A multilingual, multimodal, multilevel benchmark for examining large language mod-
els. In Advances in Neural Information Processing Systems 36: Annual Conference on Neu-
ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, Decem-
ber 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/
117c5c8622b0d539f74f6d1fb082a2e9-Abstract-Datasets_and_Benchmarks.html.
Jun Zhao, Zhihao Zhang, Luhui Gao, Qi Zhang, Tao Gui, and Xuanjing Huang.
Llama beyond english: An empirical study on language capability transfer. CoRR,
abs/2401.01055, 2024. doi: 10.48550/ARXIV.2401.01055. URL https://doi.org/10.
48550/arXiv.2401.01055.
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin
Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluat-
ing foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/ARXIV.2304.06364.
URL https://doi.org/10.48550/arXiv.2304.06364.
25
Technical Report
The MADLAD-400 dataset presents an issue with unicode escaping, notably the excessive
occurrence of “\\n”, as discussed in the HuggingFace forum32 . This issue could potentially
disrupt the in-context learning and chatbot applications, since “\n” is a common delimiter
between demonstrations and task inputs. Consequently, it is necessary to replace “\\n”
with either “\n\n” or “\n” to resolve this problem.
We fix the problem using a simple heuristic rule as shown in Algorithm 1. For example,
the original input is “A.\\nB.\\nC. D.\\nE. F.\\nG.”, where the utter-cased letter like A
and G, stands for individual sentence. The fixed output would be “A.\nB.\n\nC. D.\n\nE.
F.\n \nG.”. The challenge (the underlined parts) is to generate the “\n\n” between B. and
C. The rules for splitting text are defined as: (1) When a paraphrase contains more than
two periods, it is separated from its neighboring element with two spaces. (2) In all other
cases, a single space is used as the delimiter. For a clearer comparison, please refer to the
following specific examples.
32 https://huggingface.co/datasets/allenai/MADLAD-400/discussions/2
26
Technical Report
Original Example with Escape Issue from the MADLAD-400 Indonesian Subset.
27
Technical Report
Oleh karena itu, jika ada sekelompok orang mengatasnamakan suku tertentu
menolak pernyataan Puan atau berencana melaporkan ke proses hukum, tam-
paknya kurang pas dan bisa jadi belum melakukan pengkajian mendalam dan
hilostik.
S̈eharusnya wacana publik tertuju pada bagaimana perwujudan hak setiap indi-
vidu sebagai WNI yang tinggal di Sumbar dan di semua provinsi di Indonesia
dapat dijamin dan diwujudkan dalam kehidupan sehari-hari,ücap Emrus.
K̈onstitusi kita, UUD 1945, menggunakan kata ’setiap’ warga negara, bukan
menggunakan diksi ’kelompok’ atas dasar kategori sosial tertentu, termasuk etnis.
Artinya, setiap individu WNI memiliki hak dan kewajiban yang sama sekalipun
dari suku atau etnis yang berbeda,ẗambahnya.
Tag: Puan Maharani, Partai Demokrasi Indonesia Perjuangan (PDIP), Pancasila
We list the top-3 frequent textual duplicates for each language in Figure 9. To increase the
distinctiveness between examples, we group them in units of 100 frequencies. Ultimately,
the highest values from the first three groups are selected for demonstration.
Table 8 shows the model performance on the M3Exam dataset following the evaluation
approach adopted in the HellaSwag benchmark in both the Eleuther AI LM Evaluation
Harness and the OpenCompass platform. We obtained the evaluation results by replacing
the answer part with each possible option and appending it to the pre-defined prompt of
the given question. Then, we rank the concatenated text strings by their perplexity scores
provided by models and choose the one with the lowest perplexity as the model predic-
tion. As observed, while models evaluated using this approach generally exhibit lower
performance compared to the evaluation method used in M3Exam, our inspection reveals
that the models exhibit significantly reduced option bias under the evaluation protocol.
28
Technical Report
Figure 9: The top-3 frequent textual duplicates identified across CC100 and MADLAD-400
for three SEA languages, along with their respective frequencies. Inappropriate content
and personally identifiable information are replaced with [Sensitive Data]. The lengthy
content is truncated for clear visualization.
29
Technical Report
Table 8: Experimental results of different models on M3Exam using the HellaSwag style
evaluation protocol.
30