OpenCoder 1731317971
OpenCoder 1731317971
INF M-A-P
arXiv:2411.04905v1 [cs.CL] 7 Nov 2024
A BSTRACT
Large language models (LLMs) for code have become indispensable in various
domains, including code generation, reasoning tasks and agent systems. While
open-access code LLMs are increasingly approaching the performance levels of
proprietary models, high-quality code LLMs suitable for rigorous scientific inves-
tigation, particularly those with reproducible data processing pipelines and trans-
parent training protocols, remain limited. The scarcity is due to various chal-
lenges, including resource constraints, ethical considerations, and the competitive
advantages of keeping models advanced. To address the gap, we introduce Open-
Coder, a top-tier code LLM that not only achieves performance comparable to
leading models but also serves as an “open cookbook” for the research commu-
nity. Unlike most prior efforts, we release not only model weights and inference
code, but also the reproducible training data, complete data processing pipeline,
rigorous experimental ablation results, and detailed training protocols for open
scientific research. Through this comprehensive release, we identify the key in-
gredients for building a top-tier code LLM: (1) code optimized heuristic rules for
data cleaning and methods for data deduplication, (2) recall of text corpus related
to code and (3) high-quality synthetic data in both annealing and supervised fine-
tuning stages. By offering this level of openness, we aim to broaden access to all
aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model
and an open foundation to accelerate research, and enable reproducible advance-
ments in code AI.
Figure 1: OpenCoder surpasses all previous fully open models (i.e., with open model weights and
reproducible datasets) and other open-access models (i.e., with open model weights only) at the 6B+
parameter scale, pushing the frontier of fully open models to new heights.
∗
The first two authors contributed equally to this work. Work done during the internships of Siming Huang,
Tianhao Cheng and Jason Klein Liu at INF. † Correspondence to Wei Chu ([email protected]) and Zili Wang
([email protected]).
1
Preprint Version
C ONTENTS
1 Introduction 4
2 Pretraining Data 4
2.1 RefineCode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Raw Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Code-Related Web Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Annealing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Pretraining 10
3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Post Training 11
4.1 Data Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Two-Stage Instruction-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Experimental Results 13
5.1 Evaluation on Base Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 Evaluation on Instruct Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Analysis 16
6.1 Analysis of the Deduplication Level . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.2 Analysis on the Importance of High-quality Data In the Annealing Phase . . . . . . 17
6.3 Analysis on the Effect of GitHub Stars . . . . . . . . . . . . . . . . . . . . . . . . 18
6.4 Analysis on the two-stage instruction tuning strategy . . . . . . . . . . . . . . . . 19
7 Related Work 19
A Filtering Rules 27
A.1 Design of Filtering Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
A.2 Examples of Filtering Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2
Preprint Version
3
Preprint Version
1 I NTRODUCTION
Large Language Models (LLMs) have achieved significant success in various domains (Wang et al.,
2023; Que et al., 2024; Liu et al., 2024a;c; Wu et al., 2024), particularly in code-related tasks, rev-
olutionizing the current paradigm of software development (Qian et al., 2024; Wang et al., 2024).
Code-specific LLMs have emerged as a critical area within LLM research, with tools such as Chat-
GPT, Copilot, and Cursor reshaping the workflows of developers. Despite this, the performance
of open-source LLMs focused on code (Li et al., 2023; Tao et al.; Lozhkov et al., 2024a; Zhang
et al., 2024a) still falls short compared to state-of-the-art LLMs (Hui et al., 2024; Zhu et al., 2024),
largely because these leading models keep their training datasets—an essential factor in LLM de-
velopment—proprietary. This lack of transparency limits the broader research community’s ability
to establish strong baselines and gain deeper insights into the workings of top-tier code LLMs.
To remedy the gap, we set forth three primary goals by releasing OpenCoder and its development
material: (1) Firstly, we aim to provide scholars with a meticulously curated and fully transparent
strong baseline code LLM for research on mechanical interpretability and the data distribution of
code LLMs. (2) Secondly, we intend to conduct in-depth investigations into the pretrain and instruc-
tion data curation pipeline for the development of stronger code LLMs. (3) Thirdly, by enabling a
detailed review of the development of the models, we hope to unlock more diverse customized so-
lutions based on transparent code LLM. Through OpenCoder, we strive to stimulate and accelerate
the growth of the open-source code LLM community.
Our comprehensive set of controlled experiments highlights key design choices for data curation for
top-tier code LLMs in different training stages: (1) During the pretraining phase, the importance
of data cleaning is highlighted (Zhou et al., 2024), emphasizing the removal of non-informative
data such as pure hexadecimal code and excessively short code snippets that do not contribute to the
learning process. (2) The impact of deduplication is significant, with file-level deduplication proving
to be more effective than repository-level deduplication by maintaining data diversity and enhancing
model performance on downstream tasks (Li et al., 2023). (3) The influence of GitHub stars is also
examined, revealing that filtering data based on Github star count can possibly reduce data diversity
and affect the overall data distribution, contributing to a suboptimal result (Allal et al., 2023). (4)
In the annealing phase, the use of high-quality data is crucial for further enhancing the model’s
capabilities, indicating that data quality is more important than quantity in the later stages of model
training. (5) Finally, during the instruction tuning phase, a two-stage instruction tuning strategy
is shown to be effective, allowing the model to acquire broad capabilities initially and then refine
them with code-specific tasks, resulting in improved performance on both theoretical and practical
coding tasks. These five key points underscore the importance of data quality, diversity, and targeted
enhancement strategies in developing a high-performing code generation model like OpenCoder.
This work introduces the OpenCoder, a completely open-source Code LLM, built on the transparent
data process pipeline and reproducible dataset. As shown in Table 1, We provide the open cookbook
to build a code LLM from scratch by providing the data cleaning pipeline, reproducible pretraining
dataset, large-scale SFT Corpus, and intermediate checkpoints. OpenCoder, through its meticulous
data processing and advanced training methods, has surpassed expectations by achieving top-tier
results on multiple code LLM evaluation benchmarks. The introduction of the open cookbook of
code LLM is designed to push forward the field of code intelligence studies and to encourage its
broad use in the community of code intelligence.
2 P RETRAINING DATA
Pretraining data plays a crucial role in the development of LLMs, where the scale, quality, and
diversity of the data greatly affect the model’s overall performance. Therefore, we introduce an
efficient and effective methodology for producing data tailored for our code LLM pretraining. In
this section, we will comprehensively illustrate the data processing strategies used in both the general
pretraining stage and the annealing stage.
4
Preprint Version
Table 1: The comparison of released resources between our OpenCoder with other popular open-
sourced code LLMs. HumanEval scores are reported for the corresponding chat models.
Pretraining data forms the foundation for the capabilities of large language models. In the LLM
open-source community, The Stack v2 (Lozhkov et al., 2024a) has provided a valuable code dataset,
which significantly facilitates the training of code LLMs. However, the quality of the training part in
The Stack v2 is insufficient to train LLMs with top-rated performance. To address this, we present
RefineCode, a high-quality, reproducible dataset of 960 billion tokens across 607 programming lan-
guages, incorporating over 130 language-specific rules with customized weight assignments. This
dataset is composed of two main parts: raw code and code-related web data. Specifically, we collect
the raw code primarily from GitHub repositories up to November 2023 with non-GitHub data from
The Stack v2. Additionally, the code-related web data is primarily sourced from web corpora. A
detailed comparison with previous versions of The Stack is provided in the Appendix D. Besides,
to ensure both quality and diversity, as shown in Figure 2, we have designed a sophisticated data
processing pipeline to produce code pretraining corpus. In the following sections, we have provided
a detailed description of our processing pipeline and the details of our RefineCode dataset.
2.1.1 R AW C ODE
To ensure the curation of high-quality raw code data, we have developed the code-specific data
processing pipeline including modules of preprocessing, deduplication, transformation, filtering,
data sampling. The following sections provide the details of these processes.
Preprocessing Initially, we exclude files exceeding 8 MB in size, as these are predominantly non-
text files, which require considerable resource overhead. Furthermore, given the miscellaneous
file types present on GitHub, we restrict our selection to those file types related to programming
languages by their file extension referring to linguist1 , and filter those types with low capacity or low
quality. Finally, we preserve 607 different types of programming language files. A comprehensive
list of the included and excluded programming languages is provided in Appendix E.
Deduplication The purpose of deduplication is to construct an unbiased and diverse training set
while significantly reducing the data volume. Owing to the extremely high repetition of the source
code in Github, we prioritize the deduplication process early in the pipeline and adopt an aggressive
file-level deduplication strategy (see elaborate analysis in Section 6.1). More specifically, we lever-
age both exact deduplication and fuzzy deduplication methods to eliminate documents containing
identical or near-identical code content shown as follows:
Exact Deduplication: Due to the prevalence of forking and copy-pasting within the codebase, nearly
75% of files are completely duplicated. On account of this, differing from general deduplication
1
https://github.com/github-linguist/linguist/blob/main/lib/linguist/
languages.yml
5
Preprint Version
process, Identity removal is applied towards code data at the first step in this module. We compute
the SHA256 hash value for each document, where files with identical hash values are compared, and
only the code files with the highest star count as well as the latest commit time are retained.
Fuzzy Deduplication: Following the fuzzy deduplication setting in the general data pipeline, we
split the raw text into 5-gram pieces, and then calculate the 2048 MinHash functions (Broder, 1997).
Additionally, we utilize LSH (Leskovec et al., 2014) by setting bands to 16 and rows to 128, to retain
only those distinct files with the highest stars and latest commit time. This process removes 6% file
volume.
Transformation Filtering is generally adequate for removing files that fail to meet specific criteria.
However, certain issues, though small in text size, are pervasive across numerous files. In such
cases, it is unacceptable to exclude all those issued files. Instead, we opt to transform these files
to rectify the identified issues before the filtering module. Concretely, we implement two types of
transformation rules as follows:
Copyright Removal: There are over 15% code files including the copyright notices at the beginning
of the content like “Copyright Intel Corporation (C) 2014-2016”, which are highly repetitive and
irrelevant to the coding tasks, possibly affecting the performance of the LLM. Consequently, we
specifically identified and removed these copyright notices from the initial code comments.
PII Reduction: Personally Identifiable Information (PII) encompasses content such as passwords,
emails, IP addresses. Training on those data containing PII implies significant privacy risks. There-
fore, we employ complex regular expressions to detect such information and replace them with
placeholders such as “<name>” and “<password>”.
Filtering The quality of the original code files on GitHub exhibits significant variability, where
lower-quality code potentially hinders the LLM pretraining process. Given the distinct nature of
code compared to natural language, the criteria for high-quality code differ significantly from those
for natural language. Furthermore, different programming languages also exhibit distinct properties.
Based on this, we believe that designing a set of detailed heuristic filtering rules tailored specifically
to the characteristics of pretraining data is important to enhance the model’s capabilities. Drawing
inspiration from the principles of high-quality code data proposed in Gunasekar et al. (2023), we
consider the following guidelines when designing our filters: 1) Filter out files with poor self-
containment; 2) Filter out files with poor or minimal logical structure; 3) Remove files that
deviate significantly from standard formatting.
Based on these guidelines and the characteristics of our dataset, our work presents the first heuristic
filtering framework by considering the unique characteristics of different programming languages.
Based on RedPajama (Computer, 2023), this framework extends and refines the existing rules from
StarCoder (Li et al., 2023) to better align with the unique properties of code datasets, resulting in
more precise and higher-quality data cleansing. We developed the following three categories of
filtering rules:
6
Preprint Version
Figure 3: Visualization on the PCA data distributions of RefineCode and The Stack v2.
1. Natural Language Filtering Rules: These rules filter data based on common properties
for all text files, such as file size, number of lines, and other general metrics. Both text and
code files share these filtering rules.
2. General Code Filtering Rules: These rules apply to all code files by filtering data based
on general code characteristics, such as the number of variables, average function length,
and other common features.
3. Language-Specific Filtering Rules: These rules are designed according to the unique
characteristics of specific programming languages, such as the frequency of “pass” state-
ments in Python or the use of “goto” statements in C. We have developed these rules for
the following eight commonly used programming languages: Python, C, C++, C#, Java,
JavaScript, Go, and HTML.
Heuristic rules involve extensive threshold setting. When defining these rules and determining
thresholds, we consistently follow a guiding principle: to remove harmful data as much as pos-
sible, while ensuring the overall distribution of the dataset is not significantly affected. We outline
our motivations for rule design in Appendix A.1, along with a detailed explanation of the tuning
process for the corresponding thresholds. Besides, we show the details of several representative
rules in Appendix A.2.
Data Sampling We try to preserve the original data distribution as much as possible to maximize
the utilization of our cleaned high-quality dataset. However, we downsample certain high-resource
programming languages before using our dataset in pretraining. Specifically, we downsample Java
data from 409 GB to 200GB, due to its excessive volume compared to other common languages.
Additionally, we downsample HTML data from 213GB to 64GB, as HTML files often contain a
significant amount of non-informative structured content and lack substantial coding logic. Finally,
we produce about 730B tokens in the pretraining stage.
Notably, as illustrated in Figure 3, we use PCA to visualize the embeddings extracted from Code-
BERT (Feng et al., 2020) for The Stack V2 and RefineCode, and observe a clear distinction between
these datasets. Specifically, in Figure 3, The Stack V2 data shows a greater number of outliers, while
the embeddings of RefineCode appear more tightly clustered. Besides, after analyzing the outlier
data, we observe the outliers usually show many low-quality patterns, such as pure text comments,
hexadecimal-only data, and excessively short code lacking computational logic, which can distort
the distribution of the pretraining dataset and ultimately hinder the efficiency of pretraining.
7
Preprint Version
As shown in Figure 2, the processing pipeline of code-related web data comprises four main com-
ponents: 1) FastText Model Training: To maintain a controllable vocabulary size in fastText and
enable tokenization of Chinese texts using spaces, we first apply the BPE (Byte Pair Encoding)
tokenizer to segment the corpus. Subsequently, the open-source FastText framework is utilized
for model training. 2) Recall From Common Crawl: We perform recall on Common Crawl to
generate the code-related web corpus. 3) Code-related Domain Discovery, we conduct statistical
analysis of the recalled data by domain URLs, and define a domain as web pages with the same base
URL(e.g. stackoverflow.com), where domains with over 10% of web pages are classified as code-
related. Note that given the scarcity of Chinese data, we provide detailed annotations of domain
names related to code and mathematics within the CommonCrawl dataset in the Appendix C. 4)
Url Annotation: We manually annotate the URLs associated with code content within these iden-
tified domains. For instance, we have identified all content under “stackoverflow.com/questions”
as computer technology questions. Then, we include samples with URLs matching “stackover-
flow.com/questions”, which are not correctly classified by fastText, into our code seed corpus. After
three iterations, we obtain about 220G code-related web data. Note that as the iteration progresses,
the quantity and diversity of the seed corpus will be better.
We also apply the same recall pipeline to FineWeb (Penedo et al., 2024a), Skypile (Wei et al., 2023a)
and web part of AutoMathText (Zhang et al., 2024b) and produce 330G code-related web data in
total. Furthermore, we observe that only a very small portion of the textual data in GitHub is also
related to natural language text. Therefore, we also train a classifier to determine whether the text is
code-related and obtain an additional 178GB code-related web data.
2.1.3 S UMMARY
Ultimately, we curated a high-quality code pretraining dataset, RefineCode, consisting of about 960
billion tokens. The composition of the data sources is illustrated in Table 2, while the distribution
of different program languages is displayed in Figure 4. For more details regarding the data com-
position of different program languages, please refer to Appendix F. To demonstrate the efficacy of
RefineCode, we train a 1.5B code LLM up to 600B using data from RefineCode and the training
subset of The Stack v2 respectively. The results in Figure 1, indicate that RefineCode significantly
improves training efficiency compared to The Stack v2, highlighting the superiority of our dataset.
The annealing stage can be seen as a bridge between the general pretraining stage and the super-
vised fine-tuning (SFT) stage. Following the training strategy in MiniCPM (Hu et al., 2024), our
model also undergoes a rapid learning rate annealing phase after the general pretraining stage, where
very high-quality training data is used to further enhance the model’s capabilities. In addition to the
RefineCode from the original distribution, we further incorporated the Algorithmic Corpus and syn-
thetic data during the annealing phase. The detailed data mixture can be found in Table 3.
Original Distribution Data In the annealing stage, it’s necessary to ensure that the overall data
distribution remains similar to the pretraining phase. A significant distribution shift can lead to
catastrophic forgetting in the model’s knowledge, and we ensure that 84% of the annealing data
8
Preprint Version
Number of Files
102
107
101 106
Ru y
st
l
JavL
a
Jav P #
aS HP
Py ript
Ma pp
erv Text
pa S
Sh la
Cn
Typrkdo C
eS wn
Ko Go
JSOlin
DaN
Sw rt
SQR
L
X ll
Vt
bo
Lu s
Pa ML
Limue
Tsx
Ru ift
Sc a
sca
p
ge
M
e
b
tho
er CS
C
a
cri
t
HT
as
Jav
Figure 4: The distribution of top program languages in RefineCode.
comes from the original distribution of RefineCode. Note that given the limited computing budget
available, this mixture ratio might not be ideal.
Algorithmic Corpus Algorithmic code files exhibit strong code logic and minimal dependency on
external files, demonstrating excellent self-containment. Additionally, these files are more aligned
with the distribution of smaller, independent tasks commonly encountered in real-world interactive
scenarios. Therefore, we sample a certain proportion of the original pretraining data that contains
keywords such as “leetcode,”, “def solution,” or “class solution” to create this corpus.
Synthetic Data High-quality pretraining data rewriting is also extremely important during the pre-
training stage, which helps the model memorize and embed knowledge for efficient retrieval (Allen-
Zhu & Li, 2023). We select Algorithmic Corpus as the seed because it encompasses a wide range of
algorithmic logic. We employed two forms of rewriting enhancement: Verified Code Snippets and
Code Textbooks.
1. High Quality Code Snippet: Inspired by the synthetic CodeExercises dataset in Gu-
nasekar et al. (2023), we utilized the algorithmic corpus as seeds and employ a strong LLM
to synthesize a batch of self-contained independent functions along with their correspond-
ing test cases. We retained the data that successfully passed the test cases and included them
in the annealing stage dataset. This approach was similarly extended to support multiple
program languages.
2. Code Textbooks: To enable the model to understand code from multiple perspectives,
we constructed educational text snippets based on the hqcode 2 dataset using Qwen2-72B-
Instruct (Yang et al., 2024). Hqcode is a multilingual code dataset synthesized with GPT-
4o-Mini, where each entry describes an independent task and provides a corresponding
function as a solution. We engaged LLMs to perform interactive analysis on the code
within this dataset, extracting and elaborating on abstract code knowledge. This approach
aims to enable the model to learn code from diverse perspectives.
2
https://huggingface.co/datasets/yuxiang630/hqcode
9
Preprint Version
Table 4: Overview of the key hyperparameters of OpenCoder, including 1.5B and 8B.
OpenCoder-1.5B OpenCoder-8B
Layers 24 32
Model Dimension 2240 4096
Attention Heads 14 32
Key / Value Heads 14 8
Activation Function SwiGLU
Vocab Size 96640
Positional Embedding RoPE(θ = 10000) RoPE(θ = 500000)
Context Window Size 4096 8192
3 P RETRAINING
In this section, we provide a detailed overview of our model architecture. As shown in Table 4, the
models are available in two sizes: 1.5 billion and 8 billion parameters. The 1.5 billion model consists
of 24 layers with a hidden size of 2240, 14 attention heads, and 14 key/value heads, supporting a
context window size of 4096. The 8 billion model architecture closely follows the Llama-3.1-8B
architecture, with 32 layers, a hidden size of 4096, and 8 attention heads. Both models use the
SwiGLU activation function and have a vocabulary size of 96,640.
The training process, based on the aforementioned model architecture, involved several critical de-
tails. The dataset encompassed both Chinese and English languages, alongside 607 programming
languages, the complete list of which is provided in Appendix E.
For the 1.5B model, due to the incomplete data curation, training was performed on 2 trillion tokens
over four epochs. Following the pretraining phase, we conducted annealing training on an additional
100 billion tokens. The WSD learning schedule, referenced in MiniCPM (Hu et al., 2024), was
employed, featuring a warm-up phase of 2,000 steps across 8 billion tokens. The peak learning rate
was 3e-4, which remained constant after the warm-up and subsequently decayed exponentially to
1e-5 during the annealing phase. A micro-batch size of 4 and a global batch size of 1024 were used.
Training was conducted using Megatron-LM (Shoeybi et al., 2020) with distributed optimization
and DDP gradient overlap on a cluster of 256 H800 GPUs over a total of 109.5 hours, equating to
28,034 GPU hours.
For the 8B model, the WSD learning schedule was again employed with a warm-up phase covering
8 billion tokens over 2,000 steps. This model was trained for 3.5 epochs on 2.5 trillion tokens,
followed by a decay phase with an additional 100 billion tokens. Unlike the 1.5 billion model, which
lacked code-related recall data due to incomplete data processing, the 8 billion model incorporated
this data during training. The learning rate schedule mirrored that of the 1.5B model. The micro-
batch size was set to 1, with a TP of 2 and a sequence length of 8192. The global batch size was
1024. Training was conducted on a cluster of 512 H100 GPUs over 187.5 hours, totaling 96,000
GPU hours. It is noteworthy that the first 130,000 steps were trained with a sequence length of 4096
and a global batch size of 2048.
10
Preprint Version
4 P OST T RAINING
4.1 DATA C OMPOSITION
Open-source Training Data To enhance the model training, we collect the open-source instruc-
tion corpora from the websites, including Evol-Instruct3 (Luo et al., 2024), Infinity-Instruct4 , McE-
val5 (Chai et al., 2024; Yang et al., 2021), where the instruction data is created from the multilingual
raw code snippet by language sampling with the fixed ratio. We employ an LLM to perform bi-
nary classification on the content of Infinity-Instruct, aiming to extract the segments specifically
related to the code. Additionally, we sample real user queries from WildChat (Zhao et al., 2024)
and Code-290k-ShareGPT6 , extracting code-related dialogue histories using LLM and subsequently
performing data cleaning. For low-quality responses, we employ a robust LLM to regenerate the
content, enhancing the overall data quality. This RealUser-Instruct dataset not only exhibits high
diversity but also aligns more closely with real-world problem complexity, focusing on addressing
practical issues in authentic scenarios.
11
Preprint Version
identify high-quality seed data. By using only high-quality seed data, we ensure that the resulting
instruction-tuning dataset includes more educational example responses. Subsequently, we use a
teacher model to generate multiple test cases for the code sections in each problem. These test cases
are appended to the code snippets and executed using a Python interpreter. Only the data samples
that successfully pass the tests are retained. By using this strategy, we maximize the likelihood that
the generated data is both syntactically and semantically sound, thereby enhancing the reliability of
the dataset.
Large-scale Diverse Instruction Synthesis Following the previous work (Yue et al., 2024), to
increase the diversity of the instruction dataset, we create a large-scale instruction data synthesis
framework. The framework for synthesizing code instruction data using LLMs incorporates the fol-
lowing key components: (1) An LLM is used first to clean the irrelevant context (e.g. advertisements
on the web) in the websites and select useful sentences as the seed for further question generation.
(2) A task specification module defines programming languages, difficulty levels, and coding task
types, utilizing a configuration file for easy customization. The prompt engineering component
employs a template-based system to generate diverse, contextually rich prompts, incorporating real-
world scenarios and best practices in software development. We set temperature T = 1.0 for diverse
questions. (3) An advanced LLM with more parameters first generates the created questions and
then generates the corresponding answers. The validation module combines automated code execu-
tion and unit testing to check the correctness. (4) Then an LLM is adopted to refine the response by
adding code comments and more explanation.
12
Preprint Version
quality code from GitHub, we ensured it was exposed to real-world examples of well-maintained
and formatted code. One key advantage of using high-quality code in the fine-tuning process is that
it enhances the model’s ability to generate code that is both syntactically and semantically correct.
The two-stage fine-tuning approach allows the model to excel in theoretical knowledge and practical
coding tasks, thereby avoiding the limitations of focusing on only one area. Models that only priori-
tize theory may struggle with coding, while those focused solely on code generation may lack depth
in explaining complex concepts. By refining both areas, the model becomes technically proficient
and versatile, able to meet the needs of developers, beginners, and professionals alike.
In the first stage of SFT, we trained for one epoch with a batch size of 4096, a learning rate (LR) of
2e-5, warmup steps set to 100, and a cosine learning rate scheduler.In the second stage of SFT, we
trained for three epochs using a batch size of 512, a learning rate of 5e-5, with 100 warmup steps,
and the same cosine learning rate scheduler.
4.4 D ECONTAMINATION
We applied strict data deduplication for all SFT data. Specifically, we removed any data contain-
ing the entry points corresponding to test sets such as HumanEval and MBPP. Additionally, we
performed 10-gram deduplication, removing any data with a 10-gram overlap with the test sets.
5 E XPERIMENTAL R ESULTS
In this section, we conduct a comprehensive and fair evaluation to demonstrate that the model we
constructed using cleaned and synthesized data performs comparably to other closed large language
models. We also compared the most widely used and powerful open-source language models, in-
cluding the Crystal and StarCoder series. To further highlight the practicality and effectiveness of
our models, we focus on tasks such as code generation, code completion, and code understanding.
For base models, we focus on evaluating their code completion ability. Code completion is a fun-
damental capability that enables code models to tackle complex tasks. This evaluation goal aligns
with our optimization objective in the annealing stage, as code completion can be regarded as a
special case of the code generation task. To ensure the reproducibility of all results, we used pub-
licly available LLM evaluation framework OpenCodeEval7 . For comparing models, we compare
open-coder-1.5B with state-of-the-art small language models.
HumanEval & MBPP We selected two widely used code completion benchmarks to evaluate
OpenCoder, HumanEval (Chen et al., 2021), and MBPP (Austin et al., 2021). To further enhance
the accuracy of the evaluation, EvalPlus (Liu et al., 2024d) extends HumanEval and MBPP into
7
https://github.com/richardodliu/OpenCodeEval
13
Preprint Version
Table 6: Performance of various base models on HumanEval, MBPP, and the “complete” task of
BigCodeBench. Models trained on reproducible datasets are marked with green.
HumanEval+ and MBPP+ by adding unique and challenging test cases and correcting inaccurate
ground-truth solutions. These results can be used to indicate the model’s ability to understand and
apply basic Python data structures and knowledge of algorithms. For HumanEval, we report the
0-shot results. For MBPP, we report 3-shots on 500 questions in the test split from original dataset,
while the others following EValPlus report results on 378 questions in the sanitized part.
BigCodeBench BigCodeBench (Zhuo et al., 2024) is a challenging benchmark for code comple-
tion, designed to assess models on their ability to handle complex instructions and make accurate
function calls across diverse external libraries. In the Completion setup, models are provided with
a function signature and related documentation to generate appropriate code, along with a unit test
for the completed function. Covering a range of practical programming tasks, it evaluates models’
ability to handle real-world scenarios involving complex, task-specific libraries.
MultiPL-E MultiPL-E extends the HumanEval benchmark to evaluate the code generation capa-
bilities of large language models across multiple languages. MultiPL-E translates tasks into lan-
guages such as C++, Java, PHP, TypeScript, C#, Bash, and JavaScript, providing a consistent basis
for assessing how models apply their programming skills across different syntaxes and paradigms.
We follow the evaluation code of Qwencoder8 to systematically measure performance in each lan-
guage, providing insights into the adaptability and code generation accuracy of LLMs in a multilin-
gual context.
8
https://github.com/QwenLM/Qwen2.5-Coder
14
Preprint Version
Table 7: Performance of various chat models on HumanEval, MBPP, the “instruct” task of Big-
CodeBench and LiveCodeBench. Models trained on reproducible datasets are marked with green.
Table 8: Performance of various chat models on the MultiPL-E benchmark across different pro-
gramming languages.
McEval The comprehensive multilingual code evaluation benchmark McEval (Chai et al., 2024)
employed a detailed assessment of OpenCoder’s programming capabilities across 40 languages. In
contrast to MultiPL-E, this benchmark is not derived from HumanEval or MBPP. Figure 6 depicts
the results of the multilingual generation task for OpenCoder-8B-Instruct, which comprises nearly
2,000 samples. The figure illustrates that the model exhibits superior multilingual performance
compared to other open-source models of comparable size.
MdEval OpenCoder is also evaluated on the comprehensive multilingual code debugging bench-
mark MdEval (Liu et al., 2024e) across 18 languages. In contrast to McEval, this benchmark focuses
on the assessment of code debugging, especially for language-specific bugs. Figure 7 shows the
results of the multilingual automated program repair task for OpenCoder-8B-Instruct, which com-
prises nearly 1.2K samples, which demonstrates that OpenCoder can effectively find the bugs and
fix them compared to other open-source models of comparable size.
15
Preprint Version
50
0
Average AWK C C++ C# Clisp Coffee Dart Elisp Elixir
50
0
Erlang F# Fortran Go Groovy Haskell Html JS Java Json
50
0
Julia Kotlin Lua PHP Pascal Perl Power Python R Racket
50
0
Ruby Rust Scala Scheme Shell Swift TS Tcl VB VimL
50
0
Average C C# Clisp C++ F# Go JS Java Json
50
0
Julia PHP Pascal Python R Ruby Rust Scala Swift
6 A NALYSIS
Recent studies (Lee et al., 2021) have demonstrated the significant performance improvements that
can be achieved by deduplicating training datasets for LLM, where MinHash combined with LSH
has emerged as the predominant method for deduplication in code training datasets (Li et al., 2023;
16
Preprint Version
Lozhkov et al., 2024a; Guo et al., 2024; Mishra et al., 2024). Recently, DeepSeekCoder (Guo et al.,
2024) claims that deduplication is performed at the repository level. However, we conduct extensive
experiments on the Python corpus of RefineCode by performing deduplication at both the file and
repository levels, respectively.Specifically, the deduplication is conducted at both the file level and
repository level across the 485 million Python files available on GitHub, respectively, and then we
train two 1.5B LLMs, where the findings are as follows: First, in Table 9, the number of retained
tokens at the repository level deduplication is almost three times that of the file level deduplication.
Second, in Figure 8, we compare the downstream performance of the two datasets (i.e., HumanEval
and MBPP) during pretraining and observe that the performance of file level deduplication is better
than the performance of repository level deduplication a lot. Third, for repository level dedupli-
cation, we observe that a substantial portion of 52 billion tokens exhibits complete character-level
equivalence with another file. Fourth, when conducting file-level deduplication as a post-processing
step on the results of repository-level deduplication, we find that approximately 68 billion tokens
(about 68.4% of the data) could be further deduplicated. Our further investigation into chunk-level
deduplication revealed no observable benefits, as detailed in the Appendix B.In summary, for large-
scale code datasets, performing exact deduplication followed by file-level fuzzy deduplication is an
efficient and CPU-saving approach.
Table 9: The statistics for file level deduplication and repository level deduplication on Python code.
Rows for file level and repository level represent the number of files and repositories, respectively.
HumanEval-Pass@1 MBPP-Pass@1
0.2
File-Level
0.2
Repo-Level
0.15
0.15
Pass@1
0.1
0.1
0.05
0.05
0 0
0 20 40 60 80 100 0 20 40 60 80 100
6.2 A NALYSIS ON THE I MPORTANCE OF H IGH - QUALITY DATA I N THE A NNEALING P HASE
During the annealing phase of training, we conduct experiments by using different annealing data
with different data distributions as shown in Figure 9. Similarly, we still train two 1.5B LLMs,
where the first is trained by our original annealing data previously introduced and the second is
trained by the data without using the high-quality data (i.e., Algorithmic Corpus and the Synthetic
Data). From Figure 9, we observe that the performance drops a lot when the high-quality training
data is removed, which demonstrates the effectiveness of our constructed high-quality data in the
annealing phase.
17
Preprint Version
HumanEval-Pass@1 MBPP-Pass@1
0.52
With High-Quality Data
Without High-Quality Data
0.5
0.45
0.48
0.4 0.46
Pass@1
0.44
0.35 0.42
0.4
0.3
0.38
20 40 60 80 100 20 40 60 80 100
HumanEval-Pass@1 MBPP-Pass@1
0.18 Original Data
Filtered Data
0.2
0.16
0.14
0.15
0.12
0.1
Pass@1
0.08 0.1
0.06
0.04 0.05
0.02
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Following SantaCoder (Allal et al., 2023), we also conduct experiments by comparing the perfor-
mance trained by original code data and the filtered code data based on GitHub Stars, respectively.
Specifically, as shown in Figure 10, we train two 1.5B LLMs, where one is trained original data and
another is trained by data filtered by GitHub stars (stars>=5), and we have the following findings.
First, in Figure 10, we observe that the LLM trained by original data is better than the LLM trained
by filter data, which is similar to the results of SantaCoder. Second, in Figure 11, we also provide
the training losses of these two LLMs and observe that the loss of the LLM trained by filtered data is
fewer than the LLM trained by original data. For this phenomenon, we assume that the data quality
is better when using stars as the filter signal, but the diversity is relatively limited compared to the
original data. Besides, we find that this effect can be predicted from a single data distribution through
visualization alone, without the need for training. As dedicated in Figure 11, star filter significantly
impacts the overall data distribution, compromising data diversity. Upon closer examination of the
filtered data, we find that it still contains a considerable amount of well-structured, algorithmically
rich code. Therefore, we argue that using stars as a filtering criterion is not an optimal choice.
18
Preprint Version
Original Data
Filtered Data
3.5
2.5
Loss
1.5
0.5
0
0 20 40 60 80 100
Figure 11: Left figure: Losses of using different training data with different distributions. Right
figure: Visualization of the embeddings for original data and filtered data. Note that filtering based
on the number of stars can reduce data diversity and result in a lower overall loss for pretraining.
We compared three tuning strategies for OpenCoder-1.5B-Instruct: Stage1, Stage1+Stage2, and Mix
Training. Table 10 indicates that the two-stage SFT training can bring consistent improvement in
both public benchmarks and real-world scenarios. We observe that the data in Stage 1 exhibits signif-
icant diversity, though with relatively lower average quality. In contrast, the data in Stage 2 consists
of high-quality, code-specific SFT data. This two-stage SFT strategy allows for the acquisition of
broad capabilities in Stage 1, followed by targeted enhancement of code-related tasks in Stage 2.
Besides, similar to Chatbot Arena, we adopt the CodeArena test set covering nearly 400 human-
created samples to emulate user code-related prompts in realistic environments. We use GPT-4 as
the baseline and use GPT-4 to judge which LLM has a better response, where the reported results
are win rate compared to the GPT-4. Table 10 demonstrates the importance of the two-stage SFT
training strategy in the algorithmic benchmarks Evalplus and the realistic benchmarks CodeArena.
Table 10: Performance of different training strategies across benchmarks. Mix Training refers to the
process of combining and shuffling the data from Stage 1 and Stage 2 for joint training.
7 R ELATED W ORK
Code Large Language Models. The remarkable progress in generative language modeling has
sparked numerous studies on AI applications for software engineering (Black et al., 2022; Brown
et al., 2020; Radford et al., 2019; Touvron et al., 2023; Sun et al., 2024; Chai et al., 2024; Liu et al.,
2024e). While proprietary models (Achiam et al., 2023; Chen et al., 2021; Chowdhery et al., 2023)
achieve significant performance improvements in many code-related benchmark datasets (Chen
et al., 2021; Hendrycks et al., 2020), the inaccessible model checkpoints hinder further innovation.
In contrast, the research community has introduced several open-source models (e.g., CodeGen (Ni-
jkamp et al., 2023a;b), StarCoder (Li et al., 2023; Lozhkov et al., 2024b), CodeLlama (Roziere et al.,
2023) and DeepSeekCoder (Guo et al., 2024)), which greatly foster continued innovation in the field.
Code Benchmarks. Code generation models can be leveraged to address programming challenges
by interpreting and acting upon input specifications, which involves the automatic creation of pro-
gramming solutions based on given problem descriptions (Athiwaratkun et al., 2023; Austin et al.,
19
Preprint Version
2021; Chen et al., 2021; Gu et al., 2024; Lai et al., 2023; Chai et al., 2024; Muennighoff et al., 2024a;
Sun et al., 2024). Moreover, many benchmark datasets have been proposed to comprehensively as-
sess code large language models, such as code retrieval (Husain et al., 2019; Lu et al., 2021), code
translation (Yan et al., 2023), code efficiency (Du et al., 2024) and the challenging repository-level
code completion tasks (Allal et al., 2023; Liu et al., 2023a; Shrivastava et al., 2023; Zhang et al.,
2023; Deng et al., 2024; Liu et al., 2024b; Deng et al., 2024).
Open Large Language Models. Recently, many open-sourced LLMs have been proposed to em-
power the open research community and inspire a new wave of innovation. Specifically, many LLMs
(e.g., LLaMA (Touvron et al., 2023), Mistral (Jiang et al., 2023), Qwen (Bai et al., 2023), Chat-
GLM (GLM, 2024)), pretraining datasets (e.g., RedPajama (Computer, 2023), SlimPajama (Sobol-
eva et al., 2023), FineWeb (Penedo et al., 2024b)), and chat-related datasets (e.g., WildChat (Zhao
et al., 2024), LMSYS-Chat-1M (Zheng et al., 2023)) are open-sourced, which greatly inspire more
research innovations and accelerate the improvements of LLMs. Notably, several fully open LLMs
have been introduced, which provide as many details as possible to reproduce high-performance
LLMs. For example, in general LLMs, OLMo (Groeneveld et al., 2024), OLMoE (Muennighoff
et al., 2024b), LLM360 (Liu et al., 2023b) and MAP-Neo (Zhang et al., 2024a) are proposed. These
models release not only the final model checkpoint but also many training details (e.g., the data
processing pipeline, the pretraining data, and the intermediate checkpoints). In code LLMs, Star-
Coder (Allal et al., 2023) and StarCoderV2 (Lozhkov et al., 2024a) also release high-quality code
pretraining corpus.
R EFERENCES
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale-
man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical
report. 2023.
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz
Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t
reach for the stars! arXiv preprint arXiv:2301.03988, 2023.
Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and
extraction. arXiv preprint arXiv:2309.14316, 2023.
Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan,
Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, Sujan Kumar Gonugondla, Hantian
Ding, Varun Kumar, Nathan Fulton, Arash Farahani, Siddhartha Jain, Robert Giaquinto, Haifeng
Qian, Murali Krishna Ramanathan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Sudipta
20
Preprint Version
Sengupta, Dan Roth, and Bing Xiang. Multi-lingual evaluation of code generation models. In
The Eleventh International Conference on Learning Representations, 2023. URL https://
openreview.net/forum?id=Bo7eeXm6An8.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language
models. ArXiv preprint, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.
07732.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge,
Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu,
Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi
Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng
Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi
Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang
Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint
arXiv:2309.16609, 2023.
Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Ho-
race He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shiv-
anshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-
20B: An open-source autoregressive language model. In Proceedings of BigScience Episode
#5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pp. 95–
136, virtual+Dublin, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.
bigscience-1.9. URL https://aclanthology.org/2022.bigscience-1.9.
Andrei Z. Broder. On the resemblance and containment of documents. In Bruno Carpentieri, Al-
fredo De Santis, Ugo Vaccaro, and James A. Storer (eds.), Compression and Complexity of SE-
QUENCES 1997, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997, Proceedings, pp.
21–29. IEEE, 1997. doi: 10.1109/SEQUEN.1997.666900. URL https://doi.org/10.
1109/SEQUEN.1997.666900.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-
wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar-
wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,
Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCan-
dlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot
learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan,
and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual
Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12,
2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Linzheng Chai, Shukai Liu, Jian Yang, Yuwei Yin, Ke Jin, Jiaheng Liu, Tao Sun, Ge Zhang,
Changyu Ren, Hongcheng Guo, et al. Mceval: Massively multilingual code evaluation. arXiv
preprint arXiv:2406.07436, 2024.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared
Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large
language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):
1–113, 2023.
Together Computer. Redpajama: an open dataset for training large language models, 2023. URL
https://github.com/togethercomputer/RedPajama-Data.
Ken Deng, Jiaheng Liu, He Zhu, Congnan Liu, Jingxin Li, Jiakai Wang, Peng Zhao, Chenchen
Zhang, Yanan Wu, Xueqiao Yin, Yuanxing Zhang, Wenbo Su, Bangyu Xiang, Tiezheng Ge, and
21
Preprint Version
22
Preprint Version
Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Confer-
ence on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume
202 of Proceedings of Machine Learning Research, pp. 18319–18345. PMLR, 2023. URL
https://proceedings.mlr.press/v202/lai23b.html.
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-
Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv
preprint arXiv:2107.06499, 2021.
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets, 2nd Ed.
Cambridge University Press, 2014. ISBN 978-1107077232. URL http://www.mmds.org/.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,
Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with
you! arXiv preprint arXiv:2305.06161, 2023.
Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang,
Haoran Que, Yukang Chen, Wenbo Su, et al. E2-llm: Efficient and extreme length extension of
large language models. arXiv preprint arXiv:2401.06951, 2024a.
Jiaheng Liu, Ken Deng, Congnan Liu, Jian Yang, Shukai Liu, He Zhu, Peng Zhao, Linzheng Chai,
Yanan Wu, Ke Jin, Ge Zhang, Zekun Moore Wang, Guoan Zhang, Bangyu Xiang, Wenbo Su,
and Bo Zheng. M2rc-eval: Massively multilingual repository-level code completion evaluation.
2024b.
Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai,
Jie Liu, Ge Zhang, Jiakai Wang, et al. Ddk: Distilling domain knowledge for efficient large
language models. arXiv preprint arXiv:2407.16154, 2024c.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chat-
gpt really correct? rigorous evaluation of large language models for code generation. Advances
in Neural Information Processing Systems, 36, 2024d.
Shukai Liu, Linzheng Chai, Jian Yang, Jiajun Shi, He Zhu, Liran Wang, Ke Jin, Wei Zhang, Hualei
Zhu, Shuyue Guo, Tao Sun, Jiaheng Liu, Yunlong Duan, Yu Hao, Liqun Yang, Guanglin Niu,
Ge Zhang, and Zhoujun Li. Mdeval: Massively multilingual code debugging. arXiv preprint
arXiv:2411.02310, 2024e.
Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository-level
code auto-completion systems. abs/2306.03091, 2023a. doi: 10.48550/ARXIV.2306.03091. URL
https://doi.org/10.48550/arXiv.2306.03091.
Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo
Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang,
Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren,
Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, and Eric P.
Xing. Llm360: Towards fully transparent open-source llms, 2023b.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane
Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The
next generation. arXiv preprint arXiv:2402.19173, 2024a.
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane
Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, et al. Starcoder 2 and the stack v2: The
next generation. 2024b.
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin
Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark
dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao,
Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language
models with evol-instruct. In The Twelfth International Conference on Learning Representa-
tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https:
//openreview.net/forum?id=UnUwSIgK5W.
23
Preprint Version
Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza So-
ria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al. Gran-
ite code models: A family of open foundation models for code intelligence. arXiv preprint
arXiv:2405.04324, 2024.
Niklas Muennighoff, Qian Liu, Armel Randy Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo,
Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruc-
tion tuning code large language models. In The Twelfth International Conference on Learning
Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024a. URL
https://openreview.net/forum?id=mw1PWNSWZP.
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia
Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia,
Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela,
Ali Farhadi, Noah A. Smith, Pang Wei Koh, Amanpreet Singh, and Hannaneh Hajishirzi. Olmoe:
Open mixture-of-experts language models, 2024b. URL https://arxiv.org/abs/2409.
02060.
Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. Code-
gen2: Lessons for training llms on programming and natural languages. arXiv preprint
arXiv:2305.02309, 2023a. URL https://arxiv.org/abs/2305.02309.
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese,
and Caiming Xiong. Codegen: An open large language model for code with multi-turn program
synthesis. In International Conference on Learning Representations, 2023b. URL https:
//openreview.net/forum?id=iaYcJKpY2B_.
Guilherme Penedo, Hynek Kydlíček, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro
Von Werra, Thomas Wolf, et al. The fineweb datasets: Decanting the web for the finest text data
at scale. arXiv preprint arXiv:2406.17557, 2024a.
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin
Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the
finest text data at scale, 2024b.
Jim Plotts and Megan Risdal. Meta kaggle code, 2023. URL https://www.kaggle.com/ds/
3240808.
Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen,
Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. In Pro-
ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 15174–15186, 2024.
Haoran Que, Jiaheng Liu, Ge Zhang, Chenchen Zhang, Xingwei Qu, Yi Ma, Feiyu Duan, Zhiqi Bai,
Jiakai Wang, Yuanxing Zhang, Xu Tan, Jie Fu, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng.
D-cpt law: Domain-specific continual pre-training scaling law for large language models. ArXiv,
abs/2406.01375, 2024.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI preprint, 2019. URL
https://cdn.openai.com/better-language-models/language_models_
are_unsupervised_multitask_learners.pdf.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.
2023.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. Wu
Y.K. Li, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open
language models, 2024. URL https://arxiv.org/abs/2402.03300.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par-
allelism, 2020. URL https://arxiv.org/abs/1909.08053.
24
Preprint Version
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for
large language models of code. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara
Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International
Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research,
pp. 31693–31715. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/
v202/shrivastava23a.html.
Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey.
SlimPajama: A 627B token cleaned and deduplicated version of RedPajama, 2023.
Tao Sun, Linzheng Chai, Jian Yang, Yuwei Yin, Hongcheng Guo, Jiaheng Liu, Bing Wang, Liqun
Yang, and Zhoujun Li. Unicoder: Scaling code large language model via universal code. arXiv
preprint arXiv:2406.16441, 2024.
Tianhua Tao, Junbo Li, Bowen Tan, Hongyi Wang, William Marshall, Bhargav M Kanakiya, Joel
Hestness, Natalia Vassilieva, Zhiqiang Shen, Eric P Xing, et al. Crystal: Illuminating llm abilities
on language and code. In First Conference on Language Modeling.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan,
Yueqi Song, Bowen Li, Jaskirat Singh, et al. Opendevin: An open platform for ai software
developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024.
Zekun Moore Wang, Zhongyuan Peng, Haoran Que, Jiaheng Liu, Wangchunshu Zhou, Yuhan Wu,
Hongcheng Guo, Ruitong Gan, Zehao Ni, Man Zhang, Zhaoxiang Zhang, Wanli Ouyang, Ke Xu,
Wenhu Chen, Jie Fu, and Junran Peng. Rolellm: Benchmarking, eliciting, and enhancing role-
playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023.
Tianwen Wei, Liang Zhao, Lichang Zhang, Bo Zhu, Lijie Wang, Haihua Yang, Biye Li, Cheng
Cheng, Weiwei Lü, Rui Hu, et al. Skywork: A more open bilingual foundation model. arXiv
preprint arXiv:2310.19341, 2023a.
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Source code
is all you need. arXiv preprint arXiv:2312.02120, 2023b.
Yanan Wu, Jie Liu, Xingyuan Bu, Jiaheng Liu, Zhanhui Zhou, Yuanxing Zhang, Chenchen
Zhang, Zhiqi Bai, Haibin Chen, Tiezheng Ge, et al. Conceptmath: A bilingual concept-wise
benchmark for measuring mathematical reasoning of large language models. arXiv preprint
arXiv:2402.14660, 2024.
Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Codetransocean: A compre-
hensive multilingual benchmark for code translation. 2023.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li,
Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint
arXiv:2407.10671, 2024.
Jian Yang, Shuming Ma, Haoyang Huang, Dongdong Zhang, Li Dong, Shaohan Huang, Alexandre
Muzio, Saksham Singhal, Hany Hassan, Xia Song, and Furu Wei. Multilingual machine trans-
lation systems from microsoft for WMT21 shared task. In Loïc Barrault, Ondrej Bojar, Fethi
Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander
Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow,
Matthias Huck, Antonio Jimeno-Yepes, Philipp Koehn, Tom Kocmi, André Martins, Makoto Mor-
ishita, and Christof Monz (eds.), Proceedings of the Sixth Conference on Machine Translation,
WMT@EMNLP 2021, Online Event, November 10-11, 2021, pp. 446–455. Association for Com-
putational Linguistics, 2021. URL https://aclanthology.org/2021.wmt-1.54.
Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the
web. arXiv preprint arXiv:2405.03548, 2024.
25
Preprint Version
Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu
Chen. Repocoder: Repository-level code completion through iterative retrieval and generation.
arXiv preprint arXiv:2303.12570, 2023. URL https://arxiv.org/abs/2303.12570.
Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan,
Esther Cheng, Jie Liu, Qunshu Lin, et al. Map-neo: Highly capable and transparent bilingual
large language model series. arXiv preprint arXiv:2405.19327, 2024a.
Yifan Zhang, Yifan Luo, Yang Yuan, and Andrew Chi-Chih Yao. Automathtext: Autonomous data
selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024b.
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat:
1m chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning
Representations, 2024.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao
Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang.
Lmsys-chat-1m: A large-scale real-world llm conversation dataset, 2023.
Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, and Pengfei Liu. Programming every example:
Lifting pre-training data quality like experts at scale. arXiv preprint arXiv:2409.17115, 2024.
Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li,
Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models
in code intelligence. arXiv preprint arXiv:2406.11931, 2024.
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari,
Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Bench-
marking code generation with diverse function calls and complex instructions. arXiv preprint
arXiv:2406.15877, 2024.
26
Preprint Version
A F ILTERING RULES
Designing heuristic filtering rules is inherently challenging, often requiring iterative refinement and
experimentation to ultimately develop an effective set of rules. Given this complexity, in addition
to providing detailed explanations of our designed rules, we will also share the general insights and
methodologies we have accumulated throughout the designing process. We believe that this section
will offer valuable guidance for designing heuristic filtering rules applicable to any dataset, thereby
significantly enhancing the efficiency of constructing an effective data cleaning pipeline.
Heuristic rules filter data based on specific characteristics of a file, which, for each file, are ultimately
expressed as a score representing the file’s attribute and a corresponding threshold set by the rule.
During the rule design process, we found that understanding the distribution of scores and the impact
of different threshold settings on data filtering is critical to creating effective rules. Therefore, based
on the approach used in RedPajama (Computer, 2023), we decompose the heuristic filtering process
into two steps: quality signal computation and filtering execution. The quality signal computation
calculates the scores for all rules for each file, while the filtering execution module decides whether
a file is retained based on its quality signal scores and the corresponding thresholds.
Additionally, we recommend placing the heuristic filtering process as late as possible in the overall
data pipeline. Unlike other, more fixed stages of the data processing pipeline, this stage requires
frequent adjustments based on the final quality of the data. Placing it later in the process allows
for more precise control over the data and minimizes the need to repeat subsequent steps after this
filtering module.
The specific steps for designing our heuristic filtering rules are as follows:
1. Quality Signals Designing: Based on the definition of low-quality data and the attributes
of the dataset, we firstly design a series of quality signals that describe the attributes con-
tributing to file quality.
2. Coarse Threshold Tuning: Referring to the definition of low-quality data and the distri-
bution of quality signal scores, we roughly set filtering thresholds for all rules at once. We
then apply the filters to obtain an initial version of the filtered dataset.
3. Fine-grained Threshold Tuning: For each rule, we focus on the data that was exclusively
affected by that specific rule, meaning it did not trigger other filters. This part of the data is
directly influenced by the current rule, so we can examine whether the retention or removal
of this data under different threshold settings aligns with the intended purpose of the rule.
If a rule is effective in improving data quality based on its target attribute, we select the
optimal threshold; otherwise, the rule is discarded. After evaluating each rule, we apply
the filters again to obtain a more refined filtered dataset.
4. Data Quality Inspection: We then assess whether the filtered dataset meets our expecta-
tions for the quality of pretraining data. In addition to traditional manual inspection, we
introduce a perplexity (PPL)-based method for data quality evaluation. Specifically, we
randomly sample a set of data from the filtered dataset and use a high-performing LLM
to compute the PPL on these samples. We then examine the top-N and bottom-N samples
based on PPL. Generally, extremely low PPL suggests that the data is overly simplistic,
containing limited valuable knowledge, while extremely high PPL indicates that the data
may lack learnable patterns. Both of them are advisable to be filtered out. We closely in-
spect both sets of samples and, based on their characteristics, decide whether to add new
rules or adjust existing thresholds. This process can be repeated until the dataset reaches
the desired quality.
We elaborate several representative examples about general code filtering rules in Table 11 and
language-specific filtering rules in Table 12 and explain their rationale. It is essential to note that
for general code filtering rules, the threshold values may be slightly adjusted depending on the
27
Preprint Version
programming language of the file. For specific threshold values, please refer to our implementation
details of the data processing pipeline.
28
Preprint Version
During pretraining, data is first randomly concatenated and segmented into chunks of context length,
followed by full-attention computation within each chunk. We further explored chunk-level dedu-
plication. Specifically, the pretraining data was randomly concatenated and segmented into chunks
of 4096 tokens, followed by MinHash and LSH deduplication on these chunks. Additionally, we
applied chunk-level deduplication after file-level and repo-level deduplication.
Table 13: Comparison of deduplication strategies on Python data. At the File level, "Lines" refers
to the number of lines in individual files; at the Repo level, it indicates the line count of aggregated
strings; Note that for all deduplication strategies involving the Chunk level, "Lines" specifically
refers to 4096-token chunks.
HumanEval-Pass@1 MBPP-Pass@1
0.2
File-Level
0.2
Repo-Level
Repo&Chunk-Level
0.15
0.15
Pass@1
0.1
0.1
0.05
0.05
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Figure 12: Comparison of Pass@1 performance on HumanEval & MBPP for different dedup strate-
gies (File-Level, Repo-Level, and Repo-level + Chunk-Level) across RefineCode Python corpus.
From the results in table 13, We observe that chunk-level deduplication alone was even less effec-
tive than repo-level deduplication, and applying chunk-level deduplication after file-level removed
only an additional 0.04B of data. This indicates that chunk-level deduplication is not an effective
approach. We pre-trained three 1.5B models on the data retained under file-level, repo-level, and
repo-level + chunk-level deduplication strategies. The benchmark results are shown in Figure 12.
It is evident that file-level deduplication achieves the highest training efficiency, while repo-level
+ chunk-level deduplication outperforms repo-level alone. We attribute the superior performance
of file-level deduplication to its higher degree of data removal. Overall, we conclude that file-level
deduplication is the most suitable method for GitHub data.
29
Preprint Version
The manual annotation of the URLs of the website is presented as shown in the table 14. For future
new CC datasets, we can sample pages in these domains as initial seed corpus.
Table 14: We manually annotate code-like and math-like Chinese domains, utilizing the ’%’ symbol
as a wildcard in our pattern matching. For example, the URL ’https://my.oschina.net/u/4/blog/11’ is
matched by the pattern ’%my.oschina.net%blog%’.
Github Text files primarily consist of content written in natural languages, which includes abundant
code-related knowledge. However, we observed that a substantial portion of the dataset is unrelated
to code, which is detrimental to the model’s ability to learn code-related knowledge. Therefore,
we employed the following strategies to extract and retain the code-relevant portions before our
filtering module. Firstly, following the strategy used in starcoder (Li et al., 2023), we retained
the files with "requirement" in the lowercased filename, or if the filename without the extension is
one of "readme", "notes", "todo", "description", "cmakelists", in order to ensure that only text files
pertinent to coding contexts are preserved. This strategy recalled 3% volume of the whole text part.
Additionally, we trained a fasttext model to recall code-related text files and recalled extra 7% file
volume from the original text data.
30
Preprint Version
Our Jupyter notebook data is sourced from GitHub and Meta Kaggle code (Plotts & Risdal, 2023).
We converted this type of data into the Jupyter-structured format used in StarCoder (Li et al., 2023),
which consists of a triplet of consecutive markdown, code, and code execution results. However, we
discarded the Jupyter-script format mentioned in StarCoder. Because the code files generated from
Jupyter notebook conversions tend to have poor overall code writing standards, and the content in
Jupyter-script and Jupyter-structured formats is highly redundant, making it sufficient to retain only
one format.
Table 15: The Comparison of training data between RefineCode and series of The Stack. “LS”
denotes “Language Specific”.
Included programming languages can be categoried into three classes: code, data and text. Among
them, the "code" category represents files rich in code logic, while the "data" category primarily
consists of files with structured data, and the "text" category refers to files dominated by natural
language content. The threshold settings for the filtering rules vary slightly depending on the data
type.
Code(470 types): 1C Enterprise, 4D, ABAP, ABAP CDS, AIDL, AL, AMPL, ANTLR, API
Blueprint, APL, ASL, ASP.NET, ATS, ActionScript, Ada, Agda, Alloy, Alpine Abuild, An-
gelScript, Apex, Apollo Guidance Computer, AppleScript, Arc, AspectJ, Assembly, Astro, Asymp-
tote, Augeas, AutoHotkey, AutoIt, Awk, BASIC, BQN, Ballerina, Batchfile, Beef, Befunge,
Berry, Bikeshed, Bison, BitBake, Blade, BlitzBasic, BlitzMax, Bluespec, Boo, Boogie, Brain-
fuck, Brightscript, C, C#, C++, C2hs Haskell, CAP CDS, CLIPS, CMake, COBOL, CUE, Ca-
dence, Cairo, CameLIGO, Cap’n Proto, Ceylon, Chapel, Charity, ChucK, Circom, Cirru, Clarion,
Clarity, Classic ASP, Clean, Click, Clojure, Closure Templates, CodeQL, CoffeeScript, ColdFu-
sion, ColdFusion CFC, Common Lisp, Common Workflow Language, Component Pascal, Coq,
Crystal, Csound, Csound Document, Csound Score, Cuda, Curry, Cycript, Cypher, Cython, D,
D2, DIGITAL Command Language, DM, Dafny, Dart, DataWeave, Dhall, Diff, Dockerfile, Do-
gescript, Dylan, E, ECL, EJS, EQ, Earthly, Edge, EdgeQL, Elixir, Elm, Elvish, Emacs Lisp,
EmberScript, Erlang, F#, F*, FIRRTL, FLUX, Factor, Fancy, Fantom, Faust, Fennel, Filebench
WML, Fluent, Forth, Fortran, Fortran Free Form, FreeBasic, Futhark, GAML, GAMS, GAP, GDB,
GLSL, GSC, Game Maker Language, Genero 4gl, Genero per, Genshi, Gentoo Ebuild, Gentoo
Eclass, Gherkin, Gleam, Glimmer JS, Glyph, Go, Golo, Gosu, Grace, Grammatical Framework,
Groovy, Groovy Server Pages, HCL, HLSL, HTML, HTML+ECR, HTML+EEX, HTML+ERB,
HTML+PHP, HTML+Razor, Hack, Haml, Handlebars, Harbour, Haskell, Haxe, HiveQL, HolyC,
Hy, IDL, IGOR Pro, Idris, ImageJ Macro, Imba, Inform 7, Ink, Inno Setup, Io, Ioke, Isabelle, Is-
abelle ROOT, J, JCL, JFlex, JSONiq, Janet, Jasmin, Java, Java Server Pages, JavaScript, JetBrains
31
Preprint Version
MPS, Jinja, Jison, Jison Lex, Jolie, Jsonnet, Julia, Just, KRL, Kaitai Struct, KakouneScript, Kerbo-
Script, Kit, Kotlin, LFE, LLVM, LOLCODE, LSL, LabVIEW, Latte, Lean, Less, Lex, LigoLANG,
LilyPond, Limbo, Liquid, Literate Agda, Literate CoffeeScript, Literate Haskell, LiveScript, Logos,
Logtalk, LookML, Lua, Luau, M, M4, M4Sugar, MATLAB, MAXScript, MLIR, MQL4, MQL5,
MTML, MUF, Macaulay2, Makefile, Mako, Marko, Mask, Mathematica, Mercury, Mermaid, Me-
son, Metal, MiniD, Mint, Mirah, Modelica, Modula-3, Module Management System, Mojo, Mon-
key, MoonScript, Motorola 68K Assembly, Move, Mustache, Myghty, NASL, NSIS, NWScript,
Nearley, Nemerle, NetLinx, NetLogo, Nextflow, Nim, Nit, Nix, Nu, NumPy, Nunjucks, OCaml,
Oberon, Objective-C++, Objective-J, Omgrofl, Opa, Opal, Open Policy Agent, OpenCL, Open-
QASM, OpenSCAD, Ox, Oxygene, Oz, P4, PDDL, PEG.js, PHP, PLSQL, PLpgSQL, Pact, Pan, Pa-
pyrus, Parrot, Parrot Assembly, Parrot Internal Representation, Pascal, Pawn, Pep8, Perl, PigLatin,
Pike, PogoScript, Polar, Pony, Portugol, PowerBuilder, PowerShell, Praat, Processing, Procfile, Pro-
log, Promela, Propeller Spin, Pug, Puppet, PureScript, Prover9, Pyret, Python, Q#, QML, QMake,
Qt Script, Quake, R, RAML, REALbasic, REXX, RPGLE, RUNOFF, Racket, Ragel, Raku, Ras-
cal, ReScript, Reason, ReasonLIGO, Rebol, Red, Redcode, RenderScript, Ring, Riot, RobotFrame-
work, Roc, Rouge, Ruby, Rust, SAS, SMT, SQF, SQL, Sage, SaltStack, Sass, Scala, Scaml, Scenic,
Scheme, Scilab, Self, Shell, ShellSession, Shen, Sieve, Singularity, Slash, Slim, Slint, SmPL, Smali,
Smalltalk, Smarty, Smithy, Snakemake, SourcePawn, Squirrel, Stan, Standard ML, Starlark, Stata,
Stylus, SugarSS, Svelte, Sway, Swift, SystemVerilog, TI Program, TL-Verilog, TLA, TSX, TXL,
Talon, Tcl, Tcsh, Tea, Terraform Template, Thrift, Toit, Turing, Twig, TypeScript, Typst, Unified
Parallel C, Uno, UnrealScript, UrWeb, V, VBA, VBScript, VCL, VHDL, Vala, Velocity Template
Language, Verilog, Vim Script, Vim Snippet, Visual Basic .NET, Visual Basic 6.0, Volt, Vue, Vyper,
WDL, WGSL, WebAssembly, WebIDL, Whiley, Witcher Script, Wollok, Wren, X10, XC, XProc,
XQuery, XS, XSLT, Xojo, Xonsh, Xtend, YARA, YASnippet, Yacc, Yul, ZAP, ZIL, Zeek, Zen-
Script, Zephir, Zig, Zimpl, eC, fish, hoon, kvlang, mIRC Script, mcfunction, mupad, nesC, ooc,
templ, wisp, xBase
Data(115 types): ABNF, ASN.1, Adobe Font Metrics, Altium Designer, Ant Build System,
ApacheConf, Avro IDL, BibTeX, Browserslist, CIL, CODEOWNERS, CSON, CSS, Cabal Config,
Caddyfile, CartoCSS, Cloud Firestore Security Rules, CoNLL-U, DNS Zone, Darcs Patch, Debian
Package Control File, Dotenv, EBNF, Eagle, Easybuild, Ecere Projects, EditorConfig, Edje Data
Collection, FIGlet Font, Formatted, GEDCOM, GN, Gemfile.lock, Gerber Image, Git Attributes,
Git Config, Glyph Bitmap Distribution Format, Go Checksums, Go Module, Go Workspace, Godot
Resource, Gradle, Gradle Kotlin DSL, GraphQL, Graphviz (DOT), HAProxy, HOCON, HTTP,
HXML, INI, Ignore List, JAR Manifest, JSON, JSON with Comments, Jest Snapshot, Kusto, Lark,
Linker Script, Maven POM, NEON, NL, NPM Config, Nginx, Ninja, ObjDump, Object Data In-
stance Notation, OpenStep Property List, OpenType Feature File, Option List, PlantUML, PostCSS,
Prisma, Protocol Buffer, Protocol Buffer Text Format, Python traceback, RBS, RON, Readline Con-
fig, Record Jar, Redirect Rules, Regular Expression, SCSS, SELinux Policy, SPARQL, SSH Config,
STAR, STON, ShellCheck Config, Simple File Verification, Soong, Spline Font Database, TOML,
TextMate Properties, Turtle, Type Language, Valve Data Format, Wavefront Material, Web Ontol-
ogy Language, WebAssembly Interface Type, Wget Config, Windows Registry Entries, X BitMap,
X Font Directory Index, XCompose, XML, XML Property List, XPages, YAML, YANG, cURL
Config, crontab, desktop, dircolors, edn, nanorc
Text(22 types): AsciiDoc, Creole, Gemini, Gettext Catalog, MDX, Markdown, Muse, Org, Pod,
Pod 6, RDoc, RMarkdown, Rich Text Format, Roff, SRecode Template, Sweave, TeX, Texinfo,
Text, Textile, Wikitext, reStructuredText
2-Dimensional Array, AGS Script, Adblock Filter List, Bicep, COLLADA, CSV, Checksums, Di-
rectX 3D File, E-mail, G-code, Git Revision List, Gnuplot, IRC log, KiCad Layout, KiCad Legacy
Layout, KiCad Schematic, Lasso, Linux Kernel Module, Max, Microsoft Developer Studio Project,
Microsoft Visual Studio Solution, POV-Ray SDL, Pic, Pickle, PostScript, Public Key, Pure Data,
PureBasic, Raw token data, Roff Manpage, STL, SVG, SubRip Text, TSV, Unity3D Asset, Wave-
front Object, WebVTT, X PixMap, robots.txt
32
Preprint Version
Figure 16 shows the composition of raw code data for top 85 programming languages in the Re-
fineCode dataset, both after deduplication and filtering process. It can be observed that, after fil-
tering, the proportion of data for different programming languages has shifted significantly, with a
notable increase in the representation of commonly used programming languages.
Table 16: Overview of the data composition of in RefineCode. The items in the table are sorted in
descending order according to the file volume after filtering.
33
Preprint Version
You are a teaching assistant helping to create a Python programming task from a given code
snippet. You must provide the best response to the Python programming task, including
reasoning thought, reference solutions, explanation of test cases, and test code.
[Code Snippet]
{Code}
[Task]
{Create an independent and detailed Python programming task}
[Analysis]
{Analyze the task and reason about the given task step by step}
[Solution]
{Write a high-quality reference solution in a self-contained script that solves the task}
[Test]
{Provide ten assert statements to check the correctness of your solution}
You are exceptionally skilled at crafting high-educational level problems and offering
precise solutions. Please gain inspiration from the following code snippet to create a high-
quality programming problem, which is beneficial for learning the use of corresponding
libraries. Present your output in two distinct sections: [Problem Description] and [Solution].
[Code Snippet]
{Code}
2. [Solution]: Offer a comprehensive, **correct** solution that addresses the [Problem De-
scription] you provided. This solution should follow the standard of corresponding Library
Api doc. Please ensure that the Solution only involves answering the Problem, **without
addressing the requirements I provided!** Please provide essential explanation abouth this
solution, especially the use of requiremed Library Api.
34
Preprint Version
You are an expert in designing high-quality programming questions based on the given text.
[Guidelines]
- You can draw inspiration from the given text to create the programming questions.
- The created question should be a self-contained question, which does not depend on any
external context.
- The created response must contain the complete code snippet.
[Given Text]
{Given Text}
[Created Question]
{Created Question}
35