0% found this document useful (0 votes)
15 views

aixcoder-7b

Uploaded by

nikhil11.unp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

aixcoder-7b

Uploaded by

nikhil11.unp
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

aiXcoder-7B: A Lightweight and Effective Large

Language Model for Code Completion


Siyuan Jiang* Jia Li* ♂ He Zong Huanyu Liu, Hao Zhu
aiXcoder Peking University aiXcoder Peking University
Beijing, China Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected] {huanyuliu, zhuhao}@stu.pku.edu.cn

Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning Ge Li


aiXcoder Peking University
Beijing, China Beijing, China
arXiv:2410.13187v1 [cs.CL] 17 Oct 2024

{hushukai, lierlu, dingjiazheng, hanyu, ningwei}@aixcoder.com [email protected]

Abstract—Large Language Models (LLMs) have been widely is a critical factor for developer experience and productivity.
used in code completion, and researchers are focusing on scaling Thus, it is necessary to train lightweight LLMs that maintain
up LLMs to improve their accuracy. However, larger LLMs will high code completion accuracy while having smaller scales.
increase the response time of code completion and decrease the
developers’ productivity. In this paper, we propose a lightweight Recognizing the above research gap, we present aiXcoder-
and effective LLM for code completion named aiXcoder-7B. 7B, a lightweight and powerful LLM for code completion.
Compared to existing LLMs, aiXcoder-7B achieves higher code aiXcoder-7B contains 7 billion parameters, ensuring a high
completion accuracy while having smaller scales (i.e., 7 billion inference speed while achieving superior code completion
parameters). We attribute the superiority of aiXcoder-7B to accuracy. In our later experiments, aiXcoder-7B outperforms
three key factors: ❶ Multi-objective training. We employ three
training objectives, one of which is our proposed Structured the latest LLMs with similar sizes in six code comple-
Fill-In-the-Middle (SFIM). SFIM considers the syntax structures tion benchmarks and even surpasses larger LLMs (e.g.,
in code and effectively improves the performance of LLMs StarCoder2-15B and CodeLlama-34B). aiXcoder-7B effec-
for code. ❷ Diverse data sampling strategies. They consider tively balances model size and performance, providing a better
inter-file relationships and enhance the capability of LLMs in foundational model for both academia and industry.
understanding cross-file contexts. ❸ Extensive high-quality data.
We establish a rigorous data collection pipeline and consume
Compared to previous LLMs, we attribute the superiority
a total of 1.2 trillion unique tokens for training aiXcoder-7B. of aiXcoder-7B to the following three key factors:
This vast volume of data enables aiXcoder-7B to learn a broad • Multi-objective training. Previous LLMs mainly use Next-
distribution of code. We evaluate aiXcoder-7B in five popular Token Prediction (NTP) as the training objective, which only
code completion benchmarks and a new benchmark collected by covers limited code completion scenarios. To address this
this paper. The results show that aiXcoder-7B outperforms the
latest six LLMs with similar sizes and even surpasses four larger limitation, we propose multi-objective training, including
LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning NTP, Fill-In-the-Middle (FIM), and Structured Fill-In-
aiXcoder-7B as a lightweight and effective LLM for academia the-Middle (SFIM). NTP simulates the scenario where
and industry. Finally, we summarize three valuable insights for developers write a new file from top to bottom, and FIM
helping practitioners train the next generations of LLMs for models the scenario of developers modifying existing code.
code. aiXcoder-7B has been open-souced and gained significant
attention [1]. As of the submission date, aiXcoder-7B has received Because FIM mainly trains models to predict incomplete
2,193 GitHub Stars. and irregular code snippets, we further propose SFIM. It
Index Terms—Code Completion, Large Language Model parses the code into a syntax tree and mines a relatively
complete code span based on a tree node. aiXcoder-7B is
I. I NTRODUCTION trained to predict the code span based on its surrounding
Large Language Models (LLMs) have been widely used in context. The three objectives help aiXcoder-7B learn a
code completion [2]–[5], i.e., predicting the subsequent code comprehensive code completion ability across a wider range
based on the previous context. For example, GitHub Copilot of code completion scenarios. Details of the multi-objective
[6], an LLM-based code completion tool, is regularly utilized training are in Section III-C.
• A diverse data sampling algorithm. A code repository of-
by developers from over 10,000 organizations. Nowadays,
researchers often improve the accuracy of LLMs by scaling ten contains multiple code files. Previous studies [2], [4], [5]
up LLMs, e.g., CodeLlama-70B [2]. However, larger LLMs typically randomly sample files for training, failing to lever-
will increase the response time of code completion, which age the relationships and contextual information between
files. We propose four new sampling strategies: sampling
* Siyuan Jiang and Jia Li contribute equally and are co-first authors. based on file content similarity, sampling based on file
path similarity, sampling based on inter-file dependency, file relationships and code structures into the training. These
and random sampling. The first three strategies simulate insights can help practitioners train the next generations of
common cross-file code completion scenarios, such as code LLMs for code.
completion augmented by similar code and cross-file API We summarize the key contributions of this paper:
completion, helping aiXcoder-7B better understand and uti- • We present aiXcoder-7B, a lightweight and effective LLM
lize dependencies across files. The fourth strategy, random with 7 billion parameters for code completion. We have
sampling, is to simulate other potential code completion released its weights and code [1]. As of the submission date,
scenarios. Through these diverse sampling strategies, we aiXcoder-7B has received 2,193 GitHub Stars.
enhance aiXcoder-7B’s understanding capability of cross- • We propose a novel training objective - Structured Fill-In-
file contexts within a repository. Details of our data sampling the-Middle, which considers the syntax structures in code
algorithm are in Section III-B. and effectively improves the performance of LLMs.
• Extensive high-quality data. LLMs are inherently data- • We propose a new data sampling algorithm for code, which
driven, and their performance is significantly influenced by considers inter-file relationships and enhances the capability
the quantity and quality of the training data. We establish of LLMs in understanding cross-file contexts.
a rigorous data collection pipeline, including data crawl- • We release a new code completion benchmark, consisting
ing, data cleaning, deduplication, code quality checks, and of 16,136 samples and covering four languages.
sensitive information removal. We leverage this pipeline to • We evaluate the effectiveness of aiXcoder-7B in six code
collect a substantial amount of high-quality training data. completion benchmarks. aiXcoder-7B substantially outper-
We continuously feed the training data into aiXcoder- forms 6 LLMs with similar sizes and even surpasses 4 larger
7B, consuming a total of 1.2 trillion unique tokens. This LLMs (15B and 34B).
vast volume of data enables aiXcoder-7B to learn a broad
distribution of code data, allowing it to perform exception- II. DATA C OLLECTION P IPELINE
ally well across different code completion scenarios. More This section presents the process of collecting the pre-
details of our data collection pipeline are in Section II. training data of aiXcoder-7B. Figure 1 shows an overview
of our data collection pipeline, consisting of five stages:
We assess the effectiveness of aiXcoder-7B in three data crawl (Section II-A), data cleaning (Section II-B), data
code completion tasks, including Fill-In-the-Middle (FIM), deduplication (Section II-C), code quality checking (II-D),
cross-file code completion, and Natural language to Code and sensitive and personally identifiable information removal
(NL2Code). We experiment with six code completion bench- (Section II-E). Through this pipeline, we collect and clean
marks, five of which are popular public datasets and one is 2.8TB of natural language data and 3.5TB of source code data.
FIM-Eval collected by this paper. FIM-Eval is a benchmark Figure 2 visualizes the distributions of the top 10 programming
for FIM, consisting of 16,136 samples and covering four languages in the code data. Next, we describe the details of
languages (i.e., Java, Python, C++, JavaScript). FIM-Eval the data collection pipeline in the following sections.
additionally labels the types of code to be completed, including
13 types, e.g., function signatures and comments. Then, we A. Data Crawling
compare aiXcoder-7B to 10 recently released LLMs (from 7B The pre-training data of aiXcoder-7B consists of two parts:
to 34B) on these benchmarks and yield the following insights: natural language data and source code data.
❶ aiXcoder-7B substantially outperforms LLMs with similar Natural Language Data. We collect natural language data
sizes and even surpasses larger LLMs in six benchmarks. For from two public datasets: WuDaoCorpora [7] and RefineWeb
example, in a popular benchmark - HumanEval, aiXcoder-7B [8], driven by two key motivations. First, these datasets are
achieves a Pass@1 score of 54.9%, outperforming CodeLlama- highly diverse, covering a wide range of domains and lan-
34B (i.e., 48.2%) and StarCoder2-15B (i.e., 46.3%). The guages. They include a broad spectrum of natural language
improvements show that aiXcoder-7B achieves higher code text from the internet, such as social media conversations,
completion accuracy while having smaller scales. ❷ Based on books, and technical papers, and cover two mainstream lan-
our FIM-Eval, we analyze the performance of aiXcoder-7B in guages, i.e., English and Chinese. Second, both datasets have
completing different types of code. aiXcoder-7B outperforms been thoroughly cleaned and deduplicated in previous studies,
LLMs with similar sizes on most types (max: 13, min: 8). The which significantly reduces the preprocessing workload and
results show the strong generalization ability of aiXcoder-7B allows us to focus on processing code data. Finally, we collect
in code completion. ❸ We show that existing LLMs are prone 2.8TB of natural language data for pre-training.
to generate longer code in FIM, while the code generated by Source Code Data. The raw source code data comes from
aiXcoder-7B is closer in length to human-written reference two sources: one is the open-source dataset - The Stack v1.2,
code. The result shows that the code generated by aiXcoder- and the other is the code data we crawled ourselves.
7B is more concise and closer to the human coding style. • The Stack v1.2 [9] is a comprehensive dataset comprising
Insights of training LLMs for code. Based on our prac- approximately 6TB of permissively licensed source code
tices in aiXcoder-7B, we summarize three valuable insights, sourced from public GitHub repositories, spanning 358
including scaling up training data and introducing the inter- programming languages, with notable representation from
Fig. 1: An overview of our data collection pipeline.

Markdown • Identifying repositories’ licenses. GHArchive provides li-


HTML cense information when repository owners explicitly set the
C# code license through the web interface. We first extract each
Kotlin repository’s license from GHArchive. If a license is not
11.5%
8.2% 11.4% listed in GHArchive, we leverage the go-license-detector2
0.4% TypeScript to identify the most likely license.
2.9% Go • Removing repositories with impressive licenses. After iden-
Java 3.1%
19.6% tifying the licenses, we exclude repositories whose licenses
do not appear on the permissive license list.
15.8% • Removing low-quality repositories. We score the repositories

12.8% JavaScript from different aspects, including the number of stars, the
14.3% number of git commits, and the number of test files. Then,
we sort the repositories in descending order based on their
C++ scores and remove the lowest 10%.
Python File-level Cleaning. Next, we filter out low-quality files
in repositories. Specifically, we empirically design some rules
Fig. 2: The distributions of the top 10 programming languages in our
to filter out low-quality documents: ❶ Trivial files, including
source code training data.
empty files, corrupted files, non-text files, and auto-generated
files. ❷ Too long files. Too long files typically contain wordy
HTML, JavaScript, Java, and C. This dataset has under- or repetitive content and are not suitable as training data. If
gone rigorous cleaning processes to enhance data integrity a line in a file exceeds 1000 characters, the total number of
and prevent inflated performance metrics due to duplicate lines in the file exceeds 10,000, or the size of the file exceeds
content. Only repositories with permissive licenses were 1MB, we consider it a long file.
retained, while non-essential files, such as binaries and those
C. Data Deduplication
exceeding 1MB, were systematically excluded. The version
1.2 of The Stack has excluded opt-out requests submitted Previous work [8] has shown that data deduplication can
by February 9, 2023, as well as initially flagged malicious significantly improve the performance of trained models. It is
files (this exclusion is not exhaustive). particularly necessary for code data, where code reuse leads to
• Our crawled data. Popular repositories we have been crawl- a large amount of duplicate content. Therefore, in this stage,
ing from GitHub for the past decade. we eliminate duplicate code files within the repositories. Our
deduplication process consists of two steps:
B. Data Cleaning • Exact deduplication. We extract file contents, find the files
In this stage, we clean the collected data by removing invalid with exactly the same content, and keep only one copy.
or low-quality data. Because the natural language data has • Near deduplication. Exact deduplication is too strict and
already undergone rigorous cleaning, we focus on cleaning the may cause false positives. Thus, we further perform near
source code data. Our cleaning process comprises two steps: deduplication. We compute the MinHash [10] with 256
repository-level cleaning and file-level cleaning. Below, we permutations of all files and use Locality Sensitive Hashing
provide a detailed explanation of each step. [11] to find clusters of duplicates. We further reduce the
Repository-level Cleaning. Our goal is to remove repos- clusters by ensuring that each file in the original cluster
itories with impressive licenses and low-quality repositories. is similar to at least one other file in the reduced cluster.
To achieve this goal, our cleaning is performed in three steps: We consider two files near-duplicate when their Jaccard
• Collecting permissive licenses. We build a list of permissive similarity exceeds 0.85.
licenses based on the Blue Oak Council1 and previous
work [9]. This list includes various permissive licenses with D. Code Quality Checking
minimal restrictions on software copying, modification, and In this stage, we use code analysis tools to assess the quality
redistribution. Only repositories with licenses from this list of code data and filter out low-quality code. Low-quality code
are retained for pre-training. often contains syntax errors, code defects, vulnerabilities, and
1 https://blueoakcouncil.org/list 2 https://github.com/src-d/go-license-detector
misleading models that generate unreliable code. Specifically, B. Data Sampling Algorithm
we use the following tools to assess the quality of code: Through the pipeline in Section II, we collect extensive code
Syntax Parser. Syntax correctness is one of the basic repositories and natural language articles. We randomly shuffle
principles that source code should satisfy. We use a public these repositories and articles and iterate through them. If a
syntax parser - tree-sitter3 to parse all code files and delete natural language article is sampled, we process it into training
files that fail to parse or time out. sequences based on the training objectives (Section III-C).
SonarQube. SonarQube4 is an open-source tool for the in- If a code repository is sampled, we design an algorithm for
spection of code quality. It can detect code defects, vulnerabil- sampling files from the repository, as described in Algorithm 1.
ities, code smells, and technical debt in various programming The algorithm contains four strategies: sampling based on file
languages. We use SonarQube to identify problematic code content similarity, sampling based on file path similarity, sam-
files and delete them. pling based on inter-file dependencies, and random sampling.
The first three strategies simulate common cross-file code
E. Sensitive Information Removal completion scenarios, such as code completion augmented by
In this section, we remove the sensitive information in similar code and cross-file API completion, helping aiXcoder-
the pre-training data, e.g., texts involving sensitive topics 7B better understand and utilize dependencies across files. The
and personally identifiable information (PII). We remove this fourth strategy, random sampling, is to simulate other potential
information in two steps: code completion scenarios. For each repository, the probability
Match-based filter. We manually build a list of sensitive of selecting each of the first three strategies is 30%, and the
words, which covers a broad range of sensitive topics (e.g., probability of selecting the last strategy is 10%. These sampled
politics). Then, we scan all pre-training data and delete the files are further converted into training sequences based on the
data containing sensitive words. training objectives (Section III-C).
Model-based filter. Following previous work [4], we use a C. Training Objectives
Named Entity Recognition (NER) model to identify PII in the
The training objectives of aiXcoder-7B consist of the Next-
data. Specifically, we reuse a trained NER model in previous
Token Prediction (NTP) and Structured Fill-In-the-Middle
work [4], which can identify six categories of PII, including
(SFIM), detailed as follows.
emails, names, IP addresses, usernames, passwords, and keys.
Next-Token Prediction (NTP). It is similar to code com-
Then, we replace the detected PII entities with the follow-
pletion, training models to predict the subsequent token based
ing special tokens: <EMAIL>, <NAME>, <IP_ADDRESS>,
on the provided context. Given a code file or natural language
<USERNAME>, <PASSWORD>, <KEY>.
article x = {x0 , x1 , . . . , xl }, NTP trains models predict the
III. M ODEL T RAINING next token xi based on previous tokens {x<i }. The objective
is to minimize the following loss function:
In this section, we describe the pre-training procedure l−1
of aiXcoder-7B, including model architecture, data sampling lossN T P = −
X
log p (xi | xt<i ) (1)
algorithm, and training objectives. i=0

A. Model Architecture Fill-In-the-Middle (FIM) [16]. The motivation behind this


training objective is that human developers frequently modify
aiXcoder-7B is built upon an auto-regressive dense Trans- existing code, e.g., inserting new code snippets. Thus, FIM
former architecture [12]. aiXcoder-7B consists of 32 Trans- trains models to predict the middle content based on the
former decoder layers, with a hidden state size of 4096 and preceding and following context. Specifically, given a code
an intermediate size of 14464. More details are in our open- file or natural language article x = {x0 , . . . , xl }, we ran-
sourced repository [1]. Our tokenizer is trained with Senten- domly select a span of contiguous tokens as the middle =
cePiece [13] upon 500GB of training data. The vocabulary {xi , . . . , xj }, using the content above the span as the prefix =
size is 49,512. We adopt Rotary Positional Encodings (RoPE) {x0 , . . . , xi−1 } and the content below as the suffix =
[14] to enhance the representation of positional information {xj+1 , . . . , xl }. We employ two distinct modes to construct
in sequences, following [2], [4]. RoPE allows for a more the training sequence: PSM (i.e., [prefix; suffix; middle])
flexible encoding of position than absolute position encoding. or SPM (i.e., [suffix; prefix; middle]). [; ] means the con-
Additionally, we implement Grouped Query Attention (GQA) catenation of multiple strings using special tokens. Previous
[15], which enhances the efficiency of the attention mechanism work [3] has found that models work best when PSM and
by grouping queries, allowing for a more scalable attention SPM account for 50% each. Thus, we choose the probability
computation. We maintain a balanced design in our attention of each mode being 50%.
heads, with a configuration of 32 query attention heads and 8 Finally, we feed the training sequence into aiXcoder-7B and
key-value attention heads. minimize the following loss function:
3 https://tree-sitter.github.io/tree-sitter/ lossF IM = − log p ([prefix; suffix; middle])
4 https://www.sonarsource.com/products/sonarqube/ (2)
− log p ([suffix; prefix; middle])
Algorithm 1 Sampling code files within a repository. Selected Span:
FIM
Original Code: or i in range(2
Inputs:
for i in range(20):
A list of code files F iles if i % 5 == 0: For
Outputs: print("Hello World") Selected Span:
An ordered list of code files orderedF iles Call If i % 5 == 0:
SFIM
1: orderedF iles ← []
2: randomV alue ← random(0, 1) Compare For

3: if randomV alue < 0.3 then


4: // Sampling based on file content similarity Fig. 3: Examples of selected spans in FIM and SFIM.
5: tf Idf List ← []
6: for f ile in F iles do
Structured Fill-In-the-Middle (SFIM). As shown in Fig-
7: tf Idf List.append(TFIDF(f ile))
ure 3, FIM randomly selects spans and trains models to
8: end for
predict incomplete and irregular code snippets (e.g., or i
9: clusterN um ← min((random(1, 20), F iles.size))
in range(2). However, developers often expect models to
10: Clusters ← KMEANS(clusterN um, tf Idf List)
complete the current code into a complete snippet, such as a
11: for Cluster in SHUFFLE(Clusters) do
completed code line or loop block, instead of suggesting an
12: orderedF iles.extend(SHUFFLE(Cluster))
incomplete code snippet. To address this, we propose SFIM,
13: end for
which trains aiXcoder-7B to predict complete code snippets,
14: else if randomV alue < 0.6 then
enabling aiXcoder-7B to align with the practical needs of
15: // Sampling based on file path similarity
developers. Given a code file, SFIM uses a new strategy for
16: while F iles.size > 0 do
selecting spans: ❶ randomly select a function from the file
17: K ← random(1, F iles.size)
and parse the function into a syntax tree; ❷ randomly choose
18: randomF ile ← random(F iles)
a non-root, non-leaf node from the tree and locate the code
19: for F ile in KNN PATH(randomF ile, K) do
snippet corresponding to this node; ❸ within this code snippet,
20: orderedF iles.append(F ile)
randomly select a span, where the start position is randomly
21: F iles.remove(F ile)
determined, but the end position must be the end of a code line.
22: end for
As shown in Figure 3, SFIM selects an If node and mines a
23: orderedF iles.append(randomF ile)
relatively complete code snippet (i.e., i%5==0:) as the span.
24: end while
Subsequently, we follow the FIM and convert the select span
25: else if randomV alue < 0.9 then
into a training sequence in the format of PSM or SPM. Based
26: // Sampling based on file dependencies
on preliminary experiments, we set the probability of selecting
27: callGraph ← CALL GRAPH(F iles)
PSM to 30% and SPM to 70%. We input the training sequence
28: leaf N odes ← GET LEAFS(callGraph)
into aiXcoder-7B and minimize the following loss function:
29: while leaf N odes.isN otEmpty() do
30: for node in leaf N odes do lossSF IM = − log p ([prefix; suffix; middle])
31: lastP redecessors ← {node} (3)
− log p ([suffix; prefix; middle])
32: while lastP redecessors.isN otEmpty() do
33: for node in SHUFFLE(lastP redecessors) do Multi-objective Training. We optimize the above three ob-
34: for p in node.predecessors do jectives alternately. Given a code repository, we choose SFIM
35: p.successors.remove(node) with a probability of 70% and FIM and NTP with a probability
36: if p.successors.isEmpty() then of 15%, respectively. Given a natural language article, we
37: leaf N odes.add(p) choose FIM and NTP with a probability of 50%, respectively.
38: end if We determine these probabilities based on previous work [3]–
39: end for [5] and our preliminary experiments.
40: orderedF iles.append(node.f ile) D. Training Details
41: end for
42: lastP redecessors ← node.predecessors We leverage Megatron to train aiXcoder-7B. The training
43: end while process is conducted on 128 A100 40GB GPUs, consuming a
44: end for total of 1.2 trillion unique tokens. The hyper-parameters used
45: end while during training are shown in Table I.
46: else IV. S TUDY D ESIGN
47: // Random sampling
48: orderedF iles.extend(SHUFFLE(F iles)) We design a large-scale study to evaluate the effectiveness
49: end if
of aiXcoder-7B. This section presents the details of our
50: return orderedF iles
study, including research questions, benchmarks, compared
baselines, and evaluation metrics.
TABLE I: Training Hyperparameters for aiXcoder-7B
• CodeShell-7B [20], introduced by Shell AI, is a 7B parame-
Hyperparameter aiXcoder-7B ter model focused on shell scripting and code interpretation,
lr-decay-iters 320000 trained on a mixture of code and command-line data.
• StarCoder2-7B [4], from BigCode, is a 7B parameter
weight-decay 0.01
lr-decay-style cosine model trained on The Stack v2 dataset, specializing in code
clip-grad 1.0 understanding and generation across multiple programming
hidden-dropout 0.1 languages.
attention-dropout 0.05 • DeepSeekCoder-7B [3], by DeepSeek AI, is a 7B parameter
adam-beta1, beat2 0.9, 0.98 model trained on a blend of code and natural language data,
Batch Size 512 designed for programming tasks.
Max Learning Rate 1e-5 ❷ LLMs with larger sizes. We also select four larger LLMs
Context window 32768 as baselines to demonstrate the superiority of aiXcoder-7B:
• CodeLlama-13B [2] is an enhanced version of the CodeL-
lama model with 13B parameters.
A. Research Questions • StarCoder-15B [4] is the expanded version of the StarCoder

Our study aims to answer the following Research Questions model with 15B parameters, delivering improved accuracy
(RQs). They evaluate aiXcoder-7B in three code completion for code synthesis and interpretation.
tasks, including Natural Language to Code (NL2Code), Fill- • StarCoder2-15B [5] is a 15B parameter model upgraded

In-the-Middle (FIM), and cross-file code completion. on the original StarCoder, offering refined code generation
RQ1: How does aiXcoder-7B perform on NL2Code and more diverse programming languages.
• CodeLlama-34B [2] is the largest variant of the CodeLlama
task compared to existing LLMs? NL2Code is the task
of completing the source code based on a natural language series with 34B parameters.
description or function signature. C. Benchmarks
RQ2: How does aiXcoder-7B perform on Fill-In-the-
Middle task compared to existing LLMs? FIM simulate NL2Code Benchmarks. Following previous studies [2]–
scenarios where developers modify existing code by predicting [4], we select three popular NL2Code benchmarks in our
the missing middle portion using bidirectional contexts. experiments, detailed as follows.
RQ3: How does aiXcoder-7B perform on Cross-File • HumanEval [21] and MBPP [22] consist of 164 and 974

Code Completion compared to existing LLMs? This task Python programming problems. Each problem includes a
requires completing code by using relevant context from other function signature, a detailed docstring, and several test
files within the current repository. cases. LLMs are required to complete the function body
In the RQs, we apply aiXcoder-7B on 6 benchmarks based on the signature and docstring. The generated code
totally. These benchmarks cover 6 programming languages. is checked by executing test cases, being considered correct
To show the superiority of aiXcoder-7B, we also select 10 only if all tests pass.
• MultiPL-E [23] is the multilingual version of HumanEval,
popular LLMs as baselines for comparison. Then, we report
the execution-based and text-based metrics (Section IV-D) of covering multiple programming languages, e.g., C++, Java,
completed programs. and JavaScript.
FIM Benchmarks. Code is rarely composed in a straight-
B. Compared Models forward left-to-right sequence. Simulating when a developer
modifies existing code, FIM refers to the task of completing
Note that our aiXcoder-7B was open-sourced in March missing a middle code snippet leveraging bidirectional con-
2024. Thus, we select LLMs released before March 2024 texts.
for comparison. Specifically, we select 10 popular LLMs for • Santacoder-data [24] is a popular FIM benchmarks con-
comparison, and they can be divided into two groups: sisting of 4,792 samples. It is built from MultiPL-E [23]
❶ LLMs with similar sizes. The first group contains six and requires LLMs to predict a single line of code based on
popular LLMs, which have similar sizes to aiXcoder-7B. the preceding and following context.
• CodeGen2.5-7B [17], released by Salesforce, is a 7B pa- • FIM-Eval is a large-scale FIM benchmark collected by
rameter model specialized in code generation and under- this paper. We construct FIM-Eval from some real-world
standing, trained on a diverse set of programming languages. repositories, which are excluded from the training data of
• CodeGeex2-7B [18], developed by Zhipu AI, is a 7B aiXcoder-7B. We extract 13 types of code snippets from
parameter model designed for code completion and bug these repositories and randomly mine spans from these
fixing, leveraging a large corpus of code data. code snippets. These 13 types of code snippets encompass
• CodeLlama-7B [2], an open-source model by Meta AI, is common code completion scenarios, including method sig-
a 7B parameter architecture fine-tuned on a vast collection natures, method bodies, single-line statements, methods with
of code and natural language data based on Llama2 [19]. comments, empty code blocks, specific positions within a
TABLE II: The Pass@1 of LLMs on NL2Code benchmarks.
Model HumanEval MBPP MultiPL-E (C++) MultiPL-E (Java) MultiPL-E (JS) Average
CodeGen2.5-7B 28.7% 39.2% 25.7% 26.1% 26.2% 29.1%
CodeGeex2-7B 36.0% 36.2% 29.2% 25.9% 24.8% 30.4%
CodeLlama-7B 31.7% 38.6% 29.8% 34.2% 29.2% 32.7%
CodeShell-7B 34.4% 38.6% 28.2% 30.4% 33.2% 32.9%
StarCoder2-7B 35.4% 54.4% 33.6% 29.4% 35.4% 37.6%
DeepSeekCoder-7B 49.4% 60.6% 50.3% 43.0% 48.4% 50.3%
aiXcoder-7B 54.9% 66.0% 58.2% 57.0% 64.5% 60.1%
StarCoder-15B 31.7% 42.8% 31.1% 28.5% 29.8% 32.8%
CodeLlama-13B 36.0% 48.4% 37.9% 38.0% 32.3% 38.5%
StarCoder2-15B 46.3% 66.2% 41.4% 33.9% 44.2% 46.4%
CodeLlama-34B 48.2% 55.2% 44.7% 44.9% 42.2% 47.0%

method body (top, middle, and bottom), specific control • Exact Match (EM) evaluates the percentage of cases where
statements (i.e., if statements, for loops, while loops, try the prediction exactly matches the reference, providing a
statements, and switch-case statements). Finally, we collect strict measure of how often LLMs produce correct code
16,140 samples covering four programming languages: C++ without deviations.
(4,080 samples), Java (4,080 samples), Python (3,900 sam- • Edit Similarity (ES) measures the similarity between the
ples), and JavaScript (4,080 samples). FIM-Eval provides a prediction and the reference based on the number of edits
reliable, practical, and diverse evaluation platform for FIM. required to transform one into the other, typically using
FIM-Eval has been open-sourced in our repository [1]. metrics like Levenshtein distance [28].
Cross-File Code Completion Benchmarks. This task re-
V. R ESULTS AND A NALYSES
quires LLMs to complete the code based on cross-file context
within the same project. Building upon insights from prior TABLE III: The exact match of LLMs on the SantaCoder-data
research [3], [4], detailed as follows. benchmark.
• CrossCodeEval [25] covers four popular programming lan- Model Python JavaScript Java Avg
guages: 2,665 Python samples, 2,139 Java samples, 3,356 StarCoder2-7B 61.1% 77.5% 81.1% 73.2%
TypeScript samples, and 1,768 C# samples. Each sample is CodeLlama-7B 67.6% 74.3% 80.2% 74.0%
CodeLlama-13B 68.3% 77.6% 80.7% 75.5%
provided in three formats: no cross-file context, retrieved DeepSeekCoder-7B 66.6% 79.7% 88.1% 78.1%
cross-file context, and retrieval with reference. The LLMs aiXcoder-7B 73.3% 81.7% 83.0% 79.3%
completed code snippet is compared using text-based met-
rics. A. RQ1: Performance on NL2Code
Following recent work on LLMs [3], [5], we use greedy
D. Evaluation Metrics
decoding and report Pass@1. Table II shows the results of
We describe the evaluation metrics used in different code different LLMs on NL2Code benchmarks. From Table II, we
completion tasks. draw the following observations:
NL2Code. NL2Code benchmarks provide test cases for • Compared to LLMs of similar sizes, our aiXcoder-7B
evaluation. Thus, we execute test cases to check the correct- achieves the current best results, outperforming the top-
ness of the generated code and report Pass@k [21]. Specif- performing model DeepSeekCoder-7B by an average of
ically, we generate n ≥ k code snippets per testing sample, 9.8%. Moreover, it significantly surpasses CodeGen2.5-7B
count the number of correct code snippets c ≤ n that pass all with a 31% absolute advantage.
test cases, and calculate the Pass@k: • aiXcoder-7B even surpasses four larger LLMs (e.g.,
StarCoder2-15B and CodeLlama-34B), achieving a lead of
 
n−c

 k  13.1% over CodeLlama-34B, which is nearly five times
Pass@k := E  1−    (4) larger, and 13.7% over StarCoder2-15B on average.
Samples  n 
• Across languages like Java, Python, C++, and JavaScript,
k
our aiXcoder-7B shows strong performance. It surpasses
FIM and cross-file code completion. We consider the DeepSeekCoder-7B by 16.1% in JavaScript and exceeds by
LLMs’ completions as predictions and the human-written 5.5% in Python.
completions as references. We compare the predictions to
references and compute the following metric: B. RQ2: Performance on Fill-In-the-Middle (FIM)
• BLEU [26] measures the n-gram similarity between predic- Generally, FIM closely mirrors how developers modify
tions and references. n is empirically set to 4. existing code, making it an ideal method for evaluating models
• CodeBLEU [27] is a variant of BLEU for code. It considers in real-world programming scenarios.
not only the n-gram similarity but also the syntax and data Based on the experimental findings outlined in Table III,
flow similarity. aiXcoder-7B demonstrates the highest overall performance
TABLE IV: Performance of LLMs on CrossCodeEval.
Python Java TypeScript C# Average
Model
EM ES EM ES EM ES EM ES EM ES
Base Model
CodeLlama-7B 22.3 55.2 27.9 66.9 10.8 70.9 45.8 77.2 26.7 67.6
StarCoder2-7B 22.5 57.3 25.9 65.9 28.9 71.6 39.5 70.5 29.2 66.3
DeepSeekCoder-7B 27.2 62.3 33.4 73.2 36.6 77.3 45.9 77.0 35.8 72.4
aiXcoder-7B 30.0 70.8 34.9 77.8 35.3 79.6 49.9 86.4 37.5 78.7
+ Retrieval BM25
CodeLlama-7B 23.5 53.5 33.9 68.4 11.5 71.5 50.6 75.3 29.9 67.2
StarCoder2-7B 25.3 58.0 31.4 67.4 33.3 73.2 43.5 69.8 33.4 67.1
DeepSeekCoder-7B 29.9 62.9 39.8 74.8 39.0 77.0 52.2 78.1 40.2 73.2
aiXcoder-7B 35.3 74.3 42.2 80.4 39.9 81.3 57.7 88.8 43.8 81.2
+ Retrieval w/ Ref.
CodeLlama-7B 26.7 54.9 36.3 69.0 12.8 72.9 52.8 75.0 32.1 67.9
StarCoder2-7B 28.5 59.0 35.0 69.2 36.0 72.6 47.9 71.6 36.9 68.1
DeepSeekCoder-7B 33.2 64.5 43.7 76.1 43.4 78.4 55.4 78.7 43.9 74.4
aiXcoder-7B 40.4 76.3 47.0 82.4 45.0 83.8 61.0 89.4 48.4 83.0

TABLE V: The performance of LLMs on FIM-Eval.


• aiXcoder-7B offers no clear edge over DeepSeekCoder-
Java 7B in Python, likely due to lower training data propor-
Model
EM(%) BLEU-4 CodeBLEU Average tion. When calculating CodeBLEU in FIM-Eval, aiXcoder-
CodeLLama-7B 38.1 56.9 69.9 55.0 7B’s score of 63.0 is slightly lower than DeepSeekCoder-
StarCoder2-7B 37.7 57.7 69.2 54.9 7B’s score of 63.4. In several aspects, such as method
DeepSeekCoder-7B 43.4 63.4 71.7 59.5
aiXcoder-7B 49.4 70.6 74.0 64.7
body top/mid and if statement, it falls behind by up to
10%, indicating the need for a better understanding of
C++
Model method initiations and conditional branches. Moreover, in
EM(%) BLEU-4 CodeBLEU Average the SantaCoder-data benchmark, aiXcoder-7B’s 83.0% EM
CodeLLama-7B 25.4 44.2 63.9 44.5 of Java is 5.1% lower than the best score of 88.1%. This will
StarCoder2-7B 22.6 41.9 61.2 41.9
DeepSeekCoder-7B 29.1 50.0 65.8 48.3 be rectified by boosting Python and Java data in training.
aiXcoder-7B 37.3 60.3 67.4 55.0
JavaScript
C. RQ3: Performance on Cross-File Code Completion
Model
EM(%) BLEU-4 CodeBLEU Average Another important capability of LLMs is the ability to
CodeLLama-7B 25.7 44.1 60.4 43.4
understand code context across files, as developers often need
StarCoder2-7B 23.9 42.0 57.4 41.1 to consider information from other files within the current
DeepSeekCoder-7B 29.3 49.0 60.3 46.2 project when writing code. In Table IV, we fix the context
aiXcoder-7B 36.5 58.0 64.3 52.9
length for all LLMs at 16K and format the input using the
Python PSM pattern in FIM. All LLMs employ the greedy search to
Model
EM(%) BLEU-4 CodeBLEU Average generate code.
CodeLLama-7B 21.6 39.9 60.0 40.5 We design three experimental settings: ❶ Base Setting.
StarCoder2-7B 24.4 43.5 59.5 42.5 As a baseline, LLMs complete based solely on the current
DeepSeekCoder-7B 29.6 51.1 63.4 48.0
aiXcoder-7B 35.0 56.1 63.0 51.4 file without cross-fire context. ❷ Retrieval BM25. Based on
the current file context in the base settings, It additionally
uses BM25 to match repository code fragments. The top 5
on SantaCoder-data, achieving the best results in Python, matches, capped at 512 tokens, are added to the prompt, along
JavaScript, and Java among the models tested. with formatted class definitions from other files. ❸ Retrieval
Table V shows the average generation performance on FIM- w/Ref. In this setting, we make use of not only the in-file
Eval. Figure 4 shows the performance of LLMs in predicting context (as in Retrieval BM25 setting) but also the reference to
different types of code. Based on the results, we obtain the retrieve the cross-file context. We prepend the retrieved context
following observations: to the in-file context to construct the prompt for this setting.
• In real-world programming, aiXcoder-7B performs well in The subsequent conclusions can be made:
FIM. When evaluated on Java, C++, and JavaScript in • Under three experimental settings, aiXcoder-7B performs
FIM-Eval, aiXcoder-7B surpasses DeepSeekCoder-7B by very well, achieving EM of 30.0 and ES of 70.8 in Python,
an average of 5.2, 6.7, and 6.4 in FIM metrics for these outperforming CodeLlama-7B by 7.7 and 21.4 in the base
three languages, highlighting its multilingual versatility. It setting. In the other two experimental settings with re-
is highest in C++, exceeding StarCoder2-7B by 11.8. trieval, aiXcoder-7B has an average EM that is higher than
Fig. 4: Performance of LLMs on different types of code in FIM-Eval.
DeepSeekCoder-7B by 3.6 and 4.5 and an average ES that For example, CodeLlama-7B produces ratios of 2.14 for
is higher by 4.5 and 8.6. Java and 3.02 for C++, while StarCoder2-7B reaches even
• The model’s performance varies across different languages. higher ratios, such as 3.62 for C++. In contrast, aiXcoder-
aiXcoder-7B excels in C# and achieves an EM of 61 in 7B consistently generates code predictions that are similar in
the third setting. In other languages, the improvements of length to human-written code, achieving ratios of 0.97 for
aiXcoder-7B slightly decrease. For example, it only achieves Java and 0.87 for Python. We attribute this performance to
an EM of 40.4 in Python. In the future, we will continue to aiXcoder-7B’s Structured Fill-In-the-Middle (SFIM) training
enhance the model’s performance across different languages. objective, which helps align model outputs with human-written
code, resulting in more efficient coding practices.
VI. D ISCUSSION
A. Comparison in the Length of Code TABLE VI: The length comparison between the generated code and
reference code
We propose a novel evaluation perspective in FIM, i.e., Model Java C++ JavaScript Python
comparing the code length between human-written reference
CodeLlama-7B 2.14(340/159) 3.02(486/161) 2.39(407/170) 3.28(547/167)
code and code generated by LLMs. It is essential not only to StarCoder2-7B 2.22(353/159) 3.62(583/161) 2.69(458/170) 2.92(488/167)
ensure that the completed code is functionally correct but also DeepSeekCoder-7B 1.37(217/159) 2.05(330/161) 1.37(232/170) 1.65(275/167)
aiXcoder-7B 0.97(154/159) 1.35(217/161) 1.04(177/170) 0.87(146/167)
that its length is consistent with what a human programmer
would produce.
B. Insights of Training LLMs for Code
To gain insights into this aspect, we evaluate LLMs’ per-
formance using FIM-Eval (Section IV-C), which includes a Based on our practices in aiXcoder-7B, we summarize
variety of scenarios. Additionally, we present the code length the following insights to help practitioners train the next
ratio, which is calculated as the ratio of the number of tokens generations of LLMs for code.
in the prediction to the number of tokens in the ground truth Scaling up training data can continuously improve
code. Based on the experimental results in Table VI below, we the performance of LLMs. Although the scaling law [29]
observe that existing LLMs tend to over-generate, producing provides a relationship between model size and the amount
code that is substantially longer than necessary. Too long code of training data, we discover that further scaling up training
will increase the burden on users and reduce maintainability. data is necessary. Even if the training loss of LLMs is already
small, continually training with new data still can improve the Codex is fine-tuned from GPT-3 using a high-quality code
performance of models. Similar phenomena have also been dataset and demonstrates strong performance in Python code
found in other works [30]. completion. Following Codex, OpenAI continued to lead with
Exploiting the relationships between code files during the development of GPT-4, GPT-4 Turbo, and GPT-4o, all of
training can enhance the LLMs’ ability to understand which exhibit strong capabilities in both code generation and
cross-file context. In practical applications, LLMs often need code completion [31]. Alongside OpenAI, Google introduced
to predict code based on cross-file context [25]. Thus, during Gemini [32], and Anthropic contributed with Claude3 [33],
the training process, we should organize files based on their all of which integrate code-related tasks into broader con-
relationships and train LLMs to understand and utilize the versational capabilities. These closed-source models dominate
cross-file context. For example, we sample files based on the landscape with outstanding performance in code-related
the similarity of their content, which is closer to retrieval- tasks due to their extensive training datasets and advanced
augmented code completion scenarios. optimization techniques.
Incorporating the code structures into the training Open-Source LLMs for Code. Parallel to the development
objectives can improve the performance of LLMs. The code of closed-source LLMs, open-source LLMs have significantly
is highly structured. However, previous LLMs [2]–[4] view the expanded access to code-related technologies, thereby foster-
code into plain text, ignoring the underlying code structures. ing exploration and innovation in the research community.
This paper first incorporates code structures into the training Meta AI’s CodeLlama [2], built upon Llama2 [19], showcased
objectives of LLMs and proposes SFIM. SFIM constructs advanced capabilities, including fill-in-the-blank and zero-shot
code spans based on syntax trees of code and trains LLMs to programming. DeepSeek AI’s DeepSeek Coder [3], trained
generate accurate and concise code. The results in Section V on a diverse dataset of 2 trillion tokens, offered multiple
show the effectiveness of our SFIM. This inspires practitioners model sizes to meet various needs. The BigCode community
to explore new training objectives to model valuable code released StarCoder [4], trained on The Stack v2 dataset [9],
structures, e.g., control flows and data flows. which outperformed other open-source models in Python and
other programming languages at the time. These models,
C. Threats to Validity along with newer iterations like DeepSeek Coder V2 [34],
We summarize two main threats to this paper. have effectively reduced the performance gap with closed-
Data Leakage. A critical threat when training LLMs is the source models while promoting transparency, reproducibility,
potential inclusion of evaluation data within the training set, and community-driven development.
which can undermine the reliability of evaluation outcomes. Our aiXcoder-7B is part of the open-source community’s
To address this threat, we exclude any data related to our ongoing efforts to advance code completion. It outperforms
evaluation datasets during data collection. Additionally, the existing LLMs with similar sizes in six code completion
FIM-Eval dataset we constructed and used in our experiments benchmarks, serving as a lightweight and effective function
was further emphasized to ensure its independence from the model for academia and industry.
training data. While we cannot guarantee the absence of data VIII. C ONCLUSION AND F UTURE W ORK
leakage in other models due to lack of access, our benchmarks
demonstrate that aiXcoder-7B outperforms them reliably. Conclusion. This paper presents aiXcoder-7B, a lightweight
The selection of hyper-parameters. Another threat to the and effective LLM for code completion. aiXcoder-7B is
validity of our study lies in the selection of hyperparameters trained with 1.2 trillion unique tokens and employs some
and rules used during the training of aiXcoder-7B, including novel training techniques, including diverse data sampling
model architecture hyperparameters, thresholds for data clean- strategies and multi-objective training. We conduct extensive
ing, data deduplication parameters, code quality assessment experiments on six code completion benchmarks covering six
criteria, and sensitive information removal strategies. We se- programming languages. The results show that aiXcoder-7B
lected these based on our preliminary experiments and prior outperforms the latest six LLMs with similar sizes and even
empirical knowledge. We did not conduct a comprehensive hy- surpasses four larger LLMs (e.g., CodeLLaMa-34B). We also
perparameter search due to the substantial computational costs provide some valuable insights for helping practitioners train
involved in pre-training, potentially resulting in suboptimal the next generations of LLMs for code.
configurations. However, this does not affect our contributions, Future Work. In the future, we will train more powerful
as future improvements in hyperparameter optimization or lightweight LLMs for code completion. Specifically, we plan
heuristic rules can be easily integrated into our framework. to design a model architecture dedicated to code. Compared
with Transformer, it can explicitly model code structures, e.g.,
VII. R ELATED W ORK syntax structures. In addition, we will soon release instruction
This section provides an overview of the evolution of LLMs fine-tuned versions of aiXcoder-7B to support more software
for code. We categorize LLMs into closed-source and open- engineering tasks, e.g., code summarization and code repair.
source models. R EFERENCES
Closed-Source LLMs for Code. One of the earliest no- [1] aiXcoder, “aixcoder-7b,” https://github.com/aixcoder-plugin/
table breakthroughs is Codex [21], introduced by OpenAI. aiXcoder-7B, 2024.
[2] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, [16] M. Bavarian, H. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek,
Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, and M. Chen, “Efficient training of language models to fill in the
J. Bitton, M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong, middle,” CoRR, vol. abs/2207.14255, 2022.
A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, [17] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou, “Code-
T. Scialom, and G. Synnaeve, “Code llama: Open foundation models gen2: Lessons for training llms on programming and natural languages,”
for code,” CoRR, vol. abs/2308.12950, 2023. [Online]. Available: CoRR, vol. abs/2305.02309, 2023.
https://doi.org/10.48550/arXiv.2308.12950 [18] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang,
[3] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, “Codegeex: A
Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder: pre-trained model for code generation with multilingual benchmarking
When the large language model meets programming - the rise of code on humaneval-x,” in Proceedings of the 29th ACM SIGKDD Conference
intelligence,” CoRR, vol. abs/2401.14196, 2024. on Knowledge Discovery and Data Mining, 2023, pp. 5673–5684.
[4] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, [19] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
T. Y. Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes,
J. Monteiro, O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M. Yee, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
L. K. Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. M. S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa,
V, J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee,
Z. Zhang, N. Moustafa-Fahmy, U. Bhattacharyya, W. Yu, S. Singh, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra,
S. Luccioni, P. Villegas, M. Kunakov, F. Zhdanov, M. Romero, T. Lee, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,
N. Timor, J. Ding, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang,
M. Mishra, A. Gu, J. Robinson, C. J. Anderson, B. Dolan-Gavitt, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang,
D. Contractor, S. Reddy, D. Fried, D. Bahdanau, Y. Jernite, C. M. A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov,
Ferrandis, S. Hughes, T. Wolf, A. Guha, L. von Werra, and H. de Vries, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,”
“Starcoder: may the source be with you!” CoRR, vol. abs/2305.06161, CoRR, vol. abs/2307.09288, 2023.
2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.06161 [20] R. Xie, Z. Zeng, Z. Yu, C. Gao, S. Zhang, and W. Ye, “Codeshell
[5] A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, technical report,” CoRR, vol. abs/2403.15747, 2024.
A. Tang, D. Pykhtar, J. Liu, Y. Wei, T. Liu, M. Tian, D. Kocetkov, [21] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
A. Zucker, Y. Belkada, Z. Wang, Q. Liu, D. Abulkhanov, I. Paul, Z. Li, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
W. Li, M. Risdal, J. Li, J. Zhu, T. Y. Zhuo, E. Zheltonozhskii, N. O. O. G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
Dade, W. Yu, L. Krauß, N. Jain, Y. Su, X. He, M. Dey, E. Abati, Y. Chai, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
N. Muennighoff, X. Tang, M. Oblokulov, C. Akiki, M. Marone, C. Mou, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
M. Mishra, A. Gu, B. Hui, T. Dao, A. Zebaze, O. Dehaene, N. Patry, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
C. Xu, J. J. McAuley, H. Hu, T. Scholak, S. Paquet, J. Robinson, C. J. J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
Anderson, N. Chapados, and et al., “Starcoder 2 and the stack v2: The A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford,
next generation,” CoRR, vol. abs/2402.19173, 2024. M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew,
[6] GitHub, “Github copilot,” https://github.com/features/copilot, 2023. D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating
[7] BAAI, “Wudaocorporatext,” https://data.baai.ac.cn/details/ large language models trained on code,” CoRR, vol. abs/2107.03374,
WuDaoCorporaText, 2023. 2021.
[8] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, H. Alobeidli, A. Cap-
[22] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
pelli, B. Pannier, E. Almazrouei, and J. Launay, “The refinedweb dataset
E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program synthesis
for falcon LLM: outperforming curated corpora with web data only,”
with large language models,” CoRR, vol. abs/2108.07732, 2021.
in Advances in Neural Information Processing Systems 36: Annual
Conference on Neural Information Processing Systems 2023, NeurIPS [23] F. Cassano, J. Gouwar, D. Nguyen, S. Nguyen, L. Phipps-Costin,
2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, D. Pinckney, M. Yee, Y. Zi, C. J. Anderson, M. Q. Feldman, A. Guha,
T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., M. Greenberg, and A. Jangda, “A scalable and extensible approach
2023. to benchmarking nl2code for 18 programming languages,” CoRR, vol.
abs/2208.08227, 2022.
[9] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, Y. Jernite, M. Mitchell,
C. M. Ferrandis, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra, and [24] L. B. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, C. M. Ferrandis,
H. de Vries, “The stack: 3 TB of permissively licensed source code,” N. Muennighoff, M. Mishra, A. Gu, M. Dey, L. K. Umapathi, C. J.
Trans. Mach. Learn. Res., vol. 2023, 2023. Anderson, Y. Zi, J. Lamy-Poirier, H. Schoelkopf, S. Troshin, D. Ab-
[10] A. Z. Broder, “Identifying and filtering near-duplicate documents,” ulkhanov, M. Romero, M. Lappert, F. D. Toni, B. G. del Rı́o, Q. Liu,
in Combinatorial Pattern Matching, 11th Annual Symposium, CPM S. Bose, U. Bhattacharyya, T. Y. Zhuo, I. Yu, P. Villegas, M. Zocca,
2000, Montreal, Canada, June 21-23, 2000, Proceedings, ser. S. Mangrulkar, D. Lansky, H. Nguyen, D. Contractor, L. Villa, J. Li,
Lecture Notes in Computer Science, R. Giancarlo and D. Sankoff, D. Bahdanau, Y. Jernite, S. Hughes, D. Fried, A. Guha, H. de Vries,
Eds., vol. 1848. Springer, 2000, pp. 1–10. [Online]. Available: and L. von Werra, “Santacoder: don’t reach for the stars!” CoRR, vol.
https://doi.org/10.1007/3-540-45123-4 1 abs/2301.03988, 2023.
[11] S. Har-Peled, P. Indyk, and R. Motwani, “Approximate nearest [25] Y. Ding, Z. Wang, W. U. Ahmad, H. Ding, M. Tan, N. Jain, M. K.
neighbor: Towards removing the curse of dimensionality,” Theory Ramanathan, R. Nallapati, P. Bhatia, D. Roth, and B. Xiang, “Cross-
Comput., vol. 8, no. 1, pp. 321–350, 2012. [Online]. Available: codeeval: A diverse and multilingual benchmark for cross-file code
https://doi.org/10.4086/toc.2012.v008a014 completion,” in NeurIPS, 2023.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [26] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, automatic evaluation of machine translation,” in Proceedings of the 40th
pp. 5998–6008. annual meeting of the Association for Computational Linguistics, 2002,
[13] T. Kudo and J. Richardson, “Sentencepiece: A simple and language pp. 311–318.
independent subword tokenizer and detokenizer for neural text pro- [27] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan,
cessing,” in EMNLP (Demonstration). Association for Computational M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic
Linguistics, 2018, pp. 66–71. evaluation of code synthesis,” CoRR, vol. abs/2009.10297, 2020.
[14] J. Su, M. H. M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “Roformer: [28] V. I. Levenshtein et al., “Binary codes capable of correcting deletions,
Enhanced transformer with rotary position embedding,” Neurocomput- insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8.
ing, vol. 568, p. 127063, 2024. Soviet Union, 1966, pp. 707–710.
[15] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, [29] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess,
and S. Sanghai, “GQA: training generalized multi-query transformer R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling
models from multi-head checkpoints,” in EMNLP. Association for laws for neural language models,” CoRR, vol. abs/2001.08361, 2020.
Computational Linguistics, 2023, pp. 4895–4901. [Online]. Available: https://arxiv.org/abs/2001.08361
[30] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra,
“Grokking: Generalization beyond overfitting on small algorithmic
datasets,” CoRR, vol. abs/2201.02177, 2022. [Online]. Available:
https://arxiv.org/abs/2201.02177
[31] OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023.
[32] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
[33] “The claude 3 model family: Opus, sonnet, haiku.” [Online]. Available:
https://api.semanticscholar.org/CorpusID:268232499
[34] Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, R. Xu, Y. Wu, Y. Li, H. Gao,
S. Ma et al., “Deepseek-coder-v2: Breaking the barrier of closed-source
models in code intelligence,” arXiv preprint arXiv:2406.11931, 2024.

You might also like