aixcoder-7b
aixcoder-7b
Abstract—Large Language Models (LLMs) have been widely is a critical factor for developer experience and productivity.
used in code completion, and researchers are focusing on scaling Thus, it is necessary to train lightweight LLMs that maintain
up LLMs to improve their accuracy. However, larger LLMs will high code completion accuracy while having smaller scales.
increase the response time of code completion and decrease the
developers’ productivity. In this paper, we propose a lightweight Recognizing the above research gap, we present aiXcoder-
and effective LLM for code completion named aiXcoder-7B. 7B, a lightweight and powerful LLM for code completion.
Compared to existing LLMs, aiXcoder-7B achieves higher code aiXcoder-7B contains 7 billion parameters, ensuring a high
completion accuracy while having smaller scales (i.e., 7 billion inference speed while achieving superior code completion
parameters). We attribute the superiority of aiXcoder-7B to accuracy. In our later experiments, aiXcoder-7B outperforms
three key factors: ❶ Multi-objective training. We employ three
training objectives, one of which is our proposed Structured the latest LLMs with similar sizes in six code comple-
Fill-In-the-Middle (SFIM). SFIM considers the syntax structures tion benchmarks and even surpasses larger LLMs (e.g.,
in code and effectively improves the performance of LLMs StarCoder2-15B and CodeLlama-34B). aiXcoder-7B effec-
for code. ❷ Diverse data sampling strategies. They consider tively balances model size and performance, providing a better
inter-file relationships and enhance the capability of LLMs in foundational model for both academia and industry.
understanding cross-file contexts. ❸ Extensive high-quality data.
We establish a rigorous data collection pipeline and consume
Compared to previous LLMs, we attribute the superiority
a total of 1.2 trillion unique tokens for training aiXcoder-7B. of aiXcoder-7B to the following three key factors:
This vast volume of data enables aiXcoder-7B to learn a broad • Multi-objective training. Previous LLMs mainly use Next-
distribution of code. We evaluate aiXcoder-7B in five popular Token Prediction (NTP) as the training objective, which only
code completion benchmarks and a new benchmark collected by covers limited code completion scenarios. To address this
this paper. The results show that aiXcoder-7B outperforms the
latest six LLMs with similar sizes and even surpasses four larger limitation, we propose multi-objective training, including
LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning NTP, Fill-In-the-Middle (FIM), and Structured Fill-In-
aiXcoder-7B as a lightweight and effective LLM for academia the-Middle (SFIM). NTP simulates the scenario where
and industry. Finally, we summarize three valuable insights for developers write a new file from top to bottom, and FIM
helping practitioners train the next generations of LLMs for models the scenario of developers modifying existing code.
code. aiXcoder-7B has been open-souced and gained significant
attention [1]. As of the submission date, aiXcoder-7B has received Because FIM mainly trains models to predict incomplete
2,193 GitHub Stars. and irregular code snippets, we further propose SFIM. It
Index Terms—Code Completion, Large Language Model parses the code into a syntax tree and mines a relatively
complete code span based on a tree node. aiXcoder-7B is
I. I NTRODUCTION trained to predict the code span based on its surrounding
Large Language Models (LLMs) have been widely used in context. The three objectives help aiXcoder-7B learn a
code completion [2]–[5], i.e., predicting the subsequent code comprehensive code completion ability across a wider range
based on the previous context. For example, GitHub Copilot of code completion scenarios. Details of the multi-objective
[6], an LLM-based code completion tool, is regularly utilized training are in Section III-C.
• A diverse data sampling algorithm. A code repository of-
by developers from over 10,000 organizations. Nowadays,
researchers often improve the accuracy of LLMs by scaling ten contains multiple code files. Previous studies [2], [4], [5]
up LLMs, e.g., CodeLlama-70B [2]. However, larger LLMs typically randomly sample files for training, failing to lever-
will increase the response time of code completion, which age the relationships and contextual information between
files. We propose four new sampling strategies: sampling
* Siyuan Jiang and Jia Li contribute equally and are co-first authors. based on file content similarity, sampling based on file
path similarity, sampling based on inter-file dependency, file relationships and code structures into the training. These
and random sampling. The first three strategies simulate insights can help practitioners train the next generations of
common cross-file code completion scenarios, such as code LLMs for code.
completion augmented by similar code and cross-file API We summarize the key contributions of this paper:
completion, helping aiXcoder-7B better understand and uti- • We present aiXcoder-7B, a lightweight and effective LLM
lize dependencies across files. The fourth strategy, random with 7 billion parameters for code completion. We have
sampling, is to simulate other potential code completion released its weights and code [1]. As of the submission date,
scenarios. Through these diverse sampling strategies, we aiXcoder-7B has received 2,193 GitHub Stars.
enhance aiXcoder-7B’s understanding capability of cross- • We propose a novel training objective - Structured Fill-In-
file contexts within a repository. Details of our data sampling the-Middle, which considers the syntax structures in code
algorithm are in Section III-B. and effectively improves the performance of LLMs.
• Extensive high-quality data. LLMs are inherently data- • We propose a new data sampling algorithm for code, which
driven, and their performance is significantly influenced by considers inter-file relationships and enhances the capability
the quantity and quality of the training data. We establish of LLMs in understanding cross-file contexts.
a rigorous data collection pipeline, including data crawl- • We release a new code completion benchmark, consisting
ing, data cleaning, deduplication, code quality checks, and of 16,136 samples and covering four languages.
sensitive information removal. We leverage this pipeline to • We evaluate the effectiveness of aiXcoder-7B in six code
collect a substantial amount of high-quality training data. completion benchmarks. aiXcoder-7B substantially outper-
We continuously feed the training data into aiXcoder- forms 6 LLMs with similar sizes and even surpasses 4 larger
7B, consuming a total of 1.2 trillion unique tokens. This LLMs (15B and 34B).
vast volume of data enables aiXcoder-7B to learn a broad
distribution of code data, allowing it to perform exception- II. DATA C OLLECTION P IPELINE
ally well across different code completion scenarios. More This section presents the process of collecting the pre-
details of our data collection pipeline are in Section II. training data of aiXcoder-7B. Figure 1 shows an overview
of our data collection pipeline, consisting of five stages:
We assess the effectiveness of aiXcoder-7B in three data crawl (Section II-A), data cleaning (Section II-B), data
code completion tasks, including Fill-In-the-Middle (FIM), deduplication (Section II-C), code quality checking (II-D),
cross-file code completion, and Natural language to Code and sensitive and personally identifiable information removal
(NL2Code). We experiment with six code completion bench- (Section II-E). Through this pipeline, we collect and clean
marks, five of which are popular public datasets and one is 2.8TB of natural language data and 3.5TB of source code data.
FIM-Eval collected by this paper. FIM-Eval is a benchmark Figure 2 visualizes the distributions of the top 10 programming
for FIM, consisting of 16,136 samples and covering four languages in the code data. Next, we describe the details of
languages (i.e., Java, Python, C++, JavaScript). FIM-Eval the data collection pipeline in the following sections.
additionally labels the types of code to be completed, including
13 types, e.g., function signatures and comments. Then, we A. Data Crawling
compare aiXcoder-7B to 10 recently released LLMs (from 7B The pre-training data of aiXcoder-7B consists of two parts:
to 34B) on these benchmarks and yield the following insights: natural language data and source code data.
❶ aiXcoder-7B substantially outperforms LLMs with similar Natural Language Data. We collect natural language data
sizes and even surpasses larger LLMs in six benchmarks. For from two public datasets: WuDaoCorpora [7] and RefineWeb
example, in a popular benchmark - HumanEval, aiXcoder-7B [8], driven by two key motivations. First, these datasets are
achieves a Pass@1 score of 54.9%, outperforming CodeLlama- highly diverse, covering a wide range of domains and lan-
34B (i.e., 48.2%) and StarCoder2-15B (i.e., 46.3%). The guages. They include a broad spectrum of natural language
improvements show that aiXcoder-7B achieves higher code text from the internet, such as social media conversations,
completion accuracy while having smaller scales. ❷ Based on books, and technical papers, and cover two mainstream lan-
our FIM-Eval, we analyze the performance of aiXcoder-7B in guages, i.e., English and Chinese. Second, both datasets have
completing different types of code. aiXcoder-7B outperforms been thoroughly cleaned and deduplicated in previous studies,
LLMs with similar sizes on most types (max: 13, min: 8). The which significantly reduces the preprocessing workload and
results show the strong generalization ability of aiXcoder-7B allows us to focus on processing code data. Finally, we collect
in code completion. ❸ We show that existing LLMs are prone 2.8TB of natural language data for pre-training.
to generate longer code in FIM, while the code generated by Source Code Data. The raw source code data comes from
aiXcoder-7B is closer in length to human-written reference two sources: one is the open-source dataset - The Stack v1.2,
code. The result shows that the code generated by aiXcoder- and the other is the code data we crawled ourselves.
7B is more concise and closer to the human coding style. • The Stack v1.2 [9] is a comprehensive dataset comprising
Insights of training LLMs for code. Based on our prac- approximately 6TB of permissively licensed source code
tices in aiXcoder-7B, we summarize three valuable insights, sourced from public GitHub repositories, spanning 358
including scaling up training data and introducing the inter- programming languages, with notable representation from
Fig. 1: An overview of our data collection pipeline.
12.8% JavaScript from different aspects, including the number of stars, the
14.3% number of git commits, and the number of test files. Then,
we sort the repositories in descending order based on their
C++ scores and remove the lowest 10%.
Python File-level Cleaning. Next, we filter out low-quality files
in repositories. Specifically, we empirically design some rules
Fig. 2: The distributions of the top 10 programming languages in our
to filter out low-quality documents: ❶ Trivial files, including
source code training data.
empty files, corrupted files, non-text files, and auto-generated
files. ❷ Too long files. Too long files typically contain wordy
HTML, JavaScript, Java, and C. This dataset has under- or repetitive content and are not suitable as training data. If
gone rigorous cleaning processes to enhance data integrity a line in a file exceeds 1000 characters, the total number of
and prevent inflated performance metrics due to duplicate lines in the file exceeds 10,000, or the size of the file exceeds
content. Only repositories with permissive licenses were 1MB, we consider it a long file.
retained, while non-essential files, such as binaries and those
C. Data Deduplication
exceeding 1MB, were systematically excluded. The version
1.2 of The Stack has excluded opt-out requests submitted Previous work [8] has shown that data deduplication can
by February 9, 2023, as well as initially flagged malicious significantly improve the performance of trained models. It is
files (this exclusion is not exhaustive). particularly necessary for code data, where code reuse leads to
• Our crawled data. Popular repositories we have been crawl- a large amount of duplicate content. Therefore, in this stage,
ing from GitHub for the past decade. we eliminate duplicate code files within the repositories. Our
deduplication process consists of two steps:
B. Data Cleaning • Exact deduplication. We extract file contents, find the files
In this stage, we clean the collected data by removing invalid with exactly the same content, and keep only one copy.
or low-quality data. Because the natural language data has • Near deduplication. Exact deduplication is too strict and
already undergone rigorous cleaning, we focus on cleaning the may cause false positives. Thus, we further perform near
source code data. Our cleaning process comprises two steps: deduplication. We compute the MinHash [10] with 256
repository-level cleaning and file-level cleaning. Below, we permutations of all files and use Locality Sensitive Hashing
provide a detailed explanation of each step. [11] to find clusters of duplicates. We further reduce the
Repository-level Cleaning. Our goal is to remove repos- clusters by ensuring that each file in the original cluster
itories with impressive licenses and low-quality repositories. is similar to at least one other file in the reduced cluster.
To achieve this goal, our cleaning is performed in three steps: We consider two files near-duplicate when their Jaccard
• Collecting permissive licenses. We build a list of permissive similarity exceeds 0.85.
licenses based on the Blue Oak Council1 and previous
work [9]. This list includes various permissive licenses with D. Code Quality Checking
minimal restrictions on software copying, modification, and In this stage, we use code analysis tools to assess the quality
redistribution. Only repositories with licenses from this list of code data and filter out low-quality code. Low-quality code
are retained for pre-training. often contains syntax errors, code defects, vulnerabilities, and
1 https://blueoakcouncil.org/list 2 https://github.com/src-d/go-license-detector
misleading models that generate unreliable code. Specifically, B. Data Sampling Algorithm
we use the following tools to assess the quality of code: Through the pipeline in Section II, we collect extensive code
Syntax Parser. Syntax correctness is one of the basic repositories and natural language articles. We randomly shuffle
principles that source code should satisfy. We use a public these repositories and articles and iterate through them. If a
syntax parser - tree-sitter3 to parse all code files and delete natural language article is sampled, we process it into training
files that fail to parse or time out. sequences based on the training objectives (Section III-C).
SonarQube. SonarQube4 is an open-source tool for the in- If a code repository is sampled, we design an algorithm for
spection of code quality. It can detect code defects, vulnerabil- sampling files from the repository, as described in Algorithm 1.
ities, code smells, and technical debt in various programming The algorithm contains four strategies: sampling based on file
languages. We use SonarQube to identify problematic code content similarity, sampling based on file path similarity, sam-
files and delete them. pling based on inter-file dependencies, and random sampling.
The first three strategies simulate common cross-file code
E. Sensitive Information Removal completion scenarios, such as code completion augmented by
In this section, we remove the sensitive information in similar code and cross-file API completion, helping aiXcoder-
the pre-training data, e.g., texts involving sensitive topics 7B better understand and utilize dependencies across files. The
and personally identifiable information (PII). We remove this fourth strategy, random sampling, is to simulate other potential
information in two steps: code completion scenarios. For each repository, the probability
Match-based filter. We manually build a list of sensitive of selecting each of the first three strategies is 30%, and the
words, which covers a broad range of sensitive topics (e.g., probability of selecting the last strategy is 10%. These sampled
politics). Then, we scan all pre-training data and delete the files are further converted into training sequences based on the
data containing sensitive words. training objectives (Section III-C).
Model-based filter. Following previous work [4], we use a C. Training Objectives
Named Entity Recognition (NER) model to identify PII in the
The training objectives of aiXcoder-7B consist of the Next-
data. Specifically, we reuse a trained NER model in previous
Token Prediction (NTP) and Structured Fill-In-the-Middle
work [4], which can identify six categories of PII, including
(SFIM), detailed as follows.
emails, names, IP addresses, usernames, passwords, and keys.
Next-Token Prediction (NTP). It is similar to code com-
Then, we replace the detected PII entities with the follow-
pletion, training models to predict the subsequent token based
ing special tokens: <EMAIL>, <NAME>, <IP_ADDRESS>,
on the provided context. Given a code file or natural language
<USERNAME>, <PASSWORD>, <KEY>.
article x = {x0 , x1 , . . . , xl }, NTP trains models predict the
III. M ODEL T RAINING next token xi based on previous tokens {x<i }. The objective
is to minimize the following loss function:
In this section, we describe the pre-training procedure l−1
of aiXcoder-7B, including model architecture, data sampling lossN T P = −
X
log p (xi | xt<i ) (1)
algorithm, and training objectives. i=0
Our study aims to answer the following Research Questions model with 15B parameters, delivering improved accuracy
(RQs). They evaluate aiXcoder-7B in three code completion for code synthesis and interpretation.
tasks, including Natural Language to Code (NL2Code), Fill- • StarCoder2-15B [5] is a 15B parameter model upgraded
In-the-Middle (FIM), and cross-file code completion. on the original StarCoder, offering refined code generation
RQ1: How does aiXcoder-7B perform on NL2Code and more diverse programming languages.
• CodeLlama-34B [2] is the largest variant of the CodeLlama
task compared to existing LLMs? NL2Code is the task
of completing the source code based on a natural language series with 34B parameters.
description or function signature. C. Benchmarks
RQ2: How does aiXcoder-7B perform on Fill-In-the-
Middle task compared to existing LLMs? FIM simulate NL2Code Benchmarks. Following previous studies [2]–
scenarios where developers modify existing code by predicting [4], we select three popular NL2Code benchmarks in our
the missing middle portion using bidirectional contexts. experiments, detailed as follows.
RQ3: How does aiXcoder-7B perform on Cross-File • HumanEval [21] and MBPP [22] consist of 164 and 974
Code Completion compared to existing LLMs? This task Python programming problems. Each problem includes a
requires completing code by using relevant context from other function signature, a detailed docstring, and several test
files within the current repository. cases. LLMs are required to complete the function body
In the RQs, we apply aiXcoder-7B on 6 benchmarks based on the signature and docstring. The generated code
totally. These benchmarks cover 6 programming languages. is checked by executing test cases, being considered correct
To show the superiority of aiXcoder-7B, we also select 10 only if all tests pass.
• MultiPL-E [23] is the multilingual version of HumanEval,
popular LLMs as baselines for comparison. Then, we report
the execution-based and text-based metrics (Section IV-D) of covering multiple programming languages, e.g., C++, Java,
completed programs. and JavaScript.
FIM Benchmarks. Code is rarely composed in a straight-
B. Compared Models forward left-to-right sequence. Simulating when a developer
modifies existing code, FIM refers to the task of completing
Note that our aiXcoder-7B was open-sourced in March missing a middle code snippet leveraging bidirectional con-
2024. Thus, we select LLMs released before March 2024 texts.
for comparison. Specifically, we select 10 popular LLMs for • Santacoder-data [24] is a popular FIM benchmarks con-
comparison, and they can be divided into two groups: sisting of 4,792 samples. It is built from MultiPL-E [23]
❶ LLMs with similar sizes. The first group contains six and requires LLMs to predict a single line of code based on
popular LLMs, which have similar sizes to aiXcoder-7B. the preceding and following context.
• CodeGen2.5-7B [17], released by Salesforce, is a 7B pa- • FIM-Eval is a large-scale FIM benchmark collected by
rameter model specialized in code generation and under- this paper. We construct FIM-Eval from some real-world
standing, trained on a diverse set of programming languages. repositories, which are excluded from the training data of
• CodeGeex2-7B [18], developed by Zhipu AI, is a 7B aiXcoder-7B. We extract 13 types of code snippets from
parameter model designed for code completion and bug these repositories and randomly mine spans from these
fixing, leveraging a large corpus of code data. code snippets. These 13 types of code snippets encompass
• CodeLlama-7B [2], an open-source model by Meta AI, is common code completion scenarios, including method sig-
a 7B parameter architecture fine-tuned on a vast collection natures, method bodies, single-line statements, methods with
of code and natural language data based on Llama2 [19]. comments, empty code blocks, specific positions within a
TABLE II: The Pass@1 of LLMs on NL2Code benchmarks.
Model HumanEval MBPP MultiPL-E (C++) MultiPL-E (Java) MultiPL-E (JS) Average
CodeGen2.5-7B 28.7% 39.2% 25.7% 26.1% 26.2% 29.1%
CodeGeex2-7B 36.0% 36.2% 29.2% 25.9% 24.8% 30.4%
CodeLlama-7B 31.7% 38.6% 29.8% 34.2% 29.2% 32.7%
CodeShell-7B 34.4% 38.6% 28.2% 30.4% 33.2% 32.9%
StarCoder2-7B 35.4% 54.4% 33.6% 29.4% 35.4% 37.6%
DeepSeekCoder-7B 49.4% 60.6% 50.3% 43.0% 48.4% 50.3%
aiXcoder-7B 54.9% 66.0% 58.2% 57.0% 64.5% 60.1%
StarCoder-15B 31.7% 42.8% 31.1% 28.5% 29.8% 32.8%
CodeLlama-13B 36.0% 48.4% 37.9% 38.0% 32.3% 38.5%
StarCoder2-15B 46.3% 66.2% 41.4% 33.9% 44.2% 46.4%
CodeLlama-34B 48.2% 55.2% 44.7% 44.9% 42.2% 47.0%
method body (top, middle, and bottom), specific control • Exact Match (EM) evaluates the percentage of cases where
statements (i.e., if statements, for loops, while loops, try the prediction exactly matches the reference, providing a
statements, and switch-case statements). Finally, we collect strict measure of how often LLMs produce correct code
16,140 samples covering four programming languages: C++ without deviations.
(4,080 samples), Java (4,080 samples), Python (3,900 sam- • Edit Similarity (ES) measures the similarity between the
ples), and JavaScript (4,080 samples). FIM-Eval provides a prediction and the reference based on the number of edits
reliable, practical, and diverse evaluation platform for FIM. required to transform one into the other, typically using
FIM-Eval has been open-sourced in our repository [1]. metrics like Levenshtein distance [28].
Cross-File Code Completion Benchmarks. This task re-
V. R ESULTS AND A NALYSES
quires LLMs to complete the code based on cross-file context
within the same project. Building upon insights from prior TABLE III: The exact match of LLMs on the SantaCoder-data
research [3], [4], detailed as follows. benchmark.
• CrossCodeEval [25] covers four popular programming lan- Model Python JavaScript Java Avg
guages: 2,665 Python samples, 2,139 Java samples, 3,356 StarCoder2-7B 61.1% 77.5% 81.1% 73.2%
TypeScript samples, and 1,768 C# samples. Each sample is CodeLlama-7B 67.6% 74.3% 80.2% 74.0%
CodeLlama-13B 68.3% 77.6% 80.7% 75.5%
provided in three formats: no cross-file context, retrieved DeepSeekCoder-7B 66.6% 79.7% 88.1% 78.1%
cross-file context, and retrieval with reference. The LLMs aiXcoder-7B 73.3% 81.7% 83.0% 79.3%
completed code snippet is compared using text-based met-
rics. A. RQ1: Performance on NL2Code
Following recent work on LLMs [3], [5], we use greedy
D. Evaluation Metrics
decoding and report Pass@1. Table II shows the results of
We describe the evaluation metrics used in different code different LLMs on NL2Code benchmarks. From Table II, we
completion tasks. draw the following observations:
NL2Code. NL2Code benchmarks provide test cases for • Compared to LLMs of similar sizes, our aiXcoder-7B
evaluation. Thus, we execute test cases to check the correct- achieves the current best results, outperforming the top-
ness of the generated code and report Pass@k [21]. Specif- performing model DeepSeekCoder-7B by an average of
ically, we generate n ≥ k code snippets per testing sample, 9.8%. Moreover, it significantly surpasses CodeGen2.5-7B
count the number of correct code snippets c ≤ n that pass all with a 31% absolute advantage.
test cases, and calculate the Pass@k: • aiXcoder-7B even surpasses four larger LLMs (e.g.,
StarCoder2-15B and CodeLlama-34B), achieving a lead of
n−c
k 13.1% over CodeLlama-34B, which is nearly five times
Pass@k := E 1− (4) larger, and 13.7% over StarCoder2-15B on average.
Samples n
• Across languages like Java, Python, C++, and JavaScript,
k
our aiXcoder-7B shows strong performance. It surpasses
FIM and cross-file code completion. We consider the DeepSeekCoder-7B by 16.1% in JavaScript and exceeds by
LLMs’ completions as predictions and the human-written 5.5% in Python.
completions as references. We compare the predictions to
references and compute the following metric: B. RQ2: Performance on Fill-In-the-Middle (FIM)
• BLEU [26] measures the n-gram similarity between predic- Generally, FIM closely mirrors how developers modify
tions and references. n is empirically set to 4. existing code, making it an ideal method for evaluating models
• CodeBLEU [27] is a variant of BLEU for code. It considers in real-world programming scenarios.
not only the n-gram similarity but also the syntax and data Based on the experimental findings outlined in Table III,
flow similarity. aiXcoder-7B demonstrates the highest overall performance
TABLE IV: Performance of LLMs on CrossCodeEval.
Python Java TypeScript C# Average
Model
EM ES EM ES EM ES EM ES EM ES
Base Model
CodeLlama-7B 22.3 55.2 27.9 66.9 10.8 70.9 45.8 77.2 26.7 67.6
StarCoder2-7B 22.5 57.3 25.9 65.9 28.9 71.6 39.5 70.5 29.2 66.3
DeepSeekCoder-7B 27.2 62.3 33.4 73.2 36.6 77.3 45.9 77.0 35.8 72.4
aiXcoder-7B 30.0 70.8 34.9 77.8 35.3 79.6 49.9 86.4 37.5 78.7
+ Retrieval BM25
CodeLlama-7B 23.5 53.5 33.9 68.4 11.5 71.5 50.6 75.3 29.9 67.2
StarCoder2-7B 25.3 58.0 31.4 67.4 33.3 73.2 43.5 69.8 33.4 67.1
DeepSeekCoder-7B 29.9 62.9 39.8 74.8 39.0 77.0 52.2 78.1 40.2 73.2
aiXcoder-7B 35.3 74.3 42.2 80.4 39.9 81.3 57.7 88.8 43.8 81.2
+ Retrieval w/ Ref.
CodeLlama-7B 26.7 54.9 36.3 69.0 12.8 72.9 52.8 75.0 32.1 67.9
StarCoder2-7B 28.5 59.0 35.0 69.2 36.0 72.6 47.9 71.6 36.9 68.1
DeepSeekCoder-7B 33.2 64.5 43.7 76.1 43.4 78.4 55.4 78.7 43.9 74.4
aiXcoder-7B 40.4 76.3 47.0 82.4 45.0 83.8 61.0 89.4 48.4 83.0