0% found this document useful (0 votes)
27 views12 pages

Performance-Aligned Llms For Generating Fast Code

Uploaded by

wookikbut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views12 pages

Performance-Aligned Llms For Generating Fast Code

Uploaded by

wookikbut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Performance-Aligned LLMs for

Generating Fast Code


Daniel Nichols† , Pranav Polasam† , Harshitha Menon∗ , Aniruddha Marathe∗ , Todd Gamblin‡ , Abhinav Bhatele†
† Department of Computer Science, University of Maryland, College Park, MD, USA
∗ Centerfor Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, USA
‡ Livermore Computing, Lawrence Livermore National Laboratory, Livermore, CA, USA

Email: {dnicho, ppolasam}@umd.edu, {marathe1, gopalakrishn1, tgamblin}@llnl.gov, [email protected]


arXiv:2404.18864v1 [cs.DC] 29 Apr 2024

Abstract—Optimizing scientific software is a difficult task ductivity of software developers. By using performance-aware
because codebases are often large and complex, and performance code LLMs, developers can focus on design and correctness
can depend upon several factors including the algorithm, its without worrying about the performance implications of using
implementation, and hardware among others. Causes of poor
performance can originate from disparate sources and be difficult LLMs to generate code. Additionally, as LLM-based tools
to diagnose. Recent years have seen a multitude of work that use become more integrated with software development work-
large language models (LLMs) to assist in software development flows, developers will become more and more reliant on the
tasks. However, these tools are trained to model the distribution quality of their outputs. Improving the performance of LLM
of code as text, and are not specifically designed to understand generated code while maintaining its correctness will improve
performance aspects of code. In this work, we introduce a
reinforcement learning based methodology to align the outputs of the quality of the target software being developed. Further,
code LLMs with performance. This allows us to build upon the code LLMs that can write fast code can remove the need for
current code modeling capabilities of LLMs and extend them to every scientific and parallel programmer to be a performance
generate better performing code. We demonstrate that our fine- expert in addition to their existing domain expertise.
tuned model improves the expected speedup of generated code It is non-trivial to create code LLMs that can generate
over base models for a set of benchmark tasks from 0.9 to 1.6
for serial code and 1.9 to 4.5 for OpenMP code. faster code. Since creating performance-aware code LLMs will
Index Terms—Large Language Models, Code Generation, require fine-tuning of LLMs using performance data, one chal-
Performance Optimization, Reinforcement Learning lenge is creating such datasets. LLMs typically require very
large, general datasets for training tasks, and it is challenging
I. I NTRODUCTION to create such large datasets for performance data. Arbitrary
Developing fast and scalable code is a difficult, but often code can have a wide range of performance characteristics,
necessary task for scientific software developers. It can require and depend on many factors such as input data, hardware,
expert knowledge of the application domain, algorithm design, and software environment. Due to the complexity in collecting
programming languages, and hardware. This is a challenging performance data for arbitrary code, performance datasets are
task for even serial code, and even more complex for parallel often small and/or narrow in focus. Further, even with such
code. Further, programmers and performance engineers are a dataset in hand, an LLM needs to be carefully fine-tuned
often tasked with optimizing existing code, often not written to align its generated outputs with more performant code.
by them, which requires understanding an existing codebase There are many potential pitfalls here, for instance, improving
and the performance implications of changes. Large language the performance of generated code at the cost of correctness.
models (LLMs) have emerged as a powerful tool for assisting Additionally, fine-tuned LLMs can learn a distribution too
in the software development process for a variety of tasks disjoint from their initial code distribution they modeled and
such as code completion [1], bug detection [2], [3], and code lose their ability to generalize.
summarization [4]–[7]. Recently, they have also been used In order to overcome the challenges associated with collect-
with limited success to generate parallel code [8]. Yet they ing large scale performance data, we propose a new approach
struggle to understand performance aspects of code because that combines a structured, narrow performance dataset with
they were not designed for this task. Code LLMs are trained a more general synthetic code dataset for fine-tuning. We
on just code as text, and as a result, are not well-suited to also propose two novel fine-tuning methodologies: (1) rein-
reason about complex performance issues. Additionally, the forcement learning with performance feedback (RLPF), which
code they generate does not consider performance and could is based on reinforcement learning with human feedback
be slow, despite being correct. This has been demonstrated (RLHF) [11], and direct performance alignment (DPA), which
in existing works that show LLMs often generate inefficient is based on direct performance optimization (DPO) [12]. We
parallel code [9], [10]. use these two approaches and the new dataset to align an
Creating artificial intelligence (AI) models that can generate existing code LLM to generate faster code. These proposed
faster code has the potential to significantly improve the pro- fine-tuning methodologies use fast and slow code pairs to fine-
tune the LLMs to generate samples more similar to the fast Proximal Policy Optimization (PPO) [19] is a popular RL
code and less similar to the slow code. The aligned model is algorithm that has been used to successfully fine-tune LLMs. It
then evaluated on two code generation benchmarks and one is a state-of-the-art algorithm that has become widely used due
code optimization benchmark. We find that the aligned model to its efficiency and robustness across a number of different
is able to generate code with higher expected speedups than tasks. A key difference between PPO and other RL algorithms
that of the original model, while maintaining correctness. is its use of clipping to prevent unusually large updates to
This work makes the following important contributions: the policy. PPO clips the ratio of the new agent policy and
• A code performance dataset that combines narrow, struc- the previous agent policy to a range of [1 − ϵ, 1 + ϵ]. This
tured performance data with broad synthetic data to help prevents large weight updates, which can lead to instability in
models learn performance properties, but maintain their the training process. The clipped policy updates are combined
ability to generalize. with a value loss function (the reward signal) and an entropy
• Two novel fine-tuning methodologies, reinforcement loss function (to encourage exploration) to train the agent.
learning with performance feedback (RLPF) and direct After running many iterations of the training process, the agent
performance alignment (DPA), for aligning code LLMs learns to make decisions that optimize the reward signal. In
to generate faster code. this paper, we will train an agent (an LLM) to generate code
• A fine-tuned, performance-aligned LLM that generates that is fast (higher reward for faster, correct code).
faster code than traditional code LLMs. III. OVERVIEW OF M ETHODOLOGY
• A detailed study of the performance and correctness
Figure 1 presents an overview of our methodology for
of the code generated by performance-aligned LLMs
aligning code large language models (LLMs) to generate faster
including serial, OpenMP, and MPI code. Additionally,
code. We start by creating a dataset that can be used to
an ablation study motivating the use of synthetic data to
fine-tune an LLM to generate code that is both correct and
fine-tune code LLMs for performance.
fast (Section IV). To accomplish this, we collect a large,
II. BACKGROUND structured code dataset with performance data and test cases to
measure correctness. This structured dataset is, however, not
In this section, we provide a background into large language
representative of the entire distribution of code we want an
models and their use for code generation. We further provide
LLM to optimize so we ameliorate its shortcomings by using
an overview of reinforcement learning and the Proximal Policy
LLMs to generate a synthetic code dataset that covers a wider
Optimization algorithm.
range of code.
A. Large Language Models for Code
LLMs have been shown to be effective tools for many
code generation tasks [1], [13], [14]. These LLMs are typ-
ically Transformer models [15] fine-tuned on large code
datasets [13], [16], [17] to model the probability distribution
of code text data. These models can then be used to generate
code, fill in missing code snippets, complete code snippets,
and more. Code is generated by showing them a sequence of
Fig. 1. An overview of the proposed methodology. We first collect a large
code text (as tokens) and using the model to predict the next dataset of fast and slow code pairs using coding contest submissions and
token in the sequence. Getting good text generation with this synthetically generated data. Then we fine-tune three different LLMs on this
method is not always straightforward, so additional sampling data to generate faster code. Finally, we evaluate the fine-tuned models on
code generation and optimization tasks.
techniques such as temperature and top-p are often used to
improve the quality of the generated text [18]. These control These datasets are then used to align the outputs of an LLM
the randomness of the sampling process, with temperature with performance considerations. We employ three different
controlling the entropy of the distribution and top-p controlling techniques – supervised learning, reinforcement learning, and
the number of tokens considered for sampling. direct alignment, to fine-tune code LLMs (Section V). The
models are aligned to answers that are not only correct,
B. Reinforcement Learning and Proximal Policy Optimization
but also fast. Using the fine-tuned models we then generate
Reinforcement learning (RL) is a popular machine learning code for a set of three different benchmark tasks for code
training paradigm where an agent model learns to interact with generation and optimization (Section VI). These tasks measure
an environment to maximize a reward signal. This learning is the correctness and performance of the generated code for
typically accomplished by the agent iteratively taking actions coding problems within and outside the distribution of the
in the environment, observing the results, and updating its training data.
policy to maximize the reward. While RL techniques have
been popular for a number of years, they have recently been IV. DATA C OLLECTION AND L ABELING
applied to LLMs due to their success in aligning LLM outputs In order to align LLMs to generate more performant output,
with human preferences [11]. we need to fine-tune them on performance data. Further, to
apply the proposed fine-tuning methods, we need a dataset shown in Table I. There were a small fraction of submissions
of code where we have a slow and a fast implementation labeled as correct in the CodeContests dataset that errored or
of a particular problem. This type of structured performance failed the test cases when we ran them. These are omitted
data paired with source code is difficult to collect. It requires from the final dataset. We also include code submissions that
being able to build, execute, validate, and profile arbitrary were marked as incorrect in the original dataset, however, we
code snippets, which is difficult to accomplish at scale. In do not run them. These will eventually be useful to prevent
this section, we describe our process of collecting a large the model from generating fast, but incorrect code.
performance dataset (Dc ). Additionally, we discuss how we
extend the dataset with synthetic data (Ds ) to cover a wider B. Synthetic Data Generation
distribution of code patterns. The final dataset D contains
over 4.5 million code samples, distributed over three source The amount of data and the availability of easy testing in
languages (C++, Java, and Python) as shown in Table I. the CodeContests-Perf dataset makes it a crucial component
of our study. However, the distribution of code represented
TABLE I in the dataset is significantly different than that of the code
T HE NUMBER OF SAMPLES IN BOTH DATASETS DISTRIBUTED BY SOURCE that is typically found in production code. Coding contests
LANGUAGE . generally award participants based on time-to-submission lead-
ing to users writing messy and/or disorganized code to solve
Runtime No. of
Samples problems as quickly as possible. Further, the types of problems
Dataset (D) C++ Java Python
Data
typically found in coding contests such as depth-first search
CodeContests+Perf (Dc ) ✓ 1.8M 0.9M 1.8M 4.5M
Synthetic (Ds ) ✗ 5k 0 5k 10k and dynamic programming, while an important subset of
problems, do not cover the full range of relevant computational
problems that are found in production code, and in particular,
A. Performance Dataset Collection in scientific computing.
We build our performance dataset using the CodeContests To address the shortcomings of the CodeContests-Perf data,
dataset introduced by DeepMind in [14]. This dataset contains we generate an additional synthetic dataset Ds of fast and slow
coding contest problems and solutions from the Aizu [20], code samples. This is inspired by several recent works demon-
AtCoder [21], CodeChef [22], Codeforces [23], and Hack- strating the effectiveness of fine-tuning LLMs on synthetic
erEarth [24] online competition platforms. In total there are data to improve performance on real tasks [17], [25]–[28].
13,610 coding problems in the dataset. These range in diffi- Gilardi et al. [26] even find that LLMs can outperform humans
culty from simple to very difficult, and cover a wide range of for many text annotation tasks. In our case of annotating
topics such as graph algorithms, dynamic programming, and code performance, real runtimes are the best annotation, but
search. Each problem in the dataset has a corresponding set of in the absence of runtime data, synthetic data is a promising
submissions from users, labeled as correct or incorrect on the candidate to obtaining labeled code performance data.
respective coding contest website. The number of submissions We use the Gemini-Pro-1.0 LLM model [29] to generate
per problem ranges between tens and thousands. There are synthetic code samples as we found it to give the best outputs
solutions in three different programming languages: C++, Java, among a number of models we tested. We adapt the method-
and Python. Additionally, the dataset includes meta-data for ology in [17], where samples are generated using seed code
the problem such as the problem statement, test cases, time snippets to get diverse outputs from the model. First, we create
limits, and memory limits. a dataset of 10,000 seed samples that are 1-15 line random
This dataset is extremely valuable for our study as it substrings of random files from The Stack dataset [30], which
provides a large amount of code samples along with the is a large, 3TB dataset of permissively licensed code. Then
necessary tests to measure correctness and performance. More the LLM is asked to generate three pieces of text: a problem
so, it contains many code samples that solve the same problem, statement inspired by the seed snippet, a fast solution to the
but in different ways and with different runtimes. While many problem, and a slow solution to the problem. This produces
of the code contest websites record runtimes for submissions, inherently noisy data, since the LLM does not always generate
the CodeContests dataset as provided by DeepMind does correct or optimal (fast vs. slow) outputs. However, prior work
not include this information. We collect this data ourselves has shown that the gain in predictive performance from fine-
into a new dataset, CodeContests-Perf (Dc ), by executing tuning on synthetic data often outweighs the downsides from
each of the correct submissions and recording their runtimes. noisy data [17].
Each submission is run on all the test cases for its problem. In total, we collect 10,000 synthetic samples, 5,000 in C++
Generally, there are between 5 and 20 test cases per problem. and 5,000 in Python. While adding more synthetic samples
We create submission-runtime pairs using the average runtime would likely continue to improve the quality of the fine-tuned
over all the test cases. Each run is executed on a single core of model, we found that limiting to 10,000 samples provided
an AMD EPYC 7763 CPU with a 2.45 GHz base frequency. adequate model quality while operating within time/cost con-
The final CodeContests-Perf dataset contains 4.5 million straints for this study. Table I shows the distribution of samples
samples. The distribution of samples by source language is by language in the synthetic dataset.
V. A LIGNING LLM S TO G ENERATE FASTER C ODE : ### Instruction:
P ROPOSED F INE -T UNING A PPROACHES Given a list of strings, find the longest
common prefix shared by all strings in the
Large language models have been shown to be capable list. The prefix should be the longest
of generating correct code with high frequency on several possible string that is a prefix of every
benchmarks [1], [31], [32], yet they do not always generate string in the list.
code that is efficient [9]. They require further fine-tuning to ### Response:
align them with performance considerations. In this section, ```python
we detail how we fine-tune large language models with def longest_common_prefix(strings):
supervised learning and reinforcement learning techniques to if not strings:
generate faster code. We utilize the dataset introduced in return ""
Section IV to train three different models using supervised prefix = strings[0]
learning, reinforcement learning with performance feedback, for string in strings[1:]:
and direct performance alignment. while string.startswith(prefix):
prefix = prefix[:-1]
return prefix
A. Supervised Learning ```
In the first approach, we fine-tune a language model on the
dataset of code snippets from D to predict the next token in
a sequence given previous tokens. For our methodology, we Listing 1: The instruction prompt format used to fine-tune the
begin with a model that has already been trained on a large models. During fine-tuning, a coding problem is given to the
corpus of text and code, and then fine-tune it on a smaller model as an instruction-response pair, and the model is trained
dataset of coding problems and fast solutions. to generate similar responses when used for inference.
We create two types of prompts using the samples in D
Predicted probability of token xi
to fine-tune the model. In the first type of prompt, we use given the previous tokens x<i
a standard instruction prompt where the model is given a ( )
t
problem statement and a fast solution (shown in Listing 1). 1X
Perplexity(X ) = exp − log pθ (xi | x<i ) (1)
Using the coding contest data in Dc , we use the problem t i
description as the instruction and randomly sample one of the
five fastest solutions as the response. In the second type of
prompt, we use a variation of the standard instruction prompt B. Reinforcement Learning with Performance Feedback
where the task is to optimize a given code snippet and the To further align an LLM’s outputs with performance con-
output is an optimized version of the code. For this, we use the siderations, we propose a new method, which we call rein-
problem description and one of the slowest 33% of solutions forcement learning with performance feedback. This method is
as the instruction, and one of the five fastest solutions as the inspired by the success of reinforcement learning with human
response. Forming prompts from the synthetic dataset Ds is feedback (RLHF) [11], which aligns LLM outputs with human
similar except we only have one slow and one fast solution for preferences. RLHF uses human-labeled preference data to train
each problem, so we do not sample from ranges of solutions. a reward model that assigns rewards to LLM outputs that
Over these prompts, the model is fine-tuned to minimize are more preferred by humans. This reward model is used in
the cross-entropy loss between its predicted next token and conjunction with reinforcement learning to fine-tune a LLM to
the actual next token. We refer the reader to [33] for more generate outputs that are more preferred by humans. We adapt
details on fine-tuning LLMs for text generation. After fine- this method into reinforcement learning with performance
tuning, the model should have more fast code snippets in its feedback (RLPF) that uses performance feedback instead of
training data and its probability distribution should shift toward human feedback to fine-tune LLMs to generate faster code.
faster code. Several prior works, however, have observed that
Reward Model: We first need to design a reward function that
methods more sophisticated than supervised fine-tuning are
can be used to guide the reinforcement learning process. If we
required to align LLM outputs with certain properties, such as
can automatically run, test, and measure the performance of a
safety and human preferences [11], [34], [35].
generated LLM output, then we can simply use a function of
Supervised Fine-Tuning Evaluation Metric: We evaluate the recorded runtime as the reward. In our case, this is possible
the success of the supervised fine-tuning by measuring the for the coding contests dataset Dc , where we have unit tests
perplexity of the tuned model over an evaluation dataset. available to run and test the generated code (see Section IV-A).
Perplexity is inversely proportional to how confident a model This further highlights the utility of this dataset for our study.
is that a data sample is in the distribution it models. A lower As mentioned in Section IV-B, we want to be capable
perplexity is better and indicates the LLM is less “perplexed” of generating fast code outside the context of coding con-
by a particular sample. A model’s perplexity over t tokens tests i.e. we do not want to exclusively use the code contests
from a dataset X is given by Equation (1). data for RL fine-tuning. Since we may not be able to obtain
runtime data for other arbitrary code samples, we need to
train a reward model that rewards faster code more than
slower code for samples where we cannot obtain runtime
data. Fine-tuning LLMs for relative performance modeling was
previously demonstrated by Nichols et al. [8] and, thus, a fine-
tuned LLM is a viable candidate for the reward model.
To accomplish this we train a reward model (an LLM), rθ ,
to predict a reward for a given code sample, where a higher
reward indicates faster code. To train this model, we first use
a subset of D to create a dataset of triplets (p, df , ds ) where p
is a problem description and df and ds are fast and slow code
solutions to the problem, respectively. Using rθ , we compute
predicted rewards for df and ds , and use these to calculate the Fig. 2. An overview of the reward model fine-tuning process. The reward
model outputs a reward for a fast and slow code sample. The loss function
loss function Lr in Equation (2). uses these rewards alongside runtime data to update the weights of the model
so that its predicted rewards move farther apart for faster and slower code
predicted reward predicted reward scaled by the runtime speedup.
for fast code for slow code
h  i
Lr = − log σ rθ (p, df ) − rθ (p, ds ) − µ (p, df , ds )
to an LLM generated code sample. This reward function is
adaptive margin;
(2) defined in Equation (4).
scales the reward based on runtimes

This loss function is used to train rθ to predict a higher 


reward for df than ds . In Equation (2) σ, is the logistic func- −1
 if p ∈ Dc , d incorrect
median runtime(p)
tion and µ is an adaptive margin as defined in Equation (3). r(p, d) = runtime(d) −1 if p ∈ Dc , d correct
The loss function in Equation (2) is adapted from Wang et 

rθ (p, d) otherwise
al. [36] to include runtime information. It trains the reward (4)
model to generate rewards farther and farther apart for faster
The model is penalized with a negative reward if it gen-
and slower code samples. As rθ (p, df ) − rθ (p, ds ) gets larger,
erates incorrect code. If it generates correct code, then the
the loss function tends towards zero. On the flip side, the loss
reward is based on the speedup over the median runtime,
increases as the difference between the rewards decreases or
median runtime(d), from the submission already in the
rθ assigns a larger reward to the slower code. We utilize an
dataset. For the synthetic problems, we use the output of the
adaptive margin µ to further scale the rewards based on how
reward model rθ .
much faster the fast code is than the slow code:
Reward Model Fine-Tuning Evaluation Metric: We can
max margin value speedup of df over ds
evaluate the fine-tuning of the reward model by computing
   its accuracy over an evaluation dataset. The accuracy here is
runtime(ds )
min λ ,
runtime(df ) if p ∈ Dc defined as the proportion of samples where the reward signal
µ (p, df , ds ) = (3)
is larger for the fast code than it is for the slow code.
0 otherwise

Since we can train the reward model on both datasets Dc 1


1 [rθ (p, df ) > rθ (p, ds )] (5)
X
and Ds , we can use the runtime information from Dc to accreward (X ) =
|X |
scale the rewards appropriately. We use a max margin λ to (p,df ,ds )∈X
prevent extremely large margins when ds is very slow. Figure 2
provides an overview of the reward model fine-tuning process. Here 1 is the indicator function that returns 1 if the condi-
It is important to note that the reward model rθ is not tion is true and 0 otherwise. A perfect accuracy of 1 indicates
directly modeling code performance. Doing so would likely that the reward model always predicts a higher reward signal
be impossible as performance can depend on a number of for the fast code sample than the slow code sample.
factors like hardware, input, etc. that are not accounted for in
the input to the reward model. Instead, the reward model is Reinforcement Learning: Using the reward function r(p, d)
trained to learn code structures and patterns that generally lead and Proximal Policy Optimization (PPO) [19], we can align
to better performance. This is another reason it is important an LLM to generate faster code. We use the supervised fine-
to have a large dataset that covers a wide distribution of code, tuned model from Section V-A as the base model to fine-tune
so that the model can learn these generalizations. with RL as is common in RLHF [11]. Following standard PPO
Using the runtime data in Dc and the trained reward model, training practices we optimize the base model using the reward
we can define a reward function r(p, d) that assigns a reward objective function in Equation (6).
new model π RLPF
being fine-tuned with RL supervised model π S requiring less computation, being easier to implement, and is
  generally more stable with less hyperparameters [12]. How-
Lp = r(p, d) − ηKL π RLPF (d | p) ∥ π S (d | p) (6) ever, some works still find that RL fine-tuning can outperform
DPO for certain tasks and datasets [39]. Thus, we adapt the
Here KL is the Kullback-Leibler divergence and η is a DPO approach to compare it with RLPF. We propose Direct
hyperparameter that controls the divergence penalty. This Performance Alignment (DPA), an adaptation of the training
penalty helps prevent the model from getting stuck in local procedure and loss function from [12] that takes into account
optima or diverging too far from the original distribution of performance, to fine-tune an LLM to generate faster code. The
the supervised model [37], [38]. proposed loss function in DPA is shown in Equation (7).
During fine-tuning, a prompt is given to the base model (a predicted probabilities from fine-tuned model π P and
coding problem or optimization task) and is used to generate a supervised model π S on fast (df ) and slow (ds ) code
response. The reward function r(p, d) is then used to compute
π P (df | p) π P (ds | p)

a reward for the response either by running the generated code
Ld = − log σ β log S − β log S
or getting a reward from the reward model. The reward is then π (df | p) π (ds | p)
used to compute the loss function Lp in Equation (6). The loss 
is then used to update the base model’s parameters using PPO. − µ(p, df , ds ) (7)
The process is repeated for a number of iterations T or until
the model converges. Figure 3 provides an overview of the
RLPF fine-tuning process. Like with the reward loss in Equation (2), we utilize the
adaptive margin µ from Equation (3) to scale the loss based
on the runtime of the fast and slow code samples. This loss
function can be used to fine-tune a base LLM to generate faster
code without using reinforcement learning. To compute the
loss, we need to get model predictions for a fast and slow code
pair for both the model being fine-tuned and a base reference
model (the supervised model). Then the loss from Equation (7)
is used to update the weights of the model being fine-tuned.
This process is iteratively repeated for a number of iterations
Fig. 3. The RLPF fine-tuning process. A prompt is given to the model and T or until the model converges. This DPA fine-tuning process
a reward is calculated based on the code it generates. Additionally, the KL- is portrayed in Figure 4.
divergence between a reference model and the fine-tuned model is included in
the reward to prevent deviating too far from the original distribution. Finally,
PPO is used to update the model’s parameters based on the reward.

RLPF Fine-Tuning Evaluation Metric: We can measure the


success of the reinforcement learning using two metrics: the
mean reward and the magnitude of the KL-divergence over an
evaluation dataset. The mean reward indicates how well the
LLM being fine-tuned is able to optimize the reward function.
A higher mean reward is better and indicates that the model is
generating faster code. The KL-divergence measures how far
the fine-tuned model has diverged from the supervised model.
The absolute magnitude of this is difficult to interpret, but it
should remain positive and low to indicate that the fine-tuned Fig. 4. The DPA fine-tuning process. The model being fine-tuned and a
reference model are used to generate probabilities for a fast and slow code
model is not diverging too far from the supervised model. sample. These probabilities, combined with runtime data, are used to compute
a loss and update the model’s parameters.
C. Direct Performance Alignment
In recent work, Rafailov et al. [12] demonstrated an alterna-
DPA Fine-Tuning Evaluation Metric: The success of DPA
tive approach that does not use reinforcement learning to align
fine-tuning can be measured using a similar accuracy metric to
LLM outputs with certain properties. Their approach, called
the reward model from RLPF. Since we do not have a direct
Direct Preference Optimization (DPO), uses a derivation of
reward signal like in Equation (2), we can instead measure
RLHF’s reward objective (similar to Equation (6)) to directly
how often the difference in log probabilities between the fine-
update the model’s parameters to align with a reward signal,
tuned model and the supervised model for the fast code, i.e.
rather than train a reward model and use RL. The derived loss π P (d |p)
takes a similar form to the reward loss in Equation (2). This log πS (dff |p) is greater than the log probability difference for
π P (ds |p)
DPO fine-tuning has many advantages over RLHF, such as the slow code, i.e. log π S (ds |p)
. This is shown in Equation (8).
metric measures the expected max speedup over a baseline
1
 P
π (df | p) P
π (ds | p)
 implementation if the LLM is given k attempts to write a
1
X
accdpa (X ) = > S solution. The speedupn @k metric is defined in Equation 10.
|X | π S (df | p) π (ds | p)
(p,df ,ds )∈X We refer the reader to [9] for a complete derivation of this
(8) metric. For the coding contest problems, we use the median
VI. E VALUATION TASKS submission runtime as the baseline. For the ParEval problems,
we use the baselines provided by the benchmark.
It is important to quantify how well the models do on runtime of baseline for prompt p
downstream tasks after fine-tuning. In this section we present
two different tasks, code generation and optimization, to N j−1

Tp∗
evaluate how well the training methodologies in Section V 1 XX k−1
speedupn @k = N
(10)
|P |

improved the LLMs ability to generate fast code. We further j=1p∈P k Tp,j,n
detail an ablation study to motivate the use of synthetic data.
runtime of sample j of prompt p on n processors
A. Code Generation
To evaluate the ability of the models to generate fast B. Code Optimization
code, we utilize two sets of coding problems. The first is a In addition to generating code, we also evaluate the ability
subset of 100 coding contest problems from the CodeContests of the models to optimize existing code. This is accomplished
dataset [14] (see Section IV-A) that were removed from the by providing a code snippet and instructing the model to
training set. We can provide the model with the problem generate an optimized version of it. To evaluate this task we
statement and use it to write a solution to the problem. We use the functions in the PolyBench benchmark suite [42]. This
can then run the code and measure both its correctness and is comprised of 30 unique kernels that are typically used to
performance. Correctness can easily be tested using the unit test compiler optimizations and auto-tuning tools. We utilize
tests provided with the problems. the kernels by providing the existing kernel implementation
In addition to the coding contest problems, we also evaluate to the LLM and instructing it to generate an optimized
the models on the ParEval benchmark [9], which is a collection implementation. We can then evaluate the correctness and
of parallel code generation problems for evaluating the ability performance of the generated code.
of LLMs to generate correct and efficient parallel code. We
narrow our focus to a subset of 180 problems, namely the Code Optimization Evaluation Metrics: We evaluate the
serial, OpenMP [40], and MPI [41] problems. We include generated code on the same metrics as the code generation
OpenMP and MPI problems to evaluate the models’ ability task: correctness and performance. We use the same pass@k
to generate fast parallel code. The problems in ParEval range metric (Equation (9)) to evaluate correctness. To evaluate
a wide variety of domains, such as linear algebra, graph performance, we use speedupn @k (Equation (10)), except with
algorithms, sorting, etc. The problems are designed to be the baseline being the runtime of the original kernel.
challenging and require the generation of efficient code. The
C. Synthetic Data Ablation Study
ParEval benchmark provides a great way to test the LLMs on
problems unlike what is in their training data (coding contests). Finally, we test our hypothesis that training on synthetic data
helps the models’ ability to generalize and prevents it from
Code Generation Evaluation Metrics: We evaluate the gen- over-fitting to code contest data. To accomplish this we train
erated code on two metrics: correctness and performance. the models exclusively on the code contests dataset Dc without
To study correctness we adopt the popular pass@k metric any of the synthetic dataset Ds . We then evaluate the models
from Chen et al [1]. This metric measures the probability that on the code generation (Section VI-A) and code optimization
if an LLM is given k attempts to write a correct solution, (Section VI-B) tasks. We compute the same pass@k and
it will succeed. Equation 9 shows how this value can be speedupn @k metrics and compare the impact of the synthetic
estimated using N generated samples from an LLM. Typically data on the models’ performance. Of most interest is the
the average pass@k over a set of prompts is reported and, as performance on the ParEval and PolyBench benchmarks, as
LLMs have progressed, only the pass@1 value is reported. We these are the most different from the training data.
refer the reader to [1] for further discussion of pass@k.
Number of samples generated per prompt VII. E XPERIMENTAL S ETUP
" # Using the large performance dataset D from Section IV
N − cp N
  
1 X and the training methodology introduced in Section V, we
pass@k = 1− / (9)
|P | k k can now fine-tune LLMs to generate faster code. Once fine-
p∈ P tuned, these models can then be evaluated on the benchmarks
Number of correct detailed in Section VI. This section details the base models
Set of prompts samples for prompt p
for fine-tuning, the data subsets for each fine-tuning task, how
To evaluate the performance of the generated code, we use we implement the fine-tuning process, and the experimental
the speedupn @k metric introduced by Nichols et al [9]. This setup used to evaluate the fine-tuned models.
A. Base Model for Fine-Tuning from the slowest 50% of the solutions. Additionally, a random
Each of the training methodologies introduced in Section V 5% subset of slow solutions are replaced with an incorrect
begins with a base LLM that has already been trained and solution. This is to ensure that the model is not just learning
fine-tunes it further. We select the Deepseek-Coder 6.7B to generate fast code, but also to avoid generating incorrect
model [16] as the base for the supervised fine-tuning (Sec- code. We directly use the fast and slow code pairs from Ds
tion V-A). This model is a 6.7B parameter code LLM released to directly form the triplet.
by Deepseek-AI that is trained on 2T tokens comprised of C. Fine-Tuning Setup
mostly code with a context length of 16k tokens. We select
this model due to its good performance on code generation In order to implement the fine-tuning we extend the TRL
tasks [43] and due to other works finding it a better base model Python library [44] which is built on top of the popular trans-
for fine-tuning than the popular CodeLlama models [17]. formers library [45]. TRL provides existing implementations
Furthermore, its 6.7B parameter size makes it tractable for of RLHF and DPO, which we modify to use our custom
end-users to use it to generate code themselves on consumer rewards, loss function, and datasets. We fine-tune the models
hardware. While Deepseek-Coder is a strong base model for on a single node with four 80GB A100 GPUs and two AMD
our studies, the proposed fine-tuning methodologies can be EPYC 7763 CPUs.
applied to any existing code LLM. 1) Supervised Fine-Tuning Hyperparameters: We fine-tune
For the remaining two fine-tuning methods, RLPF and DPA, the supervised model for three epochs over the DSFT dataset.
we use the supervised fine-tuned deepseek model as the base. We use bfloat16 precision and a global batch size of 64 (1
This is in line with the methodologies in [11], [12] and ensures sample per GPU and 16 gradient accumulation steps). To fine-
that the model being aligned is within the distribution of the tune in parallel we make use of the PyTorch fully sharded data
text data it is trying to model (i.e. instruction prompts as shown parallelism (FSDP) implementation [46], which shards model
in Listing 1). Additionally, we use Deepseek-Coder 6.7B as parameters across ranks to save memory. Furthermore, we fine-
the base for the reward model. The final set of models used tune with the Adam optimizer [47] and an initial learning rate
for comparison is shown in Table II. of 1.41 × 10−5 .
2) Reward Model Fine-Tuning Hyperparameters: The re-
ward model is fine-tuned with the same hyperparameters as
TABLE II
M ODELS USED FOR COMPARISON IN THIS PAPER . the supervised model (Section VII-C1), except it is fine-tuned
D EEPSEEK -C ODER -6.7B [16] IS THE BASE MODEL WE USE IN OUR for only one epoch over the DREWARD dataset. We use a max
FINE - TUNING METHODOLOGIES . margin of λ = 3 for the margin function µ(p, df , ds ).
3) RLPF Fine-Tuning Hyperparameters: We fine-tune the
Fine-Tuning
Model Name Description
Methodology RLPF model for four PPO epochs over the DRL dataset. We
DS Deepseek-Coder 6.7B base model —
use a global batch size of four and a learning rate of 1.41 ×
DS+SFT DS after supervised fine-tuning Section V-A 10−5 . The KL regularization coefficient is initialized to γ =
DS+RLPF DS+SFT after RLPF fine-tuning Section V-B 0.1. When sampling outputs from the fine-tuned and reference
DS+DPA DS+SFT after DPA fine-tuning Section V-C
model we follow best conventions [44] and use sampling with
a top-k of 0 and a top-p of 1.0.
B. Data Setup
4) DPA Fine-Tuning Hyperparameters: The DPA model is
We fine-tune the LLMs using the dataset D from Section IV. fine-tuned for 1 epoch over the DDPA dataset with a global
We set aside 100 contests from the CodeContests dataset for batch size of four. We employ a learning rate of 1 × 10−7 in
the code generation evaluation task. The dataset is further split the AdamW optimizer [48]. Additionally, we found a value of
into smaller datasets for each fine-tuning task. The supervised β = 0.6 to be most stable for training.
fine-tuning dataset, DSFT , is comprised of 40% of the full
dataset, D, and the remaining 60% is used for the reinforce- D. Evaluation Setup
ment fine-tuning dataset, DRLPF , and the direct performance For the code generation tasks we use each of the LLMs
alignment dataset, DDPA . These two datasets can be the same to generate code for the prompts in the evaluation subset
since the alignment fine-tuning tasks are disjoint. The DRLPF of Dc and ParEval. We generate 20 samples per prompt
dataset is further split into 66% for the reward model dataset, with a temperature of 0.2 and a top-p of 0.95 following
DREWARD , and 33% for the reinforcement learning dataset, standard practices LLM code benchmarks [9], [31]. For the
DRL . During each fine-tuning stage we set aside 5% of the optimization task we similarly generate 20 optimized versions
respective dataset for evaluation (i.e. 5% of DREWARD is set of each kernel in the PolyBench benchmark suite [42] using
aside to calculate the reward model accuracy after training). each of the fine-tuned LLMs.
All of the dataset splits are stratified so that the proportion of The generated code is run on a single AMD EPYC 7763
code contest to synthetic data is equal to the original dataset. CPU. For the ParEval OpenMP tests we report results on
When creating prompt, fast code, and slow code triplets 8 cores and we use 512 ranks for the MPI tests. We make
(p, df , ds ) from Dc for RLPF and DPA fine-tuning, we select use of the existing tests in the CodeContests dataset and
df randomly from the top 5 fastest solutions. We then select ds ParEval to record the correctness and runtime of the generated
code. For the optimized PolyBench kernels we test correctness 100
Code Generation Pass@1
and runtime against the original kernel implementations. All DS DS+RLPF
80 72.4 74.0 DS+SFT DS+DPA
runtimes are averaged over 5 runs. 64.8 65.1 62.6
59.0 59.7 63.4
60

Pass@1
VIII. R ESULTS 44.8 48.3 46.3
40
With the fine-tuned models from Section V we can now 21.7
evaluate their code generation capabilities on the tasks de- 20 14.1 17.7 18.2 18.2
scribed in Section VI. In this section we present the results
0
from the fine-tuning process and the evaluation tasks. CodeContests ParEval Serial ParEval OpenMP ParEval MPI

A. Fine-Tuning Results
Fig. 5. Correctness results for each model on the code generation tasks.
We record the fine-tuning metrics on the 5% evaluation Each of the fine-tuned models shows an improvement in correctness over the
datasets at the end of each fine-tuning step. The DS+SFT baseline model with the DS+RLPF model showing the most improvement.
model yields an evaluation perplexity of 1.62. It is generally
difficult to reason about specific perplexity values, but values
near 1 show a strong ability to model the underlying text dis- 25
Code Generation Speedupn@1
tribution. Since perplexity is the exponential of cross-entropy DS DS+RLPF
20 18.5 DS+SFT DS+DPA
(see Equation (1)) a perplexity value of 1.62 means that the 17.0

Speedupn@1
14.8
cross-entropy between predicted probabilities is ≈ 0.48. 15 12.3
11.2
The RLPF reward model achieves a final evaluation accu-
10
racy of 93% after one epoch of training calculated using Equa- 5.2
6.3
5.1 4.5 4.3
tion (5). This means that in 93% of samples the model assigns 5 1.9 2.7
0.9 1.1 1.6 1.2
a higher reward signal to faster code than slower code. This
0
is a strong result as the success of RL-based LLM fine-tuning CodeContests ParEval Serial ParEval OpenMP ParEval MPI
is highly dependent on the quality of the reward model [36].
Using this reward model the DS+RLPF model is then able to Fig. 6. Speedup results for each fine-tuned model on the code generation
achieve a mean reward of 1.8 and a KL divergence of 0.29. tasks. OpenMP runtimes are on 8 cores and MPI runtimes are on 512 ranks.
This means that DP-RLPF is getting a positive mean reward, The DS+RLPF model is the best performing model across all benchmarks.
while maintaining a similar distribution to the original model.
Finally, we see that the DS+DPA model achieves an evalua-
tion accuracy of 87% calculated as show in Equation (8). This ParEval problems, DS+RLPF generates code with an expected
is not quite as high as the RLPF reward model, but is still a max speedup of 1.6x over the sequential baseline. We see the
strong result. The log-probability difference between DS+DPA same order of model performance across all the benchmarks
and the reference model for fast code samples is greater than with DS+RLPF performing the best, followed by DS+DPA,
the log-probability difference for slow code samples in 87% DS+SFT, and DS.
of the evaluation dataset.
C. Code Optimization Results
B. Code Generation Results Figure 7 shows the correctness and performance results
Figures 5 and 6 show the correctness and performance when using the fine-tuned models to optimize PolyBench
results of each fine-tuned model on the code generation tasks. kernels. DS is omitted because it is only a code completion
We see a promising trend in pass@1 scores in Figure 5 where model and was not trained to optimize code inputs. We
the fine-tuned models improve in correctness over the baseline first see that all three fine-tuned models transform the input
model. The DS+RLPF model shows the most improvement code to a correct output code with relatively high accuracy.
across all tasks. These improvements can be attributed to While provably correct compiler optimizations may seem more
training over more data and, in the case of the RLPF and desirable, LLM optimizations can be applied at a higher
DPA models, using incorrect samples as negative rewards. level of abstraction and include natural language comments
Improving the correctness of the models is a strong results to explain the transformation to a developer.
considering that the primary goal of this work is to improve We show the distribution of speedup1 @1 per PolyBench
the performance while keeping the correctness levels the same. benchmark in Figure 7 rather than an average to highlight the
Figure 6 further details the speedup results for each fine- spread of results. The speedup results show that DS+RLPF is
tuned model. We present the speedup results for OpenMP on 8 the best performing model. It is able to produce an expected
cores and MPI on 512 ranks with a sequential implementation max speedup greater than 1 in 26 out of the 30 benchmarks.
as the baseline. Across all four benchmarks DS+RLPF pro- In the case of the 3mm kernel (three matrix multiplies) it
duces faster code than the other three models. In the case of the is able to get up to 22.4x expected speedup. Many of the
code contests and ParEval serial problems, the speedup1 @1 optimizations come from loop unrolling and/or cache friendly
value is easy to interpret. For instance, in the case of the serial data access patterns. The DS+DPA model is able to produce
100
Optimization Pass@1 Optimization Speedup1@1 Speedupn@1 With and Without Synthetic Data
93.8 95.2 92.7 5 25
DS+RLPF No Synthetic DS+RLPF
80 4 20.1
20 18.5

Speedup1@1

Speedupn@1
60 14.8
Pass@1

3 15
40 2 10
6.3 6.7
20 1 5 4.5
2.8 3.2
1.6 1.6
0 0
DS+SFT DS+RLPF DS+DPA DS+SFT DS+RLPF DS+DPA 0
CodeContests ParEval Serial ParEval OpenMP ParEval MPI PolyBench

Fig. 7. pass@1 (left) and speedup1 @1 (right) results for optimizing the
PolyBench kernels. The distribution of speedup1 @1 values over the 30 Fig. 9. speedupn @1 results for DS+RLPF on each task with and without
benchmarks is shown on the right. The DS+RLPF model has further outliers synthetic data in the fine-tuning dataset. For OpenMP, MPI, and PolyBench
at 11.6 and 22.4. tasks, the model fine-tuned on synthetic data produces faster code, while the
coding contest and ParEval serial problems show a slight decrease or no
change in speedup.

faster optimizations than DS+SFT, but is not as strong as


DS+RLPF.
IX. R ELATED W ORK

D. Synthetic Data Ablation Study Results Large Language Models (LLMs), like OpenAI Codex [49],
We further highlight the use of synthetic data in the fine- CodeLlama [50], StarCoder [13], WizardCoder [51], Phind-
tuning process in Figures 8 and 9. Results for DS+RLPF are CodeLlama [52], and DeepSeek [16] are revolutionizing how
shown since it is the best performing model. We see a general developers approach their coding tasks. These models are
improvement in both correctness and performance of generated trained on vast datasets that include code repositories, docu-
code when incorporating synthetic data versus fine-tuning on mentation, high quality programming problems and solutions.
just coding contest data. The correctness improves for all the They have shown incredible potential in a variety of software-
benchmarks (Figure 8) and, notably, even improves on the related tasks ranging from code completion [14], [32], [53]–
coding contest benchmarks. The broader synthetic data is able [56], code refactoring [57], bug detection [58]–[61], documen-
to help the model generalize better even within the coding tation [62], and testing [63], [64], among others.
contest domain. In HPC, researchers are particularly interested in using
LLMs for generating parallel code [8]–[10], [65], [66]. Nichols
et al. [8] proposed HPCCoder, a model fine-tuned on HPC
100
Pass@1 With and Without Synthetic Data 95.2 data, to generate parallel code, label OpenMP pragmas, and
DS+RLPF No Synthetic DS+RLPF predict performance. Chen et al. designed LM4HPC [65]
80 72.9 74.0
65.1
framework to facilitate the research and development of HPC
60 59.0 software and proposed OMPGPT [67] for generating OpenMP
Pass@1

47.5 48.3
pragmas and data race detection [68]. Despite their popularity,
40
29.4 LLMs still struggle at generating efficient code [9], [10]. Our
20 18.2 work addresses this concern by incorporating performance
10.1
aspects of code in order to generate efficient code while
0
CodeContests ParEval Serial ParEval OpenMP ParEval MPI PolyBench maintaining its correctness.
While Reinforcement Learning with Human Feedback
Fig. 8. pass@1 results for DS+RLPF on each task with and without synthetic (RLHF) [36] has been shown to be critical for boosting
data in the fine-tuning dataset. For all tasks, the model fine-tuned on synthetic the performance of LLMs by incorporating human feedback
data produces correct code at a higher rate. into the reward model [12], it is not specialized to code
and furthermore does not consider performance. Another
The speedup results in Figure 9 show that fine-tuning with work by Mankowitz et al. [69] looked at training a deep
synthetic data also helps the models produce faster code. Only reinforcement learning agent, AlphaDev, to discover sorting
in the case of the coding contests and ParEval serial problems algorithms from scratch that outperformed previously known
do we see a decrease or no change in speedupn @1. However, human benchmarks. However, training process is limited to a
these differences are small. The performance increases for single algorithm at a time and does not fine-tune a general
OpenMP, MPI, and PolyBench are much more significant. model that can be used to generate fast code for a variety of
incorporating synthetic performance data into the fine-tuning problems. To address this gap and enable LLMs to generate
process has prevented the models from overfitting code contest faster versions of code, we introduced RLPF and DPA to tune
data and enabled them to generalize better to new tasks. LLMs on performance data.
X. C ONCLUSION [14] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, T. Hubert, P. Choy,
In this paper, we have explored the idea of fine-tuning large C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl,
language models to help them learn code structures and pat- S. Gowal, A. Cherepanov, J. Molloy, D. Mankowitz, E. Suther-
terns that generally lead to better performance. To accomplish land Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals,
“Competition-level code generation with alphacode,” arXiv preprint
this, we first collected a large performance dataset from coding arXiv:2203.07814, 2022.
contests and extended it with synthetically generated samples [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
to cover a wider distribution of code. We then introduced A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available:
two novel fine-tuning methodologies, Reinforcement Learning http://arxiv.org/abs/1706.03762
with Performance Feedback (RLPF) and Direct Performance [16] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi,
Alignment (DPA), that align LLMs with faster code outputs. Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang, “Deepseek-coder:
When the large language model meets programming – the rise of code
We have demonstrated that using such techniques we can intelligence,” 2024.
incorporate performance feedback into the fine-tuning of code [17] Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: Source
LLMs. The fine-tuned models were evaluated on code genera- code is all you need,” arXiv preprint arXiv:2312.02120, 2023.
[18] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious
tion and optimization tasks and shown to increase the expected case of neural text degeneration,” in International Conference
performance of generated code over baseline LLMs while on Learning Representations, 2020. [Online]. Available: https:
maintaining correctness for both serial and parallel codes. //openreview.net/forum?id=rygGQyrFvH
[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
R EFERENCES “Proximal policy optimization algorithms,” 2017.
[20] “Aizu,” https://judge.u-aizu.ac.jp/onlinejudge/.
[1] M. Chen and et al, “Evaluating large language models trained on code,” [21] “Atcoder,” https://atcoder.jp/.
2021. [22] “Codechef,” https://www.codechef.com/.
[2] C. Richter and H. Wehrheim, “Can we learn from developer mistakes? [23] “Codeforces,” https://codeforces.com/.
learning to localize and repair real bugs from real bug fixes,” ArXiv, vol. [24] “Hackerearth,” https://www.hackerearth.com/.
abs/2207.00301, 2022. [25] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin,
[3] A. Kharkar, R. Z. Moghaddam, M. Jin, X. Liu, X. Shi, C. B. Clement, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging
and N. Sundaresan, “Learning to reduce false positives in analytic bug llm-as-a-judge with mt-bench and chatbot arena,” 2023.
detectors,” 2022 IEEE/ACM 44th International Conference on Software [26] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd
Engineering (ICSE), pp. 1307–1316, 2022. workers for text-annotation tasks,” Proceedings of the National
[4] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “A Academy of Sciences, vol. 120, no. 30, Jul. 2023. [Online]. Available:
transformer-based approach for source code summarization,” ArXiv, vol. http://dx.doi.org/10.1073/pnas.2305016120
abs/2005.00653, 2020.
[27] X. He, Z. Lin, Y. Gong, A.-L. Jin, H. Zhang, C. Lin, J. Jiao, S. M. Yiu,
[5] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity
N. Duan, and W. Chen, “Annollm: Making large language models to be
metrics for evaluating source code summarization,” 2022 IEEE/ACM
better crowdsourced annotators,” 2023.
30th International Conference on Program Comprehension (ICPC), pp.
[28] L. Ben Allal, A. Lozhkov, G. Penedo, T. Wolf, and L. von
36–47, 2022.
Werra, “Cosmopedia,” 2024. [Online]. Available: https://huggingface.
[6] J. Gu, P. Salza, and H. C. Gall, “Assemble foundation models for
co/datasets/HuggingFaceTB/cosmopedia
automatic code summarization,” 2022 IEEE International Conference
on Software Analysis, Evolution and Reengineering (SANER), pp. 935– [29] G. Team, “Gemini: A family of highly capable multimodal models,”
946, 2022. 2023.
[7] T. Ahmed and P. Devanbu, “Learning code summarization from a small [30] D. Kocetkov, R. Li, L. Ben Allal, J. Li, C. Mou, C. Muñoz Ferrandis,
and local dataset,” ArXiv, vol. abs/2206.00804, 2022. Y. Jernite, M. Mitchell, S. Hughes, T. Wolf, D. Bahdanau, L. von Werra,
[8] D. Nichols, A. Marathe, H. Menon, T. Gamblin, and A. Bhatele, and H. de Vries, “The stack: 3 tb of permissively licensed source code,”
“Modeling parallel programs using large language models,” 2023. Preprint, 2022.
[9] D. Nichols, J. H. Davis, Z. Xie, A. Rajaram, and A. Bhatele, “Can large [31] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
language models write parallel code?” 2024. E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program
[10] P. Valero-Lara, A. Huante, M. A. Lail, W. F. Godoy, K. Teranishi, synthesis with large language models,” CoRR, vol. abs/2108.07732,
P. Balaprakash, and J. S. Vetter, “Comparing llama-2 and gpt-3 llms 2021. [Online]. Available: https://arxiv.org/abs/2108.07732
for hpc kernels generation,” 2023. [32] OpenAI, “Gpt-4 technical report,” 2023.
[11] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, [33] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman,
C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of
F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, large language models,” 2024.
J. Leike, and R. Lowe, “Training language models to follow instructions [34] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei,
with human feedback,” 2022. P. Christiano, and G. Irving, “Fine-tuning language models from human
[12] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and preferences,” 2020.
C. Finn, “Direct preference optimization: Your language model is [35] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma,
secretly a reward model,” 2023. D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath,
[13] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds,
M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda,
Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah,
O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. B. Mann, and J. Kaplan, “Training a helpful and harmless assistant with
Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, reinforcement learning from human feedback,” 2022.
J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, [36] B. Wang, R. Zheng, L. Chen, Y. Liu, S. Dou, C. Huang, W. Shen,
N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, S. Jin, E. Zhou, C. Shi, S. Gao, N. Xu, Y. Zhou, X. Fan, Z. Xi, J. Zhao,
M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, X. Wang, T. Ji, H. Yan, L. Shen, Z. Chen, T. Gui, Q. Zhang, X. Qiu,
C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, X. Huang, Z. Wu, and Y.-G. Jiang, “Secrets of rlhf in large language
J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, models part ii: Reward modeling,” 2024.
D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, [37] N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza,
A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be N. Jones, S. Gu, and R. Picard, “Way off-policy batch deep reinforce-
with you!” 2023. ment learning of implicit human preferences in dialog,” 2019.
[38] C. Laidlaw, S. Singhal, and A. Dragan, “Preventing reward hacking [59] M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning
with occupancy measure regularization,” in ICML Workshop on New for program repair,” in International conference on machine learning.
Frontiers in Learning, Control, and Dynamical Systems, 2023. [Online]. PMLR, 2021, pp. 11 941–11 952.
Available: https://openreview.net/forum?id=oiT8js6p3Z [60] J. Zhang, J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares,
[39] Z. Wang, Y. Dong, J. Zeng, V. Adams, M. N. Sreedhar, D. Egert, and G. Verbruggen, “Repairing bugs in python assignments using large
O. Delalleau, J. P. Scowcroft, N. Kant, A. Swope, and O. Kuchaiev, language models,” arXiv preprint arXiv:2209.14876, 2022.
“Helpsteer: Multi-attribute helpfulness dataset for steerlm,” 2023. [61] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of
[40] “OpenMP Application Program Interface. Version 4.0. July 2013,” 2013. the automatic bug fixing performance of chatgpt,” in 2023 IEEE/ACM
[41] M. Snir, MPI–the Complete Reference: The MPI core, ser. MPI: International Workshop on Automated Program Repair (APR). IEEE,
The Complete Reference. Mass, 1998. [Online]. Available: https: 2023, pp. 23–30.
//books.google.com/books?id=x79puJ2YkroC [62] J. Y. Khan and G. Uddin, “Automatic code documentation generation
[42] J. C. S. Grauer-Gray, “Polybench,” https://web.cs.ucla.edu/∼pouchet/ using gpt-3,” in Proceedings of the 37th IEEE/ACM International
software/polybench/, 2012. Conference on Automated Software Engineering, 2022, pp. 1–6.
[43] “Big code models leaderboard - a hugging face space by [63] M. Schäfer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of
bigcode,” 2023. [Online]. Available: https://huggingface.co/spaces/ using large language models for automated unit test generation,” IEEE
bigcode/bigcode-models-leaderboard Transactions on Software Engineering, 2023.
[44] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, [64] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and
N. Lambert, and S. Huang, “Trl: Transformer reinforcement learning,” W. Chen, “Codet: Code generation with generated tests,” arXiv preprint
https://github.com/huggingface/trl, 2020. arXiv:2207.10397, 2022.
[45] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, [65] L. Chen, P.-H. Lin, T. Vanderbruggen, C. Liao, M. Emani, and
P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, B. de Supinski, “Lm4hpc: Towards effective language model application
S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers: in high-performance computing,” in OpenMP: Advanced Task-Based,
State-of-the-Art Natural Language Processing.” Association for Device and Compiler Programming, S. McIntosh-Smith, M. Klemm,
Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available: B. R. de Supinski, T. Deakin, and J. Klinkenberg, Eds. Cham: Springer
https://www.aclweb.org/anthology/2020.emnlp-demos.6 Nature Switzerland, 2023, pp. 18–33.
[46] Y. Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, [66] X. Ding, L. Chen, M. Emani, C. Liao, P.-H. Lin, T. Vanderbruggen,
H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Dama- Z. Xie, A. Cerpa, and W. Du, “Hpc-gpt: Integrating large language model
nia, B. Nguyen, G. Chauhan, Y. Hao, A. Mathews, and S. Li, “Pytorch for high-performance computing,” in Proceedings of the SC’23 Work-
fsdp: Experiences on scaling fully sharded data parallel,” Proc. VLDB shops of The International Conference on High Performance Computing,
Endow., vol. 16, no. 12, p. 3848–3860, aug 2023. Network, Storage, and Analysis, 2023, pp. 951–960.
[47] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [67] L. Chen, A. Bhattacharjee, N. Ahmed, N. Hasabnis, G. Oren, V. Vo, and
in 3rd International Conference on Learning Representations, ICLR A. Jannesari, “Ompgpt: A generative pre-trained transformer model for
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track openmp,” arXiv preprint arXiv:2401.16445, 2024.
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: [68] L. Chen, X. Ding, M. Emani, T. Vanderbruggen, P. hung Lin, and
http://arxiv.org/abs/1412.6980 C. Liao, “Data race detection using large language models,” 2023.
[48] I. Loshchilov and F. Hutter, “Fixing weight decay regularization [69] D. J. Mankowitz, A. Michi, A. Zhernov, M. Gelmi, M. Selvi, C. Padu-
in adam,” CoRR, vol. abs/1711.05101, 2017. [Online]. Available: raru, E. Leurent, S. Iqbal, J.-B. Lespiau, A. Ahern et al., “Faster sorting
http://arxiv.org/abs/1711.05101 algorithms discovered using deep reinforcement learning,” Nature, vol.
[49] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, 618, no. 7964, pp. 257–263, 2023.
H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian,
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis,
E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse,
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford,
M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew,
D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating
large language models trained on code,” 2021.
[50] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat
models,” 2023.
[51] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin,
and D. Jiang, “Wizardcoder: Empowering code large language models
with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
[52] Phind. (2023) Phind-codellama-34b-v2. [Online]. Available: https:
//huggingface.co/Phind/Phind-CodeLlama-34B-v2
[53] S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot:
How programmers interact with code-generating models,” ArXiv, vol.
abs/2206.15000, 2022.
[54] J.-B. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting
copilot and codex: Hot temperature, cold prompts, or black magic?”
ArXiv, vol. abs/2210.14699, 2022.
[55] A. Sarkar, A. D. Gordon, C. Negreanu, C. Poelitz, S. S. Ragavan, and
B. G. Zorn, “What is it like to program with artificial intelligence?”
ArXiv, vol. abs/2208.06213, 2022.
[56] D. Guo, C. Xu, N. Duan, J. Yin, and J. McAuley, “Longcoder: A long-
range pre-trained language model for code completion,” in International
Conference on Machine Learning. PMLR, 2023, pp. 12 098–12 107.
[57] J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt
prompt patterns for improving code quality, refactoring, requirements
elicitation, and software design,” arXiv preprint arXiv:2303.07839,
2023.
[58] X. Chen, M. Lin, N. Schärli, and D. Zhou, “Teaching large language
models to self-debug,” arXiv preprint arXiv:2304.05128, 2023.

You might also like