HPC-Coder: Modeling Parallel Programs using
Large Language Models
Daniel Nichols† , Aniruddha Marathe∗ , Harshitha Menon∗ , Todd Gamblin‡ , Abhinav Bhatele†
† Department of Computer Science, University of Maryland, College Park, MD, USA
∗ Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA, USA
‡ Livermore Computing, Lawrence Livermore National Laboratory, Livermore, CA, USA
Abstract—Parallel programs in high performance computing data that is fortunately available online from open-source
(HPC) continue to grow in complexity and scale in the exascale code repositories on GitHub, gitlab etc. However, this data
era. The diversity in hardware and parallel programming models requirement for training LLMs is prohibitive for tasks where
make developing, optimizing, and maintaining parallel software
even more burdensome for developers. One way to alleviate some such data may not exist. One such task is that of modeling
of these burdens is with automated development and analysis performance (execution time) based on source code. Another
tools. Such tools can perform complex and/or remedial tasks difficult task is modeling parallel and HPC code where there
for developers that increase their productivity and decrease the is less data available and it is often more complex code.
chance for error. Until recently, such tools for code development Performance data for arbitrary code is difficult to obtain at
and performance analysis have been limited in the complexity
of tasks they can perform, especially for parallel programs. scale with large numbers of samples. First and foremost, it
However, with recent advancements in language modeling, and is non-trivial to automate the collection of performance data
the availability of large amounts of open-source code related data, for arbitrary source code. The code needs to be built and
these tools have started to utilize predictive language models run in order to measure performance, and this process can
to automate more complex tasks. In this paper, we show how vary significantly across repositories. This can be particularly
large language models (LLMs) can be applied to tasks specific
to high performance and scientific codes. We introduce a new difficult for production scientific codes due to code complexity,
dataset of HPC and scientific codes and use it to fine-tune several dependence on external libraries, and the fact that it often
pre-trained models. We compare several pre-trained LLMs on needs to be run in parallel with many resources. Second,
HPC-related tasks and introduce a new model, HPC-Coder, fine- performance depends on numerous variables besides just the
tuned on parallel codes. In our experiments, we show that this code such as input problem, architecture, and current machine
model can auto-complete HPC functions where generic models
cannot, decorate for loops with OpenMP pragmas, and model load/congestion. These either need to be fixed in the dataset
performance changes in scientific application repositories as well or accounted for within the modeling pipeline. Finally, source
as programming competition solutions. code needs to be considered holistically when modeling per-
Index Terms—large language models, parallel code generation, formance, since minor changes in one place may drastically
performance modeling impact performance elsewhere. For example, changing the data
layout within a data structure will impact the performance of
I. I NTRODUCTION data access where that structure is used. This means that the
In recent years, large language models (LLMs) have become entirety of the source code needs to be included in the dataset
the state of the art for many language modeling related and performance needs to be collected at a finer granularity.
tasks [1]. Their ability to model token probabilities within When a lack of data becomes a hurdle in machine learning
a sequential context make them desirable for language tasks tasks, it is typically solved through data augmentation and/or
such as text generation and sequence classification. In addition transfer learning. Data augmentation involves extending and/or
to being used for natural language, such models have recently duplicating data in a manner that still preserves meaning
been applied to many programming language related tasks [2]– and representational capacity. Transfer learning is done by
[4]. The predictive capabilities of these models translate well first training a model on a related or simpler task and then
to coding tasks, and the wealth of open-source code available transferring that knowledge to a new problem requiring fewer
online provides significant data for training large models. samples to learn. For our task we employ transfer learning
LLMs trained on source code data have been utilized to by using LLMs that have learned to model source code and
automate numerous software development tasks such as code then transferring that knowledge to then learn how to model
completion, malware detection, code refactoring, etc [3]–[12]. performance of source code using fewer samples. In particular,
Additionally, they have been able to automate tasks previously we explore modeling parallel and HPC codes.
considered impossible to automate such as code summariza- In this paper, we utilize LLMs to model high performance
tion and generation using natural language. Training LLMs and scientific codes, and then apply that to the problem
for these tasks requires significant amounts of source code of performance modeling. In order to accomplish this, we
introduce a new dataset of HPC and scientific codes from a multi-attention head layer. Having multiple attention heads
popular open-source repositories. We first demonstrate how allows each of them to learn, or attend to, different abstractions
our trained model, HPC-Coder, outperforms other LLMs on in the input, such as parts-of-speech for natural language input.
HPC specific tasks such as code generation and OpenMP Generally these networks are trained to model the condi-
pragma labeling. A set of code generation tests specific to tional probability of observing a language token or a sequence
HPC are introduced and the model can pass these at up to of tokens. For instance, given a string of observed tokens
53% higher rate than the other models. Additionally, it is able t1 t2 . . . ti−1 we may want the most likely next token ti .
to label for loops with OpenMP pragmas with 97% accuracy.
Finally, we demonstrate how the model can predict relative ti = arg max P (ti = t | t1 t2 . . . ti−1 )
t
performance of source code changes with up to 92% accuracy.
Similarly we may want to know the probability of a se-
In summary, this paper makes the following contributions:
quence of tokens occurring given the entire observed dataset
• A large curated dataset containing HPC and scientific
P (t1 , t2 , . . . , tN ) (i.e. how likely is a given english sentence to
code from numerous open-source repositories. be real given my previous knowledge of the language). Using
• We present an LLM, HPC-Coder, fine-tuned to model
this probability we can define a metric called perplexity.
HPC and scientific code. We show that it trains to better
N1
language modeling scores over HPC related code than
1
other state-of-the-art models. Perplexity(T ) =
P (t1 , t2 , . . . , tN )
• We introduce a set of HPC code generation tasks and
demonstrate that our model completes these tasks at With this metric a model that scores a lower perplexity on
a significantly better rate than other models on HPC- its test set T is better as it assigns a higher probability to the
specific code. test data. The ratio is normalized to be invariant to the size of
• We demonstrate how our model can be used to predict
the test set. Rewriting the formula for perplexity we can see
OpenMP pragmas with high accuracy. that it is equivalent to the exponential of the cross-entropy.
1
• We utilize our model to predict relative performance −N
Perplexity(T ) = (P (t1 , t2 , . . . , tN ))
of source code changes for two distinct datasets from − 1
scientific application repositories and coding competition = (exp log P (t1 , t2 , . . . , tN )) N
solutions. 1
= exp − log P (t1 , t2 , . . . , tN )
N
II. BACKGROUND
This section provides background on transformer-based This allows us to train the language model with cross-
language models and how they can be applied to source code. entropy loss. Minimizing the loss will, in turn, minimize the
perplexity. The perplexity is recovered by simply taking the
A. Large Language Models exponential of the loss. It is important to note that perplexity
When applying machine learning to textual data we need measures model confidence and not accuracy. However, it has
a model that takes text as input and, through the process of been demonstrated empirically that lower perplexity generally
training on previous data, learns how to predict some property leads to better performance on downstream tasks.
of that text. In recent years such models have been mostly B. Text Generation
dominated by large transformer-based models. Transformers
were first introduced by Vaswani et al. [13]. They are designed A trained model can then be used to generate new text. Since
to work with sequential data much like recurrent and long the LLM models token probability it may seem simple to select
short-term memory neural networks. However, they differ in the most probable next token, however, this can lead to poor
their use of a self-attention mechanism to attribute importance text generation. Often a model’s attention puts more focus on
weights to inputs into the model. Due to this mechanism on the most recent tokens causing this selection method to get
transformers also process entire sequences at once unlike stuck in loops or suddenly forget context. Most recent works
recurrent neural networks. combat this issue by sampling from the model’s distribution,
These self-attention units make up the basis of transformer but there are several important caveats when doing this. For
networks. Weights are divided into query, key, and value instance, we want to avoid sampling from the tail as this could
weights (namely WQ , WK , WV ). These are multiplied by drastically throw off further tokens sampled. Here we discuss
each input token i and stacked to form the matrices Q, K, several of the sampling methods used later in this paper such
and V , respectively. Given these matrices and the dimensions as temperature, top-k, and nucleus sampling.
of the key vector dk the attention can be computed as below. Temperature: When sampling temperature controls how con-
fident the model is in the sampled token. Lower temperature
QK T
Attention (Q, K, V ) = softmax √ V leads the model to assign more confidence in the most likely
dk tokens in the distribution. On the other end, the model will
These weight matrices form a single attention head. Typ- more uniformly assign confidence across the distribution when
ically transformers employ several attention heads to form the temperature is higher. This term comes from statistical
thermodynamics where lower energy states are more frequent Additionally, when applying language models to code it is
with a higher temperature. typical to customize the training process slightly to take ad-
Temperature is incorporated by dividing the logits by the vantage of the syntactic differences between natural language
temperature, temp, before computing the softmax output. The and code. For instance, the tokenizer, which is responsible for
logits are the raw, un-normalized outputs of the model and the mapping text to a sequence of integers, is often set to group
softmax is used to turn this vector into probabilities. whitespace into single tokens. This is not necessary in natural
language inputs as multiple consecutive spaces are uncommon.
logits However, in code this can meaningfully reduce the sequence
softmax
temp size and a formatter can be applied after code generation to
Thus, as temp → 0 the output becomes the argmax and as regain formatting.
temp → ∞ it leads to a uniform sampling. III. OVERVIEW OF THE P ROPOSED M ETHODOLOGY
Top-k Sampling: In top-k sampling the most likely k tokens Figure 1 provides an overview of the data gathering, train-
are sampled from the model. This aims to exclude the distribu- ing, and downstream application in this paper. In order to train
tion’s tail and prevent the model from rapidly getting off-topic. a large HPC-specific language model we need a large dataset
However, this can also reduce the quality of predictions if the of HPC code. To obtain this, we gather a dataset of HPC
body of the distribution is wider than k. A common choice source code and use it to fine-tune a pre-trained language
for k is 50. model. This data gathering is described in Section IV and
Nucleus Sampling: Nucleus, or top-p, sampling aims to solve presents what HPC sources are used and how they are pre-
the shortcomings of top-k sampling by choosing a more processed. Following this, the model fine-tuning and selection
meaningful cut-off point. In this method the CDF of the are detailed in Section V where we explain the training setup
distribution is computed and sampling is cut-off when the CDF and methodology.
exceeds p. A common choice for p is 0.9.
C. Using LLMs for Code Generation
LLMs can be trained on a variety of downstream tasks and
objectives. When applied to source code data they are typically
trained as left-to-right, masked, or encoder-decoder models.
Left-to-Right: Left-to-right or causal language models are
trained to predict the most probable next token in a sequence.
Fig. 1. Overview of the steps described in this paper to train an HPC specific
The model receives and generates text in a left-to-right fashion, model and run it on several downstream tasks. After collecting a large dataset
which is where it gets its name. This limits the amount of of HPC code we fine-tune several pre-trained language models and select the
context the model can see as it cannot use later tokens in its best one. The selected model is then used to generate code, label OpenMP
pragmas, and predict relative performance as part of several downstream tasks.
prediction even if they are present in the data. Left-to-right
models are useful for text generation related tasks. We need several realistic tests to study the performance of
Masked: Unlike left-to-right models, masked models can the language model on relevant metrics. We present three main
predict the most probable token for any position in the text. downstream tasks for evaluation in Section VI. The first two,
After removing random tokens in the samples and replacing code generation and OpenMP pragma labeling, test the model
them with mask tokens, the model is trained to predict the on its ability to generate correct and meaningful code. The last
most probable tokens to replace the masks with. In this test, relative performance prediction, shows how this trained
configuration masked models can make use of more context model can be used for useful tasks that require language
in their predictions. comprehension. Results from each of these tests are presented
and discussed in Section VII.
Encoder-Decoder: Another common approach is to train a
left-to-right model to decode a sequence after it has been IV. DATA G ATHERING AND P RE - PROCESSING
passed through an encoder. This type of model can be com-
In order to train a large language model to understand and
bined with several different objectives and is often used with
generate HPC code, we need to show it lots of examples. We
sequence-to-sequence prediction.
must first build a dataset to accomplish this. In this section,
To apply left-to-right models, which are focused on in this we detail our collected dataset and how it is processed. We
paper, to source code you simply need to provide the model present two additional code datasets paired with performance
with prior context as a sequence of tokens and then let it data for further fine-tuning model performance.
generate new tokens until some stopping threshold. The prior
context is typically a natural language comment followed by A. HPC Source Code Data
a function declaration. Tokens are then generated until the We first collect a sufficiently large dataset of source code to
function is complete (a closing } bracket in the case of C/C++). train the model on HPC and scientific code. The HPC source
dataset is collected from GitHub repositories. The source files After filtering source files, we tokenize the dataset to obtain
are pulled from repositories with C/C++ marked as the primary integer values for the text that can be used as input into
language and with ≥ 3 stars. The repositories are additionally the model. We use the pre-trained tokenizers for each of our
filtered by HPC related GitHub topics. Once cloned, we collect selected models (see Section V). These are all GPT-2 [16]
all the C/C++ source files based on their file extension. based Byte-Pair Encoding (BPE) tokenizers.
This dataset is collected and structured in the same manner
as the C/C++ source dataset from Xu et al. [14]. Their dataset C. Performance Datasets
is scraped from GitHub in a similar manner with the exception In addition to the large HPC source code dataset, we create
of only including repositories with ≥ 5 stars. Figure 2 shows two datasets of code paired with performance data. These
the distribution of lines of code (LOC) by file types in the datasets contain code pairs with performance data for both
HPC source dataset. There are roughly the same number of codes in the pair, and can be used to train an LLM to model
LOC in both C and C++ files. The distribution of actual file performance characteristics between them.
counts follows the same trend. We create two datasets – one with pairs of code that are
functionally different and one where they are the same. The
first dataset is created by using version control history to
capture performance regressions. We run each commit for the
Kripke [17] and Laghos [18] applications. These are small
HPC apps meant to mimic the computational behavior of larger
scientific applications. We automate building and running each
commit to the best of our ability and collect performance
results for 830 commits in total.
The second dataset is a set of programming competition
solutions from the code contests dataset [19]. These are
aggregated from several online programming competitions:
Aizu, AtCoder, CodeChef, CodeForces, and HackerEarth. This
dataset allows us to create pairs of code that solve the
same problem (the contest problem), but may be different
Fig. 2. Distribution of no. of lines of code in each file type. .cxx, .hh, .H, and
in implementation. We run every correct solution for each
.hxx files are included in the dataset, but omitted here due to small counts. problem in the dataset, with the corresponding problem’s test
cases as inputs, and record the run time. Using all the C++
solutions in the dataset we create ∼1.7 million samples of
B. Data Pre-processing code. Using the run times, we group the solutions into pairs
Allamanis [15] shows how duplicate source data, which and label them as slower and faster pairs.
is prevalent across GitHub repositories, can adversely bias
V. F INE - TUNING M ETHODOLOGY
LLMs during training. To prevent this we filter our datasets by
removing duplicate files based on the hash of their contents. In this section, we describe the models used and how they
We use sha256 to hash the contents of the file. were selected. We also discuss the methods used to fine-tune
In addition to deduplicating we also filter out small and them on our collected dataset.
large files. Source files larger than 1 MB are designated as
A. Models Selected For Fine-tuning
large files and removed. These are generally entire libraries
in a single source file or contain raw data within the code. Recent years have seen the introduction of a significant
Additionally, files containing less than 15 tokens, as defined number of large language models. These models can range
by the language vocab, are not included. The reduced dataset in size from 100 million to more than 100 billion parameters.
sizes after deduplication and filtering are listed in Table I. Such large models have been shown to work well for language
Approximately 18% of the files are removed during this modeling, but pose significant hurdles to train and use in
processing. Table I shows the properties of the dataset after practice. They can take months to train on large GPU clusters
each step of deduplication and filtering. and typically cannot feasibly run inference on consumer-grade
hardware. Thus, choosing the right model requires selecting
TABLE I one that can sufficiently model the language data, but also be
P ROPERTIES OF THE HPC SOURCE CODE DATASET. reasonably deployed for downstream tasks.
Keeping the above mentioned requirements in mind, we
Filter # Files # LOC Size (GB) select several models for fine-tuning and/or testing. These
None 239,469 61,585,704 2.02 are listed in Table II. All of these are based on GPT-2 [16]
Deduplicate 198,958 53,043,265 1.74 and/or GPT-3 [23] architectures with slight variations in size,
Deduplicate + remove configuration, and pre-training data. GPT-2, the smallest in our
196,140 50,017,351 1.62
small/large files
experiments, is pre-trained on the WebText [20] dataset, which
TABLE II of the training loss (see Section II-A). Every 1000 optimizer
D ESCRIPTION OF THE MODELS USED FOR FINE - TUNING . steps, we also test the model using the validation dataset, and
record the perplexity and accuracy at predicting tokens. The
Hidden Window Pre-Training
Model # Params. # Layers validation dataset is 5% of the full dataset, separate from the
Size Size Set
GPT-2 [16] 1.5B 48 1600 1024 WebText [20]
training dataset.
GPT-Neo [21] 2.7B 32 2560 256 Pile [22]
PolyCoder [14] 2.7B 32 2560 2048 Source Code
VI. D OWNSTREAM I NFERENCE TASKS AND E VALUATION
M ETRICS
In this section, we introduce the benchmarks and metrics
is a collection of language data scraped from the internet. used to evaluate the performance of the language models.
We use the 1.5 billion parameter GPT-2 model variant in this A. Code Completion
paper. PolyCoder [14] is pre-trained on a collection of solely
source code data from GitHub that contains a mixture of 12 A standard benchmark for code generation tasks is the
popular programming languages [14]. Between these two is HumanEval benchmark [31]. This is comprised of 164 sample
GPT-Neo [21] that is pre-trained on the Pile dataset [22]. This Python problems, where the input to the model is a natural
dataset contains a collection of approximately 800GB of text language description of a function and function header. The
data from the internet, academic articles, source code, etc. model generates code for the function implementation, and is
Notably this dataset has a mixture of natural language and scored on functional correctness rather than textual similarity
code. It has been demonstrated that pre-training over both or equivalence.
natural language and code can improve the performance of We introduce our own adaptation of this benchmark for HPC
the model. C/C++ programs. Our benchmark consists of 25 custom HPC
We exclude models such as GPT-4 [24], the state-of-the-art code generation problems including simple numerics, OpenMP
model that powers GitHub CoPilot, from our experiments due parallel code, and MPI routines. Table III lists the tests used
to the model and its dataset being closed source. It is currently in our evaluation. Figure 3 shows a sample prompt (top) and
only accessible for inference via a non-free API. GPT-4’s output (bottom) for a shared-memory parallel implementation
dataset being closed source is significant as we cannot remove of saxpy. The prompt is provided as input to the model and
data it has trained on from the dataset we use to evaluate its it is expected to generate text functionally equivalent to the
performance, so its results would be overly optimistic. This text on the bottom.
prevents a realistic evaluation and comparison. TABLE III
C ODE GENERATION TESTS . O PEN MP AND MPI COLUMNS DENOTE IF THE
B. Fine-tuning Setup and Hyperparameters TEST INCLUDES A VERSION WITH THAT PARALLEL BACKEND .
We rely on the functionality provided in the Hugging-
Name Description Seq. OpenMP MPI
Face [25] Python library for fine-tuning the models. This
library automates many of the tasks related to loading and Average of an array
Average X X X
of doubles
pre-processing datasets, and running language models on Reduce by generic
the datasets. In particular, we use the Trainer interface Reduce X X X
function foo
with DeepSpeed [26] as the backend to optimize fine-tuning. Saxpy Saxpy X X X
Daxpy Daxpy X X X
DeepSpeed is a framework that provides distributed training Double-precision
functionality and several memory optimizations to enable large Matmul X X X
matrix multiply
models to fit in GPU memory. Simple Send Send MPI message X
Simple Receive Receive MPI message X
Starting with the pre-trained models, we fine-tune them on a FFT Double-precision FFT X X X
single node with an AMD EPYC 7763 CPU, 512 GB memory, Single-precision Cholesky
Cholesky X X X
and four 40 GB NVIDIA A100 GPUs. With DeepSpeed’s factorization
Ping-pong MPI ping-pong X
ZeRO memory optimizations [27], all of the models fit entirely Ring pass MPI ring pass X
within a single A100 GPU and are, thus, fine-tuned using
pure data parallelism. We refer the reader to [28], [29] for
a comprehensive overview of training deep neural networks Evaluation Metric: We first record the ratio of generated
in parallel. samples that build correctly to those that do not. This indicates
We use the AdamW [30] optimizer for all the models to up- the model’s ability to generate syntactically correct code. For
date model weights and minimize the loss. We set the learning those that compile we compute the pass@k metric that denotes
rate to 5 × 10−5 and Adam parameters β1 and β2 to 0.9 and the probability that at least one of k samples out of Np
0.999, respectively. These hyperparameters are consistent with code samples is correct. We do Np trials with each prompt
typical values in the literature. 16-bit floating point precision p to generate Np code samples, compile/run the samples,
is used to accelerate fine-tuning and reduce model size on the and record the number that are functionally correct (cp ). To
A100s. We record the perplexity of the model on the training estimate the probability that at least one of k samples chosen
data during fine-tuning. This is calculated as the exponential from Np samples is correct for a particular prompt, p, we
(a) Prompt all left-to-right and can only append tokens to sequences. Thus,
1 /* we need to further fine-tune the models on a smaller dataset
2 multiply scalar float a by vector x and add to y that puts the for loop before the pragma. To accomplish this,
3 vectors x and y are length N we first create a dataset of every for loop with an OpenMP
4 use OpenMP to compute in parallel
5 */ pragma from our HPC code dataset. 500 tokens of context
6 void saxpy(float *x, float *y, float a, int N) { from before the for loop are also included. This results in a
dataset with 13,900 samples.
(b) Output Since our model is left-to-right, we format each sample by
1 #pragma omp parallel for moving the pragma to directly after the loop and a unique
2 for (int i = 0; i < N; i++) { separating token <begin-omp>. This allows us to use the
3 y[i] += a * x[i]; model by providing a for loop plus some context and the
4 }
5 } model will generate an OpenMP pragma for the for loop.
Each model is fine-tuned on this smaller dataset for three
Fig. 3. An example prompt asking the model to generate a parallel version of
epochs (passes over the entire dataset). To prevent overfitting
saxpy. The comment and function header make up the prompt. The function we use a starting learning rate of 3 × 10−5 . During training
body on the bottom shows a potential model output. 10% of the dataset is set aside for validation.
Evaluation Metric: To measure the success of this test, we use
can use the number of generated samples that are functionally the accuracy of generating correct pragmas. This is calculated
correct, cp , out of the Np total samples generated to calculate as shown in Equation 3.
pass@k for a given k as, # correct pragmas
accuracy = (3)
Np − cp Np total pragmas tested
pass@k = 1 − / (1)
k k For this problem, we define a correct pragma in two ways:
For each model, we report the average pass@k metric as the syntactic and functional. To measure syntactic correctness we
average pass@k over all P prompts as shown below: compare the generated pragma with the actual pragma for
P
"
Ni −ci
# textual equivalence. Since it is impossible to automate the
1 X
average pass@k = 1 − Nki (2) running and evaluation of arbitrary for loops from our dataset
P i=1 k we measure functional correctness by comparing the generated
This metric provides insight into the probability of a model pragmas with the actual ones while ignoring differences that
generating functionally correct code. In our experiments, we do not contribute to functionality. For instance we ignore
calculate the pass@k score for several temperatures, namely reordering of variables and clauses where these do not mat-
0.1, 0.2, 0.4, 0.6, and 0.8, and select the best one. This is ter. Additionally, clauses such as schedule are ignored. This
in line with experiments in related literature [14]. For each correctness check is done using a custom Python script that
temperature and prompt, we generate Np = 100 samples. The parses the pragmas and compares them. We record accuracy
code is generated with nucleus sampling using 0.93 as the from both of these correctness metrics for each model.
cutoff value in the CDF (see Section II). C. Relative Performance Prediction
To compile the generated code samples, we use g++ with
the “-O2 -std=c++17 -fopenmp” flags. For tests that In addition to text generation, we can also use the LLMs
need MPI we use the OpenMPI mpicxx compiler. If the build for classification. Here we use them to predict performance
is successful, then a corresponding driver binary is called that slowdowns between two pairs of code.
will call and test the generated function for correctness. These Further Fine-tuning: In order to use the models for relative
are run on a AMD EPYC 7763 CPUs with 64 physical cores performance classification we need to first fine-tune them on
at 2.45 GHz each. For tests that require OpenMP or MPI we new data for this output task. Using the Git commit data from
only denote them as correct if they used the corresponding Section IV-C we give the model text for a region of code
parallel framework to compute their result. before and after a Git commit. The codes are concatenated
B. Predicting OpenMP Pragmas with a unique token separating them, namely <COMMIT>. We
repeat a similar process for the code contest dataset, but instead
A common HPC coding task is decorating for
separate pairs by the token <PAIR>. With this data the model
loops with OpenMP pragmas. Every pragma starts with
is fine-tuned to predict whether the second code will be slower
#pragma omp parallel for and is followed by a list
(positive) or the same/faster (negative).
of optional clauses that modify the behavior of the parallel
For each dataset we fine-tune the model on 90% of the
for. We test the model’s ability to write OpenMP pragmas
data with the other 10% set aside for evaluation. The model
for arbitrary for loops.
takes the concatenated sequences of the two versions of
Further Fine-tuning: We cannot directly use the existing the code implementation and is fine-tuned for the binary
models to generate pragmas before a for loop, since they are classification problem of predicting relative performance. The
training objective is classification accuracy, which we also use which keeps improving past 45,000 samples. Based on this
to measure success for this task. result we stop fine-tuning at 45,000 samples and use these
weights for the rest of the evaluations. Additionally, due to the
Evaluation Metric: To evaluate the performance on this
computation time needed to run this test we use the 45,000
task we measure the model’s classification accuracy. This is
samples stopping point for fine-tuning all the models.
calculated as shown in Equation 4.
# correct performance predictions
accuracy = (4)
total performance predictions
For this metric higher is better and a classification accuracy
of 100% signifies a perfect score.
VII. R ESULTS
We now present the fine-tuning and evaluation results using
the downstream tasks discussed in Section VI.
A. Fine-tuning on HPC Source Code Data
We first show the results of fine-tuning the three models
selected in Table II. Table IV shows the validation perplexity
at the end of fine-tuning. Here perplexity is calculated as the
exponential of the loss as described in Section II. Each model
converges to a low perplexity score over the separate testing
set (between 2 and 4). GPT-Neo and PolyCoder achieve com- Fig. 4. Downstream evaluation performance across training iterations for
parable perplexity scores (within 0.01) while GPT2 achieves a PolyCoder+HPC. The model starts to perform worse around 45,000 samples
even though the perplexity keeps improving.
higher perplexity. All three have different pre-training datasets
and the former two are of a larger size than GPT2 (see
Table II). From this we can conclude that for this problem B. Code Completion
the pre-training dataset had less of an impact on validation Having fine-tuned the three models, we now start using them
perplexity than the model size. The lower perplexity of the for the different downstream tasks described in Section VI.
larger models means that they model the language better. The first downstream task is code generation, described in
Section VI-A. Figure 5 shows the average pass@k rates for
TABLE IV the code generation tests. The average pass@k values are
F INAL VALIDATION PERPLEXITIES FOR EACH MODEL AFTER FINE - TUNING
ON THE HPC SOURCE CODE DATASET.
computed according to Equation 2. We use PolyCoder as a
baseline for comparison since it is a state-of-the-art LLM for
Model GPT-2 GPT-Neo PolyCoder code generation. PolyCoder+HPC scores the best for average
Final Validation Perplexity 4.47 2.23 2.24 pass@1, pass@10, and pass@100. For each value of k the
models score in the order of PolyCoder+HPC, PolyCoder,
For the rest of the results presented in this section we GPT-Neo+HPC, and GPT2+HPC. PolyCoder+HPC gains the
will use PolyCoder+HPC, GPT-Neo+HPC, and GPT2+HPC to slight edge over the original PolyCoder by successfully gen-
refer to the respective models fine-tuned on the HPC dataset. erating code for the HPC-specific tasks (see Figure 6).
After fine-tuning each of the models and evaluating them
on the downstream tasks we noticed that the perplexity would
keep improving with more fine-tuning, but the downstream
evaluation performance would start to decrease. This is likely
because LLMs are subject to catastrophic forgetting during
fine-tuning. Catastrophic forgetting is the phenomenon where
previously learned information is lost or forgotten as the model
continues training and updating its weights. It is typically
prevented by minimizing the amount of fine-tuning and using
a sufficiently low learning rate.
To explore this phenomenon we ran the code generation
tasks every 1000 samples during fine-tuning of the PolyCoder
model. Figure 4 presents the results from our evaluation tests
during fine-tuning on the PolyCoder model. After seeing about
45,000 samples during fine-tuning the model starts to decrease Fig. 5. Comparison of models on code generation. The clusters represent the
in evaluation performance. This is in contrast to the perplexity average pass@k scores for k = 1, 10 and 100. Higher percentage is better.
In Figure 5 we see that GPT2+HPC scores significantly
lower than the other models. This is likely due to the smaller
model size and the fact that there is no source code in its pre-
training dataset. In this instance fine-tuning is not enough to
enable GPT-2 to generate correct C++ HPC code.
Altogether, the scores are indicative that PolyCoder+HPC
and GPT-Neo+HPC has learned how to generate valid C++
code. For instance, if the best model, PolyCoder+HPC, is
permitted to generate 100 samples, then 71% of them are
correct on average across all the tests. Similarly for 1 sample
generated this is 25%. These numbers roughly align with
results from [14] on the HumanEval Python tests. However, the
results are not directly comparable since they are a different
set of tests in a different programming language. Fig. 7. Comparison of the models’ build rate. Both PolyCoder and Poly-
To demonstrate the generative capabilities of the specialized Coder+HPC have the best percentage of total samples that successfully
models we reduce the code generation tasks to those that compile. Higher percentage is better.
are specific to HPC. This includes code that uses OpenMP
and/or MPI parallelism. Figure 6 shows the performance when
The code in Figure 8 shows example output from PolyCoder
restricted to these tests. We see that PolyCoder is unable to
and PolyCoder+HPC on generating OpenMP code to compute
generate OpenMP and MPI code as it scores significantly
a sum in parallel. We see that PolyCoder is able to produce
lower than the rest. GPT2+HPC still performs fairly low,
correct sequential code, however, it fails to add any OpenMP
however, its score has actually improved slightly over Figure 5.
pragmas. The PolyCoder+HPC model is able to correctly tag
This is due to the fact that it has only seen HPC-specific code
the for loop with an OpenMP pragma.
during training and that is what is being tested here.
(a) Prompt
1 /*
2 Compute the sum of the array X and return the sum.
3 X has N elements.
4 Use OpenMP to compute the sum in parallel.
5 */
6 float sum(float *X, int N) {
(b) PolyCoder output
1 float sum = 0.0f;
2 for (int i = 0; i < N; i++)
3 sum += X[i];
4 return sum;
5 }
(c) PolyCoder+HPC output
Fig. 6. Comparison of models on code generation for HPC-specific functions.
The clusters represent the average pass@k scores for k = 1, 10 and 100. 1 float sum = 0.0f;
Higher percentage is better. 2 #pragma omp parallel for reduction(+:sum)
3 for (int i = 0; i < N; i++)
4 sum += X[i];
Another point of interest besides functional correctness 5 return sum;
is syntactic correctness. This can be measured by the total 6 }
number of generated samples that compile successfully. This
is how often the model generates valid code, whether it is func- Fig. 8. Example OpenMP output from (b) PolyCoder and (c) Poly-
tionally correct or not. This data is presented in Figure 7. Poly- Coder+HPC. The comment and function description (top) make up the prompt
that is given to the model, while the bottom two blocks are the generated
Coder and PolyCoder+HPC both perform the best compared text. We see that PolyCoder is unable to generate OpenMP pragmas for the
to the other models with 84% and 86% of samples compiling reduction in this example.
correctly, respectively. GPT-Neo+HPC performs slightly worse
at 74% and GPT2-HPC has only 30% of samples compile. The A similar example with distributed memory is shown in
worse performance of the latter two can likely be attribute to Figure 9. The PolyCoder+HPC is able to generate correct MPI
their pre-training datasets having less code. We also observe code to compute an average in parallel across ranks. PolyCoder
that for all models there is a visual correlation between build often generated long and incorrect code that demonstrated little
and correctness rates, which is expected as a model needs to understanding of how to write MPI routines. The example
compile in order to be functionally correct. presented is indicative of the rest of the MPI samples generated
by PolyCoder. Notably, some instances did contain MPI calls,
so PolyCoder must have some MPI code in its pre-training
dataset. However, it lacked the ability to properly use MPI.
(a) Prompt
1 /*
2 Use MPI to compute the average of X across ranks
3 and return the result on all ranks.
4 Use the default communicator.
5 */
6 double mpiAverage(double X) {
(b) PolyCoder output
1 int rank; Fig. 10. Comparison of the speedups for the code generation tests over
2 int nprocs; sequential baselines. They are all above 1 demonstrating that the model is
3 double average; not generating very poor performing parallel code.
4 MPI_Status status;
5 MPI_Request request;
6
C. Predicting OpenMP Pragmas
7 MPI_Init(&argc, &argv);
8 MPI_Comm_rank(MPI_COMM_WORLD, &rank); Next, we examine the result from the OpenMP prediction
9 MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
10
tests described in Section VI-B. Figure 11 shows the results
11 if (rank == 0) { from the OpenMP experiments detailed in Section VI-B.
12 MPI_Recv(&status, 1, MPI_INT, 0, 0, We see that both models are able to generate functionally
MPI_COMM_WORLD,
13 &request);
correct OpenMP pragmas with high accuracy (right plot).
14 MPI_Send(&X, 1, MPI_INT, 0, 0, PolyCoder+HPC is able to do this with 97% accuracy and
MPI_COMM_WORLD); PolyCoder 94%. The LLMs are exemplary at understanding
15 MPI_Send(&average, 1, MPI_DOUBLE, 0, 0,
16 MPI_COMM_WORLD);
the dependencies of the for loop and what clauses are
17 } required to correctly parallelize them. We see that the model
18 return average; that has seen large amounts of OpenMP code performs better.
19 }
We can also look at how well the models reproduce the
pragmas exactly. This means all the clauses and variables
(c) PolyCoder+HPC output
within those clauses are in the same order in the dataset and
1 int size; in the output from the model. These results are shown in the
2 double sum;
3
left plot in Figure 11. While less meaningful than functional
4 MPI_Comm_size(MPI_COMM_WORLD, &size); correctness, it is interesting that the model is able to exactly
5 MPI_Allreduce(&X, &sum, 1, MPI_DOUBLE, MPI_SUM, reproduce pragmas it has not seen before with relatively high
6 MPI_COMM_WORLD);
7
accuracy (67% and 61%). This is likely due to certain trends
8 return sum / size; in the construction and ordering of OpenMP clauses that the
9 } LLMs are learning as they train.
Fig. 9. Example MPI output from (b) PolyCoder and (c) PolyCoder+HPC.
The highlighted region is code generated by the model (reformatted to fit the
column). PolyCoder results varied significantly, however, the above example
demonstrates the general lack of understanding it had for MPI.
Figure 10 shows the speedups for the code generated by
PolyCoder+HPC over sequential baselines. These are hand-
written efficient, sequential implementations for each test. We
see that PolyCoder+HPC is able to generate code that is faster
than the sequential baseline. This demonstrates that it is not
generating very poor performing parallel code and is likely
using the parallelism correctly.
Since PolyCoder+HPC scores the highest in training and
these code generation tests we select it for further compar-
Fig. 11. Comparison of models on predicting OpenMP pragmas. The left
isons in the rest of the paper. PolyCoder+HPC is the fine- plot presents accuracy in predicting OpenMP pragmas exactly as they appear
tuned model we present as HPC-Coder. We continue to use in the dataset. The right plot shows the accuracy in predicting functionally
PolyCoder as a baseline. correct OpenMP pragmas. Higher accuracy is better.
D. Relative Performance Prediction architecture, weights, and training data closed source and only
Finally, we look at the results from the relative performance inference is available via a paid API.
prediction tests described in Section VI-C. Figure 12 shows the A large amount of this recent research has focused on code
results from the relative performance prediction tests (see Sec- generation. These usually take a mix of code and natural
tion VI-C). Both models achieve high classification accuracy language and learn how to meaningfully finish the code. While
with PolyCoder+HPC being slightly better for the two proxy seminal works have continued to improve code generation
applications at 88% and PolyCoder at 86%. This means that with better and bigger models [2], [23], [33], other works
for 88% of the code changes in the two repositories version have explored how to better utilize these tools in software
control history PolyCoder+HPC is able to correctly identify engineering workflows [35]–[37]. Some flip code generation
if there will be a performance slowdown. Likewise for the around and learn to generate natural language code summaries
programming competition dataset we see that PolyCoder+HPC from code snippets [7]–[10].
outperforms the PolyCoder baseline with an accuracy of 92% These models can even be trained for tasks such bug and
vs 86%. This is a higher accuracy improvement than the proxy malware detection [11], [12]. LLMs can also be used to
applications by 4 percentage points. This is likely due to the suggest fixes in these cases rather than just identify prob-
fact that the programming competition dataset is larger and lematic code. Many other previously difficult to automate
PolyCoder+HPC has been trained on more C/C++ code. software development tasks have since been tackled by ap-
plying LLMs [6]. More recently some of these tasks have
included HPC development tasks such as race detection [38]
and OpenACC compiler validation [39].
B. Machine Learning Applied to Source Code Performance
However, one important problem in software development
that has not received much research with LLMs is that of
performance. Many of the reasons listed in Section I have
prevented meaningful studies from being accomplished. Pre-
viously approaches used code2vec [40], ir2vec [41], or a
similar method to first map source code to an embedded space
that could then be learned on. These were successfully used
for some performance related analytical modeling such as
OpenCL kernel device placement [41], but never leveraged
LLMs for a full performance study.
Fig. 12. Comparison of models on predicting relative performance of code Garg et al. [42] recently introduced DeepDevPERF, which
changes. Both models achieve similarly high accuracy. The PolyCoder+HPC
model performs slightly better on both datasets. Higher accuracy is better. is a BART-based [43] LLM designed to suggest performance
improvements to arbitrary C# code. They overcome the issue
The success of this test demonstrates that the models are of data collection by using code changes from Git commits that
able to correlate their prior language understanding with have performance related keywords in their commit message,
performance related properties of code. This means we can albeit, this dataset is still noisy. This work is different than
leverage LLMs and fine-tuning to model code performance that presented in this paper as it suggests code transformations
without the need to collect large amounts data. rather than learn relative performance. The latter being useful
in cases where two versions of a code already exist, such as
VIII. R ELATED W ORK with Git commits. Additionally, our model is trained on real
In this section we detail related work that uses LLMs to performance data and can be used for HPC and parallel code
study source code and work that uses machine learning to generation tasks.
model the performance of source code.
IX. C ONCLUSION AND F UTURE W ORK
A. LLMs for Code Generation In this paper, we have demonstrated the fine-tuning of an
With the explosion in research in transformer models and LLM using HPC code, and its ability to outperform other
LLMs there have been a large number of papers applying LLMs in HPC related tasks such as HPC code generation and
these techniques to source code. Most of these methods have performance modeling. We have accomplished this by fine-
extended GPT-2 [16], GPT-3 [23], or BERT [32], [33] models tuning a model, and showing that it can generate functionally
and trained them on code. A notable instance is Codex [2], correct HPC code at up to a 53% higher pass@k rate and
which is a modification of GPT-3 that is targeted for source can accurately label for loops with OpenMP pragmas with
code generation. Following Codex’s introduction there have 97% success. We have further demonstrated how this fine-
been several other works that have introduced state-of-the-art tuned model can be utilized to study performance properties
large language models [3], [4], [34]. While some of these of source code with little data. These results demonstrate the
are open source, the best, such as GPT-4 [24], keep their need for and usefulness of HPC-specific language models. The
best model in our experiments, PolyCoder+HPC, we present [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
as HPC-Coder. A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available:
In the future, we plan to explore further analyses that can http://arxiv.org/abs/1706.03762
be accomplished using our language model. We also plan on [14] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, “A
exploring how to tune the model to generate not just correct Systematic Evaluation of Large Language Models of Code,”
Feb. 2022, https://arxiv.org/abs/2202.13169. [Online]. Available:
but performant code. Additionally, we plan to investigate how https://doi.org/10.5281/zenodo.6363556
to engineer these innovations into practical tools that can be [15] M. Allamanis, “The adverse effects of code duplication in machine
easily used by computational scientists and HPC developers learning models of code,” in Proceedings of the 2019 ACM SIGPLAN
International Symposium on New Ideas, New Paradigms, and Reflections
to enable them to produce better code more efficiently. on Programming and Software, ser. Onward! 2019. New York, NY,
USA: Association for Computing Machinery, 2019, p. 143–153.
ACKNOWLEDGMENT [Online]. Available: https://doi.org/10.1145/3359591.3359735
[16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
This material is based upon work supported in part by “Language models are unsupervised multitask learners,” 2019.
the National Science Foundation under Grant No. 2047120. [17] A. Kunen, T. Bailey, and P. Brown, “KRIPKE-a massively parallel
This work was performed in part under the auspices of the transport mini-app,” Lawrence Livermore National Laboratory (LLNL),
Livermore, CA, Tech. Rep, 2015.
U.S. Department of Energy by Lawrence Livermore National [18] V. A. Dobrev, T. V. Kolev, and R. N. Rieben, “High-order curvilinear
Laboratory under Contract DE-AC52-07NA27344 (LLNL- finite element methods for lagrangian hydrodynamics,” SIAM Journal
CONF-844549). on Scientific Computing, vol. 34, no. 5, pp. B606–B641, 2012. [Online].
Available: https://doi.org/10.1137/120864672
R EFERENCES [19] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy,
[1] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, C. d. M. d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl,
J. Zhang, Z. Dong, Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S.
Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, and J.-R. Wen, “A survey of Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals,
large language models,” 2023. “Competition-level code generation with alphacode,” 2022. [Online].
[2] M. Chen and et al, “Evaluating large language models trained on code,” Available: https://arxiv.org/abs/2203.07814
2021. [20] A. Gokaslan and V. Cohen, “Openwebtext corpus,”
[3] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, http://Skylion007.github.io/OpenWebTextCorpus, 2019.
M. Marone, C. Akiki, J. Li, J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. [21] S. Black, G. Leo, P. Wang, C. Leahy, and S. Biderman, “GPT-Neo:
Zhuo, T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier, J. Monteiro, Large Scale Autoregressive Language Modeling with Mesh-Tensorflow,”
O. Shliazhko, N. Gontier, N. Meade, A. Zebaze, M.-H. Yee, L. K. Mar. 2021, If you use this software, please cite it using these metadata.
Umapathi, J. Zhu, B. Lipkin, M. Oblokulov, Z. Wang, R. Murthy, [Online]. Available: https://doi.org/10.5281/zenodo.5297715
J. Stillerman, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey, Z. Zhang, [22] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,
N. Fahmy, U. Bhattacharyya, W. Yu, S. Singh, S. Luccioni, P. Villegas, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and
M. Kunakov, F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding, C. Leahy, “The pile: An 800gb dataset of diverse text for language
C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao, M. Mishra, A. Gu, modeling,” CoRR, vol. abs/2101.00027, 2021. [Online]. Available:
J. Robinson, C. J. Anderson, B. Dolan-Gavitt, D. Contractor, S. Reddy, https://arxiv.org/abs/2101.00027
D. Fried, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes, T. Wolf, [23] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Guha, L. von Werra, and H. de Vries, “Starcoder: may the source be A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
with you!” 2023. A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
[4] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
Y. Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-
J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and shot learners,” CoRR, vol. abs/2005.14165, 2020. [Online]. Available:
G. Synnaeve, “Code llama: Open foundation models for code,” 2023. https://arxiv.org/abs/2005.14165
[5] J. Senanayake, H. Kalutarage, and M. O. Al-Kadri, “Android [24] OpenAI, “Gpt-4 technical report,” 2023.
mobile malware detection using machine learning: A systematic [25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
review,” Electronics, vol. 10, no. 13, 2021. [Online]. Available: P. Cistac, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao,
https://www.mdpi.com/2079-9292/10/13/1606 S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Transformers:
[6] “Ml4code,” https://ml4code.github.io/, accessed: 2022. State-of-the-Art Natural Language Processing.” Association for
[7] J. Gu, P. Salza, and H. C. Gall, “Assemble foundation models for Computational Linguistics, Oct. 2020, pp. 38–45. [Online]. Available:
automatic code summarization,” 2022 IEEE International Conference https://www.aclweb.org/anthology/2020.emnlp-demos.6
on Software Analysis, Evolution and Reengineering (SANER), pp. 935– [26] Microsoft, “Deepspeed: Extreme-scale model training for everyone,”
946, 2022. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-
[8] T. Ahmed and P. Devanbu, “Learning code summarization from a small scale-model-training-for-everyone/.
and local dataset,” ArXiv, vol. abs/2206.00804, 2022. [27] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase, S. Yang,
[9] S. Haque, Z. Eberhart, A. Bansal, and C. McMillan, “Semantic similarity M. Zhang, D. Li, and Y. He, “Zero-offload: Democratizing billion-scale
metrics for evaluating source code summarization,” 2022 IEEE/ACM model training,” CoRR, vol. abs/2101.06840, 2021. [Online]. Available:
30th International Conference on Program Comprehension (ICPC), pp. https://arxiv.org/abs/2101.06840
36–47, 2022. [28] T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed
[10] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “A deep learning: An in-depth concurrency analysis,” ACM Comput.
transformer-based approach for source code summarization,” ArXiv, vol. Surv., vol. 52, no. 4, Aug. 2019. [Online]. Available:
abs/2005.00653, 2020. https://doi.org/10.1145/3320060
[11] C. Richter and H. Wehrheim, “Can we learn from developer mistakes? [29] D. Nichols, S. Singh, S.-H. Lin, and A. Bhatele, “A survey and empirical
learning to localize and repair real bugs from real bug fixes,” ArXiv, vol. evaluation of parallel deep learning frameworks,” 2022.
abs/2207.00301, 2022. [30] I. Loshchilov and F. Hutter, “Fixing weight decay regularization
[12] A. Kharkar, R. Z. Moghaddam, M. Jin, X. Liu, X. Shi, C. B. Clement, in adam,” CoRR, vol. abs/1711.05101, 2017. [Online]. Available:
and N. Sundaresan, “Learning to reduce false positives in analytic bug http://arxiv.org/abs/1711.05101
detectors,” 2022 IEEE/ACM 44th International Conference on Software [31] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan,
Engineering (ICSE), pp. 1307–1316, 2022. H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri,
G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, [37] A. Sarkar, A. D. Gordon, C. Negreanu, C. Poelitz, S. S. Ragavan, and
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, B. G. Zorn, “What is it like to program with artificial intelligence?”
C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, ArXiv, vol. abs/2208.06213, 2022.
E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, [38] L. Chen, X. Ding, M. Emani, T. Vanderbruggen, P. hung Lin, and
J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, C. Liao, “Data race detection using large language models,” 2023.
A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, [39] C. Munley, A. Jarmusch, and S. Chandrasekaran, “Llm4vv: Developing
M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, llm-driven testsuite for compiler validation,” 2023.
D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating [40] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning
large language models trained on code,” 2021. distributed representations of code,” 2018. [Online]. Available:
[32] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: https://arxiv.org/abs/1803.09473
Pre-training of deep bidirectional transformers for language [41] S. VenkataKeerthy, R. Aggarwal, S. Jain, M. S. Desarkar,
understanding,” in Proceedings of the 2019 Conference of the R. Upadrasta, and Y. N. Srikant, “Ir2v¡span class=”smallcaps
North American Chapter of the Association for Computational smallercapital”¿ec¡/span¿: Llvm ir based scalable program embeddings,”
Linguistics: Human Language Technologies, Volume 1 (Long ACM Trans. Archit. Code Optim., vol. 17, no. 4, dec 2020. [Online].
and Short Papers). Minneapolis, Minnesota: Association for Available: https://doi.org/10.1145/3418463
Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. [42] S. Garg, R. Z. Moghaddam, C. B. Clement, N. Sundaresan, and C. Wu,
Available: https://www.aclweb.org/anthology/N19-1423 “Deepdev-perf: a deep learning-based approach for improving software
[33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, performance,” Proceedings of the 30th ACM Joint European Software
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly Engineering Conference and Symposium on the Foundations of Software
optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692, Engineering, 2022.
2019. [Online]. Available: http://arxiv.org/abs/1907.11692 [43] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed,
[34] Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang, “Magicoder: Source O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising
code is all you need,” arXiv preprint arXiv:2312.02120, 2023. sequence-to-sequence pre-training for natural language generation,
[35] J.-B. Döderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting translation, and comprehension,” 2019. [Online]. Available:
copilot and codex: Hot temperature, cold prompts, or black magic?” https://arxiv.org/abs/1910.13461
ArXiv, vol. abs/2210.14699, 2022.
[36] S. Barke, M. B. James, and N. Polikarpova, “Grounded copilot:
How programmers interact with code-generating models,” ArXiv, vol.
abs/2206.15000, 2022.