0% found this document useful (0 votes)
73 views

Translation Using LLM (Rust)

Uploaded by

pedanticwiles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Translation Using LLM (Rust)

Uploaded by

pedanticwiles
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Towards Translating Real-World Code with LLMs:

A Study of Translating to Rust


Hasan Ferit Eniser* Hanliang Zhang*, Cristina David, Meng Wang
MPI-SWS University of Bristol
[email protected] {hanliang.zhang,meng.wang,cristina.david}@bristol.ac.uk

Maria Christakis Brandon Paulsen, Joey Dodds, Daniel Kroening


TU Wien Amazon Web Services, Inc.
[email protected] {bpaulse,jldodds,dkr}@amazon.com
arXiv:2405.11514v2 [cs.SE] 21 May 2024

*Equal Contribution

Large language models (LLMs) show promise in code single function using only primitive data types, whereas real-
translation – the task of translating code written in one pro- world code has many functions and user-defined data types
gramming language to another language – due to their ability (e.g. structs).
to write code in most programming languages. However, In this work, we take a step towards answering the question:
LLM’s effectiveness on translating real-world code remains Can LLM’s translate real-world code? Towards this end,
largely unstudied. In this work, we perform the first sub- we develop F LOURINE, an end-to-end code translation tool
stantial study on LLM-based translation to Rust by assessing capable of producing validated Rust translations. F LOURINE
the ability of five state-of-the-art LLMs, GPT4, Claude 3, first uses an LLM to obtain a candidate translation, then
Claude 2.1, Gemini Pro, and Mixtral. We conduct our study applies a compilation driven repair, where we make use of
on code extracted from real-world open source projects. To the Rust compiler’s error messages as described in [14].
enable our study, we develop F LOURINE, an end-to-end code Once the translation compiles, F LOURINE uses cross-language
translation tool that uses differential fuzzing to check if a Rust differential fuzzing to test the I/O equivalence between the
translation is I/O equivalent to the original source program, source program and the Rust translation. Notably, our cross-
eliminating the need for pre-existing test cases. As part of our language differential fuzzer removes the need for unit tests
investigation, we assess both the LLM’s ability to produce – prior work assumed test cases already exist in the target
an initially successful translation, as well as their capacity to language, or they were hand-written as part of the study, mak-
fix a previously generated buggy one. If the original and the ing a substantial investigation difficult. If a counterexample is
translated programs are not I/O equivalent, we apply a set discovered, F LOURINE executes a feedback strategy, which
of automated feedback strategies, including feedback to the provides feedback to the LLM to fix the counterexample.
LLM with counterexamples. Our results show that the most For the dataset we extract benchmarks from seven open
successful LLM can translate 47% of our benchmarks, and source projects written in C and Go. We do not use the
also provides insights into next steps for improvements. entire projects because LLMs cannot fit them in their context
window. We choose these languages because Rust, C, and
I. I NTRODUCTION Go are typically used for low-level programming tasks, such
The task of program translation between programming as systems development, so C and Go are likely candidates
languages is becoming particularly relevant, given the recent for translation to Rust. The open source projects are from
interest in safe programming languages such as Rust, and the a diverse set of domains: audio processing, text processing,
expectation of translating potentially buggy, legacy code into geometry, banking, 2D triangulation, graph algorithms, and
such modern languages. While “rule-based” translation tools sound card emulation. To automate and reduce bias in the
have been developed [1]–[3] that target a fixed source and selection of code samples, we develop a methodology and tool
target language (e.g. C to Rust), recent work [4]–[7] provides for extracting them from projects. We use this tool to extract
hope that large language models (LLMs) can accomplish this code samples that contain between 1 and 25 functions and
task for any source and target language. use only standard libraries, and which also use features such
Prior work in using LLMs for code translation [4]–[9] as global variables, user-defined, dynamically-allocated, data
has almost exclusively focused on translating code taken structures, array pointers, type casts, enumeration types etc..
from competitive programming websites [10], educational For example, Figure 1 contains a program extracted
websites [11], or hand-crafted coding problems [12], [13]. from the ACH library featuring a global variable
While useful, such benchmarks are not representative of real- moov_io_ach_stringZeros, which is initialised
world code. For example, these benchmarks are typically a with the function call moov_io_ach_populateMap(94,
var ( func (e *Env) add(i, p int64) {
moov_io_ach_stringZeros map[int]string = var j int64
moov_io_ach_populateMap(94, "0") e.S[i] = true
) e.Prev[i] = p
for j = 0; j < e.N; j++ {
func moov_io_ach_populateMap(max int, zero string) if e.Lx[i]+e.Ly[i]-e.G.Get(i, j) < e.Slack[i
map[int]string { ] {
out := make(map[int]string, max) e.Slack[i] = e.Lx[i] + e.Ly[i] - e.G.Get
for i := 0; i < max; i++ { (i, j)
out[i] = strings.Repeat(zero, i) e.Slackx[i] = j
} }
return out }
} }

func (m Matrix) Get(i int64, j int64) int64 {


Fig. 1: Code sample from ACH return m.A[i*m.N+j]
}

type Env struct {


"0"). This kind of initialization of a global variable is not N int64
allowed in Rust, making it non-trivial to find an equivalent G *Matrix
S []bool
translation without resorting to unsafe code. Claude3 managed Slack, Slackx, Prev []int64
to find the following translation: Lx, Ly []int64
}
static MOOV_IO_ACH_STRING_ZEROS:
Lazy<HashMap<usize, String>> = type Matrix struct {
Lazy::new(|| populate_map(94, "0")); N int64
A []int64
This snippet uses once_cell::Lazy, which stores a value }
that gets initialized on the first access.
As another example, Figure 2 contains a program we Fig. 2: Function add from go-gt
extracted from the go-gt library, featuring a user-defined type
Env that assembles several arrays, pointers and numeric data.
func LCSBacktrackAll(str1, str2 string) ([]string,
Mapping Env to an exact counterpart in Rust is not obvious as error) {
a slice []int64 in Golang can be represented by a vector in runeStr1 := []rune(str1)
Rust Vec<i64>, which is a growable, owning array-like data runeStr2 := []rune(str2)
type, or a borrowed-slice &’a [i64], a non-growable, non- if len(runeStr1) == 0 || len(runeStr2) == 0 {
owning array-like data type. Our cross-language differential return nil, errors.New("Can’t process and
fuzzer handles translations of the function add that use both backtrack any LCS with empty string")
} else if Equal(runeStr1, runeStr2) {
variants for Env by correctly mapping between the Go and return []string{str1}, nil
Rust representations of its inputs (where the receiver e of type }
return processLCSBacktrackAll(
*Env is one of the inputs). str1,
The code in Figure 3, extracted from the go-edlib library, str2,
returns all the longest common subsequences of two input lcsProcess(runeStr1, runeStr2),
strings (for brevity, we omit the callees). This code presents len(runeStr1),
len(runeStr2),
several challenges, for instance, finding a correct mapping ).ToArray(), nil
between different styles of error handling. In Golang, a failable }
computation output is typically expressed by a pair of the
target output type and the error type, as shown in the signature Fig. 3: Function LCSBacktrackAll from go-edlib
of LCSBacktrackAll. On the other hand, in Rust, this is
often expressed by an optional output type, such as Result<
Vec<String>, Error>. Moreover, this program contains by the fuzzer. We compare these with a baseline strategy that
casts from strings to arrays, and array manipulation, which repeats the original prompt, relying on randomness in LLM
need to be correctly mapped to their corresponding Rust inference to obtain a new candidate translation.
representation. In total, we perform 8160 code translation experiments
We evaluate the LLM’s ability to produce compilable trans- across 408 code samples, four feedback strategies and five
lations, as well as translations that preserve the source be- state-of-the-art LLMs – GPT4, Claude 3, Claude 2.1, Gem-
havior. Given that semantic equivalence is critical to program ini Pro, and Mixtral. Overall, the LLMs achieve successful
translation, we further investigate the LLM’s potential to take translation rates of 21% to 47% on our code samples, and
feedback and fix equivalence errors. We develop and compare feedback strategies are responsible for up to 8% absolute of
four different feedback strategies. Three of our feedback this success rate. Somewhat unsurprisingly, we find that larger
strategies provide the the LLM with counterexamples returned programs (in LoC) are less likely to translate successfully than
struct Env { languages are different. Most of them [4]–[7], [9] evaluate
n: i64, exclusively on competitive programming style code. In con-
g: Box<Matrix>,
s: Vec<bool>,
trast, our work evaluates on real-world code, allowing us
slack: Vec<i64>, to draw stronger conclusions about LLM’s ability for code
slackx: Vec<i64>, translation. Others use some real-world code: [15] evaluates
prev: Vec<i64>,
lx: Vec<i64>,
on real-world API code, translating Java to C#. Their technique
ly: Vec<i64>, requires additional fine-tuning, unlike ours. Another work, [8],
} uses real-world benchmarks but does not produce syntactically
struct Matrix {
correct code for those examples. Two of the works [8], [9]
n: i64, conclude that counterexamples can be useful feedback, which
a: Vec<i64>, does not match our conclusion. We compare our results with
}
theirs in Section VI-C1.
fn add(e: &mut Env, i: i64, p: i64) { Other LLM/ML code translation works focus on problems
let mut j: i64 = 0; where the source an target language are the same [16], [17].
e.s[i as usize] = true;
e.prev[i as usize] = p; We consider this a different task than ours, because the goals
for j in 0..e.n { are different. Meta studies have been conducted on code
if e.lx[i as usize] + e.ly[i as usize] - get translation as well in [18], [19], though they do not provide
(&e.g, i, j) < e.slack[i as usize] {
e.slack[i as usize] = e.lx[i as usize] + insight on translating real-world code. Finally, several works
e.ly[i as usize] - get(&e.g, i, j); have developed rule-based techniques for specific source and
e.slackx[i as usize] = j; target language pairs such as C to Rust, [3], [20], [21], C
}
} to Go [1], and Java to C# [2]. While rule-based approaches
} can theoretically guarantee correctness of the translation, they
require significant engineering effort to build, and they can
fn get(m: &Matrix, i: i64, j: i64) -> i64 {
m.a[(i * m.n + j) as usize] produce unidiomatic code as we demonstrate in our results.
} Cross-Language Differential Fuzzing. While differential
fuzzing/testing has a rich literature, the majority do not con-
Fig. 4: Rust translation of function add from go-gt sider comparing implementations in two different languages.
There are many works that compare programs in the same lan-
guage using symbolic execution [22]–[25] and fuzzing [26]–
smaller programs. Surprisingly, we also find that our feedback [29]. Such works do not need to solve the problem of mapping
strategies that include counterexamples in the prompt actually data from one language to another, though they are likely
perform worse than the simple baseline strategy. We discuss complementary to our work – they could be used to improve
why this may be case, and suggest directions for future work. the coverage achieved by our fuzzer. Works in fuzzing multi-
We claim the following contributions: language systems [30] do not address this problem either. Only
• We develop F LOURINE , a tool capable of producing one work [31] attempts general cross-language testing like we
validated Rust translations without the need for hand- do by compiling both languages down to a shared IR. As we
written test cases will discuss in Section IV-B, this approach cannot effectively
• We build a cross-language fuzzer, capable of passing handle user-defined data types, and is heavily dependent on
inputs and outputs between languages the IR compiler preserving structure of the original source
• We use F LOURINE to conduct the first substantial study program.
of using LLMs to translate real-world code Feedback Strategies for LLMs. Only a limited number
• We demonstrate that LLMs are capable of translating of works have tried to develop feedback strategies for LLMs.
parts of real-world projects, and that directly providing Recent work in automated program repair [32], [33] reports
counterexamples as feedback to an LLM is less effective success with an approach that provides counterexamples as
than repeating the original prompt feedback. We discuss their results in relation to ours in
• We open source all code, benchmarks, results to repro- Section VI-C1. While any automated program repair technique
duce our experiments* could be used as a feedback strategy, we focus only on
feedback strategies that use an LLM to fix errors.
II. R ELATED W ORK
In this section, we discuss closely related work from the
literature under several categories. III. OVERVIEW
Code Translation. The most closely related code translation
works use LLMs for translation where the source and target
In this section, we define the task of code translation,
* Artifact can be downloaded at https://d34gtk3knhjgeg.cloudfront.net/ provide an overview of our algorithm for code translation with
artifact.tar.gz LLMs, and then illustrate on a concrete example.
A. Code Translation query that can be provided to G to generate a new candidate
We first formally define the problem of code translation. translation.
Let l be a programming language, and Pl the set of all valid The routine for code translation with feedback strategies is
programs written in l. Assume we have a program p ∈ Pl shown in 1. We first use G to generate a candidate program p′
that we wish to translate to a different language l′ . That is, from the initial query q, which we then pass to the compilation
we wish to find p′ ∈ Pl′ that has the same behavior as p with driven repair routine. If this is unsuccessful at making p′
respect to a mapping between the values of l and l′ . compile, we exit the loop and fail. Otherwise, we invoke the
fuzzer to check for counterexamples. If none are found, we
In our work, a program (i.e. p or p′ ) is a set of functions,
assume p′ is correct, and return it. Otherwise, we invoke a
user-defined types (e.g. struct definitions), global variables,
feedback routine, which generates a new q, and repeat the
import/include statements etc. One of the functions in a
process until a program is found that passes the fuzzer check,
program is the entry point function. Note that the entry point
or until some fixed budget is reached and we fail.
function is not necessarily main() – the inputs and outputs
of the entry point could be primitive data types, user-defined Algorithm 1 Iterative Code Translation Algorithm
types (e.g. structs, classes), and even pointer types.
For simplicity of notation, we define p and p′ as operating Require: p : The program to translate, q: The initial task
on program states. A program state contains the values of the description, FEEDBACK: A feedback strategy, b: A budget
1: while b > 0 do
inputs outputs of the program, as well as variables defined in
the global scope. Letting Sp and Sp′ be the set of all program 2: p′ ← G(q)
states for p and p′ , respectively, we have p : Sp → Sp and 3: if ¬COMPILATION - REPAIR(p′ ) then
p′ : Sp′ → Sp′ . We write p(sin ) = sout where sin , sout ∈ Sp 4: break
to denote the result of executing p on sin . 5: end if
To complete our definition of code translation, we define 6: E − , E + ← FUZZER(p, p′ )
M : Sp → Sp′ and M ′ : Sp′ → Sp , which are mapping 7: if E − = ∅ then
functions that map states of program p to states of p′ , and vice 8: return p′
versa. Formally, translation’s goal is to discover a program p′ 9: end if
such that: 10: q ← FEEDBACK(q, p′ , E − , E + )
11: b←b−1
∀s ∈ Sp .p(s) = M ′ (p′ (M (s)))
12: end while
13: return FAIL
B. Our Code Translation Algorithm
Next, we present our iterative algorithm for code translation.
We again assume we have a source program p, and we wish C. Motivating Example
to discover a translation p′ with the same behavior. We now illustrate our Rust translation approach with the
Let G : Q → Pl′ be an LLM that takes a natural concrete example in Figure 2. In the example, our source
language query q ∈ Q and outputs a candidate translation program p is the function add from the go-gt library. This
p′ ∈ Pl′ . q contains the original source program p and natural is a subroutine of the Hungarian algorithm [34] for finding
language instructions to translate p into the target language maximum matching, which adds two edges to an alternating
l′ . Note that in practice, the resulting p′ may have a top-level path during the search, and records the output by mutating the
function whose function signature is incompatible with p and receiver e.
therefore the mapping functions M , M ′ cannot be defined, or We first create an initial query containing the Go code and
the program output by the LLM may not compile. We find that instructions describing the translation task, which is given to
the former rarely happens, and we address the latter through the LLM to generate a candidate translation. If we continue
a compilation repair phase, which is based on the approach past compilation driven repair, the candidate translation p′ in
in [14]. Figure 4 is guaranteed to compile, but not to be I/O equivalent
We also assume the existence of a fuzzer. We define this as to the original source program. To check for I/O equivalence,
FUZZER (p, p′ ), which takes the original source program and p and p′ are passed to the fuzzer, which uses an off-the-
translation, and returns two sets of examples E + and E − . E + shelf fuzzing technique to generate input states, execute both
is a set of positive examples where p and p′ agree. Positive programs, and check that they produce the same output state.
examples have the form (sin , sout ), where sin , sout ∈ Sp′ . E − We capture side-effects by comparing whole program states,
is the set of counterexamples where the output produced by p rather than just the explicit output.
disagrees with p′ . A counterexample is a triple of states from One of the challenges that we face is executing p and
Sp′ of the form (sin , sexp , sact ), which are the initial state, p′ in two different languages on matching input states, and
expected output state, and the actual output state. then comparing their output state. Specifically for our running
Finally, we have a routine FEEDBACK(q, p′ , E − , E + ) which example, we must convert primitive types as well as user-
takes the query q, the candidate translation p′ , and the ex- defined data structures: Env has distinct representations in
amples E + , E − returned by the fuzzer, and returns a new Go and Rust; arguments i and p have type int64 in Go,
{"e": {
Human:
"n": 3, # Preamble
"g": { You are given a C/Go program. We need to translate
"n": 3, it to Rust.
"a": [0, 0, 0, 0, 0, 0, 0, 0, 0]
}, # Code to be translated
"s": [false, false, false], {C/Go Program}
"slack": [0, 0, 0],
"slackx": [0, 0, 0], # Instruction
Give me a Rust translation of the above C/Go code.
"prev": [0, 0, 0], # Constraints
"lx": [0, 0, 0], Here are some constraints that you should respect:
"ly": [0, 0, 0] • Give me only the translated code, don’t add
}, explanations or anything else. # formatting guideline
"i": 2, • Use only safe Rust. # code characteristic
"p": 0} • Do not use custom generics. # fuzzer limitation
• ...

Fig. 5: Serialized JSON input state for function add Assistant:

but i64 in Rust; e is a pointer to an Env value in Go, but a Fig. 6: LLM Prompt for obtaining translations.
mutable reference in Rust.
To solve this challenge, we develop a technique based on
In particular, we have three types of constraints: formatting
serializing then de-serializing to exchange data between lan-
guidelines, code characteristics and fuzzer constraints. Format-
guages. We use the JSON [35] format, because most languages
ting guidelines describe how the generated code should look,
support it. Most data types, including complex data types and
simplifying parsing and extraction of relevant information
pointers can be automatically serialized into JSON, thus it
from the response. For code characteristics, we instruct the
allows us to easily support real-world code. For our example,
LLM to produce safe Rust code, and to maintain the same
Fig. 5 denotes a serialized valid input state. Once the two
function names, parameter names, and return types from the
programs are executed, the Go output state is again serialized
input code. Finally, the fuzzer constraints ensure that the
to JSON, deserialized to Rust, and compared against the Rust
generated code can be handled by our fuzzer (more details
output state. For our example, the expected output state, as
on this in Section IV-B).
obtained by executing the Go code, is the same as the input
The translation generated by the LLM may not initially
state in Fig. 5, with the only difference that the last element
compile. We address this with approach in [14]. At a high
of field s is set to true instead of false. The translation
level, we iteratively query the LLM to fix the error, until
in Figure 4 computes the expected output state, and it is thus
the code becomes compilable. Each time, we provide both
deemed I/O equivalent to the original Go code, and returned
the faulty translation and the error message from the Rust
by F LOURINE.
compiler to the LLM, and ask it to use a specific format for
Conversely, if a counterexample is discovered by the fuzzer,
the suggested fixes, applying them only to the affected lines
then we invoke a feedback method, which uses the counterex-
of code.
ample to create a new query to the LLM and generates a new
candidate translation. Designing a suitable feedback method B. Checking Translations
is another challenging aspect of the translation task. There are To test the I/O equivalence between the original source
many ways to re-query the LLM for a new translation, each program p and a candidate Rust translation p′ , we develop
with their own likelihood of success. Moreover, most state-of- a cross-language differential fuzzer. For a given p and p′ , we
the-art LLMs are operated as API services which charge per automatically generate a fuzzing harness in Rust, which uses
input token, so different query strategies will have different Bolero and libfuzzer [36] to perform fuzzing. The test harness
dollar costs. To address this, we propose and evaluate a set of generates program states from Sp′ , which are directly invoked
feedback strategies. on p′ . We implement the mapping function M ′ : Sp′ → Sp ,
IV. LLM-BASED C ODE T RANSLATION using JSON de/serialization. We serialize the Rust program
state s′ into JSON format, and then instrument the source
A. Obtaining Translations program p to deserialize the JSON into a program state of Sp .
As mentioned in the previous sections, we are considering The instrumented p is invoked on the serialized s′ from Rust
the problem of translating a program written in C or Go to using a foreign function interface. To compare outputs, we map
Rust. We use zero-shot prompting and follow the best practices the output state of p to one of p′ using JSON de/serialization
given by the LLM’s provider. We construct the initial query q as well, which can then be directly compared.
(to be input to the LLM) as sketched in Figure 6. We use JSON serializers for two reasons. First, the mapping
We start with a preamble describing the overall task. Then, between fields of user-defined data types in the source and
we supply the program to be translated, and, finally, we target language are automatically determined based on the
provide specific constraints to be followed by the LLM. field names. This requires the LLM to produce data types
with field names that match the source program, but in our
Human:
benchmarks LLMs always do this. Second, most languages
support automatic serialization of primitive, pointer, and user- # Preamble
You are given a C/Go program and its faulty Rust
defined types. translation. We need to repair the faulty Rust
We note an alternative approach, taken by [31], is to compile program.
both p and p′ down to a common IR, such as LLVM, and # Code to be translated
then perform fuzzing on the IR. However, we find that IR {C/Go Program}
compilers for different languages typically discard type and # Code to be repaired
layout information (e.g. user-defined data types are represented {Faulty Rust Program}
as a void pointer). This makes it nearly impossible for a fuzzer # Instruction
to generate any meaningful inputs. Make changes to the given code to obtain expected
outputs for the given test inputs.
Soundness & Limitations. Our fuzzer can only make
heuristic based guarantees (e.g. coverage) on the equivalence # Constraints
Here are some constraints that you should respect:
of p and p′ . This is a limitation of fuzzing and testing in ...
general. However, our fuzzer achieves an average line coverage
# Counterexamples
of 97%. CE1
In addition, JSON serialization is not automatically sup- CE1

ported for all types. For example, features in Rust like trait Assistant:
definitions, IMPL traits, and lifetimes in data type definitions {LLM generated rust translation}

are only partially supported. This means that the equivalence Human:
check may fail because serialization fails. We report these That is incorrect on the following inputs:
# Counterexamples
errors in Section VI-B. In addition, we do not support fea- CE1
tures like concurrency, network, and file I/O. Our benchmark CE2

selection excludes these features. Assistant:

V. F EEDBACK S TRATEGIES
In this section, we present four feedback methods that can
Fig. 7: LLM Prompt for BaseRepair and CAPR. BaseRepair
be used if the fuzzer finds a counterexample E − for the
is shown in black. CAPR is shown in black and magenta.
correctness of the translation p′ by the LLM in Alg. 1.
a) Simple Restart Restart: We discard the generated code
p′ and re-query the model with the same prompt q. check, then this last faulty translation will be considered by
b) Hinted Restart Hinted: This builds on the previous the next call to BaseRepair.
strategy by adding positive and negative examples from the d) Conversational Repair (CAPR): Recent work in code
fuzzer, E + and E − , to the original prompt q. The intention is translation [8] and automated program repair [37], have
to suggest desirable behaviours to the LLM, as well as known proposed conversational repair approaches, wherein previous
faulty cases to avoid. We separately group the examples in E + incorrect code is included in the prompt to the LLM to
and E − based on the paths they exercise in p′ . Intuitively, this discourage the LLM from producing the same code again. The
corresponds to splitting them into equivalence classes, where CAPR approach begins with the same prompt as BaseRepair,
each equivalence class corresponds to a particular program however they differ if the new translation still fails the fuzzer
path. Then, the query constructed by Hinted only contains check. In BaseRepair, we create a new prompt from scratch,
positive and negative examples from a single equivalence class, but in CAPR, we keep the prompt, and append a new piece of
respectively. dialogue to it as shown in magenta Figure 7. This process can
c) Counterexample-Guided Repair (BaseRepair): Dis- be repeated multiple times, meaning the prompt is a dialogue
carding the generated code p′ when the fuzzer check fails may of failed translations.
not always be the optimal choice. For instance, if p′ is close The methods Restart and Hinted cost less than BaseRepair
to passing the fuzzer, trying to repair it might work better. and CAPR as they don’t include the incorrect translation in the
As part of BaseRepair, we give counterexamples from the prompt. Therefore the former use about half the input tokens
fuzzer to the LLM. Similarly to Hinted, a query only contains of the latter.
negative examples from the same equivalence class, which
correspond to bugs associated with the same program path. VI. E VALUATION
The expectation is that the candidate translation generated in In this section, we present our results for the following
the next iteration of Alg. 1 will produce the correct outputs research questions.
for the given examples. A sketch of the prompt used for
BaseRepair is given in Figure 7 (excluding the lines colored RQ1: How do LLMs perform on translating code taken
in magenta). In Alg. 1, if the translation generated by G for from real-world projects to Rust? We gather a large number
the query q constructed by BaseRepair still fails the fuzzer of benchmarks by extracting code samples from real-world
projects, and we use LLMs to generate translations which implementation of F LOURINE. F LOURINE is written entirely
are then checked for correctness by the fuzzer, and fixed in python, except for the fuzzer, which is written in Rust.
if needed by applying feedback strategies. We answer the F LOURINE currently supports C and Go for the input program.
following concrete questions. F LOURINE is implemented as a framework, which can be
(RQ1.1) How many benchmarks can each LLM translate extended with new LLMs, feedback strategies, and language
from each of our projects? We report the percentage of support for the input program. We use GNU Parallel [39] to
benchmarks from each project that are successfully translated run experiments in parallel.
for each LLM. We show that success rates vary widely based 2) LLMs: We limit our study to LLMs hosted by third
on the benchmark and LLM. LLMs achieve up to 80% success party providers. This is in part because they are the highest
rate on benchmarks from our “easiest” project, and between performing on coding tasks, and they are the most accessible
15-40% on our “hardest” project. in that they do not require the user to own powerful compute
(RQ1.2) How does code complexity affect the success rate resources. We use five LLMs in our evaluation: GPT-4-
of translation? We look at how lines of code and number of Turbo [40], Claude 2.1 [41], Claude 3 Sonnet [41], Gemini
functions in a benchmark influence the success rate. We show Pro [42], and Mixtral [43]. The first four are likely very large
lines of code strongly influences success rates. (1T+ parameters). On the other hand, Mixtral is relatively
(RQ1.3) How idiomatic is the Rust produced by LLMs? small (45B parameters), but is known for performing well on
We run Clippy [38], Rust’s standard linter, on the successful coding tasks, and costs less than the others. We access GPT-4-
translations, and analyze the rates of different categories of Turbo and Gemini Pro through OpenAI’s and Google’s APIs.
linter warnings. We show that LLMs occasionally (1-15% of We access Claude and Mixtral through AWS Bedrock. Due to
the time) produce code with linter warnings, suggesting that lack of access to GPU machines, we do not attempt to run
the translations could be made more performant, concise, or open source LLMs like CodeLLaMA.
use unsafe code. 3) Benchmarks: We collect benchmarks from real-world
projects hosted on GitHub. We focus on C and Go as the
RQ2: How effective are feedback strategies at fixing source program languages for multiple reasons. First, C, Go,
translation bugs? In addition to overall translation success and Rust are typically used for lower-level programming tasks,
rates, we record the initial success rates – the rate at which unlike other popular languages like Java or Python. Thus they
the first translation passes the fuzzer – and compare this to are likely candidates for translating to Rust. Second, and more
the overall success rate. We answer two concrete questions. pragmatically, projects written in C and Go make less use of
(RQ2.1) How much do feedback strategies increase the third party libraries, which we do not attempt to support for
translation success rate? We compare overall success rates this work. Conversely, most Java and Python projects make
directly to initial success rates. We show that the most effective heavy use of third party libraries.
feedback strategy increases the success rate by an absolute 6- We choose seven projects with the aim of getting a diverse
8% on average for the best LLMs. set of application domains. Our projects are:
(RQ2.2) Which feedback strategies increase success • ACH [44]: a Go library implementing a reader, writer,
rates the most? We compare the increase in success rates and validator for banking operations
for each feedback strategy. We show that, surprisingly, • geo [45]: a math-focused Go library implementing com-
Restart and Hinted outperform BaseRepair and CAPR mon geometry functions and interval arithmetic
consistently. We provide a plausible explanation for this result. • libopenaptx [46]: a C library for audio processing
• opl [47]: a C library for sound card emulation
RQ3: How do LLM translations compare to rule- • go-gt [48]: a Go library for graph algorithms
based translation tools? We compare LLM translations • go-edlib [49]: a Go library string comparison and edit
to translations produced by the rule-based translation tool distance algorithms
C2Rust [3]. While C2Rust theoretically can guarantee the • triangolatte [50]: a 2D triangulation library in Golang
correctness of the translation, we show LLMs produce far As we will show in our experiments, LLMs are still not
more concise and idiomatic translations. capable of translating entire projects. To create benchmarks
of manageable size, we develop a tool for automatically
RQ4: Why do translations fail? Translation can fail for extracting benchmarks from these projects. Our tool takes as
several reasons beyond the fuzzer finding counterexamples. input the entire project and a specific function identifier f in
We report failure rates for different failure reasons. the project. The tool then analyzes the project to find all of f ’s
dependencies, including all functions called by f (including
A. Experimental Setup transitive calls), type definitions, standard libraries, global
1) Implementation: We implement an end-to-end transla- variables, etc. and extracts them intro a single, compilable
tion tool F LOURINE, which takes as input (1) a program, (2) file. The translation task is then to write a compilable Rust file
a feedback strategy to apply, and (3) a budget. F LOURINE with a function equivalent to f ’s behavior. Our methodology
outputs either a corresponding Rust translation that passes the for selecting benchmarks is to iterate over all functions in a
fuzzer, or it fails with an error. Algorithm 1 is used for the project, create a benchmark for it, and keep it if it meets the
TABLE I: Benchmark details of translation experiments in the category (experiments for
Project Lang. #Benchs Min/Max/Avg LoC Min/Max/Avg #Func
different feedback strategies are averaged together). We
answer our sub-questions below.
libopenaptx C 31 13 / 173 / 69 1 / 9 / 2.9
opl C 81 19 / 460 / 67 1 / 15 / 2.8 (RQ1.1) How many benchmarks can each LLM translate
go-gt Go 43 9 / 213 / 51 1 / 16 / 3.5 from each of our projects? Figure 8 shows success rates by
go-edlib Go 36 13 / 597 / 62 1 / 25 / 3.1
ach Go 121 43 / 194 / 64 3 / 7 / 3.4
benchmark and LLM. The best LLMs achieve success rates
geo Go 67 13 / 70 / 35 3 / 7 / 4.1 of 20-60% depending on the benchmark, with one outlier of
triangolatte Go 29 9 / 164 / 38 1 / 10 / 2.5 80% by Claude 2 on ACH. The outlier is in large part due
to ACH having ∼40 extremely similar benchmarks, which
Claude 2 nearly always gets right. If we remove these similar
following criteria: (1) it does not use 3rd party libraries, (2) benchmarks, the success rate for Claude 2 drops to 55%,
the maximum depth of the call graph is less than 4. which is in line with the other LLMs. A consistent trend
Details on the benchmarks are given in Table I. The total is that Mixtral, while somewhat capable, has 5-20% lower
number of benchmarks extracted from each project is given success rates than the other much larger and more expensive
in the column “#Benchs”. LoC and number of functions for LLMs. However, the cost of running Mixtral (both in dollars
individual programs vary from 13 to 597 and from 1 to 25, and compute) is at least 10x less than the other LLMs. Other
respectively. trends are that Claude 2, Claude 3, and GPT-4-Turbo perform
4) LLM Hyperparameters: All LLMs use a temperature similarly on most benchmarks, and they outperform Gemini
parameter for controlling the randomness/creativity of its in most cases.
output. To make our results more deterministic, we use a lower (RQ1.2) How does code complexity affect the success rate
temperature (i.e. less random) of 0.2. Other hyperparameters, of translation? We use lines of code and number of functions
e.g. topP and topK, are set to the default value recommended as proxy metrics for complexity, and we show success rates
by the LLM’s provider. for benchmarks grouped by level of complexity in Figures 9
5) F LOURINE Hyperparameters: We set the budget b in and 10. The general trend is that increasing complexity,
Algorithm 1 to 5. For the Hinted and BaseRepair strategies especially in lines of code, reduces success rate. The spikes
we provide 4 examples in the prompt (more examples appeared for 3 functions and 48-82 lines of code are again due to
to reduce efficiency as the context window grew). For the the ACH benchmarks mentioned in the previous research
CAPR strategy, we keep conversation window size as 3, which question. Removing these flattens the spike. In particular,
means that only the latest 3 incorrect translations are made success rates tend to drop off somewhere around 100+ lines of
available to the LLM. A translation is deemed equivalent if 5 code. We discuss approaches for handling larger benchmarks
minutes of fuzzing does not return any counterexamples. in section VI-C2.
6) Compute Resources: We run our experiments on a (RQ1.3) How idiomatic is the Rust produced by LLMs?
machine with an AMD EPYC 7R13 Processor with 192 cores Figure 11 shows the rate of different categories of linter
and 380 GB of RAM. Each translation task is run sequentially warnings produced by Clippy [38], Rust’s standard linter. We
in a single thread (we do not parallelize individual translation limit our analysis to successful translations. Clippy reports
tasks or fuzzing). As previously mentioned, all LLMs are five types of warnings, and we add unsafe. We describe
accessed through APIs provided by third party services. them below, and give specific examples of the warnings most
B. Results frequently reported by Clippy on the Rust translations.
Correctness: reports code that may have correctness bugs.
We run a translation experiment for each of our five LLMs,
The common examples we find are: checking if an un-
four feedback strategies, and 408 benchmarks for a total of
signed integer is greater than 0, and using MaybeUninit
8160 translation experiments. A translation is successful if
::uninit().assume_init() (i.e. assuming that poten-
it compiles and passes the fuzzer. A translation is failed if
tially unitialized data is initialized)
it: (1) does not compile, (2) the fuzzer cannot de/serialize
the types used in the translation, or (3) the fuzzer finds a Suspicious: the same as Correctness, but could be a false
counterexample in the translation and the budget is reached if positive
applicable. We answer our research questions based on these Style: code that is unidiomatic, but still correct. The com-
results. mon examples we find are: not following naming conventions,
unnecessary borrows, using return statements, unnecessary
RQ1: How do LLMs perform on translating code closure expressions (e.g. xs.map(|x| foo(x)) instead
taken from real-world projects to Rust? Our LLMs achieve of xs.map(foo)), using class types (e.g. String) when
overall success rates of 47.7% (Claude 2), 43.9% (Claude a simple primitive type will suffice (e.g. str), and not
3), 21.0% (Mixtral), 36.9% (GPT-4-Turbo), and 33.8% using idiomatic statements (e.g. using x <= z && z <= y
(Gemini Pro). We present detailed results for each LLM instead of (x..y).contains(z))
in Figures 8, 9, 10, and 11. The success rate is the total Complexity: code that could be simplified. Common exam-
number of successful translations divided by the total number ples are: unnecessary casting or type conversion, unnecessary
Fig. 8: Success rate for each LLM on each benchmark. Fig. 9: Success rate for each LLM on benchmarks
Averaged across all feedback strategies. grouped by lines of code.

Fig. 10: Success rate for each LLM on benchmarks Fig. 11: Rates of different types of linter warnings for
grouped by number of functions. each LLM.

parentheses, and unnecessarily putting a Box<..> around the


type of a function parameter
Performance: code that could be written to run faster. The
most common example is unnecessarily putting a Box<..>
around local variables or collection types (e.g. Vec)
Unsafe: code wrapped in an unsafe block
Overall, LLMs produce very few correctness warnings,
however they occasionally (1-15% of the time) produce code
that could be more idiomatic (Style warnings), more concise
(Complexity warnings), or more performant (Performance
warnings). Gemini’s high rate of Style warnings is due to its
preference for using return statements when not necessary.
Fig. 12: Absolute improvement in success rates after applying
We suspect that a large number of these warnings could be
feedback strategies as compared to initial success rate.
eliminated through prompting (e.g. instructing the LLM to
prefer not using return statements, follow specific naming
conventions, and not use Box<..> in certain cases). We
also observe occasional use of unsafe, however the use of fuzzer, after fixing compilation errors – to the final success
unsafe code was not necessary in those cases. rate after applying feedback strategies to the unsuccessful
translations.
RQ2: How effective are feedback strategies at fixing transla- (RQ2.1) How much do feedback strategies increase the
tion bugs? We answer this question by comparing the initial translation success rate? Figure 12 shows the final success
success rate – the rate at which the first translation passes the rate subtracted from the initial success rate. Disappointingly,
the best feedback strategy only improves success rates by 6- some probability that an erroneous prediction is made. Letting
8% absolute across all of our benchmarks. e be the probability of an erroneous prediction, the probability
(RQ2.2) Which feedback strategies increase success that the LLM correctly predicts n tokens is (1 − e)n . This
rates the most? Figure 12 also shows, surprisingly, that probability quickly goes to 0 as n increases. A future direction
the most reliable strategy is Restart (simply repeating the that (at least theoretically) solves this problem is to develop a
same prompt). In fact, our results suggest that providing solution that partitions the input source program into chunks
counterexamples in the prompt may actually confuse the that can be individually translated and validated. The most
LLM. We discuss this trend further in Section VI-C1. obvious way to partition an input program is by function, but
one could imagine even going down to the basic block level.
RQ3: How do LLM translations compare to rule-based 3) Improving the Fuzzer: Given the high rate of failure
translation tools? We compare the idiomatic-ness of LLM due to serialization limitations, improving the serializer in our
generated Rust to C2Rust [3] on our opl benchmark (C2Rust fuzzer to handle more features data types is likely necessary
failed to produce code for most of libopenaptx). Rates of to make additional progress as well.
linter warnings are presented in Figure 11. Overall we can
VII. T HREATS TO VALIDITY
see that the majority of code produced by C2Rust is unsafe
and it is far less idiomatic, as indicated by the rate of style Our main results are that (1) LLMs can translate real-world
warnings. In addition, we observe that C2Rust produces code, and that (2) providing counterexamples to the LLM
far more verbose Rust than LLMs. On average C2Rust is not effective feedback. We discuss threats to the validity
translations have 1.98x more LoC than LLM translations. of these conclusions. The biggest threat to (1) is that the
fuzzer may miss counterexamples. While we acknowledge this
RQ4: What is the main cause of translation failure? limitation, we point out that the fuzzer achieves 97% coverage
There are three reasons a translation can fail. (1) A compiling on average, thus we are confident that the translations are
translation cannot be found. This accounts for only 7.0% of “mostly” correct. This limitation is also generally accepted
failures. (2) The fuzzer cannot de/serialize the data types. in prior work, which uses test suites to assess correctness.
These account for 52.6% of failures. (3) A counterexample Another threat to (1) is that our results do not generalize to
is found in the final translation. These account for 40.3% other languages. We argue this is unlikely for popular lan-
of failures. The implication of this result is that we likely guages like Java and Python, given that they more represented
under-report the true translation success rate, because at least in LLM training data than C and Go. However, even if our
some serialization failures might be successes. results do not generalize to other languages, translating C and
Go to Rust is still highly practical. The biggest threat to (2) is
C. Discussion & Future Work that other prompting strategies may improve the LLMs ability
1) Improving Feedback Strategies: Our result that coun- to use counterexamples. While possible, we believe this is
terexamples harm performance contradicts several recent unlikely due to the complexity of inputs in our benchmarks,
works’ results [8], [9], [32], [33]. We note that two of the as explained in Section VI-C1. Finally, recent work [51] has
works [8], [9] do not compare with a simple baseline like shown high non-determinism in code generated by LLMs,
Restart, so they cannot conclude if counterexamples helped or which poses a threat to both (1) and (2). While we don’t
hurt. However, the other two [32], [33] do report benefit from run our experiments multiple times, executing our feedback
counterexamples relative to a baseline, and they evaluate on strategies for multiple iterations has a similar effect. It is
real-world code (though their task is automated program repair highly unlikely an additional run of experiments would give
as opposed to code translation). The most likely explanation significantly different results, and doing so would be expensive
for their success and our failure is that their counterexamples (in dollar cost).
use inputs from human-written test cases, so they might be
VIII. C ONCLUSION
more “intuitive” to an LLM. On the other hand, our coun-
terexamples come from random inputs generated by a fuzzer. In this work, we study the ability of LLMs to translate real-
Our own manual analysis reveals that random fuzzer inputs world code to Rust. We present F LOURINE, an end-to-end Rust
are not intuitive to a human and their textual representation transpiler, and we use it to test the ability of five state-of-the-
can be very large and unintuitive to an LLM as well. A art LLMs to translate C and Go code taken from real-world
future direction is to study what types of counterexamples are projects to Rust. Our results demonstrate that LLMs are indeed
useful for LLMs. Studying input selection and input reduction capable of translating code to Rust, though there room for
techniques would likely be immediately fruitful. improvement. In addition, we show that counterexamples, at
2) Handling Larger Benchmarks: We conjecture that the least random fuzzer generated counterexamples, are ineffective
stochastic nature of LLMs’ next token prediction poses fun- feedback for an LLM.
damental limitations for translating large source programs in R EFERENCES
one go. Larger input source programs require more Rust code
[1] “C to go translator.” https://github.com/gotranspile/cxgo.
to be generated by the LLM, or in other words, more tokens [2] “Sharpen - automated java-¿c# coversion.” https://github.com/mono/
to be predicted. Each time a token prediction is made, there is sharpen.
[3] “C2rust transpiler.” https://c2rust.com/. [24] H. Palikareva, T. Kuchta, and C. Cadar, “Shadow of a doubt: testing
[4] Z. Tang, M. Agarwal, A. Shypula, B. Wang, D. Wijaya, J. Chen, for divergences between software versions,” in Proceedings of the 38th
and Y. Kim, “Explain-then-translate: an analysis on improving pro- International Conference on Software Engineering, pp. 1181–1192,
gram translation with self-generated explanations,” in Findings of the 2016.
Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, [25] S. Person, G. Yang, N. Rungta, and S. Khurshid, “Directed incremental
J. Pino, and K. Bali, eds.), (Singapore), pp. 1741–1788, Association for symbolic execution,” Acm Sigplan Notices, vol. 46, no. 6, pp. 504–515,
Computational Linguistics, Dec. 2023. 2011.
[5] B. Rozière, M. Lachaux, L. Chanussot, and G. Lample, “Unsupervised [26] J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “Dlfuzz: Differential
translation of programming languages,” in NeurIPS, 2020. fuzzing testing of deep learning systems,” in Proceedings of the 2018
[6] B. Rozière, J. Zhang, F. Charton, M. Harman, G. Synnaeve, and 26th ACM Joint Meeting on European Software Engineering Conference
G. Lample, “Leveraging automated unit tests for unsupervised code and Symposium on the Foundations of Software Engineering, pp. 739–
translation,” in ICLR, OpenReview.net, 2022. 743, 2018.
[7] M. Szafraniec, B. Roziere, H. L. F. Charton, P. Labatut, and G. Synnaeve, [27] W. Jin, A. Orso, and T. Xie, “Automated behavioral regression testing,”
“Code translation with compiler representations,” ICLR, 2023. in 2010 Third international conference on software testing, verification
[8] R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, and validation, pp. 137–146, IEEE, 2010.
M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost [28] S. Nilizadeh, Y. Noller, and C. S. Pasareanu, “Diffuzz: differential
in translation: A study of bugs introduced by large language models fuzzing for side-channel analysis,” in 2019 IEEE/ACM 41st International
while translating code,” 2024. Conference on Software Engineering (ICSE), pp. 176–187, IEEE, 2019.
[9] P. Jana, P. Jha, H. Ju, G. Kishore, A. Mahajan, and V. Ganesh, “Attention, [29] T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana, “Nezha:
compilation, and solver-based symbolic analysis are all you need,” arXiv Efficient domain-independent differential testing,” in 2017 IEEE Sym-
preprint arXiv:2306.06755, 2023. posium on security and privacy (SP), pp. 615–632, IEEE, 2017.
[30] W. Li, J. Ruan, G. Yi, L. Cheng, X. Luo, and H. Cai, “PolyFuzz:
[10] R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov,
Holistic greybox fuzzing of Multi-Language systems,” in 32nd USENIX
J. Dolby, J. Chen, M. Choudhury, L. Decker, et al., “Codenet: A large-
Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 1379–
scale ai for code dataset for learning a diversity of coding tasks,” arXiv
1396, USENIX Association, Aug. 2023.
preprint arXiv:2105.12655, 2021.
[31] J. J. Garzella, M. Baranowski, S. He, and Z. Rakamarić, “Leveraging
[11] W. U. Ahmad, M. G. R. Tushar, S. Chakraborty, and K.-W. Chang,
compiler intermediate representation for multi- and cross-language ver-
“Avatar: A parallel corpus for java-python program translation,” arXiv
ification,” in Verification, Model Checking, and Abstract Interpretation
preprint arXiv:2108.11590, 2021.
(D. Beyer and D. Zufferey, eds.), (Cham), pp. 90–111, Springer Inter-
[12] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by national Publishing, 2020.
chatgpt really correct? rigorous evaluation of large language models for [32] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era
code generation,” Advances in Neural Information Processing Systems, of large pre-trained language models,” in ICSE, IEEE, 2023.
vol. 36, 2024. [33] J. Kong, M. Cheng, X. Xie, S. Liu, X. Du, and Q. Guo, “Contrastrepair:
[13] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, Enhancing conversation-based automated program repair via contrastive
H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large test case pairs,” arXiv preprint arXiv:2403.01971, 2024.
language models trained on code,” arXiv preprint arXiv:2107.03374, [34] H. W. Kuhn, “The hungarian method for the assignment problem,” in
2021. 50 Years of Integer Programming 1958-2008 - From the Early Years
[14] P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust to the State-of-the-Art (M. Jünger, T. M. Liebling, D. Naddef, G. L.
compilation errors using llms,” arXiv preprint arXiv:2308.05177, 2023. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey,
[15] J. Zhang, P. Nie, J. J. Li, and M. Gligoric, “Multilingual code co- eds.), pp. 29–47, Springer, 2010.
evolution using large language models,” in Proceedings of the 31st ACM [35] E. T. Bray, “The javascript object notation (json) data interchange
Joint European Software Engineering Conference and Symposium on the format,” RFC 8259, RFC Editor, 12 2017.
Foundations of Software Engineering, pp. 695–707, 2023. [36] K. Serebryany, “Continuous fuzzing with libfuzzer and addresssanitizer,”
[16] Q. Zhang, J. Wang, G. H. Xu, and M. Kim, “Heterogen: transpiling c in 2016 IEEE Cybersecurity Development (SecDev), pp. 157–157, 2016.
to heterogeneous hls code with automated test generation and program [37] C. S. Xia and L. Zhang, “Conversational automated program repair,”
repair,” in Proceedings of the 27th ACM International Conference on arXiv preprint arXiv:2301.13246, 2023.
Architectural Support for Programming Languages and Operating Sys- [38] “Clippy: A bunch of lints to catch common mistakes and improve your
tems, ASPLOS ’22, (New York, NY, USA), p. 1017–1029, Association rust code.” https://rust-lang.github.io/rust-clippy/.
for Computing Machinery, 2022. [39] O. Tange, “Gnu parallel 20240122 (’frederik x’),” Jan. 2023. GNU
[17] B. Mariano, Y. Chen, Y. Feng, G. Durrett, and I. Dillig, “Automated Parallel is a general parallelizer to run multiple serial command line
transpilation of imperative to functional code using neural-guided pro- programs in parallel without changing them.
gram synthesis,” Proceedings of the ACM on Programming Languages, [40] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
vol. 6, no. OOPSLA1, pp. 1–27, 2022. D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4
[18] H. F. Eniser, V. Wüstholz, and M. Christakis, “Automatically test- technical report,” arXiv preprint arXiv:2303.08774, 2023.
ing functional properties of code translation models,” arXiv preprint [41] “Claude.” https://www.anthropic.com/index/introducing-claude.
arXiv:2309.12813, 2023. [42] “Gemini.” https://blog.google/technology/ai/google-gemini-ai/.
[19] M. Jiao, T. Yu, X. Li, G. Qiu, X. Gu, and B. Shen, “On the evaluation [43] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam-
of neural code translation: Taxonomy and benchmark,” in 2023 38th ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al.,
IEEE/ACM International Conference on Automated Software Engineer- “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
ing (ASE), pp. 1529–1541, IEEE, 2023. [44] “Moov ach.” https://github.com/moov-io/ach.
[20] H. Zhang, C. David, Y. Yu, and M. Wang, “Ownership guided C to Rust [45] “S2 geometry library in go.” https://github.com/golang/geo.
translation,” in Computer Aided Verification (CAV), vol. 13966 of LNCS, [46] “Open source implementation of audio processing technology codec
pp. 459–482, Springer, 2023. (aptx).” https://github.com/pali/libopenaptx.
[21] M. Emre, R. Schroeder, K. Dewey, and B. Hardekopf, “Translating C [47] “Engine for making things with a ms-dos feel, but for modern plat-
to safer Rust,” Proceedings of the ACM on Programming Languages, forms.” https://github.com/mattiasgustavsson/dos-like/blob/main/source/
vol. 5, no. OOPSLA, pp. 1–29, 2021. libs/opl.h.
[22] Y. Noller, C. S. Păsăreanu, M. Böhme, Y. Sun, H. L. Nguyen, and [48] “go-gt.” https://github.com/ThePaw/go-gt.
L. Grunske, “Hydiff: Hybrid differential software analysis,” in Pro- [49] “String comparison and edit distance algorithms library.” https://github.
ceedings of the ACM/IEEE 42nd International Conference on Software com/hbollon/go-edlib.
Engineering, pp. 1273–1285, 2020. [50] “2d triangulation library.” https://github.com/tchayen/triangolatte.
[23] M. Böhme, B. C. d. S. Oliveira, and A. Roychoudhury, “Regression [51] S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a
tests to expose change interaction errors,” in Proceedings of the 2013 box of chocolates: the non-determinism of chatgpt in code generation,”
9th Joint Meeting on Foundations of Software Engineering, pp. 334–344, 2023.
2013.

You might also like