Translation Using LLM (Rust)

Uploaded by

pedanticwiles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views11 pages

Translation Using LLM (Rust)

Uploaded by

pedanticwiles

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Towards Translating Real-World Code with LLMs:

A Study of Translating to Rust

Hasan Ferit Eniser* Hanliang Zhang*, Cristina David, Meng Wang
MPI-SWS University of Bristol
[email protected] {hanliang.zhang,meng.wang,cristina.david}@bristol.ac.uk

Maria Christakis Brandon Paulsen, Joey Dodds, Daniel Kroening

TU Wien Amazon Web Services, Inc.
[email protected] {bpaulse,jldodds,dkr}@amazon.com
arXiv:2405.11514v2 [cs.SE] 21 May 2024

*Equal Contribution

Large language models (LLMs) show promise in code single function using only primitive data types, whereas real-
translation – the task of translating code written in one pro- world code has many functions and user-defined data types
gramming language to another language – due to their ability (e.g. structs).
to write code in most programming languages. However, In this work, we take a step towards answering the question:
LLM’s effectiveness on translating real-world code remains Can LLM’s translate real-world code? Towards this end,
largely unstudied. In this work, we perform the first sub- we develop F LOURINE, an end-to-end code translation tool
stantial study on LLM-based translation to Rust by assessing capable of producing validated Rust translations. F LOURINE
the ability of five state-of-the-art LLMs, GPT4, Claude 3, first uses an LLM to obtain a candidate translation, then
Claude 2.1, Gemini Pro, and Mixtral. We conduct our study applies a compilation driven repair, where we make use of
on code extracted from real-world open source projects. To the Rust compiler’s error messages as described in [14].
enable our study, we develop F LOURINE, an end-to-end code Once the translation compiles, F LOURINE uses cross-language
translation tool that uses differential fuzzing to check if a Rust differential fuzzing to test the I/O equivalence between the
translation is I/O equivalent to the original source program, source program and the Rust translation. Notably, our cross-
eliminating the need for pre-existing test cases. As part of our language differential fuzzer removes the need for unit tests
investigation, we assess both the LLM’s ability to produce – prior work assumed test cases already exist in the target
an initially successful translation, as well as their capacity to language, or they were hand-written as part of the study, mak-
fix a previously generated buggy one. If the original and the ing a substantial investigation difficult. If a counterexample is
translated programs are not I/O equivalent, we apply a set discovered, F LOURINE executes a feedback strategy, which
of automated feedback strategies, including feedback to the provides feedback to the LLM to fix the counterexample.
LLM with counterexamples. Our results show that the most For the dataset we extract benchmarks from seven open
successful LLM can translate 47% of our benchmarks, and source projects written in C and Go. We do not use the
also provides insights into next steps for improvements. entire projects because LLMs cannot fit them in their context
window. We choose these languages because Rust, C, and
I. I NTRODUCTION Go are typically used for low-level programming tasks, such
The task of program translation between programming as systems development, so C and Go are likely candidates
languages is becoming particularly relevant, given the recent for translation to Rust. The open source projects are from
interest in safe programming languages such as Rust, and the a diverse set of domains: audio processing, text processing,
expectation of translating potentially buggy, legacy code into geometry, banking, 2D triangulation, graph algorithms, and
such modern languages. While “rule-based” translation tools sound card emulation. To automate and reduce bias in the
have been developed [1]–[3] that target a fixed source and selection of code samples, we develop a methodology and tool
target language (e.g. C to Rust), recent work [4]–[7] provides for extracting them from projects. We use this tool to extract
hope that large language models (LLMs) can accomplish this code samples that contain between 1 and 25 functions and
task for any source and target language. use only standard libraries, and which also use features such
Prior work in using LLMs for code translation [4]–[9] as global variables, user-defined, dynamically-allocated, data
has almost exclusively focused on translating code taken structures, array pointers, type casts, enumeration types etc..
from competitive programming websites [10], educational For example, Figure 1 contains a program extracted
websites [11], or hand-crafted coding problems [12], [13]. from the ACH library featuring a global variable
While useful, such benchmarks are not representative of real- moov_io_ach_stringZeros, which is initialised
world code. For example, these benchmarks are typically a with the function call moov_io_ach_populateMap(94,
var ( func (e *Env) add(i, p int64) {
moov_io_ach_stringZeros map[int]string = var j int64
moov_io_ach_populateMap(94, "0") e.S[i] = true
) e.Prev[i] = p
for j = 0; j < e.N; j++ {
func moov_io_ach_populateMap(max int, zero string) if e.Lx[i]+e.Ly[i]-e.G.Get(i, j) < e.Slack[i
map[int]string { ] {
out := make(map[int]string, max) e.Slack[i] = e.Lx[i] + e.Ly[i] - e.G.Get
for i := 0; i < max; i++ { (i, j)
out[i] = strings.Repeat(zero, i) e.Slackx[i] = j
} }
return out }
} }

func (m Matrix) Get(i int64, j int64) int64 {

Fig. 1: Code sample from ACH return m.A[i*m.N+j]
}

type Env struct {

"0"). This kind of initialization of a global variable is not N int64
allowed in Rust, making it non-trivial to find an equivalent G *Matrix
S []bool
translation without resorting to unsafe code. Claude3 managed Slack, Slackx, Prev []int64
to find the following translation: Lx, Ly []int64
}
static MOOV_IO_ACH_STRING_ZEROS:
Lazy<HashMap<usize, String>> = type Matrix struct {
Lazy::new(|| populate_map(94, "0")); N int64
A []int64
This snippet uses once_cell::Lazy, which stores a value }
that gets initialized on the first access.
As another example, Figure 2 contains a program we Fig. 2: Function add from go-gt
extracted from the go-gt library, featuring a user-defined type
Env that assembles several arrays, pointers and numeric data.
func LCSBacktrackAll(str1, str2 string) ([]string,
Mapping Env to an exact counterpart in Rust is not obvious as error) {
a slice []int64 in Golang can be represented by a vector in runeStr1 := []rune(str1)
Rust Vec<i64>, which is a growable, owning array-like data runeStr2 := []rune(str2)
type, or a borrowed-slice &’a [i64], a non-growable, non- if len(runeStr1) == 0 || len(runeStr2) == 0 {
owning array-like data type. Our cross-language differential return nil, errors.New("Can’t process and
fuzzer handles translations of the function add that use both backtrack any LCS with empty string")
} else if Equal(runeStr1, runeStr2) {
variants for Env by correctly mapping between the Go and return []string{str1}, nil
Rust representations of its inputs (where the receiver e of type }
return processLCSBacktrackAll(
*Env is one of the inputs). str1,
The code in Figure 3, extracted from the go-edlib library, str2,
returns all the longest common subsequences of two input lcsProcess(runeStr1, runeStr2),
strings (for brevity, we omit the callees). This code presents len(runeStr1),
len(runeStr2),
several challenges, for instance, finding a correct mapping ).ToArray(), nil
between different styles of error handling. In Golang, a failable }
computation output is typically expressed by a pair of the
target output type and the error type, as shown in the signature Fig. 3: Function LCSBacktrackAll from go-edlib
of LCSBacktrackAll. On the other hand, in Rust, this is
often expressed by an optional output type, such as Result<
Vec<String>, Error>. Moreover, this program contains by the fuzzer. We compare these with a baseline strategy that
casts from strings to arrays, and array manipulation, which repeats the original prompt, relying on randomness in LLM
need to be correctly mapped to their corresponding Rust inference to obtain a new candidate translation.
representation. In total, we perform 8160 code translation experiments
We evaluate the LLM’s ability to produce compilable trans- across 408 code samples, four feedback strategies and five
lations, as well as translations that preserve the source be- state-of-the-art LLMs – GPT4, Claude 3, Claude 2.1, Gem-
havior. Given that semantic equivalence is critical to program ini Pro, and Mixtral. Overall, the LLMs achieve successful
translation, we further investigate the LLM’s potential to take translation rates of 21% to 47% on our code samples, and
feedback and fix equivalence errors. We develop and compare feedback strategies are responsible for up to 8% absolute of
four different feedback strategies. Three of our feedback this success rate. Somewhat unsurprisingly, we find that larger
strategies provide the the LLM with counterexamples returned programs (in LoC) are less likely to translate successfully than
struct Env { languages are different. Most of them [4]–[7], [9] evaluate
n: i64, exclusively on competitive programming style code. In con-
g: Box<Matrix>,
s: Vec<bool>,
trast, our work evaluates on real-world code, allowing us
slack: Vec<i64>, to draw stronger conclusions about LLM’s ability for code
slackx: Vec<i64>, translation. Others use some real-world code: [15] evaluates
prev: Vec<i64>,
lx: Vec<i64>,
on real-world API code, translating Java to C#. Their technique
ly: Vec<i64>, requires additional fine-tuning, unlike ours. Another work, [8],
} uses real-world benchmarks but does not produce syntactically
struct Matrix {
correct code for those examples. Two of the works [8], [9]
n: i64, conclude that counterexamples can be useful feedback, which
a: Vec<i64>, does not match our conclusion. We compare our results with
}
theirs in Section VI-C1.
fn add(e: &mut Env, i: i64, p: i64) { Other LLM/ML code translation works focus on problems
let mut j: i64 = 0; where the source an target language are the same [16], [17].
e.s[i as usize] = true;
e.prev[i as usize] = p; We consider this a different task than ours, because the goals
for j in 0..e.n { are different. Meta studies have been conducted on code
if e.lx[i as usize] + e.ly[i as usize] - get translation as well in [18], [19], though they do not provide
(&e.g, i, j) < e.slack[i as usize] {
e.slack[i as usize] = e.lx[i as usize] + insight on translating real-world code. Finally, several works
e.ly[i as usize] - get(&e.g, i, j); have developed rule-based techniques for specific source and
e.slackx[i as usize] = j; target language pairs such as C to Rust, [3], [20], [21], C
}
} to Go [1], and Java to C# [2]. While rule-based approaches
} can theoretically guarantee correctness of the translation, they
require significant engineering effort to build, and they can
fn get(m: &Matrix, i: i64, j: i64) -> i64 {
m.a[(i * m.n + j) as usize] produce unidiomatic code as we demonstrate in our results.
} Cross-Language Differential Fuzzing. While differential
fuzzing/testing has a rich literature, the majority do not con-
Fig. 4: Rust translation of function add from go-gt sider comparing implementations in two different languages.
There are many works that compare programs in the same lan-
guage using symbolic execution [22]–[25] and fuzzing [26]–
smaller programs. Surprisingly, we also find that our feedback [29]. Such works do not need to solve the problem of mapping
strategies that include counterexamples in the prompt actually data from one language to another, though they are likely
perform worse than the simple baseline strategy. We discuss complementary to our work – they could be used to improve
why this may be case, and suggest directions for future work. the coverage achieved by our fuzzer. Works in fuzzing multi-
We claim the following contributions: language systems [30] do not address this problem either. Only
• We develop F LOURINE , a tool capable of producing one work [31] attempts general cross-language testing like we
validated Rust translations without the need for hand- do by compiling both languages down to a shared IR. As we
written test cases will discuss in Section IV-B, this approach cannot effectively
• We build a cross-language fuzzer, capable of passing handle user-defined data types, and is heavily dependent on
inputs and outputs between languages the IR compiler preserving structure of the original source
• We use F LOURINE to conduct the first substantial study program.
of using LLMs to translate real-world code Feedback Strategies for LLMs. Only a limited number
• We demonstrate that LLMs are capable of translating of works have tried to develop feedback strategies for LLMs.
parts of real-world projects, and that directly providing Recent work in automated program repair [32], [33] reports
counterexamples as feedback to an LLM is less effective success with an approach that provides counterexamples as
than repeating the original prompt feedback. We discuss their results in relation to ours in
• We open source all code, benchmarks, results to repro- Section VI-C1. While any automated program repair technique
duce our experiments* could be used as a feedback strategy, we focus only on
feedback strategies that use an LLM to fix errors.
II. R ELATED W ORK
In this section, we discuss closely related work from the
literature under several categories. III. OVERVIEW
Code Translation. The most closely related code translation
works use LLMs for translation where the source and target
In this section, we define the task of code translation,
* Artifact can be downloaded at https://d34gtk3knhjgeg.cloudfront.net/ provide an overview of our algorithm for code translation with
artifact.tar.gz LLMs, and then illustrate on a concrete example.
A. Code Translation query that can be provided to G to generate a new candidate
We first formally define the problem of code translation. translation.
Let l be a programming language, and Pl the set of all valid The routine for code translation with feedback strategies is
programs written in l. Assume we have a program p ∈ Pl shown in 1. We first use G to generate a candidate program p′
that we wish to translate to a different language l′ . That is, from the initial query q, which we then pass to the compilation
we wish to find p′ ∈ Pl′ that has the same behavior as p with driven repair routine. If this is unsuccessful at making p′
respect to a mapping between the values of l and l′ . compile, we exit the loop and fail. Otherwise, we invoke the
fuzzer to check for counterexamples. If none are found, we
In our work, a program (i.e. p or p′ ) is a set of functions,
assume p′ is correct, and return it. Otherwise, we invoke a
user-defined types (e.g. struct definitions), global variables,
feedback routine, which generates a new q, and repeat the
import/include statements etc. One of the functions in a
process until a program is found that passes the fuzzer check,
program is the entry point function. Note that the entry point
or until some fixed budget is reached and we fail.
function is not necessarily main() – the inputs and outputs
of the entry point could be primitive data types, user-defined Algorithm 1 Iterative Code Translation Algorithm
types (e.g. structs, classes), and even pointer types.
For simplicity of notation, we define p and p′ as operating Require: p : The program to translate, q: The initial task
on program states. A program state contains the values of the description, FEEDBACK: A feedback strategy, b: A budget
1: while b > 0 do
inputs outputs of the program, as well as variables defined in
the global scope. Letting Sp and Sp′ be the set of all program 2: p′ ← G(q)
states for p and p′ , respectively, we have p : Sp → Sp and 3: if ¬COMPILATION - REPAIR(p′ ) then
p′ : Sp′ → Sp′ . We write p(sin ) = sout where sin , sout ∈ Sp 4: break
to denote the result of executing p on sin . 5: end if
To complete our definition of code translation, we define 6: E − , E + ← FUZZER(p, p′ )
M : Sp → Sp′ and M ′ : Sp′ → Sp , which are mapping 7: if E − = ∅ then
functions that map states of program p to states of p′ , and vice 8: return p′
versa. Formally, translation’s goal is to discover a program p′ 9: end if
such that: 10: q ← FEEDBACK(q, p′ , E − , E + )
11: b←b−1
∀s ∈ Sp .p(s) = M ′ (p′ (M (s)))
12: end while
13: return FAIL
B. Our Code Translation Algorithm
Next, we present our iterative algorithm for code translation.
We again assume we have a source program p, and we wish C. Motivating Example
to discover a translation p′ with the same behavior. We now illustrate our Rust translation approach with the
Let G : Q → Pl′ be an LLM that takes a natural concrete example in Figure 2. In the example, our source
language query q ∈ Q and outputs a candidate translation program p is the function add from the go-gt library. This
p′ ∈ Pl′ . q contains the original source program p and natural is a subroutine of the Hungarian algorithm [34] for finding
language instructions to translate p into the target language maximum matching, which adds two edges to an alternating
l′ . Note that in practice, the resulting p′ may have a top-level path during the search, and records the output by mutating the
function whose function signature is incompatible with p and receiver e.
therefore the mapping functions M , M ′ cannot be defined, or We first create an initial query containing the Go code and
the program output by the LLM may not compile. We find that instructions describing the translation task, which is given to
the former rarely happens, and we address the latter through the LLM to generate a candidate translation. If we continue
a compilation repair phase, which is based on the approach past compilation driven repair, the candidate translation p′ in
in [14]. Figure 4 is guaranteed to compile, but not to be I/O equivalent
We also assume the existence of a fuzzer. We define this as to the original source program. To check for I/O equivalence,
FUZZER (p, p′ ), which takes the original source program and p and p′ are passed to the fuzzer, which uses an off-the-
translation, and returns two sets of examples E + and E − . E + shelf fuzzing technique to generate input states, execute both
is a set of positive examples where p and p′ agree. Positive programs, and check that they produce the same output state.
examples have the form (sin , sout ), where sin , sout ∈ Sp′ . E − We capture side-effects by comparing whole program states,
is the set of counterexamples where the output produced by p rather than just the explicit output.
disagrees with p′ . A counterexample is a triple of states from One of the challenges that we face is executing p and
Sp′ of the form (sin , sexp , sact ), which are the initial state, p′ in two different languages on matching input states, and
expected output state, and the actual output state. then comparing their output state. Specifically for our running
Finally, we have a routine FEEDBACK(q, p′ , E − , E + ) which example, we must convert primitive types as well as user-
takes the query q, the candidate translation p′ , and the ex- defined data structures: Env has distinct representations in
amples E + , E − returned by the fuzzer, and returns a new Go and Rust; arguments i and p have type int64 in Go,
{"e": {
Human:
"n": 3, # Preamble
"g": { You are given a C/Go program. We need to translate
"n": 3, it to Rust.
"a": [0, 0, 0, 0, 0, 0, 0, 0, 0]
}, # Code to be translated
"s": [false, false, false], {C/Go Program}
"slack": [0, 0, 0],
"slackx": [0, 0, 0], # Instruction
Give me a Rust translation of the above C/Go code.
"prev": [0, 0, 0], # Constraints
"lx": [0, 0, 0], Here are some constraints that you should respect:
"ly": [0, 0, 0] • Give me only the translated code, don’t add
}, explanations or anything else. # formatting guideline
"i": 2, • Use only safe Rust. # code characteristic
"p": 0} • Do not use custom generics. # fuzzer limitation
• ...

Fig. 5: Serialized JSON input state for function add Assistant:

but i64 in Rust; e is a pointer to an Env value in Go, but a Fig. 6: LLM Prompt for obtaining translations.
mutable reference in Rust.
To solve this challenge, we develop a technique based on
In particular, we have three types of constraints: formatting
serializing then de-serializing to exchange data between lan-
guidelines, code characteristics and fuzzer constraints. Format-
guages. We use the JSON [35] format, because most languages
ting guidelines describe how the generated code should look,
support it. Most data types, including complex data types and
simplifying parsing and extraction of relevant information
pointers can be automatically serialized into JSON, thus it
from the response. For code characteristics, we instruct the
allows us to easily support real-world code. For our example,
LLM to produce safe Rust code, and to maintain the same
Fig. 5 denotes a serialized valid input state. Once the two
function names, parameter names, and return types from the
programs are executed, the Go output state is again serialized
input code. Finally, the fuzzer constraints ensure that the
to JSON, deserialized to Rust, and compared against the Rust
generated code can be handled by our fuzzer (more details
output state. For our example, the expected output state, as
on this in Section IV-B).
obtained by executing the Go code, is the same as the input
The translation generated by the LLM may not initially
state in Fig. 5, with the only difference that the last element
compile. We address this with approach in [14]. At a high
of field s is set to true instead of false. The translation
level, we iteratively query the LLM to fix the error, until
in Figure 4 computes the expected output state, and it is thus
the code becomes compilable. Each time, we provide both
deemed I/O equivalent to the original Go code, and returned
the faulty translation and the error message from the Rust
by F LOURINE.
compiler to the LLM, and ask it to use a specific format for
Conversely, if a counterexample is discovered by the fuzzer,
the suggested fixes, applying them only to the affected lines
then we invoke a feedback method, which uses the counterex-
of code.
ample to create a new query to the LLM and generates a new
candidate translation. Designing a suitable feedback method B. Checking Translations
is another challenging aspect of the translation task. There are To test the I/O equivalence between the original source
many ways to re-query the LLM for a new translation, each program p and a candidate Rust translation p′ , we develop
with their own likelihood of success. Moreover, most state-of- a cross-language differential fuzzer. For a given p and p′ , we
the-art LLMs are operated as API services which charge per automatically generate a fuzzing harness in Rust, which uses
input token, so different query strategies will have different Bolero and libfuzzer [36] to perform fuzzing. The test harness
dollar costs. To address this, we propose and evaluate a set of generates program states from Sp′ , which are directly invoked
feedback strategies. on p′ . We implement the mapping function M ′ : Sp′ → Sp ,
IV. LLM-BASED C ODE T RANSLATION using JSON de/serialization. We serialize the Rust program
state s′ into JSON format, and then instrument the source
A. Obtaining Translations program p to deserialize the JSON into a program state of Sp .
As mentioned in the previous sections, we are considering The instrumented p is invoked on the serialized s′ from Rust
the problem of translating a program written in C or Go to using a foreign function interface. To compare outputs, we map
Rust. We use zero-shot prompting and follow the best practices the output state of p to one of p′ using JSON de/serialization
given by the LLM’s provider. We construct the initial query q as well, which can then be directly compared.
(to be input to the LLM) as sketched in Figure 6. We use JSON serializers for two reasons. First, the mapping
We start with a preamble describing the overall task. Then, between fields of user-defined data types in the source and
we supply the program to be translated, and, finally, we target language are automatically determined based on the
provide specific constraints to be followed by the LLM. field names. This requires the LLM to produce data types
with field names that match the source program, but in our
Human:
benchmarks LLMs always do this. Second, most languages
support automatic serialization of primitive, pointer, and user- # Preamble
You are given a C/Go program and its faulty Rust
defined types. translation. We need to repair the faulty Rust
We note an alternative approach, taken by [31], is to compile program.
both p and p′ down to a common IR, such as LLVM, and # Code to be translated
then perform fuzzing on the IR. However, we find that IR {C/Go Program}
compilers for different languages typically discard type and # Code to be repaired
layout information (e.g. user-defined data types are represented {Faulty Rust Program}
as a void pointer). This makes it nearly impossible for a fuzzer # Instruction
to generate any meaningful inputs. Make changes to the given code to obtain expected
outputs for the given test inputs.
Soundness & Limitations. Our fuzzer can only make
heuristic based guarantees (e.g. coverage) on the equivalence # Constraints
Here are some constraints that you should respect:
of p and p′ . This is a limitation of fuzzing and testing in ...
general. However, our fuzzer achieves an average line coverage
# Counterexamples
of 97%. CE1
In addition, JSON serialization is not automatically sup- CE1

ported for all types. For example, features in Rust like trait Assistant:
definitions, IMPL traits, and lifetimes in data type definitions {LLM generated rust translation}

are only partially supported. This means that the equivalence Human:
check may fail because serialization fails. We report these That is incorrect on the following inputs:
# Counterexamples
errors in Section VI-B. In addition, we do not support fea- CE1
tures like concurrency, network, and file I/O. Our benchmark CE2

selection excludes these features. Assistant:

V. F EEDBACK S TRATEGIES
In this section, we present four feedback methods that can
Fig. 7: LLM Prompt for BaseRepair and CAPR. BaseRepair
be used if the fuzzer finds a counterexample E − for the
is shown in black. CAPR is shown in black and magenta.
correctness of the translation p′ by the LLM in Alg. 1.
a) Simple Restart Restart: We discard the generated code
p′ and re-query the model with the same prompt q. check, then this last faulty translation will be considered by
b) Hinted Restart Hinted: This builds on the previous the next call to BaseRepair.
strategy by adding positive and negative examples from the d) Conversational Repair (CAPR): Recent work in code
fuzzer, E + and E − , to the original prompt q. The intention is translation [8] and automated program repair [37], have
to suggest desirable behaviours to the LLM, as well as known proposed conversational repair approaches, wherein previous
faulty cases to avoid. We separately group the examples in E + incorrect code is included in the prompt to the LLM to
and E − based on the paths they exercise in p′ . Intuitively, this discourage the LLM from producing the same code again. The
corresponds to splitting them into equivalence classes, where CAPR approach begins with the same prompt as BaseRepair,
each equivalence class corresponds to a particular program however they differ if the new translation still fails the fuzzer
path. Then, the query constructed by Hinted only contains check. In BaseRepair, we create a new prompt from scratch,
positive and negative examples from a single equivalence class, but in CAPR, we keep the prompt, and append a new piece of
respectively. dialogue to it as shown in magenta Figure 7. This process can
c) Counterexample-Guided Repair (BaseRepair): Dis- be repeated multiple times, meaning the prompt is a dialogue
carding the generated code p′ when the fuzzer check fails may of failed translations.
not always be the optimal choice. For instance, if p′ is close The methods Restart and Hinted cost less than BaseRepair
to passing the fuzzer, trying to repair it might work better. and CAPR as they don’t include the incorrect translation in the
As part of BaseRepair, we give counterexamples from the prompt. Therefore the former use about half the input tokens
fuzzer to the LLM. Similarly to Hinted, a query only contains of the latter.
negative examples from the same equivalence class, which
correspond to bugs associated with the same program path. VI. E VALUATION
The expectation is that the candidate translation generated in In this section, we present our results for the following
the next iteration of Alg. 1 will produce the correct outputs research questions.
for the given examples. A sketch of the prompt used for
BaseRepair is given in Figure 7 (excluding the lines colored RQ1: How do LLMs perform on translating code taken
in magenta). In Alg. 1, if the translation generated by G for from real-world projects to Rust? We gather a large number
the query q constructed by BaseRepair still fails the fuzzer of benchmarks by extracting code samples from real-world
projects, and we use LLMs to generate translations which implementation of F LOURINE. F LOURINE is written entirely
are then checked for correctness by the fuzzer, and fixed in python, except for the fuzzer, which is written in Rust.
if needed by applying feedback strategies. We answer the F LOURINE currently supports C and Go for the input program.
following concrete questions. F LOURINE is implemented as a framework, which can be
(RQ1.1) How many benchmarks can each LLM translate extended with new LLMs, feedback strategies, and language
from each of our projects? We report the percentage of support for the input program. We use GNU Parallel [39] to
benchmarks from each project that are successfully translated run experiments in parallel.
for each LLM. We show that success rates vary widely based 2) LLMs: We limit our study to LLMs hosted by third
on the benchmark and LLM. LLMs achieve up to 80% success party providers. This is in part because they are the highest
rate on benchmarks from our “easiest” project, and between performing on coding tasks, and they are the most accessible
15-40% on our “hardest” project. in that they do not require the user to own powerful compute
(RQ1.2) How does code complexity affect the success rate resources. We use five LLMs in our evaluation: GPT-4-
of translation? We look at how lines of code and number of Turbo [40], Claude 2.1 [41], Claude 3 Sonnet [41], Gemini
functions in a benchmark influence the success rate. We show Pro [42], and Mixtral [43]. The first four are likely very large
lines of code strongly influences success rates. (1T+ parameters). On the other hand, Mixtral is relatively
(RQ1.3) How idiomatic is the Rust produced by LLMs? small (45B parameters), but is known for performing well on
We run Clippy [38], Rust’s standard linter, on the successful coding tasks, and costs less than the others. We access GPT-4-
translations, and analyze the rates of different categories of Turbo and Gemini Pro through OpenAI’s and Google’s APIs.
linter warnings. We show that LLMs occasionally (1-15% of We access Claude and Mixtral through AWS Bedrock. Due to
the time) produce code with linter warnings, suggesting that lack of access to GPU machines, we do not attempt to run
the translations could be made more performant, concise, or open source LLMs like CodeLLaMA.
use unsafe code. 3) Benchmarks: We collect benchmarks from real-world
projects hosted on GitHub. We focus on C and Go as the
RQ2: How effective are feedback strategies at fixing source program languages for multiple reasons. First, C, Go,
translation bugs? In addition to overall translation success and Rust are typically used for lower-level programming tasks,
rates, we record the initial success rates – the rate at which unlike other popular languages like Java or Python. Thus they
the first translation passes the fuzzer – and compare this to are likely candidates for translating to Rust. Second, and more
the overall success rate. We answer two concrete questions. pragmatically, projects written in C and Go make less use of
(RQ2.1) How much do feedback strategies increase the third party libraries, which we do not attempt to support for
translation success rate? We compare overall success rates this work. Conversely, most Java and Python projects make
directly to initial success rates. We show that the most effective heavy use of third party libraries.
feedback strategy increases the success rate by an absolute 6- We choose seven projects with the aim of getting a diverse
8% on average for the best LLMs. set of application domains. Our projects are:
(RQ2.2) Which feedback strategies increase success • ACH [44]: a Go library implementing a reader, writer,
rates the most? We compare the increase in success rates and validator for banking operations
for each feedback strategy. We show that, surprisingly, • geo [45]: a math-focused Go library implementing com-
Restart and Hinted outperform BaseRepair and CAPR mon geometry functions and interval arithmetic
consistently. We provide a plausible explanation for this result. • libopenaptx [46]: a C library for audio processing
• opl [47]: a C library for sound card emulation
RQ3: How do LLM translations compare to rule- • go-gt [48]: a Go library for graph algorithms
based translation tools? We compare LLM translations • go-edlib [49]: a Go library string comparison and edit
to translations produced by the rule-based translation tool distance algorithms
C2Rust [3]. While C2Rust theoretically can guarantee the • triangolatte [50]: a 2D triangulation library in Golang
correctness of the translation, we show LLMs produce far As we will show in our experiments, LLMs are still not
more concise and idiomatic translations. capable of translating entire projects. To create benchmarks
of manageable size, we develop a tool for automatically
RQ4: Why do translations fail? Translation can fail for extracting benchmarks from these projects. Our tool takes as
several reasons beyond the fuzzer finding counterexamples. input the entire project and a specific function identifier f in
We report failure rates for different failure reasons. the project. The tool then analyzes the project to find all of f ’s
dependencies, including all functions called by f (including
A. Experimental Setup transitive calls), type definitions, standard libraries, global
1) Implementation: We implement an end-to-end transla- variables, etc. and extracts them intro a single, compilable
tion tool F LOURINE, which takes as input (1) a program, (2) file. The translation task is then to write a compilable Rust file
a feedback strategy to apply, and (3) a budget. F LOURINE with a function equivalent to f ’s behavior. Our methodology
outputs either a corresponding Rust translation that passes the for selecting benchmarks is to iterate over all functions in a
fuzzer, or it fails with an error. Algorithm 1 is used for the project, create a benchmark for it, and keep it if it meets the
TABLE I: Benchmark details of translation experiments in the category (experiments for
Project Lang. #Benchs Min/Max/Avg LoC Min/Max/Avg #Func
different feedback strategies are averaged together). We
answer our sub-questions below.
libopenaptx C 31 13 / 173 / 69 1 / 9 / 2.9
opl C 81 19 / 460 / 67 1 / 15 / 2.8 (RQ1.1) How many benchmarks can each LLM translate
go-gt Go 43 9 / 213 / 51 1 / 16 / 3.5 from each of our projects? Figure 8 shows success rates by
go-edlib Go 36 13 / 597 / 62 1 / 25 / 3.1
ach Go 121 43 / 194 / 64 3 / 7 / 3.4
benchmark and LLM. The best LLMs achieve success rates
geo Go 67 13 / 70 / 35 3 / 7 / 4.1 of 20-60% depending on the benchmark, with one outlier of
triangolatte Go 29 9 / 164 / 38 1 / 10 / 2.5 80% by Claude 2 on ACH. The outlier is in large part due
to ACH having ∼40 extremely similar benchmarks, which
Claude 2 nearly always gets right. If we remove these similar
following criteria: (1) it does not use 3rd party libraries, (2) benchmarks, the success rate for Claude 2 drops to 55%,
the maximum depth of the call graph is less than 4. which is in line with the other LLMs. A consistent trend
Details on the benchmarks are given in Table I. The total is that Mixtral, while somewhat capable, has 5-20% lower
number of benchmarks extracted from each project is given success rates than the other much larger and more expensive
in the column “#Benchs”. LoC and number of functions for LLMs. However, the cost of running Mixtral (both in dollars
individual programs vary from 13 to 597 and from 1 to 25, and compute) is at least 10x less than the other LLMs. Other
respectively. trends are that Claude 2, Claude 3, and GPT-4-Turbo perform
4) LLM Hyperparameters: All LLMs use a temperature similarly on most benchmarks, and they outperform Gemini
parameter for controlling the randomness/creativity of its in most cases.
output. To make our results more deterministic, we use a lower (RQ1.2) How does code complexity affect the success rate
temperature (i.e. less random) of 0.2. Other hyperparameters, of translation? We use lines of code and number of functions
e.g. topP and topK, are set to the default value recommended as proxy metrics for complexity, and we show success rates
by the LLM’s provider. for benchmarks grouped by level of complexity in Figures 9
5) F LOURINE Hyperparameters: We set the budget b in and 10. The general trend is that increasing complexity,
Algorithm 1 to 5. For the Hinted and BaseRepair strategies especially in lines of code, reduces success rate. The spikes
we provide 4 examples in the prompt (more examples appeared for 3 functions and 48-82 lines of code are again due to
to reduce efficiency as the context window grew). For the the ACH benchmarks mentioned in the previous research
CAPR strategy, we keep conversation window size as 3, which question. Removing these flattens the spike. In particular,
means that only the latest 3 incorrect translations are made success rates tend to drop off somewhere around 100+ lines of
available to the LLM. A translation is deemed equivalent if 5 code. We discuss approaches for handling larger benchmarks
minutes of fuzzing does not return any counterexamples. in section VI-C2.
6) Compute Resources: We run our experiments on a (RQ1.3) How idiomatic is the Rust produced by LLMs?
machine with an AMD EPYC 7R13 Processor with 192 cores Figure 11 shows the rate of different categories of linter
and 380 GB of RAM. Each translation task is run sequentially warnings produced by Clippy [38], Rust’s standard linter. We
in a single thread (we do not parallelize individual translation limit our analysis to successful translations. Clippy reports
tasks or fuzzing). As previously mentioned, all LLMs are five types of warnings, and we add unsafe. We describe
accessed through APIs provided by third party services. them below, and give specific examples of the warnings most
B. Results frequently reported by Clippy on the Rust translations.
Correctness: reports code that may have correctness bugs.
We run a translation experiment for each of our five LLMs,
The common examples we find are: checking if an un-
four feedback strategies, and 408 benchmarks for a total of
signed integer is greater than 0, and using MaybeUninit
8160 translation experiments. A translation is successful if
::uninit().assume_init() (i.e. assuming that poten-
it compiles and passes the fuzzer. A translation is failed if
tially unitialized data is initialized)
it: (1) does not compile, (2) the fuzzer cannot de/serialize
the types used in the translation, or (3) the fuzzer finds a Suspicious: the same as Correctness, but could be a false
counterexample in the translation and the budget is reached if positive
applicable. We answer our research questions based on these Style: code that is unidiomatic, but still correct. The com-
results. mon examples we find are: not following naming conventions,
unnecessary borrows, using return statements, unnecessary
RQ1: How do LLMs perform on translating code closure expressions (e.g. xs.map(|x| foo(x)) instead
taken from real-world projects to Rust? Our LLMs achieve of xs.map(foo)), using class types (e.g. String) when
overall success rates of 47.7% (Claude 2), 43.9% (Claude a simple primitive type will suffice (e.g. str), and not
3), 21.0% (Mixtral), 36.9% (GPT-4-Turbo), and 33.8% using idiomatic statements (e.g. using x <= z && z <= y
(Gemini Pro). We present detailed results for each LLM instead of (x..y).contains(z))
in Figures 8, 9, 10, and 11. The success rate is the total Complexity: code that could be simplified. Common exam-
number of successful translations divided by the total number ples are: unnecessary casting or type conversion, unnecessary
Fig. 8: Success rate for each LLM on each benchmark. Fig. 9: Success rate for each LLM on benchmarks
Averaged across all feedback strategies. grouped by lines of code.

Fig. 10: Success rate for each LLM on benchmarks Fig. 11: Rates of different types of linter warnings for
grouped by number of functions. each LLM.

parentheses, and unnecessarily putting a Box<..> around the

type of a function parameter
Performance: code that could be written to run faster. The
most common example is unnecessarily putting a Box<..>
around local variables or collection types (e.g. Vec)
Unsafe: code wrapped in an unsafe block
Overall, LLMs produce very few correctness warnings,
however they occasionally (1-15% of the time) produce code
that could be more idiomatic (Style warnings), more concise
(Complexity warnings), or more performant (Performance
warnings). Gemini’s high rate of Style warnings is due to its
preference for using return statements when not necessary.
Fig. 12: Absolute improvement in success rates after applying
We suspect that a large number of these warnings could be
feedback strategies as compared to initial success rate.
eliminated through prompting (e.g. instructing the LLM to
prefer not using return statements, follow specific naming
conventions, and not use Box<..> in certain cases). We
also observe occasional use of unsafe, however the use of fuzzer, after fixing compilation errors – to the final success
unsafe code was not necessary in those cases. rate after applying feedback strategies to the unsuccessful
translations.
RQ2: How effective are feedback strategies at fixing transla- (RQ2.1) How much do feedback strategies increase the
tion bugs? We answer this question by comparing the initial translation success rate? Figure 12 shows the final success
success rate – the rate at which the first translation passes the rate subtracted from the initial success rate. Disappointingly,
the best feedback strategy only improves success rates by 6- some probability that an erroneous prediction is made. Letting
8% absolute across all of our benchmarks. e be the probability of an erroneous prediction, the probability
(RQ2.2) Which feedback strategies increase success that the LLM correctly predicts n tokens is (1 − e)n . This
rates the most? Figure 12 also shows, surprisingly, that probability quickly goes to 0 as n increases. A future direction
the most reliable strategy is Restart (simply repeating the that (at least theoretically) solves this problem is to develop a
same prompt). In fact, our results suggest that providing solution that partitions the input source program into chunks
counterexamples in the prompt may actually confuse the that can be individually translated and validated. The most
LLM. We discuss this trend further in Section VI-C1. obvious way to partition an input program is by function, but
one could imagine even going down to the basic block level.
RQ3: How do LLM translations compare to rule-based 3) Improving the Fuzzer: Given the high rate of failure
translation tools? We compare the idiomatic-ness of LLM due to serialization limitations, improving the serializer in our
generated Rust to C2Rust [3] on our opl benchmark (C2Rust fuzzer to handle more features data types is likely necessary
failed to produce code for most of libopenaptx). Rates of to make additional progress as well.
linter warnings are presented in Figure 11. Overall we can
VII. T HREATS TO VALIDITY
see that the majority of code produced by C2Rust is unsafe
and it is far less idiomatic, as indicated by the rate of style Our main results are that (1) LLMs can translate real-world
warnings. In addition, we observe that C2Rust produces code, and that (2) providing counterexamples to the LLM
far more verbose Rust than LLMs. On average C2Rust is not effective feedback. We discuss threats to the validity
translations have 1.98x more LoC than LLM translations. of these conclusions. The biggest threat to (1) is that the
fuzzer may miss counterexamples. While we acknowledge this
RQ4: What is the main cause of translation failure? limitation, we point out that the fuzzer achieves 97% coverage
There are three reasons a translation can fail. (1) A compiling on average, thus we are confident that the translations are
translation cannot be found. This accounts for only 7.0% of “mostly” correct. This limitation is also generally accepted
failures. (2) The fuzzer cannot de/serialize the data types. in prior work, which uses test suites to assess correctness.
These account for 52.6% of failures. (3) A counterexample Another threat to (1) is that our results do not generalize to
is found in the final translation. These account for 40.3% other languages. We argue this is unlikely for popular lan-
of failures. The implication of this result is that we likely guages like Java and Python, given that they more represented
under-report the true translation success rate, because at least in LLM training data than C and Go. However, even if our
some serialization failures might be successes. results do not generalize to other languages, translating C and
Go to Rust is still highly practical. The biggest threat to (2) is
C. Discussion & Future Work that other prompting strategies may improve the LLMs ability
1) Improving Feedback Strategies: Our result that coun- to use counterexamples. While possible, we believe this is
terexamples harm performance contradicts several recent unlikely due to the complexity of inputs in our benchmarks,
works’ results [8], [9], [32], [33]. We note that two of the as explained in Section VI-C1. Finally, recent work [51] has
works [8], [9] do not compare with a simple baseline like shown high non-determinism in code generated by LLMs,
Restart, so they cannot conclude if counterexamples helped or which poses a threat to both (1) and (2). While we don’t
hurt. However, the other two [32], [33] do report benefit from run our experiments multiple times, executing our feedback
counterexamples relative to a baseline, and they evaluate on strategies for multiple iterations has a similar effect. It is
real-world code (though their task is automated program repair highly unlikely an additional run of experiments would give
as opposed to code translation). The most likely explanation significantly different results, and doing so would be expensive
for their success and our failure is that their counterexamples (in dollar cost).
use inputs from human-written test cases, so they might be
VIII. C ONCLUSION
more “intuitive” to an LLM. On the other hand, our coun-
terexamples come from random inputs generated by a fuzzer. In this work, we study the ability of LLMs to translate real-
Our own manual analysis reveals that random fuzzer inputs world code to Rust. We present F LOURINE, an end-to-end Rust
are not intuitive to a human and their textual representation transpiler, and we use it to test the ability of five state-of-the-
can be very large and unintuitive to an LLM as well. A art LLMs to translate C and Go code taken from real-world
future direction is to study what types of counterexamples are projects to Rust. Our results demonstrate that LLMs are indeed
useful for LLMs. Studying input selection and input reduction capable of translating code to Rust, though there room for
techniques would likely be immediately fruitful. improvement. In addition, we show that counterexamples, at
2) Handling Larger Benchmarks: We conjecture that the least random fuzzer generated counterexamples, are ineffective
stochastic nature of LLMs’ next token prediction poses fun- feedback for an LLM.
damental limitations for translating large source programs in R EFERENCES
one go. Larger input source programs require more Rust code
[1] “C to go translator.” https://github.com/gotranspile/cxgo.
to be generated by the LLM, or in other words, more tokens [2] “Sharpen - automated java-¿c# coversion.” https://github.com/mono/
to be predicted. Each time a token prediction is made, there is sharpen.
[3] “C2rust transpiler.” https://c2rust.com/. [24] H. Palikareva, T. Kuchta, and C. Cadar, “Shadow of a doubt: testing
[4] Z. Tang, M. Agarwal, A. Shypula, B. Wang, D. Wijaya, J. Chen, for divergences between software versions,” in Proceedings of the 38th
and Y. Kim, “Explain-then-translate: an analysis on improving pro- International Conference on Software Engineering, pp. 1181–1192,
gram translation with self-generated explanations,” in Findings of the 2016.
Association for Computational Linguistics: EMNLP 2023 (H. Bouamor, [25] S. Person, G. Yang, N. Rungta, and S. Khurshid, “Directed incremental
J. Pino, and K. Bali, eds.), (Singapore), pp. 1741–1788, Association for symbolic execution,” Acm Sigplan Notices, vol. 46, no. 6, pp. 504–515,
Computational Linguistics, Dec. 2023. 2011.
[5] B. Rozière, M. Lachaux, L. Chanussot, and G. Lample, “Unsupervised [26] J. Guo, Y. Jiang, Y. Zhao, Q. Chen, and J. Sun, “Dlfuzz: Differential
translation of programming languages,” in NeurIPS, 2020. fuzzing testing of deep learning systems,” in Proceedings of the 2018
[6] B. Rozière, J. Zhang, F. Charton, M. Harman, G. Synnaeve, and 26th ACM Joint Meeting on European Software Engineering Conference
G. Lample, “Leveraging automated unit tests for unsupervised code and Symposium on the Foundations of Software Engineering, pp. 739–
translation,” in ICLR, OpenReview.net, 2022. 743, 2018.
[7] M. Szafraniec, B. Roziere, H. L. F. Charton, P. Labatut, and G. Synnaeve, [27] W. Jin, A. Orso, and T. Xie, “Automated behavioral regression testing,”
“Code translation with compiler representations,” ICLR, 2023. in 2010 Third international conference on software testing, verification
[8] R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, and validation, pp. 137–146, IEEE, 2010.
M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Lost [28] S. Nilizadeh, Y. Noller, and C. S. Pasareanu, “Diffuzz: differential
in translation: A study of bugs introduced by large language models fuzzing for side-channel analysis,” in 2019 IEEE/ACM 41st International
while translating code,” 2024. Conference on Software Engineering (ICSE), pp. 176–187, IEEE, 2019.
[9] P. Jana, P. Jha, H. Ju, G. Kishore, A. Mahajan, and V. Ganesh, “Attention, [29] T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana, “Nezha:
compilation, and solver-based symbolic analysis are all you need,” arXiv Efficient domain-independent differential testing,” in 2017 IEEE Sym-
preprint arXiv:2306.06755, 2023. posium on security and privacy (SP), pp. 615–632, IEEE, 2017.
[30] W. Li, J. Ruan, G. Yi, L. Cheng, X. Luo, and H. Cai, “PolyFuzz:
[10] R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov,
Holistic greybox fuzzing of Multi-Language systems,” in 32nd USENIX
J. Dolby, J. Chen, M. Choudhury, L. Decker, et al., “Codenet: A large-
Security Symposium (USENIX Security 23), (Anaheim, CA), pp. 1379–
scale ai for code dataset for learning a diversity of coding tasks,” arXiv
1396, USENIX Association, Aug. 2023.
preprint arXiv:2105.12655, 2021.
[31] J. J. Garzella, M. Baranowski, S. He, and Z. Rakamarić, “Leveraging
[11] W. U. Ahmad, M. G. R. Tushar, S. Chakraborty, and K.-W. Chang,
compiler intermediate representation for multi- and cross-language ver-
“Avatar: A parallel corpus for java-python program translation,” arXiv
ification,” in Verification, Model Checking, and Abstract Interpretation
preprint arXiv:2108.11590, 2021.
(D. Beyer and D. Zufferey, eds.), (Cham), pp. 90–111, Springer Inter-
[12] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by national Publishing, 2020.
chatgpt really correct? rigorous evaluation of large language models for [32] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era
code generation,” Advances in Neural Information Processing Systems, of large pre-trained language models,” in ICSE, IEEE, 2023.
vol. 36, 2024. [33] J. Kong, M. Cheng, X. Xie, S. Liu, X. Du, and Q. Guo, “Contrastrepair:
[13] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, Enhancing conversation-based automated program repair via contrastive
H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., “Evaluating large test case pairs,” arXiv preprint arXiv:2403.01971, 2024.
language models trained on code,” arXiv preprint arXiv:2107.03374, [34] H. W. Kuhn, “The hungarian method for the assignment problem,” in
2021. 50 Years of Integer Programming 1958-2008 - From the Early Years
[14] P. Deligiannis, A. Lal, N. Mehrotra, and A. Rastogi, “Fixing rust to the State-of-the-Art (M. Jünger, T. M. Liebling, D. Naddef, G. L.
compilation errors using llms,” arXiv preprint arXiv:2308.05177, 2023. Nemhauser, W. R. Pulleyblank, G. Reinelt, G. Rinaldi, and L. A. Wolsey,
[15] J. Zhang, P. Nie, J. J. Li, and M. Gligoric, “Multilingual code co- eds.), pp. 29–47, Springer, 2010.
evolution using large language models,” in Proceedings of the 31st ACM [35] E. T. Bray, “The javascript object notation (json) data interchange
Joint European Software Engineering Conference and Symposium on the format,” RFC 8259, RFC Editor, 12 2017.
Foundations of Software Engineering, pp. 695–707, 2023. [36] K. Serebryany, “Continuous fuzzing with libfuzzer and addresssanitizer,”
[16] Q. Zhang, J. Wang, G. H. Xu, and M. Kim, “Heterogen: transpiling c in 2016 IEEE Cybersecurity Development (SecDev), pp. 157–157, 2016.
to heterogeneous hls code with automated test generation and program [37] C. S. Xia and L. Zhang, “Conversational automated program repair,”
repair,” in Proceedings of the 27th ACM International Conference on arXiv preprint arXiv:2301.13246, 2023.
Architectural Support for Programming Languages and Operating Sys- [38] “Clippy: A bunch of lints to catch common mistakes and improve your
tems, ASPLOS ’22, (New York, NY, USA), p. 1017–1029, Association rust code.” https://rust-lang.github.io/rust-clippy/.
for Computing Machinery, 2022. [39] O. Tange, “Gnu parallel 20240122 (’frederik x’),” Jan. 2023. GNU
[17] B. Mariano, Y. Chen, Y. Feng, G. Durrett, and I. Dillig, “Automated Parallel is a general parallelizer to run multiple serial command line
transpilation of imperative to functional code using neural-guided pro- programs in parallel without changing them.
gram synthesis,” Proceedings of the ACM on Programming Languages, [40] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman,
vol. 6, no. OOPSLA1, pp. 1–27, 2022. D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “Gpt-4
[18] H. F. Eniser, V. Wüstholz, and M. Christakis, “Automatically test- technical report,” arXiv preprint arXiv:2303.08774, 2023.
ing functional properties of code translation models,” arXiv preprint [41] “Claude.” https://www.anthropic.com/index/introducing-claude.
arXiv:2309.12813, 2023. [42] “Gemini.” https://blog.google/technology/ai/google-gemini-ai/.
[19] M. Jiao, T. Yu, X. Li, G. Qiu, X. Gu, and B. Shen, “On the evaluation [43] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bam-
of neural code translation: Taxonomy and benchmark,” in 2023 38th ford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, et al.,
IEEE/ACM International Conference on Automated Software Engineer- “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
ing (ASE), pp. 1529–1541, IEEE, 2023. [44] “Moov ach.” https://github.com/moov-io/ach.
[20] H. Zhang, C. David, Y. Yu, and M. Wang, “Ownership guided C to Rust [45] “S2 geometry library in go.” https://github.com/golang/geo.
translation,” in Computer Aided Verification (CAV), vol. 13966 of LNCS, [46] “Open source implementation of audio processing technology codec
pp. 459–482, Springer, 2023. (aptx).” https://github.com/pali/libopenaptx.
[21] M. Emre, R. Schroeder, K. Dewey, and B. Hardekopf, “Translating C [47] “Engine for making things with a ms-dos feel, but for modern plat-
to safer Rust,” Proceedings of the ACM on Programming Languages, forms.” https://github.com/mattiasgustavsson/dos-like/blob/main/source/
vol. 5, no. OOPSLA, pp. 1–29, 2021. libs/opl.h.
[22] Y. Noller, C. S. Păsăreanu, M. Böhme, Y. Sun, H. L. Nguyen, and [48] “go-gt.” https://github.com/ThePaw/go-gt.
L. Grunske, “Hydiff: Hybrid differential software analysis,” in Pro- [49] “String comparison and edit distance algorithms library.” https://github.
ceedings of the ACM/IEEE 42nd International Conference on Software com/hbollon/go-edlib.
Engineering, pp. 1273–1285, 2020. [50] “2d triangulation library.” https://github.com/tchayen/triangolatte.
[23] M. Böhme, B. C. d. S. Oliveira, and A. Roychoudhury, “Regression [51] S. Ouyang, J. M. Zhang, M. Harman, and M. Wang, “Llm is like a
tests to expose change interaction errors,” in Proceedings of the 2013 box of chocolates: the non-determinism of chatgpt in code generation,”
9th Joint Meeting on Foundations of Software Engineering, pp. 334–344, 2023.
2013.

Van Der Post H. Data Science With Rust. From Fundamentals To Insights 2024
No ratings yet
Van Der Post H. Data Science With Rust. From Fundamentals To Insights 2024
672 pages
Comprehensive Rust
No ratings yet
Comprehensive Rust
403 pages
The Rust Programming Language
100% (4)
The Rust Programming Language
597 pages
Fixing Rust Compilation Errors Using LLMs
No ratings yet
Fixing Rust Compilation Errors Using LLMs
12 pages
Understanding Bugs in Rust Compilers
No ratings yet
Understanding Bugs in Rust Compilers
12 pages
Data Science With Rust - A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More-Reactive Publishing (2024)
No ratings yet
Data Science With Rust - A Comprehensive Guide - Data Analysis, Machine Learning, Data Visualization & More-Reactive Publishing (2024)
355 pages
Vert
No ratings yet
Vert
26 pages
Type-Constrained Code Generation With Language Models
No ratings yet
Type-Constrained Code Generation With Language Models
38 pages
Rust Practise Questions Light
No ratings yet
Rust Practise Questions Light
23 pages
Cheat Sheet PDF
No ratings yet
Cheat Sheet PDF
1 page
Kandidaatintyo Saloranta Ville
No ratings yet
Kandidaatintyo Saloranta Ville
36 pages
Comprehensive Rust
No ratings yet
Comprehensive Rust
226 pages
Wells Fargo International Solutions Private Limited (QAP - Intern Analysts)
No ratings yet
Wells Fargo International Solutions Private Limited (QAP - Intern Analysts)
7 pages
Rust High Performance Learn To Skyrocket The Performance of Your Rust Applications 1st Edition Iban Eguia Moraza
No ratings yet
Rust High Performance Learn To Skyrocket The Performance of Your Rust Applications 1st Edition Iban Eguia Moraza
75 pages
Rust Design Patterns
No ratings yet
Rust Design Patterns
91 pages
Rendered Nomicon
No ratings yet
Rendered Nomicon
152 pages
The Complete Rust Programming Reference Guide Rahul Sharma - Download The Full Set of Chapters Carefully Compiled
100% (1)
The Complete Rust Programming Reference Guide Rahul Sharma - Download The Full Set of Chapters Carefully Compiled
67 pages
A Minimized, Representative Dataset For C-to-Rust Transpilation Evaluation
No ratings yet
A Minimized, Representative Dataset For C-to-Rust Transpilation Evaluation
10 pages
The Rust Programming Language For Game Tooling: Dan Olson Principal Software Architect
No ratings yet
The Rust Programming Language For Game Tooling: Dan Olson Principal Software Architect
42 pages
Subtitle
No ratings yet
Subtitle
2 pages
L03 Problem Solving As Search Uninformed
No ratings yet
L03 Problem Solving As Search Uninformed
65 pages
L07 Probabilistic Reasoning Till Sep6
No ratings yet
L07 Probabilistic Reasoning Till Sep6
71 pages
L04 Problem Solving As Search Informed
No ratings yet
L04 Problem Solving As Search Informed
49 pages
L06 Adversarial Search
No ratings yet
L06 Adversarial Search
66 pages
Operating System For Raspberry PI 3 Using Rust
No ratings yet
Operating System For Raspberry PI 3 Using Rust
8 pages
Async Programming in Rust With Async-Std
No ratings yet
Async Programming in Rust With Async-Std
40 pages
2023 Chatfuzz
No ratings yet
2023 Chatfuzz
20 pages
Python Vs Rust - Which Is Programming Language Need To Choose For Your Project
No ratings yet
Python Vs Rust - Which Is Programming Language Need To Choose For Your Project
7 pages
ST Macros
No ratings yet
ST Macros
83 pages
Rust Language Cheat Sheet
No ratings yet
Rust Language Cheat Sheet
19 pages
On ML-Based Program Translation - Perils and Promises
No ratings yet
On ML-Based Program Translation - Perils and Promises
6 pages
WinZO Games (Software Developer Intern)
No ratings yet
WinZO Games (Software Developer Intern)
3 pages
L05 Local Search Algorithms
No ratings yet
L05 Local Search Algorithms
28 pages
A Rust Sampler
No ratings yet
A Rust Sampler
27 pages
Lecture Notes 03 (CSI2372 - Advanced Programming Concepts With C++)
No ratings yet
Lecture Notes 03 (CSI2372 - Advanced Programming Concepts With C++)
52 pages
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
No ratings yet
WhiteFox - White-Box Compiler Fuzzing Empowered by Large Language Models
27 pages
Rust Programming Tutorial PDF
100% (1)
Rust Programming Tutorial PDF
57 pages
Pain Points - A Gentle Introduction To Rust
No ratings yet
Pain Points - A Gentle Introduction To Rust
153 pages
Annual Report 2022
No ratings yet
Annual Report 2022
16 pages
Learning Rust
100% (5)
Learning Rust
184 pages
Getting Started - Beginning Rust - Get Started With Rust 2021 Edition
No ratings yet
Getting Started - Beginning Rust - Get Started With Rust 2021 Edition
8 pages
Rust Programming
No ratings yet
Rust Programming
9 pages
Warner Bros. Discovery (Software Engineer)
No ratings yet
Warner Bros. Discovery (Software Engineer)
3 pages
Who Uses Rust
No ratings yet
Who Uses Rust
9 pages
Pldi25 Paper511
No ratings yet
Pldi25 Paper511
23 pages
Pathway HighPrep InterIIT TechMeet (1) 241020 234354-1
No ratings yet
Pathway HighPrep InterIIT TechMeet (1) 241020 234354-1
8 pages
Wipro Enterprises (P) Limited (Summer Internship Program (FMCG) )
No ratings yet
Wipro Enterprises (P) Limited (Summer Internship Program (FMCG) )
2 pages
Python Essentials For MLops
No ratings yet
Python Essentials For MLops
8 pages
Rust Vs C A Battle of Speed and Efficiency
No ratings yet
Rust Vs C A Battle of Speed and Efficiency
5 pages
GitHub - Rustomax - Rust-Iterator
No ratings yet
GitHub - Rustomax - Rust-Iterator
15 pages
Pain Points - A Gentle Introduction To Rust PDF
0% (1)
Pain Points - A Gentle Introduction To Rust PDF
154 pages
Report
No ratings yet
Report
5 pages
Network Hw4 Chat
No ratings yet
Network Hw4 Chat
4 pages
VTION (Android Developer)
No ratings yet
VTION (Android Developer)
3 pages
Wadhwani Foundation (GenAI and Blockchain Internship)
No ratings yet
Wadhwani Foundation (GenAI and Blockchain Internship)
3 pages
Rust
No ratings yet
Rust
29 pages
VTION (Data Scientist)
No ratings yet
VTION (Data Scientist)
3 pages
Urja Mobility (Supply Chain Management Intern)
No ratings yet
Urja Mobility (Supply Chain Management Intern)
3 pages
Linux Kernel Module Development With Rust
No ratings yet
Linux Kernel Module Development With Rust
2 pages
VTION (LLM AI Developer)
No ratings yet
VTION (LLM AI Developer)
3 pages
Urja Mobility (Technical Support and R&D Intern)
No ratings yet
Urja Mobility (Technical Support and R&D Intern)
2 pages
Rust
No ratings yet
Rust
18 pages
Whirlpool Corporation (Intern)
No ratings yet
Whirlpool Corporation (Intern)
2 pages
Desktop Apps With Rust and Tauri
No ratings yet
Desktop Apps With Rust and Tauri
2 pages
New Text Document
No ratings yet
New Text Document
3 pages
Rust Programming
No ratings yet
Rust Programming
7 pages
Developing Apps with Python and Flet
From Everand
Developing Apps with Python and Flet
Williams Asiedu
No ratings yet
C Programming : All-in-One Resource for C Programming , Comprehensive Tutorials, Expert Tips, and a Wide Range of Exercises for All Skill Levels
From Everand
C Programming : All-in-One Resource for C Programming , Comprehensive Tutorials, Expert Tips, and a Wide Range of Exercises for All Skill Levels
Aria Thane
No ratings yet
Build Your Own Programming Language: A programmer's guide to designing compilers, interpreters, and DSLs for modern computing problems
From Everand
Build Your Own Programming Language: A programmer's guide to designing compilers, interpreters, and DSLs for modern computing problems
Clinton L. Jeffery
No ratings yet
Vala Programming Language Essentials: Definitive Reference for Developers and Engineers
From Everand
Vala Programming Language Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
Understanding Software Engineering Vol 2: Programming principles and concepts to build any software.
From Everand
Understanding Software Engineering Vol 2: Programming principles and concepts to build any software.
Gabriel Clemente
5/5 (1)
C++ for Beginners: Understand Core C++ Concepts with Practical Examples
From Everand
C++ for Beginners: Understand Core C++ Concepts with Practical Examples
Eleanor Nash
No ratings yet
Rust In Practice
From Everand
Rust In Practice
GitforGits
No ratings yet
The Complete C++ Programming Guide
From Everand
The Complete C++ Programming Guide
gareth thomas
No ratings yet
Learn C++
From Everand
Learn C++
Aishik Dutta
No ratings yet
PowerShell Practitioner: Understanding The Core Building Blocks of Programming & Scripting through PowerShell, Plus Debunking Popular Misconceptions
From Everand
PowerShell Practitioner: Understanding The Core Building Blocks of Programming & Scripting through PowerShell, Plus Debunking Popular Misconceptions
Stevens-Sobolewski Justin
No ratings yet
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
Programming in Star
From Everand
Programming in Star
Francis McCabe
No ratings yet
Learn Rust Programming: Safe Code, Supports Low Level and Embedded Systems Programming with a Strong Ecosystem (English Edition)
From Everand
Learn Rust Programming: Safe Code, Supports Low Level and Embedded Systems Programming with a Strong Ecosystem (English Edition)
Claus Matzinger
No ratings yet
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
From Everand
Rust In Practice: A Programmers Guide to Build Rust Programs, Test Applications and Create Cargo Packages
Rustacean Team
No ratings yet
Mastering the Art of Go Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Go Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
Rust Essentials for New Developers: A Practical Guide with Examples
From Everand
Rust Essentials for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
From Everand
COMPUTER PROGRAMMING FOR KIDS: An Easy Step-by-Step Guide For Young Programmers To Learn Coding Skills (2022 Crash Course for Newbies)
Dexter Rogers
No ratings yet
Go Functional Programming Simplified: A Practical Guide with Examples
From Everand
Go Functional Programming Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
C# Functional Programming Made Easy: A Practical Guide with Examples
From Everand
C# Functional Programming Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
Go File Handling for New Coders: A Practical Guide with Examples
From Everand
Go File Handling for New Coders: A Practical Guide with Examples
William E. Clark
No ratings yet
C++ Functional Programming for Starters: A Practical Guide with Examples
From Everand
C++ Functional Programming for Starters: A Practical Guide with Examples
William E. Clark
No ratings yet
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Go Debugging from Scratch: A Practical Guide with Examples
From Everand
Go Debugging from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
C# OOP Step by Step: A Practical Guide with Examples
From Everand
C# OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering C: A Comprehensive Guide to Programming Excellence
From Everand
Mastering C: A Comprehensive Guide to Programming Excellence
THE NORTHERN HIMALAYAS
No ratings yet
An Introduction to Functional Programming Through Lambda Calculus
From Everand
An Introduction to Functional Programming Through Lambda Calculus
Greg Michaelson
No ratings yet
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast!
From Everand
C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast!
Tim Warren
5/5 (1)
Rust for Beginners
From Everand
Rust for Beginners
Hernando Abella
No ratings yet
Learning Rust
From Everand
Learning Rust
IT Campus Academy
No ratings yet
Basic Guide to Programming Languages Python, JavaScript, and Ruby
From Everand
Basic Guide to Programming Languages Python, JavaScript, and Ruby
Kiet Huynh
No ratings yet
Understanding Python: Beginner's Guide to Programming
From Everand
Understanding Python: Beginner's Guide to Programming
Sabry Fattah
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
From Everand
C Programming: C Programming Language for beginners, teaching you how to learn to code in C fast!
Adam Dodson
No ratings yet
How to Learn PHP, MySQL and Javascript Quickly!: For Dummies
From Everand
How to Learn PHP, MySQL and Javascript Quickly!: For Dummies
Andrei Besedin
5/5 (1)
Logic Programming: Fundamentals and Applications
From Everand
Logic Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet

Translation Using LLM (Rust)

Uploaded by

Translation Using LLM (Rust)

Uploaded by

Towards Translating Real-World Code with LLMs:

A Study of Translating to Rust

Maria Christakis Brandon Paulsen, Joey Dodds, Daniel Kroening

func (m Matrix) Get(i int64, j int64) int64 {

type Env struct {

Fig. 5: Serialized JSON input state for function add Assistant:

selection excludes these features. Assistant:

parentheses, and unnecessarily putting a Box<..> around the

You might also like