Translation Using LLM (Rust)
Translation Using LLM (Rust)
*Equal Contribution
Large language models (LLMs) show promise in code single function using only primitive data types, whereas real-
translation – the task of translating code written in one pro- world code has many functions and user-defined data types
gramming language to another language – due to their ability (e.g. structs).
to write code in most programming languages. However, In this work, we take a step towards answering the question:
LLM’s effectiveness on translating real-world code remains Can LLM’s translate real-world code? Towards this end,
largely unstudied. In this work, we perform the first sub- we develop F LOURINE, an end-to-end code translation tool
stantial study on LLM-based translation to Rust by assessing capable of producing validated Rust translations. F LOURINE
the ability of five state-of-the-art LLMs, GPT4, Claude 3, first uses an LLM to obtain a candidate translation, then
Claude 2.1, Gemini Pro, and Mixtral. We conduct our study applies a compilation driven repair, where we make use of
on code extracted from real-world open source projects. To the Rust compiler’s error messages as described in [14].
enable our study, we develop F LOURINE, an end-to-end code Once the translation compiles, F LOURINE uses cross-language
translation tool that uses differential fuzzing to check if a Rust differential fuzzing to test the I/O equivalence between the
translation is I/O equivalent to the original source program, source program and the Rust translation. Notably, our cross-
eliminating the need for pre-existing test cases. As part of our language differential fuzzer removes the need for unit tests
investigation, we assess both the LLM’s ability to produce – prior work assumed test cases already exist in the target
an initially successful translation, as well as their capacity to language, or they were hand-written as part of the study, mak-
fix a previously generated buggy one. If the original and the ing a substantial investigation difficult. If a counterexample is
translated programs are not I/O equivalent, we apply a set discovered, F LOURINE executes a feedback strategy, which
of automated feedback strategies, including feedback to the provides feedback to the LLM to fix the counterexample.
LLM with counterexamples. Our results show that the most For the dataset we extract benchmarks from seven open
successful LLM can translate 47% of our benchmarks, and source projects written in C and Go. We do not use the
also provides insights into next steps for improvements. entire projects because LLMs cannot fit them in their context
window. We choose these languages because Rust, C, and
I. I NTRODUCTION Go are typically used for low-level programming tasks, such
The task of program translation between programming as systems development, so C and Go are likely candidates
languages is becoming particularly relevant, given the recent for translation to Rust. The open source projects are from
interest in safe programming languages such as Rust, and the a diverse set of domains: audio processing, text processing,
expectation of translating potentially buggy, legacy code into geometry, banking, 2D triangulation, graph algorithms, and
such modern languages. While “rule-based” translation tools sound card emulation. To automate and reduce bias in the
have been developed [1]–[3] that target a fixed source and selection of code samples, we develop a methodology and tool
target language (e.g. C to Rust), recent work [4]–[7] provides for extracting them from projects. We use this tool to extract
hope that large language models (LLMs) can accomplish this code samples that contain between 1 and 25 functions and
task for any source and target language. use only standard libraries, and which also use features such
Prior work in using LLMs for code translation [4]–[9] as global variables, user-defined, dynamically-allocated, data
has almost exclusively focused on translating code taken structures, array pointers, type casts, enumeration types etc..
from competitive programming websites [10], educational For example, Figure 1 contains a program extracted
websites [11], or hand-crafted coding problems [12], [13]. from the ACH library featuring a global variable
While useful, such benchmarks are not representative of real- moov_io_ach_stringZeros, which is initialised
world code. For example, these benchmarks are typically a with the function call moov_io_ach_populateMap(94,
var ( func (e *Env) add(i, p int64) {
moov_io_ach_stringZeros map[int]string = var j int64
moov_io_ach_populateMap(94, "0") e.S[i] = true
) e.Prev[i] = p
for j = 0; j < e.N; j++ {
func moov_io_ach_populateMap(max int, zero string) if e.Lx[i]+e.Ly[i]-e.G.Get(i, j) < e.Slack[i
map[int]string { ] {
out := make(map[int]string, max) e.Slack[i] = e.Lx[i] + e.Ly[i] - e.G.Get
for i := 0; i < max; i++ { (i, j)
out[i] = strings.Repeat(zero, i) e.Slackx[i] = j
} }
return out }
} }
but i64 in Rust; e is a pointer to an Env value in Go, but a Fig. 6: LLM Prompt for obtaining translations.
mutable reference in Rust.
To solve this challenge, we develop a technique based on
In particular, we have three types of constraints: formatting
serializing then de-serializing to exchange data between lan-
guidelines, code characteristics and fuzzer constraints. Format-
guages. We use the JSON [35] format, because most languages
ting guidelines describe how the generated code should look,
support it. Most data types, including complex data types and
simplifying parsing and extraction of relevant information
pointers can be automatically serialized into JSON, thus it
from the response. For code characteristics, we instruct the
allows us to easily support real-world code. For our example,
LLM to produce safe Rust code, and to maintain the same
Fig. 5 denotes a serialized valid input state. Once the two
function names, parameter names, and return types from the
programs are executed, the Go output state is again serialized
input code. Finally, the fuzzer constraints ensure that the
to JSON, deserialized to Rust, and compared against the Rust
generated code can be handled by our fuzzer (more details
output state. For our example, the expected output state, as
on this in Section IV-B).
obtained by executing the Go code, is the same as the input
The translation generated by the LLM may not initially
state in Fig. 5, with the only difference that the last element
compile. We address this with approach in [14]. At a high
of field s is set to true instead of false. The translation
level, we iteratively query the LLM to fix the error, until
in Figure 4 computes the expected output state, and it is thus
the code becomes compilable. Each time, we provide both
deemed I/O equivalent to the original Go code, and returned
the faulty translation and the error message from the Rust
by F LOURINE.
compiler to the LLM, and ask it to use a specific format for
Conversely, if a counterexample is discovered by the fuzzer,
the suggested fixes, applying them only to the affected lines
then we invoke a feedback method, which uses the counterex-
of code.
ample to create a new query to the LLM and generates a new
candidate translation. Designing a suitable feedback method B. Checking Translations
is another challenging aspect of the translation task. There are To test the I/O equivalence between the original source
many ways to re-query the LLM for a new translation, each program p and a candidate Rust translation p′ , we develop
with their own likelihood of success. Moreover, most state-of- a cross-language differential fuzzer. For a given p and p′ , we
the-art LLMs are operated as API services which charge per automatically generate a fuzzing harness in Rust, which uses
input token, so different query strategies will have different Bolero and libfuzzer [36] to perform fuzzing. The test harness
dollar costs. To address this, we propose and evaluate a set of generates program states from Sp′ , which are directly invoked
feedback strategies. on p′ . We implement the mapping function M ′ : Sp′ → Sp ,
IV. LLM-BASED C ODE T RANSLATION using JSON de/serialization. We serialize the Rust program
state s′ into JSON format, and then instrument the source
A. Obtaining Translations program p to deserialize the JSON into a program state of Sp .
As mentioned in the previous sections, we are considering The instrumented p is invoked on the serialized s′ from Rust
the problem of translating a program written in C or Go to using a foreign function interface. To compare outputs, we map
Rust. We use zero-shot prompting and follow the best practices the output state of p to one of p′ using JSON de/serialization
given by the LLM’s provider. We construct the initial query q as well, which can then be directly compared.
(to be input to the LLM) as sketched in Figure 6. We use JSON serializers for two reasons. First, the mapping
We start with a preamble describing the overall task. Then, between fields of user-defined data types in the source and
we supply the program to be translated, and, finally, we target language are automatically determined based on the
provide specific constraints to be followed by the LLM. field names. This requires the LLM to produce data types
with field names that match the source program, but in our
Human:
benchmarks LLMs always do this. Second, most languages
support automatic serialization of primitive, pointer, and user- # Preamble
You are given a C/Go program and its faulty Rust
defined types. translation. We need to repair the faulty Rust
We note an alternative approach, taken by [31], is to compile program.
both p and p′ down to a common IR, such as LLVM, and # Code to be translated
then perform fuzzing on the IR. However, we find that IR {C/Go Program}
compilers for different languages typically discard type and # Code to be repaired
layout information (e.g. user-defined data types are represented {Faulty Rust Program}
as a void pointer). This makes it nearly impossible for a fuzzer # Instruction
to generate any meaningful inputs. Make changes to the given code to obtain expected
outputs for the given test inputs.
Soundness & Limitations. Our fuzzer can only make
heuristic based guarantees (e.g. coverage) on the equivalence # Constraints
Here are some constraints that you should respect:
of p and p′ . This is a limitation of fuzzing and testing in ...
general. However, our fuzzer achieves an average line coverage
# Counterexamples
of 97%. CE1
In addition, JSON serialization is not automatically sup- CE1
ported for all types. For example, features in Rust like trait Assistant:
definitions, IMPL traits, and lifetimes in data type definitions {LLM generated rust translation}
are only partially supported. This means that the equivalence Human:
check may fail because serialization fails. We report these That is incorrect on the following inputs:
# Counterexamples
errors in Section VI-B. In addition, we do not support fea- CE1
tures like concurrency, network, and file I/O. Our benchmark CE2
V. F EEDBACK S TRATEGIES
In this section, we present four feedback methods that can
Fig. 7: LLM Prompt for BaseRepair and CAPR. BaseRepair
be used if the fuzzer finds a counterexample E − for the
is shown in black. CAPR is shown in black and magenta.
correctness of the translation p′ by the LLM in Alg. 1.
a) Simple Restart Restart: We discard the generated code
p′ and re-query the model with the same prompt q. check, then this last faulty translation will be considered by
b) Hinted Restart Hinted: This builds on the previous the next call to BaseRepair.
strategy by adding positive and negative examples from the d) Conversational Repair (CAPR): Recent work in code
fuzzer, E + and E − , to the original prompt q. The intention is translation [8] and automated program repair [37], have
to suggest desirable behaviours to the LLM, as well as known proposed conversational repair approaches, wherein previous
faulty cases to avoid. We separately group the examples in E + incorrect code is included in the prompt to the LLM to
and E − based on the paths they exercise in p′ . Intuitively, this discourage the LLM from producing the same code again. The
corresponds to splitting them into equivalence classes, where CAPR approach begins with the same prompt as BaseRepair,
each equivalence class corresponds to a particular program however they differ if the new translation still fails the fuzzer
path. Then, the query constructed by Hinted only contains check. In BaseRepair, we create a new prompt from scratch,
positive and negative examples from a single equivalence class, but in CAPR, we keep the prompt, and append a new piece of
respectively. dialogue to it as shown in magenta Figure 7. This process can
c) Counterexample-Guided Repair (BaseRepair): Dis- be repeated multiple times, meaning the prompt is a dialogue
carding the generated code p′ when the fuzzer check fails may of failed translations.
not always be the optimal choice. For instance, if p′ is close The methods Restart and Hinted cost less than BaseRepair
to passing the fuzzer, trying to repair it might work better. and CAPR as they don’t include the incorrect translation in the
As part of BaseRepair, we give counterexamples from the prompt. Therefore the former use about half the input tokens
fuzzer to the LLM. Similarly to Hinted, a query only contains of the latter.
negative examples from the same equivalence class, which
correspond to bugs associated with the same program path. VI. E VALUATION
The expectation is that the candidate translation generated in In this section, we present our results for the following
the next iteration of Alg. 1 will produce the correct outputs research questions.
for the given examples. A sketch of the prompt used for
BaseRepair is given in Figure 7 (excluding the lines colored RQ1: How do LLMs perform on translating code taken
in magenta). In Alg. 1, if the translation generated by G for from real-world projects to Rust? We gather a large number
the query q constructed by BaseRepair still fails the fuzzer of benchmarks by extracting code samples from real-world
projects, and we use LLMs to generate translations which implementation of F LOURINE. F LOURINE is written entirely
are then checked for correctness by the fuzzer, and fixed in python, except for the fuzzer, which is written in Rust.
if needed by applying feedback strategies. We answer the F LOURINE currently supports C and Go for the input program.
following concrete questions. F LOURINE is implemented as a framework, which can be
(RQ1.1) How many benchmarks can each LLM translate extended with new LLMs, feedback strategies, and language
from each of our projects? We report the percentage of support for the input program. We use GNU Parallel [39] to
benchmarks from each project that are successfully translated run experiments in parallel.
for each LLM. We show that success rates vary widely based 2) LLMs: We limit our study to LLMs hosted by third
on the benchmark and LLM. LLMs achieve up to 80% success party providers. This is in part because they are the highest
rate on benchmarks from our “easiest” project, and between performing on coding tasks, and they are the most accessible
15-40% on our “hardest” project. in that they do not require the user to own powerful compute
(RQ1.2) How does code complexity affect the success rate resources. We use five LLMs in our evaluation: GPT-4-
of translation? We look at how lines of code and number of Turbo [40], Claude 2.1 [41], Claude 3 Sonnet [41], Gemini
functions in a benchmark influence the success rate. We show Pro [42], and Mixtral [43]. The first four are likely very large
lines of code strongly influences success rates. (1T+ parameters). On the other hand, Mixtral is relatively
(RQ1.3) How idiomatic is the Rust produced by LLMs? small (45B parameters), but is known for performing well on
We run Clippy [38], Rust’s standard linter, on the successful coding tasks, and costs less than the others. We access GPT-4-
translations, and analyze the rates of different categories of Turbo and Gemini Pro through OpenAI’s and Google’s APIs.
linter warnings. We show that LLMs occasionally (1-15% of We access Claude and Mixtral through AWS Bedrock. Due to
the time) produce code with linter warnings, suggesting that lack of access to GPU machines, we do not attempt to run
the translations could be made more performant, concise, or open source LLMs like CodeLLaMA.
use unsafe code. 3) Benchmarks: We collect benchmarks from real-world
projects hosted on GitHub. We focus on C and Go as the
RQ2: How effective are feedback strategies at fixing source program languages for multiple reasons. First, C, Go,
translation bugs? In addition to overall translation success and Rust are typically used for lower-level programming tasks,
rates, we record the initial success rates – the rate at which unlike other popular languages like Java or Python. Thus they
the first translation passes the fuzzer – and compare this to are likely candidates for translating to Rust. Second, and more
the overall success rate. We answer two concrete questions. pragmatically, projects written in C and Go make less use of
(RQ2.1) How much do feedback strategies increase the third party libraries, which we do not attempt to support for
translation success rate? We compare overall success rates this work. Conversely, most Java and Python projects make
directly to initial success rates. We show that the most effective heavy use of third party libraries.
feedback strategy increases the success rate by an absolute 6- We choose seven projects with the aim of getting a diverse
8% on average for the best LLMs. set of application domains. Our projects are:
(RQ2.2) Which feedback strategies increase success • ACH [44]: a Go library implementing a reader, writer,
rates the most? We compare the increase in success rates and validator for banking operations
for each feedback strategy. We show that, surprisingly, • geo [45]: a math-focused Go library implementing com-
Restart and Hinted outperform BaseRepair and CAPR mon geometry functions and interval arithmetic
consistently. We provide a plausible explanation for this result. • libopenaptx [46]: a C library for audio processing
• opl [47]: a C library for sound card emulation
RQ3: How do LLM translations compare to rule- • go-gt [48]: a Go library for graph algorithms
based translation tools? We compare LLM translations • go-edlib [49]: a Go library string comparison and edit
to translations produced by the rule-based translation tool distance algorithms
C2Rust [3]. While C2Rust theoretically can guarantee the • triangolatte [50]: a 2D triangulation library in Golang
correctness of the translation, we show LLMs produce far As we will show in our experiments, LLMs are still not
more concise and idiomatic translations. capable of translating entire projects. To create benchmarks
of manageable size, we develop a tool for automatically
RQ4: Why do translations fail? Translation can fail for extracting benchmarks from these projects. Our tool takes as
several reasons beyond the fuzzer finding counterexamples. input the entire project and a specific function identifier f in
We report failure rates for different failure reasons. the project. The tool then analyzes the project to find all of f ’s
dependencies, including all functions called by f (including
A. Experimental Setup transitive calls), type definitions, standard libraries, global
1) Implementation: We implement an end-to-end transla- variables, etc. and extracts them intro a single, compilable
tion tool F LOURINE, which takes as input (1) a program, (2) file. The translation task is then to write a compilable Rust file
a feedback strategy to apply, and (3) a budget. F LOURINE with a function equivalent to f ’s behavior. Our methodology
outputs either a corresponding Rust translation that passes the for selecting benchmarks is to iterate over all functions in a
fuzzer, or it fails with an error. Algorithm 1 is used for the project, create a benchmark for it, and keep it if it meets the
TABLE I: Benchmark details of translation experiments in the category (experiments for
Project Lang. #Benchs Min/Max/Avg LoC Min/Max/Avg #Func
different feedback strategies are averaged together). We
answer our sub-questions below.
libopenaptx C 31 13 / 173 / 69 1 / 9 / 2.9
opl C 81 19 / 460 / 67 1 / 15 / 2.8 (RQ1.1) How many benchmarks can each LLM translate
go-gt Go 43 9 / 213 / 51 1 / 16 / 3.5 from each of our projects? Figure 8 shows success rates by
go-edlib Go 36 13 / 597 / 62 1 / 25 / 3.1
ach Go 121 43 / 194 / 64 3 / 7 / 3.4
benchmark and LLM. The best LLMs achieve success rates
geo Go 67 13 / 70 / 35 3 / 7 / 4.1 of 20-60% depending on the benchmark, with one outlier of
triangolatte Go 29 9 / 164 / 38 1 / 10 / 2.5 80% by Claude 2 on ACH. The outlier is in large part due
to ACH having ∼40 extremely similar benchmarks, which
Claude 2 nearly always gets right. If we remove these similar
following criteria: (1) it does not use 3rd party libraries, (2) benchmarks, the success rate for Claude 2 drops to 55%,
the maximum depth of the call graph is less than 4. which is in line with the other LLMs. A consistent trend
Details on the benchmarks are given in Table I. The total is that Mixtral, while somewhat capable, has 5-20% lower
number of benchmarks extracted from each project is given success rates than the other much larger and more expensive
in the column “#Benchs”. LoC and number of functions for LLMs. However, the cost of running Mixtral (both in dollars
individual programs vary from 13 to 597 and from 1 to 25, and compute) is at least 10x less than the other LLMs. Other
respectively. trends are that Claude 2, Claude 3, and GPT-4-Turbo perform
4) LLM Hyperparameters: All LLMs use a temperature similarly on most benchmarks, and they outperform Gemini
parameter for controlling the randomness/creativity of its in most cases.
output. To make our results more deterministic, we use a lower (RQ1.2) How does code complexity affect the success rate
temperature (i.e. less random) of 0.2. Other hyperparameters, of translation? We use lines of code and number of functions
e.g. topP and topK, are set to the default value recommended as proxy metrics for complexity, and we show success rates
by the LLM’s provider. for benchmarks grouped by level of complexity in Figures 9
5) F LOURINE Hyperparameters: We set the budget b in and 10. The general trend is that increasing complexity,
Algorithm 1 to 5. For the Hinted and BaseRepair strategies especially in lines of code, reduces success rate. The spikes
we provide 4 examples in the prompt (more examples appeared for 3 functions and 48-82 lines of code are again due to
to reduce efficiency as the context window grew). For the the ACH benchmarks mentioned in the previous research
CAPR strategy, we keep conversation window size as 3, which question. Removing these flattens the spike. In particular,
means that only the latest 3 incorrect translations are made success rates tend to drop off somewhere around 100+ lines of
available to the LLM. A translation is deemed equivalent if 5 code. We discuss approaches for handling larger benchmarks
minutes of fuzzing does not return any counterexamples. in section VI-C2.
6) Compute Resources: We run our experiments on a (RQ1.3) How idiomatic is the Rust produced by LLMs?
machine with an AMD EPYC 7R13 Processor with 192 cores Figure 11 shows the rate of different categories of linter
and 380 GB of RAM. Each translation task is run sequentially warnings produced by Clippy [38], Rust’s standard linter. We
in a single thread (we do not parallelize individual translation limit our analysis to successful translations. Clippy reports
tasks or fuzzing). As previously mentioned, all LLMs are five types of warnings, and we add unsafe. We describe
accessed through APIs provided by third party services. them below, and give specific examples of the warnings most
B. Results frequently reported by Clippy on the Rust translations.
Correctness: reports code that may have correctness bugs.
We run a translation experiment for each of our five LLMs,
The common examples we find are: checking if an un-
four feedback strategies, and 408 benchmarks for a total of
signed integer is greater than 0, and using MaybeUninit
8160 translation experiments. A translation is successful if
::uninit().assume_init() (i.e. assuming that poten-
it compiles and passes the fuzzer. A translation is failed if
tially unitialized data is initialized)
it: (1) does not compile, (2) the fuzzer cannot de/serialize
the types used in the translation, or (3) the fuzzer finds a Suspicious: the same as Correctness, but could be a false
counterexample in the translation and the budget is reached if positive
applicable. We answer our research questions based on these Style: code that is unidiomatic, but still correct. The com-
results. mon examples we find are: not following naming conventions,
unnecessary borrows, using return statements, unnecessary
RQ1: How do LLMs perform on translating code closure expressions (e.g. xs.map(|x| foo(x)) instead
taken from real-world projects to Rust? Our LLMs achieve of xs.map(foo)), using class types (e.g. String) when
overall success rates of 47.7% (Claude 2), 43.9% (Claude a simple primitive type will suffice (e.g. str), and not
3), 21.0% (Mixtral), 36.9% (GPT-4-Turbo), and 33.8% using idiomatic statements (e.g. using x <= z && z <= y
(Gemini Pro). We present detailed results for each LLM instead of (x..y).contains(z))
in Figures 8, 9, 10, and 11. The success rate is the total Complexity: code that could be simplified. Common exam-
number of successful translations divided by the total number ples are: unnecessary casting or type conversion, unnecessary
Fig. 8: Success rate for each LLM on each benchmark. Fig. 9: Success rate for each LLM on benchmarks
Averaged across all feedback strategies. grouped by lines of code.
Fig. 10: Success rate for each LLM on benchmarks Fig. 11: Rates of different types of linter warnings for
grouped by number of functions. each LLM.